CN106297805B

CN106297805B - A kind of method for distinguishing speek person based on respiratory characteristic

Info

Publication number: CN106297805B
Application number: CN201610626034.0A
Authority: CN
Inventors: 鲁力; 刘玲霜
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2016-08-02
Filing date: 2016-08-02
Publication date: 2019-07-05
Anticipated expiration: 2036-08-02
Also published as: CN106297805A

Abstract

The invention discloses a kind of method for distinguishing speek person based on respiratory characteristic, this method specifically includes that unknown input sound bite, pass through the breathing template established by mel-frequency cepstrum coefficient MFCC, zero-crossing rate ZCR and short-time energy E extracts the breath sound in unknown sound bite, then the false positive part in breath sound is rejected using the border detection algorithm for eliminating false low ebb, breath sound after being precisely separated, finally distinguish whether the speaker of unknown sound bite from sample speaker and judges whether the speaker of unknown sound bite is legal speaker using the breath sound after being precisely separated.The uniqueness that the present invention realizes human body respiration for the first time is paid close attention to and is studied, and it is effectively applied and overcomes " extraction of breath signal " and " breath signal processing " two challenge greatly that the development and utilization of the speaker Recognition Technology based on breathing face in Speaker Recognition System.Thus Speaker Recognition System provided by the invention is simple and efficient, and recognition result is accurate and reliable.

Description

A kind of method for distinguishing speek person based on respiratory characteristic

Technical field

The present invention relates to a kind of system and methods of contactless biometric signal detection, are based on more particularly, to one kind The Speaker Recognition System and method of respiratory characteristic.

Background technique

Speaker Identification (Speaker Recognition) is a kind of underlying issue, and be subdivided into two classes: speaker identification asks Inscribe (Speaker Identication) and speaker verification's problem (Speaker Verification).The former distinguishes unknown theory Whether words people is a member in certain known speaker's sample database；The latter confirms whether the speaker's identity of statement is legal. Identification speaker is divided into training and two stages of test, and the training stage is used for the foundation of speaker characteristic template, and test phase is then The similarity of test data and feature templates is calculated, and obtains judging result.It is different according to the degree of dependence to speech text, it says Words people's identification is divided into text relationship type (only effective to some special text), text independent type (any text is effective), text again Prompting-type (it is effective to be subordinate to special text collection).Although phonetic feature can weaken due to the reason of because of microphone, channel, will receive strong The influence of health, mood, or even be imitated, but in recent years, speech processes the relevant technologies are quickly grown, and have already appeared many realities instantly When application, so that speech processes relevant issues has been obtained more concerns and research.

The present Speaker Identification scheme deposited is based on Source-Filter (source-filter), or is based on Source- System (source-system) model, or feature vector is extracted based on the two simultaneously.Excitation source information can pass through glottal signal base It is indicated in the remaining sample linear prediction of shape.Channel information can be captured by cepstrum signal.Prosodic information can be held by statistics The continuous time, tone, energy time dynamic obtain.It is the energy source that sound generates based on aerodynamic respiratory One of, it can be extracted and be handled as one section of complete voice.Existing research is dedicated to breath signal in voice Detection and rejecting improve speech-to-text converting algorithm improving sound quality, training typist and identify psychological shape Condition etc..

Source-Filter (source-filter) theory thinks that voice is the response of sound channel system, and gives non-linear , the good approximation of the voice changed over time." source (source) " refers to 4 kinds of source speech signals: suction source, sources of friction, Glottis (sounding) source and transient state source.Sound channel acts like a filter, and input is generated by above-mentioned 4 kinds of source speech signals, Output then forms vowel, consonant or any voice.Sound channel, which also controls, manages tone generation, voice quality, harmonic wave, and resonance is special Property, rdaiation response etc..

In source/system (source/system) model, voice is built according to linear slowly varying discrete-time system Mould.The system is excited by the random noise in unvoiced speech source or the pulse paracycle in speech sound source.Source includes tone The phonetic feature easily to malfunction.Therefore, source model is rarely used in Speaker Identification, is also seldom enhanced with other features. Relatively, system (system) model is corresponding with smooth power spectral envelope, and envelope passes through linear prediction or Meier filter Analysis obtains.Therefore, which is widely used in the Speaker Recognition System in relation to cepstrum coefficient.

Both models are using breathing as a part of speech source, the voice being converted into speech sound source or noiseless language Noise in source of sound.In fact, respiratory is that a kind of energy for converting sound for energy is shifted to new management mechanisms.In addition, in voice In breathing be it is limited, generally, expiratory duration is longer than inspiratory duration, and the breathing in living in non-voice, exhale and Inspiration time is of substantially equal.

Respiratory system includes lung, diaphram, intercostal muscle and by bronchus, tracheae, larynx, sound channel, the breathing letter that oral cavity is constituted Road.We regard breathing as the physiology fingerprint of entire respiratory system, it by intra-pulmonary pressure, air flow direction and muscular movement manage with Control.When air-breathing, respiratory muscle is shunk, and intra-pulmonary pressure reduces, and air flows into intrapulmonary from external.Similarly, due to intrapulmonary when expiration Pressure increases, intrapulmonary space compression, and air is breathed out from intrapulmonary to external.According to anatomy principle, breathing front and back certainly exists one A silencing interval.Breathing is influenced by age, sex factor, and 100-400 milliseconds of normal continuous, silencing gap continues 20 milliseconds More than.Silencing gap is the key that carry out breathing description separation.

The generation of breathing is that lung, intra-pulmonary pressure, diaphragm, sound channel, tracheae, respiratory muscle are coefficient as a result, being breathing system Physiology fingerprint in meaning of uniting.The flowing of air is not to complete moment, therefore all have one before the generation of breathing and after occurring A silencing gap (>=20 milliseconds).It is compared with the voice signal (not including breathing) of ordinary meaning, the energy of breath signal is weak, when Between short (100-400 milliseconds), occurrence frequency is low (12-18 beats/min), and Chong Die with the generation of non-respiratory voice signal in low frequency (100Hz–1kHz).In addition, breath sound and phoneme and consonant fricative are especially similar, in such as " church "//, " vision " In<Z>.Therefore, development and utilization of the breathing in speaker Recognition Technology face " extraction of breath signal " and " breath signal The big challenge of processing " two, is not exploited in speaker Recognition Technology so as to cause breathing, and often as breathing noise quilt It rejects.

Summary of the invention

It is an object of the invention to: it can not be used effectively for above-mentioned breathing in the prior art in speaker Recognition Technology In, and the development and utilization of the speaker Recognition Technology based on breathing face " extraction of breath signal " and " breath signal processing " Two big challenges, the present invention provide a kind of Speaker Recognition System and method based on respiratory characteristic.

The technical solution adopted by the invention is as follows:

A kind of method for distinguishing speek person based on respiratory characteristic, it is characterised in that the following steps are included:

Step 1: input breath sample collection carries out sub-frame processing to breath sample collection, obtains breathing frame, passes through mel-frequency Cepstrum coefficient MFCC will breathe frame and be established as breathing template, and calculate breathing frame and breathe template that each breath sample collection obtains Similarity obtains its minimum value Bm；

Step 2: unknown input sound bite carries out sub-frame processing to unknown sound bite, obtains unknown speech frame, calculates The similarity of each unknown speech frame and breathing template；Calculate unknown speech frame zero-crossing rate ZCR and unknown speech frame it is short Shi NengliangE；According to unknown speech frame and the breathing similarity of template, Bm, unknown speech frame zero-crossing rate ZCR and unknown voice The short-time energy E of frame filters out the breath sound in unknown sound bite, exhaling after the breath sound composition initial gross separation filtered out Sound-absorbing；

Step 3: the silencing gap of the breath sound after initial gross separation is detected using the border detection algorithm for eliminating false low ebb, The false positive part in the breath sound after initial gross separation, the breath sound after being precisely separated are rejected according to silencing gap；

Step 4: choosing one group of sample speaker, acquire the breathing segment of each sample speaker, establish one group of speaker Sample database carries out step 5 if need to determine whether the speaker of unknown sound bite comes from sample speaker；If needing to determine Whether the speaker of unknown sound bite is legal speaker, carries out step 6；

Step 5: every in the breath sound and speaker's sample database after calculating being precisely separated of the unknown sound bite The similarity of a speaker's breath sample takes the corresponding sample of maximum similarity to speak the speaker of artificial unknown sound bite, Terminate；

Step 6: to each sample speaker's collecting test sample, choosing a test sample；

Step 7: the test sample for calculating selection is similar to speaker's breath sample each in speaker's sample database Degree takes the maximum value in the test sample and speaker's sample database in the similarity of each speaker's breath sample, obtains To a maximum similarity；

Step 8: another test sample is chosen, step 7 is repeated, it is similar until obtaining the corresponding maximum of all test samples Degree, obtains maximum similarity group；

Step 9: acquiring the sound bite of legal speaker, the breathing piece of legal speaker is extracted using breath sample collection Section, the similarity of the breathing segment of breath sound and legal speaker after calculating being precisely separated of unknown sound bite；

Step 10: if the breath sound after being precisely separated of unknown sound bite is similar to the breathing segment of legal speaker Degree is greater than the minimum value of maximum similarity group, then the speaker of unknown sound bite is accredited as legal speaker, otherwise is non- Method speaker.

In above scheme, the step 1 the following steps are included:

Step 1.1: the breath sample collection is divided into the breathing frame that length is 100 milliseconds by input breath sample collection, will be every A breathing frame is divided into continuous and overlapped breathing subframe again, and each subframe lengths that breathe is 10ms, and adjacent breather Overlapped length is 5ms between frame；

Step 1.2 carries out preemphasis to each breathing subframe using first-order difference filter, the breather after obtaining preemphasis Frame；Wherein, first-order difference filter H:

H (z)=1- α z^-1

Wherein, α is pre-emphasis parameters α ≈ 0.095, and z is signal sampling point data；

Step 1.3: MFCC being calculated to the breathing subframe after each preemphasis of each breathing frame, obtains each breathing frame Cepstrum matrix in short-term removes DC component to each column of the matrix of cepstrum in short-term of each breathing frame, obtains each breathing frame MFCC cepstrum matrix；

Step 1.4: calculate the Mean Matrix T of breath sample collection:

Wherein, N represents the number that breath sample concentrates breathing frame, and M (Xi) indicates the MFCC cepstrum square of i-th of breathing frame Battle array, i ∈ [1,2 ..., N]；

Calculate the variance matrix V of breath sample collection:

Step 1.5: the MFCC cepstrum matrix series connection by all breathing frames is matrix M one big_b: M_b=[M (X₁),…,M (X_i),M(X_i+1),…,M(X_N)]

Singular value decomposition is carried out to the big matrix:

M_b=U Σ V^*

Wherein, U is m × m rank unitary matrice；Σ is positive semidefinite m × n rank diagonal matrix, and V* indicates the conjugate transposition of V, be n × N rank unitary matrice, the element on Σ diagonal line is { λ₁,λ₂,λ₃..., the as singular value of M obtains singular value vector { λ₁,λ₂, λ₃,…}；

With maximum singular value λ_mThe singular value vector is normalized, the singular value vector after finally being normalizedWherein, λ_m=max { λ₁,λ₂,λ₃,…}；

Step 1.6: obtaining one group of breathing template, the breathing template includes singular value vector S, the breathing sample after normalization The variance matrix V of this collection and Mean Matrix T of breath sample collection.

In above scheme, the step 2 the following steps are included:

Step 2.1: unknown input sound bite, to unknown sound bite carry out sub-frame processing, obtain unknown speech frame with Unknown speech subframe calculates the similarity B (X, T, V, S) of each unknown speech frame and breathing template；Calculate breath sample collection Each breathing frame and the similarity for breathing template, taking minimum similarity degree is Bm；

Calculate each unknown speech frame short-time energy E:

Wherein, n indicates that n-th of sampled point of signal, x [n] indicate that n-th of speech sample signal, N indicate that the window of sample is long Degree, N₀The window start for indicating sample is N₀A sampled point；

Calculate the average value of all unknown speech frames；

Calculate the zero-crossing rate ZCR of unknown sound bite:

Step 2.2: choosing a unknown speech frame；

Step 2.3: if the unknown speech frame being selected and the similarity B (X, T, V, S) of breathing template are greater than threshold value Bm/2, And the zero-crossing rate ZCR of unknown speech frame is less than 0.25, and the unknown speech frame being selected short-time energy E be less than it is all not Know the average value of speech frame, then judge that the unknown speech frame being selected judges quilt if being unsatisfactory for the condition as breath sound The unknown speech frame chosen is non-respiratory sound.

Step 2.4: other unknown speech frames are chosen, step 2.3 is repeated, it is all unknown in unknown sound bite until judging Whether speech frame is breath sound；

Step 2.5: retaining breath sound, reject non-respiratory sound, obtain initial gross separation breath sound；

The side of the similarity of breathing frame or unknown speech frame and breathing template is calculated in above scheme, in the step 2.1 Method the following steps are included:

Step 2.1.1: breath sample collection or unknown sound bite are divided by input breath sample collection or unknown sound bite Each breathing frame or unknown speech frame are divided into continuous and phase mutual respect by the breathing frame or unknown speech frame that length is 100 milliseconds again Folded breathing subframe or unknown speech subframe, each to breathe subframe or unknown speech subframe length is 10ms and adjacent is unknown Overlapped length is 5ms between speech subframe；

Step 2.1.2: preemphasis is carried out to each unknown speech subframe using first-order difference filter, after obtaining preemphasis Breathe frame or unknown speech frame；Wherein, first-order difference filter H:

H (z)=1- α z^-1

Wherein, α is pre-emphasis parameters α ≈ 0.095；Z is signal sampling point data；

Step 2.1.3: to it is each breathing frame or unknown speech frame each preemphasis after breathing subframe or unknown voice Subframe calculates MFCC, the cepstrum matrix in short-term of each breathing frame or unknown speech frame is obtained, to each breathing frame or unknown voice The each column of the matrix of cepstrum in short-term of frame remove DC component, obtain the MFCC cepstrum matrix M of each breathing frame or unknown speech frame (X)

Step 2.1.4: a breathing frame or unknown speech frame X are chosen；

Step 2.1.5: the normalization difference matrix D of the breathing frame or unknown speech frame that are selected is calculated:

Wherein, T indicates that the Mean Matrix of breath sample collection, V indicate that breath sample collection variance matrix, M (X) are quilt The breathing frame of selection or the MFCC cepstrum matrix of unknown speech frame；

Step 2.1.6: each column of D are multiplied with half Hamming window, the cepstrum coefficient of low frequency is made to be strengthened:

D (:, j)=D (:, j) hamming, j ∈ [1, N_C]

Wherein, Nc indicates the MFCC number of parameters in each breathing subframe or unknown speech subframe, the i.e. columns of D； Hamming indicates Hamming window.

Step 2.1.7: calculate be selected breathing frame or unknown speech frame X and breathing template similarity B (X, T, V, S component Cp):

Wherein, n indicates the quantity that subframe or unknown speech subframe are breathed in the breathing frame or unknown speech frame X being selected, k ∈ [1, n], D_kjIndicate j-th of MFCC ginseng in k-th of the breathing subframe or unknown speech subframe of frame to be breathed or unknown speech frame X Number；

Calculate another component of the similarity B (X, T, V, S) of the breathing frame being selected or unknown speech frame X and breathing template Cn:

Step 2.1.8: the similarity B (X, T, V, S) of breathing frame or unknown speech frame X and breathing template are calculated:

B (X, T, V, S)=Cp*Cn；

Step 2.1.9: choosing the MFCC cepstrum matrix of another breathing frame or unknown speech frame, repeats step 2.1.5- 2.1.8；

Step 2.1.10: repeating step 2.1.9, until obtaining the phase of all breathing frames or unknown speech frame with breathing template Like degree；

In above scheme, the value that the border detection algorithm of the false low ebb of elimination utilizes in the step 3 includes that breathing continues Time threshold, energy threshold, zero-crossing rate ZCR bound threshold value and spectrum slope accurately find breathing boundary, step 3 benefit With Binary Zero -1, accurately instruction is breathed in the position of current speech segment.

In above scheme, according to claim 1 based on the method for distinguishing speek person of respiratory characteristic, which is characterized in that The speaker in breath sound and speaker's sample database after calculating being precisely separated of the unknown sound bite in the step 5 The similarity of breath sample the following steps are included:

Step 5.1: setting the MFCC feature vector of the sample in breath sample database as (a₁,a₂,...,a_n), calculating is said Talk about the Mean Matrix M of the MFCC feature vector of speaker's breath sample in proper manners database:

Wherein, a_iFor i-th MFCC of the MFCC feature vector of speaker's breath sample in speaker's sample database Cepstrum matrix, n represent the MFCC cepstrum square in the MFCC feature vector of speaker's breath sample in speaker's sample database The number of battle array, i ∈ [1,2 ..., n]；

Calculate the variance matrix V of the MFCC feature vector of speaker's breath sample in speaker's sample database:

Step 5.2: the MFCC feature vector of all breath sounds after calculating being precisely separated of unknown sound bite is denoted as (b₁,b₂,...,b_n), b_iFor the MFCC cepstrum matrix of the breath sound after being precisely separated for i-th of unknown sound bite；

Step 5.3: to the feature vector (a of speaker's breath sample in speaker's sample database₁,a₂,..., a_n) It is normalized:

Wherein, r and c are respectively indicatedRow and column,For a_kNormalization difference matrix, k ∈ [1,2 ..., n], i ∈[1,2,…,r],j∈[1,2,…,c]；

Step 5.4: by (Sa₁,Sa₂,...,Sa_n) ascending order arrangement is carried out, obtain (S₁,S₂,...,S_n)；

Step 5.5: to the MFCC feature vectors of all breath sounds after being precisely separated of unknown sound bite (b1, B2 ..., bn) it is normalized:

Wherein, r and c are respectively indicatedRow and column,For b_kNormalization difference matrix, k ∈ [1,2 ..., n], i ∈[1,2,…,r],j∈[1,2,…,c]；

Step 5.6: calculating the similarity degree Pk of bk and reference template: by Sb_kWith ordered vector (S1, S2 ..., Sn) Element is compared one by one, and Pk is that the number of elements in ordered vector less than Sbk is total divided by element, calculates the average value of Pk, The similarity of the sample in breath sound and breath sample database after obtaining being precisely separated of unknown sound bite.

In above scheme, the breath sound after being precisely separated of unknown sound bite is calculated in the step 9 is spoken with legal The similarity of the breathing segment of people the following steps are included:

Step 9.1: setting the MFCC feature vector of the breathing segment of legal speaker as (a₁,a₂,...,a_n), it is legal to calculate The Mean Matrix M of the MFCC feature vector of the breathing segment of speaker:

Wherein, a_iFor i-th of MFCC cepstrum matrix of the MFCC feature vector of the breathing segment of legal speaker, n is represented The number of MFCC cepstrum matrix in the MFCC feature vector of the breathing segment of legal speaker, i ∈ [1,2 ..., n]；

Calculate the variance matrix V of the MFCC feature vector of the breathing segment of legal speaker:

Step 9.2: the MFCC feature vector of all breath sounds after calculating being precisely separated of unknown sound bite is denoted as (b₁,b₂,...,b_n), b_iFor the MFCC cepstrum matrix of the breath sound after being precisely separated for i-th of unknown sound bite；

Step 9.3: to the feature vector (a of the breathing segment of legal speaker₁,a₂,...,a_n) it is normalized:

Step 9.4: by (Sa₁,Sa₂,...,Sa_n) ascending order arrangement is carried out, obtain (S₁,S₂,...,S_n)；

Step 9.5: to the MFCC feature vectors of all breath sounds after being precisely separated of unknown sound bite (b1, B2 ..., bn) it is normalized:

Step 9.6: calculating the similarity degree Pk of bk and reference template: by Sb_kWith ordered vector (S1, S2 ..., Sn) Element is compared one by one, and Pk is that the number of elements in ordered vector less than Sbk is total divided by element, calculates the average value of Pk, The similarity of the breathing segment of breath sound and legal speaker after obtaining being precisely separated of unknown sound bite.

In above scheme, each speaker in the test sample and speaker's sample database of selection is calculated in the step 7 Breath sound and speaker after calculating being precisely separated of the unknown sound bite in the method for the similarity of breath sample and step 5 The method of the similarity of each speaker's breath sample is identical in sample database.

In above scheme, the method that MFCC is calculated in the step 1.3 and step 5.2 includes: that will need to calculate MFCC's Signal carries out Fast Fourier Transform (FFT), complicated sine curve coefficient is then calculated, finally by the filter group based on melscale It is exported.

In conclusion by adopting the above-described technical solution, the beneficial effects of the present invention are:

1) present invention as a set of Verification System based on breathing, paid close attention to by the uniqueness for realizing human body respiration for the first time And research, and it is effectively applied the development and utilization that the speaker Recognition Technology based on breathing is overcome in Speaker Recognition System " extraction of breath signal " and " breath signal processing " two faced is challenged greatly.

2) the present invention is based on the knowledge of mathematical statistics, devise a light similarity algorithm for decision: the calculation Method is a series of simple vector operations using MFCC Mean Matrix and variance matrix.Compared with traditional classification algorithm, this hair Similarity algorithm in bright has more preferably classification performance.

3) present invention can operate with speaker identification's experiment and speaker verification's experiment；Simultaneously because if people's exhales Haustorium official is interfered, then his breathing signature may be modified, therefore the invention can be used for judging human body respiration organ Whether it is interfered.

4) present invention can be achieved to need the identification under mute occasion.

5) present invention can be achieved can not sounding tester identification.

6) classification method that uses of the present invention is opposite with traditional complex model classification side based on multi-parameter, more assumed Method has lower time complexity and space complexity.In addition, the present invention uses the algorithm process data based on MFCC more Fastly, required training sample is less, and ensures recognition accuracy, thus Speaker Recognition System provided by the invention is simple and efficient, And recognition result is accurate and reliable.

Detailed description of the invention

Fig. 1 is the system framework figure for judging whether unknown speaker identity is legal in the present invention；

Fig. 2 is the frame diagram for breathing Preliminary detection in the present invention in step 2；

Fig. 3 is to breathe the frame diagram finally detected in the present invention in step 3；

Fig. 4 is the experimental result schematic table of step 6-8 in the present invention；

Fig. 5 indicates that Meier filter group acts on the comparison after breath signal and non-respiratory voice signal in the present invention；

Fig. 6 indicates the characteristics of ZCR, spectrum slope and STE in the present invention；

Fig. 7 indicates the formant of breath signal and the voice signal of non-respiratory in the present invention；

Fig. 8 shows the breath signals under the breathing and abnormal condition under normal condition in the present invention；

Specific embodiment

All features disclosed in this specification can be with any other than mutually exclusive feature and/or step Mode combines.

It elaborates below with reference to Fig. 1-8 couples of present invention.

The invention proposes a kind of method for distinguishing speek person based on respiratory characteristic, which takes applied to Speaker Identification Obtain good effect.The realization schematic diagram of entire algorithm similar to Fig. 1, comprising steps of

Step 1: such as Fig. 1, inputting breath sample collection, sub-frame processing is carried out to breath sample collection, obtain breathing frame, pass through plum Your frequency cepstral coefficient MFCC will breathe frame and be established as breathing template；Step 1 specifically includes the following steps:

H (z)=1- α z^-1

Step 1.4: calculate the Mean Matrix T of breath sample collection:

Calculate the variance matrix V of breath sample collection:

Singular value decomposition is carried out to the big matrix:

M_b=U Σ V*

Step 2: such as Fig. 2, unknown input sound bite carries out sub-frame processing to unknown sound bite, obtains unknown voice Frame calculates the similarity of each unknown speech frame and breathing template, calculates the zero-crossing rate ZCR and unknown language of unknown sound bite The short-time energy E of tablet section；According to unknown sound bite and the breathing similarity of template, Bm, unknown sound bite zero-crossing rate The short-time energy E of ZCR and unknown sound bite filters out the breath sound in unknown sound bite, the breath sound group that filters out At the breath sound after initial gross separation；

The step 2 the following steps are included:

Step 2.1: unknown input sound bite, to unknown sound bite carry out sub-frame processing, obtain unknown speech frame with Unknown speech subframe calculates the similarity B (X, T, V, S) of each unknown speech frame and breathing template；

Each breathing frame of breath sample collection and the similarity of breathing template are calculated, taking minimum similarity degree is Bm；

Calculate each unknown speech frame short-time energy E:

Calculate the average value of all unknown speech frames

Calculate the zero-crossing rate ZCR of unknown speech frame:

The method that the similarity of breathing frame or unknown speech frame and breathing template is calculated in the step 2.1 includes following step It is rapid:

H (z)=1- α z^-1

Step 2.1.4: a breathing frame or unknown speech frame X are chosen；

D (:, j)=D (:, j) hamming, j ∈ [1, N_C]

B (X, T, V, S)=Cp*Cn；

Step 2.2: choosing a unknown speech frame；

Step 2.3: if the unknown speech frame being selected and the similarity B (X, T, V, S) of breathing template are greater than threshold value Bm/2, And the zero-crossing rate ZCR of unknown sound bite is less than 0.25 (sample rate is 44kHz at this time), and the unknown speech frame being selected Short-time energy E be less than all unknown speech frames average valueThe unknown speech frame being selected then is judged as breath sound, if not Meet the condition, then judges the unknown speech frame being selected as non-respiratory sound.

Step 3: such as Fig. 3, utilizing the heavy of the breath sound after the border detection algorithm detection initial gross separation for eliminating false low ebb Silent gap rejects the false positive part in the breath sound after initial gross separation, the breathing after being precisely separated according to silencing gap Sound；The 3rd phase of volume 15 in border detection algorithm specific implementation such as in the March, 2007 for eliminating false low ebb " An effective in IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING algorithm for automatic detection and exact demarcation of breath sounds in Speech and song " text；

Step 5: in the breath sound and speaker's sample database after calculating being precisely separated of the unknown sound bite The similarity of each speaker's breath sample takes the corresponding sample of maximum similarity to speak the speaker of artificial unknown sound bite, Terminate；

In breath sound and speaker's sample database after calculating being precisely separated of the unknown sound bite in the step 5 Speaker's breath sample similarity the following steps are included:

Step 5.1: setting the MFCC feature vector of speaker's breath sample in speaker's sample database as (a₁, a₂,...,a_n), calculate the Mean Matrix M of the MFCC feature vector of speaker's breath sample in speaker's sample database:

Step 7: such as Fig. 4, the test sample and each speaker's breath sample in speaker's sample database for calculating selection Similarity, take the maximum in the similarity of each speaker's breath sample in the test sample and speaker's sample database Value, obtains a maximum similarity；

The test sample and each speaker's breath sample in speaker's sample database that selection is calculated in the step 7 Breath sound and speaker's sample database after calculating being precisely separated of the unknown sound bite in the method for similarity and step 5 In each speaker's breath sample similarity method it is identical.

Step 8: such as Fig. 4, another test sample is chosen, repeats step 7, it is corresponding most until obtaining all test samples Big similarity obtains maximum similarity group；

Step 9: acquire the breathing segment of legal speaker, the breath sound after calculating being precisely separated of unknown sound bite with The similarity of the breathing segment of legal speaker；

The breathing segment of breath sound and legal speaker after calculating being precisely separated of unknown sound bite in the step 9 Similarity the following steps are included:

The method that MFCC is calculated in the step 1.3 and step 5.2 includes: to carry out the signal for needing to calculate MFCC quickly Then Fourier transformation calculates complicated sine curve coefficient, is finally exported by the filter group based on melscale.

The present invention has been explained by the above embodiments, but it is to be understood that, above-described embodiment is only intended to The purpose of citing and explanation, is not intended to limit the invention to the scope of the described embodiments.Furthermore those skilled in the art It is understood that the present invention is not limited to the above embodiments, introduction according to the present invention can also be made more kinds of member Variants and modifications, all fall within the scope of the claimed invention for these variants and modifications.Protection scope of the present invention by The appended claims and its equivalent scope are defined.

Claims

1. a kind of method for distinguishing speek person based on respiratory characteristic, it is characterised in that the following steps are included:

Step 1: input breath sample collection carries out sub-frame processing to breath sample collection, obtains breathing frame, passes through mel-frequency cepstrum Coefficient MFCC will breathe frame and be established as breathing template, and it is similar to breathing template to calculate the breathing frame that each breath sample collection obtains Degree, obtains its minimum value Bm；

Step 2: unknown input sound bite carries out sub-frame processing to unknown sound bite, obtains unknown speech frame, calculates each The similarity of unknown speech frame and breathing template；Calculate the zero-crossing rate ZCR of unknown speech frame and in short-term capable of for unknown speech frame Measure E；According to the similarity of unknown speech frame and breathing template, Bm, the zero-crossing rate ZCR of unknown speech frame and unknown speech frame Short-time energy E filters out the breath sound in unknown sound bite, the breath sound after the breath sound composition initial gross separation filtered out；

Step 3: the silencing gap of the breath sound after initial gross separation is detected using the border detection algorithm for eliminating false low ebb, according to Reject the false positive part in the breath sound after initial gross separation, the breath sound after being precisely separated in silencing gap；

Step 4: choosing one group of sample speaker, acquire the breathing segment of each sample speaker, establish one group of speaker's sample Database carries out step 5 if need to determine whether the speaker of unknown sound bite comes from sample speaker；If need to determine unknown Whether the speaker of sound bite is legal speaker, carries out step 6；

Step 5: each theory in the breath sound and speaker's sample database after calculating being precisely separated of the unknown sound bite The similarity for talking about people's breath sample takes the corresponding sample of maximum similarity to speak the speaker of artificial unknown sound bite, terminates；

Step 7: calculating the similarity of each speaker's breath sample in the test sample and speaker's sample database of selection, take Maximum value in the test sample and speaker's sample database in the similarity of each speaker's breath sample, obtains one Maximum similarity；

Step 8: another test sample is chosen, step 7 is repeated, until obtaining the corresponding maximum similarity of all test samples, Obtain maximum similarity group；

Step 9: acquiring the sound bite of legal speaker, the breathing segment of legal speaker is extracted using breath sample collection, count The similarity of the breathing segment of breath sound and legal speaker after calculating being precisely separated of unknown sound bite；

Step 10: if breath sound after being precisely separated of unknown sound bite and the similarity of the breathing segment of legal speaker are big In the minimum value of maximum similarity group, then the speaker of unknown sound bite is accredited as legal speaker, otherwise illegally to say Talk about people.

2. according to claim 1 based on the method for distinguishing speek person of respiratory characteristic, which is characterized in that the step 1 includes Following steps:

Step 1.1: the breath sample collection is divided into the breathing frame that length is 100 milliseconds, exhaled each by input breath sample collection Inhale frame and be divided into continuous and overlapped breathing subframe again, each subframe lengths that breathe are 10ms, and adjacent breathing subframe it Between overlapped length be 5ms；

Step 1.2 carries out preemphasis to each breathing subframe using first-order difference filter, the breathing subframe after obtaining preemphasis；Its In, first-order difference filter H:

H (z)=1- α z^-1

Step 1.3: MFCC being calculated to the breathing subframe after each preemphasis of each breathing frame, obtains each breathing frame in short-term Cepstrum matrix removes DC component to each column of the matrix of cepstrum in short-term of each breathing frame, and the MFCC for obtaining each breathing frame falls Spectrum matrix；

Step 1.4: calculate the Mean Matrix T of breath sample collection:

Wherein, N represents the number that breath sample concentrates breathing frame, and M (Xi) indicates the MFCC cepstrum matrix of i-th of breathing frame, i ∈ [1,2,…,N]；

Calculate the variance matrix V of breath sample collection:

Step 1.5: the MFCC cepstrum matrix series connection by all breathing frames is matrix M one big_b:

M_b=[M (X₁),…,M(X_i),M(X_i+1),…,M(X_N)]

Singular value decomposition is carried out to the big matrix:

M_b=U Σ V^*

Wherein, U is m × m rank unitary matrice；Σ is positive semidefinite m × n rank diagonal matrix, and it is n × n rank that V*, which indicates the conjugate transposition of V, Unitary matrice, the element on Σ diagonal line is { λ₁,λ₂,λ₃..., the as singular value of M obtains singular value vector { λ₁,λ₂, λ₃,…}；

Step 1.6: obtaining one group of breathing template, the breathing template includes singular value vector S, the breath sample collection after normalization Variance matrix V and breath sample collection Mean Matrix T.

3. according to claim 1 based on the method for distinguishing speek person of respiratory characteristic, which is characterized in that the step 2 includes Following steps:

Step 2.1: unknown input sound bite, to unknown sound bite carry out sub-frame processing, obtain unknown speech frame with it is unknown Speech subframe calculates the similarity B (X, T, V, S) of each unknown speech frame and breathing template；Calculate each of breath sample collection It breathes frame and breathes the similarity of template, taking minimum similarity degree is Bm；

Calculate each unknown speech frame short-time energy E:

Wherein, n indicates that n-th of sampled point of signal, x [n] indicate that n-th of speech sample signal, N indicate the length of window of sample, N₀ The window start for indicating sample is N₀A sampled point；

Calculate the average value of all unknown speech frames

Calculate the zero-crossing rate ZCR of unknown speech frame:

Step 2.2: choosing a unknown speech frame；

Step 2.3: if the unknown speech frame being selected and the similarity B (X, T, V, S) of breathing template are greater than threshold value Bm/2, and The zero-crossing rate ZCR of unknown speech frame is less than 0.25, and the short-time energy E of unknown speech frame being selected is less than all unknown languages The average value of sound frameThen judge the unknown speech frame being selected as breath sound；If being unsatisfactory for above-mentioned condition, judgement is selected Unknown speech frame be non-respiratory sound, wherein X indicates that breathing frame or unknown speech frame, T indicate the mean value square of breath sample collection Battle array, V indicate that the variance matrix of speaker's breath sample, S indicate the singular value vector after normalization；

Step 2.4: choosing other unknown speech frames, step 2.3 is repeated, until judging all unknown voices in unknown sound bite Whether frame is breath sound；

Step 2.5: retaining breath sound, reject non-respiratory sound, obtain initial gross separation breath sound.

4. according to claim 3 based on the method for distinguishing speek person of respiratory characteristic, which is characterized in that in the step 2.1 Calculate breathing frame or unknown speech frame and breathing template similarity method the following steps are included:

Step 2.1.1: breath sample collection or unknown sound bite are divided into length by input breath sample collection or unknown sound bite For 100 milliseconds of breathing frame or unknown speech frame, each breathing frame or unknown speech frame are divided into again continuous and overlapped Subframe or unknown speech subframe are breathed, each subframe or unknown speech subframe length of breathing is 10ms, and adjacent unknown voice Overlapped length is 5ms between subframe；

Step 2.1.2: preemphasis is carried out to each unknown speech subframe using first-order difference filter, the breathing after obtaining preemphasis Frame or unknown speech frame；Wherein, first-order difference filter H:

H (z)=1- α z^-1

α is pre-emphasis parameters α ≈ 0.095；Z is signal sampling point data；

Step 2.1.3: to it is each breathing frame or unknown speech frame each preemphasis after breathing subframe or unknown speech subframe MFCC is calculated, the cepstrum matrix in short-term of each breathing frame or unknown speech frame is obtained, to each breathing frame or unknown speech frame The each column of cepstrum matrix remove DC component in short-term, obtain the MFCC cepstrum matrix M (X) of each breathing frame or unknown speech frame；

Step 2.1.4: a breathing frame or unknown speech frame X are chosen；

Wherein, T indicates that the Mean Matrix of breath sample collection, V indicate that breath sample collection variance matrix, M (X) are selected Breathe the MFCC cepstrum matrix of frame or unknown speech frame；

D (:, j)=D (:, j) hamming, j ∈ [1, N_C]

Wherein, Nc indicates the MFCC number of parameters in each breathing subframe or unknown speech subframe, the i.e. columns of D；Hamming table Show Hamming window；

Step 2.1.7: point of the similarity B (X, T, V, S) of the breathing frame being selected or unknown speech frame X and breathing template is calculated Measure Cp:

Calculate another component Cn of the similarity B (X, T, V, S) of the breathing frame being selected or unknown speech frame X and breathing template:

B (X, T, V, S)=Cp*Cn；

Step 2.1.9: choosing the MFCC cepstrum matrix of another breathing frame or unknown speech frame, repeats step 2.1.5-2.1.8；

Step 2.1.10: repeating step 2.1.9, similar to breathing template until obtaining all breathing frames or unknown speech frame Degree.

5. according to claim 1 based on the method for distinguishing speek person of respiratory characteristic described in any one of -4, which is characterized in that institute Stating and eliminating the value that the border detection algorithm of false low ebb utilizes in step 3 includes breathing duration threshold, energy threshold, zero passage Rate ZCR bound threshold value and spectrum slope accurately find breathing boundary, and the step 3 using Binary Zero -1, accurately exhale by instruction It inhales in the position of current speech segment.

6. according to claim 1 based on the method for distinguishing speek person of respiratory characteristic described in any one of -4, which is characterized in that institute It states and each speaks in breath sound and speaker's sample database after calculating being precisely separated of the unknown sound bite in step 5 The similarity of people's breath sample the following steps are included:

Step 5.1: setting the MFCC feature vector of speaker's breath sample in speaker's sample database as (a₁,a₂,..., a_n), calculate the Mean Matrix M of the MFCC feature vector of speaker's breath sample in speaker's sample database:

Wherein, a_iFor i-th of MFCC cepstrum square of the MFCC feature vector of speaker's breath sample in speaker's sample database Battle array, n represent of the MFCC cepstrum matrix in the MFCC feature vector of speaker's breath sample in speaker's sample database Number, i ∈ [1,2 ..., n]；

Step 5.2: the MFCC feature vector of all breath sounds after calculating being precisely separated of unknown sound bite is denoted as (b₁, b₂,...,b_n), b_iFor the MFCC cepstrum matrix of the breath sound after being precisely separated for i-th of unknown sound bite；

Step 5.3: to the feature vector (a of speaker's breath sample in speaker's sample database₁,a₂,...,a_n) returned One changes:

Wherein, r and c are respectively indicatedRow and column,For a_kNormalization difference matrix, k ∈ [1,2 ..., n], i ∈ [1, 2,…,r],j∈[1,2,…,c]；

Step 5.5: to the MFCC feature vectors of all breath sounds after being precisely separated of unknown sound bite (b1, b2 ..., Bn it) is normalized:

Wherein, r and c are respectively indicatedRow and column,For b_kNormalization difference matrix, k ∈ [1,2 ..., n], i ∈ [1, 2,…,r],j∈[1,2,…,c]；

Step 5.6: calculating the similarity degree Pk of bk and reference template: by Sb_kWith the element of ordered vector (S1, S2 ..., Sn) into Row compares one by one, and Pk is that the number of elements in ordered vector less than Sbk is total divided by element, calculates the average value of Pk, obtains not The similarity of the sample in breath sound and breath sample database after knowing being precisely separated of sound bite.

7. according to claim 1 based on the method for distinguishing speek person of respiratory characteristic described in any one of -4, which is characterized in that institute State the similarity packet of the breath sound after calculating being precisely separated of unknown sound bite in step 9 and the breathing segment of legal speaker Include following steps:

Step 9.1: setting the MFCC feature vector of the breathing segment of legal speaker as (a₁,a₂,...,a_n), calculate legal speak The Mean Matrix M of the MFCC feature vector of the breathing segment of people:

Wherein, a_iFor i-th of MFCC cepstrum matrix of the MFCC feature vector of the breathing segment of legal speaker, n represents legal theory Talk about the number of the MFCC cepstrum matrix in the MFCC feature vector of the breathing segment of people, i ∈ [1,2 ..., n]；

Step 9.2: the MFCC feature vector of all breath sounds after calculating being precisely separated of unknown sound bite is denoted as (b₁, b₂,...,b_n), b_iFor the MFCC cepstrum matrix of the breath sound after being precisely separated for i-th of unknown sound bite；

Step 9.5: to the MFCC feature vectors of all breath sounds after being precisely separated of unknown sound bite (b1, b2 ..., Bn it) is normalized:

Step 9.6: calculating the similarity degree Pk of bk and reference template: by Sb_kWith the element of ordered vector (S1, S2 ..., Sn) into Row compares one by one, and Pk is that the number of elements in ordered vector less than Sbk is total divided by element, calculates the average value of Pk, obtains not The similarity of the breathing segment of breath sound and legal speaker after knowing being precisely separated of sound bite.

8. according to claim 1 based on the method for distinguishing speek person of respiratory characteristic described in any one of -4, which is characterized in that institute The method for stating the similarity of each speaker's breath sample in the test sample for calculating selection in step 7 and speaker's sample database It is exhaled with the breath sound after being precisely separated of the unknown sound bite is calculated in step 5 with each speaker in speaker's sample database The method for inhaling the similarity of sample is identical.

9. based on the method for distinguishing speek person of respiratory characteristic according to claim 6, which is characterized in that the step 1.3 Include: that the signal for needing to calculate MFCC is subjected to Fast Fourier Transform (FFT) with the method for calculating MFCC in step 5.2, then calculates Complicated sine curve coefficient is finally exported by the filter group based on melscale.