CN1871639A - Topological voiceprints for speaker identification - Google Patents

Topological voiceprints for speaker identification Download PDF

Info

Publication number
CN1871639A
CN1871639A CN 200480030850 CN200480030850A CN1871639A CN 1871639 A CN1871639 A CN 1871639A CN 200480030850 CN200480030850 CN 200480030850 CN 200480030850 A CN200480030850 A CN 200480030850A CN 1871639 A CN1871639 A CN 1871639A
Authority
CN
China
Prior art keywords
speaker
topological
group
sound
topological index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 200480030850
Other languages
Chinese (zh)
Inventor
贝尔纳多·加布里埃尔·明德林
马科斯·阿尔贝托·特雷维桑
曼努埃尔·卡米洛·埃吉亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BUENOS AIRES, University of
Universidad Nacional de Quilmes
University of California
Original Assignee
BUENOS AIRES, University of
Universidad Nacional de Quilmes
University of California
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BUENOS AIRES, University of, Universidad Nacional de Quilmes, University of California filed Critical BUENOS AIRES, University of
Publication of CN1871639A publication Critical patent/CN1871639A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
  • Collating Specific Patterns (AREA)

Abstract

The speaker recognition techniques of this application use a topological description of his/her voice spectral properties in order to use it as a biometric characterization for the speaker. Distinctly different from computing distances between spectral curves obtained from voices of different speakers in various spectral analysis methods, such topological features provide a oneto-one correspondence between a subject and a mold represented by a set of rational numbers.

Description

The topological voiceprints that is used for Speaker Identification
The application requires in the right of priority of the 60/497th, No. 007 U.S. Provisional Patent Application of " the TOPOLOGICALVOICEPRINTS FOR SPEAKER IDENTIFICATION " by name of submission on August 20th, 2003, and its full content is hereby expressly incorporated by reference.
Technical field
The application relates to by the voice recognition speaker.
Background technology
The sound of different people has different sound characteristics.The difference that can extract the sound characteristic of different people constitutes unique identification instrument of distinguishing and recognizing the speaker.To some degree, Speaker Identification is a kind of based on automatically discerning whose process in speech from each information of sound or voice signal acquisition.At various application scenarios, Speaker Identification can be divided into both speaker. identification (Speaker Identification) and speaker verification (SpeakerVerification).Both speaker. identification is meant determines who registered speaker provides given pronunciation among one group of known speaker.This given pronunciation is analyzed, and it is compared with the acoustic information of known speaker, to determine whether coupling.And in the speaker verification, unknown speaker is at first claimed to be the entity of a certain known speaker, obtains the pronunciation of this unknown speaker then, and its information with the known speaker of being claimed is compared, to determine whether coupling.
Speaker Recognition Technology has multiple use.For example, can use speaker's sound to control visit to restricted unit, computer system, database and various services, for example, to the telephone interview of bank, database service, shopping and voice mail, and to the visit of safety equipment and computer system.In both speaker. identification with in confirming, all require the user to come " logining " Speaker Recognition System, so that system can characterize and the acoustic pattern of analysis user by its speech samples is provided.
In the Speaker Identification field, developed the method for distinguishing speek person that distance between the multiple vector (for example, frequency spectrum parameter) that utilizes sound characteristic is distinguished the speaker.In this frequency spectrum analysis method, the distance between the sound characteristic that calculating is extracted and the sound template of known speaker.Based on the suitable analysis of statistical analysis and other, if the distance that received sound or pronunciation are calculated within the predetermined threshold of known speaker, then received sound or pronunciation belong to this known speaker.
Summary of the invention
The described speaker Recognition Technology of the application is partly to develop based on a plurality of technical limitations in the various frequency spectrum analysis methods of frequency spectrum parameter distance calculation.For example, because the pronunciation of same speaker's difference may have different slightly frequency spectrums, and determine to depend on the sound spectrum database that is used for adapting to appropriate threshold value basically, so this frequency spectrum analysis method may not be enough accurate at least.
The application's speaker Recognition Technology is used the topological characteristic of the sound that calculates from single speaker, constitute one group of discrete rational number (for example integer) as each speaker's biometric features, and the speaker or the object that use these rational numbers to recognize to bear inspection.With never calculate in the various frequency spectrum analysis methods with the spectrum curve that obtains of speaker's sound between distance obviously different, this topological characteristic provides object and the model represented by one group of rational number or the one-to-one relationship between the vocal print.Therefore, can be at the rational number database of the incompatible formation of various applied field that comprises both speaker. identification and affirmation towards different known speaker.The database of this rational number is less with respect to the traditional personal voice database that uses in various frequency spectrum analysis methods.Each vocal print comprises the one group of discrete integer being used to distinguish speaker and other speakers or the topological parameter of rational number form, and obtains by the spectral function that embeds the voice sound of speaking.
In one embodiment, a kind of method that is used for determining by sound speaker ' s identity has been described.At first, extract one group of topological index by the spectral function that embeds the voice sound of speaking.Next, select the biometric features of topological index, be used for identification and confirm speaker and other speakers as the speaker.
In another embodiment, topological parameter is the rational number from for example integer of relative rotational (rrr) acquisition.Each object divides all that be equipped with can be by one group of rational number of brief Reconstruction of Discourse.The subclass of these numbers does not change with the difference of same speaker's language, and different with the difference of object.Like this, the size that can not consider the feature of database is set up a kind of standard method of describing sound.This group rational number that characterizes sound is very stable, can encode in various device (for example magnetic devices or printing equipment) at an easy rate.
The typical method of Miao Shuing may further comprise the steps in this application.The record speaker voice signal and with its digitizing.Calculate the linear predictor coefficient of this discrete signal.Calculate power spectrum according to this linear predictor coefficient.Then, make up the three-dimensional periodic track, and make up the second three-dimensional periodic track by benchmark power spectrum (for example natural reference signal, natural reference signal) by this power spectrum.Obtain topology information then about the periodic orbit of voice signal and natural reference signal.One group of topological index that use is selected distinguishes the speaker who produces this voice signal with other speakers with different topology index.
The application has also described Speaker Recognition System.In an example, Speaker Recognition System comprises: microphone is used for receiving sample sound from the speaker; Read head is used for reading from portable storage device the sound identification data of the rational number of the sound of representing known speaker uniquely; And processing unit.Processing unit is connected to microphone and read head, can be used for extracting topology information from speaker's sample sound, to produce topological dispersion number from sample sound.Processing unit also can be used for the dispersion number of known speaker is compared with the topological dispersion number that derives from sample sound, to determine whether the speaker is known speaker.Because it is enough little to be used for the file size of numerical code of discrete rational number of Speaker Identification, so one or more vocal prints of one or more speakers can be stored in the portable portable storage device of user.
In accompanying drawing, embodiment and claim, these and other examples and embodiment will be described in more detail.
Description of drawings
Fig. 1 shows the periodic function that is used to embed from single speaker (solid line) and universal reference (dotted line).These functions are by the initial log|H (f) that adopted for 1/2 initial period | 2Make up.
Fig. 2 shows on the whole cycle of function at log|H (f) two different speakers, that use the maximum entropy approximate value | 2Three examples.Outside second resonance peak, frequency spectrum is clustered into two different groups naturally.The initial voice section is corresponding to the Spanish vowel [a] that extracts from conventional speech utterance.
Fig. 3 shows the example of the time-delay embedding (Δ f=40Hz) of the function F of being calculated by a voiced segments (solid line) (f).
Fig. 4 shows the vowel line that three ages are close to identical male sex speaker, and its short vowel section (about 100ms) by about 10 word of gathering in different enrollment time sections constitutes.
Fig. 5 A shows the example as the sample sound of the function of time that obtains from the speaker by microphone.
Fig. 5 B shows the power spectrum that the sample sound from Fig. 5 A obtains.
Fig. 5 C shows two 3d orbits 1 of topological approach and 2 be connected that is used for extracting from voice signal rotation number.
Fig. 5 D shows track that is made of sample sound and the relative rotation number that obtains from the relation of the relative topology between the benchmark track of reference signal.
Fig. 6 A, Fig. 6 B, Fig. 6 C show and select the example of constant rotation number as the process of this speaker's vocal print from a plurality of rotation matrixs of a certain speaker's same voiced sound.
Fig. 7 shows the example that the vocal print of the sound of unknown speaker and known speaker is compared in complete The matching analysis mode.
Fig. 8 shows two candidates' of three vocal prints affirmations of three known speaker of contrast step.
Fig. 9 shows the example of Speaker Recognition System.
Figure 10 shows the operation of the system among Fig. 9.
Embodiment
Speaker Recognition Technology described herein can be implemented in a variety of forms.In one embodiment, from speaker's sample sound, extract for example one group of discrete rational number (for example, integer).The subclass of the rational number that is extracted is present in every a word of speaker, and under normal speak situation and lower noise environment, this subclass speaker's words with talk about between do not change.This subset is called as vocal print, is used as speaker's biometric features, is used for from other speakers identification and confirms this speaker.
Therefore, can use this biometric features to realize the speaker verification by following steps.At first, analyze sample sound, to extract one group of rational number of second speaker from second speaker.Second speaker's discrete rational number of this group and speaker's vocal print are compared, do not use threshold value in the comparison procedure.Subsequently, when mating fully between this group rational number and speaker's of second speaker the vocal print, prove that second speaker is exactly the speaker.If do not match, then think the second artificial people different that speak with the speaker.
In the embodiment of both speaker. identification, from the sample sound of different known speaker, extract vocal print.Then, analysis is from the vocal print of unknown speaker, extracting one group of rational number of unknown speaker, and the discrete rational number of this group of unknown speaker and vocal print of known speaker compared, with determine whether the coupling, thereby the identification unknown speaker whether be one of known speaker.
It should be noted that in above-mentioned speaker verification and both speaker. identification process, between not discrete on the same group rational number, compare, to determine whether coupling.And do not need to determine that difference between two spectrum signatures is whether in selected threshold value.Each feature of speaker Recognition Technology described herein is better than the various frequency spectrum analysis methods based on the calculating of spectrogram parameter distance.
The voice recognition method is non-invasive recognition methods, and therefore, in this, the voice recognition method is better than for example other biological mensuration recognition methods of retina scanning method.But, be used for the Spectral Analysis Method of Speaker Identification and unlike the other biological assay method that comprises fingerprint recognition, be widely used, this is because when the spectrum signature in the alternative sounds relatively to a certain extent, for identification certainly, be difficult to determine many approaching be only enough approaching.Speaker Recognition Technology described herein has been avoided using the uncertainty of threshold ratio than spectrum signature, and a kind of new method of extracting biometric features from voice spectrum information is provided.
As everyone knows, the spectral characteristic of people's sound has been carried speaker's unique trait, thereby can be used for Speaker Identification.In producing the process of voiced sound, filter by speaker's sound channel by the voice signal of regulating the rich spectrum signature that air-flow produces by vocal cords.As the resonance of the sound channel of passive filter ergonomics characteristics determined, therefore can be used for recognizing the speaker by the speaker.But the physical property secundum legem active power filtering theory (source-filter theory) of human sound is described.In the voiced sound process that produces similar vowel, air-flow causes the periodic vibration of vocal cords.This vibration produces time dependent pressure surge in the input of passive linear wave filter (being sound channel).The feedback of separation between source and wave filter supposition vocal cord vibration can be ignored, and people such as Laje are at Phys.Rev.E64, has at large confirmed this hypothesis in 05621 (2001) under conventional voice condition.The input pressure of rich spectrum signature presents the harmonic wave that fundamental frequency is about 100Hz.Sound channel is selected some frequency from these harmonic waves.Like this, the frequency spectrum of voiced sound is loaded with the information about sound channel, and each speaker's sound channel all is unique, and therefore, the frequency spectrum of voiced sound can be as speaker's biometric features.
Typical method in the Speaker Identification field (for example various frequency spectrum analysis method) uses the proper vector that has value that characterizes different objects, carry out the multidimensional grouping, by proper vector is measured the group (cluster) relevant with different objects separated then.In the framework (framework) of the spectrum signature of sound, a kind of method of carrying out identity validation is the distance (distortion measurement) of making between the characteristic that calculates according to language, and for example the difference between two frequency spectrums is to the integration on the order of magnitude.Another kind of distortion measurement is based on the difference between the spectrum slope (spectral slope), and for example, power spectrum is to the first order derivative with respect to the logarithm of frequency.
These frequency spectrum analysis methods have many technological deficiencies.Fig. 1 shows the example of logarithm power spectrum of three Different Discourse of same speaker.For same speaker's Different Discourse, these power spectrums are slightly different aspect spectrogram peak value and profile.What therefore, when the difference of calculating between the spectrum signature, measure the distance between the curve and determine to accept error very difficult and complicated in essence for Speaker Identification.For example, the result of calculation of this frequency spectrum analysis method is scattered between the multiple scope concerning different speakers usually.Similarly, the boundary between the acceptable value between the approaching speaker of two scopes be set in where also exist uncertain.
Speaker Recognition Technology as herein described is used the diverse method of extracting unique biometric features from sound and language.Above-mentioned Frequency spectrum ratio can alternatively be called the coefficient realization of cepstrum coefficient by another group, this cepstrum coefficient is the Fourier amplitudes of frequency spectrum function.To a certain extent, this implementation can be understood as sound spectrum as " time " series processing, and wherein frequency f plays the time.Under this viewpoint, the inventor discloses: the technology of using in the dynamical system theory for two periodic orbits relatively can be used among the analysis of voiced sound frequency spectrum.The method of this expressing information has thoroughly been avoided the calculating of differences of spectral features.Especially, the inventor has explored the use of topological tools, and this topological tools is used to catch the main morphological feature of track, does not consider slight deformation.The topological analysis of nonlinear kinetics system is a ripe technical field of setting up, Robert Gilmore is at Review of Modern Physics, Vol.70, No.4 describes ultimate principle and analytical framework in detail in " Topological analysis of chaotic dynamical system " in the 1455-1592 page or leaf (in October, 1998).
With the lower part topological tools of developing in the different field by the working power system how is described, by some groups of rational numbers sign frequency spectrums.Especially, in one group of less relatively speaker, there are the some groups of rational number subclass that as if can strengthen speaker's identity information.These results have shown with one of the voice recognition object new direction: the arrangement of rational number defines the vocal print that depends on himself, need not consider any acceptance/refusal threshold value.
In the analysis of three-dimensional dynamical systems, periodic orbit is a closed curve, this closed curve can by each other and self knot and ways of connecting characterize.For example, referring to Solari and Gilmore, " Relative rotation rates for driven dynamicalsystem ", Physical Review A37,3096-3109 page or leaf (1998); People such as Mindlin, " Classification of strange attractors by rational numbers ", PhysicalReview Letters, Vol.64,2350-2353 page or leaf (1990); And Mindlin and Gilmore, Physica D58,229 pages (1992).For with this analytical applications in the problem of Speaker Identification, utilize the technology be applied to usually in periodically " time " sequential analysis, with the power spectrum of voiced sound on logarithmically calibrated scale as the periodic data string manipulation.Can utilize time-delay to embed, this serial data is constituted 3d orbit.
Fig. 2 shows the example of logarithm power spectrum of three pronunciations of two speakers.These frequency spectrums are divided into two groups that correspond respectively to two speakers naturally.Can find that the topological property of their embedding is the suitable tools that is used for identity validation.
The relative rotational of describing in the publication of above-mentioned Solari that quotes and Gilmore is used for aid illustration by the topological invariant of the two-dimensional dynamical system of cyclic drive for introducing, and can be used for extracting biometric information from the spectral characteristic of human sound.Relative rotational also can be at the autonomous dynamical system of a big class with R 3Make up: wherein can find Poincar é part.
In order to describe the frequency response of sound channel, calculate the maximum entropy approximate value of power spectrum of the voiced segments of each storage.Can be by calculating voiced segments { y nM linear predictor coefficient carry out top calculating, get speed r=1/ Δ:
y n = Σ k = 1 m d k y n - k + x n - - - ( 1 )
Wherein, suppose lp in whole voice segments (linear prediction) coefficient d 1, d 2..., d mConstant, and select this coefficient to make x nMaximum.These lp coefficients can be used for estimating power spectrum | H (f) | 2For having the rational function of m limit:
H ( f ) = d 0 1 - Σ k = 1 m d k e ik 2 πfΔ - - - ( 2 )
It is in [1/2 Δ, 1/2 Δ], that is, Nyquist is periodic in the interval.The frequency spectrum of two speakers among Fig. 2 is based on the example of the frequency spectrum of formula (2) reconstruct.
Use the formula (2) of m=13 coefficient, estimate the logarithm log|H (f) of energy spectrum function | 2This frequency spectrum is with respect to the f=0 symmetry.Therefore, each frequency spectrum has only half relevant with extraction with the analysis of topological rational number.In the raw data of handling sound spectrum, we remove log|H (f) | 2And log|H (π/Δ) | between difference, add linear function and also deduct mean value.Final spectral function F (f) is a periodic function, and its cycle is 1/2nd of the initial period.
Refer again to Fig. 1, show some examples of the F (f) of same speaker's Different Discourse with the benchmark frequency spectrum function.Can use time-delay δ, resulting function F (f) is embedded phase space.Fig. 3 further shows the example of the track that uses δ=40Hz.Embedding the always online F of track (f)=F (f-δ)=F (f-2 δ) by time-delay F (f), F (f-δ) and F (f-2 δ) definition, in the phase space shows empty on every side.Therefore, by F (f)=F (f-2 δ); The demifacet of F (f-δ)<F (f-2 δ) definition has provided good Poincar é part.
Selection is with respect to the relative rotation of benchmark, as the topological characteristic of these periodic orbits.For example, use universal reference: smooth, do not have joint (non articulated) sound channel (suppose that voiced sound is zero).This universal reference is independent of database, concerning example that the application describes corresponding to the embedding of the power spectrum of the opening with given length 17.5cm-even pipeline of sealing.
Can have p by the hypothesis track AAnd p BSection interval (period), these embed the relative rotation of frequency spectrum according to following calculating.Set up the relative rotation matrix of track A and B M ∈ Z p A × p B , Matrix element M IjBe equivalent to of the summation of the i interval of track A with respect to the signed point of crossing (signed crossing) in the j interval of track B.Can calculate signed point of crossing on the two-dimensional sub-spaces by two track A and B are projected in.In this projection, just the tangent vector in two intervals on the point of crossing is made along the direction of air-flow.The upper tangential amount is to bottom tangent vector rotation, if this rotation to the right (left side) revolve, then the point of crossing is distributed in+1 (1).Element as the relative rotation matrix of above-mentioned foundation is a rational number.
This relative rotation matrix is relevant with relative rotational by following formula:
R ij ( A , B ) = 1 p A p B Σ k = 0 p A p B - 1 M i + k , j + k - - - ( 3 )
Wherein, periodic boundary condition is used to this matrix.
For the sound characteristic (voice signature) of setting up the speaker, said each vowel of speaker all will be characterized.A kind of method that characterizes vowel is by stack all relative rotation matrixs corresponding to same voiced sound and same speaker, and by in these relative rotation matrixs, searching coincidence (coincidence), promptly, rotation number, it can not change when the Different Discourse of saying according to the speaker is calculated rotation number.These coincidences are called as " stable rotation number ", and are rational number.The test of carrying out shows that these stable rotation numbers are unique for a speaker, and different speakers' stable rotation number difference.Therefore, these stable rotation integers of speaker are similar to speaker's fingerprint, can be used as voice biometric features, are used to recognize speaker and other speakers.
The arrangement that is arranged in the stable rotation number of original matrix is known as speaker's " vowel line (vowelprint) ".The set of speaker's vowel line is known as " vocal print ".Fig. 4 shows three vowel line examples that are close to Spain's vowel [a] of identical male sex's object corresponding to three ages.
Above-mentioned vocal print is the set of discrete rational number, the vocal biometric features of this set expression speaker uniqueness.Can compare with the one group of rational number that obtains from known speaker by these rational numbers that will from speaker's sound, obtain and discern the speaker.Two groups of this measurement calculating of relatively having avoided distance between the spectrum signature of dispersing between the rational number, and avoided mating intrinsic uncertainty in the different spectrum signature processes based on some predetermined threshold value.In addition, compare with the bigger usually audio database of spectrum signature in the Spectral Analysis Method, the size of the digital document of these rational numbers is less relatively.Therefore, people's vocal print can be stored in the various portable storage devices as numerical code, for example, credit card, I.D. are (for example, driving license) and the magnetic stripe on the bank card, be printed on various lip-deep bar code such as print file (for example, passport and driving license) and I.D., miniature electric memory storage and other.People can carry vocal print easily, and vocal print is used for identification, confirms and other purposes.
In a plurality of embodiment, can use a computer or receive and handle voice signal, and extract the rational number of speaker's vocal print from the speaker based on the electronic installation of microprocessor and system.Can store this vocal print, be used for both speaker. identification subsequently and confirm processing.For example, be connected to computing machine or can be used for obtaining sample sound from the speaker based on the microphone of the electronic installation of microprocessor and system.With the voice signal digitizing that microphone receives, use above-mentioned track to handle then through digitized voice signal, stablize rotation number as vocal print with a group of obtaining for each speaker.
Fig. 5 A shows by the example as the voice signal of function of time microphone generating, the speaker.Select the number voice signal segment, with the sound spectrum that is formed for further handling.Fig. 5 B shows the example of sound power spectrum that signal segment from Fig. 5 A obtains and the frequency spectrum of selected reference voice signal.In the hands-on of system, record training language from one group of speaker is between different record times.
Fig. 5 C shows two simple 3d orbits 1 and 2 the example that is connected (linking).As mentioned above, two tracks 1 are tied with being connected with 2 group and be can be used for obtaining relative rotary index or relative rotation number.Track that the voice signal of speaker from be similar to Fig. 3 produces and benchmark track can be used for obtaining the relative rotation matrix based on the relevant topology relation of two tracks.Fig. 5 D shows the example by the relative rotation integer of the topological analysis acquisition of sample sound.In order to extract rational number, set up periodic function based on the spectrum signature of the voiced sound that is write down.Use the phase space reconfiguration technology to set up closed 3d orbit.After the analysis of three-dimensional dynamical systems, from closed orbit or curve, extract connection and knot characteristic.The some groups of rational numbers (rotation number) that extracted are arranged in the matrix form shown in Fig. 5 D.Then, form model by the final arrangement of rotation number, rotation number remains unchanged for the variation of each speaker's language.The matrix that includes only the stability number that is positioned at the original matrix position can be used for constituting speaker's sound signature (voice signature) or sound model.
Fig. 6 A, Fig. 6 B, Fig. 6 C show the formation for speaker dependent's sound model.Can calculate the rotational speed of the track of voice signal F (f) with respect to selected benchmark.Embed the function F (f) of track and the benchmark of q section for having the p section, can obtain the matrix of p * q rotation number.Fig. 6 A shows the example of the rotation number of 4 * 4 matrixes.(i j) is equivalent to speaker's the i section of periodic orbit with respect to the revolution of the j section of benchmark to the matrix element of this matrix.Each matrix element all is a rotation number.Calculate the constant rotation number of sound model as all language of training group.As an example, Fig. 6 B show from same speaker obtain for 4 of same voiced sound different matrixes.In 4 matrixes that obtained, some rotation numbers change with the difference of matrix.Fig. 6 B also shows 4 matrix elements that add shade, and these matrix elements are constant in 4 matrixes.Based on 4 samples among Fig. 6 B, set up the final matrix of the sound model shown in Fig. 6 C.The same p * q the matrix that is still with original matrix of the matrix of this sound model, except only keeping constant matrix element, and all the other matrix elements are empty.These empty matrix elements are corresponding to changing maximum topological index.For each speaker and each voiced sound a model is arranged all.Each speaker is repeated above-mentioned training managing, with the audio database of the model of setting up all speakers.
The database of the sound model of having set up known speaker and store or make this database can be by the Speaker Recognition System visit after, system can confirm or recognize the speaker at any time.At first, obtain sample sound, and calculate one group of rotational speed matrix from the sound model of the unknown speaker that requires to login database from unknown speaker.These test matrixs compare at each voiced sound and corresponding sound model.Have only when mating (Model Matching) fully for one in the sound model in test matrix and the database, could confirm unknown speaker.As long as use full match-on criterion, just do not need to be used to accept threshold value with rejection threshold.
The left side of Fig. 7 shows the example (for example, being stored in the code in the credit card) of speaker's sound model, and the right shows the test matrix that obtains from unknown speaker.In 6 constant rotation numbers in the on the left side sound model, the rotation number in the matrix of the right has only 3 couplings.Therefore, do not mate fully in this example, determine that unknown speaker is not this known speaker.
The above-mentioned topological approach that is used for Speaker Identification has successfully stood check.Retell the statement that contains 5 Spain's vowels for six times by writing down among 18 speakers everyone, according to small fragment (about 100ms) the formation topological matrix that from these vowels, obtains, set up audio database then.Final audio database has the vocal print that everyone topological matrix calculates from 18 speakers.
Then, record is from the speaker's who requires to enter database sample sound, and calculates topological matrix from the sample sound that is write down.These candidate matrices are compared with corresponding vowel line in the database.Have only when this group candidate matrices and single storage vocal print mate fully, could recognize the member of this artificial database of speaking.Here, coupling is meant that all stability numbers in all vowel lines all appear in the corresponding candidate matrices fully.
Fig. 8 shows the example that the single vowel line that obtains from 18 speakers is compared.Among Fig. 8, two candidate matrices are compared with the database of model.For in two candidate matrices each, all show single vowel line.If the vocal print of speaker's candidate matrices and a certain storage mates fully, then be the member of database with this both speaker. identification.Gray area in the model is corresponding to the position of containing stability number in the matrix.The candidate is recognized as member's (that is, coupling) fully of database, and the numeral of this position that requires to be arranged in candidate matrices is identical with the stability number of model.Each sentence in 108 language of audio database all is used as the candidate and recognizes.This test has obtained desirable recognition effect, the positive or negative identification of neither one mistake.
The rotation number subclass of carrying out in constituting the vocal print process is selected to allow the people expect, may lose some information.In order to verify this hypothesis, the set of each vocal print in the database with all single matrixes that constitute vocal print replaced, so just preserved all topology informations.In 108 language of this database each all is used as the candidate of identification.The number that overlaps in calculated candidate matrix and the characterization database between one group of matrix of each speaker.The result shows that this is a kind of method of low performance, and this is owing to found the affirmation and negation that several are wrong.Therefore, as if because given up the unnecessary information that index carried that is changed by the variation with language, topological stability number has been strengthened relevant frequency spectrum information.
In addition, to comparing between above topology method and the measurement Law.In measurement Law, calculate the secondary range (quadratic distance) between the frequency spectrum, and under optimal threshold, calculate coincidence.In this case, come each speaker's of surrogate data method storehouse vocal print by the spectral function that is used for constituting rotation matrix.This measurement Law will be lower than topological approach as the performance of Speaker Identification.
This topological approach shows many useful advantages with respect to various measurement Laws.In the measurement Law of calculating distance between the frequency spectrum, need the definition threshold value, it is a database correlative.To use with full match-on criterion by the topological voiceprints that rational number constitutes, introduced a kind of new method that is independent of database, does not need the threshold value affirmation to accept.
Implemented the embodiment of this topological approach of operation on standard personal computer, the topology that test shows is carried out on PC is handled very fast.In case write down language, just can easily extract voiced segments.Intersection counting algorithm easy to use (cross-counting algorithm) (referring to, for example, the Gilmore paper of being quoted) can be set up their relative rotation matrix, and calculates vocal print by the coincidence of calculating the minor matrix set simply.In case set up audio database, whole identification mission is exactly the coupling of minor matrix.
In this topological approach, find that the variation of the quantity of stability number is the function of training group size.For the training group greater than 10 vowels, the quantity of stability number converges on approximate 8.The spectral function that these stability numbers have been described voiced sound with respect to the benchmark frequency spectrum relative peak height, it does not change with the variation of language.The topological index of obtaining the language of record when voice changes with seriously catching a cold from this object with the stability number of object in the notebook data storehouse is compared.Information appropriateness in the matrix of test shows stability number reduces: have only the index relevant with highest frequency to change, and the major part of vocal print remains unchanged.
Multiple systems can adopt topological voice recognition method of the present invention.Simple embodiment can be used to computing machine or the processing unit that comprises microprocessor to handle the voice signal from the microphone that is connected to processing unit.Can use that for example electronic storage device, magnetic memory apparatus (for example, the hard disk drive among the PC) or the storage medium of light storage device are stored the topological voiceprints of known speaker.The user provides sample sound by speaking facing to microphone.Processing unit is at first handled the sample sound from the user, to extract user's topological sound index, then user's topological sound index is compared with the index in being stored in memory storage, with the coupling of one of known speaker in search subscriber and the database.
Fig. 9 shows the example of the Speaker Recognition System of implementing the above topology method.Figure 10 shows the operating process of the system among Fig. 9.This system comprises: processing unit can be a computing machine or comprise microprocessor, is used for according to the topological approach processing audio signal, and is used for the sound model that will read from read head and the test matrix that is made of voice signal and compares; The input microphone is connected with processing unit, is used to write down the voice signal from the speaker; Read head, also be connected to processing unit, be used to read the rational number of the sound model that is stored in the one or more known speaker on the portable storage device, this portable storage device for example magnetic card, light storage device, be printed on card or electronic storage device or storage card with the bar code of rational number coding.
For example, suppose that read head is the magnetic reader, portable storage device is the magnetic card of numerical code that stores one or more sound models of known speaker.Requirement claims to be that the holder of known speaker slips over reader with card and also speaks facing to microphone, so that obtain his sample sound.This sample sound of processing unit processes, extracting topological rational number, and should the topology rational number and compare from the rational number that this card reads.Between all rational numbers fully under the situation of coupling, the user who confirms this card is stored in known speaker on this card for its vocal print.The user capture that can allow this card is bank account or computer system for example.
Can realize confirming system by computer network based on the computer security of this topological approach, wherein, can will send to processing unit from user's digitized voice sample by network, processing unit determines whether this user's sample sound mates with the acoustic phase that is stored in the known speaker in the topological database.This purposes can be applicable to the wireless communication link of internet, telephone wire and network, for example wireless telephony network or radio data network.Various uses can be in conjunction with topological voice recognition of the present invention as the part of confirming process or all, for example affirmation of e-bank or finance, online shopping, various documentary evidence (for example passport, I.D.) and bank card, credit card, electronic trade, telephone interview, the no key affirmation that enters the user identity of (keyless entry) (automobile, dwelling house, office etc.) and driving license of said affirmation process.
Some embodiment have below only been described.Yet, should be appreciated that these embodiment can carry out various modifications and reinforcement.

Claims (26)

1. method that is used for determining by sound speaker's identity comprises:
Extract one group of topological index from the embedding of the spectral function of speaker's sound; And
The described topological index of use selecting is as described speaker's biometric features, with identification and confirm described speaker and other speakers.
2. method according to claim 1 further comprises:
Analysis is from second speaker's sample sound, to extract one group of topological index of described second speaker;
More described second speaker's topological index group and described speaker's topological index group;
When mating between described second speaker's topological index group and described speaker's the topological index group, confirm that described second speaker is described speaker; And
When not matching, confirm that described second speaker is the people who is different from described speaker.
3. method according to claim 1 further comprises:
From the sound of different known speaker, extract many group topological indexs;
Analysis is from the sample sound of unknown speaker, to extract one group of topological index of described unknown speaker;
The topological index group of more described unknown speaker and many groups topological index of described known speaker are to determine whether coupling;
When coupling, confirm that described unknown speaker is the described known speaker that the topological index group of its topological index group and described unknown speaker is complementary.
4. method according to claim 1 further comprises:
Described speaker's topological index group is stored in the portable unit;
Obtain sample sound from the user who has described portable unit;
The described sample sound that analysis is obtained from described user is to extract one group of topological index of described user;
Provide reading device, from described portable unit, to read described speaker's topological index group;
The described speaker's who reads from described portable unit topological index group and described user's topological index group relatively is to determine whether coupling; And
When coupling, confirm that described user is described speaker.
5. method according to claim 4 further comprises and uses magnetic memory apparatus as described portable unit.
6. method according to claim 5, wherein, described portable unit is a magnetic card, and described speaker's topological index group is stored in the described magnetic card.
7. method according to claim 6, wherein, described magnetic card comprises the magnetic stripe of the topological index group of storing described speaker.
8. method according to claim 4, wherein, described portable unit has the surface that is printed on bar code pattern, and described speaker's topological index group is stored in the described bar code pattern.
9. method according to claim 4 further comprises and uses electronic storage device as described portable unit.
10. method according to claim 4 further comprises and uses light storage device as described portable unit.
11. method according to claim 1 wherein, is extracted described topological index group and is comprised from described speaker's sound:
Processing is from described speaker's voice signal, to obtain spectral function;
Make up closed 3d orbit by described spectral function;
With respect to benchmark, obtain one group of topological index from described track; And
Select the biometric features of the subclass of described topological index as described speaker.
12. a method comprises:
Record and processing are from speaker's voice signal;
Calculate linear predictor coefficient by described voice signal;
Calculate power spectrum by described linear predictor coefficient;
Make up the three-dimensional periodic track based on described power spectrum;
Power spectrum by the natural reference signal makes up the three-dimensional periodic track;
Acquisition is about the topology information of the described periodic orbit of described voice signal and described natural reference signal; And
Use one group of topological index selecting, the described speaker who produces described voice signal is distinguished from other speakers with different topology index.
13. method according to claim 12, wherein, described topology information is obtained by the periodic orbit of described voice signal and the rotational speed of relative rotational between another benchmark track and/or the periodic orbit of himself.
14. method according to claim 12, wherein, described topology information is by calculating connection performance and/or obtaining from track from connection performance.
15. method according to claim 12, wherein, described topology information obtains from described track by calculating the knot type in embedding.
16. method according to claim 12, wherein, each three-dimensional periodic track makes up with respect to Cartesian coordinates, and the axle of described three-dimensional periodic track is by having the power spectrum definition that out of phase postpones.
17. method according to claim 12, wherein, each three-dimensional periodic track makes up with respect to Cartesian coordinates, and the axle of described three-dimensional periodic track embeds definition by other integral differentials.
18. method according to claim 12 further comprises:
Formation comprises the database of the different selection topological index group of a plurality of known speaker; And
The one group of topological index and the described database of the selection of unknown speaker are compared, to determine whether coupling.
19. a method comprises:
The database of the vocal print that comprises known speaker is provided, wherein, each vocal print comprises one group of topological number that is used to distinguish speaker and other speakers, and derives from the periodic orbit that obtains from the power spectrum of described speaker's sound and the relation three dimensions between the periodic orbit of the power spectrum acquisition of audio frequency benchmark; And
The vocal print and the described database of unknown speaker are compared, to determine whether coupling.
20. method according to claim 19, wherein, described three dimensions is by the energy spectrum function definition with different length of delays.
21. method according to claim 20, wherein, described three dimensions embeds definition according to the three-dimensional integral differential.
22. one kind is used for comprising from other speaker's identification speakers' vocal print:
One group of rational number, the topological characteristic of sign spectral function is used for speaker and other speakers are distinguished,
Wherein, described topological parameter derives from the periodic orbit that obtains from described speaker's power spectrum and the relation three dimensions between the periodic orbit of the power spectrum acquisition of audio frequency benchmark.
23. a Speaker Recognition System comprises:
Microphone is used for receiving sample sound from the speaker;
Read head is used for reading from portable storage device the sound identification data of the rational number of expression known speaker; And
Processing unit, be connected to described microphone and described read head, described processing unit can be used for extracting topology information from the described sample sound from described speaker, to produce topological rational number from described sample sound, and the described rational number of described known speaker compared with the described topological rational number that derives from described sample sound, to determine whether described speaker is described known speaker.
24. system according to claim 22, wherein, described read head is the magnetic reader of reading of data from the magnetic portable storage device.
25. system according to claim 22, wherein, described read head is the optical pickup of reading of data from the light portable storage device.
26. system according to claim 22, wherein, described read head is the electronic reader of reading of data from the electronics portable storage device.
CN 200480030850 2003-08-20 2004-08-20 Topological voiceprints for speaker identification Pending CN1871639A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US49700703P 2003-08-20 2003-08-20
US60/497,007 2003-08-20

Publications (1)

Publication Number Publication Date
CN1871639A true CN1871639A (en) 2006-11-29

Family

ID=34216064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200480030850 Pending CN1871639A (en) 2003-08-20 2004-08-20 Topological voiceprints for speaker identification

Country Status (3)

Country Link
CN (1) CN1871639A (en)
AR (1) AR047710A1 (en)
WO (1) WO2005020208A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129859B (en) * 2010-01-18 2013-10-30 盛乐信息技术(上海)有限公司 Voiceprint authentication system and method for rapid channel compensation
CN105359132A (en) * 2013-06-18 2016-02-24 Mtcom公司 Method for generating and retrieving electronic document and recording medium therefor

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8086519B2 (en) 2004-10-14 2011-12-27 Cfph, Llc System and method for facilitating a wireless financial transaction
US7860778B2 (en) 2004-11-08 2010-12-28 Cfph, Llc System and method for implementing push technology in a wireless financial transaction

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4415767A (en) * 1981-10-19 1983-11-15 Votan Method and apparatus for speech recognition and reproduction
US5799276A (en) * 1995-11-07 1998-08-25 Accent Incorporated Knowledge-based speech recognition system and methods having frame length computed based upon estimated pitch period of vocalic intervals
US6470315B1 (en) * 1996-09-11 2002-10-22 Texas Instruments Incorporated Enrollment and modeling method and apparatus for robust speaker dependent speech models
US5940791A (en) * 1997-05-09 1999-08-17 Washington University Method and apparatus for speech analysis and synthesis using lattice ladder notch filters
US6006186A (en) * 1997-10-16 1999-12-21 Sony Corporation Method and apparatus for a parameter sharing speech recognition system
US6092039A (en) * 1997-10-31 2000-07-18 International Business Machines Corporation Symbiotic automatic speech recognition and vocoder
JP2986792B2 (en) * 1998-03-16 1999-12-06 株式会社エイ・ティ・アール音声翻訳通信研究所 Speaker normalization processing device and speech recognition device
US6618702B1 (en) * 2002-06-14 2003-09-09 Mary Antoinette Kohler Method of and device for phone-based speaker recognition

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129859B (en) * 2010-01-18 2013-10-30 盛乐信息技术(上海)有限公司 Voiceprint authentication system and method for rapid channel compensation
CN105359132A (en) * 2013-06-18 2016-02-24 Mtcom公司 Method for generating and retrieving electronic document and recording medium therefor

Also Published As

Publication number Publication date
WO2005020208A3 (en) 2005-04-28
WO2005020208A2 (en) 2005-03-03
AR047710A1 (en) 2006-02-15

Similar Documents

Publication Publication Date Title
CN110457432B (en) Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium
Tiwari MFCC and its applications in speaker recognition
US20070198262A1 (en) Topological voiceprints for speaker identification
US8099288B2 (en) Text-dependent speaker verification
Campbell Speaker recognition: A tutorial
Lin et al. Audio classification and categorization based on wavelets and support vector machine
CN101465123B (en) Verification method and device for speaker authentication and speaker authentication system
CN1170239C (en) Palm acoustic-print verifying system
CN1808567A (en) Voice-print authentication device and method of authenticating people presence
WO2016119604A1 (en) Voice information search method and apparatus, and server
CN1547191A (en) Semantic and sound groove information combined speaking person identity system
Pawar et al. Review of various stages in speaker recognition system, performance measures and recognition toolkits
CN101154380A (en) Method and device for registration and validation of speaker's authentication
Fong Using hierarchical time series clustering algorithm and wavelet classifier for biometric voice classification
Eshwarappa et al. Multimodal biometric person authentication using speech, signature and handwriting features
CN112735435A (en) Voiceprint open set identification method with unknown class internal division capability
Campbell Speaker recognition
Mansour et al. Voice recognition Using back propagation algorithm in neural networks
Mohammed et al. Advantages and disadvantages of automatic speaker recognition systems
CN112035700B (en) Voice deep hash learning method and system based on CNN
CN1871639A (en) Topological voiceprints for speaker identification
Kumalasari et al. Speech classification using combination virtual center of gravity and k-means clustering based on audio feature extraction
Gupta et al. Speech Recognition Using Correlation Technique
Chauhan et al. A review of automatic speaker recognition system
CN2763935Y (en) Spenker certification identifying system by combined lexeme and sound groove information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20061129