CN1932974A

CN1932974A - Speaker identifying equipment, speaker identifying program and speaker identifying method

Info

Publication number: CN1932974A
Application number: CNA2005100995434A
Authority: CN
Inventors: 柿野友成; 伊久美智则
Original assignee: Toshiba TEC Corp
Current assignee: Toshiba TEC Corp
Priority date: 2005-09-13
Filing date: 2005-09-13
Publication date: 2007-03-21

Abstract

The invention is concerned with the distinguishing equipment for speaker, it is: in the speaker distance computing part of the equipment, gets the quantification distance between the voice eigenvector of the voice eigenvector serial producing in the un-distinguished speaker's voice and the representative vector in the coding book, quantifies the voice eigenvector based on the quantification distance, gets the quantification distortion by using the high rank voice eigenvector in the voice eigenvector serial; in the speaker distinguishing part of the equipment, enforces the speaker distinguish based on the quantification distortion, the quantification distortion is the average value of several quantification distortion.

Description

Speaker identification equipment, speaker identifying program and speaker identification method

Technical field

The present invention relates to speaker identification equipment, be used for the computer program and the speaker identification method of speaker identification, the personal information that is used for being included in by use sound wave is discerned the speaker.

Background technology

Proposed to discern speaker's the speaker identification equipment that depends on text and identified the speaker identification equipment speaker and text-independent as speaker identification equipment based on the voice of saying any content based on the voice of saying predetermined content.

Speaker identification equipment is converted to simulating signal with the sound wave of input usually, is digital signal with the analog signal conversion of being changed, the discriminant analysis of combine digital signal, and produce the mentioned speech feature vector sequence that comprises personal information then.Here, cepstrum (cepstrum) coefficient is as speech feature vector.In enrollment mode, speaker identification equipment is trooped mentioned speech feature vector sequence and is bunch (cluster) of predetermined quantity, for example 32 bunches, and generation (is shown referring to Furui as the representation vector of the centre of form (centroid) of each bunch, " the Speech Information Processing " of Japan Morikita Shuppan Co.Ltd, first published, the 56-57 page or leaf).In addition, in markers, speaker identification equipment calculates the distance between the code book of mentioned speech feature vector sequence that enrollment mode produces and registration in advance from the input sound wave based on each speech feature vector, calculate mean value (mean distance), and based on this mean distance sign speaker.

Speaker identification equipment is as under the situation of speaker verification equipment therein, calculating is in the mentioned speech feature vector sequence that produces from the speaker that will be identified with about the distance between this speaker's the code book, and should distance and threshold value compare to carry out speaker verification.Speaker identification equipment is as under the situation of speaker identification equipment therein, the distance of calculating between the speaker's of the mentioned speech feature vector sequence that from the speaker that will be identified, produces and all registrations code book, and from a plurality of distances, select the shortest distance to carry out speaker identification corresponding to the speaker who registers.

Current, the tone of the vibration frequency of the cepstrum coefficient of reflection vocal tract shape or indication vocal cords is often used as the phonetic feature amount.Its information comprises harmonious sounds (phonological) information of indicating speech content, and the personal information that depends on the speaker.When as the difference of distance calculation speaker voice,, not desirable so the deviation of the deviation of harmonious sounds information and personal information compared because the deviation of harmonious sounds information is greater than the deviation of personal information.On the contrary, wish more identical harmonious sounds information.Therefore, according to existing speaker identification equipment, carry out the approximate normalization of phoneme, and speaker's distance calculation that will compare the reflection individual character that obtains by the identical phoneme of pairing approximation is an amount distortion by cluster vector deviation in observation space.

Yet when trooping mentioned speech feature vector sequence, should speech feature vector being set to which rank is problems.Usually, a large amount of harmonious sounds information are present in low order, and a large amount of personal information is present in high-order.Therefore, if the speech feature vector rank are set to low order in order to improve the harmonious sounds resolution performance when trooping, then may reduce speaker's resolution performance.On the contrary, if speech feature vector is set to high-order in order to improve speaker's resolution performance, then may reduce the harmonious sounds resolution performance.This causes the problem of trade-off relation.Because this problem, the speech feature vector rank are current to be set to by the definite only exponent number of experimental technique.

Summary of the invention

Therefore, the objective of the invention is to eliminate the trade-off relation between harmonious sounds resolution performance and orator's resolution performance, and realize accurate speaker identification.

According to one aspect of the present invention, provide a kind of speaker identification equipment, wherein based on the low order speech feature vector group in first mentioned speech feature vector sequence that from the speaker's that will be registered voice, produces and obtain distance between the speech feature vector of first mentioned speech feature vector sequence.Based on the distance that is obtained and first mentioned speech feature vector sequence of trooping, and produce and storage comprises the code book of a plurality of representation vectors.Based on the low order speech feature vector group of second mentioned speech feature vector sequence that from the speaker's that will be identified voice, produces, and obtain each speech feature vector in (a) second mentioned speech feature vector sequence and (b) be stored in quantized distance between of correspondence in a plurality of representation vectors in the code book.Quantize each speech feature vector in second mentioned speech feature vector sequence based on the quantized distance that is obtained.And, and obtain in each speech feature vector of second mentioned speech feature vector sequence and be stored in quantizing distortion between one corresponding in a plurality of representation vectors in the code book based on the high-order speech feature vector group of second characteristic vector sequence.Carry out speaker identification based on the quantizing distortion that is obtained.

According to another aspect of the present invention, a kind of speaker identification equipment is provided, wherein between the speech feature vector of first mentioned speech feature vector sequence that from the speaker's that will be registered voice, produces, obtain weighing vector distance based on first weight.Based on the weighing vector that obtained distance and first mentioned speech feature vector sequence of trooping, and produce and storage comprises the code book of a plurality of representation vectors.Weight quantization distance in a plurality of representation vectors of acquisition in being stored in code book between each speech feature vector of corresponding and second mentioned speech feature vector sequence that from speaker's voice that will be identified, produces based on second weight.Quantize each speech feature vector in second mentioned speech feature vector sequence based on the weight quantization that obtained distance.And obtain weight quantization distortion between each speech feature vector of corresponding in a plurality of representation vectors in being stored in code book one and second mentioned speech feature vector sequence based on the 3rd weight different with second weight with first weight.Carry out speaker identification based on this quantizing distortion.

According to the present invention, can realize highly accurate speaker identification.

Description of drawings

Fig. 1 is the block scheme that shows the structure of speaker identification equipment of the present invention;

Fig. 2 is for schematically illustrating in order to obtain the mode chart of trooping that representation vector carries out from mentioned speech feature vector sequence;

Fig. 3 is for the block scheme of the speaker identification structure partly that provides in speaker identification equipment is provided;

Fig. 4 is the mode chart that the structure of proper vector is shown; And

Fig. 5 is the block scheme that the example structure of the speaker identification equipment of being realized by software of the present invention is shown.

Embodiment

Will be referring to figs. 1 to the 4 explanation first embodiment of the present invention.Fig. 1 is the block scheme of structure that the speaker identification equipment 100 of first embodiment is shown.

As shown in Figure 1, this speaker identification equipment 100 comprises that microphone 1, low-pass filter 2, A/D converter 3, proper vector produce part 4, speaker identification part 5, speaker models generation part 6 and storage area 7.Utilize these parts and element, can carry out various devices (perhaps step).

Microphone 1 is the speech conversion of input electric simulation signal.Low-pass filter 2 is removed the frequency that is higher than preset frequency from the input simulating signal.It is digital signal that A/D converter 3 will be imported analog signal conversion with predetermined sampling frequency and quantization digit.Phonetic entry part 8 comprises microphone 1, low-pass filter 2 and A/D converter 3.

Proper vector produces discrete analysis and generation and the output M rank mentioned speech feature vector sequence (characteristic parameter time series) that part 4 is carried out supplied with digital signal.In addition, proper vector produces part 4 and comprises the switch (not shown), is used to select enrollment mode and markers.According to selected pattern, in enrollment mode, proper vector produces part 4 and is electrically connected to speaker models generation part 6, and produce part 6 output M rank mentioned speech feature vector sequence to the speaker, and in markers, proper vector produces part 4 and is electrically connected to speaker identification part 5, and to speaker identification part 5 output M rank mentioned speech feature vector sequence.In this embodiment of the present invention, M rank mentioned speech feature vector sequence is 16 rank mentioned speech feature vector sequence (M=16), and this proper vector comprises 1 to 16 rank LPC cepstrum coefficient, but is not limited to this example.

Speaker models produces part 6 in enrollment mode, produces code book from the mentioned speech feature vector sequence that produces part 4 generations in proper vector, as speaker models.Storage area 7 is storage (registration) produces the code book of part 6 places generation in speaker models dictionaries.

Speaker identification part 5 is calculated code book and the distance between the mentioned speech feature vector sequence of proper vector generation part 4 places generation in being stored in storage area 7 in advance, then based on this apart from identification the speaker, and output the result as the speaker identification result.

Next, will describe speaker models with reference to figure 2 and produce part 6, it is that the mode chart of trooping that carries out in order to obtain representation vector (centre of form) from mentioned speech feature vector sequence schematically is shown.

As shown in Figure 2, speaker models produce part 6 will be in enrollment mode proper vector produce M rank mentioned speech feature vector sequence that part 4 places produce from speaker's voice that will be registered troop for a plurality of corresponding to predetermined code book size bunch.Speaker models produces the centre of form that part 6 be the weighting center of this bunch of each bunch acquisition conduct then, as the representation vector of this bunch, and as codebook element a plurality of representation vectors (centre of form of each bunch) is registered to storage area 7 (dictionary).For the speaker of each registration produces code book.

Here, (N＜M) mentioned speech feature vector sequence (shadow region among Fig. 2) is carried out and is trooped, and obtains M rank representation vector by the N rank in the mentioned speech feature vector sequence of use M rank.N rank mentioned speech feature vector sequence is a low order speech feature vector group.

Vector distance D1 between the vector that can from following formula (1), obtain in trooping, to use.In this embodiment of the present invention, N=8, M=16 and code book size are 32.

D 1 = {[Σ_{K = 1}^{N} {(X_{K} - Y_{K})}^{2}]}^{\frac{1}{2}} - - - (1)

D1 wherein: vector distance

X _K, Y _K: M rank proper vector

N＜M

That is to say, speaker models produces part 6 and uses the N rank mentioned speech feature vector sequence in the M rank mentioned speech feature vector sequence of proper vector generation part 4 places generation in enrollment mode, obtain vector distance D1 according to formula (1), then based on the vector distance D1 that the is obtained M rank mentioned speech feature vector sequence of trooping, and produce the code book of forming by a plurality of M rank representation vector.

Next, will be with reference to figure 3 explanation speaker identification parts 5, this figure is the block scheme that the structure of speaker identification part 5 is shown.

As shown in Figure 3, speaker identification part 5 comprises speaker's distance calculation part 11 and identification division 12.

Speaker's distance calculation part 11 is calculated a plurality of representation vectors and the distance (distance between code book and characteristic vector sequence) between the M rank mentioned speech feature vector sequence that proper vector generation part 4 places produce in being stored in code book from speaker's voice that will be identified.That is to say, speaker's distance calculation part 11 is to produce each proper vector in the M rank mentioned speech feature vector sequence that part 4 places produce in proper vector, calculates from proper vector and produces distance (distance between representation vector and proper vector) between the representation vector of the proper vector of part 4 and the code book in the storage area 7.

Here, can be by following acquisition in the distance between code book and the characteristic vector sequence: (a), quantize each the M rank speech feature vector in this mentioned speech feature vector sequence based on the quantized distance D2 between representation vector and proper vector by using N rank element to calculate; And (b) by using M rank speech feature vector to obtain distortion distance D3 (quantizing distortion) between representation vector and proper vector.Therefore, as the mean value of the quantizing distortion that is obtained and calculate distance between code book and characteristic vector sequence.Here, N rank mentioned speech feature vector sequence is a low order speech feature vector group, and M rank mentioned speech feature vector sequence is a high-order speech feature vector group.

The representation vector that in quantification treatment, uses and the quantized distance D2 between the proper vector can be from following formula (2), obtained, and distortion distance D3 can be from following formula (3), obtained.

D 2 = {[Σ_{K = 1}^{N} {(C_{K} - X_{K})}^{2}]}^{\frac{1}{2}} - - - (2)

Wherein, D2: representation vector-proper vector distance (quantized distance)

C _K: representation vector

X _K: M rank proper vector

D 3 = {[Σ_{K = 1}^{M} {(C_{K} - X_{K})}^{2}]}^{\frac{1}{2}} - - - (3)

Wherein, D3: representation vector-proper vector distance (distortion distance)

C _K: representation vector

X _K: M rank proper vector

Speaker's distance calculation part 11 obtains quantized distance D2 according to formula (2).D2 is the quantized distance between representation vector and proper vector, just, produce each speech feature vector in the M rank mentioned speech feature vector sequence that part 4 places produce and in enrollment mode, be stored in quantized distance between a plurality of representation vectors in the code book of storage area 7 in proper vector.Based on the quantized distance D2 that is obtained, carry out the quantification of M rank mentioned speech feature vector sequence then by using N rank mentioned speech feature vector sequence.That is to say, quantize each speech feature vector in the mentioned speech feature vector sequence of M rank.Then, according to the distortion distance D3 of formula (3) calculating between representation vector and proper vector.D3 is the distortion distance between a plurality of representation vectors stored in the code book of storage area 7 and the M rank mentioned speech feature vector sequence that produces the generation of part 4 places in proper vector.

In this embodiment, obtain quantizing distortion by using M rank mentioned speech feature vector sequence, but it is not limited to this example.For example, can comprise that (N＜m＜M) mentioned speech feature vector sequence of mentioned speech feature vector sequence (high-order mentioned speech feature vector sequence) obtains quantizing distortion on (m is to M) rank by use.Comprise that (N＜m＜M) mentioned speech feature vector sequence of mentioned speech feature vector sequence should be a high-order speech feature vector group on (m is to M) rank.If high-order speech feature vector group comprises the high-order mentioned speech feature vector sequence, then it should be enough high.(N＜m＜M) mentioned speech feature vector sequence can be following any one: the mentioned speech feature vector sequence that only comprises (m is to M) rank cepstrum coefficient shown in Fig. 4 (b) shadow region on (m is to M) rank; The mentioned speech feature vector sequence that comprises the part in (m is to M) the rank cepstrum coefficient shown in Fig. 4 (c) shadow region and (1 to N) rank cepstrum coefficient; The mentioned speech feature vector sequence (M rank mentioned speech feature vector sequence) that perhaps comprises (1 to M) the rank cepstrum coefficient shown in Fig. 4 (d) shadow region.Here, (1 to N) rank cepstrum coefficient (shadow region among Fig. 4 (a)) is the low order cepstrum coefficient, and (m is to M) rank cepstrum coefficient is the high-order cepstrum coefficient.The high-order cepstrum coefficient comprises more personal information than low order cepstrum coefficient, and the low order cepstrum coefficient comprises more harmonious sounds information than high-order cepstrum coefficient.In this embodiment, N=8 and M=16, but be not limited to these values.

Identification division 12 is discerned the speaker based on the mean value of the quantizing distortion that obtains at speaker's distance calculation part 11 places, and the output recognition result is as the speaker identification result.When speaker identification equipment 100 during as speaker verification equipment, speaker's distance calculation part 11 is calculated the distance (mean value of quantizing distortion) between a plurality of representation vectors in mentioned speech feature vector sequence that produces and the code book that is stored in the speaker that will be identified from speaker's voice that will be identified.Identification division 12 by should distance and threshold ratio discern the speaker.In addition, when speaker identification equipment 100 is used as speaker identification equipment, speaker's distance calculation part 11 is calculated the distance between a plurality of representation vectors in mentioned speech feature vector sequence that produces and the speaker's that is stored in all registrations code book from speaker's voice that will be identified, and then by from a plurality of distances, selecting bee-line to discern the speaker.

According to the first embodiment of the present invention, in enrollment mode, can be by using the N rank vector element in the M rank mentioned speech feature vector sequence from the speaker's voice that will enrollment mode, register, produce, and obtain the distance of the vector of each speech feature vector D1 to vector.Based on vector distance D1 and the M rank mentioned speech feature vector sequence of trooping, and produce the code book of forming by a plurality of M rank centre of form.In addition, in markers, based on the quantized distance D2 between the N rank vector element in each representation vector of each M rank speech feature vector that from speaker's voice that will be identified, produces and code book, and each speech feature vector in the mentioned speech feature vector sequence of quantification M rank, obtain to use the distortion distance D3 of M rank vector element, and carry out speaker identification based on the mean value of quantizing distortion.Utilize said structure, can eliminate the trade-off relation between harmonious sounds resolution performance and speaker's resolution performance, and can guarantee the well balanced of them.Therefore, can realize highly accurate speaker identification.

In this embodiment of the present invention, first mentioned speech feature vector sequence that produces from speaker's voice that will be registered and second mentioned speech feature vector sequence that produces from speaker's voice that will be identified all are M rank mentioned speech feature vector sequence, low order speech feature vector group is the N rank (mentioned speech feature vector sequence of N＜M), code book comprises M rank representation vector, and high-order speech feature vector group is M rank mentioned speech feature vector sequence.Therefore, can guarantee stable recognition performance assuredly.

Replacedly, according to this embodiment of the present invention, first mentioned speech feature vector sequence and second mentioned speech feature vector sequence all are M rank mentioned speech feature vector sequence, low order speech feature vector group is the N rank (mentioned speech feature vector sequence of N＜M), code book comprises M rank representation vector, and high-order speech feature vector group is to comprise (m is to M) rank (mentioned speech feature vector sequence of mentioned speech feature vector sequence of N＜m＜M).Therefore, can guarantee stable recognition performance assuredly.

Now the second embodiment of the present invention will be described.Second embodiment is the speaker identification part 5 of first embodiment of the invention and the modification that speaker models produces part 6.Therefore, will with first embodiment in identical Reference numeral represent the part of the same structure that occurs among second embodiment, and produce the part 6 except speaker identification part 5 and speaker models, will omit their explanation.

Will be with reference to the speaker models generation part 6 of figure 2 explanations according to second embodiment.Speaker models produces part 6, and will be in enrollment mode to produce in proper vector that M rank mentioned speech feature vector sequence that part 4 places produce troops from speaker's voice that will be registered be corresponding to predetermined code book size a plurality of bunches, acquisition as the centre of form at the weighting center of each bunch so that make this centre of form become the representation vector of this bunch, and to a plurality of representation vectors of storage area (dictionary) 7 registrations with as code book.For the speaker of each registration produces code book.

Here, troop, and obtain M rank representation vector by using M rank mentioned speech feature vector sequence to carry out.Weighing vector distance D 1 between the vector that can from following formula (4), obtain in trooping, to use.In this embodiment, N=8, M=16 and code book size are 32.

D 1 = {[Σ_{K = 1}^{M} U_{K} {(X_{K} - Y_{K})}^{2}]}^{\frac{1}{2}} - - - (4)

D1 wherein: vector distance

U _K: weight

U_{K} = \{\begin{matrix} 1 & K \leq N \\ 0 & K > N \end{matrix}

X _K, Y _K: M rank proper vector

Speaker models produces part 6 and uses the M rank mentioned speech feature vector sequence that produces at proper vector generation part 4 places to obtain each weighing vector distance D 1 according to formula (4), based on the weighing vector distance D 1 that is obtained and this M rank mentioned speech feature vector sequence of trooping, and produce the code book of forming by a plurality of M rank representation vector.

Next, with the speaker identification part 5 (referring to Fig. 3) of explanation according to second embodiment.Speaker identification part 5 has the structure similar to first embodiment basically, and comprises speaker's distance calculation part 11 and identification division 12.

Speaker's distance calculation part 11 is calculated a plurality of representation vectors and the distance (distance between code book and characteristic vector sequence) between the M rank mentioned speech feature vector sequence that proper vector generation part 4 places produce of storing in the code book of storage area 7 from speaker's voice that will be identified.That is to say, speaker's distance calculation part 11 is each proper vector in the M rank mentioned speech feature vector sequence that proper vector generation part 4 places produce, and calculates in the distance between the representation vector of proper vector and code book (distance between representation vector and proper vector).

Here, by each the M rank speech feature vector that quantizes in the mentioned speech feature vector sequence based on weight quantization distance D 2, and pass through to use this M rank speech feature vector to obtain the weighted distortion distance D 3 (quantizing distortion) of the distance between representation vector and proper vector then, and the distance of acquisition between code book and proper vector is as the mean value of quantizing distortion.

According to second embodiment, can from following formula (5), obtain the representation vector that in quantification, uses and the weight quantization distance D 2 of proper vector, and can from following formula (6), obtain to be used to obtain the weighted distortion distance D 3 of quantizing distortion.

D 2 = {[Σ_{K = 1}^{M} U_{K} {(C_{K} - X_{K})}^{2}]}^{\frac{1}{2}} - - - (5)

Wherein, D2: representation vector-proper vector distance (quantized distance)

U _K: weight

U_{K} = \{\begin{matrix} 1 & K \leq N \\ 0 & K > N \end{matrix}

U _K: representation vector

X _K: M rank proper vector

D 3 = {[Σ_{K = 1}^{M} V_{K} {(C_{K} - X_{K})}^{2}]}^{\frac{1}{2}} - - - (6)

Wherein, D3: representation vector-proper vector distance (distortion distance)

V _K: weight (VK=1)

C _K: representation vector

X _K: M rank proper vector

Therefore, speaker's distance calculation part 11 obtains weight quantization distance D 2 between representation vector and proper vector according to formula (5), and it is to be stored in a plurality of representation vectors in the code book of storage area 7 and to produce distance between each speech feature vector in the M rank mentioned speech feature vector sequence that part 4 places produce in proper vector in recognition mode.Then, carry out the quantification of M rank mentioned speech feature vector sequence based on the weight quantization distance D 2 that is obtained.That is to say, quantize each speech feature vector in the mentioned speech feature vector sequence of M rank, obtain a plurality of representation vectors in the code book of storage area 7, store and produce weighted distortion distance D 3 between each speech feature vector in the M rank mentioned speech feature vector sequence that part 4 places produce according to formula (6), and obtain the mean value (mean value of quantizing distortion) of the weighted distortion distance D 3 that obtained in proper vector.

Identification division 12 is discerned the speaker based on the mean value of the quantizing distortion that obtains at speaker's distance calculation part 11 places, and exports recognition result as the speaker identification result.When speaker identification equipment 100 is used as speaker verification equipment, speaker's distance calculation part 11 is calculated the distance between a plurality of representation vectors in mentioned speech feature vector sequence that produces and the code book that is stored in the speaker that will be identified from speaker's voice that will be identified, and identification division 12 is verified the speaker by distance and threshold value are compared.In addition, when speaker identification equipment 100 is used as speaker identification equipment, speaker's distance calculation part 11 is calculated the distance (mean value of quantizing distortion) between a plurality of representation vectors in speech feature vector that produces and the speaker's that is stored in all registrations code book from speaker's voice that will be identified, and by from the distance that is obtained, selecting bee-line to identify the speaker.

Second embodiment according to the invention described above, in enrollment mode, the vector of each vectorial weighting of the M rank mentioned speech feature vector sequence that acquisition produces from speaker's voice that will be registered is to vector distance D1, based on the weighing vector distance D 1 that is obtained and the M rank mentioned speech feature vector sequence of trooping, and produce the code book that comprises a plurality of M rank representation vector.In markers, quantize each speech feature vector based on the weight quantization distance D 2 between each representation vector of each speech feature vector in the M rank mentioned speech feature vector sequence that from speaker's voice that will be identified, produces and code book, by using M rank mentioned speech feature vector sequence to obtain quantizing distortion based on distortion distance D3, and carry out speaker identification based on the mean value of this quantizing distortion then.Utilize said structure, can eliminate the trade-off relation between harmonious sounds resolution performance and speaker's resolution performance, and can guarantee the well balanced of them.Therefore, can realize highly accurate speaker identification.

In the second embodiment of the present invention, used formula (4), (5) and (6), but be not limited to these formula.For example, formula (4) can replace with following formula (7) (weight: U _K=1), formula (5) can replace with following formula (8) (weight: U _K=1) and formula (6) can replace with following formula (9) (weight: V _K=1/S _K).Here, statistics ground acquisition in advance is as the standard deviation S of the deviation value of each speech feature vector _K

D 1 = {[Σ_{K = 1}^{M} U_{K} {(X_{K} - Y_{K})}^{2}]}^{\frac{1}{2}} - - - (7)

D1 wherein: vector distance

U _K: weight (U _K=1)

X _K, Y _K: M rank proper vector

D 2 = {[Σ_{K = 1}^{M} U_{K} {(C_{k} - X_{K})}^{2}]}^{\frac{1}{2}} - - - (8)

Wherein, D2: representation vector-proper vector distance (quantized distance)

U _K: weight (U _K=1)

C _K: representation vector

X _K: M rank proper vector

D 3 = {[Σ_{K = 1}^{M} V_{K} {(C_{K} - X_{K})}^{2}]}^{\frac{1}{2}} - - - (9)

Wherein, D3: representation vector-proper vector distance (distortion distance)

V _K: weight (V _K=1/S _K)

C _K: representation vector

X _K: M rank proper vector

S _K: K rank standard deviation

In the second embodiment of the present invention, first mentioned speech feature vector sequence that produces from the speaker's that will be registered voice and second mentioned speech feature vector sequence that produces from speaker's voice that will be identified all are M rank mentioned speech feature vector sequence.Obtain the vector distance of weighting and the quantized distance of weighting respectively by the weight of using satisfied following relation,

U _K=1 (k≤N), 0 (k＞N), wherein N＜M.

Here U _KBe first weight and second weight.Can satisfy the distortion distance that the following weight that concerns obtains weighting by using,

V _K=1 (k≤M), wherein the 3rd weight is V _K

Therefore, can realize highly accurate recognition performance.

Replacedly, in the second embodiment of the present invention, first mentioned speech feature vector sequence and second mentioned speech feature vector sequence all are M rank mentioned speech feature vector sequence.Obtain the vector distance of weighting and the quantized distance of weighting respectively by the weight of using satisfied following relation,

U _K＝1(k≤M)

V _K=1/S _K(k≤M), wherein the 3rd weight is V _K

Therefore, can realize highly accurate recognition performance.

Hardware configuration is not limited to above-mentioned ad hoc structure, and it can be realized by software.Speaker identification part 5 or speaker models produce part 6 and can be realized by software.Fig. 5 is the block scheme that the speaker identification equipment of being realized by software 100 is shown.

As shown in Figure 5, speaker identification equipment 100 comprises: CPU 101, and it is connected to the ROM of storage BIOS etc. via bus; And storer 102, it comprises ROM and RAM, to constitute microcomputer.CPU 101 by the I/O (not shown) via bus be connected to HDD 103, read the CD-ROM drive 105 of computer-readable CD-ROM 104, the communication facilities 106 of communicating by letter, keyboard 107, display 108 and microphone 1 such as CRT or LCD with the Internet etc.

CD-ROM 104 storages as computer-readable recording medium realize speaker identification functional programs of the present invention, and CPU 101 can realize speaker identification function of the present invention by this program is installed.In addition, the voice by microphone 1 input are stored in HDD 103 grades.Then, when program run, read the speech data that is stored in HDD 103 grades and handle to carry out speaker identification.Speaker identification handle to realize that producing part 4, speaker identification part 5 and speaker with proper vector produces the similar function of each part in the part 6 etc., so can obtain and above-mentioned similar effect.

For storage medium, can use the various CDs such as DVD, various photomagneto disk, the various disks such as floppy disk and semiconductor memory etc.In addition, can be by from the network of for example the Internet, downloading and this program be installed among the HDD 103 as storage area, and realize the present invention.In this case, become storage medium of the present invention at the stored program memory storage in the server place of transmitting terminal.This program can go up operation at given OS (operating system), and under the sort of situation, this program can allow OS to carry out some part of above-mentioned processing, and this program can be a part that comprises the program file group of given application software such as word processor software or OS etc.

Claims

1, a kind of speaker identification equipment (100) is characterized in that comprising:

Be used for obtaining based on the low order speech feature vector group first mentioned speech feature vector sequence that produces from speaker's voice that will be registered between the speech feature vector of first mentioned speech feature vector sequence distance, first mentioned speech feature vector sequence of trooping, and produce the device of the code book that comprises a plurality of representation vectors based on the distance that is obtained;

Be used to store the device of the code book that is produced;

Be used for obtaining each speech feature vector in (a) second mentioned speech feature vector sequence and (b) be stored in quantized distance between of correspondence in a plurality of representation vectors in the code book based on the low order speech feature vector group second mentioned speech feature vector sequence that produces from speaker's voice that will be identified vector, quantize each the described speech feature vector in second mentioned speech feature vector sequence based on the quantized distance that is obtained, and obtain each the described speech feature vector in second mentioned speech feature vector sequence based on the high-order speech feature vector group in second characteristic vector sequence and be stored in the device of the quantizing distortion between one corresponding in a plurality of representation vectors in the code book; And

Be used for carrying out the device of speaker identification based on the quantizing distortion that is obtained.

2, speaker identification equipment as claimed in claim 1, wherein, in first mentioned speech feature vector sequence and second mentioned speech feature vector sequence each all is M rank mentioned speech feature vector sequence, low order speech feature vector group is the N rank (mentioned speech feature vector sequence of N＜M), corresponding vector is M rank representation vectors, and high-order speech feature vector group is M rank mentioned speech feature vector sequence.

3, speaker identification equipment as claimed in claim 1, wherein, in first mentioned speech feature vector sequence and second mentioned speech feature vector sequence each all is M rank mentioned speech feature vector sequence, the low order mentioned speech feature vector sequence is the N rank (mentioned speech feature vector sequence of N＜M), representation vector is M rank representation vectors, and high-order speech feature vector group is to comprise that m is to the M rank mentioned speech feature vector sequence (mentioned speech feature vector sequence of N＜m＜M).

4, a kind of speaker identification equipment (100) is characterized in that comprising:

Be used for obtaining between the speech feature vector of first mentioned speech feature vector sequence that produces from speaker's voice that will be registered the weighing vector distance based on first weight, first mentioned speech feature vector sequence of trooping, and produce the device of the code book that comprises a plurality of representation vectors based on the weighing vector distance that is obtained;

Be used to store the device of the code book that is produced;

Weight quantization distance between each speech feature vector in second mentioned speech feature vector sequence that is used for obtaining corresponding in a plurality of representation vectors of code book storage one and from speaker's voice that will be identified, produces based on second weight, quantize each the described speech feature vector in second mentioned speech feature vector sequence based on the weight quantization that obtained distance, and in a plurality of representation vectors that obtain in code book, to store corresponding one with second mentioned speech feature vector sequence in each described speech feature vector between the device based on the weight quantization distortion of three weight different with second weight with first weight; And

Be used for carrying out the device of speaker identification based on this quantizing distortion.

5, speaker identification equipment as claimed in claim 4, wherein, each in first mentioned speech feature vector sequence and second mentioned speech feature vector sequence all is M rank mentioned speech feature vector sequence;

Wherein second weight of first weight of weighing vector distance and weight quantization distance all is U _K, wherein:

U _K＝1(k≤N)，0(k＞N)，

And N＜M; And

Wherein the 3rd weight of weighted distortion distance is V _K, wherein:

V _K＝1(k≤M)。

6, speaker identification equipment as claimed in claim 4, wherein, each in first mentioned speech feature vector sequence and second mentioned speech feature vector sequence all is M rank mentioned speech feature vector sequence;

U _K=1 (k≤M); And

Wherein the 3rd weight of weighted distortion distance is V _K, wherein:

V _K＝1/S _K(k≤M)，

And the deviation value on each M rank is S _K

7, a kind of speaker identification method is characterized in that comprising:

Be used for obtaining the vector distance between the speech feature vector in first mentioned speech feature vector sequence, first mentioned speech feature vector sequence of trooping based on the vector distance that is obtained and produce the step of the code book that comprises a plurality of representation vectors based on the low order speech feature vector group first mentioned speech feature vector sequence that produces from speaker's voice that will be registered;

Be used to store the step of the code book that is produced;

Be used for obtaining each speech feature vector in (a) second mentioned speech feature vector sequence and (b) be stored in quantized distance between one corresponding in a plurality of representation vectors in the code book based on the low order speech feature vector group second mentioned speech feature vector sequence that produces from speaker's voice that will be identified, quantize each the described speech feature vector in second mentioned speech feature vector sequence based on the quantized distance that is obtained, and obtain each the described speech feature vector in second mentioned speech feature vector sequence based on the high-order speech feature vector group in second mentioned speech feature vector sequence and be stored in the step of the quantizing distortion between one corresponding in a plurality of representation vectors in the code book; And

Be used for carrying out the step of speaker identification based on the quantizing distortion that is obtained.

8, a kind of speaker identification method is characterized in that comprising:

Be used for obtaining the weighing vector distance between the speech feature vector of first mentioned speech feature vector sequence that produces from speaker's voice that will be registered, first mentioned speech feature vector sequence of trooping based on the weighing vector distance that is obtained and produce the step of the code book that comprises a plurality of representation vectors based on first weight;

Be used to store the step of the code book that is produced;

Weight quantization distance between each speech feature vector in second mentioned speech feature vector sequence that is used for obtaining corresponding in a plurality of representation vectors that code book is stored one and from speaker's voice that will be identified, produces based on second weight, quantize each the described speech feature vector in second mentioned speech feature vector sequence based on the weight quantization that obtained distance, and in a plurality of representation vectors that obtain in code book, to store corresponding one with second mentioned speech feature vector sequence in each described speech feature vector between the step based on the weight quantization distortion of three weight different with second weight with first weight; And