WO2007111169A1 - 話者認識システムにおける話者モデル登録装置及び方法、並びにコンピュータプログラム - Google Patents
話者認識システムにおける話者モデル登録装置及び方法、並びにコンピュータプログラム Download PDFInfo
- Publication number
- WO2007111169A1 WO2007111169A1 PCT/JP2007/055433 JP2007055433W WO2007111169A1 WO 2007111169 A1 WO2007111169 A1 WO 2007111169A1 JP 2007055433 W JP2007055433 W JP 2007055433W WO 2007111169 A1 WO2007111169 A1 WO 2007111169A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speaker
- speaker model
- registration
- utterances
- model
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 49
- 238000004590 computer program Methods 0.000 title claims description 24
- 238000004364 calculation method Methods 0.000 claims abstract description 47
- 238000012795 verification Methods 0.000 claims description 72
- 230000008569 process Effects 0.000 claims description 31
- 230000006870 function Effects 0.000 claims description 11
- 238000010187 selection method Methods 0.000 claims description 7
- 230000002596 correlated effect Effects 0.000 abstract 1
- 238000012545 processing Methods 0.000 description 8
- 238000012821 model calculation Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 5
- 230000002411 adverse Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 230000001419 dependent effect Effects 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 240000001462 Pleurotus ostreatus Species 0.000 description 1
- 235000001603 Pleurotus ostreatus Nutrition 0.000 description 1
- 244000000231 Sesamum indicum Species 0.000 description 1
- 235000003434 Sesamum indicum Nutrition 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000013642 negative control Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
Definitions
- the present invention is provided in various computer devices such as a car navigation device, a net banking device, an auto-lock device, and a computer recognition device, and various electronic electric devices.
- the present invention relates to a technical field of a speaker recognition system that performs speaker recognition, and in particular, a speaker model registration device and method in the system, and a computer program that causes a computer to function as such a speaker model registration device In the field of technology.
- a text-fixed type or a text-dependent type in which uttered text used for recognition is registered in advance, and such registration is not necessary, and any text is recognized.
- the text-dependent type has come into practical use, and various proposals have been made (see Patent Document 1).
- Patent Document 1 Japanese Unexamined Patent Application Publication No. 2004-294755
- the present invention has been made in view of the above-described problems, for example, and a speaker on which processing on a computer and operation by a user are relatively simple when registering text relating to speaker recognition.
- a speaker model registration device and method in a recognition system a speaker recognition system provided with such a speaker model registration device, and a computer program for causing a computer to function as such a speaker model registration device Means to solve the problem
- the speaker model registration apparatus in the speaker recognition system is a speaker model registration apparatus that registers a speaker model for speaker recognition in the speaker recognition system in order to solve the above-described problems.
- a calculation unit that performs the verification of the speaker model for which the calculation has been performed, a verification unit that performs the acquired a utterance as a verification utterance, and the speaker model for which the verification has been performed, Registration means for registering a speaker model that satisfies a predetermined criterion as a speaker model for speaker recognition.
- the registration is performed as follows in the speaker model registration stage in the speaker recognition system.
- utterance means voice or speech information related to text uttered by a speaker as a user, which is used at any stage throughout the speaker recognition process.
- the speaker model is calculated after the obtained n utterances are selected as registration utterances by a calculation means having, for example, a processor and a memory.
- registration utterance means an utterance used for registration.
- the utterance for registration need only be used for registration, and is not limited to the one used when valid registration is performed as a result.
- the collation unit having a processor, a memory, and the like, for example, selects the utterance a times acquired by the acquisition unit as the collation utterance, and thus performs the calculation in this way. Speaker model matching is performed.
- “verification utterance” means an utterance used as a reference for collation, that is, as a comparison target or comparison reference.
- the collation utterance need only be used at least for collation, and is not limited to the one used when effective collation is performed as a result.
- the collation utterance here which is not used for actual speaker recognition, is used at the registration stage.
- the calculation means passively or actively selects the acquired n utterances as registration utterances
- the verification means passively selects the acquired ⁇ utterances as verification utterances. Choose automatically or actively.
- the "passive”, in accordance with a predetermined rule, for example, from the beginning to the ⁇ -th (e.g., the first three times) to select the utterance as registration utterance, for example ⁇ subsequent to the end of the ⁇ times This means that when the utterance (for example, only for the fourth time) is selected as the utterance for verification, there is no particular effect on which calculation means or verification means to select.
- active means, for example, a calculation means when an utterance such as ⁇ times or oc times when a relatively good matching result is obtained is selected as a utterance for registration or an utterance for verification. In other words, it means that a selection is made with some selection action including systematic or trial-and-error action.
- a speaker model whose collation result by the collation unit satisfies a predetermined standard is Registered as a model.
- a speaker model whose matching result does not satisfy a predetermined standard is not registered as a speaker model for speaker recognition.
- the registration means includes ⁇ (where ⁇ is 1 or more and OC or less) as the predetermined reference in the OC times. If it can be accepted as the speaker himself more than (integer) times, it is registered as a speaker model for speaker recognition.
- the registration unit when it is possible to accept as the speaker himself j8 times or more in a times, the registration unit registers as a speaker model for speaker recognition. Conversely, if it is impossible to accept as the speaker himself more than ⁇ 8 times in a times, it will not be registered as a speaker model for speaker recognition by the registration means.
- the determination of whether or not the result of the collation satisfies the predetermined standard may be performed by the registration unit or the collation unit. Therefore, the registration means can reliably register a speaker model with high reliability.
- the registration means when the registration means does not register as a speaker model for speaker recognition, or the result of the comparison is If the predetermined standard is not satisfied! / ⁇ , the prompting means for discarding the speaker model for which the matching has been performed and prompting the acquisition means to acquire the utterance is further provided.
- the registration unit when the registration unit does not register as a speaker model for speaker recognition, or when the result of collation does not satisfy a predetermined criterion, for example, a display device, a voice output device, a controller or a processor,
- the prompting means having a memory or the like discards the verified speaker model, and then prompts the acquisition means to acquire the utterance.
- the speaker who is the user is prompted to speak again through the display output on the display screen and the voice output in the sound field in front of the speaker model registration device. Therefore, it is possible to reliably register a speaker model with high reliability by the registration means while avoiding registration of a speaker model with low reliability.
- the calculation means changes the selection method when selecting the registration utterance from the utterances acquired n + a times, and performs the calculation again.
- the calculation unit acquires n + a times, that is, The speaker model is calculated again after changing the combination of ⁇ + ⁇ utterances selected as registration utterances. Then, even if noise is mixed in several times of utterances, it is possible to calculate and verify the speaker model based on the noise, etc. by changing the selection of the utterance for registration and starting over from the calculation of the speaker model. It is possible to reduce or eliminate adverse effects on the results.
- the speaker model can be registered with high reliability.
- the collation means performs the collation again by changing the selection method for selecting the middle utterance of the utterance acquired ⁇ + a times.
- the registration unit when the registration unit does not register as a speaker model for speaker recognition or when the result of the verification does not satisfy the predetermined criterion, it is acquired n + a times by the verification unit, that is, Of the ⁇ + ⁇ utterances, the one selected as the utterance for verification is changed, and then verification is performed again. Then, suppose that noise etc. are mixed in the utterance of the number of times. However, by changing the way of selecting the verification utterance and re-examining the verification capability of the utterance, it is possible to reduce or eliminate the adverse effect of the noise on the verification result.
- the calculation means selects the registration utterance from the utterances acquired n + a times.
- the registration means calculates a plurality of corresponding speaker models having the best matching results among the plurality of speaker models calculated above. To do.
- registration is performed from the utterances acquired n + a times, that is, ⁇ + ⁇ , by the calculation means.
- Multiple combinations of speaker models are calculated after changing the combination of the utterances selected for use. Then, even if noise etc. are mixed in several times of utterances, by adopting the case where the calculation of the speaker model is executed without any problem by changing the method of selecting the registration utterance, It is possible to reduce or eliminate adverse effects on the results of calculation and verification of the person model. In this way, by excluding the utterance by the speaker at the time when noise was mixed or the utterance at the time when the utterance itself failed, the registration means A highly reliable speaker model can be registered.
- the verification unit selects the verification utterance from the utterances acquired n + a times.
- the registration unit performs the verification in a plurality of ways, and the registering unit determines whether the statistical value or at least one of the results of the verification performed in the plurality of ways satisfies a predetermined criterion.
- a user model A user model.
- collation is performed by the collating means from the utterances acquired n + a times, that is, from ⁇ + ⁇ existing utterances.
- collation is performed after changing what is selected as the utterance. Then, even if there are noises etc. in the utterance several times, the matching can be performed without any problem by changing the way to select the utterance for verification. By adopting the case where it is performed, it is possible to reduce or eliminate the adverse effect on the result of matching due to the noise or the like.
- the registration means can effectively avoid the repetition of processing and operation related to acquisition of the utterance.
- a highly reliable speaker model can be registered.
- a speaker recognition system is based on the above-described speaker model registration device (including various aspects thereof) and the registered speaker model. Recognizing means for recognizing all utterances by an arbitrary speaker.
- the speaker model registration device according to the present invention described above is provided, it is extremely reliable through a relatively simple registration operation or registration operation. High speaker recognition is possible.
- another speaker recognition system includes the above-described speaker model registration device (including various aspects thereof), and the verification unit includes the registration unit. Based on the speaker model, it also functions as a recognition means for recognizing all utterances by any speaker.
- the speaker model registration device according to the present invention since the speaker model registration device according to the present invention described above is provided, it is extremely reliable through a relatively simple registration operation or registration operation. High speaker recognition is possible. Moreover, since the collation means used for registration also serves as the recognition means used for recognition, the system configuration can be simplified, which is extremely advantageous.
- the recognition means has a similarity based on the registered speaker model with respect to the utterance by the arbitrary speaker. Based on this, the recognition is performed.
- the speaker model registration method in the speaker recognition system solves the above problem. Therefore, a speaker model registration method for registering a speaker model for speaker recognition in the speaker recognition system, where utterance is n + a (where n is an integer of 2 or more, ⁇ is 1 or more)
- a matching step in which the a-uttered utterance is used as a matching utterance and a speaker model for which the matching result satisfies a predetermined criterion among the speaker models for which the matching is performed. As a registration process.
- speaker model registration method of the present invention can adopt various aspects similar to the various aspects of the speaker model registration apparatus of the present invention described above.
- a computer program provides a computer provided in a speaker model registration apparatus for registering a speaker model for speaker recognition in a speaker recognition system.
- a (where n is an integer equal to or greater than 2 and ⁇ is an integer equal to or greater than 1) acquisition means; and a calculation means for calculating a speaker model using the acquired ⁇ utterances as registration utterances;
- a collation means for collating the calculated speaker model with the obtained ⁇ utterances as a collation utterance, and a result of the collation among the speaker models subjected to the collation. That satisfy the predetermined criteria function as registration means for registering as a speaker model for speaker recognition.
- the computer program of the present invention is read from a recording medium such as a CD-ROM or DVD-ROM storing the computer program into a computer provided in the speaker model registration device and executed. Or if the computer program is executed after being downloaded via communication means,
- the speaker model registration apparatus of the present invention described above can be constructed relatively easily. As a result, as in the case of the speaker model registration device of the present invention described above, repeated acquisition of utterances due to noise mixed in the utterances by the speakers or failure of the utterances themselves by the speakers is possible. Even if it works well all the time, the situation where repeated registration operations are performed can be avoided very efficiently, or registration of an unreliable speaker model can be avoided very reliably.
- a computer program product in a computer-readable medium for registering a speaker model for speaker recognition in a speaker recognition system.
- the acquisition means for acquiring the utterance n + a (where n is an integer of 2 or more and ⁇ is an integer of 1 or more);
- the calculation means for calculating the speaker model and the verification of the speaker model in which the calculation was performed are performed as described above (the X utterances are used for verification).
- the collation means performed as an utterance and the speaker models subjected to the collation those that satisfy the predetermined criteria are made to function as registration means for registering as the speaker model for speaker recognition.
- the computer program product of the present invention if the computer program product is read into a computer from a recording medium such as a ROM, CD-ROM, DVD-ROM, or hard disk storing the computer program product, or
- a recording medium such as a ROM, CD-ROM, DVD-ROM, or hard disk storing the computer program product
- the computer program product which is a transmission wave
- the computer program product which is a transmission wave
- the computer program product may be configured by a computer readable code (or computer readable instruction) that functions as the above-described speaker model registration device of the present invention.
- the speaker model registration apparatus includes a calculation unit, a verification unit, and a registration unit.
- a calculation step Since it has a process and a registration process, the situation where the registration operation is repeated is extremely effective. It is possible to avoid it very efficiently or to register a speaker model with low reliability.
- the speaker recognition system of the present invention since the speaker model registration device of the present invention is provided, it is possible to perform speaker recognition with extremely high reliability through a relatively simple registration operation or registration operation.
- the computer program of the present invention the computer functions as a calculation means, a collation means, and a registration means, so that the above-described speaker model registration apparatus of the present invention can be constructed relatively easily.
- FIG. 1 is a block diagram conceptually showing the basic structure of a speaker model registration device in a speaker recognition system according to a first example of the present invention.
- FIG. 2 is a block diagram conceptually showing the basic structure of a speaker model registration device in the speaker recognition system in the second example.
- FIG. 3 is a flowchart showing an operation process of the speaker model registration device in the speaker recognition system according to the second embodiment.
- FIG. 4 is a flowchart showing an operation process of the speaker model registration device in the speaker recognition system according to the third embodiment.
- FIG. 5 is a flowchart showing an operation process of the speaker model registration device in the speaker recognition system according to the fourth embodiment.
- FIG. 6 is a flowchart showing an operation process of the speaker model registration device in the speaker recognition system according to the fifth embodiment.
- FIG. 7 is a flowchart showing an operation process at the time of speaker recognition in the speaker recognition system according to the sixth embodiment. Explanation of symbols
- FIG. 1 relates to the first embodiment of the present invention.
- FIG. 3 is a block diagram conceptually showing the basic structure of a speaker model registration device in the speaker recognition system.
- the speaker model registration device 10 in the speaker recognition system 1 includes an acquisition unit 13 as an example of an “acquisition unit” according to the present invention and a “calculation” according to the present invention.
- a calculation unit 20 as an example of a “means”, a verification unit 30 as an example of a “collation unit” and a “recognition unit” according to the present invention, and a registration unit as an example of a “registration unit” according to the present invention.
- 40 and a reminder 50 as an example of the “reminder” according to the present invention.
- the acquisition unit 13 includes a voice input device such as a microphone, for example.
- a voice input device such as a microphone
- the user 12 for example, Mr. Suzuki
- a keyword for example, " Hiratake sesame ”
- n is the number of utterances for registration, that is, the number of utterances required to calculate and register the speaker model 25
- ⁇ is the number of utterances for verification, that is, the calculated utterances. This is the number of utterances required to check whether the person model 25 is appropriate or not.
- n 3 that is, a speaker model 25 (for example, Suzuki model) is calculated based on three utterances
- ⁇ 1, that is, the speaker model 25 is based on one matching utterance. Matched.
- the calculation unit 20 is logically constructed according to a program in a computer having a processor, a memory, and the like, for example, and is based on ⁇ utterances of the utterances acquired by the acquisition unit 13. Then, the speaker model 25 that captures the characteristics when the user 12 (Mr. Suzuki) utters the keyword is calculated.
- the collation unit 30 is logically constructed according to a program in a computer having a processor, a memory, and the like, for example, and collates ⁇ utterances uttered by user 12 (Mr. Suzuki).
- the matching utterance is compared with the calculated speaker model 25.
- the user 12 (Mr. Suzuki) collates a single utterance for collation with the speaker model 25 calculated.
- the collation unit 30 may function as a recognition unit.
- the registration unit 40 is logically constructed according to a program in a computer having a processor, a memory, and the like, for example.
- the verification unit 30 As a result of the collation by means of the above, a speaker model 25 for speaker recognition, which satisfies the predetermined criteria, is used as a speaker model database 45 built in a large-scale storage device such as a hard disk device or an optical disk device provided in a computer.
- a speaker model database 45 built in a large-scale storage device such as a hard disk device or an optical disk device provided in a computer.
- the user 12 Mr. Suzuki
- the speaker model 25 is verified to be appropriate or function properly, and is registered in the speaker model database 45.
- the other person for example, Sato's utterance instead of Suzuki, is used as a collation utterance as negative control, and it is recognized that he is not the person! Can be registered.
- the reminder unit 50 determines the speaker calculated by the calculation unit 20.
- the model 25 or the user 12 prompts the user 12 for the registration utterance again because there is a problem or inappropriateness in the utterance that is the basis of the speaker model 25. For example, a prompt message such as “Please speak again” is displayed on the display, or a voice is output. And Until it is no longer prompted by the prompting unit 50, in other words, until the speaker model 25 for speaker recognition is registered, processing based on the above configuration is performed.
- the following recognition unit 30 may be further included.
- the recognition unit 30 is logically constructed according to a program in a computer having a processor, a memory, and the like, for example.
- an arbitrary speaker here Speaker or user 12 is not limited to the person who registered speaker model 25. (For example, a third party who attempts to impersonate Mr. Suzuki) and registered speaker model 25
- collating it is recognized whether or not an arbitrary speaker seeking recognition is the speaker of the registered speaker model 25. Specifically, as a result of collation, if the degree of similarity meets a predetermined standard, it recognizes that any speaker seeking recognition is a speaker of the registered speaker model 25 and does not satisfy it. Recognizes that he is not a speaker.
- the speaker model 25 for speaker recognition is preferably registered.
- FIG. 2 is a block diagram conceptually showing the basic structure of the speaker model registration apparatus in the speaker recognition system according to the second embodiment. 2 and 3, the same reference numerals are given to the same components as those of the first embodiment shown in FIG. 1, and the description thereof will be omitted as appropriate.
- the microphone 132 is used when the user 2 utters a keyword n times. It is a device that converts each story into an electrical signal and inputs it to the speaker recognition system 1.
- the voice partial extraction unit 142 is logically constructed in accordance with a program in a computer having a processor, a memory, and the like, for example, and is a general that uses a power difference between background noise and a voice utterance section. This is a computing device that cuts out the utterance voice portion when the keyword is also uttered by the electric signal power of the converted utterance by the voice segment detection method or the like.
- the feature amount calculation unit 201 is logically constructed according to a program in a computer having a processor, a memory, and the like, for example, and converts an inputted speech voice portion into a feature amount.
- a feature amount is an arithmetic device that is converted by MFCC (Mel Frequency Cepstrum Coefficient: MFCC), LPC (Linear Predictive Coding: LPC) cepstrum, or the like. If there are a plurality of feature quantities, a part (for example, n times) is transmitted to the speaker model calculation unit 202, and another part (for example, ex times) is transmitted to the verification / registration unit 41.
- MFCC Mobile Frequency Cepstrum Coefficient
- LPC Linear Predictive Coding
- the speaker model calculation unit 202 is logically constructed according to a program in a computer including a processor, a memory, and the like.
- n of the feature amounts calculated by the feature amount calculation unit 201 is n.
- It is an arithmetic unit that uses a batch to calculate and learn a speaker model used for matching.
- the speaker model is represented as a speaker template in various speech recognition algorithms such as speaker HMM (Hidden Markov Model: HMM) and DP (Dynamic Programming: DP) matching.
- the matching unit 30 is an arithmetic device that compares the speaker model calculated by the speaker model calculating unit 202 with the feature quantity for matching and calculates the similarity. is there. Note that the likelihood or the reciprocal of the distance scale is used as the similarity. When the reciprocal of the distance scale is used as the similarity, since it is the reciprocal, it is necessary to appropriately change the control method. Specifically, the direction of the inequality sign when comparing with the predetermined threshold value in the verification and registration unit 41 is reversed.
- the verification / registration unit 41 is logically constructed according to a program in a computer having a processor, a memory, and the like, for example, and compares the similarity calculated by the collation unit 30 with a predetermined threshold value.
- the calculated speaker model (regarding whether or not the feature quantity for each of the X matches is recognized as the principal, in other words, the calculated speaker model is registered in the speaker model database 45.
- the verification unit 41 verifies the speaker model verified to be registered, and the registration unit 41 stores the speaker model data. Register with Base 45.
- the display screen 52 is, for example, a liquid crystal display or the like, and is a display device that displays a verification result or a prompt notification message.
- FIG. 3 relates to the second embodiment.
- FIG. 5 is a flowchart showing an operation process of the speaker model registration device in the speaker recognition system.
- n + a utterances are input to the speaker model registration device 10 via the microphone 132 (step S101).
- utterances other than keywords such as “Eidodo” should be avoided by teaching them by displaying text on the screen or using guidance voices.
- the speech part of the input n + a utterances is extracted by the speech part extraction unit 142 (step S102).
- a user's speaker model is calculated and learned using the speech portion related to the n + a utterances (step S103). Specifically, the utterance voice part related to the transmitted n + a utterances is converted into each feature quantity by the feature quantity calculation unit 201, and among the feature quantities related to the n + a utterances, n times The feature amount related to the utterance (registration utterance) is transmitted to the speaker model calculation unit 202, and the user's utterance model is calculated. The feature quantities related to the remaining oc utterances (synchronization utterances) are transmitted to the verification unit 30 for verification.
- the calculated speaker model power matching unit 30 of the user collates with the feature amount related to the a-time collation utterance (step S104). For example, the degree of similarity between the calculated user's speaker model and the feature quantity related to the ⁇ collation utterances is calculated.
- the verification result of the similarity between the user's speaker model calculated in this way and each verification utterance is aggregated by the verification 'registration unit 41 (step S105), and this aggregated result is the registration judgment criterion.
- it is determined whether or not the calculated speaker model of the user can be registered (step S106). For example, it is calculated from ⁇ matching utterances. It is determined whether or not the user's speaker model has been accepted as the person himself / herself is 1/3 or less if the number is 3 or more. Specifically, it is determined whether or not the number of times the similarity with the calculated speaker model of the user exceeds a predetermined similarity threshold among ⁇ utterances for comparison is ⁇ or more.
- the “predetermined similarity threshold” is a similarity corresponding to each registration determination criterion, and the value may have a margin. However, if the margin width is too large, humans other than the user will recognize that they are the users themselves. Conversely, if the margin width is too small, even the users themselves will be recognized depending on their physical condition. Can happen. Therefore, the “predetermined similarity threshold” should be determined by experimentation or simulation as a similarity that allows the user's utterance and the non-user's utterance to be practically sufficiently distinguished, taking into account the above matters!
- step S106 when it is determined that the total result satisfies the registration determination criteria (step S106: Yes), the verification 'registration unit 41 sets the calculated speaker model of the user based on the speaker model data. (Step S1071), the user is notified via the display screen 52 (step S1081), and the registration is completed.
- step S 106 determines whether the above-mentioned total result satisfies the registration criterion. If it is not determined that the above-mentioned total result satisfies the registration criterion (step S 106: No), the user's speaker model calculated by the reminder 50 is discarded (step S 1072). A notification for prompting the user to re-register is given via the display screen 52 (step S1082). The above process is repeated until the speaker model is registered.
- the speaker model registration device 10 in the speaker recognition system 1 operates as shown in FIG. 3, the speaker model is appropriately registered.
- the utterance for registration 'verification utterance is acquired first, and the speaker recognition performance of the speaker model learned by the verification utterance is verified after learning the model with the registration utterance. If the keyword text is entered in, the user will not be forced to perform any extra operations, and even if noise is mixed in the first utterance, the user or administrator will not have to go through human tasks such as confirmation. This is very convenient in practice.
- Fig. 4 12 is a flowchart showing an operation process of the speaker model registration device in the speaker recognition system according to the third embodiment.
- FIG. 4 the same reference numerals are given to the same components or processes as those according to the above drawings, and the description thereof will be omitted as appropriate.
- the flowchart in FIG. 4 differs from the flowchart in FIG. 3 mainly in the power processing in which the speaker model is discarded (step S1072).
- step S1072 when the speaker model is discarded (step S1072), it is not immediately prompted to re-utter, and it is confirmed whether or not the choice of n utterance / ⁇ utterance is exhausted (step S3073). For example, you can decide multiple ways to select in advance, and check whether you have already tried all of the ways to choose all of them.
- step S3073 Yes
- the user is notified of re-registration via the display screen 52 (step S1082).
- step S1082 the user is notified of re-registration via the display screen 52 (step S1082).
- step S3073 the speaker model is changed again by changing the way of selecting n registration utterances or changing the way of selecting OC utterances.
- the speaker model registration device 10 in the speaker recognition system 1 As described above with reference to FIG. 4 in addition to FIGS. 2 and 3, according to the speaker model registration device 10 in the speaker recognition system 1 according to the present embodiment, the speaker model is appropriate. Of course, since the utterances already entered are reused, the burden on the user is reduced, which is very advantageous in practice.
- FIG. 5 is a flowchart showing the operation process of the speaker model registration device in the speaker recognition system according to the fourth embodiment.
- the same reference numerals are given to the same components or processes as those according to the above drawings, and the description thereof will be omitted as appropriate.
- the flowchart in FIG. 5 is different from the flowchart in FIG. 3 mainly because the utterance voice portion of the input utterance is extracted (step S102) and whether the force also clears the registration criterion. This is the process up to (step SI 06).
- a plurality of user speaker models are calculated and learned using the speech portion related to the n + a utterances (step S403).
- the calculated user's plural speaker model power matching units 30 collate with the feature quantities related to the a-time collation utterances (step S404).
- the verification results of the similarity between the user's multiple speaker models calculated in this way and the respective verification utterances are tabulated by the verification and registration unit 41 (step S405), and the multiple types of stories are recorded. Is selected from among the person models (step S406). For example, the speaker model having the highest average similarity degree with each of the matching utterances that can be recognized as the person in question is selected as the best matching result. At this time, instead of the average value, another scale such as the maximum value, the minimum value, or the median value may be determined in advance and adopted.
- step S106 it is determined whether or not the aggregation result relating to the speaker model with the best matching result satisfies the registration determination criteria.
- a plurality of speaker models can be stored. Since the best one is selected, for example, the utterance by the speaker at the time when noise is mixed or the utterance at the time when the utterance itself fails is excluded, and the process and operation related to the acquisition of the utterance are efficiently repeated.
- the verification and registration unit 41 can select and register a speaker model with high reliability.
- FIG. 6 is a flowchart showing the operation process of the speaker model registration device in the speaker recognition system according to the fifth embodiment.
- FIG. 6 the same components or processes as those in the above drawings are denoted by the same reference numerals, and description thereof will be omitted as appropriate.
- the flowchart of FIG. 6 differs from the flowchart of FIG. 3 mainly by verifying the speaker model, and when the speaker model satisfies the registration criteria, it is recognized as the person based on the speaker model. Instead of the ⁇ utterances made, the speaker mode is again based on ⁇ + ⁇ registration utterances. It is a point to learn and register Dell.
- step S105 the verification results of the similarity between the calculated user's speaker model and each verification utterance are aggregated by the verification 'registration unit 41 (step S105). Assume that the decision is made (step S 106: Yes).
- the speaker model is re-calculated by the speaker model calculation unit 202 by further adding the y utterances recognized as the person's own to the n registration utterances (step S5071), Finally, a speaker model based on these ⁇ + ⁇ utterances is registered.
- adaptive processing may be performed on ⁇ utterances.
- the speaker model calculation unit 202 is highly reliable and can perform speaker model calculation or adaptive processing.
- FIG. 7 is a flowchart showing an operation process at the time of speaker recognition in the speaker recognition system according to the sixth embodiment.
- the user that is, the speaker
- the utterance voice at this time is recorded by the microphone 132 (step S601).
- a voice utterance section is extracted by the extraction unit 142 (step S602).
- the extracted speech utterance section is converted into a feature value by the feature value calculation unit 201 and sent to the matching unit (step S603).
- the collation unit 30 collates the sent feature quantity with each speaker model registered by the speaker model registration device 10 according to the above-described embodiment, and each speaker model is checked. Correspondingly, the similarity is calculated (step S604). Among them, the highest similarity (hereinafter, The speaker corresponding to the speaker model with the high similarity is also selected as a recognition result candidate (step S605).
- step S606 the maximum similarity is compared with a threshold set in advance so that the speech of another person can be rejected with sufficient accuracy. If this maximum similarity is higher than the threshold (step S606: Yes) It is determined that the speaker is the corresponding speaker (step S6071), and the result is output to the display screen 52 (step S6081).
- step S606 if the maximum similarity is lower than the threshold (step S606: No), the recognition result candidate is not recognized as a speaker, and the speaker is rejected (step S6072), and the recognition failure screen is displayed. Is displayed (step S6082).
- the speaker model to be verified is narrowed down to one by declaring who he is in advance by speaking or by keyboard input.
- the similarity may be obtained by comparing the above, and it may be determined whether the speaker is recognized or rejected by comparing with a threshold value.
- the speaker recognition system 1 includes the speaker model registration device 10 according to the above-described embodiment. Therefore, a relatively simple registration operation or! ⁇ is highly reliable through the registration operation! Speaker recognition is possible.
- the operation process shown in the above embodiment is to operate the speaker recognition system based on the speaker model registration method in the speaker recognition system 1 including an acquisition process, a calculation process, a collation process, and a registration process. It may be realized by. Alternatively, it may be realized by causing a computer provided in the speaker recognition system 1 having an acquisition means, a calculation means, a verification means, and a registration means to read the computer program.
- Speaker model registration apparatus and method in speaker recognition system are installed in various computer devices and various electronic and electrical devices such as car navigation devices, net banking devices, auto-lock devices, computer recognition devices, etc., and speaker recognition is performed based on the utterances of the speaker who is the user. It can be used for a speaker model registration device in a speaker recognition system that performs the above.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- User Interface Of Digital Computer (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/293,943 US20090106025A1 (en) | 2006-03-24 | 2007-03-16 | Speaker model registering apparatus and method, and computer program |
JP2008507435A JP4854732B2 (ja) | 2006-03-24 | 2007-03-16 | 話者認識システムにおける話者モデル登録装置及び方法、並びにコンピュータプログラム |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2006-084275 | 2006-03-24 | ||
JP2006084275 | 2006-03-24 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2007111169A1 true WO2007111169A1 (ja) | 2007-10-04 |
Family
ID=38541089
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2007/055433 WO2007111169A1 (ja) | 2006-03-24 | 2007-03-16 | 話者認識システムにおける話者モデル登録装置及び方法、並びにコンピュータプログラム |
Country Status (3)
Country | Link |
---|---|
US (1) | US20090106025A1 (ja) |
JP (1) | JP4854732B2 (ja) |
WO (1) | WO2007111169A1 (ja) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20180010923A (ko) * | 2015-07-22 | 2018-01-31 | 구글 엘엘씨 | 개별화된 핫워드 검출 모델들 |
US10832685B2 (en) | 2015-09-15 | 2020-11-10 | Kabushiki Kaisha Toshiba | Speech processing device, speech processing method, and computer program product |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9042867B2 (en) * | 2012-02-24 | 2015-05-26 | Agnitio S.L. | System and method for speaker recognition on mobile devices |
GB201802309D0 (en) * | 2017-11-14 | 2018-03-28 | Cirrus Logic Int Semiconductor Ltd | Enrolment in speaker recognition system |
US20230215422A1 (en) * | 2022-01-05 | 2023-07-06 | Google Llc | Multimodal intent understanding for automated assistant |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS584198A (ja) * | 1981-06-30 | 1983-01-11 | 株式会社日立製作所 | 音声認識装置における標準パタ−ン登録方式 |
JPS62245295A (ja) * | 1986-04-18 | 1987-10-26 | 株式会社リコー | 特定話者音声認識装置 |
JPH02210500A (ja) * | 1989-02-10 | 1990-08-21 | Ricoh Co Ltd | 標準パターン登録方式 |
JPH02298996A (ja) * | 1989-05-12 | 1990-12-11 | Toshiba Corp | 単語音声認識装置 |
JPH09218696A (ja) * | 1996-02-14 | 1997-08-19 | Ricoh Co Ltd | 音声認識装置 |
JPH1020882A (ja) * | 1996-07-01 | 1998-01-23 | Ricoh Co Ltd | 音声認識装置および標準パターン登録方法 |
JP2000155595A (ja) * | 1998-11-19 | 2000-06-06 | Canon Inc | 撮像装置 |
JP2004279770A (ja) * | 2003-03-17 | 2004-10-07 | Kddi Corp | 話者認証装置及び判別関数設定方法 |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS5681781A (en) * | 1979-12-05 | 1981-07-04 | Nippon Electric Co | Sound lock system |
JPH10133680A (ja) * | 1996-09-06 | 1998-05-22 | Amtex Kk | 音声データ記憶者判定装置 |
US6182037B1 (en) * | 1997-05-06 | 2001-01-30 | International Business Machines Corporation | Speaker recognition over large population with fast and detailed matches |
US5897616A (en) * | 1997-06-11 | 1999-04-27 | International Business Machines Corporation | Apparatus and methods for speaker verification/identification/classification employing non-acoustic and/or acoustic models and databases |
US6141644A (en) * | 1998-09-04 | 2000-10-31 | Matsushita Electric Industrial Co., Ltd. | Speaker verification and speaker identification based on eigenvoices |
US6748356B1 (en) * | 2000-06-07 | 2004-06-08 | International Business Machines Corporation | Methods and apparatus for identifying unknown speakers using a hierarchical tree structure |
ATE335195T1 (de) * | 2001-05-10 | 2006-08-15 | Koninkl Philips Electronics Nv | Hintergrundlernen von sprecherstimmen |
US6996526B2 (en) * | 2002-01-02 | 2006-02-07 | International Business Machines Corporation | Method and apparatus for transcribing speech when a plurality of speakers are participating |
JP2004309779A (ja) * | 2003-04-07 | 2004-11-04 | Casio Comput Co Ltd | 音声認証装置 |
JP2005241215A (ja) * | 2004-02-27 | 2005-09-08 | Mitsubishi Electric Corp | 電気機器、冷蔵庫、冷蔵庫の操作方法 |
JP4254753B2 (ja) * | 2005-06-30 | 2009-04-15 | ヤマハ株式会社 | 話者認識方法 |
-
2007
- 2007-03-16 WO PCT/JP2007/055433 patent/WO2007111169A1/ja active Application Filing
- 2007-03-16 US US12/293,943 patent/US20090106025A1/en not_active Abandoned
- 2007-03-16 JP JP2008507435A patent/JP4854732B2/ja not_active Expired - Fee Related
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS584198A (ja) * | 1981-06-30 | 1983-01-11 | 株式会社日立製作所 | 音声認識装置における標準パタ−ン登録方式 |
JPS62245295A (ja) * | 1986-04-18 | 1987-10-26 | 株式会社リコー | 特定話者音声認識装置 |
JPH02210500A (ja) * | 1989-02-10 | 1990-08-21 | Ricoh Co Ltd | 標準パターン登録方式 |
JPH02298996A (ja) * | 1989-05-12 | 1990-12-11 | Toshiba Corp | 単語音声認識装置 |
JPH09218696A (ja) * | 1996-02-14 | 1997-08-19 | Ricoh Co Ltd | 音声認識装置 |
JPH1020882A (ja) * | 1996-07-01 | 1998-01-23 | Ricoh Co Ltd | 音声認識装置および標準パターン登録方法 |
JP2000155595A (ja) * | 1998-11-19 | 2000-06-06 | Canon Inc | 撮像装置 |
JP2004279770A (ja) * | 2003-03-17 | 2004-10-07 | Kddi Corp | 話者認証装置及び判別関数設定方法 |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20180010923A (ko) * | 2015-07-22 | 2018-01-31 | 구글 엘엘씨 | 개별화된 핫워드 검출 모델들 |
US10438593B2 (en) | 2015-07-22 | 2019-10-08 | Google Llc | Individualized hotword detection models |
US10535354B2 (en) | 2015-07-22 | 2020-01-14 | Google Llc | Individualized hotword detection models |
KR102205371B1 (ko) | 2015-07-22 | 2021-01-20 | 구글 엘엘씨 | 개별화된 핫워드 검출 모델들 |
US10832685B2 (en) | 2015-09-15 | 2020-11-10 | Kabushiki Kaisha Toshiba | Speech processing device, speech processing method, and computer program product |
Also Published As
Publication number | Publication date |
---|---|
US20090106025A1 (en) | 2009-04-23 |
JP4854732B2 (ja) | 2012-01-18 |
JPWO2007111169A1 (ja) | 2009-08-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111566729B (zh) | 用于远场和近场声音辅助应用的利用超短语音分段进行的说话者标识 | |
JP6394709B2 (ja) | 話者識別装置および話者識別用の登録音声の特徴量登録方法 | |
US9424837B2 (en) | Voice authentication and speech recognition system and method | |
US8010367B2 (en) | Spoken free-form passwords for light-weight speaker verification using standard speech recognition engines | |
JP4588069B2 (ja) | 操作者認識装置、操作者認識方法、および、操作者認識プログラム | |
AU2013203139A1 (en) | Voice authentication and speech recognition system and method | |
JP5172973B2 (ja) | 音声認識装置 | |
Li et al. | Verbal information verification | |
JP4897040B2 (ja) | 音響モデル登録装置、話者認識装置、音響モデル登録方法及び音響モデル登録処理プログラム | |
JP2010020102A (ja) | 音声認識装置、音声認識方法及びコンピュータプログラム | |
JP2013232017A (ja) | 音声認識システムのパフォーマンスを評価および改善するための方法およびシステム | |
JP4854732B2 (ja) | 話者認識システムにおける話者モデル登録装置及び方法、並びにコンピュータプログラム | |
US11416593B2 (en) | Electronic device, control method for electronic device, and control program for electronic device | |
JP4143541B2 (ja) | 動作モデルを使用して非煩雑的に話者を検証するための方法及びシステム | |
JP2008233305A (ja) | 音声対話装置、音声対話方法及びプログラム | |
JP7339116B2 (ja) | 音声認証装置、音声認証システム、および音声認証方法 | |
JP3837061B2 (ja) | 音信号認識システムおよび音信号認識方法並びに当該音信号認識システムを用いた対話制御システムおよび対話制御方法 | |
JP2017161581A (ja) | 音声認識装置、音声認識プログラム | |
WO2007111197A1 (ja) | 話者認識システムにおける話者モデル登録装置及び方法、並びにコンピュータプログラム | |
CN117378006A (zh) | 混合多语种的文本相关和文本无关说话者确认 | |
JP2005092310A (ja) | 音声キーワード認識装置 | |
WO2006027844A1 (ja) | 話者照合装置 | |
CN109559759B (zh) | 具备增量注册单元的电子设备及其方法 | |
JP5088314B2 (ja) | 音声応答装置、及びプログラム | |
WO2008018136A1 (fr) | dispositif de reconnaissance d'un individu en fonction de sa voix, procédé de reconnaissance d'un individu en fonction de sa voix, etc. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07738878 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2008507435 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 12293943 Country of ref document: US |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 07738878 Country of ref document: EP Kind code of ref document: A1 |