WO2023228542A1 - Authentication system and authentication method - Google Patents

Authentication system and authentication method Download PDF

Info

Publication number
WO2023228542A1
WO2023228542A1 PCT/JP2023/012047 JP2023012047W WO2023228542A1 WO 2023228542 A1 WO2023228542 A1 WO 2023228542A1 JP 2023012047 W JP2023012047 W JP 2023012047W WO 2023228542 A1 WO2023228542 A1 WO 2023228542A1
Authority
WO
WIPO (PCT)
Prior art keywords
authentication
audio signal
utterance
section
speaker
Prior art date
Application number
PCT/JP2023/012047
Other languages
French (fr)
Japanese (ja)
Inventor
鉄平 福田
亮太 藤井
慎太郎 岡田
Original Assignee
パナソニックIpマネジメント株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by パナソニックIpマネジメント株式会社 filed Critical パナソニックIpマネジメント株式会社
Publication of WO2023228542A1 publication Critical patent/WO2023228542A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces

Definitions

  • the present disclosure relates to an authentication system and an authentication method.
  • Patent Document 1 discloses a telephone communication device that registers voiceprint data for voiceprint authentication from voice received during a telephone call.
  • the telephone device acquires the received voice, acquires the telephone number of the speaking party, and extracts voiceprint data from the acquired voice.
  • the telephone device measures the acquisition time of the received voice.
  • the telephone device determines whether the total acquisition time length of at least one voiceprint data in the telephone directory and corresponding to the same telephone number as the acquired telephone number is longer than the time required for voiceprint verification. If the telephone device determines that the total acquisition time of the voiceprint data is longer than the time required for voiceprint verification, the telephone device associates the acquired telephone number with the voiceprint data and stores them in the storage unit.
  • Patent Document 1 when the total acquisition time length of the speaker's voiceprint data exceeds a predetermined value, the voiceprint data is registered in a database in association with the speaker's telephone number.
  • the telephone device disclosed in Patent Document 1 always requires voiceprint data for registration used in voiceprint authentication for a total acquisition time longer than a predetermined value. Therefore, the user needs to speak for more than a predetermined time to register the voiceprint data, and the user also needs to speak for the same amount of time for voiceprint authentication, which improves user convenience. Improvements are expected.
  • the present disclosure has been devised in view of the conventional situation described above, and improves user convenience by determining the speaking time during authentication according to the total length of the user's uttered audio acquired at the time of registration. With the goal.
  • the present disclosure includes an acquisition unit that acquires an audio signal of a speaker's utterance, a first utterance section in which the speaker is speaking from the acquired audio signal, and audio signals of each of a plurality of speakers. a second speech section in which the speaker is speaking from the speech signal of the database in which is registered; a first speech signal of the first speech section; and a detection section that detects the second speech section of the first speech section; Authentication using the first audio signal based on the length of the second audio signal of the second utterance interval or the number of sounds included in the second utterance interval by comparing the second audio signal of the interval.
  • An authentication system is provided, which includes a determination unit that determines conditions, and an authentication unit that authenticates the speaker based on the determined authentication conditions.
  • the present disclosure also provides an authentication method performed by one or more computers, which acquires an audio signal of a speaker's utterance, and determines a first utterance period in which the speaker is speaking based on the acquired audio signal. and a second utterance section in which the speaker is speaking from the voice signals of a database in which the voice signals of each of a plurality of speakers are registered, and the first voice of the first utterance section is detected.
  • the signal is compared with the second audio signal of the second utterance section, and the second utterance is determined based on the length of the second audio signal of the second utterance section or the number of sounds included in the second utterance section.
  • An authentication method is provided, in which authentication conditions for authentication using one audio signal are determined, and the speaker is authenticated based on the determined authentication conditions.
  • the utterance time during authentication can be determined according to the total length of the user's uttered voice acquired at the time of registration, thereby improving convenience for the user.
  • a diagram showing an example of a use case of the authentication system according to this embodiment A block diagram showing an example of the internal configuration of an authentication analysis device according to this embodiment Flowchart related to registration processing of uttered audio signal for registration Diagram showing an example of setting authentication conditions based on utterance length Diagram showing an example of setting authentication conditions based on utterance length and number of sounds
  • Diagram showing an example of an operator performing identity verification authentication based on the authentication text displayed on the screen Diagram showing an example of performing identity verification authentication based on the identity verification text displayed on the user side call terminal Diagram showing an example of resetting the required time for authentication conditions based on the measurement results of sound collection conditions during authentication after setting authentication conditions
  • Diagram showing an example of resetting the threshold value of the authentication condition based on the measurement results of the sound collection condition during authentication after setting the authentication condition
  • Flowchart of processing related to speaker authentication Diagram showing an example of setting restrictions on operations after successful authentication depending on the quality of the
  • FIG. 1 is a diagram showing an example of a use case of the authentication system according to the present embodiment.
  • the authentication system 100 acquires a voice signal or voice data of a person to be authenticated using voice (in the example shown in FIG. 1, the user US), and stores the acquired voice signal or voice data in a storage ( In the example shown in FIG. 1, the voice signal or voice data of the speaker registered (stored) in the registered speaker database DB) is compared. Based on the matching result, the authentication system 100 evaluates the degree of similarity between the voice signal or voice data collected from the user US who is the authentication target and the voice signal or voice data registered in the storage, and authenticate the user US based on the degree of
  • the authentication system 100 includes an operator-side telephone terminal OP1 as an example of a sound collection device, an authentication analysis device P1, a registered speaker database DB, and a display DP as an example of an output device. configured.
  • the authentication analysis device P1 and the display DP may be integrally configured.
  • the operator-side call terminal OP1 may be replaced with an automatic voice device, and in this case, the automatic voice device may be configured integrally with the authentication analysis device P1.
  • the authentication system 100 shown in FIG. 1 is used to authenticate a speaker (user US) in a call center as an example, and authenticates the user using voice data collected from the voice of the user US talking to the operator OP. Perform US authentication.
  • the authentication system 100 shown in FIG. 1 further includes a user-side telephone terminal UP1 and a network NW. It goes without saying that the overall configuration of the authentication system 100 is not limited to the example shown in FIG.
  • the user-side call terminal UP1 is connected to the operator-side call terminal OP1 via the network NW so that they can communicate wirelessly.
  • the wireless communication referred to here is network communication via a wireless LAN (Local Area Network) such as Wi-Fi (registered trademark).
  • the user-side call terminal UP1 is configured by, for example, a notebook PC, a tablet terminal, a smartphone, a telephone, or the like.
  • the user-side call terminal UP1 is a sound collection device equipped with a microphone (not shown), and collects the voice uttered by the user US, converts it into an audio signal, and transmits the converted audio signal to the operator via the network NW. It is sent to the side call terminal OP1. Further, the user side telephone terminal UP1 acquires the audio signal of the operator OP's utterance transmitted from the operator side telephone terminal OP1, and outputs it from a speaker (not shown).
  • the network NW is, for example, an IP (Internet Protocol) network or a telephone network, and connects the user-side call terminal UP1 and the operator-side call terminal OP1 to enable transmission and reception of audio signals. Note that data transmission and reception is performed by wired communication or wireless communication.
  • IP Internet Protocol
  • the operator side telephone terminal OP1 is connected to the user side telephone terminal UP1 and the authentication analysis device P1 so as to be able to transmit and receive data through wired communication or wireless communication, respectively, and transmits and receives audio signals.
  • the operator-side call terminal OP1 is configured by, for example, a notebook PC, a tablet terminal, a smartphone, a telephone, or the like.
  • the operator-side telephone terminal OP1 acquires an audio signal based on the voice uttered by the user US transmitted from the user-side telephone terminal UP1 via the network NW, and transmits it to the authentication analysis device P1. Note that, when the operator side call terminal OP1 acquires an audio signal including the acquired user US's utterance voice and the operator OP's utterance voice, the operator side call terminal OP1 determines the sound pressure level and frequency band of the voice signal of the operator side call terminal OP1.
  • An audio signal based on the voice uttered by the user US and an audio signal based on the voice uttered by the operator OP may be separated based on voice parameters such as .
  • the operator-side telephone terminal OP1 extracts only the voice signal based on the voice uttered by the user US and sends it to the authentication analysis device P1.
  • the operator-side call terminal OP1 may be communicably connected to each of the plurality of user-side call terminals, and may simultaneously acquire audio signals from each of the plurality of user-side call terminals.
  • Operator side call terminal OP1 transmits the acquired audio signal to authentication analysis device P1.
  • the authentication system 100 can simultaneously perform voice authentication processing and voice analysis processing for each of a plurality of users.
  • the operator-side call terminal OP1 may simultaneously acquire audio signals including the voices uttered by each of a plurality of users.
  • the operator-side telephone terminal OP1 extracts a voice signal for each user from each of the voice signals of a plurality of users acquired via the network NW, and transmits the voice signal for each user to the authentication analysis device P1.
  • the operator-side call terminal OP1 may analyze the audio signals of multiple users and separate and extract the audio signals for each user based on audio parameters such as sound pressure level and frequency band. .
  • the operator-side call terminal OP1 may separate and extract the voice signals for each user based on the arrival direction of the uttered voice.
  • the authentication system 100 can perform voice authentication processing and voice analysis processing for each of a plurality of users, even for voice signals collected in an environment where a plurality of users speak at the same time, such as a web conference, for example. .
  • An authentication analysis device P1 which is an example of an authentication device and a computer, is connected to the operator side telephone terminal OP1, the registered speaker database DB, and the display DP so that data can be transmitted and received, respectively.
  • the authentication analysis device P1 may be connected to the operator side telephone terminal OP1, the registered speaker database DB, and the display DP via a network (not shown) to enable wired or wireless communication.
  • the authentication analysis device P1 acquires the voice signal of the user US transmitted from the operator side call terminal OP1, analyzes the acquired voice signal for each frequency, for example, and extracts the utterance features of the user US. .
  • the authentication analysis device P1 refers to the registered speaker database DB, and compares the extracted utterance features with the utterance features of each of the plurality of users registered in advance in the registered speaker database DB. Perform voice authentication of user US.
  • the authentication analysis device P1 generates an authentication result screen SC including the authentication result of the user US, and transmits it to the display DP for output. It goes without saying that the authentication result screen SC shown in FIG. 1 is just an example and is not limited thereto.
  • the authentication result screen SC shown in FIG. 1 includes, for example, a message "The voice matches Mr. Taro Yamada's voice.” which is the authentication result of the user US.
  • the registered speaker database DB which is an example of a database, is a so-called storage, and is configured using a storage medium such as a flash memory, an HDD (Hard Disk Drive), or an SSD (Solid State Drive).
  • the registered speaker database DB stores (registers) user information of each of a plurality of users and utterance feature amounts in association with each other.
  • the user information here is information related to users, such as a user name, user ID (Identification), or identification information assigned to each user.
  • the registered speaker database DB may be configured integrally with the authentication analysis device P1.
  • the display DP is configured using, for example, an LCD (Liquid Crystal Display) or an organic EL (Electroluminescence) display, and displays the authentication result screen SC transmitted from the authentication analysis device P1. Note that the display DP may be configured integrally with the authentication analysis device P1.
  • the user-side call terminal UP1 collects the user US's uttered voice COM12 ⁇ This is Taro Yamada'' and the uttered voice COM14 ⁇ This is 123245678'', converts them into audio signals, and makes a call to the operator side.
  • the operator-side telephone terminal OP1 transmits audio signals based on the user US's uttered voices COM12 and COM14 transmitted from the user-side telephone terminal UP1 to the authentication analysis device P1.
  • the operator-side call terminal OP1 receives the operator OP's uttered voice COM11 "Please tell me your name,” the uttered voice COM13 "Please tell me your membership number,” and the user US's uttered voice COM12 and uttered voice COM14.
  • the voice signals based on the operator OP's voice COM11 and voice COM13 are separated and removed, and the voice signals based on the voice COM12 and COM14 of the user US are separated and removed. Only the authentication information is extracted and sent to the authentication analysis device P1. Thereby, the authentication analysis device P1 can improve user authentication accuracy by using only the voice signal of the person to be authenticated.
  • FIG. 2 is a block diagram showing an example of the internal configuration of the authentication analysis device according to the present embodiment.
  • the authentication analysis device P1 is configured to include at least a communication unit 20, a processor 21, and a memory 22.
  • the communication unit 20 is connected to enable data communication with each of the operator-side telephone terminal OP1 and the registered speaker database DB.
  • the communication unit 20 outputs the audio signal transmitted from the operator-side telephone terminal OP1 to the processor 21.
  • the processor 21 is, for example, a CPU (Central Processing Unit), a DSP (Digital Signal Processor), a GPU (Graphical Processing Unit), or an FPGA (Field). Using a semiconductor chip mounted with at least one of the electronic devices such as Programmable Gate Array) configured.
  • the processor 21 functions as a controller that controls the overall operation of the authentication analysis device P1, and performs control processing to supervise the operation of each part of the authentication analysis device P1, and input/output of data between each part of the authentication analysis device P1. Performs processing, data calculation processing, and data storage processing.
  • the processor 21 uses the program and data stored in the ROM (Read Only Memory) 22A of the memory 22 to detect a speech interval detection section 21A, a registration quality determination section 21B, a feature amount extraction section 21C, a comparison target setting section 21D, The respective functions of a similarity calculation section 21E, an authentication condition setting section 21F, an authentication sound collection condition measurement section 21G, and an operation restriction setting section 21H are realized.
  • the processor 21 uses a RAM (Random Access Memory) 22B of the memory 22 during operation, and temporarily stores data or information generated or acquired by the processor 21 and each section in the RAM 22B of the memory 22.
  • the utterance section detection unit 21A which is an example of a detection unit, acquires an audio signal of the utterance at the time of authentication (hereinafter referred to as "utterance audio signal"), analyzes the acquired utterance audio signal, and determines whether the user US has uttered the utterance. Detects the utterance interval (hereinafter referred to as the first utterance interval) in which the first utterance interval occurs.
  • the utterance section detection section 21A outputs a utterance audio signal (hereinafter referred to as a first audio signal) corresponding to at least one first utterance section detected from the utterance audio signal to the feature extraction section 21C.
  • the speech section detection section 21A may temporarily store the first audio signal of at least one first speech section in the RAM 22B of the memory 22. Note that when the speech section detection section 21A detects a plurality of first speech sections, the speech section detection section 21A may connect the first audio signals of each detected first speech section and output it to the feature amount extraction section 21C. . Furthermore, the speech section detection unit 21A detects a speech section (hereinafter referred to as a second speech section) of the voice data acquired from the user US when registering in advance a speech signal used for authentication of the user US. The speech section detection section 21A outputs a speech audio signal corresponding to the second speech section (hereinafter referred to as a second audio signal) to the registration quality determination section 21B. Note that when there are multiple second speech sections, the speech section detection section 21A connects the second audio signals of the respective detected second speech sections and outputs the concatenated second audio signals to the registration quality determination section 21B. Good too.
  • the registration quality determination unit 21B which is an example of a processing unit, acquires a second speech interval or a second audio signal in which each of a plurality of second utterance intervals are concatenated from the utterance interval detection unit 21A.
  • the registration quality determination unit 21B determines the quality of the acquired second audio signal.
  • Quality refers to the quality of the user's surrounding environment at the time of registration, the accuracy of the user's speech, or both, when registering the second audio signal for each user in the registered speaker database DB prior to actual authentication (at the time of registration). This is an indicator that shows the In this embodiment, authentication conditions (see below) to be imposed on the user during actual authentication are determined based on the quality at the time of registration.
  • the registration quality determination unit 21B determines the quality based on, for example, the length of the utterance of the second audio signal (hereinafter referred to as utterance length) or the number of sounds included in the second audio signal. Note that the elements used by the registration quality determining unit 21B to determine the quality are not limited to the utterance length and the number of sounds, but may also be the number of phonemes or the number of words.
  • the registration quality determination unit 21B outputs information on the determined quality to the feature amount extraction unit 21C or the authentication condition setting unit 21F.
  • the feature extraction unit 21C which is an example of a processing unit, analyzes the characteristics of the individual's voice, for example, for each frequency, using the one or more utterance audio signals extracted by the utterance section detection unit 21A, and extracts the utterance feature quantity. Extract.
  • the feature extractor 21C extracts the speech feature of the first audio signal of the first speech section output from the speech section detector 21A. Further, the feature extracting section 21C extracts the speech feature of the second audio signal of the second speech section output from the speech section detecting section 21A.
  • the utterance feature amount of the second audio signal of the second utterance section may be registered in advance in the registered speaker database DB.
  • the feature amount extraction unit 21C associates the extracted utterance feature amount of the first utterance section with the first audio signal from which this utterance feature amount is extracted and outputs it to the similarity calculation unit 21E, or It is output to the setting section 21D or temporarily stored in the RAM 22B of the memory 22.
  • the feature amount extraction unit 21C associates the speech feature amount of the second speech section with the second audio signal from which this speech feature amount is extracted and outputs it to the similarity calculation section 21E.
  • the utterance feature amount and the quality information acquired from the registered quality determination unit 21B are linked and temporarily stored in the RAM 22B of the memory 22.
  • the feature extraction unit 21C performs voice recognition on the utterance content of the utterance audio signal.
  • the method of voice recognition of the utterance content can be realized by a known technique, for example, it may be realized by phoneme analysis of the uttered voice signal and calculated as linguistic information, or it may be realized by other analysis methods.
  • the comparison target setting unit 21D which is an example of a setting unit, acquires data on the user US who is the speaker from the registered speaker database DB.
  • the data of the user US is, for example, personal information such as the date of birth, name, or gender of the user US, or at least one of voice data related to utterances registered in the past by the user US, or a feature amount of the voice data.
  • the comparison target setting unit 21D may, for example, identify the speaker as the user US using the extracted feature quantity of the speaker output from the feature quantity extracting unit 21C.
  • the speaker may be identified as the user US based on the content (for example, name or ID) input by the speaker into the user-side call terminal UP1.
  • the comparison target setting section 21D outputs the acquired data of the user US to the utterance section detection section 21A or the similarity calculation section 21E.
  • the similarity calculation unit 21E which is an example of the authentication unit, acquires the utterance feature of the utterance audio signal output from the feature extraction unit 21C.
  • the similarity calculating unit 21E calculates the similarity between the utterance feature amount of the first utterance section and the utterance feature amount of the second utterance section obtained from the feature amount extraction section 21C.
  • the similarity calculation unit 21E identifies the user corresponding to the uttered audio signal (that is, the audio signal transmitted from the user-side call terminal UP1) based on the calculated similarity, and executes the user's identity verification authentication. do.
  • the authentication condition setting unit 21F which is an example of a determining unit, sets authentication conditions based on the quality-related information acquired from the registered quality determination unit 21B.
  • the authentication conditions include, for example, the length of the speech made by the user US, the content of the speech, or a threshold value for determination. Note that the authentication conditions are not limited to these.
  • the authentication sound collection condition measurement unit 21G which is an example of a measurement unit, measures the sound collection conditions at the time of authentication.
  • the sound collection conditions include, for example, the noise, volume, degree of reverberation, or number of phonemes included in the spoken audio signal collected during authentication. Note that the sound collection conditions are not limited to these.
  • the authentication sound collection condition measuring section 21G outputs the measured sound collection conditions to the authentication condition setting section 21F.
  • the operation restriction setting unit 21H which is an example of a setting unit, sets restrictions on the operations that the user US can perform based on the quality of the uttered audio signal in the second utterance section. For example, when the authentication system 100 is installed in an ATM (Automatic Teller Machine), the operation restriction setting unit 21H restricts operations such as remittance or transfer when the quality of the spoken audio signal is poor. Note that examples of machines in which the authentication system 100 is installed are not limited to ATMs.
  • the processor 21 sets the authentication conditions for the user's identity verification based on the quality of the second audio signal determined by the registration quality determination unit 21B.
  • the processor 21 acquires the user's uttered audio signal based on the set authentication conditions.
  • the processor 21 determines whether the speaker is the person himself/herself based on the comparison between the first audio signal of the first utterance section detected by the utterance section detection unit 21A and the second audio signal of the second utterance section. authenticate the person.
  • the memory 22 includes, for example, a ROM 22A that stores programs that define various processes performed by the processor 21 and data used during the execution of the programs, and a ROM 22A that serves as a work memory used when executing various processes performed by the processor 21. It has at least a RAM 22B. A program that defines various processes to be performed by the processor 21 and data used during execution of the program are written in the ROM 22A.
  • the RAM 22B temporarily stores data or information generated or acquired by the processor 21 (for example, the utterance audio signal or the utterance feature amount corresponding to each utterance audio signal).
  • the display I/F 23 connects the processor 21 and the display DP to enable data communication, and outputs the authentication result screen SC generated by the similarity calculation unit 21E of the processor 21 to the display DP.
  • the display I/F 23 causes the display DP to display an authentication status indicating whether or not the speaker is the authentic speaker based on the authentication result of the processor 21.
  • FIG. 3 is a flowchart related to the registration process of the uttered audio signal for registration. Note that each process related to the flowchart in FIG. 3 is executed by the processor 21.
  • the flowchart in FIG. 3 represents the process related to registration, that is, the registration of the uttered audio signal that is stored in the registered speaker database DB in advance.
  • the processor 21 starts receiving an utterance audio signal for registration (hereinafter referred to as a registration audio signal) from the speaker (St10). That is, in the process of step St10, the speaker starts speaking to the user-side telephone terminal UP1.
  • a registration audio signal an utterance audio signal for registration
  • the processor 21 ends receiving the registered audio signal from the speaker (St11). That is, in the process of step St11, the speaker finishes speaking to the user-side telephone terminal UP1.
  • the speech section detection unit 21A detects the second speech section of the registered audio signal acquired in the processing from step St10 to step St11 (St12).
  • the registration quality determination unit 21B determines the quality of the second audio signal of the second speech section detected in the process of step St12 (St13).
  • the registered quality determination unit 21B determines whether or not to re-acquire the registered audio signal based on the quality determined in the process of step St13 (St14). For example, the registration quality determination unit 21B determines not to reacquire when the quality is equal to or higher than a predetermined minimum necessary value, and determines to not reacquire when the quality is less than a predetermined minimum necessary value. judge. For example, if the speaker has not uttered a single word, the length of the utterance is 1 second, or the number of sounds is 1, the registration quality determining unit 21B determines to re-acquire the registered audio signal. Note that the example in which the registration quality determining unit 21B determines to re-acquire is an example and is not limited to these. Moreover, the process of step St14 may be omitted from the process of the flowchart related to FIG. 3.
  • the feature extracting unit 21C extracts the utterance feature of the uttered audio signal in the second utterance section (St15). .
  • the feature extracting unit 21C associates the quality determined in the process of step St13 with the utterance feature extracted in the process of step St15 and stores it in the registered speaker database DB (St16).
  • FIG. 4 is a diagram illustrating an example of setting authentication conditions based on utterance length.
  • the registered voice signal US10, the registered voice signal US11, and the registered voice signal US12 are registered voice signals of the user US registered in the registered speaker database DB in the process related to FIG. 3.
  • the quality when the utterance length is less than 10 seconds, the quality is "low", and when the utterance length is 10 seconds or more, the quality is "high".
  • the threshold number of seconds for which the quality is "low” or “high” is an example and is not limited.
  • the quality is not limited to two levels of “low” and “high”, but may be set to, for example, three levels of "low", “medium” and “high”, or four or more levels.
  • the authentication condition setting unit 21F changes the required time based on the quality results.
  • the required time is 15 seconds, and when the quality is "high”, the required time is 7 seconds.
  • the required time is the total speaking time that the authentication system 100 requests from the speaker when performing authentication. Note that the length of the requested time is just an example and is not limited thereto.
  • the determination threshold is all set to 70 regardless of the quality result.
  • the determination threshold value is a threshold value used by the similarity calculation unit 21E to determine the degree of similarity between the utterance feature amount of the first utterance section and the utterance feature amount of the second utterance section. The higher the determination threshold, the higher the degree of similarity required. Note that the value of the determination threshold is an example and is not limited to 70.
  • the utterance content of the registered voice signal US10 is "a ka sa tana de su" and the utterance length is 5 seconds.
  • the registered audio signal US10 has a utterance length of 5 seconds and is less than 10 seconds, so the quality is "low”. As a result, the required time is 15 seconds and the determination threshold is 70 as authentication conditions for the registered audio signal US10.
  • the utterance content of the registered audio signal US11 is ⁇ Aka satana de wa ma ya ya ra wa desu'' (a ka sa ta na de su ha ma ya ra wa de su)” and the length of the utterance is 8 seconds. Become.
  • the quality of the registered audio signal US11 is "low” because the utterance length is 8 seconds and it is less than 10 seconds.
  • the required time is 15 seconds and the determination threshold is 70 as authentication conditions for the registered audio signal US11.
  • the utterance content of the registered audio signal US12 is ⁇ A ka sa ta na de.'' su i chi ni sa n shi go ro ku na na de su)'' and the length of the utterance is 13 seconds.
  • the quality of the registered audio signal US12 is "high" because the utterance length is 13 seconds and is longer than 10 seconds.
  • the required time is 7 seconds and the determination threshold is 70 as the authentication conditions for the registered audio signal US12.
  • FIG. 5 is a diagram illustrating an example of setting authentication conditions based on utterance length and number of sounds.
  • the quality when the utterance length is less than 10 seconds, the quality is "low", and when the utterance length is 10 seconds or more, the quality is "high".
  • the threshold number of seconds and the number of sounds for which the quality is "low” or “high” are examples and are not limited.
  • the quality is not limited to “low” and “high”, but may be set to three stages of “low”, “medium” and “high”, or to four or more stages.
  • the authentication condition setting unit 21F changes the required time.
  • the quality of the registered audio signal is determined by the quality of the utterance length and the quality of the number of sounds, whichever is lower. In the example shown in FIG. 5, when the quality is "low”, the required time is 15 seconds, and when the quality is "high”, the required time is 7 seconds. Note that the length of the requested time is just an example and is not limited thereto. In the example shown in FIG. 5, the determination threshold is all set to 70 regardless of the quality result.
  • the utterance content of the registered voice signal US10 is "a ka sa tana de su", and the number of tones is seven.
  • the utterance length of the registered audio signal US10 is 5 seconds, which is less than 10 seconds, so the quality related to the utterance length is "low.” Since the number of tones is 7, which is less than 13, the quality related to the number of tones is "low.” The quality of the utterance length and the number of tones are both "low", and the quality of the registered audio signal US10 is "low”.
  • the required time as an authentication condition for the registered audio signal US10 is 15 seconds, and the determination threshold is 70.
  • the utterance content of registered audio signal US11 is "Aka sa ta na de ha ma ya ra wa desu" (a ka sa ta na de su ha ma ya ra) wa de su)" and the number of sounds is 14. .
  • the utterance length of the registered audio signal US11 is 8 seconds, which is less than 10 seconds, so the quality related to the utterance length is "low.”
  • the number of tones is 14, which is 13 or more, so the quality related to the number of tones is "high.” Since the quality of the number of tones is "high” but the quality of the utterance length is "low”, the quality of the registered audio signal US11 is "low".
  • the required time as an authentication condition for the registered audio signal US11 is 15 seconds, and the determination threshold is 70.
  • the utterance content of the registered audio signal US12 is ⁇ Aka sa ta na de s. u i chi ni sa n shi go ro ku na na de su )” and the number of tones is 20.
  • the utterance length of the registered audio signal US12 is 13 seconds, which is 10 seconds or more, so the quality related to the utterance length is "high.” Since the number of tones is 20, which is 13 or more, the quality related to the number of tones is "high.”
  • the quality of the utterance length and the number of tones are both "high", and the quality of the registered audio signal US12 is "high".
  • the required time as an authentication condition for the registered audio signal US12 is 7 seconds, and the determination threshold is 70.
  • FIG. 6 is a diagram illustrating an example of setting utterance content as an authentication condition according to the quality of a registered audio signal.
  • the quality determination method in FIG. 6 is the same as the determination method in FIG. 4.
  • the authentication condition setting unit 21F specifies words that prompt the user US to speak based on the quality results.
  • the feature quantity extracting unit 21C performs voice recognition of the registered voice signal, analyzes the utterance content, and outputs it to the authentication condition setting unit 21F.
  • the authentication condition setting unit 21F determines the wording to be specified for the user US based on the voice recognition result obtained from the feature extracting unit 21C.
  • the quality is "low” and the authentication condition setting unit 21F specifies the utterance content, the requested time is not specified. If the quality is "high", no text is specified for the user US.
  • the quality is “high”
  • the user US may be allowed to specify a text, or a shorter text may be specified than when the quality is "low” to improve user convenience.
  • the required time is set to 7 seconds as the authentication condition.
  • the required time when the quality is "high” is an example and is not limited to 7 seconds.
  • the determination threshold is set to a constant value of 70 regardless of quality. Note that the determination threshold value may not be a constant value but may be changed depending on the quality.
  • the utterance content of the registered voice signal US10 is "Akasatana desu”. Since the quality of the registered audio signal US10 is "low", the feature extraction unit 21C performs speech recognition of the registered audio signal US10, and the recognition result is "Akasatana Desu”.
  • the authentication condition setting unit 21F specifies the wording as “Akasatana desu” based on the recognition result obtained from the feature extracting unit 21C.
  • the authentication condition setting unit 21F sets the request time to no specification and the determination threshold to 70.
  • the utterance content of the registered audio signal US12 is "Akasatana desu, Ichinisan Shigorokunana desu.” Since the quality of the registered audio signal US12 is "high", the feature extraction unit 21C does not perform speech recognition.
  • the authentication condition setting unit 21F sets the required time to 7 seconds and the determination threshold to 70 as authentication conditions. The authentication condition setting unit 21F does not specify the wording because the quality of the registered audio signal US12 is "high".
  • the authentication system 100 can authenticate in a short time while maintaining high authentication accuracy by specifying and having the user US utter a phrase based on the utterance content of the registered voice signal according to the quality of the registered voice signal. . Furthermore, when the quality of the registered voice signal is high, the authentication system 100 does not specify the content of the utterance, thereby saving the user US's effort.
  • FIG. 7 is a diagram illustrating an example in which an operator performs identity verification based on the authentication text displayed on the screen.
  • the authentication system 100 uses voice recognition to specify the words to be uttered by the user US.
  • the operator OP displays the message to be uttered by the user US designated by the method shown in FIG. An example of making US utter a specified phrase will be explained.
  • Screen SC1 is an example of an operator screen displayed on display DP.
  • Speaker registration information is displayed in frame FR1.
  • frame FR1 "sender number”, “registered name”, “registered address”, “age”, and “speaker registration status” are displayed as speaker registration information.
  • the "source number” is, for example, a telephone number.
  • Speaker registration presence/absence indicates whether or not a registered voice signal is stored in the registered speaker database DB. Furthermore, if the registered voice signal is stored in the registered speaker database DB, the quality associated with the registered voice signal is also displayed.
  • the "sender number” is XXX-XXXXXX-XXXXX
  • the "registered name” is Ada A
  • the "registered address” is ABCDEFG
  • the "age” is 33
  • " ⁇ Speaker registration presence/absence'' is displayed as Yes (quality: low).
  • frame FR2 candidates who are considered to be speakers are displayed as the speaker authentication results. Furthermore, the probability that the speaker is a candidate is displayed next to the candidate. In the example of frame FR2, the probability is expressed as a percentage, but the probability is not limited to this and may be expressed as "low, medium, high.” In frame FR2, "A-field A-man: 70%”, “B-yama B-ro: 25%” and “C-kawa C-husband: 5%" are displayed as the authentication results.
  • a sentence to be uttered by the speaker (hereinafter referred to as an identity authentication sentence) is displayed.
  • an identity authentication sentence a sentence to be uttered by the speaker
  • frame FR3 "I am Hamayarawa, I am Hachikyujuzero" is displayed as a text for personal authentication.
  • the button BT1 is a button to start or stop authentication.
  • Operator OP utters utterance voice OP10, "I'm Hamayarawa. Please say, 'I'm Hachikyujuzero.'” based on the text for personal authentication displayed in frame FR3 of screen SC1. Based on the utterance of the operator OP, the user US utters the utterance voice US13, "It's Hamayarawa. It's Hachikyujuzero.”
  • Screen SC2 is an example of an operator screen displayed on display DP.
  • Speaker registration information is displayed in frame FR5.
  • frame FR5 "caller number”, "registered name”, “registered address”, “age”, and “speaker registration status” are displayed as speaker registration information.
  • the "sender number” is XXX-XXXXXX-XXXXX
  • the "registered name” is B-Yama Brou
  • the "registered address” is GFEDCBA
  • the "age” is 44
  • " ⁇ Speaker registration presence/absence'' is displayed as Yes (quality: high).
  • the button BT2 is a button to start or stop authentication.
  • Operator OP utters utterance voice OP11, "Please say, 'It's Hamayarawa.'” based on the personal authentication text displayed in frame FR7 of screen SC2. Based on the utterance of the operator OP, the user US utters the uttered voice US14, "It's Hamaya Rawa.”
  • the operator OP reads out the personal authentication text displayed on the operator screen and has the user US speak it, but the user US may also read the personal authentication text using an automatic voice and have the user US speak it.
  • the authentication system 100 allows the user US and the operator OP to perform authentication without worrying about the length of the utterance. Further, when the quality of the registered audio signal is high, the authentication system 100 can specify a short phrase to the user US for authentication, thereby reducing the time required for authentication. Furthermore, when the quality of the registered audio signal is low, the authentication system 100 can maintain high authentication accuracy by specifying a longer word for the user US than when the quality is high, thereby preventing authentication failure or redoing. be able to.
  • FIG. 8 is a diagram illustrating an example of performing identity verification authentication based on the identity verification text displayed on the user side call terminal.
  • Case CE is an example where the quality of the registered audio signal is low.
  • Screen SC3 is an example of a screen displayed on user side call terminal UP1.
  • the user US looks at the content displayed on the screen SC3 and utters the utterance voice US13, "It's Hamayarawa. It's Hachikyujuzero.”
  • Case CF is an example where the quality of the registered audio signal is high.
  • Screen SC4 is an example of a screen displayed on user side call terminal UP1.
  • the user US looks at the content displayed on the screen SC4 and utters the utterance voice US13, "It's Hamayarawa.”
  • the authentication system 100 can authenticate the user US without worrying about the length of his or her utterance. Furthermore, with this, the authentication system 100 can authenticate the user US unattended without the intervention of a person such as the operator OP.
  • FIG. 9 is a diagram illustrating an example of resetting the required time of the authentication condition based on the measurement result of the sound collection condition at the time of authentication after setting the authentication condition.
  • FIG. 10 is a diagram showing an example of resetting the threshold value of the authentication condition based on the measurement result of the sound collection condition at the time of authentication after setting the authentication condition.
  • the authentication sound collection condition measuring unit 21G determines the noise, volume, degree of reverberation, or speech sound signal of the speech sound signal collected during authentication as the sound collection condition during authentication (hereinafter referred to as the sound collection condition during authentication). Measure the number of phonemes included.
  • FIGS. 9 and 10 show an example in which the authentication conditions are once set (hereinafter referred to as initial authentication conditions) and then are reset based on the measured authentication sound collection conditions.
  • the authentication condition setting unit 21F increases the required time by 3 seconds as an authentication condition.
  • the authentication condition setting unit 21F increases the required time by 3 seconds as the authentication condition.
  • the authentication condition setting unit 21F increases the required time by 3 seconds as an authentication condition.
  • the authentication condition setting unit 21F increases the required time by 5 seconds as an authentication condition.
  • the length of the required time to be increased regarding noise, volume, number of phonemes, and reverberation is an example and is not limited to these.
  • the initial authentication conditions are a required time of 15 seconds and a determination threshold of 70.
  • the initial authentication conditions are just an example and are not limited thereto. If there is a lot of noise as the sound collection condition for authentication, the authentication condition setting unit 21F increases the required time by 3 seconds. As a result, the authentication conditions after resetting are a request time of 18 seconds and a determination threshold of 70. If the volume is low as the sound collection condition for authentication, the authentication condition setting unit 21F increases the required time by 3 seconds. As a result, the authentication conditions after resetting are a request time of 18 seconds and a determination threshold of 70.
  • the initial authentication conditions are a required time of 7 seconds and a determination threshold of 70.
  • the initial authentication conditions are just an example and are not limited thereto.
  • the authentication condition setting unit 21F increases the required time by 6 seconds in total.
  • the authentication conditions after resetting are a request time of 13 seconds and a determination threshold of 70.
  • the authentication condition setting unit 21F increases the required time by 3 seconds.
  • the authentication conditions after resetting are a request time of 10 seconds and a determination threshold of 70.
  • the authentication condition setting unit 21F increases the required time by 5 seconds.
  • the authentication conditions after resetting are a request time of 12 seconds and a determination threshold of 70. If the sound collection conditions at the time of authentication are good, the authentication conditions are the same as the initial authentication conditions.
  • the threshold value of the authentication condition is reset based on the measurement result of the sound collection condition at the time of authentication after the authentication condition is set.
  • the authentication condition setting unit 21F lowers the determination threshold by 10 as the authentication condition.
  • the authentication condition setting unit 21F lowers the determination threshold by 15 as the authentication condition.
  • the authentication condition setting unit 21F lowers the determination threshold by 10 as an authentication condition.
  • the authentication condition setting unit 21F lowers the determination threshold by 20 as the authentication condition.
  • the values of the determination thresholds to be lowered with respect to noise, volume, number of phonemes, and reverberation are only examples, and are not limited to these.
  • the initial authentication conditions are a required time of 15 seconds and a determination threshold of 70.
  • the initial authentication conditions are just an example and are not limited thereto. If there is a lot of noise as the sound collection condition for authentication, the authentication condition setting unit 21F lowers the determination threshold by 10. As a result, the authentication conditions after resetting are a request time of 15 seconds and a determination threshold of 60. When the volume is low as the sound collection condition at the time of authentication, the authentication condition setting unit 21F lowers the determination threshold value by 15. As a result, the authentication conditions after resetting are a request time of 15 seconds and a determination threshold of 55.
  • the initial authentication conditions are a required time of 7 seconds and a determination threshold of 70.
  • the initial authentication conditions are just an example and are not limited thereto.
  • the authentication condition setting unit 21F lowers the determination threshold by 25 in total.
  • the authentication conditions after resetting are a request time of 7 seconds and a determination threshold of 45.
  • the authentication condition setting unit 21F lowers the determination threshold by 10.
  • the authentication conditions after resetting are a request time of 7 seconds and a determination threshold of 60.
  • the authentication condition setting unit 21F lowers the determination threshold value by 20.
  • the authentication conditions after resetting are a request time of 7 seconds and a determination threshold of 50. If the sound collection conditions at the time of authentication are good, the authentication conditions are the same as the initial authentication conditions.
  • the authentication system 100 can perform authentication with high accuracy by lengthening the request time, lowering the determination threshold, or both if the sound collection conditions at the time of authentication are poor.
  • FIG. 11 is a flowchart of processing related to speaker authentication. Each process related to FIG. 11 is executed by the processor 21.
  • the comparison target setting unit 21D sets a person to be used for authentication from among a plurality of people registered in the registered speaker database DB when performing authentication to confirm the identity of a speaker who is an authentication target person (St20). .
  • the comparison target setting unit 21D acquires information regarding the quality of the registered voice signal of the person who is the comparison target set in the process of step St20 from the registered speaker database DB (St21).
  • the comparison target setting section 21D outputs the acquired information to the authentication condition setting section 21F. Further, the comparison target setting unit 21D acquires the registered feature amount of the registered voice signal of the person who is the comparison target from the registered speaker database DB, and outputs it to the similarity calculation unit 21E.
  • the authentication sound collection condition measurement unit 21G measures the authentication sound collection conditions (St22). Note that the process of step St22 may be omitted from the process of the flowchart related to FIG. 11.
  • the authentication condition setting unit 21F sets authentication conditions based on the information regarding quality acquired from the comparison target setting unit 21D in the process of step St21 (St23).
  • the processor 21 transmits a signal to start the authentication process to the communication unit 20 (St24).
  • the communication unit 20 transmits an instruction to the operator side telephone terminal OP1 to start authentication.
  • the authentication condition setting unit 21F acquires the registered feature amount of the speaker's registered voice signal from the comparison target setting unit 21D.
  • the authentication condition setting unit 21F specifies the wording of the utterance content to be used for authentication based on the acquired registered feature amount (St25). Note that the process of step St25 may be omitted from the process of the flowchart related to FIG. 11.
  • the processor 21 starts receiving the uttered audio signal used for authentication (St26).
  • the processor 21 outputs the acquired speech audio signal to the feature extraction unit 21C.
  • the authentication sound collection condition measurement unit 21G measures the authentication sound collection conditions (St27).
  • the authentication sound collection condition measuring unit 21G outputs information on the measured authentication sound collection conditions to the authentication condition setting unit 21F. Note that the process of step St27 may be omitted from the process of the flowchart related to FIG. 11.
  • the authentication condition setting unit 21F resets the authentication condition based on the authentication sound collection condition acquired in the process of step St27 (St28).
  • the processor 21 acquires the uttered audio signal based on the authentication condition reset by the authentication condition setting unit 21F in the process of step St28.
  • the processor 21 outputs the acquired speech audio signal to the feature extraction unit 21C. Note that the process of step St28 may be omitted from the process of the flowchart related to FIG. 11.
  • the processor 21 transmits a signal to the communication unit 20 to end the authentication process, that is, to end the reception of the uttered audio signal used for authentication (St29).
  • the communication unit 20 transmits an instruction to end the authentication to the operator side call terminal OP1.
  • the feature amount extraction unit 21C extracts the speech feature amount of the speech audio signal acquired in the process of step St26 or the process of step St28 (St30).
  • the feature extraction unit 21C outputs the extracted utterance feature to the similarity calculation unit 21E.
  • the similarity calculation unit 21E calculates the similarity based on the registered feature obtained in step St21 and the utterance feature obtained in the process of step St30 (St31).
  • the similarity calculation unit 21E determines whether the similarity calculated in the process of step St31 is greater than or equal to a predetermined threshold (St32). When the similarity calculation unit 21E determines that the similarity is greater than or equal to a predetermined threshold (St32, YES), the similarity calculation unit 21E sends a signal to the communication unit 20, display I/F 23, or Output to both.
  • the similarity calculation unit 21E determines whether or not to continue the authentication process (St34).
  • step St22 the process of the processor 21 returns to the process of step St23.
  • the similarity calculation unit 21E determines not to continue the authentication process (St35, NO), it outputs a signal to the communication unit 20, the display I/F 23, or both to the effect that the authentication of the speaker's identity verification has failed. .
  • FIG. 12 is a diagram illustrating an example in which restrictions are placed on operations after successful authentication depending on the quality of the registered audio signal.
  • Registered audio signal US10 and registered audio signal US12 according to FIG. 12 are similar to registered audio signal US10 and registered audio signal US12 according to FIG. 4. Therefore, in FIG. 12, the explanation up to the setting of the authentication conditions is omitted.
  • the operation restriction setting unit 21H sets the operation (for example, deposit) after the speaker's identity verification has been successfully authenticated based on the quality of the registered audio signal. You can put restrictions on it.
  • examples in which the authentication system 100 is installed are not limited to bank ATMs, here, for convenience of explanation, it is assumed that the authentication system 100 is installed in a bank ATM.
  • Case CG is an example of authentication when the quality of the registered audio signal is low.
  • the operation restriction setting unit 21H restricts the operation mode and operates in the restriction mode.
  • the restricted mode for example, only the introduction and deposit of the balance of the account can be made. Note that the operations that are possible in the restricted mode are merely examples and are not limited to these.
  • Case CH is an example of authentication when the quality of the registered audio signal is high.
  • the operation restriction setting unit 21H since the quality of the registered audio signal is high, the operation restriction setting unit 21H operates in a normal mode in which no restrictions are placed on the operation mode. In the normal mode, all operations such as introduction of remaining balance in the account, deposit, remittance, or transfer are possible. Note that the operations that are possible in the normal mode are just examples and are not limited to these.
  • the authentication system 100 can reduce the risk of misjudgment by placing operational restrictions on the machine in which the authentication system 100 is installed.
  • FIG. 13 is a flowchart of a process for setting operation restrictions based on the quality of registered audio signals. Each process in the flowchart in FIG. 13 is executed by the processor 21. It should be noted that processes in the flowchart of FIG. 13 that are similar to those in the flowchart of FIG. 11 are given the same reference numerals, and description thereof will be omitted.
  • the operation restriction setting unit 21H determines whether the quality of the registered audio signal is high (St36).
  • the operation restriction setting unit 21H determines that the quality of the registered audio signal is high (St36, YES), it sets the operation mode to the normal mode (St37).
  • the operation restriction setting unit 21H determines that the quality of the registered audio signal is low (St36, NO), it sets the operation mode to the restriction mode (St38).
  • the authentication system (eg, authentication system 100) according to the present embodiment includes an acquisition unit (eg, user-side call terminal UP1) that acquires the audio signal of the speaker's uttered voice.
  • the authentication system determines the first utterance period in which the speaker is speaking based on the acquired audio signal, and the first utterance period in which the speaker is speaking based on the audio signal of the database in which the audio signals of each of a plurality of speakers are registered. 2 (for example, the speech section detection section 21A).
  • the authentication system compares the first audio signal of the first utterance section with the second audio signal of the second utterance section, and determines the length of the second audio signal of the second utterance section or the second speech section.
  • the authentication system includes a determining unit (for example, an authentication condition setting unit 21F) that determines authentication conditions for authentication using the first audio signal based on the number of sounds included in the first audio signal.
  • the authentication system includes an authentication section (eg, similarity calculation section 21E) that authenticates the speaker based on the determined authentication conditions.
  • the authentication system can determine the authentication conditions required of the user during authentication based on the utterance length or number of sounds of the registered audio signal, so the authentication conditions can be changed for each user. .
  • the authentication system can determine the utterance time at the time of authentication according to the total length of the user's utterance voice acquired at the time of registration, and improve convenience for the user.
  • the authentication unit of the authentication system is configured such that the length of the second utterance section is equal to or greater than the first predetermined value, and the length of the first utterance section is equal to or greater than the second predetermined value.
  • Authentication will start if the The authentication unit performs authentication when the length of the second utterance section is less than a first predetermined value and the length of the first utterance section is greater than or equal to a third predetermined value that is larger than the second predetermined value. Start.
  • the authentication system requests the user to speak by setting a number of seconds that is estimated to be sufficient for authentication at the time of authentication, based on the quality of the registered audio signal based on the length of the utterance.
  • the authentication system 100 can determine the utterance time at the time of authentication according to the length of the total time of the user's uttered voice acquired at the time of registration, and can improve convenience for the user.
  • the authentication unit of the authentication system is configured such that the number of sounds included in the second speech section is equal to or greater than a fourth predetermined value, and the length of the first speech section is equal to or greater than a second predetermined value. Authentication starts when the above conditions are met.
  • the authentication section Start authentication.
  • the authentication system requests the user to speak by setting a number of seconds that is estimated to be sufficient for stability during authentication based on the quality of the registered audio signal based on the number of tones. This can prevent authentication from failing due to too short utterances. Thereby, the authentication system can improve user convenience during authentication.
  • the authentication unit of the authentication system is configured such that the length of the second speech section is equal to or greater than the first predetermined value, and the number of sounds included in the second speech section is equal to or greater than the fourth predetermined value. Then, the authentication is started when the length of the first utterance section is equal to or greater than the second predetermined value.
  • the authentication unit is configured to determine whether the length of the second speech section is less than a first predetermined value or the number of sounds included in the second speech section is less than a fourth predetermined value, and the length of the first speech section is less than a first predetermined value. Authentication is started when the third predetermined value is greater than the second predetermined value.
  • the authentication system can determine the utterance length required of the user at the time of authentication, according to the utterance length and the number of sounds of the registered audio signal. Since the authentication system can determine authentication conditions based on the length of the utterance and the number of sounds, it can improve user convenience and perform more accurate authentication.
  • the determination unit of the authentication system determines the length of the speech section from the speech content included in the audio signal of the second speech section. Decide on the text that prompts you to speak. Thereby, the authentication system can perform authentication in a short time while maintaining high authentication accuracy by having the user specify and utter a phrase based on the utterance content of the registered voice signal according to the utterance length of the registered voice signal. Furthermore, if the utterance length of the registered audio signal is sufficiently long, the authentication system does not specify the content of the utterance, thereby saving the user US effort.
  • the authentication system further includes a first display unit (for example, display DP) that displays a screen that the operator refers to during authentication, and the determination unit causes the first display unit to display text.
  • a first display unit for example, display DP
  • the authentication system allows the user and the operator to perform authentication without worrying about the length of the utterance.
  • the authentication system can specify a short phrase to the user for authentication, thereby reducing the time required for authentication.
  • the authentication system can maintain high authentication accuracy by specifying a longer word for the user than when the quality is high, and prevent authentication failure or retry. can.
  • the authentication system further includes a second display unit (for example, user-side call terminal UP1) that displays a screen that the speaker refers to during authentication, and the determination unit displays the text on the second display unit. Display.
  • the authentication system can perform authentication without requiring the user to worry about the length of the utterance. Furthermore, this allows the authentication system to authenticate users unattended without the intervention of a person such as an operator.
  • the authentication system further includes a measurement unit that measures at least one of noise, volume, number of phonemes, or reverberation size of the audio signal in the first speech section, and the determination unit measures Authentication conditions are set based on the measurement results obtained from the department.
  • the authentication system can perform authentication with high accuracy by lengthening the required time, lowering the determination threshold, or both if the sound collection conditions at the time of authentication are poor.
  • the authentication condition of the authentication system is the length of the audio signal or the threshold value for determination related to the authentication of the speaker's identity.
  • the authentication system can specify the utterance length to be specified to the user according to the quality of the registered audio signal of each user, and can improve convenience for the user.
  • the authentication system can change the threshold value for determination depending on the quality of the user's registered audio signal, and can perform flexible authentication depending on each user.
  • the authentication system is configured such that when the length of the second utterance section is less than the first predetermined value, a restriction setting unit (for example, an operation restriction It further includes a setting section 21H).
  • a restriction setting unit for example, an operation restriction It further includes a setting section 21H.
  • the technology of the present disclosure is useful as an authentication system and method that improves user convenience by determining the utterance time during authentication according to the total length of the user's utterances acquired at the time of registration.

Abstract

The authentication system comprises: an acquisition unit that acquires an audio signal of a speaker's spoken voice; a detection unit that detects a first utterance interval spoken by the speaker from the audio signal and a second utterance interval spoken by the speaker from the audio signals of a database in which are registered audio signals of a plurality of speakers; a determination unit that compares a first audio signal of the first utterance interval and a second audio signal of the second utterance interval and determines authentication conditions for authentication using the first audio signal based on the length of the second audio signal of the second utterance interval or the number of sounds included in the second utterance interval; and an authentication unit that authenticates the speaker.

Description

認証システムおよび認証方法Authentication system and method
 本開示は、認証システムおよび認証方法に関する。 The present disclosure relates to an authentication system and an authentication method.
 特許文献1には、声紋認証のための声紋データを通話中の受話音声から登録する通話装置が開示されている。通話装置は、受話音声を取得し、発話側の電話番号を取得し、取得した受話音声から声紋データを抽出する。次に、通話装置は、受話音声の取得時間を計測する。通話装置は、電話帳にあり、取得した電話番号と同一の電話番号に対応する少なくとも1つ以上の声紋データの合算取得時間長が、声紋照合のために必要な時間よりも長いか判定する。通話装置は、声紋データの合算取得時間が声紋照合のために必要な時間よりも長いと判定した場合には、取得した電話番号と声紋データとを対応付けて格納部に格納する。 Patent Document 1 discloses a telephone communication device that registers voiceprint data for voiceprint authentication from voice received during a telephone call. The telephone device acquires the received voice, acquires the telephone number of the speaking party, and extracts voiceprint data from the acquired voice. Next, the telephone device measures the acquisition time of the received voice. The telephone device determines whether the total acquisition time length of at least one voiceprint data in the telephone directory and corresponding to the same telephone number as the acquired telephone number is longer than the time required for voiceprint verification. If the telephone device determines that the total acquisition time of the voiceprint data is longer than the time required for voiceprint verification, the telephone device associates the acquired telephone number with the voiceprint data and stores them in the storage unit.
日本国特開2016-53598号公報Japanese Patent Application Publication No. 2016-53598
 特許文献1では、話者の声紋データの合計取得時間長が所定値以上となった場合に、声紋データが話者の電話番号と紐づけられてデータベースに登録される。つまり、特許文献1に開示されている通話装置は、声紋認証の際に使用する登録用の声紋データを、所定値以上の合計取得時間長分ほど常に必要とする。このため、ユーザは、声紋データの登録に所定値以上の時間、発話する必要があり、それにより声紋認証の際にも同様な時間分ほど発話する必要があるためユーザの利便性をより良くするための改善が見込まれる。 In Patent Document 1, when the total acquisition time length of the speaker's voiceprint data exceeds a predetermined value, the voiceprint data is registered in a database in association with the speaker's telephone number. In other words, the telephone device disclosed in Patent Document 1 always requires voiceprint data for registration used in voiceprint authentication for a total acquisition time longer than a predetermined value. Therefore, the user needs to speak for more than a predetermined time to register the voiceprint data, and the user also needs to speak for the same amount of time for voiceprint authentication, which improves user convenience. Improvements are expected.
 本開示は、上述した従来の状況に鑑みて案出され、登録時に取得したユーザの発話音声の合計時間の長さに応じて認証時の発話時間を決定し、ユーザの利便性を向上することを目的とする。 The present disclosure has been devised in view of the conventional situation described above, and improves user convenience by determining the speaking time during authentication according to the total length of the user's uttered audio acquired at the time of registration. With the goal.
 本開示は、話者の発話音声の音声信号を取得する取得部と、取得された前記音声信号から前記話者が発話している第1の発話区間と、複数の話者のそれぞれの音声信号が登録されたデータベースの前記音声信号から前記話者が発話している第2の発話区間と、を検出する検出部と、前記第1の発話区間の第1音声信号と、前記第2の発話区間の第2音声信号とを照合し、前記第2の発話区間の前記第2音声信号の長さもしくは前記第2の発話区間に含まれる音数に基づき前記第1音声信号を用いる認証の認証条件を決定する決定部と、決定された前記認証条件に基づいて、前記話者の認証を行う認証部と、を備える、認証システムを提供する。 The present disclosure includes an acquisition unit that acquires an audio signal of a speaker's utterance, a first utterance section in which the speaker is speaking from the acquired audio signal, and audio signals of each of a plurality of speakers. a second speech section in which the speaker is speaking from the speech signal of the database in which is registered; a first speech signal of the first speech section; and a detection section that detects the second speech section of the first speech section; Authentication using the first audio signal based on the length of the second audio signal of the second utterance interval or the number of sounds included in the second utterance interval by comparing the second audio signal of the interval. An authentication system is provided, which includes a determination unit that determines conditions, and an authentication unit that authenticates the speaker based on the determined authentication conditions.
 また、本開示は、1以上のコンピュータが行う認証方法であって、話者の発話音声の音声信号を取得し、取得された前記音声信号から前記話者が発話している第1の発話区間と、複数の話者のそれぞれの音声信号が登録されたデータベースの前記音声信号から前記話者が発話している第2の発話区間と、を検出し、前記第1の発話区間の第1音声信号と、前記第2の発話区間の第2音声信号とを照合し、前記第2の発話区間の前記第2音声信号の長さもしくは前記第2の発話区間に含まれる音数に基づき前記第1音声信号を用いる認証の認証条件を決定し、決定された前記認証条件に基づいて、前記話者の認証を行う、認証方法を提供する。 The present disclosure also provides an authentication method performed by one or more computers, which acquires an audio signal of a speaker's utterance, and determines a first utterance period in which the speaker is speaking based on the acquired audio signal. and a second utterance section in which the speaker is speaking from the voice signals of a database in which the voice signals of each of a plurality of speakers are registered, and the first voice of the first utterance section is detected. The signal is compared with the second audio signal of the second utterance section, and the second utterance is determined based on the length of the second audio signal of the second utterance section or the number of sounds included in the second utterance section. An authentication method is provided, in which authentication conditions for authentication using one audio signal are determined, and the speaker is authenticated based on the determined authentication conditions.
 なお、これらの包括的または具体的な態様は、システム、装置、方法、集積回路、コンピュータプログラム、または、記録媒体で実現されてもよく、システム、装置、方法、集積回路、コンピュータプログラムおよび記録媒体の任意な組み合わせで実現されてもよい。 Note that these comprehensive or specific aspects may be realized by a system, an apparatus, a method, an integrated circuit, a computer program, or a recording medium. It may be realized by any combination of the following.
 本開示によれば、登録時に取得したユーザの発話音声の合計時間の長さに応じて認証時の発話時間を決定し、ユーザの利便性を向上することができる。 According to the present disclosure, the utterance time during authentication can be determined according to the total length of the user's uttered voice acquired at the time of registration, thereby improving convenience for the user.
本実施の形態に係る認証システムのユースケースの一例を示す図A diagram showing an example of a use case of the authentication system according to this embodiment 本実施の形態に係る認証解析装置の内部構成例を示すブロック図A block diagram showing an example of the internal configuration of an authentication analysis device according to this embodiment 登録用の発話音声信号の登録処理に係るフローチャートFlowchart related to registration processing of uttered audio signal for registration 発話長に基づく認証条件の設定の一例を示す図Diagram showing an example of setting authentication conditions based on utterance length 発話長および音数に基づく認証条件の設定の一例を示す図Diagram showing an example of setting authentication conditions based on utterance length and number of sounds 録音声信号の品質に応じて認証条件として発話内容を設定する一例を示す図A diagram showing an example of setting utterance content as an authentication condition according to the quality of the recorded voice signal オペレータが画面に表示された認証用文章に基づき本人確認の認証を実施する例を示す図Diagram showing an example of an operator performing identity verification authentication based on the authentication text displayed on the screen ユーザ側通話端末に表示された本人確認用文章に基づき本人確認の認証を実施する例を示す図Diagram showing an example of performing identity verification authentication based on the identity verification text displayed on the user side call terminal 認証条件を設定後に認証時の集音条件の測定結果から認証条件の要求時間を再設定する例を示す図Diagram showing an example of resetting the required time for authentication conditions based on the measurement results of sound collection conditions during authentication after setting authentication conditions 認証条件を設定後に認証時の集音条件の測定結果から認証条件の閾値を再設定する例を示す図Diagram showing an example of resetting the threshold value of the authentication condition based on the measurement results of the sound collection condition during authentication after setting the authentication condition 話者の認証に係る処理のフローチャートFlowchart of processing related to speaker authentication 登録音声信号の品質によって認証が成功した後の動作に制限を設ける例を示す図Diagram showing an example of setting restrictions on operations after successful authentication depending on the quality of the registered audio signal 登録音声信号の品質に基づき動作制限を設ける処理のフローチャートFlowchart of processing for setting operation restrictions based on the quality of registered audio signals
 以下、図面を適宜参照して、本開示に係る認証システムおよび認証方法を具体的に開示した実施の形態について、詳細に説明する。ただし、必要以上に詳細な説明は省略する場合がある。例えば、すでによく知られた事項の詳細説明及び実質的に同一の構成に対する重複説明を省略する場合がある。これは、以下の説明が不必要に冗長になるのを避け、当業者の理解を容易にするためである。なお、添付図面及び以下の説明は、当業者が本開示を十分に理解するために提供されるのであって、これらにより特許請求の記載の主題を限定することは意図されていない。 Hereinafter, embodiments specifically disclosing an authentication system and an authentication method according to the present disclosure will be described in detail with appropriate reference to the drawings. However, more detailed explanation than necessary may be omitted. For example, detailed explanations of well-known matters and redundant explanations of substantially the same configurations may be omitted. This is to avoid unnecessary redundancy in the following description and to facilitate understanding by those skilled in the art. The accompanying drawings and the following description are provided to enable those skilled in the art to fully understand the present disclosure, and are not intended to limit the subject matter of the claims.
 まず、図1を参照して、本実施の形態に係る認証システムのユースケースについて説明する。図1は、本実施の形態に係る認証システムのユースケースの一例を示す図である。認証システム100は、音声を用いた認証の対象となる人物(図1に示す例では、ユーザUS)の音声信号または音声データを取得し、取得された音声信号または音声データと、事前にストレージ(図1に示す例では、登録話者データベースDB)に登録(格納)された話者の音声信号または音声データとを照合する。認証システム100は、照合結果に基づいて、認証対象であるユーザUSから集音された音声信号または音声データとストレージに登録された音声信号または音声データとの類似度を評価し、評価された類似度に基づいて、ユーザUSを認証する。 First, with reference to FIG. 1, a use case of the authentication system according to the present embodiment will be described. FIG. 1 is a diagram showing an example of a use case of the authentication system according to the present embodiment. The authentication system 100 acquires a voice signal or voice data of a person to be authenticated using voice (in the example shown in FIG. 1, the user US), and stores the acquired voice signal or voice data in a storage ( In the example shown in FIG. 1, the voice signal or voice data of the speaker registered (stored) in the registered speaker database DB) is compared. Based on the matching result, the authentication system 100 evaluates the degree of similarity between the voice signal or voice data collected from the user US who is the authentication target and the voice signal or voice data registered in the storage, and authenticate the user US based on the degree of
 実施の形態1に係る認証システム100は、集音装置の一例としてオペレータ側通話端末OP1と、認証解析装置P1と、登録話者データベースDBと、出力装置の一例としてのディスプレイDPと、を含んで構成される。なお、認証解析装置P1およびディスプレイDPは、一体的に構成されてよい。なお、オペレータ側通話端末OP1は、自動音声装置に置き換えられてもよいし、この場合自動音声装置は認証解析装置P1と一体的に構成されてもよい。 The authentication system 100 according to the first embodiment includes an operator-side telephone terminal OP1 as an example of a sound collection device, an authentication analysis device P1, a registered speaker database DB, and a display DP as an example of an output device. configured. Note that the authentication analysis device P1 and the display DP may be integrally configured. Note that the operator-side call terminal OP1 may be replaced with an automatic voice device, and in this case, the automatic voice device may be configured integrally with the authentication analysis device P1.
 なお、図1に示す認証システム100は、一例としてコールセンタにおいて話者(ユーザUS)の認証に用いられる例を示し、オペレータOPと通話するユーザUSの発話音声を集音した音声データを用いてユーザUSの認証を行う。図1に示す認証システム100は、さらにユーザ側通話端末UP1と、ネットワークNWとを含んで構成される。なお、認証システム100の全体構成は、図1に示す例に限定されないことは言うまでもない。 Note that the authentication system 100 shown in FIG. 1 is used to authenticate a speaker (user US) in a call center as an example, and authenticates the user using voice data collected from the voice of the user US talking to the operator OP. Perform US authentication. The authentication system 100 shown in FIG. 1 further includes a user-side telephone terminal UP1 and a network NW. It goes without saying that the overall configuration of the authentication system 100 is not limited to the example shown in FIG.
 ユーザ側通話端末UP1は、オペレータ側通話端末OP1との間で、ネットワークNWを介して無線通信可能に接続される。なお、ここでいう無線通信は、例えばWi-Fi(登録商標)などの無線LAN(Local Area Network)を介したネットワーク通信である。 The user-side call terminal UP1 is connected to the operator-side call terminal OP1 via the network NW so that they can communicate wirelessly. Note that the wireless communication referred to here is network communication via a wireless LAN (Local Area Network) such as Wi-Fi (registered trademark).
 ユーザ側通話端末UP1は、例えば、ノートPC、タブレット端末、スマートフォンまたは電話機等により構成される。ユーザ側通話端末UP1は、マイク(不図示)を備える集音装置であり、ユーザUSの発話音声を集音して音声信号に変換し、この変換された音声信号を、ネットワークNWを介してオペレータ側通話端末OP1に送信する。また、ユーザ側通話端末UP1は、オペレータ側通話端末OP1から送信されたオペレータOPの発話音声の音声信号を取得して、スピーカ(不図示)から出力する。 The user-side call terminal UP1 is configured by, for example, a notebook PC, a tablet terminal, a smartphone, a telephone, or the like. The user-side call terminal UP1 is a sound collection device equipped with a microphone (not shown), and collects the voice uttered by the user US, converts it into an audio signal, and transmits the converted audio signal to the operator via the network NW. It is sent to the side call terminal OP1. Further, the user side telephone terminal UP1 acquires the audio signal of the operator OP's utterance transmitted from the operator side telephone terminal OP1, and outputs it from a speaker (not shown).
 ネットワークNWは、例えばIP(Internet Protocol)網または電話網であって、ユーザ側通話端末UP1とオペレータ側通話端末OP1との間で、音声信号の送受信を可能に接続する。なお、データの送受信は、有線通信または無線通信により実行される。 The network NW is, for example, an IP (Internet Protocol) network or a telephone network, and connects the user-side call terminal UP1 and the operator-side call terminal OP1 to enable transmission and reception of audio signals. Note that data transmission and reception is performed by wired communication or wireless communication.
 オペレータ側通話端末OP1は、ユーザ側通話端末UP1および認証解析装置P1との間でそれぞれ有線通信または無線通信でデータ送受信可能に接続され、音声信号の送受信を行う。 The operator side telephone terminal OP1 is connected to the user side telephone terminal UP1 and the authentication analysis device P1 so as to be able to transmit and receive data through wired communication or wireless communication, respectively, and transmits and receives audio signals.
 オペレータ側通話端末OP1は、例えば、ノートPC、タブレット端末、スマートフォンまたは電話機等により構成される。オペレータ側通話端末OP1は、ネットワークNWを介してユーザ側通話端末UP1から送信されたユーザUSの発話音声に基づく音声信号を取得し、認証解析装置P1に送信する。なお、オペレータ側通話端末OP1は、取得されたユーザUSの発話音声とオペレータOPの発話音声とを含む音声信号を取得した場合には、オペレータ側通話端末OP1の音声信号の音圧レベル、周波数帯域等の音声パラメータに基づいて、ユーザUSの発話音声に基づく音声信号と、オペレータOPの発話音声に基づく音声信号とを分離してよい。オペレータ側通話端末OP1は、分離後にユーザUSの発話音声に基づく音声信号のみを抽出して認証解析装置P1に送信する。 The operator-side call terminal OP1 is configured by, for example, a notebook PC, a tablet terminal, a smartphone, a telephone, or the like. The operator-side telephone terminal OP1 acquires an audio signal based on the voice uttered by the user US transmitted from the user-side telephone terminal UP1 via the network NW, and transmits it to the authentication analysis device P1. Note that, when the operator side call terminal OP1 acquires an audio signal including the acquired user US's utterance voice and the operator OP's utterance voice, the operator side call terminal OP1 determines the sound pressure level and frequency band of the voice signal of the operator side call terminal OP1. An audio signal based on the voice uttered by the user US and an audio signal based on the voice uttered by the operator OP may be separated based on voice parameters such as . After separation, the operator-side telephone terminal OP1 extracts only the voice signal based on the voice uttered by the user US and sends it to the authentication analysis device P1.
 また、オペレータ側通話端末OP1は、複数のユーザ側通話端末のそれぞれとの間で通信可能に接続され、同時に複数のユーザ側通話端末のそれぞれから音声信号を取得してもよい。オペレータ側通話端末OP1は、取得された音声信号を認証解析装置P1に送信する。これにより、認証システム100は、同時に複数のユーザのそれぞれの音声認証処理、音声解析処理を実行できる。 Furthermore, the operator-side call terminal OP1 may be communicably connected to each of the plurality of user-side call terminals, and may simultaneously acquire audio signals from each of the plurality of user-side call terminals. Operator side call terminal OP1 transmits the acquired audio signal to authentication analysis device P1. Thereby, the authentication system 100 can simultaneously perform voice authentication processing and voice analysis processing for each of a plurality of users.
 また、オペレータ側通話端末OP1は、同時に複数のユーザのそれぞれの発話音声を含む音声信号を取得してもよい。オペレータ側通話端末OP1は、ネットワークNWを介して取得された複数のユーザの音声信号のそれぞれからユーザごとの音声信号を抽出し、ユーザごとの音声信号をそれぞれ認証解析装置P1に送信する。このような場合、オペレータ側通話端末OP1は、複数のユーザの音声信号を解析し、音圧レベル、周波数帯域等の音声パラメータに基づいて、音声信号をユーザごとに分離して抽出してもよい。音声信号がアレイマイク等により集音された場合には、オペレータ側通話端末OP1は、発話音声の到来方向に基づいて、音声信号をユーザごとに分離して抽出してもよい。これにより、認証システム100は、例えば、Web会議等の同時に複数のユーザが発話する環境で集音された音声信号であっても、複数のユーザのそれぞれの音声認証処理、音声解析処理を実行できる。 Furthermore, the operator-side call terminal OP1 may simultaneously acquire audio signals including the voices uttered by each of a plurality of users. The operator-side telephone terminal OP1 extracts a voice signal for each user from each of the voice signals of a plurality of users acquired via the network NW, and transmits the voice signal for each user to the authentication analysis device P1. In such a case, the operator-side call terminal OP1 may analyze the audio signals of multiple users and separate and extract the audio signals for each user based on audio parameters such as sound pressure level and frequency band. . When the voice signals are collected by an array microphone or the like, the operator-side call terminal OP1 may separate and extract the voice signals for each user based on the arrival direction of the uttered voice. As a result, the authentication system 100 can perform voice authentication processing and voice analysis processing for each of a plurality of users, even for voice signals collected in an environment where a plurality of users speak at the same time, such as a web conference, for example. .
 認証装置およびコンピュータの一例としての認証解析装置P1は、オペレータ側通話端末OP1、登録話者データベースDBおよびディスプレイDPの間でそれぞれデータ送受信可能に接続される。なお、認証解析装置P1は、オペレータ側通話端末OP1、登録話者データベースDBおよびディスプレイDPの間でネットワーク(不図示)を介して、有線通信または無線通信が可能に接続されていてもよい。 An authentication analysis device P1, which is an example of an authentication device and a computer, is connected to the operator side telephone terminal OP1, the registered speaker database DB, and the display DP so that data can be transmitted and received, respectively. Note that the authentication analysis device P1 may be connected to the operator side telephone terminal OP1, the registered speaker database DB, and the display DP via a network (not shown) to enable wired or wireless communication.
 認証解析装置P1は、オペレータ側通話端末OP1から送信されたユーザUSの音声信号を取得し、取得された音声信号を、例えば周波数ごとに音声解析して、ユーザUS個人の発話特徴量を抽出する。認証解析装置P1は、登録話者データベースDBを参照して、登録話者データベースDBに事前に登録された複数のユーザのそれぞれの発話特徴量と、抽出された発話特徴量とを照合して、ユーザUSの音声認証を実行する。認証解析装置P1は、ユーザUSの認証結果を含む認証結果画面SCを生成して、ディスプレイDPに送信して出力させる。なお、図1に示す認証結果画面SCは、一例であってこれに限定されないことは言うまでもない。図1に示す認証結果画面SCは、例えばユーザUSの認証結果であるメッセージ「山田太郎さんの声と一致しました。」を含む。 The authentication analysis device P1 acquires the voice signal of the user US transmitted from the operator side call terminal OP1, analyzes the acquired voice signal for each frequency, for example, and extracts the utterance features of the user US. . The authentication analysis device P1 refers to the registered speaker database DB, and compares the extracted utterance features with the utterance features of each of the plurality of users registered in advance in the registered speaker database DB. Perform voice authentication of user US. The authentication analysis device P1 generates an authentication result screen SC including the authentication result of the user US, and transmits it to the display DP for output. It goes without saying that the authentication result screen SC shown in FIG. 1 is just an example and is not limited thereto. The authentication result screen SC shown in FIG. 1 includes, for example, a message "The voice matches Mr. Taro Yamada's voice." which is the authentication result of the user US.
 データベースの一例としての登録話者データベースDBは、所謂ストレージであって、例えばフラッシュメモリ、HDD(Hard Disk Drive)あるいはSSD(Solid State Drive)等の記憶媒体を用いて構成される。登録話者データベースDBは、複数のユーザのそれぞれのユーザ情報と、発話特徴量とを対応付けて格納(登録)する。ここでいうユーザ情報は、ユーザに関する情報であって、例えば、ユーザ名、ユーザID(Identification)またはユーザごとに割り当てられた識別情報等である。なお、登録話者データベースDBは、認証解析装置P1と一体的に構成されてもよい。 The registered speaker database DB, which is an example of a database, is a so-called storage, and is configured using a storage medium such as a flash memory, an HDD (Hard Disk Drive), or an SSD (Solid State Drive). The registered speaker database DB stores (registers) user information of each of a plurality of users and utterance feature amounts in association with each other. The user information here is information related to users, such as a user name, user ID (Identification), or identification information assigned to each user. Note that the registered speaker database DB may be configured integrally with the authentication analysis device P1.
 ディスプレイDPは、例えばLCD(Liquid Crystal Display)あるいは有機EL(Electroluminescence)ディスプレイを用いて構成され、認証解析装置P1から送信された認証結果画面SCを表示する。なお、ディスプレイDPは、認証解析装置P1と一体的に構成されてもよい。 The display DP is configured using, for example, an LCD (Liquid Crystal Display) or an organic EL (Electroluminescence) display, and displays the authentication result screen SC transmitted from the authentication analysis device P1. Note that the display DP may be configured integrally with the authentication analysis device P1.
 図1に示す例において、ユーザ側通話端末UP1は、ユーザUSの発話音声COM12「山田太郎です」と、発話音声COM14「123245678です」とを集音し、音声信号に変換して、オペレータ側通話端末OP1に送信する。オペレータ側通話端末OP1は、ユーザ側通話端末UP1から送信されたユーザUSの発話音声COM12,COM14のそれぞれに基づく音声信号を認証解析装置P1に送信する。 In the example shown in FIG. 1, the user-side call terminal UP1 collects the user US's uttered voice COM12 ``This is Taro Yamada'' and the uttered voice COM14 ``This is 123245678'', converts them into audio signals, and makes a call to the operator side. Send to terminal OP1. The operator-side telephone terminal OP1 transmits audio signals based on the user US's uttered voices COM12 and COM14 transmitted from the user-side telephone terminal UP1 to the authentication analysis device P1.
 なお、オペレータ側通話端末OP1は、オペレータOPの発話音声COM11「お名前を教えてください」と、発話音声COM13「会員番号を教えてください」と、ユーザUSの発話音声COM12および発話音声COM14とを集音した音声信号を取得した場合には、オペレータOPの発話音声COM11および発話音声COM13のそれぞれに基づく音声信号を分離および除去し、ユーザUSの発話音声COM12および発話音声COM14のそれぞれに基づく音声信号のみを抽出して、認証解析装置P1に送信する。これにより、認証解析装置P1は、認証対象者の音声信号のみを用いることで、ユーザ認証精度を向上できる。 The operator-side call terminal OP1 receives the operator OP's uttered voice COM11 "Please tell me your name," the uttered voice COM13 "Please tell me your membership number," and the user US's uttered voice COM12 and uttered voice COM14. When the collected voice signals are obtained, the voice signals based on the operator OP's voice COM11 and voice COM13 are separated and removed, and the voice signals based on the voice COM12 and COM14 of the user US are separated and removed. Only the authentication information is extracted and sent to the authentication analysis device P1. Thereby, the authentication analysis device P1 can improve user authentication accuracy by using only the voice signal of the person to be authenticated.
 次に、図2を参照して、本実施の形態に係る認証解析装置の内部構成例について説明する。図2は、本実施の形態に係る認証解析装置の内部構成例を示すブロック図である。認証解析装置P1は、通信部20と、プロセッサ21と、メモリ22と、を少なくとも含んで構成される。 Next, an example of the internal configuration of the authentication analysis device according to the present embodiment will be described with reference to FIG. 2. FIG. 2 is a block diagram showing an example of the internal configuration of the authentication analysis device according to the present embodiment. The authentication analysis device P1 is configured to include at least a communication unit 20, a processor 21, and a memory 22.
 通信部20は、オペレータ側通話端末OP1および登録話者データベースDBのそれぞれとの間でデータ通信可能に接続する。通信部20は、オペレータ側通話端末OP1から送信された音声信号をプロセッサ21に出力する。 The communication unit 20 is connected to enable data communication with each of the operator-side telephone terminal OP1 and the registered speaker database DB. The communication unit 20 outputs the audio signal transmitted from the operator-side telephone terminal OP1 to the processor 21.
 プロセッサ21は、例えばCPU(Central Processing Unit)、DSP(Digital Signal Processor)、GPU(Graphical Processing Unit)、FPGA(Field Programmable Gate Array)等の電子デバイスのうち少なくとも1つが実装された半導体チップを用いて構成される。プロセッサ21は、認証解析装置P1の全体的な動作を司るコントローラとして機能し、認証解析装置P1の各部の動作を統括するための制御処理、認証解析装置P1の各部との間のデータの入出力処理、データの演算処理およびデータの記憶処理を行う。 The processor 21 is, for example, a CPU (Central Processing Unit), a DSP (Digital Signal Processor), a GPU (Graphical Processing Unit), or an FPGA (Field). Using a semiconductor chip mounted with at least one of the electronic devices such as Programmable Gate Array) configured. The processor 21 functions as a controller that controls the overall operation of the authentication analysis device P1, and performs control processing to supervise the operation of each part of the authentication analysis device P1, and input/output of data between each part of the authentication analysis device P1. Performs processing, data calculation processing, and data storage processing.
 プロセッサ21は、メモリ22のROM(Read Only Memory)22Aに記憶されたプログラムおよびデータを用いることで、発話区間検出部21A、登録品質判定部21B、特徴量抽出部21C、比較対象設定部21D、類似度計算部21E、認証条件設定部21F、認証時集音条件測定部21Gおよび動作制限設定部21Hのそれぞれの機能を実現する。プロセッサ21は、動作中にメモリ22のRAM(Random Access Memory)22Bを使用し、プロセッサ21および各部が生成あるいは取得したデータもしくは情報をメモリ22のRAM22Bに一時的に保存する。 The processor 21 uses the program and data stored in the ROM (Read Only Memory) 22A of the memory 22 to detect a speech interval detection section 21A, a registration quality determination section 21B, a feature amount extraction section 21C, a comparison target setting section 21D, The respective functions of a similarity calculation section 21E, an authentication condition setting section 21F, an authentication sound collection condition measurement section 21G, and an operation restriction setting section 21H are realized. The processor 21 uses a RAM (Random Access Memory) 22B of the memory 22 during operation, and temporarily stores data or information generated or acquired by the processor 21 and each section in the RAM 22B of the memory 22.
 検出部の一例としての発話区間検出部21Aは、認証時の発話音声の音声信号(以下、「発話音声信号」と表記)を取得し、取得された発話音声信号を解析し、ユーザUSが発話している発話区間(以下、第1の発話区間と称する)を検出する。発話区間検出部21Aは、発話音声信号から検出された少なくとも1つの第1の発話区間に対応する発話音声信号(以下、第1音声信号と称する)を特徴量抽出部21Cに出力する。また、発話区間検出部21Aは、少なくとも1つの第1の発話区間の第1音声信号をメモリ22のRAM22Bに一時的に保存してもよい。なお、発話区間検出部21Aは、第1の発話区間を複数検出した場合、検出されたそれぞれの第1の発話区間の第1音声信号を連結して特徴量抽出部21Cに出力してもよい。また発話区間検出部21Aは、ユーザUSの認証に用いる発話音声信号を予め登録する際に、ユーザUSから取得した音声データの発話区間(以下、第2の発話区間と称する)を検出する。発話区間検出部21Aは、第2の発話区間に対応する発話音声信号(以下、第2音声信号と称する)を登録品質判定部21Bに出力する。なお、発話区間検出部21Aは、第2の発話区間が複数存在している場合、検出されたそれぞれの第2の発話区間の第2音声信号を連結して登録品質判定部21Bに出力してもよい。 The utterance section detection unit 21A, which is an example of a detection unit, acquires an audio signal of the utterance at the time of authentication (hereinafter referred to as "utterance audio signal"), analyzes the acquired utterance audio signal, and determines whether the user US has uttered the utterance. Detects the utterance interval (hereinafter referred to as the first utterance interval) in which the first utterance interval occurs. The utterance section detection section 21A outputs a utterance audio signal (hereinafter referred to as a first audio signal) corresponding to at least one first utterance section detected from the utterance audio signal to the feature extraction section 21C. Further, the speech section detection section 21A may temporarily store the first audio signal of at least one first speech section in the RAM 22B of the memory 22. Note that when the speech section detection section 21A detects a plurality of first speech sections, the speech section detection section 21A may connect the first audio signals of each detected first speech section and output it to the feature amount extraction section 21C. . Furthermore, the speech section detection unit 21A detects a speech section (hereinafter referred to as a second speech section) of the voice data acquired from the user US when registering in advance a speech signal used for authentication of the user US. The speech section detection section 21A outputs a speech audio signal corresponding to the second speech section (hereinafter referred to as a second audio signal) to the registration quality determination section 21B. Note that when there are multiple second speech sections, the speech section detection section 21A connects the second audio signals of the respective detected second speech sections and outputs the concatenated second audio signals to the registration quality determination section 21B. Good too.
 処理部の一例としての登録品質判定部21Bは、発話区間検出部21Aから第2の発話区間もしくは複数の第2の発話区間のそれぞれが連結された第2音声信号を取得する。登録品質判定部21Bは、取得した第2音声信号の品質を判定する。品質とは、実際の認証に先だってユーザごとに第2音声信号を登録話者データベースDBに登録する際(登録時)、登録時のユーザの周囲環境の良し悪しあるいはユーザの発話精度、またはその両方を示す指標である。本実施の形態では、この登録時の品質に基づいて、実際の認証時にユーザに課す認証条件(後述参照)が決定される。登録品質判定部21Bは、例えば、第2音声信号の発話の長さ(以下、発話長と称する)または第2音声信号に含まれる音数に基づき品質を判定する。なお、登録品質判定部21Bが品質を判定するのに用いる要素は、発話長と音数に限られず、音素数または単語数でもよい。登録品質判定部21Bは、判定した品質の情報を特徴量抽出部21Cまたは認証条件設定部21Fに出力する。 The registration quality determination unit 21B, which is an example of a processing unit, acquires a second speech interval or a second audio signal in which each of a plurality of second utterance intervals are concatenated from the utterance interval detection unit 21A. The registration quality determination unit 21B determines the quality of the acquired second audio signal. Quality refers to the quality of the user's surrounding environment at the time of registration, the accuracy of the user's speech, or both, when registering the second audio signal for each user in the registered speaker database DB prior to actual authentication (at the time of registration). This is an indicator that shows the In this embodiment, authentication conditions (see below) to be imposed on the user during actual authentication are determined based on the quality at the time of registration. The registration quality determination unit 21B determines the quality based on, for example, the length of the utterance of the second audio signal (hereinafter referred to as utterance length) or the number of sounds included in the second audio signal. Note that the elements used by the registration quality determining unit 21B to determine the quality are not limited to the utterance length and the number of sounds, but may also be the number of phonemes or the number of words. The registration quality determination unit 21B outputs information on the determined quality to the feature amount extraction unit 21C or the authentication condition setting unit 21F.
 処理部の一例としての特徴量抽出部21Cは、発話区間検出部21Aにより抽出された1以上の発話音声信号を用いて個人の音声の特徴を、例えば周波数ごとに解析して、発話特徴量を抽出する。特徴量抽出部21Cは、発話区間検出部21Aから出力された第1の発話区間の第1音声信号の発話特徴量を抽出する。また、特徴量抽出部21Cは、発話区間検出部21Aから出力された第2の発話区間の第2音声信号の発話特徴量を抽出する。なお、第2の発話区間の第2音声信号の発話特徴量は予め登録話者データベースDBに登録されていてもよい。特徴量抽出部21Cは、抽出された第1の発話区間の発話特徴量と、この発話特徴量が抽出された第1音声信号とを対応付けて類似度計算部21Eに出力したり、比較対象設定部21Dに出力したりメモリ22のRAM22Bに一時的に保存したりする。特徴量抽出部21Cは、第2の発話区間の発話特徴量と、この発話特徴量が抽出された第2音声信号とを対応付けて類似度計算部21Eに出力したり、第2の発話区間の発話特徴量と、登録品質判定部21Bから取得した品質に係る情報を紐づけてメモリ22のRAM22Bに一時的に保存したりする。 The feature extraction unit 21C, which is an example of a processing unit, analyzes the characteristics of the individual's voice, for example, for each frequency, using the one or more utterance audio signals extracted by the utterance section detection unit 21A, and extracts the utterance feature quantity. Extract. The feature extractor 21C extracts the speech feature of the first audio signal of the first speech section output from the speech section detector 21A. Further, the feature extracting section 21C extracts the speech feature of the second audio signal of the second speech section output from the speech section detecting section 21A. Note that the utterance feature amount of the second audio signal of the second utterance section may be registered in advance in the registered speaker database DB. The feature amount extraction unit 21C associates the extracted utterance feature amount of the first utterance section with the first audio signal from which this utterance feature amount is extracted and outputs it to the similarity calculation unit 21E, or It is output to the setting section 21D or temporarily stored in the RAM 22B of the memory 22. The feature amount extraction unit 21C associates the speech feature amount of the second speech section with the second audio signal from which this speech feature amount is extracted and outputs it to the similarity calculation section 21E. The utterance feature amount and the quality information acquired from the registered quality determination unit 21B are linked and temporarily stored in the RAM 22B of the memory 22.
 特徴量抽出部21Cは、発話音声信号の発話内容を音声認識する。発話内容の音声認識の方法は、公知技術により実現可能であり、例えば発話音声信号の音素解析を行い言語情報として算出してもよいし、他の解析方法により実現されてもよい。 The feature extraction unit 21C performs voice recognition on the utterance content of the utterance audio signal. The method of voice recognition of the utterance content can be realized by a known technique, for example, it may be realized by phoneme analysis of the uttered voice signal and calculated as linguistic information, or it may be realized by other analysis methods.
 設定部の一例としての比較対象設定部21Dは、登録話者データベースDBから話者であるユーザUSのデータを取得する。ここでユーザUSのデータとは、例えば、ユーザUSの生年月日、名前または性別などの個人情報またはユーザUSが過去に登録した発話に係る音声データもしくは音声データの特徴量の少なくとも1つである。比較対象設定部21Dは、話者をユーザUSと設定するのに、例えば、特徴量抽出部21Cから出力された話者の抽出特徴量を用いて話者をユーザUSと特定してもよいし、話者がユーザ側通話端末UP1に入力した内容(例えば、名前またはIDなど)から話者をユーザUSと特定してもよい。比較対象設定部21Dは、取得したユーザUSのデータを発話区間検出部21Aまたは類似度計算部21Eに出力する。 The comparison target setting unit 21D, which is an example of a setting unit, acquires data on the user US who is the speaker from the registered speaker database DB. Here, the data of the user US is, for example, personal information such as the date of birth, name, or gender of the user US, or at least one of voice data related to utterances registered in the past by the user US, or a feature amount of the voice data. . In order to set the speaker as the user US, the comparison target setting unit 21D may, for example, identify the speaker as the user US using the extracted feature quantity of the speaker output from the feature quantity extracting unit 21C. , the speaker may be identified as the user US based on the content (for example, name or ID) input by the speaker into the user-side call terminal UP1. The comparison target setting section 21D outputs the acquired data of the user US to the utterance section detection section 21A or the similarity calculation section 21E.
 認証部の一例としての類似度計算部21Eは、特徴量抽出部21Cから出力された発話音声信号の発話特徴量を取得する。類似度計算部21Eは、特徴量抽出部21Cから取得した第1の発話区間の発話特徴量と第2の発話区間の発話特徴量との類似度を算出する。類似度計算部21Eは、算出された類似度に基づいて、発話音声信号(つまり、ユーザ側通話端末UP1から送信された音声信号)に対応するユーザを特定してユーザの本人確認の認証を実行する。 The similarity calculation unit 21E, which is an example of the authentication unit, acquires the utterance feature of the utterance audio signal output from the feature extraction unit 21C. The similarity calculating unit 21E calculates the similarity between the utterance feature amount of the first utterance section and the utterance feature amount of the second utterance section obtained from the feature amount extraction section 21C. The similarity calculation unit 21E identifies the user corresponding to the uttered audio signal (that is, the audio signal transmitted from the user-side call terminal UP1) based on the calculated similarity, and executes the user's identity verification authentication. do.
 決定部の一例としての認証条件設定部21Fは、登録品質判定部21Bから取得した品質に係る情報に基づき認証条件を設定する。認証条件とは、例えば、ユーザUSが発話する長さ、発話する内容または判定に係る閾値などである。なお、認証条件は、これらに限られない。 The authentication condition setting unit 21F, which is an example of a determining unit, sets authentication conditions based on the quality-related information acquired from the registered quality determination unit 21B. The authentication conditions include, for example, the length of the speech made by the user US, the content of the speech, or a threshold value for determination. Note that the authentication conditions are not limited to these.
 測定部の一例としての認証時集音条件測定部21Gは、認証時の集音条件を測定する。集音条件とは、例えば、認証時に集音された発話音声信号のノイズ、音量、残響の度合いまたは発話音声信号に含まれる音素数などである。なお、集音条件は、これらに限られない。認証時集音条件測定部21Gは、測定した集音条件を認証条件設定部21Fに出力する。 The authentication sound collection condition measurement unit 21G, which is an example of a measurement unit, measures the sound collection conditions at the time of authentication. The sound collection conditions include, for example, the noise, volume, degree of reverberation, or number of phonemes included in the spoken audio signal collected during authentication. Note that the sound collection conditions are not limited to these. The authentication sound collection condition measuring section 21G outputs the measured sound collection conditions to the authentication condition setting section 21F.
 設定部の一例としての動作制限設定部21Hは、第2の発話区間の発話音声信号の品質に基づき、ユーザUSのできる動作に制限を設定する。例えば、認証システム100がATM(Automatic Teller Machine)に搭載されている場合、動作制限設定部21Hは、発話音声信号の品質が悪い場合送金または振替などの動作を制限する。なお、認証システム100が搭載される機械の例はATMに限られない。 The operation restriction setting unit 21H, which is an example of a setting unit, sets restrictions on the operations that the user US can perform based on the quality of the uttered audio signal in the second utterance section. For example, when the authentication system 100 is installed in an ATM (Automatic Teller Machine), the operation restriction setting unit 21H restricts operations such as remittance or transfer when the quality of the spoken audio signal is poor. Note that examples of machines in which the authentication system 100 is installed are not limited to ATMs.
 これらによって、プロセッサ21は、登録品質判定部21Bが判定した第2音声信号の品質に基づきユーザの本人確認の認証時の認証条件を設定する。プロセッサ21は、設定した認証条件に基づきユーザの発話音声信号を取得する。プロセッサ21は、発話区間検出部21Aにより検出された第1の発話区間の第1音声信号と、第2の発話区間の第2音声信号との照合に基づいて、話者が本人であるか否かを認証する。 Based on these, the processor 21 sets the authentication conditions for the user's identity verification based on the quality of the second audio signal determined by the registration quality determination unit 21B. The processor 21 acquires the user's uttered audio signal based on the set authentication conditions. The processor 21 determines whether the speaker is the person himself/herself based on the comparison between the first audio signal of the first utterance section detected by the utterance section detection unit 21A and the second audio signal of the second utterance section. authenticate the person.
 メモリ22は、例えばプロセッサ21が行う各種の処理を規定したプログラムとそのプログラムの実行中に使用するデータとを格納するROM22Aと、プロセッサ21が行う各種の処理を実行する際に用いるワークメモリとしてのRAM22Bと、を少なくとも有する。ROM22Aには、プロセッサ21が行う各種の処理を規定したプログラムとそのプログラムの実行中に使用するデータとが書き込まれている。RAM22Bには、プロセッサ21により生成あるいは取得されたデータもしくは情報(例えば、発話音声信号または各発話音声信号に対応する発話特徴量等)が一時的に保存される。 The memory 22 includes, for example, a ROM 22A that stores programs that define various processes performed by the processor 21 and data used during the execution of the programs, and a ROM 22A that serves as a work memory used when executing various processes performed by the processor 21. It has at least a RAM 22B. A program that defines various processes to be performed by the processor 21 and data used during execution of the program are written in the ROM 22A. The RAM 22B temporarily stores data or information generated or acquired by the processor 21 (for example, the utterance audio signal or the utterance feature amount corresponding to each utterance audio signal).
 表示I/F23は、プロセッサ21とディスプレイDPとの間をデータ通信可能に接続し、プロセッサ21の類似度計算部21Eにより生成された認証結果画面SCをディスプレイDPに出力する。表示I/F23は、プロセッサ21の認証結果に基づき話者が本人であるか否かを示す認証状況をディスプレイDPに表示させる。 The display I/F 23 connects the processor 21 and the display DP to enable data communication, and outputs the authentication result screen SC generated by the similarity calculation unit 21E of the processor 21 to the display DP. The display I/F 23 causes the display DP to display an authentication status indicating whether or not the speaker is the authentic speaker based on the authentication result of the processor 21.
 次に、図3を参照して、登録用の発話音声信号の登録処理を説明する。図3は、登録用の発話音声信号の登録処理に係るフローチャートである。なお、図3のフローチャートに係る各処理はプロセッサ21によって実行される。 Next, with reference to FIG. 3, the registration process of the uttered audio signal for registration will be described. FIG. 3 is a flowchart related to the registration process of the uttered audio signal for registration. Note that each process related to the flowchart in FIG. 3 is executed by the processor 21.
 図3のフローチャートは、登録時、つまり予め登録話者データベースDBに保存しておく発話音声信号の登録に係る処理を表す。 The flowchart in FIG. 3 represents the process related to registration, that is, the registration of the uttered audio signal that is stored in the registered speaker database DB in advance.
 プロセッサ21は、話者からの登録用の発話音声信号(以下、登録音声信号と称する)の受信を開始する(St10)。つまり、ステップSt10の処理で、話者はユーザ側通話端末UP1に対して発話を開始する。 The processor 21 starts receiving an utterance audio signal for registration (hereinafter referred to as a registration audio signal) from the speaker (St10). That is, in the process of step St10, the speaker starts speaking to the user-side telephone terminal UP1.
 プロセッサ21は、話者からの登録音声信号の受信を終了する(St11)。つまり、ステップSt11の処理で、話者はユーザ側通話端末UP1に対して発話を終了する。 The processor 21 ends receiving the registered audio signal from the speaker (St11). That is, in the process of step St11, the speaker finishes speaking to the user-side telephone terminal UP1.
 発話区間検出部21Aは、ステップSt10からステップSt11までの処理で取得した登録音声信号の第2の発話区間を検出する(St12)。 The speech section detection unit 21A detects the second speech section of the registered audio signal acquired in the processing from step St10 to step St11 (St12).
 登録品質判定部21Bは、ステップSt12の処理で検出された第2の発話区間の第2音声信号の品質を判定する(St13)。 The registration quality determination unit 21B determines the quality of the second audio signal of the second speech section detected in the process of step St12 (St13).
 登録品質判定部21Bは、ステップSt13の処理で判定した品質に基づき登録音声信号を再取得するか否かを判定する(St14)。登録品質判定部21Bは、例えば品質が予め定められた必要最低限の値以上である場合に再取得しないと判定し、品質が予め定められた必要最低限の値未満である場合に再取得すると判定する。例えば、話者が一言も発話していない、発話長が1秒であるまたは音数が1音である場合、登録品質判定部21Bは、登録音声信号を再取得すると判定する。なお、登録品質判定部21Bが再取得すると判定する例は、一例でありこれらに限定されない。また、ステップSt14の処理は図3に係るフローチャートの処理から省略されてもよい。 The registered quality determination unit 21B determines whether or not to re-acquire the registered audio signal based on the quality determined in the process of step St13 (St14). For example, the registration quality determination unit 21B determines not to reacquire when the quality is equal to or higher than a predetermined minimum necessary value, and determines to not reacquire when the quality is less than a predetermined minimum necessary value. judge. For example, if the speaker has not uttered a single word, the length of the utterance is 1 second, or the number of sounds is 1, the registration quality determining unit 21B determines to re-acquire the registered audio signal. Note that the example in which the registration quality determining unit 21B determines to re-acquire is an example and is not limited to these. Moreover, the process of step St14 may be omitted from the process of the flowchart related to FIG. 3.
 登録品質判定部21Bは、登録音声信号を再取得すると判定した場合(St14,YES)、プロセッサ21の処理はステップSt10の処理に戻る。 When the registration quality determination unit 21B determines to re-acquire the registered audio signal (St14, YES), the process of the processor 21 returns to the process of step St10.
 登録品質判定部21Bは、登録音声信号を再取得しないと判定した場合(St14,NO)、特徴量抽出部21Cは、第2の発話区間の発話音声信号の発話特徴量を抽出する(St15)。 When the registration quality determining unit 21B determines that the registered audio signal is not to be reacquired (St14, NO), the feature extracting unit 21C extracts the utterance feature of the uttered audio signal in the second utterance section (St15). .
 特徴量抽出部21Cは、ステップSt13の処理で判定された品質とステップSt15の処理で抽出された発話特徴量とを紐づけて登録話者データベースDBに保存する(St16)。 The feature extracting unit 21C associates the quality determined in the process of step St13 with the utterance feature extracted in the process of step St15 and stores it in the registered speaker database DB (St16).
 次に、図4を参照して、発話長に基づく認証条件の設定の一例を説明する。図4は、発話長に基づく認証条件の設定の一例を示す図である。 Next, an example of setting authentication conditions based on utterance length will be described with reference to FIG. 4. FIG. 4 is a diagram illustrating an example of setting authentication conditions based on utterance length.
 登録音声信号US10、登録音声信号US11および登録音声信号US12は、図3に係る処理で登録話者データベースDBに登録されたユーザUSの登録音声信号である。 The registered voice signal US10, the registered voice signal US11, and the registered voice signal US12 are registered voice signals of the user US registered in the registered speaker database DB in the process related to FIG. 3.
 図4に係る例では、発話長が10秒未満の場合、品質は「低」に、発話長が10秒以上の場合、品質は「高」となる。なお、品質が「低」または「高」となる閾値の秒数は一例であり限定されない。また、品質は、「低」および「高」の2段階に限られず、例えば「低」、「中」および「高」の3段階、あるいは4段階以上に設定されてもよい。 In the example shown in FIG. 4, when the utterance length is less than 10 seconds, the quality is "low", and when the utterance length is 10 seconds or more, the quality is "high". Note that the threshold number of seconds for which the quality is "low" or "high" is an example and is not limited. Furthermore, the quality is not limited to two levels of "low" and "high", but may be set to, for example, three levels of "low", "medium" and "high", or four or more levels.
 認証条件設定部21Fは、品質の結果に基づき要求時間を変更する。図4に係る例では品質が「低」の場合、要求時間は15秒であり、品質が「高」の場合、要求時間は7秒となる。要求時間とは、認証システム100が認証を行う際に話者に要求する発話の合計時間である。なお、要求時間の長さは一例でありこれらに限定されない。図4に係る例では、判定閾値は品質の結果に関わらず全て70とする。判定閾値とは、類似度計算部21Eが第1の発話区間の発話特徴量と第2の発話区間の発話特徴量との類似度の判定に用いる閾値のことである。判定閾値が高いほど、より高い類似度が必要となる。なお、判定閾値の値は一例であり70に限定されない。 The authentication condition setting unit 21F changes the required time based on the quality results. In the example shown in FIG. 4, when the quality is "low", the required time is 15 seconds, and when the quality is "high", the required time is 7 seconds. The required time is the total speaking time that the authentication system 100 requests from the speaker when performing authentication. Note that the length of the requested time is just an example and is not limited thereto. In the example shown in FIG. 4, the determination threshold is all set to 70 regardless of the quality result. The determination threshold value is a threshold value used by the similarity calculation unit 21E to determine the degree of similarity between the utterance feature amount of the first utterance section and the utterance feature amount of the second utterance section. The higher the determination threshold, the higher the degree of similarity required. Note that the value of the determination threshold is an example and is not limited to 70.
 登録音声信号US10の発話内容は、「あ か さ た な で す(a ka sa ta na de su)」であり、発話長は5秒となる。登録音声信号US10は、発話長が5秒で10秒未満であるため品質は「低」となる。この結果、登録音声信号US10に係る認証条件として要求時間は15秒、判定閾値は70となる。 The utterance content of the registered voice signal US10 is "a ka sa tana de su" and the utterance length is 5 seconds. The registered audio signal US10 has a utterance length of 5 seconds and is less than 10 seconds, so the quality is "low". As a result, the required time is 15 seconds and the determination threshold is 70 as authentication conditions for the registered audio signal US10.
 登録音声信号US11の発話内容は、「あ か さ た な で す は ま や ら わ で す(a ka sa ta na de su ha ma ya ra wa de su)」であり、発話長は8秒となる。登録音声信号US11は、発話長が8秒で10秒未満であるため品質は「低」となる。この結果、登録音声信号US11に係る認証条件として要求時間は15秒、判定閾値は70となる。 The utterance content of the registered audio signal US11 is ``Aka satana de wa ma ya ya ra wa desu'' (a ka sa ta na de su ha ma ya ra wa de su)” and the length of the utterance is 8 seconds. Become. The quality of the registered audio signal US11 is "low" because the utterance length is 8 seconds and it is less than 10 seconds. As a result, the required time is 15 seconds and the determination threshold is 70 as authentication conditions for the registered audio signal US11.
 登録音声信号US12の発話内容は、「あ か さ た な で す い ち に さ ん し ご ろ く な な で す(a ka sa ta na de su i chi ni sa n shi go ro ku na na de su)」であり、発話長は13秒となる。登録音声信号US12は、発話長が13秒で10秒以上であるため品質は「高」となる。この結果、登録音声信号US12に係る認証条件として要求時間は7秒、判定閾値は70となる。 The utterance content of the registered audio signal US12 is ``A ka sa ta na de.'' su i chi ni sa n shi go ro ku na na de su)'' and the length of the utterance is 13 seconds. The quality of the registered audio signal US12 is "high" because the utterance length is 13 seconds and is longer than 10 seconds. As a result, the required time is 7 seconds and the determination threshold is 70 as the authentication conditions for the registered audio signal US12.
 次に、図5を参照して、発話長および音数に基づく認証条件の設定の一例を説明する。図5は、発話長および音数に基づく認証条件の設定の一例を示す図である。 Next, an example of setting authentication conditions based on utterance length and number of sounds will be described with reference to FIG. 5. FIG. 5 is a diagram illustrating an example of setting authentication conditions based on utterance length and number of sounds.
 図5に係る例では、発話長が10秒未満の場合、品質は「低」に、発話長が10秒以上の場合、品質は「高」となる。音数が13音未満の場合、品質は「低」に、音数が13音以上の場合、品質は「高」となる。なお、品質が「低」または「高」となる閾値の秒数および音数は一例であり限定されない。また、品質は、「低」および「高」に限られず「低」、「中」および「高」の3段階、あるいは4段階以上に設定されてもよい。 In the example shown in FIG. 5, when the utterance length is less than 10 seconds, the quality is "low", and when the utterance length is 10 seconds or more, the quality is "high". When the number of tones is less than 13, the quality is "low", and when the number of tones is 13 or more, the quality is "high". Note that the threshold number of seconds and the number of sounds for which the quality is "low" or "high" are examples and are not limited. Further, the quality is not limited to "low" and "high", but may be set to three stages of "low", "medium" and "high", or to four or more stages.
 発話長および音数の品質の結果に基づき、認証条件設定部21Fは、要求時間を変更する。発話長および音数の品質のうち品質が低い方を登録音声信号の品質とする。図5に係る例では品質が「低」の場合、要求時間は15秒であり、品質が「高」の場合、要求時間は7秒となる。なお、要求時間の長さは一例でありこれらに限定されない。図5に係る例では、判定閾値は品質の結果に関わらず全て70とする。 Based on the quality of the utterance length and number of sounds, the authentication condition setting unit 21F changes the required time. The quality of the registered audio signal is determined by the quality of the utterance length and the quality of the number of sounds, whichever is lower. In the example shown in FIG. 5, when the quality is "low", the required time is 15 seconds, and when the quality is "high", the required time is 7 seconds. Note that the length of the requested time is just an example and is not limited thereto. In the example shown in FIG. 5, the determination threshold is all set to 70 regardless of the quality result.
 登録音声信号US10の発話内容は「あ か さ た な で す(a ka sa ta na de su)」であり、音数は7音となる。登録音声信号US10の発話長は5秒であり10秒未満であるため発話長に係る品質は「低」となる。音数は7音であり13音未満であるため音数に係る品質は「低」となる。発話長および音数の品質がどちらも「低」であり、登録音声信号US10の品質は「低」となる。この結果、登録音声信号US10に係る認証条件としての要求時間は15秒、判定閾値は70となる。 The utterance content of the registered voice signal US10 is "a ka sa tana de su", and the number of tones is seven. The utterance length of the registered audio signal US10 is 5 seconds, which is less than 10 seconds, so the quality related to the utterance length is "low." Since the number of tones is 7, which is less than 13, the quality related to the number of tones is "low." The quality of the utterance length and the number of tones are both "low", and the quality of the registered audio signal US10 is "low". As a result, the required time as an authentication condition for the registered audio signal US10 is 15 seconds, and the determination threshold is 70.
 登録音声信号US11の発話内容は「あ か さ た な で す は ま や ら わ で す(a ka sa ta na de su ha ma ya ra wa de su)」であり、音数は14音となる。登録音声信号US11の発話長は8秒であり10秒未満であるため発話長に係る品質は「低」となる。音数は14音であり13音以上であるため音数に係る品質は「高」となる。音数の品質は「高」であるが発話長の品質が「低」のため、登録音声信号US11の品質は「低」となる。この結果、登録音声信号US11に係る認証条件としての要求時間は15秒、判定閾値は70となる。 The utterance content of registered audio signal US11 is "Aka sa ta na de ha ma ya ra wa desu" (a ka sa ta na de su ha ma ya ra) wa de su)" and the number of sounds is 14. . The utterance length of the registered audio signal US11 is 8 seconds, which is less than 10 seconds, so the quality related to the utterance length is "low." The number of tones is 14, which is 13 or more, so the quality related to the number of tones is "high." Since the quality of the number of tones is "high" but the quality of the utterance length is "low", the quality of the registered audio signal US11 is "low". As a result, the required time as an authentication condition for the registered audio signal US11 is 15 seconds, and the determination threshold is 70.
 登録音声信号US12の発話内容は「あ か さ た な で す い ち に さ ん し ご ろ く な な で す(a ka sa ta na de su i chi ni sa n shi go ro ku na na de su)」であり、音数は20音となる。登録音声信号US12の発話長は13秒であり10秒以上であるため発話長に係る品質は「高」となる。音数は20音であり13音以上であるため音数に係る品質は「高」となる。発話長および音数の品質がどちらも「高」であり、登録音声信号US12の品質は「高」となる。この結果、登録音声信号US12に係る認証条件としての要求時間は7秒、判定閾値は70となる。 The utterance content of the registered audio signal US12 is ``Aka sa ta na de s. u i chi ni sa n shi go ro ku na na de su )” and the number of tones is 20. The utterance length of the registered audio signal US12 is 13 seconds, which is 10 seconds or more, so the quality related to the utterance length is "high." Since the number of tones is 20, which is 13 or more, the quality related to the number of tones is "high." The quality of the utterance length and the number of tones are both "high", and the quality of the registered audio signal US12 is "high". As a result, the required time as an authentication condition for the registered audio signal US12 is 7 seconds, and the determination threshold is 70.
 次に、図6を参照して、登録音声信号の品質に応じて認証条件として発話内容を設定する一例を説明する。図6は、登録音声信号の品質に応じて認証条件として発話内容を設定する一例を示す図である。図6に係る品質の判定方法は、図4の判定方法と同様とする。 Next, with reference to FIG. 6, an example of setting the utterance content as an authentication condition according to the quality of the registered audio signal will be described. FIG. 6 is a diagram illustrating an example of setting utterance content as an authentication condition according to the quality of a registered audio signal. The quality determination method in FIG. 6 is the same as the determination method in FIG. 4.
 認証条件設定部21Fは、品質の結果に基づきユーザUSに発話を促す文言を指定する。図6に係る例では品質が「低」の場合、特徴量抽出部21Cは登録音声信号の音声認識を実行し発話内容を解析し認証条件設定部21Fに出力する。認証条件設定部21Fは、特徴量抽出部21Cから取得した音声認識の結果に基づきユーザUSに指定する文言を決定する。品質が「低」となり、認証条件設定部21Fが発話内容を指定する場合、要求時間は指定なしとなる。品質が「高」の場合、ユーザUSに文言の指定はしない。なお、品質が「高」であってもユーザUSに文言の指定をしてもよく、品質が「低」のときよりも短い文章を指定しユーザの利便性を向上させてもよい。図6に係る例では、品質が「高」の場合、認証条件として要求時間を7秒と設定する。なお、品質が「高」の場合の要求時間は一例であり7秒に限られない。判定閾値は、図6の例では品質に関わらず70の一定値とする。なお、判定閾値は一定値とせず品質によって変更してもよい。 The authentication condition setting unit 21F specifies words that prompt the user US to speak based on the quality results. In the example shown in FIG. 6, when the quality is "low", the feature quantity extracting unit 21C performs voice recognition of the registered voice signal, analyzes the utterance content, and outputs it to the authentication condition setting unit 21F. The authentication condition setting unit 21F determines the wording to be specified for the user US based on the voice recognition result obtained from the feature extracting unit 21C. When the quality is "low" and the authentication condition setting unit 21F specifies the utterance content, the requested time is not specified. If the quality is "high", no text is specified for the user US. Note that even if the quality is "high", the user US may be allowed to specify a text, or a shorter text may be specified than when the quality is "low" to improve user convenience. In the example shown in FIG. 6, when the quality is "high", the required time is set to 7 seconds as the authentication condition. Note that the required time when the quality is "high" is an example and is not limited to 7 seconds. In the example of FIG. 6, the determination threshold is set to a constant value of 70 regardless of quality. Note that the determination threshold value may not be a constant value but may be changed depending on the quality.
 登録音声信号US10の発話内容は、「あかさたなです」である。登録音声信号US10の品質は「低」となるため、特徴量抽出部21Cは、登録音声信号US10の音声認識を実行し、認識結果が「あかさたなです」となる。認証条件設定部21Fは、特徴量抽出部21Cから取得した認識結果に基づき、文言を「あかさたなです」に指定する。認証条件設定部21Fは、要求時間は指定なし、判定閾値は70とする。 The utterance content of the registered voice signal US10 is "Akasatana desu". Since the quality of the registered audio signal US10 is "low", the feature extraction unit 21C performs speech recognition of the registered audio signal US10, and the recognition result is "Akasatana Desu". The authentication condition setting unit 21F specifies the wording as “Akasatana desu” based on the recognition result obtained from the feature extracting unit 21C. The authentication condition setting unit 21F sets the request time to no specification and the determination threshold to 70.
 登録音声信号US12の発話内容は、「あかさたなです いちにさんしごろくななです」である。登録音声信号US12の品質は「高」となるため、特徴量抽出部21Cは音声認識を実行しない。認証条件設定部21Fは、認証条件として要求時間は7秒と設定し、判定閾値は70と設定する。認証条件設定部21Fは、登録音声信号US12の品質が「高」のため文言は指定しない。 The utterance content of the registered audio signal US12 is "Akasatana desu, Ichinisan Shigorokunana desu." Since the quality of the registered audio signal US12 is "high", the feature extraction unit 21C does not perform speech recognition. The authentication condition setting unit 21F sets the required time to 7 seconds and the determination threshold to 70 as authentication conditions. The authentication condition setting unit 21F does not specify the wording because the quality of the registered audio signal US12 is "high".
 これにより、認証システム100は、登録音声信号の品質に応じて登録音声信号の発話内容に基づく文言をユーザUSに指定し発話させることで高い認証の精度を保ちつつ短時間で認証することができる。また、認証システム100は、登録音声信号の品質が高い場合は、発話内容を指定されずユーザUSの手間を省くことができる。 As a result, the authentication system 100 can authenticate in a short time while maintaining high authentication accuracy by specifying and having the user US utter a phrase based on the utterance content of the registered voice signal according to the quality of the registered voice signal. . Furthermore, when the quality of the registered voice signal is high, the authentication system 100 does not specify the content of the utterance, thereby saving the user US's effort.
 次に、図7を参照して、オペレータが画面に表示された本人確認用文章に基づき本人確認の認証を実施する例を説明する。図7は、オペレータが画面に表示された認証用文章に基づき本人確認の認証を実施する例を示す図である。 Next, with reference to FIG. 7, an example will be described in which the operator performs identity verification authentication based on the identity verification text displayed on the screen. FIG. 7 is a diagram illustrating an example in which an operator performs identity verification based on the authentication text displayed on the screen.
 図6で示した品質が「低」の場合、認証システム100は、音声認識を用いてユーザUSに発話させる文言を指定する。図7では、図6に係る方法で指定されたユーザUSに発話させる文言をオペレータがユーザUSの本人確認の認証の際にみる画面(以下、オペレータ画面と称する)に表示し、オペレータOPがユーザUSに指定された文言を発話させる例を説明する。 When the quality shown in FIG. 6 is "low", the authentication system 100 uses voice recognition to specify the words to be uttered by the user US. In FIG. 7, the operator OP displays the message to be uttered by the user US designated by the method shown in FIG. An example of making US utter a specified phrase will be explained.
 まず、ケースCCに示す例について説明する。画面SC1は、ディスプレイDPに表示されるオペレータ画面の一例である。 First, an example shown in case CC will be explained. Screen SC1 is an example of an operator screen displayed on display DP.
 枠FR1には、話者の登録情報が表示される。枠FR1には、話者の登録情報として「発信元番号」、「登録名」、「登録住所」、「年齢」および「話者登録有無」が表示される。「発信元番号」は、例えば電話番号である。「話者登録有無」は、登録話者データベースDBに登録音声信号が保存されているか否かを表す。また、登録話者データベースDBに登録音声信号が保存されている場合、登録音声信号に紐づけられた品質も併せて表示する。例えば、枠FR1には、「発信元番号」は××-××××-××××、「登録名」はA田A男、「登録住所」はABCDEFG、「年齢」は33および「話者登録有無」は有(品質:低)と表示される。 Speaker registration information is displayed in frame FR1. In frame FR1, "sender number", "registered name", "registered address", "age", and "speaker registration status" are displayed as speaker registration information. The "source number" is, for example, a telephone number. “Speaker registration presence/absence” indicates whether or not a registered voice signal is stored in the registered speaker database DB. Furthermore, if the registered voice signal is stored in the registered speaker database DB, the quality associated with the registered voice signal is also displayed. For example, in frame FR1, the "sender number" is XXX-XXXXXX-XXXXXX, the "registered name" is Ada A, the "registered address" is ABCDEFG, the "age" is 33, and " ``Speaker registration presence/absence'' is displayed as Yes (quality: low).
 枠FR2には、話者の認証結果として、話者と考えられる候補者が表示される。また、候補者の横には、話者が候補者である確率が表示される。枠FR2の例では、確率は百分率で示されるがこれに限られず「低、中、高」のような表示でもよい。枠FR2には、認証結果として「A田A男:70%」、「B山B郎:25%」および「C川C夫:5%」と表示される。 In frame FR2, candidates who are considered to be speakers are displayed as the speaker authentication results. Furthermore, the probability that the speaker is a candidate is displayed next to the candidate. In the example of frame FR2, the probability is expressed as a percentage, but the probability is not limited to this and may be expressed as "low, medium, high." In frame FR2, "A-field A-man: 70%", "B-yama B-ro: 25%" and "C-kawa C-husband: 5%" are displayed as the authentication results.
 枠FR3には、話者に発話させる文言(以下、本人認証用文章と称する)を表示する。枠FR3には、本人認証用文章として「はまやらわです はちきゅうじゅうぜろです」と表示する。 In the frame FR3, a sentence to be uttered by the speaker (hereinafter referred to as an identity authentication sentence) is displayed. In frame FR3, "I am Hamayarawa, I am Hachikyujuzero" is displayed as a text for personal authentication.
 ボタンBT1は、認証を開始または停止させるボタンである。 The button BT1 is a button to start or stop authentication.
 オペレータOPは、画面SC1の枠FR3に表示された本人認証用文章に基づき発話音声OP10「「はまやらわです はちきゅうじゅうぜろです」とお話しください」と発話する。ユーザUSは、オペレータOPの発話に基づき発話音声US13「はまやらわです はちきゅうじゅうぜろです」と発話する。 Operator OP utters utterance voice OP10, "I'm Hamayarawa. Please say, 'I'm Hachikyujuzero.'" based on the text for personal authentication displayed in frame FR3 of screen SC1. Based on the utterance of the operator OP, the user US utters the utterance voice US13, "It's Hamayarawa. It's Hachikyujuzero."
 次に、ケースCDに示す例について説明する。画面SC2は、ディスプレイDPに表示されるオペレータ画面の一例である。 Next, the example shown in case CD will be explained. Screen SC2 is an example of an operator screen displayed on display DP.
 枠FR5には、話者の登録情報が表示される。枠FR5には、話者の登録情報として「発信元番号」、「登録名」、「登録住所」、「年齢」および「話者登録有無」が表示される。例えば、枠FR2には、「発信元番号」は××-××××-××××、「登録名」はB山B郎、「登録住所」はGFEDCBA、「年齢」は44および「話者登録有無」は有(品質:高)と表示される。 Speaker registration information is displayed in frame FR5. In the frame FR5, "caller number", "registered name", "registered address", "age", and "speaker registration status" are displayed as speaker registration information. For example, in frame FR2, the "sender number" is XXX-XXXXXX-XXXXXX, the "registered name" is B-Yama Brou, the "registered address" is GFEDCBA, the "age" is 44, and " ``Speaker registration presence/absence'' is displayed as Yes (quality: high).
 枠FR6には、話者の認証結果として、話者と考えられる候補者が表示される。また、候補者の横には、話者が候補者である確率が表示される。枠FR2には、認証結果として「A田A男:15%」、「B山B郎:60%」および「C川C夫:25%」と表示される。 In the frame FR6, candidates who are considered to be speakers are displayed as the speaker authentication results. Furthermore, the probability that the speaker is a candidate is displayed next to the candidate. In frame FR2, "A-field A-man: 15%", "B-yama B-ro: 60%", and "C-kawa C-husband: 25%" are displayed as the authentication results.
 枠FR7には、本人認証用文章として「はまやらわです」と表示する。ケースCDは、品質が「高」であるため、品質が「低」のケースCCよりも短い文言を本人確認用文章として指定される。 In frame FR7, "Hamayarawa Desu" is displayed as the text for personal authentication. Since the quality of the case CD is "high", a sentence shorter than that of the case CC whose quality is "low" is designated as the identity verification sentence.
 ボタンBT2は、認証を開始または停止させるボタンである。 The button BT2 is a button to start or stop authentication.
 オペレータOPは、画面SC2の枠FR7に表示された本人認証用文章に基づき発話音声OP11「「はまやらわです」とお話しください」と発話する。ユーザUSは、オペレータOPの発話に基づき発話音声US14「はまやらわです」と発話する。 Operator OP utters utterance voice OP11, "Please say, 'It's Hamayarawa.'" based on the personal authentication text displayed in frame FR7 of screen SC2. Based on the utterance of the operator OP, the user US utters the uttered voice US14, "It's Hamaya Rawa."
 図7ではオペレータOPがオペレータ画面に表示される本人認証用文章を読み上げてユーザUSに発話させたが、自動音声により本人認証用文章を流しユーザUSに発話させてもよい。 In FIG. 7, the operator OP reads out the personal authentication text displayed on the operator screen and has the user US speak it, but the user US may also read the personal authentication text using an automatic voice and have the user US speak it.
 これにより、認証システム100は、ユーザUSおよびオペレータOPが発話長を気にせずに認証を行うことができるようにする。また、認証システム100は、登録音声信号の品質が高い場合は、短い文言をユーザUSに指定して認証を行えるので認証にかかる時間を短縮することができる。また、認証システム100は、登録音声信号の品質が低い場合は、品質が高い場合よりも長い文言をユーザUSに指定することで高い認証の精度を保つことができ、認証の失敗またはやり直しを防ぐことができる。 Thereby, the authentication system 100 allows the user US and the operator OP to perform authentication without worrying about the length of the utterance. Further, when the quality of the registered audio signal is high, the authentication system 100 can specify a short phrase to the user US for authentication, thereby reducing the time required for authentication. Furthermore, when the quality of the registered audio signal is low, the authentication system 100 can maintain high authentication accuracy by specifying a longer word for the user US than when the quality is high, thereby preventing authentication failure or redoing. be able to.
 次に、図8を参照して、ユーザ側通話端末に表示された本人確認用文章に基づき本人確認の認証を実施する例を説明する。図8は、ユーザ側通話端末に表示された本人確認用文章に基づき本人確認の認証を実施する例を示す図である。 Next, with reference to FIG. 8, an example will be described in which identity verification is performed based on the identity verification text displayed on the user side call terminal. FIG. 8 is a diagram illustrating an example of performing identity verification authentication based on the identity verification text displayed on the user side call terminal.
 まず、ケースCEに示す例について説明する。ケースCEは、登録音声信号の品質が低い場合の例である。画面SC3は、ユーザ側通話端末UP1に表示された画面の一例である。 First, an example shown in case CE will be explained. Case CE is an example where the quality of the registered audio signal is low. Screen SC3 is an example of a screen displayed on user side call terminal UP1.
 画面SC3には、「本人確認文章 はまやらわです はちきゅうじゅうぜろです を発話ください」と表示される。枠FR9には、本人確認文章として「はまやらわです はちきゅうじゅうぜろです」が表示されユーザUSによって表示される文章が異なる。 On the screen SC3, the following message is displayed: "Please speak the following text to confirm your identity: This is Hamayarawa. This is Hachikyujuzero." In frame FR9, "It's Hamayarawa, I'm Hachikyujuzero" is displayed as the identity verification text, and the text displayed differs depending on the user US.
 ユーザUSは、画面SC3に表示された内容をみて発話音声US13「はまやらわです はちきゅうじゅうぜろです」と発話する。 The user US looks at the content displayed on the screen SC3 and utters the utterance voice US13, "It's Hamayarawa. It's Hachikyujuzero."
 次に、ケースCFに示す例について説明する。ケースCFは、登録音声信号の品質が高い場合の例である。画面SC4は、ユーザ側通話端末UP1に表示された画面の一例である。 Next, an example shown in case CF will be explained. Case CF is an example where the quality of the registered audio signal is high. Screen SC4 is an example of a screen displayed on user side call terminal UP1.
 画面SC4には、「本人確認文章 はまやらわです を発話ください」と表示される。枠FR9には、本人確認文章として「はまやらわです」が表示される。 On the screen SC4, the message "Please speak the identity verification sentence: Hamayarawa desu" is displayed. In frame FR9, "Hamayarawa desu" is displayed as the identity verification text.
 ユーザUSは、画面SC4に表示された内容をみて発話音声US13「はまやらわです」と発話する。 The user US looks at the content displayed on the screen SC4 and utters the utterance voice US13, "It's Hamayarawa."
 これにより、認証システム100は、ユーザUSに発話長を気にさせずに認証を行うことができる。また、これにより、認証システム100は、オペレータOP等の人物を介さず無人でユーザUSの認証を行うことができる。 Thereby, the authentication system 100 can authenticate the user US without worrying about the length of his or her utterance. Furthermore, with this, the authentication system 100 can authenticate the user US unattended without the intervention of a person such as the operator OP.
 次に、図9および図10を参照して、認証条件を設定後に認証時の集音条件の測定結果から認証条件を再設定する例を説明する。図9は、認証条件を設定後に認証時の集音条件の測定結果から認証条件の要求時間を再設定する例を示す図である。図10は、認証条件を設定後に認証時の集音条件の測定結果から認証条件の閾値を再設定する例を示す図である。 Next, with reference to FIGS. 9 and 10, an example will be described in which the authentication conditions are reset based on the measurement results of the sound collection conditions during authentication after the authentication conditions are set. FIG. 9 is a diagram illustrating an example of resetting the required time of the authentication condition based on the measurement result of the sound collection condition at the time of authentication after setting the authentication condition. FIG. 10 is a diagram showing an example of resetting the threshold value of the authentication condition based on the measurement result of the sound collection condition at the time of authentication after setting the authentication condition.
 認証時集音条件測定部21Gは、認証時の集音条件(以下、認証時集音条件と称する)として認証時に集音された発話音声信号のノイズ、音量、残響の度合いまたは発話音声信号に含まれる音素数などを測定する。図9および図10は、一度認証条件が設定されたあと(以下、初期認証条件と称する)測定された認証時集音条件によって認証条件が再設定される例を示す。 The authentication sound collection condition measuring unit 21G determines the noise, volume, degree of reverberation, or speech sound signal of the speech sound signal collected during authentication as the sound collection condition during authentication (hereinafter referred to as the sound collection condition during authentication). Measure the number of phonemes included. FIGS. 9 and 10 show an example in which the authentication conditions are once set (hereinafter referred to as initial authentication conditions) and then are reset based on the measured authentication sound collection conditions.
 発話音声信号のノイズがノイズに関する所定値以上の場合、つまりノイズが多い場合、認証条件設定部21Fは認証条件として要求時間を3秒長くする。 If the noise of the uttered audio signal is greater than a predetermined value regarding noise, that is, if there is a lot of noise, the authentication condition setting unit 21F increases the required time by 3 seconds as an authentication condition.
 発話音声信号の音量が音量に関する所定値未満の場合、つまり音量が小さい場合、認証条件設定部21Fは認証条件として要求時間を3秒長くする。 If the volume of the uttered audio signal is less than the predetermined value regarding the volume, that is, if the volume is low, the authentication condition setting unit 21F increases the required time by 3 seconds as the authentication condition.
 発話音声信号の音素数が音素数に関する所定値未満の場合、つまり音素数が少ない場合、認証条件設定部21Fは認証条件として要求時間を3秒長くする。 If the number of phonemes in the uttered audio signal is less than a predetermined value regarding the number of phonemes, that is, if the number of phonemes is small, the authentication condition setting unit 21F increases the required time by 3 seconds as an authentication condition.
 発話音声信号の残響が残響に関する所定値以上の場合、つまり残響が大きい場合、認証条件設定部21Fは認証条件として要求時間を5秒長くする。 When the reverberation of the uttered audio signal is greater than or equal to a predetermined value regarding reverberation, that is, when the reverberation is large, the authentication condition setting unit 21F increases the required time by 5 seconds as an authentication condition.
 なお、ノイズ、音量、音素数および残響に関して長くする要求時間の長さは一例でありこれらに限定されない。 Note that the length of the required time to be increased regarding noise, volume, number of phonemes, and reverberation is an example and is not limited to these.
 発話音声信号の品質が「低」の場合、初期認証条件は要求時間が15秒および判定閾値が70となる。なお、初期認証条件は一例でありこれに限定されない。認証時集音条件としてノイズが多い場合、認証条件設定部21Fは要求時間を3秒長くする。この結果、再設定後の認証条件は要求時間が18秒および判定閾値が70となる。認証時集音条件として音量が小さい場合、認証条件設定部21Fは要求時間を3秒長くする。この結果、再設定後の認証条件は要求時間が18秒および判定閾値が70となる。 When the quality of the spoken audio signal is "low", the initial authentication conditions are a required time of 15 seconds and a determination threshold of 70. Note that the initial authentication conditions are just an example and are not limited thereto. If there is a lot of noise as the sound collection condition for authentication, the authentication condition setting unit 21F increases the required time by 3 seconds. As a result, the authentication conditions after resetting are a request time of 18 seconds and a determination threshold of 70. If the volume is low as the sound collection condition for authentication, the authentication condition setting unit 21F increases the required time by 3 seconds. As a result, the authentication conditions after resetting are a request time of 18 seconds and a determination threshold of 70.
 発話音声信号の品質が「高」の場合、初期認証条件は要求時間が7秒および判定閾値が70となる。なお、初期認証条件は一例でありこれに限定されない。認証時集音条件として音量が小さくかつノイズが多い場合、認証条件設定部21Fは要求時間を合計で6秒長くする。この結果、再設定後の認証条件は要求時間が13秒および判定閾値が70となる。認証時集音条件として音素数が少ない場合、認証条件設定部21Fは要求時間を3秒長くする。この結果、再設定後の認証条件は要求時間が10秒および判定閾値が70となる。認証時集音条件として残響が大きい場合、認証条件設定部21Fは要求時間を5秒長くする。この結果、再設定後の認証条件は要求時間が12秒および判定閾値が70となる。認証時集音条件が良好の場合、認証条件は初期認証条件と同様となる。 When the quality of the uttered audio signal is "high", the initial authentication conditions are a required time of 7 seconds and a determination threshold of 70. Note that the initial authentication conditions are just an example and are not limited thereto. When the sound collection conditions for authentication are low volume and a lot of noise, the authentication condition setting unit 21F increases the required time by 6 seconds in total. As a result, the authentication conditions after resetting are a request time of 13 seconds and a determination threshold of 70. When the number of phonemes is small as the sound collection condition for authentication, the authentication condition setting unit 21F increases the required time by 3 seconds. As a result, the authentication conditions after resetting are a request time of 10 seconds and a determination threshold of 70. When the reverberation is large as the sound collection condition for authentication, the authentication condition setting unit 21F increases the required time by 5 seconds. As a result, the authentication conditions after resetting are a request time of 12 seconds and a determination threshold of 70. If the sound collection conditions at the time of authentication are good, the authentication conditions are the same as the initial authentication conditions.
 次に、図10を参照して、認証条件を設定後に認証時の集音条件の測定結果から認証条件の閾値を再設定する例を説明する。 Next, with reference to FIG. 10, an example will be described in which the threshold value of the authentication condition is reset based on the measurement result of the sound collection condition at the time of authentication after the authentication condition is set.
 発話音声信号のノイズがノイズに関する所定値以上の場合、つまりノイズが多い場合、認証条件設定部21Fは認証条件として判定閾値を10低くする。 If the noise in the uttered audio signal is greater than or equal to a predetermined noise-related value, that is, if there is a lot of noise, the authentication condition setting unit 21F lowers the determination threshold by 10 as the authentication condition.
 発話音声信号の音量が音量に関する所定値未満の場合、つまり音量が小さい場合、認証条件設定部21Fは認証条件として判定閾値を15低くする。 If the volume of the uttered audio signal is less than the predetermined value regarding the volume, that is, if the volume is small, the authentication condition setting unit 21F lowers the determination threshold by 15 as the authentication condition.
 発話音声信号の音素数が音素数に関する所定値未満の場合、つまり音素数が少ない場合、認証条件設定部21Fは認証条件として判定閾値を10低くする。 When the number of phonemes in the uttered audio signal is less than a predetermined value regarding the number of phonemes, that is, when the number of phonemes is small, the authentication condition setting unit 21F lowers the determination threshold by 10 as an authentication condition.
 発話音声信号の残響が残響に関する所定値以上の場合、つまり残響が大きい場合、認証条件設定部21Fは認証条件として判定閾値を20低くする。 When the reverberation of the uttered audio signal is greater than or equal to a predetermined value regarding reverberation, that is, when the reverberation is large, the authentication condition setting unit 21F lowers the determination threshold by 20 as the authentication condition.
 なお、ノイズ、音量、音素数および残響に関して低くする判定閾値の値は一例でありこれらに限定されない。 Note that the values of the determination thresholds to be lowered with respect to noise, volume, number of phonemes, and reverberation are only examples, and are not limited to these.
 発話音声信号の品質が「低」の場合、初期認証条件は要求時間が15秒および判定閾値が70となる。なお、初期認証条件は一例でありこれに限定されない。認証時集音条件としてノイズが多い場合、認証条件設定部21Fは判定閾値を10低くする。この結果、再設定後の認証条件は要求時間が15秒および判定閾値が60となる。認証時集音条件として音量が小さい場合、認証条件設定部21Fは判定閾値を15低くする。この結果、再設定後の認証条件は要求時間が15秒および判定閾値が55となる。 When the quality of the spoken audio signal is "low", the initial authentication conditions are a required time of 15 seconds and a determination threshold of 70. Note that the initial authentication conditions are just an example and are not limited thereto. If there is a lot of noise as the sound collection condition for authentication, the authentication condition setting unit 21F lowers the determination threshold by 10. As a result, the authentication conditions after resetting are a request time of 15 seconds and a determination threshold of 60. When the volume is low as the sound collection condition at the time of authentication, the authentication condition setting unit 21F lowers the determination threshold value by 15. As a result, the authentication conditions after resetting are a request time of 15 seconds and a determination threshold of 55.
 発話音声信号の品質が「高」の場合、初期認証条件は要求時間が7秒および判定閾値が70となる。なお、初期認証条件は一例でありこれに限定されない。認証時集音条件として音量が小さくかつノイズが多い場合、認証条件設定部21Fは判定閾値を合計で25低くする。この結果、再設定後の認証条件は要求時間が7秒および判定閾値が45となる。認証時集音条件として音素数が少ない場合、認証条件設定部21Fは判定閾値を10低くする。この結果、再設定後の認証条件は要求時間が7秒および判定閾値が60となる。認証時集音条件として残響が大きい場合、認証条件設定部21Fは判定閾値を20低くする。この結果、再設定後の認証条件は要求時間が7秒および判定閾値が50となる。認証時集音条件が良好の場合、認証条件は初期認証条件と同様となる。 When the quality of the uttered audio signal is "high", the initial authentication conditions are a required time of 7 seconds and a determination threshold of 70. Note that the initial authentication conditions are just an example and are not limited thereto. When the sound collection conditions at the time of authentication are that the volume is low and there is a lot of noise, the authentication condition setting unit 21F lowers the determination threshold by 25 in total. As a result, the authentication conditions after resetting are a request time of 7 seconds and a determination threshold of 45. When the number of phonemes is small as the sound collection condition for authentication, the authentication condition setting unit 21F lowers the determination threshold by 10. As a result, the authentication conditions after resetting are a request time of 7 seconds and a determination threshold of 60. When the reverberation is large as a sound collection condition at the time of authentication, the authentication condition setting unit 21F lowers the determination threshold value by 20. As a result, the authentication conditions after resetting are a request time of 7 seconds and a determination threshold of 50. If the sound collection conditions at the time of authentication are good, the authentication conditions are the same as the initial authentication conditions.
 これにより、認証システム100は、登録音声信号の品質に関わらず、認証時の集音条件が悪い場合は要求時間を長くする、判定閾値を低くするもしくはその両方をすることで精度高く認証できる。 Thereby, regardless of the quality of the registered audio signal, the authentication system 100 can perform authentication with high accuracy by lengthening the request time, lowering the determination threshold, or both if the sound collection conditions at the time of authentication are poor.
 次に、図11を参照して、話者の認証に係る処理を説明する。図11は、話者の認証に係る処理のフローチャートである。図11に係る各処理は、プロセッサ21によって実行される。 Next, processing related to speaker authentication will be described with reference to FIG. 11. FIG. 11 is a flowchart of processing related to speaker authentication. Each process related to FIG. 11 is executed by the processor 21.
 比較対象設定部21Dは、認証対象人物である話者を本人確認する認証を実行する際に登録話者データベースDBに登録されている複数の人物の中から認証に用いる人物を設定する(St20)。 The comparison target setting unit 21D sets a person to be used for authentication from among a plurality of people registered in the registered speaker database DB when performing authentication to confirm the identity of a speaker who is an authentication target person (St20). .
 比較対象設定部21Dは、登録話者データベースDBからステップSt20の処理で設定された比較対象である人物の登録音声信号の品質に関する情報を取得する(St21)。比較対象設定部21Dは、取得した情報を認証条件設定部21Fに出力する。また、比較対象設定部21Dは、登録話者データベースDBから比較対象である人物の登録音声信号の登録特徴量を取得し、類似度計算部21Eに出力する。 The comparison target setting unit 21D acquires information regarding the quality of the registered voice signal of the person who is the comparison target set in the process of step St20 from the registered speaker database DB (St21). The comparison target setting section 21D outputs the acquired information to the authentication condition setting section 21F. Further, the comparison target setting unit 21D acquires the registered feature amount of the registered voice signal of the person who is the comparison target from the registered speaker database DB, and outputs it to the similarity calculation unit 21E.
 認証時集音条件測定部21Gは、認証時集音条件を測定する(St22)。なお、ステップSt22の処理は図11に係るフローチャートの処理から省略されてもよい。 The authentication sound collection condition measurement unit 21G measures the authentication sound collection conditions (St22). Note that the process of step St22 may be omitted from the process of the flowchart related to FIG. 11.
 認証条件設定部21Fは、ステップSt21の処理で比較対象設定部21Dから取得した品質に関する情報に基づき認証条件を設定する(St23)。 The authentication condition setting unit 21F sets authentication conditions based on the information regarding quality acquired from the comparison target setting unit 21D in the process of step St21 (St23).
 プロセッサ21は、認証処理を開始する信号を通信部20に送信する(St24)。通信部20は、オペレータ側通話端末OP1に認証を開始させる指示を送信する。 The processor 21 transmits a signal to start the authentication process to the communication unit 20 (St24). The communication unit 20 transmits an instruction to the operator side telephone terminal OP1 to start authentication.
 認証条件設定部21Fは、比較対象設定部21Dから話者の登録音声信号の登録特徴量を取得する。認証条件設定部21Fは、取得した登録特徴量に基づき認証に用いる発話内容の文言を指定する(St25)。なお、ステップSt25の処理は図11に係るフローチャートの処理から省略されてもよい。 The authentication condition setting unit 21F acquires the registered feature amount of the speaker's registered voice signal from the comparison target setting unit 21D. The authentication condition setting unit 21F specifies the wording of the utterance content to be used for authentication based on the acquired registered feature amount (St25). Note that the process of step St25 may be omitted from the process of the flowchart related to FIG. 11.
 プロセッサ21は、認証に用いる発話音声信号の受信を開始する(St26)。プロセッサ21は、取得した発話音声信号を特徴量抽出部21Cに出力する。 The processor 21 starts receiving the uttered audio signal used for authentication (St26). The processor 21 outputs the acquired speech audio signal to the feature extraction unit 21C.
 認証時集音条件測定部21Gは、認証時集音条件を測定する(St27)。認証時集音条件測定部21Gは、測定した認証時集音条件の情報を認証条件設定部21Fに出力する。なお、ステップSt27の処理は図11に係るフローチャートの処理から省略されてもよい。 The authentication sound collection condition measurement unit 21G measures the authentication sound collection conditions (St27). The authentication sound collection condition measuring unit 21G outputs information on the measured authentication sound collection conditions to the authentication condition setting unit 21F. Note that the process of step St27 may be omitted from the process of the flowchart related to FIG. 11.
 認証条件設定部21Fは、ステップSt27の処理で取得した認証時集音条件に基づき認証条件を再設定する(St28)。プロセッサ21は、認証条件設定部21FがステップSt28の処理で再設定した認証条件に基づき発話音声信号を取得する。プロセッサ21は、取得した発話音声信号を特徴量抽出部21Cに出力する。なお、ステップSt28の処理は図11に係るフローチャートの処理から省略されてもよい。 The authentication condition setting unit 21F resets the authentication condition based on the authentication sound collection condition acquired in the process of step St27 (St28). The processor 21 acquires the uttered audio signal based on the authentication condition reset by the authentication condition setting unit 21F in the process of step St28. The processor 21 outputs the acquired speech audio signal to the feature extraction unit 21C. Note that the process of step St28 may be omitted from the process of the flowchart related to FIG. 11.
 プロセッサ21は、認証処理を終了するつまり認証に用いる発話音声信号の受信を終了する信号を通信部20に送信する(St29)。通信部20は、オペレータ側通話端末OP1に認証を終了させる指示を送信する。 The processor 21 transmits a signal to the communication unit 20 to end the authentication process, that is, to end the reception of the uttered audio signal used for authentication (St29). The communication unit 20 transmits an instruction to end the authentication to the operator side call terminal OP1.
 特徴量抽出部21Cは、ステップSt26の処理もしくはステップSt28の処理で取得した発話音声信号の発話特徴量を抽出する(St30)。特徴量抽出部21Cは、抽出した発話特徴量を類似度計算部21Eに出力する。 The feature amount extraction unit 21C extracts the speech feature amount of the speech audio signal acquired in the process of step St26 or the process of step St28 (St30). The feature extraction unit 21C outputs the extracted utterance feature to the similarity calculation unit 21E.
 類似度計算部21Eは、ステップSt21で取得した登録特徴量とステップSt30の処理で取得した発話特徴量とに基づき類似度を計算する(St31)。 The similarity calculation unit 21E calculates the similarity based on the registered feature obtained in step St21 and the utterance feature obtained in the process of step St30 (St31).
 類似度計算部21Eは、ステップSt31の処理で計算した類似度が予め定められた閾値以上であるか否かを判定する(St32)。類似度計算部21Eは、類似度が予め定められた閾値以上であると判定すると(St32、YES)、話者の本人確認の認証が成功した旨の信号を通信部20、表示I/F23もしくはその両方に出力する。 The similarity calculation unit 21E determines whether the similarity calculated in the process of step St31 is greater than or equal to a predetermined threshold (St32). When the similarity calculation unit 21E determines that the similarity is greater than or equal to a predetermined threshold (St32, YES), the similarity calculation unit 21E sends a signal to the communication unit 20, display I/F 23, or Output to both.
 類似度計算部21Eは、類似度が予め定められた閾値未満であると判定した場合(St32、NO)、認証処理を継続するか否かを判定する(St34)。 If the similarity calculation unit 21E determines that the similarity is less than a predetermined threshold (St32, NO), the similarity calculation unit 21E determines whether or not to continue the authentication process (St34).
 類似度計算部21Eは、認証処理を継続すると判定した場合(St34、YES)、プロセッサ21の処理はステップSt22の処理に戻る。ステップSt22の処理が省略される場合は、プロセッサ21の処理はステップSt23の処理に戻る。 When the similarity calculation unit 21E determines to continue the authentication process (St34, YES), the process of the processor 21 returns to the process of step St22. If the process of step St22 is omitted, the process of the processor 21 returns to the process of step St23.
 類似度計算部21Eは、認証処理を継続しないと判定した場合(St35、NO)、話者の本人確認の認証が失敗した旨の信号を通信部20、表示I/F23もしくはその両方に出力する。 If the similarity calculation unit 21E determines not to continue the authentication process (St35, NO), it outputs a signal to the communication unit 20, the display I/F 23, or both to the effect that the authentication of the speaker's identity verification has failed. .
 次に、図12を参照して、登録音声信号の品質によって認証が成功した後の動作に制限を設ける例を説明する。図12は、登録音声信号の品質によって認証が成功した後の動作に制限を設ける例を示す図である。 Next, with reference to FIG. 12, an example will be described in which restrictions are placed on operations after successful authentication depending on the quality of the registered audio signal. FIG. 12 is a diagram illustrating an example in which restrictions are placed on operations after successful authentication depending on the quality of the registered audio signal.
 図12に係る登録音声信号US10および登録音声信号US12は、図4に係る登録音声信号US10および登録音声信号US12と同様である。そのため図12では、認証条件が設定されるまでの説明は省略する。 Registered audio signal US10 and registered audio signal US12 according to FIG. 12 are similar to registered audio signal US10 and registered audio signal US12 according to FIG. 4. Therefore, in FIG. 12, the explanation up to the setting of the authentication conditions is omitted.
 認証システム100が例えば銀行のATMに搭載された場合などで、動作制限設定部21Hは登録音声信号の品質に基づき、話者の本人確認の認証が成功した後の動作(例えば、入金など)に制限をかけてもよい。なお、認証システム100が搭載される例は銀行のATMに限られないがここでは、説明の便宜上認証システム100は銀行のATMに搭載されているものとする。 When the authentication system 100 is installed in a bank ATM, for example, the operation restriction setting unit 21H sets the operation (for example, deposit) after the speaker's identity verification has been successfully authenticated based on the quality of the registered audio signal. You can put restrictions on it. Although examples in which the authentication system 100 is installed are not limited to bank ATMs, here, for convenience of explanation, it is assumed that the authentication system 100 is installed in a bank ATM.
 ケースCGは、登録音声信号の品質が低い場合の認証の一例である。ケースCGでは、登録音声信号の品質が低いため、動作制限設定部21Hは動作モードに制限をかけ制限モードとして動作させる。制限モードでは、例えば、口座の残金の紹介および入金のみができる。なお、制限モードで可能とされる動作は一例でありこれらに限られない。 Case CG is an example of authentication when the quality of the registered audio signal is low. In case CG, since the quality of the registered audio signal is low, the operation restriction setting unit 21H restricts the operation mode and operates in the restriction mode. In the restricted mode, for example, only the introduction and deposit of the balance of the account can be made. Note that the operations that are possible in the restricted mode are merely examples and are not limited to these.
 ケースCHは、登録音声信号の品質が高い場合の認証の一例である。ケースCHでは、登録音声信号の品質が高いため、動作制限設定部21Hは動作モードに制限をかけない通常モードとして動作させる。通常モードでは、例えば、口座の残金の紹介、入金、送金または振替などの全ての動作が可能とされる。なお、通常モードで可能とされる動作は一例でありこれらに限られない。 Case CH is an example of authentication when the quality of the registered audio signal is high. In case CH, since the quality of the registered audio signal is high, the operation restriction setting unit 21H operates in a normal mode in which no restrictions are placed on the operation mode. In the normal mode, all operations such as introduction of remaining balance in the account, deposit, remittance, or transfer are possible. Note that the operations that are possible in the normal mode are just examples and are not limited to these.
 これにより、認証システム100は、登録音声信号の品質が低い場合、認証システム100の搭載されている機械に動作制限を設けることで、誤判定時のリスクを下げることができる。 Thereby, when the quality of the registered audio signal is low, the authentication system 100 can reduce the risk of misjudgment by placing operational restrictions on the machine in which the authentication system 100 is installed.
 次に、図13を参照して、登録音声信号の品質に基づき動作制限を設ける処理を説明する。図13は、登録音声信号の品質に基づき動作制限を設ける処理のフローチャートである。図13に係るフローチャートの各処理はプロセッサ21によって実行される。なお、図13のフローチャートの処理で図11のフローチャートの処理と同様の処理は同一符合を付記し説明を省略する。 Next, with reference to FIG. 13, a process for setting operation restrictions based on the quality of registered audio signals will be described. FIG. 13 is a flowchart of a process for setting operation restrictions based on the quality of registered audio signals. Each process in the flowchart in FIG. 13 is executed by the processor 21. It should be noted that processes in the flowchart of FIG. 13 that are similar to those in the flowchart of FIG. 11 are given the same reference numerals, and description thereof will be omitted.
 ステップSt34の処理で、話者の本人確認に係る認証が成功すると、動作制限設定部21Hは登録音声信号の品質が高いか否かを判定する(St36)。 If the authentication related to the speaker's identity verification is successful in the process of step St34, the operation restriction setting unit 21H determines whether the quality of the registered audio signal is high (St36).
 動作制限設定部21Hは、登録音声信号の品質が高いと判定した場合(St36、YES)、動作モードを通常モードに設定する(St37)。 When the operation restriction setting unit 21H determines that the quality of the registered audio signal is high (St36, YES), it sets the operation mode to the normal mode (St37).
 動作制限設定部21Hは、登録音声信号の品質が低いと判定した場合(St36、NO)、動作モードを制限モードに設定する(St38)。 If the operation restriction setting unit 21H determines that the quality of the registered audio signal is low (St36, NO), it sets the operation mode to the restriction mode (St38).
 以上により、本実施の形態に係る認証システム(例えば、認証システム100)は、話者の発話音声の音声信号を取得する取得部(例えば、ユーザ側通話端末UP1)を備える。認証システムは、取得された音声信号から話者が発話している第1の発話区間と、複数の話者のそれぞれの音声信号が登録されたデータベースの音声信号から話者が発話している第2の発話区間と、を検出する検出部(例えば、発話区間検出部21A)を備える。認証システムは、第1の発話区間の第1音声信号と、第2の発話区間の第2音声信号とを照合し、第2の発話区間の第2音声信号の長さもしくは第2の発話区間に含まれる音数に基づき第1音声信号を用いる認証の認証条件を決定する決定部(例えば、認証条件設定部21F)を備える。認証システムは、決定された前記認証条件に基づいて、前記話者の認証を行う認証部(例えば、類似度計算部21E)と、を備える。 As described above, the authentication system (eg, authentication system 100) according to the present embodiment includes an acquisition unit (eg, user-side call terminal UP1) that acquires the audio signal of the speaker's uttered voice. The authentication system determines the first utterance period in which the speaker is speaking based on the acquired audio signal, and the first utterance period in which the speaker is speaking based on the audio signal of the database in which the audio signals of each of a plurality of speakers are registered. 2 (for example, the speech section detection section 21A). The authentication system compares the first audio signal of the first utterance section with the second audio signal of the second utterance section, and determines the length of the second audio signal of the second utterance section or the second speech section. includes a determining unit (for example, an authentication condition setting unit 21F) that determines authentication conditions for authentication using the first audio signal based on the number of sounds included in the first audio signal. The authentication system includes an authentication section (eg, similarity calculation section 21E) that authenticates the speaker based on the determined authentication conditions.
 これにより、本実施の形態に係る認証システムは、登録音声信号の発話長もしくは音数に基づいて認証時にユーザに求める認証条件を決定することができるので、ユーザ毎に認証条件を変えることができる。これにより、認証システムは、登録時に取得したユーザの発話音声の合計時間の長さに応じて認証時の発話時間を決定し、ユーザの利便性を向上することができる。 As a result, the authentication system according to the present embodiment can determine the authentication conditions required of the user during authentication based on the utterance length or number of sounds of the registered audio signal, so the authentication conditions can be changed for each user. . Thereby, the authentication system can determine the utterance time at the time of authentication according to the total length of the user's utterance voice acquired at the time of registration, and improve convenience for the user.
 また、本実施の形態に係る認証システムの認証部は、第2の発話区間の長さが第1の所定値以上であって、第1の発話区間の長さが第2の所定値以上となる場合に認証を開始する。認証部は、第2の発話区間の長さが第1の所定値未満であって、第1の発話区間の長さが第2の所定値より大きい第3の所定値以上となる場合に認証を開始する。これにより、認証システムは、登録音声信号の発話長に基づく品質により、認証時に判定するために十分と推測される秒数を設定してユーザに発話を要求するため、ユーザが必要以上に長く発話したり、発話が短すぎて認証が失敗したりすることを防ぐことができる。これにより、認証システム100は、登録時に取得したユーザの発話音声の合計時間の長さに応じて認証時の発話時間を決定し、ユーザの利便性を向上することができる。 Further, the authentication unit of the authentication system according to the present embodiment is configured such that the length of the second utterance section is equal to or greater than the first predetermined value, and the length of the first utterance section is equal to or greater than the second predetermined value. Authentication will start if the The authentication unit performs authentication when the length of the second utterance section is less than a first predetermined value and the length of the first utterance section is greater than or equal to a third predetermined value that is larger than the second predetermined value. Start. As a result, the authentication system requests the user to speak by setting a number of seconds that is estimated to be sufficient for authentication at the time of authentication, based on the quality of the registered audio signal based on the length of the utterance. This can prevent authentication from failing due to too short utterances. Thereby, the authentication system 100 can determine the utterance time at the time of authentication according to the length of the total time of the user's uttered voice acquired at the time of registration, and can improve convenience for the user.
 また、本実施の形態に係る認証システムの認証部は、第2の発話区間に含まれる音数が第4の所定値以上であって、第1の発話区間の長さが第2の所定値以上となる場合に認証を開始する。認証部は、第2の発話区間に含まれる音数が第4の所定値未満であって、第1の発話区間の長さが第2の所定値より大きい第3の所定値以上となる場合に認証を開始する。これにより、認証システムは、登録音声信号の音数に基づく品質により、認証時に安定するために十分と推測される秒数を設定してユーザに発話を要求するため、ユーザが必要以上に長く発話したり、発話が短すぎて認証が失敗したりすることを防ぐことができる。これにより、認証システムは、認証時のユーザの利便性を向上することができる。 Further, the authentication unit of the authentication system according to the present embodiment is configured such that the number of sounds included in the second speech section is equal to or greater than a fourth predetermined value, and the length of the first speech section is equal to or greater than a second predetermined value. Authentication starts when the above conditions are met. When the number of sounds included in the second utterance section is less than a fourth predetermined value and the length of the first utterance section is greater than or equal to a third predetermined value that is greater than the second predetermined value, the authentication section Start authentication. As a result, the authentication system requests the user to speak by setting a number of seconds that is estimated to be sufficient for stability during authentication based on the quality of the registered audio signal based on the number of tones. This can prevent authentication from failing due to too short utterances. Thereby, the authentication system can improve user convenience during authentication.
 また、本実施の形態に係る認証システムの認証部は、第2の発話区間の長さが第1の所定値以上かつ第2の発話区間に含まれる音数が第4の所定値以上であって、第1の発話区間の長さが第2の所定値以上となる場合に前記認証を開始する。認証部は、第2の発話区間の長さが第1の所定値未満または第2の発話区間に含まれる音数が第4の所定値未満であって、第1の発話区間の長さが第2の所定値より大きい第3の所定値以上となる場合に認証を開始する。これにより、認証システムは、登録音声信号の発話長および音数に応じて、認証時にユーザに求める発話長を決定することができる。認証システムは、発話長と音数とから認証条件を決定できるため、ユーザの利便性を向上するとともにより高精度な認証を行うことができる。 Further, the authentication unit of the authentication system according to the present embodiment is configured such that the length of the second speech section is equal to or greater than the first predetermined value, and the number of sounds included in the second speech section is equal to or greater than the fourth predetermined value. Then, the authentication is started when the length of the first utterance section is equal to or greater than the second predetermined value. The authentication unit is configured to determine whether the length of the second speech section is less than a first predetermined value or the number of sounds included in the second speech section is less than a fourth predetermined value, and the length of the first speech section is less than a first predetermined value. Authentication is started when the third predetermined value is greater than the second predetermined value. Thereby, the authentication system can determine the utterance length required of the user at the time of authentication, according to the utterance length and the number of sounds of the registered audio signal. Since the authentication system can determine authentication conditions based on the length of the utterance and the number of sounds, it can improve user convenience and perform more accurate authentication.
 また、本実施の形態に係る認証システムの決定部は、第2の発話区間の長さが第1の所定値未満の場合、第2の発話区間の音声信号に含まれる発話内容から話者に発話を促すテキストを決定する。これにより、認証システムは、登録音声信号の発話長に応じて登録音声信号の発話内容に基づく文言をユーザに指定し発話させることで高い認証の精度を保ちつつ短時間で認証することができる。また、認証システムは、登録音声信号の発話長が十分に長い場合は、発話内容を指定されずユーザUSの手間を省くことができる。 Further, when the length of the second speech section is less than the first predetermined value, the determination unit of the authentication system according to the present embodiment determines the length of the speech section from the speech content included in the audio signal of the second speech section. Decide on the text that prompts you to speak. Thereby, the authentication system can perform authentication in a short time while maintaining high authentication accuracy by having the user specify and utter a phrase based on the utterance content of the registered voice signal according to the utterance length of the registered voice signal. Furthermore, if the utterance length of the registered audio signal is sufficiently long, the authentication system does not specify the content of the utterance, thereby saving the user US effort.
 また、本実施の形態に係る認証システムは、オペレータが認証時に参照する画面を表示する第1表示部(例えば、ディスプレイDP)をさらに備え、決定部は、テキストを第1表示部に表示させる。これにより、認証システムは、ユーザおよびオペレータが発話長を気にせずに認証を行うことができるようにする。また、認証システムは、登録音声信号の品質が高い場合は、短い文言をユーザに指定して認証を行えるので認証にかかる時間を短縮することができる。また、認証システムは、登録音声信号の品質が低い場合は、品質が高い場合よりも長い文言をユーザに指定することで高い認証の精度を保つことができ、認証の失敗またはやり直しを防ぐことができる。 Further, the authentication system according to the present embodiment further includes a first display unit (for example, display DP) that displays a screen that the operator refers to during authentication, and the determination unit causes the first display unit to display text. Thereby, the authentication system allows the user and the operator to perform authentication without worrying about the length of the utterance. Furthermore, if the quality of the registered audio signal is high, the authentication system can specify a short phrase to the user for authentication, thereby reducing the time required for authentication. In addition, when the quality of the registered audio signal is low, the authentication system can maintain high authentication accuracy by specifying a longer word for the user than when the quality is high, and prevent authentication failure or retry. can.
 また、本実施の形態に係る認証システムは、話者が認証時に参照する画面を表示する第2表示部(例えばユーザ側通話端末UP1)をさらに備え、決定部は、テキストを第2表示部に表示させる。認証システムは、ユーザに発話長を気にさせずに認証を行うことができる。また、これにより、認証システムは、オペレータ等の人物を介さず無人でユーザの認証を行うことができる。 Further, the authentication system according to the present embodiment further includes a second display unit (for example, user-side call terminal UP1) that displays a screen that the speaker refers to during authentication, and the determination unit displays the text on the second display unit. Display. The authentication system can perform authentication without requiring the user to worry about the length of the utterance. Furthermore, this allows the authentication system to authenticate users unattended without the intervention of a person such as an operator.
 また、本実施の形態に係る認証システムは、第1の発話区間の音声信号のノイズ、音量、音素数または残響の大きさの少なくとも1つを測定する測定部をさらに備え、決定部は、測定部から取得した測定結果に基づき認証条件を設定する。これにより、認証システムは、登録音声信号の品質に関わらず、認証時の集音条件が悪い場合は要求時間を長くする、判定閾値を低くするもしくはその両方をすることで精度高く認証できる。 Further, the authentication system according to the present embodiment further includes a measurement unit that measures at least one of noise, volume, number of phonemes, or reverberation size of the audio signal in the first speech section, and the determination unit measures Authentication conditions are set based on the measurement results obtained from the department. As a result, regardless of the quality of the registered audio signal, the authentication system can perform authentication with high accuracy by lengthening the required time, lowering the determination threshold, or both if the sound collection conditions at the time of authentication are poor.
 また、本実施の形態に係る認証システムの認証条件は、音声信号の長さまたは話者の本人確認の認証に係る判定の閾値である。これにより、認証システムは、各ユーザの登録音声信号の品質に応じて、ユーザに指定する発話長を指定することができユーザの利便性を向上することができる。また、認証システムは、ユーザの登録音声信号の品質によって判定に係る閾値を変更することができ、各ユーザに応じて柔軟な認証を実行することができる。 Further, the authentication condition of the authentication system according to the present embodiment is the length of the audio signal or the threshold value for determination related to the authentication of the speaker's identity. Thereby, the authentication system can specify the utterance length to be specified to the user according to the quality of the registered audio signal of each user, and can improve convenience for the user. Further, the authentication system can change the threshold value for determination depending on the quality of the user's registered audio signal, and can perform flexible authentication depending on each user.
 また、本実施の形態に係る認証システムは、第2の発話区間の長さが第1の所定値未満の場合、話者が認証後に実施できる動作に制限を設ける制限設定部(例えば、動作制限設定部21H)をさらに備える。これにより、認証システムは、登録音声信号の品質が低い場合、認証システムの搭載されている機械に動作制限を設けることで、誤判定時のリスクを下げることができる。 In addition, the authentication system according to the present embodiment is configured such that when the length of the second utterance section is less than the first predetermined value, a restriction setting unit (for example, an operation restriction It further includes a setting section 21H). Thereby, when the quality of the registered audio signal is low, the authentication system can lower the risk of misjudgment by placing operational restrictions on the machine in which the authentication system is installed.
 以上、添付図面を参照しながら実施の形態について説明したが、本開示はかかる例に限定されない。当業者であれば、特許請求の範囲に記載された範疇内において、各種の変更例、修正例、置換例、付加例、削除例、均等例に想到し得ることは明らかであり、それらについても本開示の技術的範囲に属すると了解される。また、発明の趣旨を逸脱しない範囲において、上述した実施の形態における各構成要素を任意に組み合わせてもよい。 Although the embodiments have been described above with reference to the accompanying drawings, the present disclosure is not limited to such examples. It is clear that those skilled in the art can come up with various changes, modifications, substitutions, additions, deletions, and equivalents within the scope of the claims, and It is understood that it falls within the technical scope of the present disclosure. Further, each of the constituent elements in the embodiments described above may be combined as desired without departing from the spirit of the invention.
 なお、本出願は、2022年5月27日出願の日本特許出願(特願2022-086892)に基づくものであり、その内容は本出願の中に参照として援用される。 Note that this application is based on a Japanese patent application (Japanese Patent Application No. 2022-086892) filed on May 27, 2022, and the contents thereof are incorporated as a reference in this application.
 本開示の技術は、登録時に取得したユーザの発話音声の合計時間の長さに応じて認証時の発話時間を決定し、ユーザの利便性を向上する認証システムおよび認証方法として有用である。 The technology of the present disclosure is useful as an authentication system and method that improves user convenience by determining the utterance time during authentication according to the total length of the user's utterances acquired at the time of registration.
 NW ネットワーク
 UP1 ユーザ側通話端末
 OP1 オペレータ側通話端末
 US ユーザ
 OP オペレータ
 COM11,COM12,COM13,COM14 発話音声
 P1 認証解析装置
 DB 登録話者データベース
 DP ディスプレイ
 SC 認証結果画面
 20 通信部
 21 プロセッサ
 21A 発話区間検出部
 21B 登録品質判定部
 21C 特徴量抽出部
 21D 比較対象設定部
 21E 類似度計算部
 21F 認証条件設定部
 21G 認証時集音条件測定部
 22H 動作制限設定部
 22 メモリ
 22A ROM
 22B RAM
 23 表示I/F
 US10,US11,US12 登録音声信号
 CA,CB,CC,CD,CE,CF,CG,CH ケース
 SC1,SC2,SC3,SC4 画面
 FR1,FR2,FR3,FR4,FR5,FR6,FR7 枠
 BT1,BT2 ボタン
 OP10,OP11,US13,US14 発話音声
NW network UP1 User-side call terminal OP1 Operator-side call terminal US User OP Operator COM11, COM12, COM13, COM14 Speech audio P1 Authentication analysis device DB Registered speaker database DP Display SC Authentication result screen 20 Communication unit 21 Processor 21A Speech section detection unit 21B Registration quality determination unit 21C Feature extraction unit 21D Comparison target setting unit 21E Similarity calculation unit 21F Authentication condition setting unit 21G Authentication sound collection condition measurement unit 22H Operation restriction setting unit 22 Memory 22A ROM
22B RAM
23 Display I/F
US10, US11, US12 Registered audio signal CA, CB, CC, CD, CE, CF, CG, CH Case SC1, SC2, SC3, SC4 Screen FR1, FR2, FR3, FR4, FR5, FR6, FR7 Frame BT1, BT2 Button OP10, OP11, US13, US14 Utterance audio

Claims (11)

  1.  話者の発話音声の音声信号を取得する取得部と、
     取得された前記音声信号から前記話者が発話している第1の発話区間と、複数の話者のそれぞれの音声信号が登録されたデータベースの前記音声信号から前記話者が発話している第2の発話区間と、を検出する検出部と、
     前記第1の発話区間の第1音声信号と、前記第2の発話区間の第2音声信号とを照合し、前記第2の発話区間の前記第2音声信号の長さもしくは前記第2の発話区間に含まれる音数に基づき前記第1音声信号を用いる認証の認証条件を決定する決定部と、
     決定された前記認証条件に基づいて、前記話者の認証を行う認証部と、を備える、
     認証システム。
    an acquisition unit that acquires an audio signal of the speaker's uttered voice;
    The first utterance period in which the speaker is speaking is determined from the acquired audio signal, and the first utterance period in which the speaker is speaking is determined from the audio signal in the database in which the audio signals of each of a plurality of speakers are registered. a detection unit that detects the utterance section of 2;
    The first audio signal of the first utterance section and the second audio signal of the second utterance section are compared, and the length of the second audio signal of the second utterance section or the second utterance is determined. a determining unit that determines authentication conditions for authentication using the first audio signal based on the number of sounds included in the section;
    an authentication unit that authenticates the speaker based on the determined authentication condition;
    Authentication system.
  2.  前記認証部は、
     前記第2の発話区間の長さが第1の所定値以上であって、前記第1の発話区間の長さが第2の所定値以上となる場合に前記認証を開始し、
     前記第2の発話区間の長さが前記第1の所定値未満であって、前記第1の発話区間の長さが前記第2の所定値より大きい第3の所定値以上となる場合に前記認証を開始する、
     請求項1に記載の認証システム。
    The authentication section is
    starting the authentication when the length of the second utterance section is equal to or greater than a first predetermined value, and the length of the first utterance section is equal to or greater than a second predetermined value;
    When the length of the second utterance section is less than the first predetermined value and the length of the first utterance section is greater than or equal to the third predetermined value, which is larger than the second predetermined value, start authentication,
    The authentication system according to claim 1.
  3.  前記認証部は、
     前記第2の発話区間に含まれる音数が第4の所定値以上であって、前記第1の発話区間の長さが第2の所定値以上となる場合に前記認証を開始し、
     前記第2の発話区間に含まれる前記音数が前記第4の所定値未満であって、前記第1の発話区間の長さが前記第2の所定値より大きい第3の所定値以上となる場合に前記認証を開始する、
     請求項1に記載の認証システム。
    The authentication section is
    starting the authentication when the number of sounds included in the second speech section is equal to or greater than a fourth predetermined value, and the length of the first speech section is equal to or greater than a second predetermined value;
    The number of sounds included in the second speech section is less than the fourth predetermined value, and the length of the first speech section is greater than or equal to a third predetermined value that is greater than the second predetermined value. initiating said authentication if
    The authentication system according to claim 1.
  4.  前記認証部は、
     前記第2の発話区間の長さが第1の所定値以上かつ前記第2の発話区間に含まれる前記音数が第4の所定値以上であって、前記第1の発話区間の長さが第2の所定値以上となる場合に前記認証を開始し、
     前記第2の発話区間の長さが前記第1の所定値未満または前記第2の発話区間に含まれる前記音数が前記第4の所定値未満であって、前記第1の発話区間の長さが前記第2の所定値より大きい第3の所定値以上となる場合に前記認証を開始する、
     請求項1に記載の認証システム。
    The authentication section is
    The length of the second speech section is greater than or equal to a first predetermined value, the number of sounds included in the second speech section is greater than or equal to a fourth predetermined value, and the length of the first speech section is greater than or equal to a fourth predetermined value. Starting the authentication when the value is equal to or higher than a second predetermined value;
    The length of the second speech section is less than the first predetermined value, or the number of sounds included in the second speech section is less than the fourth predetermined value, and the length of the first speech section is less than the first predetermined value. starting the authentication when the value is equal to or greater than a third predetermined value that is larger than the second predetermined value;
    The authentication system according to claim 1.
  5.  前記決定部は、
     前記第2の発話区間の長さが第1の所定値未満の場合、前記第2の発話区間の前記音声信号に含まれる発話内容から前記話者に発話を促すテキストを決定する、
     請求項1に記載の認証システム。
    The determining unit is
    If the length of the second speech section is less than a first predetermined value, determining a text that prompts the speaker to speak from the speech content included in the audio signal of the second speech section;
    The authentication system according to claim 1.
  6.  オペレータが前記認証時に参照する画面を表示する第1表示部をさらに備え、
     前記決定部は、前記テキストを前記第1表示部に表示させる、
     請求項5に記載の認証システム。
    further comprising a first display unit that displays a screen that the operator refers to during the authentication;
    The determining unit displays the text on the first display unit,
    The authentication system according to claim 5.
  7.  前記話者が前記認証時に参照する画面を表示する第2表示部をさらに備え、
     前記決定部は、前記テキストを前記第2表示部に表示させる、
     請求項5に記載の認証システム。
    further comprising a second display unit that displays a screen that the speaker refers to during the authentication;
    The determining unit displays the text on the second display unit,
    The authentication system according to claim 5.
  8.  前記第1の発話区間の前記音声信号のノイズ、音量、音素数または残響の大きさの少なくとも1つを測定する測定部をさらに備え、
     前記決定部は、前記測定部から取得した測定結果に基づき前記認証条件を設定する、
     請求項5に記載の認証システム。
    Further comprising a measurement unit that measures at least one of noise, volume, number of phonemes, or reverberation size of the audio signal in the first speech section,
    The determining unit sets the authentication condition based on the measurement result obtained from the measuring unit.
    The authentication system according to claim 5.
  9.  前記認証条件は、前記音声信号の長さまたは前記話者の本人確認の認証に係る判定の閾値である、
     請求項8に記載の認証システム。
    The authentication condition is a length of the audio signal or a threshold value for determining the identity of the speaker.
    The authentication system according to claim 8.
  10.  前記第2の発話区間の長さが第1の所定値未満の場合、前記話者が認証後に実施できる動作に制限を設ける制限設定部をさらに備える、
     請求項1に記載の認証システム。
    If the length of the second utterance section is less than a first predetermined value, the speaker further includes a limit setting unit that sets a limit on operations that the speaker can perform after authentication.
    The authentication system according to claim 1.
  11.  1以上のコンピュータが行う認証方法であって、
     話者の発話音声の音声信号を取得し、
     取得された前記音声信号から前記話者が発話している第1の発話区間と、複数の話者のそれぞれの音声信号が登録されたデータベースの前記音声信号から前記話者が発話している第2の発話区間と、を検出し、
     前記第1の発話区間の第1音声信号と、前記第2の発話区間の第2音声信号とを照合し、
     前記第2の発話区間の前記第2音声信号の長さもしくは前記第2の発話区間に含まれる音数に基づき前記第1音声信号を用いる認証の認証条件を決定し、
     決定された前記認証条件に基づいて、前記話者の認証を行う、
     認証方法。
    An authentication method performed by one or more computers, the method comprising:
    Obtain the audio signal of the speaker's utterance,
    The first utterance period in which the speaker is speaking is determined from the acquired audio signal, and the first utterance period in which the speaker is speaking is determined from the audio signal in the database in which the audio signals of each of a plurality of speakers are registered. Detecting the utterance section of 2,
    collating a first audio signal of the first utterance section and a second audio signal of the second utterance section;
    determining authentication conditions for authentication using the first audio signal based on the length of the second audio signal of the second utterance section or the number of sounds included in the second utterance section;
    authenticating the speaker based on the determined authentication conditions;
    Authentication method.
PCT/JP2023/012047 2022-05-27 2023-03-24 Authentication system and authentication method WO2023228542A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2022086892A JP2023174185A (en) 2022-05-27 2022-05-27 Authentication system and authentication method
JP2022-086892 2022-05-27

Publications (1)

Publication Number Publication Date
WO2023228542A1 true WO2023228542A1 (en) 2023-11-30

Family

ID=88919028

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/012047 WO2023228542A1 (en) 2022-05-27 2023-03-24 Authentication system and authentication method

Country Status (2)

Country Link
JP (1) JP2023174185A (en)
WO (1) WO2023228542A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016053598A (en) * 2014-09-02 2016-04-14 株式会社Kddiテクノロジー Communication device, method and program for registering voice print
WO2016092807A1 (en) * 2014-12-11 2016-06-16 日本電気株式会社 Speaker identification device and method for registering features of registered speech for identifying speaker
JP2017187642A (en) * 2016-04-06 2017-10-12 日本電信電話株式会社 Registered utterance division device, speaker likelihood evaluation device, speaker identification device, registered utterance division method, speaker likelihood evaluation method, and program
US20190341055A1 (en) * 2018-05-07 2019-11-07 Microsoft Technology Licensing, Llc Voice identification enrollment
US20200152206A1 (en) * 2017-12-26 2020-05-14 Robert Bosch Gmbh Speaker Identification with Ultra-Short Speech Segments for Far and Near Field Voice Assistance Applications
US20200175993A1 (en) * 2018-11-30 2020-06-04 Samsung Electronics Co., Ltd. User authentication method and apparatus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016053598A (en) * 2014-09-02 2016-04-14 株式会社Kddiテクノロジー Communication device, method and program for registering voice print
WO2016092807A1 (en) * 2014-12-11 2016-06-16 日本電気株式会社 Speaker identification device and method for registering features of registered speech for identifying speaker
JP2017187642A (en) * 2016-04-06 2017-10-12 日本電信電話株式会社 Registered utterance division device, speaker likelihood evaluation device, speaker identification device, registered utterance division method, speaker likelihood evaluation method, and program
US20200152206A1 (en) * 2017-12-26 2020-05-14 Robert Bosch Gmbh Speaker Identification with Ultra-Short Speech Segments for Far and Near Field Voice Assistance Applications
US20190341055A1 (en) * 2018-05-07 2019-11-07 Microsoft Technology Licensing, Llc Voice identification enrollment
US20200175993A1 (en) * 2018-11-30 2020-06-04 Samsung Electronics Co., Ltd. User authentication method and apparatus

Also Published As

Publication number Publication date
JP2023174185A (en) 2023-12-07

Similar Documents

Publication Publication Date Title
US11295748B2 (en) Speaker identification with ultra-short speech segments for far and near field voice assistance applications
US11735191B2 (en) Speaker recognition with assessment of audio frame contribution
US20180040325A1 (en) Speaker recognition
JP3967952B2 (en) Grammar update system and method
JP5024154B2 (en) Association apparatus, association method, and computer program
US20080154596A1 (en) Solution that integrates voice enrollment with other types of recognition operations performed by a speech recognition engine using a layered grammar stack
JP4960596B2 (en) Speech recognition method and system
WO2006087799A1 (en) Audio authentication system
KR20080038896A (en) Method and apparatus for alarming of speech-recognition error
US20220223155A1 (en) Wakeword selection
EP0892388A1 (en) Method and apparatus for providing speaker authentication by verbal information verification using forced decoding
US9691389B2 (en) Spoken word generation method and system for speech recognition and computer readable medium thereof
JP2004538526A (en) Voice registration method and system, voice registration method and voice recognition method and system based on the system
KR20170007107A (en) Speech Recognition System and Method
JP2015191076A (en) voice identification device
WO2018216180A1 (en) Speech recognition device and speech recognition method
US20030023439A1 (en) Method and apparatus for automatic recognition of long sequences of spoken digits
WO2023228542A1 (en) Authentication system and authentication method
JP7339116B2 (en) Voice authentication device, voice authentication system, and voice authentication method
KR20110079161A (en) Method and apparatus for verifying speaker in mobile terminal
WO2023047893A1 (en) Authentication device and authentication method
WO2022201458A1 (en) Voice interaction system, voice interaction method, and voice interaction management apparatus
WO2023100905A1 (en) Authentication device and authentication method
JP5336788B2 (en) Speech recognition apparatus and program
KR100584906B1 (en) Method for measuring intonation similarity

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23811421

Country of ref document: EP

Kind code of ref document: A1