WO2023182014A1 - Voice authentication device and voice authentication method - Google Patents

Voice authentication device and voice authentication method Download PDF

Info

Publication number
WO2023182014A1
WO2023182014A1 PCT/JP2023/009467 JP2023009467W WO2023182014A1 WO 2023182014 A1 WO2023182014 A1 WO 2023182014A1 JP 2023009467 W JP2023009467 W JP 2023009467W WO 2023182014 A1 WO2023182014 A1 WO 2023182014A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
registered
sound collection
similarity
similarity calculation
Prior art date
Application number
PCT/JP2023/009467
Other languages
French (fr)
Japanese (ja)
Inventor
正成 宮本
Original Assignee
パナソニックIpマネジメント株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by パナソニックIpマネジメント株式会社 filed Critical パナソニックIpマネジメント株式会社
Publication of WO2023182014A1 publication Critical patent/WO2023182014A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates

Definitions

  • the present disclosure relates to a voice authentication device and a voice authentication method.
  • Patent Document 1 discloses a voice recognition device that recognizes a test subject's voice.
  • the speech recognition device stores a plurality of motion noise models created corresponding to each of the plurality of motions in association with each of the plurality of motions, detects input speech including the subject's voice, and detects the subject's motion. is specified, and a motion noise model corresponding to the motion specified by the motion specifying means is read out.
  • the speech recognition device reads an environmental noise model corresponding to the subject's current position, synthesizes the environmental noise model with the read motion noise model, and uses the synthesized noise superimposition model to detect the noise contained in the detected input speech. recognize the test subject's voice.
  • Patent Document 1 it is necessary to collect in advance the motion noise generated in each of a plurality of motions and the environmental noise at each of a plurality of positions where voice recognition can be performed, which is very time-consuming.
  • voiceprint authentication the feature quantity that indicates the individuality of the authentication target (person) extracted from the audio signal changes depending on the noise contained in the audio signal and the sound collection conditions of the sound collection device with which the audio signal was collected. do. Therefore, when performing voiceprint authentication using the voice recognition device described above, if the recording conditions for the pre-registered voice signal and the recording conditions for the voice signal collected during voiceprint authentication are different, each Features extracted from voice signals do not indicate the individuality of the same person, and there is a possibility that the accuracy of voiceprint authentication will decrease.
  • the present disclosure was devised in view of the above-mentioned conventional situation, and aims to provide a voice authentication device and a voice authentication method that suppress a decrease in speaker authentication accuracy due to changes in environmental noise.
  • the present disclosure includes an acquisition unit that acquires audio data, a detection unit that detects an utterance section in which a speaker is speaking from the audio data, and an extraction unit that extracts utterance features of the speaker from the detected utterance section. , the extracted utterance feature of the speaker, and the utterance feature of at least one registered speaker registered in advance. a selection unit that selects a first similarity calculation model based on the selected first similarity calculation model, and comparing the utterance features of the speaker with the utterance features of the registered speaker, using the selected first similarity calculation model;
  • a voice authentication device is provided, comprising: an authentication section that authenticates the speaker.
  • the present disclosure also provides a voice authentication method performed by a terminal device, which acquires voice data, detects a speech section in which the speaker is speaking from the voice data, and detects the speech section of the speaker from the detected speech section.
  • the utterance features are extracted, and based on the extracted utterance features of the speaker and the utterance features of at least one registered speaker registered in advance, the speaker is selected from among a plurality of similarity calculation models.
  • a first similarity calculation model used for authentication is selected, and the selected first similarity calculation model is used to compare the utterance features of the speaker with the utterance features of the registered speaker. , provides a voice authentication method for authenticating the speaker.
  • Block diagram showing an example of internal configuration of a voice authentication system A diagram illustrating each process performed by a processor of a terminal device in an embodiment.
  • Flowchart showing an example of the operation procedure of the terminal device in the embodiment Flowchart illustrating an example of a procedure for determining sound collection conditions of a terminal device in an embodiment
  • Diagram illustrating an example of determining sound collection conditions and calculating reliability Flowchart showing an example of a speaker authentication procedure of a terminal device in an embodiment
  • Diagram explaining an example of a correspondence list when the noise type at the time of voice registration and the noise type at the time of voice authentication are different Diagram explaining a specific example of the correspondence list
  • FIG. 1 is a block diagram showing an example of the internal configuration of a voice authentication system 100 according to an embodiment.
  • FIG. 2 is a diagram illustrating each process performed by the processor 11 of the terminal device P1 in the embodiment.
  • the voice authentication system 100 includes a terminal device P1 as an example of a voice authentication device and a monitor MN. Note that the voice authentication system 100 may include a microphone MK or a monitor MN.
  • the microphone MK picks up the voice uttered by the speaker US in order to register the voice in the terminal device P1 in advance.
  • the microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data that is registered in the terminal device P1.
  • Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .
  • the microphone MK picks up the voice uttered by the speaker US, which is used for speaker authentication.
  • the microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data.
  • Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .
  • the voice data for voice registration or the voice data already registered in the terminal device P1 will be referred to as "registered voice data", and the voice data for voice authentication will be referred to as "authentication voice data”. and differentiate.
  • the microphone MK may be, for example, a microphone included in a predetermined device such as a Personal Computer (hereinafter referred to as "PC"), a notebook PC, a smartphone, a tablet terminal, or the like. Further, the microphone MK may transmit an audio signal or audio data to the terminal device P1 by wireless communication via a network (not shown).
  • PC Personal Computer
  • the microphone MK may transmit an audio signal or audio data to the terminal device P1 by wireless communication via a network (not shown).
  • the terminal device P1 is realized by, for example, a PC, a notebook PC, a smartphone, a tablet terminal, etc., and executes a voice registration process using registered voice data of the speaker US and a speaker authentication process using authentication voice data.
  • It includes a communication unit 10, a processor 11, a memory 12, a feature extraction model database DB1, a registered speaker database DB2, a similarity calculation model database DB3, and a sound collection condition learning database DB4.
  • the communication unit 10 which is an example of an acquisition unit, is connected to the microphone MK and the monitor MN so that data can be transmitted and received by wire or wirelessly.
  • the wireless communication referred to here is, for example, short-range wireless communication such as Bluetooth (registered trademark) or NFC (registered trademark), or communication via a wireless local area network (LAN) such as Wi-Fi (registered trademark).
  • the communication unit 10 may transmit and receive data to and from the microphone MK via an interface such as a Universal Serial Bus (USB). Furthermore, the communication unit 10 may perform data transmission and reception with the monitor MN via an interface such as High-Definition Multimedia Interface (HDMI, registered trademark).
  • HDMI High-Definition Multimedia Interface
  • the processor 11 is configured using, for example, a central processing unit (CPU) or a field programmable gate array (FPGA), and performs various processing and control in cooperation with the memory 12. Specifically, the processor 11 refers to the program and data held in the memory 12, and executes the program to identify the feature amount extraction section 111, the sound collection condition determination section 112, the speaker registration section 113, and the similar speaker registration section 113. The functions of each unit such as the degree calculation model selection unit 114, the reliability calculation unit 115, and the authentication unit 116 are realized.
  • CPU central processing unit
  • FPGA field programmable gate array
  • the processor 11 When registering the voice of the speaker US, the processor 11 implements the functions of the feature extraction unit 111, the speaker registration unit 113, and the sound collection condition determination unit 112, thereby registering the voice of the speaker US in the registered speaker database DB2. Executes new registration (storage) processing.
  • the processor 11 realizes the functions of the feature extraction unit 111, the sound collection condition determination unit 112, the similarity calculation model selection unit 114, the reliability calculation unit 115, and the authentication unit 116 during voice authentication of the speaker US. By doing so, speaker authentication processing is executed.
  • the feature extraction unit 111 which is an example of a detection unit and an extraction unit, acquires the voice data (registered voice data or authentication voice data) of the speaker US transmitted from the microphone MK.
  • the feature extraction unit 111 executes voice registration processing or voice authentication processing based on a control command associated with the voice data.
  • the feature extraction unit 111 detects the utterance section in which the speaker US is speaking from the registered voice data during voice registration.
  • the feature amount extracting unit 111 extracts a feature amount indicating the individuality of the speaker US from the detected speech section, and outputs the extracted feature amount to the sound collection condition determining unit 112 and the speaker registration unit 113.
  • the feature amount extraction unit 111 extracts the feature amount of the speaker US from the authentication voice data and outputs it to each of the sound collection condition determination unit 112 and the authentication unit 116.
  • the sound collection condition determination section 112 Based on the feature amount of the speaker US outputted from the feature amount extraction section 111, the sound collection condition determination section 112, which is an example of the determination section, selects the voice data from which the feature amount has been extracted (that is, the utterance of the speaker US). The sound collection conditions under which the sound) was collected are determined. The sound collection condition determination unit 112 outputs information on the sound collection conditions of the feature amount of the speaker US to the speaker registration unit 113 at the time of voice registration. Further, the sound collection condition determination unit 112 outputs information on the sound collection conditions of the feature amount of the speaker US to the similarity calculation model selection unit 114 during voice authentication.
  • the sound pickup conditions here refer to the sound pickup device that picked up the utterances of the speaker US or registered speaker, the language, gender, age, and noise of the noise included in the features of the speaker US or registered speaker. Type, etc.
  • the sound collecting device is, for example, a microphone, a telephone, a headset, or the like.
  • Noise is noise that is collected due to the environment (background) at the time of sound collection, and includes, for example, surrounding voices at the time of sound collection, music, the sound of a vehicle running, the sound of the wind, etc.
  • the noise type indicates, for example, the environment (place) or position where the noise occurs, such as in-store noise, outdoor wind noise, in-store music, or inside a station.
  • the noise type may further include information on time zones such as early morning, daytime, and nighttime, for example.
  • the speaker registration unit 113 acquires the feature quantity of the speaker US outputted from the feature quantity extraction unit 111 and the speaker information of the speaker US associated with the registered voice data.
  • the speaker registration unit 113 also acquires information on the sound collection conditions output from the sound collection condition determination unit 112.
  • the speaker registration unit 113 associates the feature amount of the speaker US, the speaker information of the speaker US, and the information on the sound collection conditions and registers them in the registered speaker database DB2.
  • the speaker information may be extracted from the registered speech data by voice recognition, or may be obtained from a terminal owned by the speaker US (for example, a PC, a notebook PC, a smartphone, a tablet terminal).
  • the speaker information here includes, for example, identification information that can identify the speaker US, the name of the speaker US, speaker identification (ID), and the like.
  • the similarity calculation model selection unit 114 uses the sound collection condition information of the feature quantity of the speaker US outputted from the sound collection condition determination unit 112 and the plurality of sound collection conditions registered in the registered speaker database DB2. Information on the sound collection conditions of the feature quantities of each registered speaker is acquired.
  • the similarity calculation model selection unit 114 selects the speaker US based on the acquired sound pickup condition information of the feature amount of the speaker US and the information of the sound pickup condition of the feature amount of each of the plurality of registered speakers.
  • a similarity calculation model (an example of a similarity calculation model) to be used in the similarity calculation process between the feature amount and the feature amount of any one registered speaker is selected.
  • the similarity calculation model selection unit 114 uses the information on the sound collection conditions for the acquired feature quantities of the speaker US, the information on the sound collection conditions for the feature quantities of each of the plurality of registered speakers, and the selected similarity calculations. Select one of the plurality of selected models (first similarity calculation model) by referring to the correspondence lists LST, LST1, LST2 (see FIGS. 7, 8, and 9) that are associated with the models. A model is selected and output to each of the reliability calculation section 115 and the authentication section 116.
  • the reliability calculation unit 115 which is an example of the reliability calculation unit, calculates (evaluates) the reliability (score) indicating the certainty of the identification result of the speaker US based on the similarity calculated by the authentication unit 116.
  • the reliability calculation unit 115 calculates the reliability based on the distance between the model learning data distribution of the similarity calculation model used in the similarity calculation process by the authentication unit 116 and the feature amount of the speaker US.
  • the reliability calculation unit 115 outputs the calculated reliability information to the authentication unit 116.
  • the authentication unit 116 which is an example of a calculation unit, acquires the feature amount of the speaker US output from the feature amount extraction unit 111, and calculates the feature amount of each of the plurality of registered speakers registered in the registered speaker database DB2. get.
  • the authentication unit 116 also obtains the selected models of the correspondence lists LST, LST1, and LST2 output from the similarity calculation model selection unit 114.
  • the authentication unit 116 uses a similarity calculation model based on the correspondence lists LST, LST1, and LST2 to calculate the similarity between the feature amounts of each of the plurality of registered speakers and the feature amount of the speaker US.
  • the authentication unit 116 identifies the speaker US based on the calculated similarity.
  • the authentication unit 116 also obtains reliability information output from the reliability calculation unit 115.
  • the authentication unit 116 generates an authentication result screen SC based on the speaker information of the identified speaker US and reliability information, and transmits it to the monitor MN.
  • the memory 12 includes, for example, a random access memory (hereinafter referred to as "RAM”) as a work memory used when executing each process of the processor 11, and a read-only memory that stores programs and data that define the operation of the processor 11.
  • RAM random access memory
  • ROM read-only memory
  • Data or information generated or acquired by the processor 11 is temporarily stored in the RAM.
  • a program that defines the operation of the processor 11 is written in the ROM.
  • the feature extraction model database DB1 is a so-called storage, and is stored using a storage medium such as a flash memory, a Hard Disk Drive (hereinafter referred to as "HDD”), or a Solid State Drive (hereinafter referred to as "SSD”). configured.
  • the feature extraction model database DB1 stores a feature extraction model capable of detecting the utterance section of the speaker US from registered speech data or authenticated speech data and extracting the feature of the speaker US.
  • the feature extraction model is, for example, a learning model generated by learning using deep learning or the like.
  • the registered speaker database DB2 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD.
  • the registered speaker database DB2 stores feature quantities of each of a plurality of registered speakers registered in advance, information on the determination results of sound collection conditions corresponding to the feature quantities, and registered speaker information in association with each other. do.
  • the similarity calculation model database DB3 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD.
  • the similarity calculation model database DB3 stores a similarity calculation model that can calculate the similarity between the feature extracted from the authenticated voice data and the feature of a registered speaker registered in the registered speaker database DB2.
  • the similarity calculation model is a learning model that is trained and generated using learning data under predetermined sound collection conditions by deep learning or the like.
  • a similarity calculation model is one in which dimensions in which individuality is likely to be expressed are learned in advance and retained in order to calculate the similarity between two multidimensional vectors with high precision.
  • the method of calculating similarity using a model is just one example of a method for calculating similarity between vectors, and previously described techniques such as Euclidean distance and cosine similarity may be used.
  • the learning database DB4 for each sound collection condition is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD.
  • the sound collection condition-specific learning database DB4 stores distribution information of learning data used for learning the similarity calculation model.
  • the monitor MN is configured using a display such as a Liquid Crystal Display (LCD) or an organic electroluminescence (EL), for example.
  • the monitor MN displays the authentication result screen SC output from the terminal device P1.
  • the authentication result screen SC is a screen that notifies the administrator (for example, the person viewing Monitor MN) of the speaker authentication result, and displays the authentication result information "XX matched Mr. XX's voice.” Contains degree information "Reliability: High”.
  • the authentication result screen SC may also include other registered speaker information (for example, a face image, etc.). Further, the authentication result screen SC does not need to include reliability information.
  • FIG. 3 is a flowchart showing an example of the operation procedure of the terminal device P1 in the embodiment.
  • the terminal device P1 acquires audio data from the microphone MK (St11).
  • the microphone MK may be, for example, a microphone included in a PC, a notebook PC, a smartphone, or a tablet terminal.
  • the terminal device P1 determines whether the control command associated with the voice data is a control command requesting registration in the registered speaker database DB2 (St12).
  • step St12 the terminal device P1 determines to newly register the feature amount of the speaker US in the registered speaker database DB2 if the control command is a control command requesting registration in the registered speaker database DB2. (St12, YES), and the feature amount of the speaker US is extracted from the voice data (registered voice data) (St13).
  • step St12 if the control command is not a control command requesting registration in the registered speaker database DB2 but a control command requesting speaker authentication, the terminal device P1 registers the registered speaker database DB2. It is determined that the feature amount of the speaker US is not to be newly registered (St12, NO), and the feature amount of the speaker US is extracted from the voice data (authentication voice data) (St14).
  • the terminal device P1 executes a process of determining the sound collection condition of the registered voice data (uttered voice) from which this feature amount has been extracted (St15A).
  • the terminal device P1 stores (registers) the feature amount of the speaker US, the information on the sound collection conditions, and the speaker information of the speaker US in the registered speaker database DB2 in association with each other (St16).
  • the terminal device P1 executes a process of determining the sound collection condition of the registered voice data (uttered voice) from which this feature amount has been extracted (St15B).
  • the terminal device P1 acquires information on the feature amounts and sound collection conditions of each of the plurality of registered speakers registered in the registered speaker database DB2.
  • the terminal device P1 collects the feature values of the speaker US and the plurality of registered speakers based on the information on the feature values and sound collection conditions of the speaker US and the information on the feature values and sound collection conditions of each of the plurality of registered speakers.
  • a similarity calculation model for calculating the similarity with each feature of the speaker is selected (St18).
  • the terminal device P1 creates correspondence lists LST, LST1, With reference to LST2 (see FIGS. 7, 8, and 9), one of the plurality of selection models is selected.
  • the terminal device P1 executes speaker authentication processing based on the selected models of the selected correspondence lists LST, LST1, and LST2 (St19).
  • FIG. 4 is a flowchart illustrating an example of a sound collection condition determination procedure of the terminal device P1 in the embodiment.
  • FIG. 5 is a diagram illustrating an example of determining sound collection conditions and an example of calculating reliability.
  • the terminal device P1 acquires the feature amount of the speaker US of the authentication voice data (St151), and acquires learning data information for each similarity calculation model registered in the similarity calculation model database DB3 (St152).
  • the learning data here is data for calculating the similarity of feature amounts under predetermined sound collection conditions.
  • the terminal device P1 calculates, for example, the distance between the feature amount of the speaker US and each learning data (St153).
  • the terminal device P1 identifies and selects the similarity calculation model with the smallest calculated distance (St154).
  • the terminal device P1 determines that the sound collection condition corresponding to this similarity calculation model is the sound collection condition for the feature amount of the speaker US (St154).
  • the terminal device P1 calculates the distances d XA , d Calculate XC .
  • the three similarity calculation models DBA, DBB, and DBC shown in FIG. 5 show an example in which the area includes five learning data, but the number of learning data used for generation (learning) of the similarity calculation models is as follows. It is sufficient if it is 1 or more. Furthermore, the number of similarity calculation models DBA, DBB, and DBC may be two or more.
  • the terminal device P1 uses the following (Formula 1) to calculate the distances d XA , d XB , d calculate.
  • N an integer of 1 or more
  • the number N is the number of training data used to generate the similarity calculation model, or the representative number of the similarity calculation models used to calculate the distance. Indicates the number of training data.
  • the number N of learning data for each similarity calculation model does not have to be the same.
  • N 5, and the distance d , A 4 , and A 5 .
  • the distance d _ _ _ The distance d _ _ _ _
  • the terminal device P1 selects the similarity calculation model corresponding to the distribution of the learning data having the minimum distance among the plurality of calculated distances as the similarity calculation model used for calculating the similarity. For example, in the example shown in FIG. 5, the distance dXC is the minimum distance among the calculated distances dXA , dXB , and dXC . In such a case, the terminal device P1 determines that the sound collection conditions for the voice uttered by the speaker US (feature amount PT1) are the same as the sound collection conditions corresponding to the similarity calculation model DBC.
  • the terminal device P1 calculates the distance between the distribution of learning data of the similarity calculation model used to calculate the similarity and the feature amount of the speaker US (that is, the distance d XA , d XB , d XC ). Based on this, the reliability of the calculated similarity is calculated.
  • the terminal device P1 determines whether the distance between the similarity calculation model used to calculate the similarity and the feature amount of the speaker US is less than or equal to a predetermined value.
  • the terminal device P1 determines that the reliability of the similarity is "high”. Calculate (evaluate). On the other hand, if the terminal device P1 determines that the distance between the distribution of training data of the similarity calculation model and the feature amount of the speaker US is not less than a predetermined value, the terminal device P1 sets the reliability of the similarity to "low”. Calculate (evaluate).
  • FIG. 6 is a flowchart showing an example of the speaker authentication procedure of the terminal device P1 in the embodiment.
  • the terminal device P1 is used to determine the degree of similarity between the feature amount of the speaker US and the feature amount of one of the registered speakers among the plurality of registered speakers, based on the correspondence lists LST, LST1, and LST2.
  • a similarity calculation model is read from the similarity calculation model database DB3 (St191).
  • the terminal device P1 uses the similarity calculation model to calculate the similarity between the feature amount of the speaker US and the feature amount of the registered speaker (St192). Furthermore, the terminal device P1 calculates the reliability of the calculated similarity based on the distance between the similarity calculation model calculated using (Formula 1) and the feature amount of the speaker US (St192). . The terminal device P1 performs step St192 until it calculates the similarity and reliability between the feature amount of the voice data of the speaker US and the feature amounts of all registered speakers registered in the registered speaker database DB2. Repeat the process.
  • the terminal device P1 determines whether there is a degree of similarity greater than or equal to a threshold value among the calculated degrees of similarity (St193).
  • the terminal device P1 determines in the process of step St193 that there is a degree of similarity that is equal to or greater than the threshold among the calculated degrees of similarity (St193, YES), the terminal device P1 registers the degree of similarity that is determined to be equal to or greater than the threshold and corresponds to the degree of similarity. It is determined that the speaker and the speaker US are the same person. The terminal device P1 identifies the speaker US based on the registered speaker information of this registered speaker (St194). Note that if there are multiple degrees of similarity determined to be equal to or greater than the threshold, the terminal device P1 determines that the registered speaker corresponding to the highest calculated degree of similarity and the speaker US are the same person. It may be determined that
  • the terminal device P1 determines in the process of step St193 that there is no similarity greater than or equal to the threshold among the calculated similarities (St193, NO), the terminal device P1 determines that the speaker US cannot be identified (St195). ).
  • the terminal device P1 generates an authentication result screen SC based on the registered speaker information of the identified speaker US.
  • the terminal device P1 outputs the generated authentication result screen SC to the monitor MN for display (St196).
  • the terminal device P1 registers the speaker information, the feature amount of the speaker US, and the sound collection condition of the feature amount of the speaker US in association with each other at the time of voice registration.
  • the terminal device P1 associates and registers the sound collection conditions that change the feature values indicating the individuality of the speaker US, so that the sound collection conditions of the characteristic Yanagi at the time of voice registration and the feature values at the time of voice authentication are registered. Even if the sound collection conditions are different and the feature values at the time of voice registration and the feature values at the time of voice authentication are different, the similarity calculation model can be selected based on the sound collection conditions for each feature value. Speaker authentication can be performed with higher accuracy. Therefore, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by noise included in the authentication voice data.
  • the terminal device P1 calculates the degree of reliability indicating the likelihood of the speaker identified by the speaker authentication process, which is indicated by the calculated degree of similarity, based on the similarity calculation model used to calculate the degree of similarity. and display. Thereby, the terminal device P1 can present the certainty of the speaker authentication result to the administrator viewing the monitor MN. Therefore, the terminal device P1 informs the administrator by presenting the reliability that there is no similarity calculation model suitable for calculating the similarity, and that speaker authentication has been performed using a similarity calculation model "universal model", which will be described later. can be informed.
  • FIG. 7 is a diagram illustrating an example of the correspondence list LST1 when the sound collection condition estimation result at the time of voice registration and the sound collection condition estimation result at the time of voice authentication are the same.
  • FIG. 8 is a diagram illustrating an example of the correspondence list LST2 when the sound collection conditions at the time of voice registration and the sound collection conditions at the time of voice authentication are different.
  • the audio data at the time of voice registration and voice authentication have the same sound collection condition "in-store noise, male, telephone voice”.
  • the terminal device P1 extracts the feature quantity of the speaker US from the authentication voice data of the speaker US transmitted from a sound collection device such as a microphone MK.
  • the terminal device P1 refers to the sound collection condition-specific learning database DB4, and based on the extracted features and the distribution of learning data for each sound collection condition, the terminal device P1 determines the sound collection condition "in-store" for the feature amount of the speaker US. It is determined that the sound is "noise, male, telephone voice.”
  • the terminal device P1 collects the sound of the speaker US based on the information on the sound collection conditions of the speaker US and the information on the sound collection conditions of each of the plurality of registered speakers registered in the registered speaker database DB2.
  • a similarity calculation model is selected that corresponds to the combination of sound conditions and sound collection conditions of each of the plurality of registered speakers.
  • the terminal device P1 refers to a correspondence list LST1 that associates information on the sound collection conditions of the speaker US, information on the sound collection conditions of each of the plurality of registered speakers, and each of the selected similarity calculation models. do.
  • the correspondence list LST1 includes the determination result "Sound collection condition determination result 1" of the sound collection condition of the feature quantity of the speaker US, and the determination result "Sound collection condition determination result 1" of the sound collection condition of the registered speaker registered in the registered speaker database DB2. This is data that associates "condition determination result 2" with a similarity calculation model "selected model” selected based on these two sound collection conditions.
  • Sound collection condition determination result "Sound collection condition determination result 1" indicates the determination result of the sound collection condition of the speaker US.
  • Sound collection condition determination result "Sound collection condition determination result 2" indicates the determination result of the sound collection conditions of the registered speakers registered in the registered speaker database DB2. Further, information on the determination probability corresponding to the determination result of each sound collection condition is not essential and may be omitted.
  • the determination result of the sound collection conditions may include a plurality of conditions such as the gender of the speaker US, the sound collection device, the type of noise, etc., for example.
  • the similarity calculation model "selected model” is a similarity calculation model selected corresponding to the sound collection condition determination result “sound collection condition determination result 1" and the sound collection condition determination result "sound collection condition determination result 2". Includes each of the models “Model A”, “Model B”, “Model C”, and "Model Z”.
  • the similarity calculation model selection unit 114 performs similarity calculation suitable for the similarity calculation process based on the combination of the sound collection conditions for the feature amount of the speaker US and the sound collection conditions for the feature amount of the registered speaker. If it is determined that there is no model, a general-purpose similarity calculation model "Model Z" is selected.
  • the similarity calculation model selection unit 114 selects a similarity calculation model database from the similarity calculation model database DB3 based on the referenced correspondence list LST1. For example, in the example shown in FIG. 7, the similarity calculation model selection unit 114 selects the similarity calculation model "model A.”
  • the registered voice data at the time of voice registration is the same sound collection condition "in-store noise, male, telephone voice".
  • the audio data at the time of audio authentication has a sound collection condition "outdoor noise, male, headset voice” that is different from the sound collection condition of the registered audio data at the time of voice registration.
  • the terminal device P1 extracts the feature quantity of the speaker US from the authentication voice data of the speaker US transmitted from a sound collection device such as a microphone MK.
  • the terminal device P1 refers to the sound collection condition-specific learning database DB4, and based on the extracted features and the distribution of learning data for each sound collection condition, the terminal device P1 determines the sound collection condition "in-store" for the feature amount of the speaker US. It is determined that the sound is "noise, male, telephone voice.”
  • the terminal device P1 collects the sound of the speaker US based on the information on the sound collection conditions of the speaker US and the information on the sound collection conditions of each of the plurality of registered speakers registered in the registered speaker database DB2.
  • a similarity calculation model is selected that corresponds to the combination of sound conditions and sound collection conditions of each of the plurality of registered speakers.
  • the terminal device P1 refers to a correspondence list LST2 that associates information on the sound collection conditions of the speaker US, information on the sound collection conditions of each of the plurality of registered speakers, and each of the selected similarity calculation models. do.
  • the correspondence list LST2 includes the determination result "Sound collection condition determination result 3" of the sound collection condition of the feature quantity of the speaker US, and the determination result "Sound collection condition determination result 3" of the sound collection condition of the registered speaker registered in the registered speaker database DB2. This is data that associates "condition determination result 4" with a similarity calculation model "selected model” selected based on these two sound collection conditions.
  • Sound collection condition determination result “Sound collection condition determination result 3" indicates the determination result of the sound collection condition of the speaker US.
  • Sound collection condition determination result "Sound collection condition determination result 4" indicates the determination result of the sound collection conditions of the registered speakers registered in the registered speaker database DB2. Further, information on the determination probability corresponding to the determination result of each sound collection condition is not essential and may be omitted.
  • the determination result of the sound collection conditions may include a plurality of conditions such as the gender of the speaker US, the sound collection device, the type of noise, etc., for example.
  • the similarity calculation model "selected model” is a similarity calculation model selected corresponding to the sound collection condition determination result “sound collection condition determination result 3" and the sound collection condition determination result "sound collection condition determination result 4". Includes each of the models “Model D”, “Model E”, “Model F”, and "Model Z”.
  • the similarity calculation model selection unit 114 performs similarity calculation suitable for the similarity calculation process based on the combination of the sound collection conditions for the feature amount of the speaker US and the sound collection conditions for the feature amount of the registered speaker. If it is determined that there is no model, select the similarity calculation model "general model".
  • the similarity calculation model selection unit 114 selects a similarity calculation model database from the similarity calculation model database DB3 based on the referenced correspondence list LST2. For example, in the example shown in FIG. 8, the similarity calculation model selection unit 114 selects the similarity calculation model "Model D.”
  • the terminal device P1 calculates the two feature quantities (of the speaker US) based on the combination of the sound collection conditions included in the feature quantities of the speaker US and the registered speaker feature quantities, respectively, which are the subject of similarity calculation. It is possible to select the optimal similarity calculation model for the similarity calculation process (features and features of registered speakers). As a result, even if the sound collection conditions included in the feature amounts at the time of voice registration and the sound collection conditions included in the feature amounts at the time of voice authentication change, the terminal device P1 can detect the similarity between the two feature amounts.
  • the optimal similarity calculation model can be selected to calculate the degree of similarity. In other words, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by noise included in the audio data.
  • the terminal device P1 calculates the degree of similarity using, for example, a similarity calculation model corresponding to each condition, and calculates the average value as the degree of similarity. May be adopted.
  • FIG. 9 is a diagram illustrating a specific example of the correspondence list LST.
  • the correspondence list LST includes the sound collection condition determination result of the feature quantity of the speaker US "sound collection condition determination result AA”, the sound collection condition information of the registered speaker's feature quantity “sound collection condition determination result BB”, The selected similarity calculation model "selected model” is associated with the selected similarity calculation model.
  • the sound collection condition determination result ⁇ sound collection condition determination result AA'' of the feature quantity of the speaker US and the information ⁇ sound collection condition determination result BB'' of the sound collection condition of the registered speaker's feature quantity are, for example, Includes gender, equipment (sound quality), and noise type.
  • the sound collection conditions shown in FIG. 9 are merely an example, and the present invention is not limited thereto.
  • the correspondence list LST indicates the same sound collection conditions in bold letters in the sound collection condition determination result “sound collection condition determination result AA” and the sound collection condition information “sound collection condition determination result BB”.
  • the sound collection condition "gender” indicates the gender of the speaker US determined based on the feature amount of the speaker US.
  • the sound collection condition “equipment (sound quality)” indicates information regarding the sound collection device determined based on the feature amount of the speaker US.
  • the sound collection condition “noise type” indicates the type of environmental sound, noise, etc. at the time of speaking that is included in the feature amount of the speaker US.
  • the similarity calculation model "selection model” is a similarity calculation model used to calculate the similarity between the feature amount of the speaker US and the feature amount of the registered speaker.
  • the determination results of the two sound collection conditions are that the sound collection condition "gender” is “male”, the sound collection condition “equipment (sound quality)” is “telephone”, The sound collection condition “Noise type” is "In-store noise”.
  • the terminal device P1 uses a similarity calculation model ⁇ Similarity Calculation Model'' suitable for calculating the similarity of the feature amounts of the same sound pickup conditions ⁇ Male'', ⁇ Telephone'', and ⁇ Store Noise'' under the two sound pickup conditions. Select "Male Phone Store Noise Model".
  • the determination results of the two sound collection conditions are such that the sound collection condition "gender” is “male” and the sound collection condition “equipment (sound quality)" is “telephone”. , "Headset”, and the sound collection condition “Noise type” is "In-store noise” and "Outdoor noise”.
  • the terminal device P1 selects the similarity calculation model "male model” suitable for calculating the similarity of the feature amount of the sound collection condition "male” which is the same or similar in the two sound collection conditions.
  • the determination results of the two sound collection conditions are such that the sound collection condition "gender” is “female” and the sound collection condition “equipment (sound quality)" is “headset”.
  • the sound collection conditions ⁇ noise type'' are ⁇ none (clean voice)'' and ⁇ in-store noise.
  • the terminal device P1 uses a similarity calculation model "female headset” suitable for calculating the similarity of the feature amounts of the same or similar sound collection conditions "female” and "headset" in the two sound collection conditions. Select "Model”.
  • the terminal device P1 includes the communication unit 10 (an example of an acquisition unit) that acquires audio data, and the feature extraction unit 111 that detects a speech section in which the speaker US is speaking from the audio data. (an example of a detection unit), a feature extraction unit 111 (an example of an extraction unit) that extracts a feature quantity of the speaker US (an example of a speech feature quantity) from a detected utterance interval, Based on the feature amount and the feature amount of at least one registered speaker registered in advance, a first similarity calculation model (an example of a similarity calculation model) used for authenticating the speaker US is selected based on the feature amount and the feature amount of at least one registered speaker registered in advance.
  • a first similarity calculation model an example of a similarity calculation model
  • a similarity calculation model selection unit 114 (an example of a selection unit) that selects a similarity calculation model (an example of a first similarity calculation model and the similarity calculation model selected in the process of step St18); and an authentication unit 116 that authenticates the speaker US by comparing the feature quantities of the speaker US with the feature quantities of registered speakers using the first similarity calculation model.
  • the terminal device P1 selects a similarity calculation model that is more suitable for the similarity calculation process of the two feature amounts (the feature amount of the speaker US and the feature amount of the registered speaker).
  • speaker authentication based on two features can be performed with higher accuracy.
  • the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by differences in noise (environmental noise) during voice registration and voice authentication.
  • the terminal device P1 further includes a sound collection condition determination unit 112 (an example of a determination unit) that determines the sound collection condition of the uttered sound corresponding to the feature amount based on the feature amount.
  • the sound collection condition determining unit 112 selects the first similarity calculation model based on the sound collection condition corresponding to the feature amount of the speaker US and the sound collection condition corresponding to the feature amount of the registered speaker.
  • the terminal device P1 according to the embodiment can select a similarity calculation model more suitable for feature value matching based on the combination of the sound collection conditions at the time of voice registration and the sound collection conditions at the time of voice authentication. Therefore, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy.
  • the plurality of similarity calculation models in the terminal device P1 according to the embodiment are generated using at least one learning data under a predetermined sound collection condition.
  • the sound collection condition determination unit 112 determines the sound collection conditions of the speaker US or the registered speaker based on the distance between the feature amount of the speaker US or the registered speaker and each of the plurality of similarity calculation models. do. Thereby, the terminal device P1 according to the embodiment calculates each sound based on the distance between the similarity calculation model generated using the learning data for each sound collection condition and the feature amount of the speaker US or the registered speaker. Sound collection conditions corresponding to feature amounts can be determined with higher accuracy.
  • the sound collection condition determination unit 112 in the terminal device P1 determines the second sound collection condition determination unit 112 in which the distance between the feature amount of the speaker US or the registered speaker and each of the plurality of similarity calculation models is the shortest. (which is an example of the second similarity calculation model and is selected in the process of step St154), and picks up sound corresponding to the selected second similarity calculation model. Based on the conditions, the sound pickup conditions of the speaker US or the registered speaker are determined. Thereby, the terminal device P1 according to the embodiment selects the similarity calculation model with the closest characteristics when calculating the similarity with the feature amount of the speaker US or the registered speaker. Sound collection conditions corresponding to the feature quantities of registered speakers can be determined with higher accuracy.
  • the terminal device P1 includes an authentication unit 116 (an example of a calculation unit) that calculates the degree of similarity between the feature amount of the audio data of the utterance section and the feature amount of each of the plurality of registered speakers. Be prepared for more.
  • the authentication unit 116 authenticates the speaker US based on the plurality of calculated similarities. Thereby, the terminal device P1 according to the embodiment can perform speaker authentication using the degree of similarity between the feature amounts of a plurality of registered speakers registered in advance and the feature amounts of the speaker US.
  • the authentication unit 116 in the terminal device P1 according to the embodiment specifies that the registered speaker whose degree of similarity is equal to or greater than the threshold is the speaker US.
  • the terminal device P1 according to the embodiment can perform speaker authentication using the degree of similarity between the feature amounts of a plurality of registered speakers registered in advance and the feature amounts of the speaker US.
  • the authentication unit 116 in the terminal device P1 according to the embodiment generates and outputs an authentication result screen SC that includes information regarding registered speakers whose degree of similarity is equal to or greater than the threshold value. Thereby, the terminal device P1 according to the embodiment can present the speaker authentication result to the speaker US or the administrator.
  • the authentication unit 116 in the terminal device P1 according to the embodiment determines that the speaker US cannot be identified. Thereby, the terminal device P1 according to the embodiment can more effectively suppress a decrease in speaker authentication accuracy and more effectively suppress erroneous authentication of the speaker US.
  • the terminal device P1 also includes an authentication unit 116 that calculates the similarity between the feature amount of the audio data of the utterance section and the feature amount of each of the plurality of registered speakers, and the authentication unit 116 that calculates the reliability of the similarity. It further includes a reliability calculation unit 115 (an example of a reliability calculation unit) that calculates reliability. The reliability calculation unit 115 calculates the reliability of the similarity based on the distance between the feature amount of the speaker US and the second similarity calculation model. As a result, the terminal device P1 according to the embodiment has confidence in the similarity, the sound collection conditions determined in the process of calculating the similarity, the first similarity calculation model used to calculate the similarity, etc. Can calculate degrees.
  • the authentication unit 116 in the terminal device P1 identifies the registered speaker whose degree of similarity is equal to or greater than the threshold value as the speaker US, and provides information regarding the registered speaker whose degree of similarity is equal to or greater than the threshold value; An authentication result screen SC including the calculated reliability information is generated and output.
  • the terminal device P1 according to the embodiment displays the speaker authentication result and the reliability of the speaker authentication result, so that the administrator can know whether the speaker authentication result is reliable or not. You can prompt for confirmation.
  • the sound collection conditions include the gender of the speaker US, the age of the speaker US, the language of the speaker US, the sound collection device with which the uttered voice was collected, or the uttered voice. At least one of the types of noise included.
  • the terminal device P1 according to the embodiment can change the feature amounts used for speaker authentication and select a similarity calculation model based on the sound collection conditions that may cause a decrease in speaker authentication accuracy. Therefore, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy.
  • the present disclosure is useful as a voice authentication device and voice authentication method that can suppress a decrease in speaker authentication accuracy due to changes in environmental noise.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephone Function (AREA)

Abstract

This voice authentication device comprises: an acquisition unit that acquires voice data; a detection unit that detects from the voice data an utterance section in which a speaker is uttering; an extraction unit that extracts an utterance feature amount of the speaker from the detected utterance section; a selection unit that, on the basis of the extracted utterance feature amount of the speaker and an utterance feature amount of at least one registered speaker that is previously registered, selects a first similarity calculation model used for authenticating the speaker from among a plurality of similarity calculation models; and an authentication unit that, using the selected first similarity calculation model, compares the utterance feature amount of the speaker and the utterance feature amount of the registered speaker, and authenticates the speaker.

Description

音声認証装置および音声認証方法Voice authentication device and voice authentication method
 本開示は、音声認証装置および音声認証方法に関する。 The present disclosure relates to a voice authentication device and a voice authentication method.
 特許文献1には、被験者の音声を認識する音声認識装置が開示されている。音声認識装置は、複数の動作の各々に対応して作成された複数の動作雑音モデルを当該複数の動作の各々に対応付けて記憶し、被験者の音声を含む入力音声を検出し、被験者の動作を特定し、動作特定手段によって特定された動作に対応する動作雑音モデルを読み出す。音声認識装置は、被験者の現在位置に対応する環境雑音モデルを読み出し、読み出された動作雑音モデルに環境雑音モデルを合成し、合成された雑音重畳モデルを用いて、検出される入力音声に含まれる被験者の音声を認識する。 Patent Document 1 discloses a voice recognition device that recognizes a test subject's voice. The speech recognition device stores a plurality of motion noise models created corresponding to each of the plurality of motions in association with each of the plurality of motions, detects input speech including the subject's voice, and detects the subject's motion. is specified, and a motion noise model corresponding to the motion specified by the motion specifying means is read out. The speech recognition device reads an environmental noise model corresponding to the subject's current position, synthesizes the environmental noise model with the read motion noise model, and uses the synthesized noise superimposition model to detect the noise contained in the detected input speech. recognize the test subject's voice.
日本国特開2008-250059号公報Japanese Patent Application Publication No. 2008-250059
 しかし、特許文献1では、事前に複数の動作の各々で発生する動作雑音と、音声認識を実行可能な複数の位置の各々の環境雑音とを収音する必要があり、たいへん手間だった。また、声紋認証において、音声信号から抽出される認証対象(人物)の個人性を示す特徴量は、音声信号に含まれる雑音、音声信号が収音された収音機器等の収音条件によって変化する。よって、上述した音声認識装置を用いて声紋認証を行う場合、事前に登録された音声信号の収音条件と、声紋認証時に収音された音声信号の収音条件とがそれぞれ異なる場合、それぞれの音声信号から抽出された特徴量は同一人物の個人性を示さず、声紋認証精度が低下する可能性があった。 However, in Patent Document 1, it is necessary to collect in advance the motion noise generated in each of a plurality of motions and the environmental noise at each of a plurality of positions where voice recognition can be performed, which is very time-consuming. In addition, in voiceprint authentication, the feature quantity that indicates the individuality of the authentication target (person) extracted from the audio signal changes depending on the noise contained in the audio signal and the sound collection conditions of the sound collection device with which the audio signal was collected. do. Therefore, when performing voiceprint authentication using the voice recognition device described above, if the recording conditions for the pre-registered voice signal and the recording conditions for the voice signal collected during voiceprint authentication are different, each Features extracted from voice signals do not indicate the individuality of the same person, and there is a possibility that the accuracy of voiceprint authentication will decrease.
 本開示は、上述した従来の状況に鑑みて案出され、環境雑音の変化に起因する話者認証精度の低下を抑制する音声認証装置および音声認証方法を提供することを目的とする。 The present disclosure was devised in view of the above-mentioned conventional situation, and aims to provide a voice authentication device and a voice authentication method that suppress a decrease in speaker authentication accuracy due to changes in environmental noise.
 本開示は、音声データを取得する取得部と、前記音声データから話者が発話している発話区間を検出する検出部と、検出された発話区間から前記話者の発話特徴量を抽出する抽出部と、抽出された前記話者の発話特徴量と、事前に登録された少なくとも1つの登録話者の発話特徴量とに基づいて、複数の類似度計算モデルのうち前記話者の認証に用いられる第1類似度計算モデルを選定する選定部と、選定された第1類似度計算モデルを用いて、前記話者の発話特徴量と、前記登録話者の発話特徴量とを照合して、前記話者を認証する認証部と、を備える、音声認証装置を提供する。 The present disclosure includes an acquisition unit that acquires audio data, a detection unit that detects an utterance section in which a speaker is speaking from the audio data, and an extraction unit that extracts utterance features of the speaker from the detected utterance section. , the extracted utterance feature of the speaker, and the utterance feature of at least one registered speaker registered in advance. a selection unit that selects a first similarity calculation model based on the selected first similarity calculation model, and comparing the utterance features of the speaker with the utterance features of the registered speaker, using the selected first similarity calculation model; A voice authentication device is provided, comprising: an authentication section that authenticates the speaker.
 また、本開示は、端末装置が行う音声認証方法であって、音声データを取得し、前記音声データから話者が発話している発話区間を検出し、検出された発話区間から前記話者の発話特徴量を抽出し、抽出された前記話者の発話特徴量と、事前に登録された少なくとも1つの登録話者の発話特徴量とに基づいて、複数の類似度計算モデルのうち前記話者の認証に用いられる第1類似度計算モデルを選定し、選定された第1類似度計算モデルを用いて、前記話者の発話特徴量と、前記登録話者の発話特徴量とを照合して、前記話者を認証する、音声認証方法を提供する。 The present disclosure also provides a voice authentication method performed by a terminal device, which acquires voice data, detects a speech section in which the speaker is speaking from the voice data, and detects the speech section of the speaker from the detected speech section. The utterance features are extracted, and based on the extracted utterance features of the speaker and the utterance features of at least one registered speaker registered in advance, the speaker is selected from among a plurality of similarity calculation models. A first similarity calculation model used for authentication is selected, and the selected first similarity calculation model is used to compare the utterance features of the speaker with the utterance features of the registered speaker. , provides a voice authentication method for authenticating the speaker.
 本開示によれば、環境雑音の変化に起因する話者認証精度の低下を抑制できる。 According to the present disclosure, it is possible to suppress a decrease in speaker authentication accuracy due to changes in environmental noise.
実施の形態に係る音声認証システムの内部構成例を示すブロック図Block diagram showing an example of internal configuration of a voice authentication system according to an embodiment 実施の形態における端末装置のプロセッサが行う各処理について説明する図A diagram illustrating each process performed by a processor of a terminal device in an embodiment. 実施の形態における端末装置の動作手順例を示すフローチャートFlowchart showing an example of the operation procedure of the terminal device in the embodiment 実施の形態における端末装置の収音条件判定手順例を示すフローチャートFlowchart illustrating an example of a procedure for determining sound collection conditions of a terminal device in an embodiment 収音条件の判定例、および信頼度の算出例を説明する図Diagram illustrating an example of determining sound collection conditions and calculating reliability 実施の形態における端末装置の話者認証手順例を示すフローチャートFlowchart showing an example of a speaker authentication procedure of a terminal device in an embodiment 音声登録時の収音条件推定結果と音声認証時の収音条件推定結果とが同一である場合の対応リストの一例を説明する図A diagram illustrating an example of a correspondence list when the sound collection condition estimation result at the time of voice registration and the sound collection condition estimation result at the time of voice authentication are the same. 音声登録時のノイズ種別と音声認証時のノイズ種別とが異なる場合の対応リストの一例を説明する図Diagram explaining an example of a correspondence list when the noise type at the time of voice registration and the noise type at the time of voice authentication are different 対応リストの具体例を説明する図Diagram explaining a specific example of the correspondence list
 以下、適宜図面を参照しながら、本開示に係る音声認証装置および音声認証方法を具体的に開示した実施の形態を詳細に説明する。但し、必要以上に詳細な説明は省略する場合がある。例えば、既によく知られた事項の詳細説明および実質的に同一の構成に対する重複説明を省略する場合がある。これは、以下の説明が不必要に冗長になるのを避け、当業者の理解を容易にするためである。なお、添付図面および以下の説明は、当業者が本開示を十分に理解するために提供されるのであって、これらにより特許請求の範囲に記載の主題を限定することは意図されていない。 Hereinafter, embodiments specifically disclosing a voice authentication device and a voice authentication method according to the present disclosure will be described in detail with reference to the drawings as appropriate. However, more detailed explanation than necessary may be omitted. For example, detailed explanations of well-known matters and redundant explanations of substantially the same configurations may be omitted. This is to avoid unnecessary redundancy in the following description and to facilitate understanding by those skilled in the art. The accompanying drawings and the following description are provided to enable those skilled in the art to fully understand the present disclosure, and are not intended to limit the subject matter recited in the claims.
 まず、図1および図2を参照して、実施の形態に係る音声認証システム100について説明する。図1は、実施の形態に係る音声認証システム100の内部構成例を示すブロック図である。図2は、実施の形態における端末装置P1のプロセッサ11が行う各処理について説明する図である。 First, a voice authentication system 100 according to an embodiment will be described with reference to FIGS. 1 and 2. FIG. 1 is a block diagram showing an example of the internal configuration of a voice authentication system 100 according to an embodiment. FIG. 2 is a diagram illustrating each process performed by the processor 11 of the terminal device P1 in the embodiment.
 音声認証システム100は、音声認証装置の一例としての端末装置P1と、モニタMNとを含む。なお、音声認証システム100は、マイクMKあるいはモニタMNを含む構成であってもよい。 The voice authentication system 100 includes a terminal device P1 as an example of a voice authentication device and a monitor MN. Note that the voice authentication system 100 may include a microphone MK or a monitor MN.
 マイクMKは、端末装置P1に事前に音声登録するための話者USの発話音声を収音する。マイクMKは、収音された話者USの発話音声を、端末装置P1に登録される音声信号または音声データに変換する。マイクMKは、変換された音声信号または音声データを、通信部10を介してプロセッサ11に送信する。 The microphone MK picks up the voice uttered by the speaker US in order to register the voice in the terminal device P1 in advance. The microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data that is registered in the terminal device P1. Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .
 また、マイクMKは、話者認証に用いられる話者USの発話音声を収音する。マイクMKは、収音された話者USの発話音声を音声信号または音声データに変換する。マイクMKは、変換された音声信号または音声データを、通信部10を介してプロセッサ11に送信する。 Additionally, the microphone MK picks up the voice uttered by the speaker US, which is used for speaker authentication. The microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data. Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .
 以降の説明では、説明を分かりやすくするために音声登録用の音声データ、あるいは端末装置P1に登録済みの音声データを「登録音声データ」、音声認証用の音声データを「認証音声データ」と表記し、区別する。 In the following explanation, in order to make the explanation easier to understand, the voice data for voice registration or the voice data already registered in the terminal device P1 will be referred to as "registered voice data", and the voice data for voice authentication will be referred to as "authentication voice data". and differentiate.
 なお、マイクMKは、例えば、Personal Computer(以降、「PC」と表記)、ノートPC、スマートフォン、タブレット端末等の所定の装置が備えるマイクであってもよい。また、マイクMKは、ネットワーク(不図示)を介した無線通信により、音声信号または音声データを端末装置P1に送信してもよい。 Note that the microphone MK may be, for example, a microphone included in a predetermined device such as a Personal Computer (hereinafter referred to as "PC"), a notebook PC, a smartphone, a tablet terminal, or the like. Further, the microphone MK may transmit an audio signal or audio data to the terminal device P1 by wireless communication via a network (not shown).
 端末装置P1は、例えば、PC、ノートPC、スマートフォン、タブレット端末等により実現され、話者USの登録音声データを用いた音声登録処理と、認証音声データを用いた話者認証処置とを実行する。通信部10と、プロセッサ11と、メモリ12と、特徴量抽出モデルデータベースDB1と、登録話者データベースDB2と、類似度計算モデルデータベースDB3と、収音条件別学習データベースDB4と、を含む。 The terminal device P1 is realized by, for example, a PC, a notebook PC, a smartphone, a tablet terminal, etc., and executes a voice registration process using registered voice data of the speaker US and a speaker authentication process using authentication voice data. . It includes a communication unit 10, a processor 11, a memory 12, a feature extraction model database DB1, a registered speaker database DB2, a similarity calculation model database DB3, and a sound collection condition learning database DB4.
 取得部の一例としての通信部10は、マイクMKと、モニタMN、との間でそれぞれデータ送受信可能に有線通信あるいは無線通信可能に接続される。ここでいう無線通信は、例えばBluetooth(登録商標)、NFC(登録商標)等の近距離無線通信、またはWi-Fi(登録商標)等の無線Local Area Network(LAN)を介した通信である。 The communication unit 10, which is an example of an acquisition unit, is connected to the microphone MK and the monitor MN so that data can be transmitted and received by wire or wirelessly. The wireless communication referred to here is, for example, short-range wireless communication such as Bluetooth (registered trademark) or NFC (registered trademark), or communication via a wireless local area network (LAN) such as Wi-Fi (registered trademark).
 なお、通信部10は、Universal Serial Bus(USB)等のインターフェースを介してマイクMKとの間でデータ送受信を実行してもよい。また、通信部10は、High-Definition Multimedia Interface(HDMI,登録商標)等のインターフェースを介してモニタMNとの間でデータ送受信を実行してもよい。 Note that the communication unit 10 may transmit and receive data to and from the microphone MK via an interface such as a Universal Serial Bus (USB). Furthermore, the communication unit 10 may perform data transmission and reception with the monitor MN via an interface such as High-Definition Multimedia Interface (HDMI, registered trademark).
 プロセッサ11は、例えばCentral Processing Unit(CPU)またはField Programmable Gate Array(FPGA)を用いて構成されて、メモリ12と協働して、各種の処理および制御を行う。具体的には、プロセッサ11は、メモリ12に保持されたプログラムおよびデータを参照し、そのプログラムを実行することにより、特徴量抽出部111、収音条件判定部112、話者登録部113、類似度計算モデル選択部114、信頼度計算部115、認証部116等の各部の機能を実現する。 The processor 11 is configured using, for example, a central processing unit (CPU) or a field programmable gate array (FPGA), and performs various processing and control in cooperation with the memory 12. Specifically, the processor 11 refers to the program and data held in the memory 12, and executes the program to identify the feature amount extraction section 111, the sound collection condition determination section 112, the speaker registration section 113, and the similar speaker registration section 113. The functions of each unit such as the degree calculation model selection unit 114, the reliability calculation unit 115, and the authentication unit 116 are realized.
 プロセッサ11は、話者USの音声登録時に、特徴量抽出部111、話者登録部113、収音条件判定部112の各部の機能を実現することで、登録話者データベースDB2への話者USの新規登録(格納)処理を実行する。 When registering the voice of the speaker US, the processor 11 implements the functions of the feature extraction unit 111, the speaker registration unit 113, and the sound collection condition determination unit 112, thereby registering the voice of the speaker US in the registered speaker database DB2. Executes new registration (storage) processing.
 また、プロセッサ11は、話者USの音声認証時に、特徴量抽出部111、収音条件判定部112、類似度計算モデル選択部114、信頼度計算部115、認証部116の各部の機能を実現することで、話者認証処理を実行する。 In addition, the processor 11 realizes the functions of the feature extraction unit 111, the sound collection condition determination unit 112, the similarity calculation model selection unit 114, the reliability calculation unit 115, and the authentication unit 116 during voice authentication of the speaker US. By doing so, speaker authentication processing is executed.
 検出部および抽出部の一例としての特徴量抽出部111は、マイクMKから送信された話者USの音声データ(登録音声データ、あるいは認証音声データ)を取得する。特徴量抽出部111は、音声データに対応付けられた制御指令に基づいて、音声登録処理または音声認証処理を実行する。 The feature extraction unit 111, which is an example of a detection unit and an extraction unit, acquires the voice data (registered voice data or authentication voice data) of the speaker US transmitted from the microphone MK. The feature extraction unit 111 executes voice registration processing or voice authentication processing based on a control command associated with the voice data.
 特徴量抽出部111は、音声登録時において、登録音声データから話者USが発話している発話区間を検出する。特徴量抽出部111は、検出された発話区間から話者USの個人性を示す特徴量を抽出して、収音条件判定部112および話者登録部113に出力する。 The feature extraction unit 111 detects the utterance section in which the speaker US is speaking from the registered voice data during voice registration. The feature amount extracting unit 111 extracts a feature amount indicating the individuality of the speaker US from the detected speech section, and outputs the extracted feature amount to the sound collection condition determining unit 112 and the speaker registration unit 113.
 また、特徴量抽出部111は、音声認証時において、認証音声データから話者USの特徴量を抽出して、収音条件判定部112および認証部116のそれぞれに出力する。 Further, during voice authentication, the feature amount extraction unit 111 extracts the feature amount of the speaker US from the authentication voice data and outputs it to each of the sound collection condition determination unit 112 and the authentication unit 116.
 判定部の一例としての収音条件判定部112は、特徴量抽出部111から出力された話者USの特徴量に基づいて、この特徴量が抽出された音声データ(つまり、話者USの発話音声)が収音された収音条件を判定する。収音条件判定部112は、音声登録時において、話者USの特徴量の収音条件の情報を話者登録部113に出力する。また、収音条件判定部112は、音声認証時において、話者USの特徴量の収音条件の情報を類似度計算モデル選択部114に出力する。 Based on the feature amount of the speaker US outputted from the feature amount extraction section 111, the sound collection condition determination section 112, which is an example of the determination section, selects the voice data from which the feature amount has been extracted (that is, the utterance of the speaker US). The sound collection conditions under which the sound) was collected are determined. The sound collection condition determination unit 112 outputs information on the sound collection conditions of the feature amount of the speaker US to the speaker registration unit 113 at the time of voice registration. Further, the sound collection condition determination unit 112 outputs information on the sound collection conditions of the feature amount of the speaker US to the similarity calculation model selection unit 114 during voice authentication.
 なお、ここでいう収音条件は、話者USあるいは登録話者の発話音声を収音した収音装置、話者USあるいは登録話者の言語,性別,年齢、特徴量に含まれるノイズのノイズ種別等である。収音装置は、例えば、マイク,電話,ヘッドセット等である。 Note that the sound pickup conditions here refer to the sound pickup device that picked up the utterances of the speaker US or registered speaker, the language, gender, age, and noise of the noise included in the features of the speaker US or registered speaker. Type, etc. The sound collecting device is, for example, a microphone, a telephone, a headset, or the like.
 ノイズは、収音時の環境(背景)に起因して収音されたノイズであって、例えば、収音時の周囲の話し声、音楽、車両の走行音、風の音等である。ノイズ種別は、例えば、店舗内雑音、屋外風雑音、店舗内音楽、駅構内等のノイズが発生する環境(場所)、位置等を示す。ノイズ種別は、例えば、早朝、昼間、夜間等の時間帯の情報をさらに含んでもよい。 Noise is noise that is collected due to the environment (background) at the time of sound collection, and includes, for example, surrounding voices at the time of sound collection, music, the sound of a vehicle running, the sound of the wind, etc. The noise type indicates, for example, the environment (place) or position where the noise occurs, such as in-store noise, outdoor wind noise, in-store music, or inside a station. The noise type may further include information on time zones such as early morning, daytime, and nighttime, for example.
 話者登録部113は、特徴量抽出部111から出力された話者USの特徴量と、登録音声データに対応付けられた話者USの話者情報とを取得する。また、話者登録部113は、収音条件判定部112から出力された収音条件の情報を取得する。話者登録部113は、話者USの特徴量と、話者USの話者情報と、収音条件の情報とを対応付けて、登録話者データベースDB2に登録する。 The speaker registration unit 113 acquires the feature quantity of the speaker US outputted from the feature quantity extraction unit 111 and the speaker information of the speaker US associated with the registered voice data. The speaker registration unit 113 also acquires information on the sound collection conditions output from the sound collection condition determination unit 112. The speaker registration unit 113 associates the feature amount of the speaker US, the speaker information of the speaker US, and the information on the sound collection conditions and registers them in the registered speaker database DB2.
 なお、話者情報は、登録音声データから音声認識により抽出されてもよいし、話者USが所有する端末(例えば、PC、ノートPC、スマートフォン,タブレット端末)から取得されてもよい。また、ここでいう話者情報は、例えば、話者USを識別可能な識別情報、話者USの氏名、話者Identification(ID)等である。 Note that the speaker information may be extracted from the registered speech data by voice recognition, or may be obtained from a terminal owned by the speaker US (for example, a PC, a notebook PC, a smartphone, a tablet terminal). Moreover, the speaker information here includes, for example, identification information that can identify the speaker US, the name of the speaker US, speaker identification (ID), and the like.
 選定部の一例としての類似度計算モデル選択部114は、収音条件判定部112から出力された話者USの特徴量の収音条件の情報と、登録話者データベースDB2に登録された複数の登録話者のそれぞれの特徴量の収音条件の情報とを取得する。類似度計算モデル選択部114は、取得された話者USの特徴量の収音条件の情報と、複数の登録話者のそれぞれの特徴量の収音条件の情報とに基づいて、話者USの特徴量と、いずれか1人の登録話者の特徴量との類似度算出処理に用いられる類似度計算モデル(類似度計算モデルの一例)を選択する。 The similarity calculation model selection unit 114, which is an example of the selection unit, uses the sound collection condition information of the feature quantity of the speaker US outputted from the sound collection condition determination unit 112 and the plurality of sound collection conditions registered in the registered speaker database DB2. Information on the sound collection conditions of the feature quantities of each registered speaker is acquired. The similarity calculation model selection unit 114 selects the speaker US based on the acquired sound pickup condition information of the feature amount of the speaker US and the information of the sound pickup condition of the feature amount of each of the plurality of registered speakers. A similarity calculation model (an example of a similarity calculation model) to be used in the similarity calculation process between the feature amount and the feature amount of any one registered speaker is selected.
 類似度計算モデル選択部114は、取得された話者USの特徴量の収音条件の情報と、複数の登録話者のそれぞれの特徴量の収音条件の情報と、選択された類似度計算モデルとを対応付けた対応リストLST,LST1,LST2(図7、図8、図9参照)を参照して、複数の選択モデル(第1類似度計算モデル)のそれぞれのうちいずれか1つの選択モデルを選定し、信頼度計算部115および認証部116のそれぞれに出力する。 The similarity calculation model selection unit 114 uses the information on the sound collection conditions for the acquired feature quantities of the speaker US, the information on the sound collection conditions for the feature quantities of each of the plurality of registered speakers, and the selected similarity calculations. Select one of the plurality of selected models (first similarity calculation model) by referring to the correspondence lists LST, LST1, LST2 (see FIGS. 7, 8, and 9) that are associated with the models. A model is selected and output to each of the reliability calculation section 115 and the authentication section 116.
 信頼度算出部の一例としての信頼度計算部115は、認証部116により算出された類似度に基づく話者USの特定結果の確からしさを示す信頼度(スコア)を算出(評価)する。信頼度計算部115は、認証部116による類似度算出処理に用いられた類似度計算モデルのモデル学習データ分布と、話者USの特徴量との距離に基づいて、信頼度を算出する。信頼度計算部115は、算出された信頼度の情報を認証部116に出力する。 The reliability calculation unit 115, which is an example of the reliability calculation unit, calculates (evaluates) the reliability (score) indicating the certainty of the identification result of the speaker US based on the similarity calculated by the authentication unit 116. The reliability calculation unit 115 calculates the reliability based on the distance between the model learning data distribution of the similarity calculation model used in the similarity calculation process by the authentication unit 116 and the feature amount of the speaker US. The reliability calculation unit 115 outputs the calculated reliability information to the authentication unit 116.
 算出部の一例としての認証部116は、特徴量抽出部111から出力された話者USの特徴量を取得し、登録話者データベースDB2に登録された複数の登録話者のそれぞれの特徴量を取得する。また、認証部116は、類似度計算モデル選択部114から出力された対応リストLST,LST1,LST2の選択モデルを取得する。 The authentication unit 116, which is an example of a calculation unit, acquires the feature amount of the speaker US output from the feature amount extraction unit 111, and calculates the feature amount of each of the plurality of registered speakers registered in the registered speaker database DB2. get. The authentication unit 116 also obtains the selected models of the correspondence lists LST, LST1, and LST2 output from the similarity calculation model selection unit 114.
 認証部116は、対応リストLST,LST1,LST2に基づく類似度計算モデルを用いて、複数の登録話者のそれぞれの特徴量と、話者USの特徴量との類似度を算出する。認証部116は、算出された類似度に基づいて、話者USを特定する。また、認証部116は、信頼度計算部115から出力された信頼度の情報を取得する。認証部116は、特定された話者USの話者情報と、信頼度の情報とに基づいて、認証結果画面SCを生成して、モニタMNに送信する。 The authentication unit 116 uses a similarity calculation model based on the correspondence lists LST, LST1, and LST2 to calculate the similarity between the feature amounts of each of the plurality of registered speakers and the feature amount of the speaker US. The authentication unit 116 identifies the speaker US based on the calculated similarity. The authentication unit 116 also obtains reliability information output from the reliability calculation unit 115. The authentication unit 116 generates an authentication result screen SC based on the speaker information of the identified speaker US and reliability information, and transmits it to the monitor MN.
 メモリ12は、例えばプロセッサ11の各処理を実行する際に用いられるワークメモリとしてのRandom Access Memory(以降、「RAM」と表記)と、プロセッサ11の動作を規定したプログラムおよびデータを格納するRead Only Memory(以降、「ROM」と表記)とを有する。RAMには、プロセッサ11により生成あるいは取得されたデータもしくは情報が一時的に保存される。ROMには、プロセッサ11の動作を規定するプログラムが書き込まれている。 The memory 12 includes, for example, a random access memory (hereinafter referred to as "RAM") as a work memory used when executing each process of the processor 11, and a read-only memory that stores programs and data that define the operation of the processor 11. Memory (hereinafter referred to as "ROM"). Data or information generated or acquired by the processor 11 is temporarily stored in the RAM. A program that defines the operation of the processor 11 is written in the ROM.
 特徴量抽出モデルデータベースDB1は、所謂ストレージであって、例えばフラッシュメモリ、Hard Disk Drive(以降、「HDD」と表記)あるいはSolid State Drive(以降、「SSD」と表記)等の記憶媒体を用いて構成される。特徴量抽出モデルデータベースDB1は、登録音声データあるいは認証音声データから話者USの発話区間を検出し、この話者USの特徴量を抽出可能な特徴量抽出モデルを格納する。特徴量抽出モデルは、例えば、ディープラーニング等を用いた学習により生成された学習モデルである。 The feature extraction model database DB1 is a so-called storage, and is stored using a storage medium such as a flash memory, a Hard Disk Drive (hereinafter referred to as "HDD"), or a Solid State Drive (hereinafter referred to as "SSD"). configured. The feature extraction model database DB1 stores a feature extraction model capable of detecting the utterance section of the speaker US from registered speech data or authenticated speech data and extracting the feature of the speaker US. The feature extraction model is, for example, a learning model generated by learning using deep learning or the like.
 登録話者データベースDB2は、所謂ストレージであって、例えばフラッシュメモリ、HDDあるいはSSD等の記憶媒体を用いて構成される。登録話者データベースDB2は、事前に登録された複数の登録話者のそれぞれの特徴量と、この特徴量に対応する収音条件の判定結果の情報と、登録話者情報とを対応付けて格納する。 The registered speaker database DB2 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD. The registered speaker database DB2 stores feature quantities of each of a plurality of registered speakers registered in advance, information on the determination results of sound collection conditions corresponding to the feature quantities, and registered speaker information in association with each other. do.
 類似度計算モデルデータベースDB3は、所謂ストレージであって、例えばフラッシュメモリ、HDDあるいはSSD等の記憶媒体を用いて構成される。類似度計算モデルデータベースDB3は、認証音声データから抽出された特徴量と、登録話者データベースDB2に登録された登録話者の特徴量との類似度を算出可能な類似度計算モデルを格納する。類似度計算モデルは、ディープラーニング等によって所定の収音条件における学習データを用いて学習され、生成された学習モデルである。 The similarity calculation model database DB3 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD. The similarity calculation model database DB3 stores a similarity calculation model that can calculate the similarity between the feature extracted from the authenticated voice data and the feature of a registered speaker registered in the registered speaker database DB2. The similarity calculation model is a learning model that is trained and generated using learning data under predetermined sound collection conditions by deep learning or the like.
 例えば、類似度計算モデルは、2つの多次元ベクトルの類似度を高精度に算出するために、個人性の表れやすい次元を事前学習しておき保持しておくものである。なお、モデルを利用した類似度の算出方法は、ベクトル間の類似度計算における手法のあくまで一例であって、ユークリッド距離やコサイン類似度などの既出の技術が用いられてもよい。 For example, a similarity calculation model is one in which dimensions in which individuality is likely to be expressed are learned in advance and retained in order to calculate the similarity between two multidimensional vectors with high precision. Note that the method of calculating similarity using a model is just one example of a method for calculating similarity between vectors, and previously described techniques such as Euclidean distance and cosine similarity may be used.
 収音条件別学習データベースDB4は、所謂ストレージであって、例えばフラッシュメモリ、HDDあるいはSSD等の記憶媒体を用いて構成される。収音条件別学習データベースDB4は、類似度計算モデルの学習に用いられた学習データの分布情報等を格納する。 The learning database DB4 for each sound collection condition is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD. The sound collection condition-specific learning database DB4 stores distribution information of learning data used for learning the similarity calculation model.
 モニタMNは、例えばLiquid Crystal Display(LCD)または有機Electroluminescence(EL)等のディスプレイを用いて構成される。モニタMNは、端末装置P1から出力された認証結果画面SCを表示する。 The monitor MN is configured using a display such as a Liquid Crystal Display (LCD) or an organic electroluminescence (EL), for example. The monitor MN displays the authentication result screen SC output from the terminal device P1.
 認証結果画面SCは、話者認証結果を管理者(例えば、モニタMNを視聴する人物等)に通知する画面であって、認証結果情報「XX XXさんの声と一致しました。」と、信頼度情報「信頼度:高」を含む。認証結果画面SCは、他の登録話者情報(例えば、顔画像等)を含んでもよい。また、認証結果画面SCは、信頼度情報を含まなくてもよい。 The authentication result screen SC is a screen that notifies the administrator (for example, the person viewing Monitor MN) of the speaker authentication result, and displays the authentication result information "XX matched Mr. XX's voice." Contains degree information "Reliability: High". The authentication result screen SC may also include other registered speaker information (for example, a face image, etc.). Further, the authentication result screen SC does not need to include reliability information.
 次に、図3を参照して、端末装置P1の動作手順について説明する。図3は、実施の形態における端末装置P1の動作手順例を示すフローチャートである。 Next, with reference to FIG. 3, the operating procedure of the terminal device P1 will be described. FIG. 3 is a flowchart showing an example of the operation procedure of the terminal device P1 in the embodiment.
 端末装置P1は、マイクMKから音声データを取得する(St11)。なお、マイクMKは、例えば、PC、ノートPC、スマートフォン、タブレット端末が備えるマイクであってもよい。 The terminal device P1 acquires audio data from the microphone MK (St11). Note that the microphone MK may be, for example, a microphone included in a PC, a notebook PC, a smartphone, or a tablet terminal.
 端末装置P1は、音声データに対応付けられた制御指令が登録話者データベースDB2への登録を要求する制御指令であるか否かを判定する(St12)。 The terminal device P1 determines whether the control command associated with the voice data is a control command requesting registration in the registered speaker database DB2 (St12).
 端末装置P1は、ステップSt12の処理において、制御指令が登録話者データベースDB2への登録を要求する制御指令である場合には、登録話者データベースDB2に話者USの特徴量を新規登録すると判定し(St12,YES)、音声データ(登録音声データ)から話者USの特徴量を抽出する(St13)。 In the process of step St12, the terminal device P1 determines to newly register the feature amount of the speaker US in the registered speaker database DB2 if the control command is a control command requesting registration in the registered speaker database DB2. (St12, YES), and the feature amount of the speaker US is extracted from the voice data (registered voice data) (St13).
 一方、端末装置P1は、ステップSt12の処理において、制御指令が登録話者データベースDB2への登録を要求する制御指令でなく、話者認証を要求する制御指令である場合、登録話者データベースDB2に話者USの特徴量を新規登録しないと判定し(St12,NO)、音声データ(認証音声データ)から話者USの特徴量を抽出する(St14)。 On the other hand, in the process of step St12, if the control command is not a control command requesting registration in the registered speaker database DB2 but a control command requesting speaker authentication, the terminal device P1 registers the registered speaker database DB2. It is determined that the feature amount of the speaker US is not to be newly registered (St12, NO), and the feature amount of the speaker US is extracted from the voice data (authentication voice data) (St14).
 端末装置P1は、登録音声データから抽出された話者USの特徴量に基づいて、この特徴量が抽出された登録音声データ(発話音声)の収音条件の判定処理を実行する(St15A)。 Based on the feature amount of the speaker US extracted from the registered voice data, the terminal device P1 executes a process of determining the sound collection condition of the registered voice data (uttered voice) from which this feature amount has been extracted (St15A).
 端末装置P1は、話者USの特徴量と、収音条件の情報と、話者USの話者情報を対応付けて、登録話者データベースDB2に格納(登録)する(St16)。 The terminal device P1 stores (registers) the feature amount of the speaker US, the information on the sound collection conditions, and the speaker information of the speaker US in the registered speaker database DB2 in association with each other (St16).
 端末装置P1は、認証音声データから抽出された話者USの特徴量に基づいて、この特徴量が抽出された登録音声データ(発話音声)の収音条件の判定処理を実行する(St15B)。 Based on the feature amount of the speaker US extracted from the authentication voice data, the terminal device P1 executes a process of determining the sound collection condition of the registered voice data (uttered voice) from which this feature amount has been extracted (St15B).
 端末装置P1は、登録話者データベースDB2に登録された複数の登録話者のそれぞれの特徴量および収音条件の情報を取得する。端末装置P1は、話者USの特徴量および収音条件の情報と、複数の登録話者のそれぞれの特徴量および収音条件の情報とに基づいて、話者USの特徴量と複数の登録話者のそれぞれの特徴量との類似度を算出するための類似度計算モデルを選択する(St18)。 The terminal device P1 acquires information on the feature amounts and sound collection conditions of each of the plurality of registered speakers registered in the registered speaker database DB2. The terminal device P1 collects the feature values of the speaker US and the plurality of registered speakers based on the information on the feature values and sound collection conditions of the speaker US and the information on the feature values and sound collection conditions of each of the plurality of registered speakers. A similarity calculation model for calculating the similarity with each feature of the speaker is selected (St18).
 なお、ここで、端末装置P1は、選択された類似度計算モデルと、話者USの収音条件と、複数の登録話者のそれぞれの収音条件とを対応付けた対応リストLST,LST1,LST2(図7、図8、図9参照)を参照して、複数の選択モデルのうちいずれか1つの選択モデルを選定する。 Note that here, the terminal device P1 creates correspondence lists LST, LST1, With reference to LST2 (see FIGS. 7, 8, and 9), one of the plurality of selection models is selected.
 端末装置P1は、選定された対応リストLST,LST1,LST2の選択モデルに基づいて、話者認証処理を実行する(St19)。 The terminal device P1 executes speaker authentication processing based on the selected models of the selected correspondence lists LST, LST1, and LST2 (St19).
 次に、図4および図5を参照して、図3に示すステップSt15A,ステップSt15Bのそれぞれに示す収音条件の判定手順について説明する。図4は、実施の形態における端末装置P1の収音条件判定手順例を示すフローチャートである。図5は、収音条件の判定例、および信頼度の算出例を説明する図である。 Next, with reference to FIGS. 4 and 5, the procedure for determining the sound collection conditions shown in each of step St15A and step St15B shown in FIG. 3 will be described. FIG. 4 is a flowchart illustrating an example of a sound collection condition determination procedure of the terminal device P1 in the embodiment. FIG. 5 is a diagram illustrating an example of determining sound collection conditions and an example of calculating reliability.
 端末装置P1は、認証音声データの話者USの特徴量を取得し(St151)、類似度計算モデルデータベースDB3に登録された類似度計算モデルごとの学習データの情報を取得する(St152)。なお、ここでいう学習データは、所定の収音条件で特徴量の類似度を算出するためのデータである。 The terminal device P1 acquires the feature amount of the speaker US of the authentication voice data (St151), and acquires learning data information for each similarity calculation model registered in the similarity calculation model database DB3 (St152). Note that the learning data here is data for calculating the similarity of feature amounts under predetermined sound collection conditions.
 端末装置P1は、最適なモデルを選択するために、例えば話者USの特徴量と、各学習データとの間の距離を算出する(St153)。端末装置P1は、算出された距離が最も小さい類似度計算モデルを特定し、選択する(St154)。端末装置P1は、この類似度計算モデルに対応する収音条件を、話者USの特徴量の収音条件であると判定する(St154)。 In order to select the optimal model, the terminal device P1 calculates, for example, the distance between the feature amount of the speaker US and each learning data (St153). The terminal device P1 identifies and selects the similarity calculation model with the smallest calculated distance (St154). The terminal device P1 determines that the sound collection condition corresponding to this similarity calculation model is the sound collection condition for the feature amount of the speaker US (St154).
 例えば図5に示すように、端末装置P1は、話者USの特徴量PT1と、類似度計算モデルDBA,DBB,DBCのそれぞれの学習データの分布との間の距離dXA,dXB,dXCを算出する。 For example, as shown in FIG . 5, the terminal device P1 calculates the distances d XA , d Calculate XC .
 なお、図5に示す3つの類似度計算モデルDBA,DBB,DBCでは5つの学習データを含む領域である例を示すが、類似度計算モデルの生成(学習)に用いられる学習データの数は、1以上であればよい。また、類似度計算モデルDBA,DBB,DBCの数は、2以上であればよい。 Note that the three similarity calculation models DBA, DBB, and DBC shown in FIG. 5 show an example in which the area includes five learning data, but the number of learning data used for generation (learning) of the similarity calculation models is as follows. It is sufficient if it is 1 or more. Furthermore, the number of similarity calculation models DBA, DBB, and DBC may be two or more.
 端末装置P1は、以下に示す(数式1)を用いて、話者USの特徴量PT1と、類似度計算モデルDBA,DBB,DBCのそれぞれとの間の距離dXA,dXB,dXCを算出する。なお、(数式1)に示す数N(N:1以上の整数)は、類似度計算モデルの生成に用いられた学習データの数、あるいは類似度計算モデルのうち距離の算出に用いられる代表的な学習データの数を示す。また、各類似度計算モデルの学習データの数Nは、同一でなくてもよい。 The terminal device P1 uses the following (Formula 1) to calculate the distances d XA , d XB , d calculate. Note that the number N (N: an integer of 1 or more) shown in (Formula 1) is the number of training data used to generate the similarity calculation model, or the representative number of the similarity calculation models used to calculate the distance. Indicates the number of training data. Furthermore, the number N of learning data for each similarity calculation model does not have to be the same.
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 例えば、図5に示す例において、N=5であり、距離dXAは、話者USの特徴量PT1と、類似度計算モデルDBAに含まれる5個の学習データA,A,A,A,Aとの間の平均距離である。距離dXBは、話者USの特徴量PT1と、類似度計算モデルDBBに含まれる5個の学習データB,B,B,B,Bとの間の平均距離である。距離dXCは、話者USの特徴量PT1と、類似度計算モデルDBCに含まれる5個の学習データC,C,C,C,Cとの間の平均距離である。 For example, in the example shown in FIG . 5, N=5, and the distance d , A 4 , and A 5 . The distance d _ _ _ The distance d _ _ _
 端末装置P1は、算出された複数の距離のうち最小距離である学習データの分布に対応する類似度計算モデルを、類似度の算出に用いられる類似度計算モデルとして選択する。例えば、図5に示す例では、算出された距離dXA,dXB,dXCのうち距離dXCが最小距離である。このような場合、端末装置P1は、話者USの発話音声(特徴量PT1)の収音条件が、類似度計算モデルDBCに対応する収音条件と同一であると判定する。 The terminal device P1 selects the similarity calculation model corresponding to the distribution of the learning data having the minimum distance among the plurality of calculated distances as the similarity calculation model used for calculating the similarity. For example, in the example shown in FIG. 5, the distance dXC is the minimum distance among the calculated distances dXA , dXB , and dXC . In such a case, the terminal device P1 determines that the sound collection conditions for the voice uttered by the speaker US (feature amount PT1) are the same as the sound collection conditions corresponding to the similarity calculation model DBC.
 また、端末装置P1は、類似度の算出に用いられた類似度計算モデルの学習データの分布と、話者USの特徴量との間の距離(つまり、距離dXA,dXB,dXC)に基づいて、算出された類似度の信頼度を算出する。 In addition, the terminal device P1 calculates the distance between the distribution of learning data of the similarity calculation model used to calculate the similarity and the feature amount of the speaker US (that is, the distance d XA , d XB , d XC ). Based on this, the reliability of the calculated similarity is calculated.
 端末装置P1は、類似度の算出に用いられた類似度計算モデルと、話者USの特徴量との間の距離が所定値以下であるか否かを判定する。 The terminal device P1 determines whether the distance between the similarity calculation model used to calculate the similarity and the feature amount of the speaker US is less than or equal to a predetermined value.
 端末装置P1は、類似度計算モデルの学習データの分布と、話者USの特徴量との間の距離が所定値以下であると判定した場合、類似度の信頼度を「高」であると算出(評価)する。一方、端末装置P1は、類似度計算モデルの学習データの分布と、話者USの特徴量との間の距離が所定値以下でないと判定した場合、類似度の信頼度を「低」であると算出(評価)する。 If the terminal device P1 determines that the distance between the training data distribution of the similarity calculation model and the feature amount of the speaker US is less than or equal to a predetermined value, the terminal device P1 determines that the reliability of the similarity is "high". Calculate (evaluate). On the other hand, if the terminal device P1 determines that the distance between the distribution of training data of the similarity calculation model and the feature amount of the speaker US is not less than a predetermined value, the terminal device P1 sets the reliability of the similarity to "low". Calculate (evaluate).
 例えば、図5に示す例において、端末装置P1は、類似度計算モデルDBCと、話者USの特徴量PT1との間の距離dXCが所定値以下である場合、類似度の信頼度を「高」であると算出(評価)する。一方、端末装置P1は、類似度計算モデルDBCと、話者USの特徴量PT1との間の距離dXCが所定値以下でない場合、類似度の信頼度を「低」であると算出(評価)する。 For example, in the example shown in FIG. 5, if the distance d It is calculated (evaluated) as "high". On the other hand, if the distance d )do.
 次に、図6を参照して、図3に示すステップSt19に示す話者認証手順について説明する。図6は、実施の形態における端末装置P1の話者認証手順例を示すフローチャートである。 Next, with reference to FIG. 6, the speaker authentication procedure shown in step St19 shown in FIG. 3 will be described. FIG. 6 is a flowchart showing an example of the speaker authentication procedure of the terminal device P1 in the embodiment.
 端末装置P1は、対応リストLST,LST1,LST2に基づいて、話者USの特徴量と、複数の登録話者のそれぞれのうちいずれかの登録話者の特徴量との類似度判定に用いられる類似度計算モデルを類似度計算モデルデータベースDB3から読み込む(St191)。 The terminal device P1 is used to determine the degree of similarity between the feature amount of the speaker US and the feature amount of one of the registered speakers among the plurality of registered speakers, based on the correspondence lists LST, LST1, and LST2. A similarity calculation model is read from the similarity calculation model database DB3 (St191).
 端末装置P1は、類似度計算モデルを用いて、話者USの特徴量と、登録話者の特徴量との類似度を算出する(St192)。また、端末装置P1は、(数式1)を用いて算出される類似度計算モデルと、話者USの特徴量との距離に基づいて、算出された類似度の信頼度を算出する(St192)。端末装置P1は、話者USの音声データの特徴量と、登録話者データベースDB2に登録済みのすべての登録話者の特徴量との類似度および信頼度のそれぞれを算出するまで、ステップSt192の処理を繰り返し実行する。 The terminal device P1 uses the similarity calculation model to calculate the similarity between the feature amount of the speaker US and the feature amount of the registered speaker (St192). Furthermore, the terminal device P1 calculates the reliability of the calculated similarity based on the distance between the similarity calculation model calculated using (Formula 1) and the feature amount of the speaker US (St192). . The terminal device P1 performs step St192 until it calculates the similarity and reliability between the feature amount of the voice data of the speaker US and the feature amounts of all registered speakers registered in the registered speaker database DB2. Repeat the process.
 端末装置P1は、算出された類似度のうち閾値以上の類似度があるか否かを判定する(St193)。 The terminal device P1 determines whether there is a degree of similarity greater than or equal to a threshold value among the calculated degrees of similarity (St193).
 端末装置P1は、ステップSt193の処理において、算出された類似度のそれぞれのうち閾値以上の類似度があると判定した場合(St193,YES)、閾値以上であると判定され類似度に対応する登録話者と話者USとが同一人物であると判定する。端末装置P1は、この登録話者の登録話者情報に基づいて、話者USを特定する(St194)。なお、端末装置P1は、閾値以上であると判定された類似度が複数ある場合には、算出された類似度が最も高い類似度に対応する登録話者と話者USとが同一人物であると判定してもよい。 If the terminal device P1 determines in the process of step St193 that there is a degree of similarity that is equal to or greater than the threshold among the calculated degrees of similarity (St193, YES), the terminal device P1 registers the degree of similarity that is determined to be equal to or greater than the threshold and corresponds to the degree of similarity. It is determined that the speaker and the speaker US are the same person. The terminal device P1 identifies the speaker US based on the registered speaker information of this registered speaker (St194). Note that if there are multiple degrees of similarity determined to be equal to or greater than the threshold, the terminal device P1 determines that the registered speaker corresponding to the highest calculated degree of similarity and the speaker US are the same person. It may be determined that
 端末装置P1は、ステップSt193の処理において、算出された類似度のそれぞれのうち閾値以上の類似度がないと判定した場合(St193,NO)、話者USを特定不可であると判定する(St195)。 If the terminal device P1 determines in the process of step St193 that there is no similarity greater than or equal to the threshold among the calculated similarities (St193, NO), the terminal device P1 determines that the speaker US cannot be identified (St195). ).
 端末装置P1は、特定された話者USの登録話者情報に基づいて、認証結果画面SCを生成する。端末装置P1は、生成された認証結果画面SCをモニタMNに出力して、表示させる(St196)。 The terminal device P1 generates an authentication result screen SC based on the registered speaker information of the identified speaker US. The terminal device P1 outputs the generated authentication result screen SC to the monitor MN for display (St196).
 以上により、端末装置P1は、音声登録時に話者情報と、話者USの特徴量と、話者USの特徴量の収音条件とを対応付けて登録する。端末装置P1は、話者USの個人性を示す特徴量を変化させる収音条件を対応付けて登録することにより、音声登録時の特徴柳雄の収音条件と、音声認証時の特徴量の収音条件とが異なり、音声登録時の特徴量と音声認証時の特徴量とが異なる場合であっても、それぞれの特徴量の収音条件に基づいて、類似度計算モデルを選択することでより高精度に話者認証を実行することができる。したがって、端末装置P1は、認証音声データに含まれるノイズに起因する話者認証精度の低下をより効果的に抑制できる。 As described above, the terminal device P1 registers the speaker information, the feature amount of the speaker US, and the sound collection condition of the feature amount of the speaker US in association with each other at the time of voice registration. The terminal device P1 associates and registers the sound collection conditions that change the feature values indicating the individuality of the speaker US, so that the sound collection conditions of the characteristic Yanagi at the time of voice registration and the feature values at the time of voice authentication are registered. Even if the sound collection conditions are different and the feature values at the time of voice registration and the feature values at the time of voice authentication are different, the similarity calculation model can be selected based on the sound collection conditions for each feature value. Speaker authentication can be performed with higher accuracy. Therefore, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by noise included in the authentication voice data.
 また、端末装置P1は、類似度の算出に用いられた類似度計算モデルに基づいて、算出された類似度が示す、話者認証処理により特定された話者の確からしさを示す信頼度を算出し、表示する。これにより、端末装置P1は、モニタMNを視聴する管理者に話者認証結果の確からしさを提示できる。したがって、端末装置P1は、類似度の算出に適した類似度計算モデルがなく、後述する類似度計算モデル「汎用モデル」を用いて話者認証を実行したことを、信頼度の提示により管理者に知らせることができる。 Furthermore, the terminal device P1 calculates the degree of reliability indicating the likelihood of the speaker identified by the speaker authentication process, which is indicated by the calculated degree of similarity, based on the similarity calculation model used to calculate the degree of similarity. and display. Thereby, the terminal device P1 can present the certainty of the speaker authentication result to the administrator viewing the monitor MN. Therefore, the terminal device P1 informs the administrator by presenting the reliability that there is no similarity calculation model suitable for calculating the similarity, and that speaker authentication has been performed using a similarity calculation model "universal model", which will be described later. can be informed.
 図7および図8のそれぞれを参照して、対応リストLST1,LST2の一例について説明する。図7は、音声登録時の収音条件推定結果と音声認証時の収音条件推定結果とが同一である場合の対応リストLST1の一例を説明する図である。図8は、音声登録時の収音条件と音声認証時の収音条件とが異なる場合の対応リストLST2の一例を説明する図である。 An example of the correspondence lists LST1 and LST2 will be described with reference to FIGS. 7 and 8, respectively. FIG. 7 is a diagram illustrating an example of the correspondence list LST1 when the sound collection condition estimation result at the time of voice registration and the sound collection condition estimation result at the time of voice authentication are the same. FIG. 8 is a diagram illustrating an example of the correspondence list LST2 when the sound collection conditions at the time of voice registration and the sound collection conditions at the time of voice authentication are different.
 なお、図7および図8では、説明を分かりやすくするために登録話者データベースDB2に登録されたいずれか1人の登録話者の特徴量に対応付けられた収音条件の情報と、話者認証対象である話者USの認証音声データの収音条件とに基づいて、対応リストLST1,LST2を参照する例について説明する。 In addition, in FIGS. 7 and 8, in order to make the explanation easier to understand, information on the sound collection conditions associated with the feature amount of any one registered speaker registered in the registered speaker database DB2 and information on the speaker An example will be described in which the correspondence lists LST1 and LST2 are referred to based on the sound collection conditions of the authenticated voice data of the speaker US who is the authentication target.
 図7に示す対応リストLST1の参照例において、音声登録時および音声認証時のそれぞれの音声データは、同一の収音条件「店舗内雑音、男性、電話音声」である。 In the reference example of the correspondence list LST1 shown in FIG. 7, the audio data at the time of voice registration and voice authentication have the same sound collection condition "in-store noise, male, telephone voice".
 端末装置P1は、マイクMK等の収音装置から送信された話者USの認証音声データから話者USの特徴量を抽出する。端末装置P1は、収音条件別学習データベースDB4を参照し、抽出された特徴量と、収音条件ごとの学習データの分布とに基づいて、話者USの特徴量の収音条件「店舗内雑音、男性、電話音声」であると判定する。 The terminal device P1 extracts the feature quantity of the speaker US from the authentication voice data of the speaker US transmitted from a sound collection device such as a microphone MK. The terminal device P1 refers to the sound collection condition-specific learning database DB4, and based on the extracted features and the distribution of learning data for each sound collection condition, the terminal device P1 determines the sound collection condition "in-store" for the feature amount of the speaker US. It is determined that the sound is "noise, male, telephone voice."
 また、端末装置P1は、話者USの収音条件の情報と、登録話者データベースDB2に登録された複数の登録話者のそれぞれの収音条件の情報とに基づいて、話者USの収音条件と複数の登録話者のそれぞれの収音条件との組み合わせに対応する類似度計算モデルを選択する。端末装置P1は、話者USの収音条件の情報と、複数の登録話者のそれぞれの収音条件の情報と、選択された類似度計算モデルのそれぞれとを対応付けた対応リストLST1を参照する。 Furthermore, the terminal device P1 collects the sound of the speaker US based on the information on the sound collection conditions of the speaker US and the information on the sound collection conditions of each of the plurality of registered speakers registered in the registered speaker database DB2. A similarity calculation model is selected that corresponds to the combination of sound conditions and sound collection conditions of each of the plurality of registered speakers. The terminal device P1 refers to a correspondence list LST1 that associates information on the sound collection conditions of the speaker US, information on the sound collection conditions of each of the plurality of registered speakers, and each of the selected similarity calculation models. do.
 対応リストLST1は、話者USの特徴量の収音条件の判定結果「収音条件判定結果1」と、登録話者データベースDB2に登録された登録話者の収音条件の判定結果「収音条件判定結果2」と、これら2つの収音条件に基づいて選択された類似度計算モデル「選択モデル」とを対応付けたデータである。 The correspondence list LST1 includes the determination result "Sound collection condition determination result 1" of the sound collection condition of the feature quantity of the speaker US, and the determination result "Sound collection condition determination result 1" of the sound collection condition of the registered speaker registered in the registered speaker database DB2. This is data that associates "condition determination result 2" with a similarity calculation model "selected model" selected based on these two sound collection conditions.
 収音条件の判定結果「収音条件判定結果1」は、話者USの収音条件の判定結果を示す。 Sound collection condition determination result "Sound collection condition determination result 1" indicates the determination result of the sound collection condition of the speaker US.
 収音条件の判定結果「収音条件判定結果2」は、登録話者データベースDB2に登録された登録話者の収音条件の判定結果を示す。また、各収音条件の判定結果に対応する判定確率の情報は、必須でなく省略されてもよい。 Sound collection condition determination result "Sound collection condition determination result 2" indicates the determination result of the sound collection conditions of the registered speakers registered in the registered speaker database DB2. Further, information on the determination probability corresponding to the determination result of each sound collection condition is not essential and may be omitted.
 また、収音条件の判定結果は、例えば、話者USの性別、収音装置、ノイズ種類等の複数の条件を含んでよい。 Further, the determination result of the sound collection conditions may include a plurality of conditions such as the gender of the speaker US, the sound collection device, the type of noise, etc., for example.
 類似度計算モデル「選択モデル」は、収音条件の判定結果「収音条件判定結果1」と収音条件の判定結果「収音条件判定結果2」とに対応して選択された類似度計算モデル「モデルA」、「モデルB」、「モデルC」、「モデルZ」のそれぞれを含む。 The similarity calculation model "selected model" is a similarity calculation model selected corresponding to the sound collection condition determination result "sound collection condition determination result 1" and the sound collection condition determination result "sound collection condition determination result 2". Includes each of the models "Model A", "Model B", "Model C", and "Model Z".
 例えば、類似度計算モデル「モデルA」は、話者USの特徴量の収音条件が「XX1」であって、登録話者の特徴量の収音条件の情報が「XX1」である場合に、話者USの特徴量と、登録話者の特徴量との類似度の算出処理に最適であると判定された類似度計算モデルである。 For example, in the similarity calculation model "Model A", when the sound collection condition of the feature quantity of the speaker US is "XX1", and the information of the sound collection condition of the registered speaker's feature quantity is "XX1", , is a similarity calculation model determined to be optimal for calculating the similarity between the feature amount of the speaker US and the feature amount of the registered speaker.
 なお、類似度計算モデル選択部114は、話者USの特徴量の収音条件と、登録話者の特徴量の収音条件との組み合わせに基づいて、類似度算出処理に適した類似度計算モデルがないと判定した場合、汎用の類似度計算モデル「モデルZ」を選択する。 Note that the similarity calculation model selection unit 114 performs similarity calculation suitable for the similarity calculation process based on the combination of the sound collection conditions for the feature amount of the speaker US and the sound collection conditions for the feature amount of the registered speaker. If it is determined that there is no model, a general-purpose similarity calculation model "Model Z" is selected.
 類似度計算モデル選択部114は、参照された対応リストLST1に基づいて、類似度計算モデルデータベースDB3から類似度計算モデルデータベースを選択する。例えば、図7に示す例において、類似度計算モデル選択部114は、類似度計算モデル「モデルA」を選択する。 The similarity calculation model selection unit 114 selects a similarity calculation model database from the similarity calculation model database DB3 based on the referenced correspondence list LST1. For example, in the example shown in FIG. 7, the similarity calculation model selection unit 114 selects the similarity calculation model "model A."
 次に、図8に示す対応リストLST2の参照例において、音声登録時の登録音声データは、同一の収音条件「店舗内雑音、男性、電話音声」である。音声認証時の音声データは、音声登録時の登録音声データの収音条件と異なる収音条件「屋外雑音、男性、ヘッドセット音声」である。 Next, in the reference example of the correspondence list LST2 shown in FIG. 8, the registered voice data at the time of voice registration is the same sound collection condition "in-store noise, male, telephone voice". The audio data at the time of audio authentication has a sound collection condition "outdoor noise, male, headset voice" that is different from the sound collection condition of the registered audio data at the time of voice registration.
 端末装置P1は、マイクMK等の収音装置から送信された話者USの認証音声データから話者USの特徴量を抽出する。端末装置P1は、収音条件別学習データベースDB4を参照し、抽出された特徴量と、収音条件ごとの学習データの分布とに基づいて、話者USの特徴量の収音条件「店舗内雑音、男性、電話音声」であると判定する。 The terminal device P1 extracts the feature quantity of the speaker US from the authentication voice data of the speaker US transmitted from a sound collection device such as a microphone MK. The terminal device P1 refers to the sound collection condition-specific learning database DB4, and based on the extracted features and the distribution of learning data for each sound collection condition, the terminal device P1 determines the sound collection condition "in-store" for the feature amount of the speaker US. It is determined that the sound is "noise, male, telephone voice."
 また、端末装置P1は、話者USの収音条件の情報と、登録話者データベースDB2に登録された複数の登録話者のそれぞれの収音条件の情報とに基づいて、話者USの収音条件と複数の登録話者のそれぞれの収音条件との組み合わせに対応する類似度計算モデルを選択する。端末装置P1は、話者USの収音条件の情報と、複数の登録話者のそれぞれの収音条件の情報と、選択された類似度計算モデルのそれぞれとを対応付けた対応リストLST2を参照する。 Furthermore, the terminal device P1 collects the sound of the speaker US based on the information on the sound collection conditions of the speaker US and the information on the sound collection conditions of each of the plurality of registered speakers registered in the registered speaker database DB2. A similarity calculation model is selected that corresponds to the combination of sound conditions and sound collection conditions of each of the plurality of registered speakers. The terminal device P1 refers to a correspondence list LST2 that associates information on the sound collection conditions of the speaker US, information on the sound collection conditions of each of the plurality of registered speakers, and each of the selected similarity calculation models. do.
 対応リストLST2は、話者USの特徴量の収音条件の判定結果「収音条件判定結果3」と、登録話者データベースDB2に登録された登録話者の収音条件の判定結果「収音条件判定結果4」と、これら2つの収音条件に基づいて選択された類似度計算モデル「選択モデル」とを対応付けたデータである。 The correspondence list LST2 includes the determination result "Sound collection condition determination result 3" of the sound collection condition of the feature quantity of the speaker US, and the determination result "Sound collection condition determination result 3" of the sound collection condition of the registered speaker registered in the registered speaker database DB2. This is data that associates "condition determination result 4" with a similarity calculation model "selected model" selected based on these two sound collection conditions.
 収音条件の判定結果「収音条件判定結果3」は、話者USの収音条件の判定結果を示す。 Sound collection condition determination result "Sound collection condition determination result 3" indicates the determination result of the sound collection condition of the speaker US.
 収音条件の判定結果「収音条件判定結果4」は、登録話者データベースDB2に登録された登録話者の収音条件の判定結果を示す。また、各収音条件の判定結果に対応する判定確率の情報は、必須でなく省略されてもよい。 Sound collection condition determination result "Sound collection condition determination result 4" indicates the determination result of the sound collection conditions of the registered speakers registered in the registered speaker database DB2. Further, information on the determination probability corresponding to the determination result of each sound collection condition is not essential and may be omitted.
 また、収音条件の判定結果は、例えば、話者USの性別、収音装置、ノイズ種類等の複数の条件を含んでよい。 Further, the determination result of the sound collection conditions may include a plurality of conditions such as the gender of the speaker US, the sound collection device, the type of noise, etc., for example.
 類似度計算モデル「選択モデル」は、収音条件の判定結果「収音条件判定結果3」と収音条件の判定結果「収音条件判定結果4」とに対応して選択された類似度計算モデル「モデルD」、「モデルE」、「モデルF」、「モデルZ」のそれぞれを含む。 The similarity calculation model "selected model" is a similarity calculation model selected corresponding to the sound collection condition determination result "sound collection condition determination result 3" and the sound collection condition determination result "sound collection condition determination result 4". Includes each of the models "Model D", "Model E", "Model F", and "Model Z".
 例えば、類似度計算モデル「モデルD」は、話者USの特徴量の収音条件が「XX1」であって、登録話者の特徴量の収音条件の情報が「XX4」である場合に、話者USの特徴量と、登録話者の特徴量との類似度の算出処理に最適であると判定された類似度計算モデルである。 For example, in the similarity calculation model "Model D", when the sound collection condition of the feature quantity of the speaker US is "XX1" and the information of the sound collection condition of the registered speaker's feature quantity is "XX4", , is a similarity calculation model determined to be optimal for calculating the similarity between the feature amount of the speaker US and the feature amount of the registered speaker.
 なお、類似度計算モデル選択部114は、話者USの特徴量の収音条件と、登録話者の特徴量の収音条件との組み合わせに基づいて、類似度算出処理に適した類似度計算モデルがないと判定した場合、類似度計算モデル「汎用モデル」を選択する。 Note that the similarity calculation model selection unit 114 performs similarity calculation suitable for the similarity calculation process based on the combination of the sound collection conditions for the feature amount of the speaker US and the sound collection conditions for the feature amount of the registered speaker. If it is determined that there is no model, select the similarity calculation model "general model".
 類似度計算モデル選択部114は、参照された対応リストLST2に基づいて、類似度計算モデルデータベースDB3から類似度計算モデルデータベースを選択する。例えば、図8に示す例において、類似度計算モデル選択部114は、類似度計算モデル「モデルD」を選択する。 The similarity calculation model selection unit 114 selects a similarity calculation model database from the similarity calculation model database DB3 based on the referenced correspondence list LST2. For example, in the example shown in FIG. 8, the similarity calculation model selection unit 114 selects the similarity calculation model "Model D."
 以上により、端末装置P1は、類似度算出対象である話者USの特徴量と登録話者の特徴量とにそれぞれ含まれる収音条件の組み合わせに基づいて、2つの特徴量(話者USの特徴量および登録話者の特徴量)の類似度算出処理に最適な類似度計算モデルを選択できる。これにより、端末装置P1は、音声登録時の特徴量に含まれる収音条件と、音声認証時の特徴量に含まれる収音条件とが変化した場合であっても、2つの特徴量の類似度を算出するために最適な類似度計算モデルを選択できる。つまり、端末装置P1は、音声データに含まれるノイズに起因する話者認証精度の低下をより効果的に抑制できる。収音条件判定時に、複数条件の候補が存在している場合、端末装置P1は、例えばそれぞれに相当する類似度計算モデルを利用して類似度を算出し、平均値を計算して類似度として採用してもよい。 As described above, the terminal device P1 calculates the two feature quantities (of the speaker US) based on the combination of the sound collection conditions included in the feature quantities of the speaker US and the registered speaker feature quantities, respectively, which are the subject of similarity calculation. It is possible to select the optimal similarity calculation model for the similarity calculation process (features and features of registered speakers). As a result, even if the sound collection conditions included in the feature amounts at the time of voice registration and the sound collection conditions included in the feature amounts at the time of voice authentication change, the terminal device P1 can detect the similarity between the two feature amounts. The optimal similarity calculation model can be selected to calculate the degree of similarity. In other words, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by noise included in the audio data. When determining the sound collection condition, if there are candidates for multiple conditions, the terminal device P1 calculates the degree of similarity using, for example, a similarity calculation model corresponding to each condition, and calculates the average value as the degree of similarity. May be adopted.
 図9を参照して、対応リストについて具体的に説明する。図9は、対応リストLSTの具体例を説明する図である。 The correspondence list will be specifically explained with reference to FIG. 9. FIG. 9 is a diagram illustrating a specific example of the correspondence list LST.
 なお、図9では説明を分かりやすくするために、図7に示す対応リストLST1の参照例と同様に、音声登録時および音声認証時のそれぞれの音声データは、同一の収音条件「店舗内雑音、男性、電話音声」である例について説明する。 In order to make the explanation easier to understand, in FIG. 9, similar to the reference example of the correspondence list LST1 shown in FIG. , male, telephone voice” will be explained below.
 対応リストLSTは、話者USの特徴量の収音条件の判定結果「収音条件判定結果AA」と、登録話者の特徴量の収音条件の情報「収音条件判定結果BB」と、選択された類似度計算モデル「選択モデル」とを対応付けられている。 The correspondence list LST includes the sound collection condition determination result of the feature quantity of the speaker US "sound collection condition determination result AA", the sound collection condition information of the registered speaker's feature quantity "sound collection condition determination result BB", The selected similarity calculation model "selected model" is associated with the selected similarity calculation model.
 話者USの特徴量の収音条件の判定結果「収音条件判定結果AA」および登録話者の特徴量の収音条件の情報「収音条件判定結果BB」は、例えば、収音条件「性別」、「機器(音質)」、「ノイズ種類」を含む。なお、図9に示す収音条件は一例であってこれに限定されない。また、対応リストLSTは、収音条件の判定結果「収音条件判定結果AA」と収音条件の情報「収音条件判定結果BB」とにおいて、同一の収音条件を太字で示す。 For example, the sound collection condition determination result ``sound collection condition determination result AA'' of the feature quantity of the speaker US and the information ``sound collection condition determination result BB'' of the sound collection condition of the registered speaker's feature quantity are, for example, Includes gender, equipment (sound quality), and noise type. Note that the sound collection conditions shown in FIG. 9 are merely an example, and the present invention is not limited thereto. Furthermore, the correspondence list LST indicates the same sound collection conditions in bold letters in the sound collection condition determination result “sound collection condition determination result AA” and the sound collection condition information “sound collection condition determination result BB”.
 収音条件「性別」は、話者USの特徴量に基づいて判定された話者USの性別を示す。収音条件「機器(音質)」は、話者USの特徴量に基づいて判定された収音装置に関する情報を示す。収音条件「ノイズ種類」は、話者USの特徴量に含まれる発話時の環境音、ノイズ等の種類を示す。 The sound collection condition "gender" indicates the gender of the speaker US determined based on the feature amount of the speaker US. The sound collection condition "equipment (sound quality)" indicates information regarding the sound collection device determined based on the feature amount of the speaker US. The sound collection condition "noise type" indicates the type of environmental sound, noise, etc. at the time of speaking that is included in the feature amount of the speaker US.
 類似度計算モデル「選択モデル」は、話者USの特徴量と、登録話者の特徴量との類似度の算出に用いられる類似度計算モデルである。 The similarity calculation model "selection model" is a similarity calculation model used to calculate the similarity between the feature amount of the speaker US and the feature amount of the registered speaker.
 例えば、端末装置P1は、例えば、2つの収音条件の判定結果が、収音条件「性別」が「男性」であって、収音条件「機器(音質)」が「電話」であって、収音条件「ノイズ種類」が「店舗内雑音」である。このような場合、端末装置P1は、2つの収音条件において同一の収音条件「男性」、「電話」、「店舗内雑音」の特徴量の類似度の算出に適した類似度計算モデル「男性 電話 店舗内雑音モデル」を選択する。 For example, in the terminal device P1, the determination results of the two sound collection conditions are that the sound collection condition "gender" is "male", the sound collection condition "equipment (sound quality)" is "telephone", The sound collection condition "Noise type" is "In-store noise". In such a case, the terminal device P1 uses a similarity calculation model ``Similarity Calculation Model'' suitable for calculating the similarity of the feature amounts of the same sound pickup conditions ``Male'', ``Telephone'', and ``Store Noise'' under the two sound pickup conditions. Select "Male Phone Store Noise Model".
 また、例えば、端末装置P1は、例えば、2つの収音条件の判定結果のそれぞれが、収音条件「性別」が「男性」であって、収音条件「機器(音質)」が「電話」、「ヘッドセット」であって、収音条件「ノイズ種類」が「店舗内雑音」、「屋外雑音」である。このような場合、端末装置P1は、2つの収音条件において同一あるいは類似する収音条件「男性」の特徴量の類似度の算出に適した類似度計算モデル「男性モデル」を選択する。 Further, for example, in the terminal device P1, the determination results of the two sound collection conditions are such that the sound collection condition "gender" is "male" and the sound collection condition "equipment (sound quality)" is "telephone". , "Headset", and the sound collection condition "Noise type" is "In-store noise" and "Outdoor noise". In such a case, the terminal device P1 selects the similarity calculation model "male model" suitable for calculating the similarity of the feature amount of the sound collection condition "male" which is the same or similar in the two sound collection conditions.
 また、例えば、端末装置P1は、例えば、2つの収音条件の判定結果のそれぞれが、収音条件「性別」が「女性」であって、収音条件「機器(音質)」が「ヘッドセット」であって、収音条件「ノイズ種類」が「なし(クリーン音声)」、「店舗内雑音」である。このような場合、端末装置P1は、2つの収音条件において同一あるいは類似する収音条件「女性」、「ヘッドセット」の特徴量の類似度の算出に適した類似度計算モデル「女性 ヘッドセットモデル」を選択する。 Further, for example, in the terminal device P1, the determination results of the two sound collection conditions are such that the sound collection condition "gender" is "female" and the sound collection condition "equipment (sound quality)" is "headset". '', and the sound collection conditions ``noise type'' are ``none (clean voice)'' and ``in-store noise.'' In such a case, the terminal device P1 uses a similarity calculation model "female headset" suitable for calculating the similarity of the feature amounts of the same or similar sound collection conditions "female" and "headset" in the two sound collection conditions. Select "Model".
 以上により、実施の形態に係る端末装置P1は、音声データを取得する通信部10(取得部の一例)と、音声データから話者USが発話している発話区間を検出する特徴量抽出部111(検出部の一例)と、検出された発話区間から話者USの特徴量(発話特徴量の一例)を抽出する特徴量抽出部111(抽出部の一例)と、抽出された話者USの特徴量と、事前に登録された少なくとも1つの登録話者の特徴量とに基づいて、複数の類似度計算モデル(類似度計算モデルの一例)のうち話者USの認証に用いられる第1の類似度計算モデル(第1類似度計算モデルの一例であって、ステップSt18の処理で選択される類似度計算モデル)を選定する類似度計算モデル選択部114(選定部の一例)と、選定された第1の類似度計算モデルを用いて、話者USの特徴量と、登録話者の特徴量とを照合して、話者USを認証する認証部116と、を備える。 As described above, the terminal device P1 according to the embodiment includes the communication unit 10 (an example of an acquisition unit) that acquires audio data, and the feature extraction unit 111 that detects a speech section in which the speaker US is speaking from the audio data. (an example of a detection unit), a feature extraction unit 111 (an example of an extraction unit) that extracts a feature quantity of the speaker US (an example of a speech feature quantity) from a detected utterance interval, Based on the feature amount and the feature amount of at least one registered speaker registered in advance, a first similarity calculation model (an example of a similarity calculation model) used for authenticating the speaker US is selected based on the feature amount and the feature amount of at least one registered speaker registered in advance. A similarity calculation model selection unit 114 (an example of a selection unit) that selects a similarity calculation model (an example of a first similarity calculation model and the similarity calculation model selected in the process of step St18); and an authentication unit 116 that authenticates the speaker US by comparing the feature quantities of the speaker US with the feature quantities of registered speakers using the first similarity calculation model.
 これにより、実施の形態に係る端末装置P1は、2つの特徴量(話者USの特徴量および登録話者の特徴量)の類似度算出処理により適した類似度計算モデルを選択し、選択された類似度計算モデルを用いることで2つの特徴量に基づく話者認証をより高精度に行うことができる。つまり、端末装置P1は、音声登録時と音声認証時とのノイズ(環境雑音)が異なることで生じる話者認証精度の低下をより効果的に抑制できる。 As a result, the terminal device P1 according to the embodiment selects a similarity calculation model that is more suitable for the similarity calculation process of the two feature amounts (the feature amount of the speaker US and the feature amount of the registered speaker). By using a similar similarity calculation model, speaker authentication based on two features can be performed with higher accuracy. In other words, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by differences in noise (environmental noise) during voice registration and voice authentication.
 また、実施の形態に係る端末装置P1は、特徴量に基づいて、特徴量に対応する発話音声の収音条件を判定する収音条件判定部112(判定部の一例)、をさらに備える。収音条件判定部112は、話者USの特徴量に対応する収音条件と、登録話者の特徴量に対応する収音条件とに基づいて、第1の類似度計算モデルを選定する。これにより、実施の形態に係る端末装置P1は、音声登録時の収音条件と音声認証時の収音条件との組み合わせに基づいて、特徴量の照合により適した類似度計算モデルを選択できる。したがって、端末装置P1は、話者認証精度の低下をより効果的に抑制できる。 Furthermore, the terminal device P1 according to the embodiment further includes a sound collection condition determination unit 112 (an example of a determination unit) that determines the sound collection condition of the uttered sound corresponding to the feature amount based on the feature amount. The sound collection condition determining unit 112 selects the first similarity calculation model based on the sound collection condition corresponding to the feature amount of the speaker US and the sound collection condition corresponding to the feature amount of the registered speaker. Thereby, the terminal device P1 according to the embodiment can select a similarity calculation model more suitable for feature value matching based on the combination of the sound collection conditions at the time of voice registration and the sound collection conditions at the time of voice authentication. Therefore, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy.
 また、実施の形態に係る端末装置P1における複数の類似度計算モデルは、所定の収音条件における少なくとも1つの学習データを用いて生成される。収音条件判定部112は、話者USまたは登録話者の特徴量と、複数の類似度計算モデルのそれぞれとの間の距離に基づいて、話者USまたは登録話者の収音条件を判定する。これにより、実施の形態に係る端末装置P1は、収音条件ごとの学習データを用いて生成された類似度計算モデルと、話者USまたは登録話者の特徴量との距離に基づいて、各特徴量に対応する収音条件をより高精度に判定できる。 Further, the plurality of similarity calculation models in the terminal device P1 according to the embodiment are generated using at least one learning data under a predetermined sound collection condition. The sound collection condition determination unit 112 determines the sound collection conditions of the speaker US or the registered speaker based on the distance between the feature amount of the speaker US or the registered speaker and each of the plurality of similarity calculation models. do. Thereby, the terminal device P1 according to the embodiment calculates each sound based on the distance between the similarity calculation model generated using the learning data for each sound collection condition and the feature amount of the speaker US or the registered speaker. Sound collection conditions corresponding to feature amounts can be determined with higher accuracy.
 また、実施の形態に係る端末装置P1における収音条件判定部112は、話者USまたは登録話者の特徴量と、複数の類似度計算モデルのそれぞれとの間の距離が最短である第2の類似度計算モデル(第2類似度計算モデルの一例であって、ステップSt154の処理で選択される類似度計算モデル)を選定し、選定された第2の類似度計算モデルに対応する収音条件に基づいて、話者USまたは登録話者の収音条件を判定する。これにより、実施の形態に係る端末装置P1は、話者USまたは登録話者の特徴量との類似度を計算する上で、特性が最も近い類似度計算モデルを選択するため、話者USまたは登録話者の特徴量に対応する収音条件をより高精度に判定できる。 Further, the sound collection condition determination unit 112 in the terminal device P1 according to the embodiment determines the second sound collection condition determination unit 112 in which the distance between the feature amount of the speaker US or the registered speaker and each of the plurality of similarity calculation models is the shortest. (which is an example of the second similarity calculation model and is selected in the process of step St154), and picks up sound corresponding to the selected second similarity calculation model. Based on the conditions, the sound pickup conditions of the speaker US or the registered speaker are determined. Thereby, the terminal device P1 according to the embodiment selects the similarity calculation model with the closest characteristics when calculating the similarity with the feature amount of the speaker US or the registered speaker. Sound collection conditions corresponding to the feature quantities of registered speakers can be determined with higher accuracy.
 また、実施の形態に係る端末装置P1は、発話区間の音声データの特徴量と、複数の登録話者のそれぞれの特徴量との類似度を算出する認証部116(算出部の一例)、をさらに備える。認証部116は、算出された複数の類似度に基づいて、話者USを認証する。これにより、実施の形態に係る端末装置P1は、事前に登録された複数の登録話者の特徴量と、話者USの特徴量との類似度を用いて、話者認証を実行できる。 Furthermore, the terminal device P1 according to the embodiment includes an authentication unit 116 (an example of a calculation unit) that calculates the degree of similarity between the feature amount of the audio data of the utterance section and the feature amount of each of the plurality of registered speakers. Be prepared for more. The authentication unit 116 authenticates the speaker US based on the plurality of calculated similarities. Thereby, the terminal device P1 according to the embodiment can perform speaker authentication using the degree of similarity between the feature amounts of a plurality of registered speakers registered in advance and the feature amounts of the speaker US.
 また、実施の形態に係る端末装置P1における認証部116は、類似度が閾値以上である登録話者が話者USであると特定する。これにより、実施の形態に係る端末装置P1は、事前に登録された複数の登録話者の特徴量と、話者USの特徴量との類似度を用いて、話者認証を実行できる。 Furthermore, the authentication unit 116 in the terminal device P1 according to the embodiment specifies that the registered speaker whose degree of similarity is equal to or greater than the threshold is the speaker US. Thereby, the terminal device P1 according to the embodiment can perform speaker authentication using the degree of similarity between the feature amounts of a plurality of registered speakers registered in advance and the feature amounts of the speaker US.
 また、実施の形態に係る端末装置P1における認証部116は、類似度が閾値以上である登録話者に関する情報を含む認証結果画面SCを生成して、出力する。これにより、実施の形態に係る端末装置P1は、話者USあるいは管理者に話者認証結果を提示できる。 Additionally, the authentication unit 116 in the terminal device P1 according to the embodiment generates and outputs an authentication result screen SC that includes information regarding registered speakers whose degree of similarity is equal to or greater than the threshold value. Thereby, the terminal device P1 according to the embodiment can present the speaker authentication result to the speaker US or the administrator.
 また、実施の形態に係る端末装置P1における認証部116は、算出された複数の類似度が閾値以上でないと判定した場合、話者USを特定不可であると判定する。これにより、実施の形態に係る端末装置P1は、話者認証精度の低下をより効果的に抑制し、話者USの誤認証をより効果的に抑制できる。 Furthermore, when determining that the plurality of calculated similarities are not equal to or greater than the threshold, the authentication unit 116 in the terminal device P1 according to the embodiment determines that the speaker US cannot be identified. Thereby, the terminal device P1 according to the embodiment can more effectively suppress a decrease in speaker authentication accuracy and more effectively suppress erroneous authentication of the speaker US.
 また、実施の形態に係る端末装置P1は、発話区間の音声データの特徴量と、複数の登録話者のそれぞれの特徴量との類似度を算出する認証部116と、類似度の信頼度を算出する信頼度計算部115(信頼度算出部の一例)、をさらに備える。信頼度計算部115は、話者USの特徴量と、第2の類似度計算モデルとの間の距離に基づいて、類似度の信頼度を算出する。これにより、実施の形態に係る端末装置P1は、類似度、あるいは類似度の算出処理の過程で判定された収音条件、類似度の算出に用いられた第1の類似度計算モデル等の信頼度を算出できる。 The terminal device P1 according to the embodiment also includes an authentication unit 116 that calculates the similarity between the feature amount of the audio data of the utterance section and the feature amount of each of the plurality of registered speakers, and the authentication unit 116 that calculates the reliability of the similarity. It further includes a reliability calculation unit 115 (an example of a reliability calculation unit) that calculates reliability. The reliability calculation unit 115 calculates the reliability of the similarity based on the distance between the feature amount of the speaker US and the second similarity calculation model. As a result, the terminal device P1 according to the embodiment has confidence in the similarity, the sound collection conditions determined in the process of calculating the similarity, the first similarity calculation model used to calculate the similarity, etc. Can calculate degrees.
 また、実施の形態に係る端末装置P1における認証部116は、類似度が閾値以上である登録話者を話者USであると特定し、類似度が閾値以上である登録話者に関する情報と、算出された信頼度の情報とを含む認証結果画面SCを生成して、出力する。これにより、実施の形態に係る端末装置P1は、話者認証結果と、話者認証結果の信頼度とを表示することで、管理者に話者認証結果が信頼できるものであるか否かの確認を促すことができる。 Further, the authentication unit 116 in the terminal device P1 according to the embodiment identifies the registered speaker whose degree of similarity is equal to or greater than the threshold value as the speaker US, and provides information regarding the registered speaker whose degree of similarity is equal to or greater than the threshold value; An authentication result screen SC including the calculated reliability information is generated and output. Thereby, the terminal device P1 according to the embodiment displays the speaker authentication result and the reliability of the speaker authentication result, so that the administrator can know whether the speaker authentication result is reliable or not. You can prompt for confirmation.
 また、実施の形態に係る端末装置P1において、収音条件は、話者USの性別、話者USの年齢、話者USの言語、発話音声が収音された収音機器、あるいは発話音声に含まれるノイズの種類の少なくとも1つを含む。これにより、実施の形態に係る端末装置P1は、話者認証に用いられる特徴量を変化させ、話者認証精度の低下の要因となり得る収音条件に基づいて、類似度計算モデルを選定できる。したがって、端末装置P1は、話者認証精度の低下をより効果的に抑制できる。 Furthermore, in the terminal device P1 according to the embodiment, the sound collection conditions include the gender of the speaker US, the age of the speaker US, the language of the speaker US, the sound collection device with which the uttered voice was collected, or the uttered voice. At least one of the types of noise included. Thereby, the terminal device P1 according to the embodiment can change the feature amounts used for speaker authentication and select a similarity calculation model based on the sound collection conditions that may cause a decrease in speaker authentication accuracy. Therefore, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy.
 以上、図面を参照しながら各種の実施の形態について説明したが、本開示はかかる例に限定されないことは言うまでもない。当業者であれば、特許請求の範囲に記載された範疇内において、各種の変更例、修正例、置換例、付加例、削除例、均等例に想到し得ることは明らかであり、それらについても当然に本開示の技術的範囲に属するものと了解される。また、発明の趣旨を逸脱しない範囲において、上述した各種の実施の形態における各構成要素を任意に組み合わせてもよい。 Although various embodiments have been described above with reference to the drawings, it goes without saying that the present disclosure is not limited to such examples. It is clear that those skilled in the art can come up with various changes, modifications, substitutions, additions, deletions, and equivalents within the scope of the claims, and It is understood that it naturally falls within the technical scope of the present disclosure. Further, each of the constituent elements in the various embodiments described above may be arbitrarily combined without departing from the spirit of the invention.
 なお、本出願は、2022年3月22日出願の日本特許出願(特願2022-045391)に基づくものであり、その内容は本出願の中に参照として援用される。 Note that this application is based on a Japanese patent application (Japanese Patent Application No. 2022-045391) filed on March 22, 2022, and the contents thereof are incorporated as a reference in this application.
 本開示は、環境雑音の変化に起因する話者認証精度の低下を抑制できる音声認証装置および音声認証方法として有用である。 The present disclosure is useful as a voice authentication device and voice authentication method that can suppress a decrease in speaker authentication accuracy due to changes in environmental noise.
10 通信部
11 プロセッサ
12 メモリ
100 音声認証システム
111 特徴量抽出部
112 収音条件判定部
113 話者登録部
114 類似度計算モデル選択部
115 信頼度計算部
116 認証部
DB1 特徴量抽出モデルデータベース
DB2 登録話者データベース
DB3 類似度計算モデルデータベース
DB4 収音条件別学習データベース
MK マイク
MN モニタ
P1 端末装置
SC 認証結果画面
US 話者
10 Communication unit 11 Processor 12 Memory 100 Voice authentication system 111 Feature extraction unit 112 Sound collection condition determination unit 113 Speaker registration unit 114 Similarity calculation model selection unit 115 Reliability calculation unit 116 Authentication unit DB1 Feature extraction model database DB2 Registration Speaker database DB3 Similarity calculation model database DB4 Sound collection condition learning database MK Microphone MN Monitor P1 Terminal SC Authentication result screen US Speaker

Claims (12)

  1.  音声データを取得する取得部と、
     前記音声データから話者が発話している発話区間を検出する検出部と、
     検出された発話区間から前記話者の発話特徴量を抽出する抽出部と、
     抽出された前記話者の発話特徴量と、事前に登録された少なくとも1つの登録話者の発話特徴量とに基づいて、複数の類似度計算モデルのうち前記話者の認証に用いられる第1類似度計算モデルを選定する選定部と、
     選定された第1類似度計算モデルを用いて、前記話者の発話特徴量と、前記登録話者の発話特徴量とを照合して、前記話者を認証する認証部と、を備える、
     音声認証装置。
    an acquisition unit that acquires audio data;
    a detection unit that detects a speech section in which the speaker is speaking from the audio data;
    an extraction unit that extracts speech features of the speaker from the detected speech section;
    Based on the extracted utterance features of the speaker and the utterance features of at least one registered speaker registered in advance, the first one of the plurality of similarity calculation models used for authenticating the speaker is selected. a selection unit that selects a similarity calculation model;
    an authentication unit that authenticates the speaker by comparing the utterance features of the speaker and the utterance features of the registered speaker using the selected first similarity calculation model;
    Voice authentication device.
  2.  前記発話特徴量に基づいて、前記発話特徴量に対応する発話音声の収音条件を判定する判定部、をさらに備え、
     前記選定部は、前記話者の発話特徴量に対応する収音条件と、前記登録話者の前記発話特徴量に対応する収音条件とに基づいて、前記第1類似度計算モデルを選定する、
     請求項1に記載の音声認証装置。
    further comprising a determination unit that determines, based on the utterance feature amount, a sound collection condition of the uttered voice corresponding to the utterance feature amount,
    The selection unit selects the first similarity calculation model based on a sound collection condition corresponding to the utterance feature of the speaker and a sound collection condition corresponding to the utterance feature of the registered speaker. ,
    The voice authentication device according to claim 1.
  3.  前記複数の類似度計算モデルは、所定の収音条件における少なくとも1つの学習データを用いて生成され、
     前記判定部は、前記話者または登録話者の発話特徴量と、前記複数の類似度計算モデルのそれぞれとの間の距離に基づいて、前記話者または登録話者の収音条件を判定する、
     請求項2に記載の音声認証装置。
    The plurality of similarity calculation models are generated using at least one learning data under predetermined sound collection conditions,
    The determination unit determines a sound pickup condition of the speaker or registered speaker based on a distance between the utterance feature amount of the speaker or registered speaker and each of the plurality of similarity calculation models. ,
    The voice authentication device according to claim 2.
  4.  前記判定部は、前記話者または登録話者の発話特徴量と、前記複数の類似度計算モデルのそれぞれとの間の距離が最短である第2類似度計算モデルを選定し、選定された前記第2類似度計算モデルに対応する収音条件に基づいて、前記話者または登録話者の収音条件を判定する、
     請求項3に記載の音声認証装置。
    The determination unit selects a second similarity calculation model that has the shortest distance between the utterance feature of the speaker or the registered speaker and each of the plurality of similarity calculation models, and determining sound collection conditions of the speaker or registered speaker based on sound collection conditions corresponding to a second similarity calculation model;
    The voice authentication device according to claim 3.
  5.  前記発話区間の前記音声データの前記発話特徴量と、前記複数の登録話者のそれぞれの前記発話特徴量との類似度を算出する算出部、をさらに備え、
     前記認証部は、算出された複数の前記類似度に基づいて、前記話者を認証する、
     請求項1に記載の音声認証装置。
    further comprising a calculation unit that calculates the degree of similarity between the utterance feature amount of the audio data of the utterance section and the utterance feature amount of each of the plurality of registered speakers,
    The authentication unit authenticates the speaker based on the plurality of calculated similarities;
    The voice authentication device according to claim 1.
  6.  前記認証部は、前記類似度が閾値以上である登録話者を前記話者であると特定する、
     請求項5に記載の音声認証装置。
    The authentication unit identifies a registered speaker for whom the degree of similarity is a threshold value or more as the speaker;
    The voice authentication device according to claim 5.
  7.  前記認証部は、前記類似度が前記閾値以上である前記登録話者に関する情報を含む認証結果画面を生成して、出力する、
     請求項6に記載の音声認証装置。
    The authentication unit generates and outputs an authentication result screen including information regarding the registered speaker whose degree of similarity is greater than or equal to the threshold;
    The voice authentication device according to claim 6.
  8.  前記認証部は、算出された前記複数の類似度が閾値以上でないと判定した場合、前記話者を特定不可であると判定する、
     請求項6に記載の音声認証装置。
    When the authentication unit determines that the plurality of calculated similarities are not equal to or greater than a threshold, the authentication unit determines that the speaker cannot be identified.
    The voice authentication device according to claim 6.
  9.  前記発話区間の前記音声データの前記発話特徴量と、前記複数の登録話者のそれぞれの前記発話特徴量との類似度を算出する算出部と、
     前記類似度の信頼度を算出する信頼度算出部、をさらに備え、
     前記信頼度算出部は、前記話者の発話特徴量と、前記第2類似度計算モデルとの間の距離に基づいて、前記類似度の信頼度を算出する、
     請求項4に記載の音声認証装置。
    a calculation unit that calculates the degree of similarity between the utterance feature amount of the audio data of the utterance section and the utterance feature amount of each of the plurality of registered speakers;
    further comprising a reliability calculation unit that calculates reliability of the similarity,
    The reliability calculation unit calculates the reliability of the similarity based on the distance between the utterance feature of the speaker and the second similarity calculation model.
    The voice authentication device according to claim 4.
  10.  前記認証部は、前記類似度が閾値以上である登録話者を前記話者であると特定し、前記類似度が前記閾値以上である前記登録話者に関する情報と、算出された前記信頼度の情報とを含む認証結果画面を生成して、出力する、
     請求項9に記載の音声認証装置。
    The authentication unit identifies the registered speaker for whom the degree of similarity is greater than or equal to the threshold value as the speaker, and includes information regarding the registered speaker for whom the degree of similarity is greater than or equal to the threshold value and the calculated degree of reliability. Generate and output an authentication result screen including information,
    The voice authentication device according to claim 9.
  11.  前記収音条件は、前記話者の性別、前記話者の年齢、前記話者の言語、前記発話音声が収音された収音機器、あるいは前記発話音声に含まれるノイズの種類の少なくとも1つを含む、
     請求項2に記載の音声認証装置。
    The sound collection condition is at least one of the gender of the speaker, the age of the speaker, the language of the speaker, the sound collection device with which the spoken voice was collected, or the type of noise contained in the spoken voice. including,
    The voice authentication device according to claim 2.
  12.  端末装置が行う音声認証方法であって、
     音声データを取得し、
     前記音声データから話者が発話している発話区間を検出し、
     検出された発話区間から前記話者の発話特徴量を抽出し、
     抽出された前記話者の発話特徴量と、事前に登録された少なくとも1つの登録話者の発話特徴量とに基づいて、複数の類似度計算モデルのうち前記話者の認証に用いられる第1類似度計算モデルを選定し、
     選定された第1類似度計算モデルを用いて、前記話者の発話特徴量と、前記登録話者の発話特徴量とを照合して、前記話者を認証する、
     音声認証方法。
    A voice authentication method performed by a terminal device, the method comprising:
    Get audio data,
    detecting an utterance section in which the speaker is speaking from the audio data;
    Extracting the speaker's utterance features from the detected utterance interval,
    Based on the extracted utterance features of the speaker and the utterance features of at least one registered speaker registered in advance, the first one of the plurality of similarity calculation models used for authenticating the speaker is selected. Select a similarity calculation model,
    authenticating the speaker by comparing the utterance features of the speaker and the utterance features of the registered speaker using the selected first similarity calculation model;
    Voice authentication method.
PCT/JP2023/009467 2022-03-22 2023-03-10 Voice authentication device and voice authentication method WO2023182014A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2022-045391 2022-03-22
JP2022045391 2022-03-22

Publications (1)

Publication Number Publication Date
WO2023182014A1 true WO2023182014A1 (en) 2023-09-28

Family

ID=88101357

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/009467 WO2023182014A1 (en) 2022-03-22 2023-03-10 Voice authentication device and voice authentication method

Country Status (1)

Country Link
WO (1) WO2023182014A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008070596A (en) * 2006-09-14 2008-03-27 Yamaha Corp Voice authentication apparatus, voice authentication method, and program
JP2009003162A (en) * 2007-06-21 2009-01-08 Panasonic Corp Strained voice detector
US20190295553A1 (en) * 2018-03-21 2019-09-26 Hyundai Mobis Co., Ltd. Apparatus for recognizing voice speaker and method for the same
WO2021192719A1 (en) * 2020-03-27 2021-09-30 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ Speaker identification method, speaker identification device, speaker identification program, sex identification model generation method, and speaker identification model generation method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008070596A (en) * 2006-09-14 2008-03-27 Yamaha Corp Voice authentication apparatus, voice authentication method, and program
JP2009003162A (en) * 2007-06-21 2009-01-08 Panasonic Corp Strained voice detector
US20190295553A1 (en) * 2018-03-21 2019-09-26 Hyundai Mobis Co., Ltd. Apparatus for recognizing voice speaker and method for the same
WO2021192719A1 (en) * 2020-03-27 2021-09-30 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ Speaker identification method, speaker identification device, speaker identification program, sex identification model generation method, and speaker identification model generation method

Similar Documents

Publication Publication Date Title
CN105741836B (en) Voice recognition device and voice recognition method
US10204619B2 (en) Speech recognition using associative mapping
US10607597B2 (en) Speech signal recognition system and method
US9443511B2 (en) System and method for recognizing environmental sound
US9959863B2 (en) Keyword detection using speaker-independent keyword models for user-designated keywords
US8731936B2 (en) Energy-efficient unobtrusive identification of a speaker
US10157272B2 (en) Systems and methods for evaluating strength of an audio password
US9601112B2 (en) Speech recognition system and method using incremental device-based acoustic model adaptation
WO2019080639A1 (en) Object identifying method, computer device and computer readable storage medium
US20190392858A1 (en) Intelligent voice outputting method, apparatus, and intelligent computing device
US20160118039A1 (en) Sound sample verification for generating sound detection model
CN104217149A (en) Biometric authentication method and equipment based on voice
JP2019053126A (en) Growth type interactive device
US9947323B2 (en) Synthetic oversampling to enhance speaker identification or verification
KR20210044475A (en) Apparatus and method for determining object indicated by pronoun
US9224388B2 (en) Sound recognition method and system
JP6676009B2 (en) Speaker determination device, speaker determination information generation method, and program
US10997972B2 (en) Object authentication device and object authentication method
WO2023182014A1 (en) Voice authentication device and voice authentication method
US11107476B2 (en) Speaker estimation method and speaker estimation device
US20180285643A1 (en) Object recognition device and object recognition method
KR101840363B1 (en) Voice recognition apparatus and terminal device for detecting misprononced phoneme, and method for training acoustic model
WO2023182016A1 (en) Voice authentication device and voice authentication method
US11437044B2 (en) Information processing apparatus, control method, and program
KR20210063698A (en) Electronic device and method for controlling the same, and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23774613

Country of ref document: EP

Kind code of ref document: A1