WO2023182016A1 - Voice authentication device and voice authentication method - Google Patents

Voice authentication device and voice authentication method Download PDF

Info

Publication number
WO2023182016A1
WO2023182016A1 PCT/JP2023/009469 JP2023009469W WO2023182016A1 WO 2023182016 A1 WO2023182016 A1 WO 2023182016A1 JP 2023009469 W JP2023009469 W JP 2023009469W WO 2023182016 A1 WO2023182016 A1 WO 2023182016A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
noise
registered
similarity
authentication
Prior art date
Application number
PCT/JP2023/009469
Other languages
French (fr)
Japanese (ja)
Inventor
正成 宮本
Original Assignee
パナソニックIpマネジメント株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by パナソニックIpマネジメント株式会社 filed Critical パナソニックIpマネジメント株式会社
Publication of WO2023182016A1 publication Critical patent/WO2023182016A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions

Definitions

  • the present disclosure relates to a voice authentication device and a voice authentication method.
  • Patent Document 1 discloses a voice recognition device that recognizes a test subject's voice.
  • the speech recognition device stores a plurality of motion noise models created corresponding to each of the plurality of motions in association with each of the plurality of motions, detects input speech including the subject's voice, and detects the subject's motion. is specified, and a motion noise model corresponding to the motion specified by the motion specifying means is read out.
  • the speech recognition device reads an environmental noise model corresponding to the subject's current position, synthesizes the environmental noise model with the read motion noise model, and uses the synthesized noise superimposition model to detect the noise contained in the detected input speech. recognize the test subject's voice.
  • Patent Document 1 it is necessary to collect in advance the motion noise generated in each of a plurality of motions and the environmental noise at each of a plurality of positions where voice recognition can be performed, which is very time-consuming. Furthermore, in voiceprint authentication, the feature quantity that indicates the individuality of an authentication target (person) extracted from a voice signal changes depending on the noise contained in the voice signal. Therefore, when performing voiceprint authentication using the voice recognition device described above, if the noise contained in the pre-registered voice signal and the noise contained in the voice signal collected during voiceprint authentication are different, each Features extracted from voice signals do not indicate the individuality of the same person, and there is a possibility that the accuracy of voiceprint authentication will decrease.
  • the present disclosure was devised in view of the above-mentioned conventional situation, and aims to provide a voice authentication device and a voice authentication method that suppress a decrease in speaker authentication accuracy due to changes in environmental noise.
  • the present disclosure includes: an acquisition unit that acquires audio data; a detection unit that detects a speech section in which a speaker is speaking from the audio data; and a non-speech section in which the speaker is not speaking; an extraction unit that extracts the utterance feature amount of the speech data and noise included in the non-speech section of the audio data; a selection unit that selects any one similarity calculation model from each of the plurality of similarity calculation models based on the noise generated by the speaker; and an authentication unit that authenticates the speaker by comparing the registered feature amount of the registered speaker.
  • the present disclosure also provides a voice authentication method performed by a terminal device, which acquires voice data, and determines from the voice data a speech section in which the speaker is speaking and a non-speech section in which the speaker is not speaking. , and extracts the speech feature amount of the speech section and the noise included in the non-speech section of the audio data, and extracts the extracted noise and the registered feature amount of a plurality of registered speakers registered in advance.
  • One of the similarity calculation models is selected based on the noise associated with the noise, and the selected similarity calculation model is used to calculate the utterance features of the speaker and , a voice authentication method is provided in which the speaker is authenticated by comparing the registered feature amount with the registered feature amount of the registered speaker.
  • Block diagram showing an example of internal configuration of a voice authentication system A diagram illustrating each process performed by a processor of a terminal device in an embodiment.
  • Flowchart showing an example of the operation procedure of the terminal device in the embodiment Flowchart showing an example of a speaker authentication procedure of a terminal device in an embodiment
  • FIG. 1 is a block diagram showing an example of the internal configuration of a voice authentication system 100 according to an embodiment.
  • FIG. 2 is a diagram illustrating each process performed by the processor 11 of the terminal device P1 in the embodiment.
  • the voice authentication system 100 includes a terminal device P1 as an example of a voice authentication device, a monitor MN, a noise determination device P2, and a network NW. Note that the voice authentication system 100 may include a microphone MK or a monitor MN.
  • the microphone MK picks up the voice uttered by the speaker US in order to register the voice in the terminal device P1 in advance.
  • the microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data that is registered in the terminal device P1.
  • Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .
  • the microphone MK picks up the voice uttered by the speaker US, which is used for speaker authentication.
  • the microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data.
  • Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .
  • the voice data for voice registration or the voice data already registered in the terminal device P1 will be referred to as "registered voice data", and the voice data for voice authentication will be referred to as "authentication voice data”. and differentiate.
  • the microphone MK may be, for example, a microphone included in a predetermined device such as a Personal Computer (hereinafter referred to as "PC"), a notebook PC, a smartphone, a tablet terminal, or the like. Further, the microphone MK may transmit an audio signal or audio data to the terminal device P1 by wireless communication via a network (not shown).
  • PC Personal Computer
  • the microphone MK may transmit an audio signal or audio data to the terminal device P1 by wireless communication via a network (not shown).
  • the terminal device P1 is realized by, for example, a PC, a notebook PC, a smartphone, a tablet terminal, etc., and executes a voice registration process using registered voice data of the speaker US and a speaker authentication process using authentication voice data. . It includes a communication unit 10, a processor 11, a memory 12, a feature extraction model database DB1, a registered speaker database DB2, a similarity calculation model database DB3, and a noise-similarity calculation model correspondence list DB4.
  • the communication unit 10 which is an example of the acquisition unit, is connected to the microphone MK, the monitor MN, and the noise determination device P2 so that data can be transmitted and received through wired communication or wireless communication.
  • the wireless communication referred to here is, for example, short-range wireless communication such as Bluetooth (registered trademark) or NFC (registered trademark), or communication via a wireless LAN (Local Area Network) such as Wi-Fi (registered trademark).
  • the communication unit 10 may transmit and receive data to and from the microphone MK via an interface such as a Universal Serial Bus (USB). Furthermore, the communication unit 10 may perform data transmission and reception with the monitor MN via an interface such as High-Definition Multimedia Interface (HDMI, registered trademark).
  • HDMI High-Definition Multimedia Interface
  • the processor 11 is configured using, for example, a central processing unit (CPU) or a field programmable gate array (FPGA), and performs various processing and control in cooperation with the memory 12. Specifically, the processor 11 refers to the program and data held in the memory 12 and executes the program, thereby controlling the noise extraction section 111, the feature amount extraction section 112, the noise determination section 113, and the speaker registration section. 114, the similarity calculation model selection section 115, the reliability calculation section 116, the authentication section 117, and other sections.
  • CPU central processing unit
  • FPGA field programmable gate array
  • the processor 11 When registering the voice of the speaker US, the processor 11 realizes the functions of the noise extraction unit 111, the feature extraction unit 112, the noise determination unit 113, and the speaker registration unit 114, thereby registering the voice in the registered speaker database DB2. New registration (storage) processing of the feature amount of the speaker US is executed.
  • the feature amount here is a feature amount indicating the individuality of the speaker US extracted from the registered voice data.
  • the processor 11 controls each section of the noise extraction section 111, the feature amount extraction section 112, the noise determination section 113, the similarity calculation model selection section 115, the reliability calculation section 116, and the authentication section 117. By realizing this function, speaker authentication processing is executed.
  • the noise extraction unit 111 which is an example of a detection unit and an extraction unit, acquires registered voice data or authentication voice data of the speaker US transmitted from the microphone MK.
  • the noise extraction unit 111 distinguishes the utterance section in which the speaker US is uttering and the utterance section in which the speaker US is not uttering (hereinafter referred to as "non-speech section") out of registered voice data or authentication voice data.
  • non-speech section the utterance section in which the speaker US is not uttering
  • the noise extraction unit 111 extracts the noise included in the detected non-speech period, and outputs the extracted noise data (hereinafter referred to as “noise data”) to the noise determination unit 113.
  • the feature extraction unit 112 which is an example of a detection unit, acquires the registered voice data or authentication voice data of the speaker US transmitted from the microphone MK.
  • the feature extraction unit 112 detects a speech section from the registered speech data or the authentication speech data, and uses the feature extraction model to identify the individual of the speaker US from the detected speech section of the registered speech data or the authentication speech data. Extract features that indicate gender.
  • the feature amount extraction unit 112 associates a control command requesting registration of the feature amount of the speaker US and registered speaker information of the speaker US with the registered voice data of the speaker US transmitted from the microphone MK. If so, the process moves to the voice registration process of the speaker US.
  • the feature amount extraction unit 112 outputs the extracted feature amount of the speaker US to the speaker registration unit 114.
  • the feature extraction unit 112 executes speaker authentication processing when a control command requesting speaker authentication is associated with the authentication voice data of the speaker US transmitted from the microphone MK.
  • the feature amount extraction unit 112 outputs the extracted feature amount of the speaker US to the authentication unit 117.
  • the noise determination unit 113 acquires the noise data output from the noise extraction unit 111.
  • the noise determination unit 113 transmits noise data to the noise determination device P2 via the network NW, and determines the type of noise included in the registered voice data or authentication voice data of the speaker US.
  • the noise referred to here is noise that is collected due to the environment (background) at the time of sound collection, and includes, for example, surrounding voices at the time of sound collection, music, the sound of a vehicle running, the sound of the wind, etc. .
  • the noise type indicates, for example, the environment (place) or position where the noise occurs, such as in-store noise, outdoor wind noise, in-store music, or inside a station.
  • the noise type may further include information on time zones such as early morning, daytime, and nighttime, for example.
  • the noise determination unit 113 In the voice registration process of the speaker US, the noise determination unit 113 outputs noise type information corresponding to the noise type determination result transmitted from the noise determination device P2 to the speaker registration unit 114. Further, in the speaker authentication process, the noise determination unit 113 outputs noise type information corresponding to the noise type determination result transmitted from the noise determination device P2 to the similarity calculation model selection unit 115.
  • the speaker registration unit 114 acquires the feature quantity of the speaker US outputted from the feature quantity extraction unit 112 and the noise type information included in the registered voice data of the speaker US outputted from the noise determination unit 113. .
  • the speaker registration unit 114 associates the feature amount of the speaker US, noise type information, and speaker information of the speaker US, and registers them in the registered speaker database DB2.
  • the speaker information may be extracted from the registered speech data by voice recognition, or may be obtained from a terminal owned by the speaker US (for example, a PC, a notebook PC, a smartphone, a tablet terminal).
  • the speaker information here includes, for example, identification information that can identify the speaker US, the name of the speaker US, speaker identification (ID), and the like.
  • the similarity calculation model selection unit 115 which is an example of the selection unit, selects the noise type associated with the feature amount of each of the plurality of registered speakers registered in the registered speaker database DB2 and the registered voice data of the speaker US. The degree of similarity between the noise extracted from the noise type and the noise type is calculated (evaluated). The similarity calculation model selection unit 115 selects a similarity calculation model (an example of a similarity calculation model) stored in the similarity calculation model database DB3 based on the calculated similarity.
  • a similarity calculation model an example of a similarity calculation model
  • the similarity calculation model selected here is a model that is more suitable or optimal for calculating the similarity between the feature amount of the speaker US and the feature amount of any registered speaker.
  • the similarity calculation model selection unit 115 makes a selection based on noise type information corresponding to the registered voice data of the speaker US, noise type information corresponding to a plurality of registered speakers, and the similarity of these noise types.
  • noise type information corresponding to the registered voice data of the speaker US
  • noise type information corresponding to a plurality of registered speakers
  • similarity of these noise types With reference to the correspondence lists LST1 and LST2 (see FIGS. 5 and 6), which are associated with the information on the similarity calculation models that have been selected, select one of the plurality of selection models (similarity models). is selected and output to each of the reliability calculation section 116 and the authentication section 117.
  • the reliability calculation unit 116 which is an example of the reliability calculation unit, calculates (evaluates) the reliability (score) indicating the certainty of the identification result of the speaker US based on the similarity calculated by the authentication unit 117.
  • the reliability calculation unit 116 transmits the calculated reliability information to the authentication unit 117 based on the similarity calculated by the authentication unit 117, the noise type information used in the similarity calculation process, the similarity calculation model, etc. Output to. Note that the reliability calculation process by the reliability calculation unit 116 is not essential and may be omitted.
  • the authentication unit 117 which is an example of a calculation unit, acquires the feature amount of the speaker US output from the feature amount extraction unit 112, and calculates the feature amount of each of the plurality of registered speakers registered in the registered speaker database DB2. get.
  • the authentication unit 117 also obtains the selected models of the correspondence lists LST1 and LST2 output from the similarity calculation model selection unit 115.
  • the authentication unit 117 uses a similarity calculation model based on the correspondence lists LST1 and LST2 to calculate the similarity between the feature amounts of each of the plurality of registered speakers and the feature amount of the speaker US.
  • the authentication unit 117 identifies the speaker US based on the calculated similarity.
  • the authentication unit 117 generates an authentication result screen SC based on the speaker information of the identified speaker US, and transmits it to the monitor MN.
  • the memory 12 includes, for example, a random access memory (hereinafter referred to as "RAM”) as a work memory used when executing each process of the processor 11, and a read-only memory that stores programs and data that define the operation of the processor 11.
  • RAM random access memory
  • ROM read-only memory
  • Data or information generated or acquired by the processor 11 is temporarily stored in the RAM.
  • a program that defines the operation of the processor 11 is written in the ROM.
  • the feature extraction model database DB1 is a so-called storage, and is stored using a storage medium such as a flash memory, a Hard Disk Drive (hereinafter referred to as "HDD”), or a Solid State Drive (hereinafter referred to as "SSD”). configured.
  • the feature extraction model database DB1 stores a feature extraction model capable of detecting the utterance section of the speaker US from registered speech data or authenticated speech data and extracting the feature of the speaker US.
  • the feature extraction model is, for example, a learning model generated by learning using deep learning or the like.
  • the registered speaker database DB2 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD.
  • the registered speaker database DB2 includes feature quantities of each of a plurality of registered speakers registered in advance, information on the noise type of noise included in the registered voice data from which the feature quantities were extracted, and registered speaker information. Store information in association with other information.
  • the similarity calculation model database DB3 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD.
  • the similarity calculation model database DB3 stores a similarity calculation model that can calculate the similarity between two feature amounts.
  • the similarity calculation model is, for example, a learning model generated by learning using deep learning or the like.
  • a similarity calculation model is one in which dimensions in which individuality is likely to be expressed are learned in advance and retained in order to calculate the similarity between two multidimensional vectors with high precision.
  • the method for calculating the similarity using a model is an example of a method for calculating the similarity between vectors, and previously described techniques such as Euclidean distance and cosine similarity may be used.
  • the noise-similarity calculation model correspondence list DB4 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD.
  • the noise-similarity calculation model correspondence list DB4 stores similarity calculation models used in similarity calculation processing for each combination of noise types.
  • the monitor MN is configured using a display such as a Liquid Crystal Display (LCD) or an organic electroluminescence (EL), for example.
  • the monitor MN displays the authentication result screen SC output from the terminal device P1.
  • the authentication result screen SC is a screen that notifies the administrator (for example, the person viewing Monitor MN) of the speaker authentication result, and displays the authentication result information "XX matched Mr. XX's voice.” Contains degree information "Reliability: High”.
  • the authentication result screen SC may also include other registered speaker information (for example, a face image, etc.). Further, the authentication result screen SC does not need to include reliability information.
  • the network NW connects the terminal device P1 and the noise determination device P2 to enable data communication.
  • the noise determination device P2 may not only be connected to the terminal device P1 via the network NW, but may also be a part of the terminal device P1.
  • the noise determination device P2 acquires the noise extracted from the registered voice data of the speaker US or the authentication voice data transmitted from the terminal device P1.
  • the noise determination device P2 determines the type of noise based on the acquired noise.
  • the noise determination device P2 transmits noise type information to the terminal device P1.
  • FIG. 3 is a flowchart showing an example of the operation procedure of the terminal device P1 in the embodiment.
  • the terminal device P1 acquires audio data from the microphone MK (St11).
  • the microphone MK may be, for example, a microphone included in a PC, a notebook PC, a smartphone, or a tablet terminal.
  • the terminal device P1 determines whether the control command associated with the voice data is a control command requesting registration in the registered speaker database DB2 (St12).
  • the terminal device P1 determines to newly register the feature amount of the speaker US in the registered speaker database DB2 if the control command is a control command requesting registration in the registered speaker database DB2. (St12, YES), and extracts noise included in the non-speech section from the voice data (registered voice data) (St13).
  • the noise referred to here is the noise included in the voice data (registered voice data or authentication voice data), and is the surrounding environmental sound, noise, etc. when the utterance voice of the speaker US is collected.
  • step St12 if the control command is not a control command requesting registration in the registered speaker database DB2 but a control command requesting speaker authentication, the terminal device P1 registers the registered speaker database DB2. It is determined that the feature quantity of the speaker US is not to be newly registered (St12, NO), and noise included in the non-speech section of the voice data (authentication voice data) is extracted (St14).
  • the terminal device P1 associates a control command requesting determination of the extracted noise type with the noise, and transmits the extracted noise to the noise determination device P2.
  • the terminal device P1 executes noise type determination processing by acquiring the noise type information (that is, the determination result) transmitted from the noise determination device P2 (St15).
  • the terminal device P1 extracts a feature amount indicating the individuality of the speaker US from the utterance section of the registered voice data (St16).
  • the feature amounts extracted from the utterance sections of the registered voice data and the authentication voice data include a feature amount indicating the individuality of the speaker US and a noise feature amount.
  • the terminal device P1 associates the feature amount of the speaker US extracted from the registered voice data, noise type information, and speaker information and registers them in the registered speaker database DB2 (St17).
  • the terminal device P1 further extracts the noise extracted from the non-speech section of the authentication voice data.
  • the terminal device P1 associates the extracted noise with a control command requesting a noise type determination, and transmits the extracted noise to the noise determination device P2.
  • the terminal device P1 executes noise type determination processing by acquiring the noise type information (that is, the determination result) transmitted from the noise determination device P2 (St18).
  • the terminal device P1 extracts the feature amount of the speaker US from the utterance section of the authenticated voice data of the speaker US (St19). Further, the terminal device P1 acquires the feature amounts of each of the plurality of registered speakers registered in the registered speaker database DB2 (St20), and executes speaker authentication processing (St21).
  • FIG. 4 is a flowchart showing an example of the speaker authentication procedure of the terminal device P1 in the embodiment.
  • the terminal device P1 determines the feature amount of the speaker US and the plurality of A similarity calculation model suitable for calculating the degree of similarity with each feature quantity of the registered speakers is selected for each registered speaker.
  • the terminal device P1 creates correspondence lists LST1 and LST2 (Fig. 5, 6).
  • the terminal device P1 uses the referenced correspondence lists LST1 and LST2 to determine the degree of similarity between the feature amount of the speaker US and the feature amount of one of the registered speakers among the plurality of registered speakers.
  • the similarity calculation model is read from the similarity calculation model database DB3 (St211).
  • the terminal device P1 uses the similarity calculation model to calculate the feature amount of the voice data of the speaker US and the characteristics of one of the registered speakers among the plurality of registered speakers registered in the registered speaker database DB2.
  • the degree of similarity with the amount is calculated (St212).
  • the terminal device P1 repeatedly executes the process of step St212 until it calculates the similarity between the feature amount of the voice data of the speaker US and the feature amount of all registered speakers registered in the registered speaker database DB2. .
  • the terminal device P1 determines whether there is a degree of similarity greater than or equal to a threshold value among the calculated degrees of similarity (St213).
  • the terminal device P1 determines in the process of step St213 that there is a degree of similarity that is equal to or greater than the threshold among the calculated degrees of similarity (St213, YES)
  • the terminal device P1 registers the degree of similarity that is determined to be equal to or greater than the threshold and corresponds to the degree of similarity.
  • the speaker US is identified based on the speaker information (St214). Note that if there are multiple degrees of similarity that are determined to be equal to or higher than the threshold, the terminal device P1 identifies the speaker US based on the registered speaker information corresponding to the degree of similarity with the highest calculated degree of similarity. You may.
  • the terminal device P1 determines in the process of step St213 that there is no similarity greater than or equal to the threshold among the calculated similarities (St213, NO), the terminal device P1 determines that the speaker US cannot be identified (St215). ).
  • the terminal device P1 generates an authentication result screen SC based on the registered speaker information of the identified speaker US.
  • the terminal device P1 outputs the generated authentication result screen SC to the monitor MN for display (St216).
  • the terminal device P1 registers the speaker information, the feature amount of the speaker US, and the information on the noise type included in the utterance section of the speaker US in association with each other at the time of voice registration.
  • the terminal device P1 can determine the degree of similarity suitable for each noise type. You can select a calculation model. Therefore, by using the selected similarity calculation model, the terminal device P1 more accurately determines the similarity between the feature amount of the speaker US including different noises and the feature amount of the registered speaker. This makes it possible to more effectively suppress a decrease in speaker authentication accuracy caused by noise contained in the authentication voice data.
  • the terminal device P1 calculates the degree of reliability indicating the likelihood of the speaker identified by the speaker authentication process, which is indicated by the calculated degree of similarity, based on the similarity calculation model used to calculate the degree of similarity. and display. Thereby, the terminal device P1 can present the certainty of the speaker authentication result to the administrator viewing the monitor MN. Therefore, the terminal device P1 informs the administrator by presenting the reliability that there is no similarity calculation model suitable for calculating the similarity, and that speaker authentication has been performed using a similarity calculation model "universal model", which will be described later. can be informed.
  • FIG. 5 is a diagram illustrating an example of the correspondence list LST1 when the noise type at the time of voice registration and the noise type at the time of voice authentication are the same.
  • FIG. 6 is a diagram illustrating an example of the correspondence list LST2 when the noise type at the time of voice registration and the noise type at the time of voice authentication are different.
  • noise type information associated with the feature amount of any one registered speaker registered in the registered speaker database DB2 and speaker authentication information are shown.
  • An example will be described in which the correspondence lists LST1 and LST2 are referred to based on the noise type of noise included in the authenticated voice data of the target speaker US.
  • each of the voice data at the time of voice registration and voice authentication includes noise that corresponds to the same noise type "in-store noise.”
  • the terminal device P1 transmits the noise extracted from the authentication voice data of the speaker US transmitted from the microphone MK to the noise determination device P2.
  • the terminal device P1 generates a predefined correspondence list LST1 based on the noise type determination result transmitted from the noise determination device P2 and the noise type information of registered speakers registered in the registered speaker database DB2. With reference to , one selection model is selected from each of the plurality of selection models (similarity models).
  • the correspondence list LST1 includes the noise type determination result "Noise determination result 1" of the noise included in the authenticated voice data of the speaker US and the noise type determination result "Noise determination result 1" of the registered speaker registered in the registered speaker database DB2. This is data that associates "Judgment Result 2" with a similarity calculation model "selected model” selected based on these two noise types.
  • the noise type determination result "Noise determination result 1" indicates, for example, the noise determination result in the voice at the time of authentication.
  • the noise type determination result "Noise determination result 2" indicates, for example, the noise determination result of registered speech.
  • the number of noise type information registered in the registered speaker database DB2 may be one or more. In other words, if there are multiple noise type candidates, multiple pieces of information may be retained. Furthermore, information on determination probabilities indicating respective reliability corresponding to determination results of a plurality of noise types is not essential and may be omitted.
  • the similarity calculation model "selected model” is, for example, the similarity calculation model “model A” selected in response to the noise type determination result "noise determination result 1" and the noise type determination result "noise determination result 2". , "Model B”, “Model C”, and "Model Z”.
  • the similarity calculation model "Model A”
  • the information on the noise type of the noise included in the features of the speaker US is “Noise A”
  • the information on the noise type of the noise included in the features of the registered speaker is “Noise A”.
  • This similarity calculation model is determined to be optimal for calculating the similarity between the feature amount of the speaker US and the feature amount of the registered speaker when the information is "noise A.”
  • Model Z which is a general-purpose similarity calculation model
  • the similarity calculation model selection unit 115 selects a similarity calculation model database from the similarity calculation model database DB3 based on the referenced correspondence list LST1. For example, in the example shown in FIG. 5, the similarity calculation model selection unit 115 selects the similarity calculation model "model A.”
  • the authenticated voice data of the speaker US at the time of voice registration includes noise that corresponds to the noise type "in-store noise.” Furthermore, the registered voice data of the registered speaker at the time of voice authentication includes noise that corresponds to a noise type "outdoor noise" that is different from the authenticated voice data of the speaker US.
  • the terminal device P1 transmits the noise extracted from the authentication voice data of the speaker US transmitted from the microphone MK to the noise determination device P2.
  • the terminal device P1 generates a predefined correspondence list LST2 based on the noise type determination result transmitted from the noise determination device P2 and the noise type information of registered speakers registered in the registered speaker database DB2.
  • one selection model is selected from each of the plurality of selection models (similarity models).
  • the correspondence list LST2 includes the noise type determination result "Noise determination result 3" of the noise included in the authenticated voice data of the speaker US, and the noise type determination result "Noise determination result 3" of the registered speaker registered in the registered speaker database DB2. This is data that associates "determination result 4" with a similarity calculation model "selected model” selected based on these two noise types.
  • Noise type determination result "Noise determination result 3" indicates the noise type determination result extracted from the voice data of the registered speaker.
  • the noise type determination result "Noise determination result 4" indicates the determination result of the noise type information of the registered speakers registered in the registered speaker database DB2.
  • the number of noise type information registered in the registered speaker database DB2 may be one or more. In other words, if there are multiple noise type candidates, multiple pieces of information may be retained. Further, information on the determination probability indicating the reliability corresponding to the determination result of each noise type is not essential and may be omitted.
  • the similarity calculation model "selected model” is a similarity calculation model "model G” selected corresponding to the noise type determination result "noise determination result 3" and the noise type determination result "noise determination result 4", Includes each of "Model H”, “Model I”, and "Model Z”.
  • the similarity calculation model "Model G” is used when the noise type information of the noise included in the authentication speech data is "Noise A” and the noise type information of the registered speaker is "Noise D”. , is a similarity calculation model determined to be optimal for calculating the similarity between the feature amount of the speaker US and the feature amount of the registered speaker.
  • the similarity calculation model selection unit 115 selects a model suitable for the similarity calculation process based on the combination of the noise type information of the noise included in the feature amount of the speaker US and the noise type information of the registered speaker. If it is determined that there is no similarity calculation model, a general-purpose similarity calculation model "Model Z" is selected.
  • the similarity calculation model selection unit 115 selects a similarity calculation model database from the similarity calculation model database DB3 based on the referenced correspondence list LST2. For example, in the example shown in FIG. 6, the similarity calculation model selection unit 115 selects the similarity calculation model "model E.”
  • the terminal device P1 calculates two feature quantities (features of the speaker US and It is possible to select the optimal similarity calculation model for the similarity calculation process (for example, the characteristics of registered speakers). As a result, even if the noise type included in the feature amount at the time of voice registration and the noise type included in the feature amount at the time of voice authentication change, the terminal device P1 calculates the degree of similarity between the two feature amounts.
  • the optimal similarity calculation model can be selected for calculation. In other words, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by noise included in the audio data. If there are candidates for multiple conditions at the time of noise determination, the terminal device P1 calculates the degree of similarity using, for example, a similarity calculation model corresponding to each condition, calculates the average value, and uses it as the degree of similarity. Good too.
  • FIG. 7 is a diagram illustrating an example of calculating reliability.
  • FIG. 7 shows an example in which the reliability is calculated using two indicators “high” and "low", the reliability may be calculated as a numerical value from 0 (zero) to 100, for example.
  • the reliability is calculated when the registered voice data at the time of voice registration and the authentication voice data at the time of voice authentication are each of the same noise type, as explained in FIG. Let's discuss an example.
  • the reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117.
  • the reliability calculation unit 116 determines whether or not the similarity calculation model used to calculate the similarity is a similarity calculation model based on a known noise type, and whether or not the reliability calculation unit 116 is reliable based on the determination probability of the noise type. Determine the degree.
  • the noise type determination result based on noise is that the determination probability for the noise type “outdoor noise” is "90%", and the determination probability for the noise type “in-store music” is “6%”. ”, and the determination probability that the noise type is “unknown noise” is “4%”.
  • the noise type determination results shown in “Case 1" indicate that the noise types "outdoor noise” and “in-store noise” are known noises, and the noise type "unknown noise” is unknown noises. .
  • the similarity calculation model selection unit 115 selects the similarity calculation model "outdoor wind noise model” based on the determination result of the noise type of the noise.
  • the authentication unit 117 uses the similarity calculation model "outdoor noise model” to calculate the similarity between the feature amount of the speaker US and the feature amount of any registered speaker registered in the registered speaker database DB2. do.
  • the reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117.
  • the reliability calculation unit 116 calculates that the similarity calculation model "outdoor noise model” is a similarity calculation model based on a known noise type "outdoor wind noise model” and that the noise determination probability is "90%”. ”, the reliability is calculated as “high”.
  • the reliability calculation unit 116 determines that the noise determination probability is equal to or higher than a predetermined probability (for example, 85%, 90%, etc.), the reliability calculation unit 116 calculates that the reliability is "high” and calculates the noise determination probability as “high.” If it is determined that the probability is not greater than a predetermined probability, the reliability is calculated to be "low”.
  • a predetermined probability for example, 85%, 90%, etc.
  • the noise type determination results based on the noise are that the determination probability for the noise type “outdoor wind noise” is “48%” and the determination probability for the noise type “unknown noise” is “39%”. ”, and the determination probability that the noise type is “in-store music” is “13%”.
  • the noise type determination results shown in “Case 2” indicate that the noise types “Outdoor wind noise” and “In-store music” are known noises, and the noise type "Unknown noise” is unknown noises. show.
  • the similarity calculation model selection unit 115 selects the similarity calculation model "outdoor wind noise model” based on the determination result of the noise type of the noise.
  • the authentication unit 117 uses the similarity calculation model "outdoor noise model” to calculate the similarity between the feature amount of the speaker US and the feature amount of any registered speaker registered in the registered speaker database DB2. do.
  • the reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117.
  • the reliability calculation unit 116 calculates that the similarity calculation model "outdoor noise model” is a similarity calculation model based on a known noise type “outdoor wind noise model” and that the noise determination probability is "48%”. ”, the reliability is calculated as “low”.
  • the noise type determination results based on the noise are that the determination probability for the noise type “unknown noise” is “55%” and the determination probability for the noise type “outdoor wind noise” is “28%”. ”, and the determination probability that the noise type is “in-store music” is “17%”.
  • the noise type determination results shown in “Case 3” indicate that the noise types “Outdoor wind noise” and “In-store noise” are known noises, and the noise type "Unknown noise” is unknown noises. show.
  • the similarity calculation model selection unit 115 selects the similarity calculation model "general model” based on the determination result of the noise type of the noise.
  • the authentication unit 117 calculates the similarity between the feature amount of the speaker US and the feature amount of any registered speaker registered in the registered speaker database DB2 using the similarity calculation model "general model”. .
  • the reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117.
  • the reliability calculation unit 116 determines that the similarity calculation model "general-purpose model" used to calculate the similarity is unknown noise and the noise determination probability is "55%”, so the reliability calculation unit 116 determines that it is reliable. Calculate the degree "low”.
  • the reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117.
  • the reliability calculation unit 116 calculates that the similarity calculation model "outdoor noise model” is a similarity calculation model based on a known noise type “outdoor wind noise model” and that the noise determination probability is "48%”. ”, the reliability is calculated as “low”.
  • the terminal device P1 includes the communication unit 10 (an example of an acquisition unit) that acquires audio data, the utterance section where the speaker is speaking from the audio data, and the utterance section where the speaker is not speaking.
  • a noise extraction unit 111 and a feature extraction unit 112 detect non-speech intervals, feature quantities of the utterance interval (an example of utterance features), and noise included in the non-speech interval of audio data.
  • a noise extraction unit 111 and a feature quantity extraction unit 112 extract the noise, and associate the extracted noise with the feature quantities of a plurality of registered speakers registered in advance (an example of registered feature quantities).
  • a similarity calculation model selection unit 115 (an example of a selection unit) that selects any one similarity calculation model from each of a plurality of similarity calculation models (an example of a similarity calculation model) based on the detected noise; and an authentication unit 117 that authenticates the speaker by comparing the feature amount of the speaker US with the feature amount of the registered speaker using the selected similarity calculation model.
  • the terminal device P1 calculates a similarity calculation model suitable for speaker authentication based on the combination of noise types included in the feature amount of the speaker US and the feature amount of the registered speaker. You can choose. In other words, even if the noise type included in the feature amount during voice registration is different from the noise included in the feature amount during voice authentication, the terminal device P1 selects a similarity calculation model more suitable for speaker authentication. You can choose. Therefore, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by noise included in the audio data.
  • the communication unit 10 in the terminal device P1 according to the embodiment further acquires noise type information of the extracted noise.
  • the similarity calculation model selection unit 115 selects a similarity calculation model based on the acquired speaker noise type information and the registered speaker noise type information.
  • the terminal device P1 according to the embodiment generates two feature quantities (feature quantities of the speaker US It is possible to select a similarity calculation model that is more suitable for the similarity calculation process of the features of the registered speaker and the registered speaker's feature values.
  • the terminal device P1 according to the embodiment further includes an authentication unit 117 (an example of a calculation unit) that calculates the degree of similarity between the feature amount of the speaker US and the feature amount of the registered speaker.
  • the authentication unit 117 authenticates the speaker US based on the plurality of calculated similarities.
  • the terminal device P1 according to the embodiment can perform speaker authentication using the degree of similarity between the feature amounts of a plurality of registered speakers registered in advance and the feature amounts of the speaker US.
  • the terminal device P1 according to the embodiment further includes a reliability calculation unit 116 (an example of a reliability calculation unit) that calculates the reliability of the similarity.
  • the communication unit 10 further acquires noise type information of the extracted noise and a score indicating the noise type of the noise.
  • the reliability calculation unit 116 calculates the reliability of the similarity based on the score. Thereby, the terminal device P1 according to the embodiment can calculate the reliability of the speaker authentication result by calculating the reliability corresponding to the similarity.
  • the authentication unit 117 in the terminal device P1 according to the embodiment identifies the registered speaker whose degree of similarity is equal to or greater than the threshold value as the speaker US.
  • the terminal device P1 according to the embodiment can perform speaker authentication using the degree of similarity between the feature amounts of a plurality of registered speakers registered in advance and the feature amounts of the speaker US.
  • the authentication unit 117 in the terminal device P1 according to the embodiment generates and outputs an authentication result screen SC that includes information regarding registered speakers whose degree of similarity is equal to or greater than the threshold value.
  • the terminal device P1 according to the embodiment can present the speaker authentication result to the speaker US or the administrator.
  • the authentication unit 117 in the terminal device P1 according to the embodiment determines that the plurality of calculated similarities are not equal to or greater than the threshold value, the authentication unit 117 determines that the speaker US cannot be identified. Thereby, the terminal device P1 according to the embodiment can more effectively suppress a decrease in speaker authentication accuracy and more effectively suppress erroneous authentication of the speaker US.
  • the authentication unit 117 in the terminal device P1 according to the embodiment generates and outputs an authentication result screen SC including information regarding registered speakers whose degree of similarity is equal to or higher than a threshold value and information on the calculated reliability. do.
  • the terminal device P1 according to the embodiment displays the speaker authentication result and the reliability of the speaker authentication result, so that the administrator can know whether the speaker authentication result is reliable or not. You can prompt for confirmation.
  • the present disclosure is useful as a voice authentication device and a voice authentication method that suppress a decrease in speaker authentication accuracy due to changes in environmental noise.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)

Abstract

This voice authentication device comprises: a detection unit for detecting, from speech data, a speech segment in which a speaker is speaking and a non-speech segment in which the speaker is not speaking; an extraction unit for extracting a speech feature amount of the speech segment and noise contained in the non-speech segment; a selection unit for selecting, on the basis of the extracted noise and noise associated with a plurality of pre-registered registered feature amounts, one similarity calculation model from among a plurality of similarity calculation models; and an authentication unit for authenticating the speaker by matching the speech feature amounts of the speaker with the registered feature amounts of the registered speaker using the selected similarity calculation model.

Description

音声認証装置および音声認証方法Voice authentication device and voice authentication method
 本開示は、音声認証装置および音声認証方法に関する。 The present disclosure relates to a voice authentication device and a voice authentication method.
 特許文献1には、被験者の音声を認識する音声認識装置が開示されている。音声認識装置は、複数の動作の各々に対応して作成された複数の動作雑音モデルを当該複数の動作の各々に対応付けて記憶し、被験者の音声を含む入力音声を検出し、被験者の動作を特定し、動作特定手段によって特定された動作に対応する動作雑音モデルを読み出す。音声認識装置は、被験者の現在位置に対応する環境雑音モデルを読み出し、読み出された動作雑音モデルに環境雑音モデルを合成し、合成された雑音重畳モデルを用いて、検出される入力音声に含まれる被験者の音声を認識する。 Patent Document 1 discloses a voice recognition device that recognizes a test subject's voice. The speech recognition device stores a plurality of motion noise models created corresponding to each of the plurality of motions in association with each of the plurality of motions, detects input speech including the subject's voice, and detects the subject's motion. is specified, and a motion noise model corresponding to the motion specified by the motion specifying means is read out. The speech recognition device reads an environmental noise model corresponding to the subject's current position, synthesizes the environmental noise model with the read motion noise model, and uses the synthesized noise superimposition model to detect the noise contained in the detected input speech. recognize the test subject's voice.
日本国特開2008-250059号公報Japanese Patent Application Publication No. 2008-250059
 しかし、特許文献1では、事前に複数の動作の各々で発生する動作雑音と、音声認識を実行可能な複数の位置の各々の環境雑音とを収音する必要があり、たいへん手間だった。また、声紋認証において、音声信号から抽出される認証対象(人物)の個人性を示す特徴量は、音声信号に含まれる雑音によって変化する。よって、上述した音声認識装置を用いて声紋認証を行う場合、事前に登録された音声信号に含まれる雑音と、声紋認証時に収音された音声信号に含まれる雑音とがそれぞれ異なる場合、それぞれの音声信号から抽出された特徴量は同一人物の個人性を示さず、声紋認証精度が低下する可能性があった。 However, in Patent Document 1, it is necessary to collect in advance the motion noise generated in each of a plurality of motions and the environmental noise at each of a plurality of positions where voice recognition can be performed, which is very time-consuming. Furthermore, in voiceprint authentication, the feature quantity that indicates the individuality of an authentication target (person) extracted from a voice signal changes depending on the noise contained in the voice signal. Therefore, when performing voiceprint authentication using the voice recognition device described above, if the noise contained in the pre-registered voice signal and the noise contained in the voice signal collected during voiceprint authentication are different, each Features extracted from voice signals do not indicate the individuality of the same person, and there is a possibility that the accuracy of voiceprint authentication will decrease.
 本開示は、上述した従来の状況に鑑みて案出され、環境雑音の変化に起因する話者認証精度の低下を抑制する音声認証装置および音声認証方法を提供することを目的とする。 The present disclosure was devised in view of the above-mentioned conventional situation, and aims to provide a voice authentication device and a voice authentication method that suppress a decrease in speaker authentication accuracy due to changes in environmental noise.
 本開示は、音声データを取得する取得部と、前記音声データから話者が発話している発話区間と、前記話者が発話していない非発話区間とを検出する検出部と、前記発話区間の発話特徴量と、前記音声データの前記非発話区間に含まれるノイズとを抽出する抽出部と、抽出されたノイズと、事前に登録された複数の登録話者の登録特徴量に対応付けられたノイズとに基づいて、複数の類似度計算モデルのそれぞれのうちいずれか1つの類似度計算モデルを選定する選定部と、選定された類似度計算モデルを用いて、前記話者の発話特徴量と、前記登録話者の登録特徴量とを照合して、前記話者を認証する認証部と、を備える、音声認証装置を提供する。 The present disclosure includes: an acquisition unit that acquires audio data; a detection unit that detects a speech section in which a speaker is speaking from the audio data; and a non-speech section in which the speaker is not speaking; an extraction unit that extracts the utterance feature amount of the speech data and noise included in the non-speech section of the audio data; a selection unit that selects any one similarity calculation model from each of the plurality of similarity calculation models based on the noise generated by the speaker; and an authentication unit that authenticates the speaker by comparing the registered feature amount of the registered speaker.
 また、本開示は、端末装置が行う音声認証方法であって、音声データを取得し、前記音声データから話者が発話している発話区間と、前記話者が発話していない非発話区間とを検出し、前記発話区間の発話特徴量と、前記音声データの前記非発話区間に含まれるノイズとを抽出し、抽出されたノイズと、事前に登録された複数の登録話者の登録特徴量に対応付けられたノイズとに基づいて、類似度計算モデルのそれぞれのうちいずれか1つの類似度計算モデルを選定し、選定された類似度計算モデルを用いて、前記話者の発話特徴量と、前記登録話者の登録特徴量とを照合して、前記話者を認証する、音声認証方法を提供する。 The present disclosure also provides a voice authentication method performed by a terminal device, which acquires voice data, and determines from the voice data a speech section in which the speaker is speaking and a non-speech section in which the speaker is not speaking. , and extracts the speech feature amount of the speech section and the noise included in the non-speech section of the audio data, and extracts the extracted noise and the registered feature amount of a plurality of registered speakers registered in advance. One of the similarity calculation models is selected based on the noise associated with the noise, and the selected similarity calculation model is used to calculate the utterance features of the speaker and , a voice authentication method is provided in which the speaker is authenticated by comparing the registered feature amount with the registered feature amount of the registered speaker.
 本開示によれば、環境雑音の変化に起因する話者認証精度の低下を抑制できる。 According to the present disclosure, it is possible to suppress a decrease in speaker authentication accuracy due to changes in environmental noise.
実施の形態に係る音声認証システムの内部構成例を示すブロック図Block diagram showing an example of internal configuration of a voice authentication system according to an embodiment 実施の形態における端末装置のプロセッサが行う各処理について説明する図A diagram illustrating each process performed by a processor of a terminal device in an embodiment. 実施の形態における端末装置の動作手順例を示すフローチャートFlowchart showing an example of the operation procedure of the terminal device in the embodiment 実施の形態における端末装置の話者認証手順例を示すフローチャートFlowchart showing an example of a speaker authentication procedure of a terminal device in an embodiment 音声登録時のノイズ種別と音声認証時のノイズ種別とが同一である場合の対応リストの一例を説明する図Diagram illustrating an example of a correspondence list when the noise type at the time of voice registration and the noise type at the time of voice authentication are the same 音声登録時のノイズ種別と音声認証時のノイズ種別とが異なる場合の対応リストの一例を説明する図Diagram explaining an example of a correspondence list when the noise type at the time of voice registration and the noise type at the time of voice authentication are different 信頼度の算出例を説明する図Diagram explaining an example of calculating reliability
 以下、適宜図面を参照しながら、本開示に係る音声認証装置および音声認証方法を具体的に開示した実施の形態を詳細に説明する。但し、必要以上に詳細な説明は省略する場合がある。例えば、既によく知られた事項の詳細説明および実質的に同一の構成に対する重複説明を省略する場合がある。これは、以下の説明が不必要に冗長になるのを避け、当業者の理解を容易にするためである。なお、添付図面および以下の説明は、当業者が本開示を十分に理解するために提供されるのであって、これらにより特許請求の範囲に記載の主題を限定することは意図されていない。 Hereinafter, embodiments specifically disclosing a voice authentication device and a voice authentication method according to the present disclosure will be described in detail with reference to the drawings as appropriate. However, more detailed explanation than necessary may be omitted. For example, detailed explanations of well-known matters and redundant explanations of substantially the same configurations may be omitted. This is to avoid unnecessary redundancy in the following description and to facilitate understanding by those skilled in the art. The accompanying drawings and the following description are provided to enable those skilled in the art to fully understand the present disclosure, and are not intended to limit the subject matter recited in the claims.
(実施の形態)
 まず、図1および図2を参照して、実施の形態に係る音声認証システム100について説明する。図1は、実施の形態に係る音声認証システム100の内部構成例を示すブロック図である。図2は、実施の形態における端末装置P1のプロセッサ11が行う各処理について説明する図である。
(Embodiment)
First, a voice authentication system 100 according to an embodiment will be described with reference to FIGS. 1 and 2. FIG. 1 is a block diagram showing an example of the internal configuration of a voice authentication system 100 according to an embodiment. FIG. 2 is a diagram illustrating each process performed by the processor 11 of the terminal device P1 in the embodiment.
 音声認証システム100は、音声認証装置の一例としての端末装置P1と、モニタMNと、ノイズ判定装置P2と、ネットワークNWとを含む。なお、音声認証システム100は、マイクMKあるいはモニタMNを含む構成であってもよい。 The voice authentication system 100 includes a terminal device P1 as an example of a voice authentication device, a monitor MN, a noise determination device P2, and a network NW. Note that the voice authentication system 100 may include a microphone MK or a monitor MN.
 マイクMKは、端末装置P1に事前に音声登録するための話者USの発話音声を収音する。マイクMKは、収音された話者USの発話音声を、端末装置P1に登録される音声信号または音声データに変換する。マイクMKは、変換された音声信号または音声データを、通信部10を介してプロセッサ11に送信する。 The microphone MK picks up the voice uttered by the speaker US in order to register the voice in the terminal device P1 in advance. The microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data that is registered in the terminal device P1. Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .
 また、マイクMKは、話者認証に用いられる話者USの発話音声を収音する。マイクMKは、収音された話者USの発話音声を音声信号または音声データに変換する。マイクMKは、変換された音声信号または音声データを、通信部10を介してプロセッサ11に送信する。 Additionally, the microphone MK picks up the voice uttered by the speaker US, which is used for speaker authentication. The microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data. Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .
 以降の説明では、説明を分かりやすくするために音声登録用の音声データ、あるいは端末装置P1に登録済みの音声データを「登録音声データ」、音声認証用の音声データを「認証音声データ」と表記し、区別する。 In the following explanation, in order to make the explanation easier to understand, the voice data for voice registration or the voice data already registered in the terminal device P1 will be referred to as "registered voice data", and the voice data for voice authentication will be referred to as "authentication voice data". and differentiate.
 なお、マイクMKは、例えば、Personal Computer(以降、「PC」と表記)、ノートPC、スマートフォン、タブレット端末等の所定の装置が備えるマイクであってもよい。また、マイクMKは、ネットワーク(不図示)を介した無線通信により、音声信号または音声データを端末装置P1に送信してもよい。 Note that the microphone MK may be, for example, a microphone included in a predetermined device such as a Personal Computer (hereinafter referred to as "PC"), a notebook PC, a smartphone, a tablet terminal, or the like. Further, the microphone MK may transmit an audio signal or audio data to the terminal device P1 by wireless communication via a network (not shown).
 端末装置P1は、例えば、PC、ノートPC、スマートフォン、タブレット端末等により実現され、話者USの登録音声データを用いた音声登録処理と、認証音声データを用いた話者認証処置とを実行する。通信部10と、プロセッサ11と、メモリ12と、特徴量抽出モデルデータベースDB1と、登録話者データベースDB2と、類似度計算モデルデータベースDB3と、ノイズ-類似度計算モデル対応リストDB4と、を含む。 The terminal device P1 is realized by, for example, a PC, a notebook PC, a smartphone, a tablet terminal, etc., and executes a voice registration process using registered voice data of the speaker US and a speaker authentication process using authentication voice data. . It includes a communication unit 10, a processor 11, a memory 12, a feature extraction model database DB1, a registered speaker database DB2, a similarity calculation model database DB3, and a noise-similarity calculation model correspondence list DB4.
 取得部の一例としての通信部10は、マイクMKと、モニタMN、ノイズ判定装置P2との間でそれぞれデータ送受信可能に有線通信あるいは無線通信可能に接続される。ここでいう無線通信は、例えばBluetooth(登録商標)、NFC(登録商標)等の近距離無線通信、またはWi-Fi(登録商標)等の無線LAN(Local Area Network)を介した通信である。 The communication unit 10, which is an example of the acquisition unit, is connected to the microphone MK, the monitor MN, and the noise determination device P2 so that data can be transmitted and received through wired communication or wireless communication. The wireless communication referred to here is, for example, short-range wireless communication such as Bluetooth (registered trademark) or NFC (registered trademark), or communication via a wireless LAN (Local Area Network) such as Wi-Fi (registered trademark).
 なお、通信部10は、Universal Serial Bus(USB)等のインターフェースを介してマイクMKとの間でデータ送受信を実行してもよい。また、通信部10は、High-Definition Multimedia Interface(HDMI,登録商標)等のインターフェースを介してモニタMNとの間でデータ送受信を実行してもよい。 Note that the communication unit 10 may transmit and receive data to and from the microphone MK via an interface such as a Universal Serial Bus (USB). Furthermore, the communication unit 10 may perform data transmission and reception with the monitor MN via an interface such as High-Definition Multimedia Interface (HDMI, registered trademark).
 プロセッサ11は、例えばCentral Processing Unit(CPU)またはField Programmable Gate Array(FPGA)を用いて構成されて、メモリ12と協働して、各種の処理および制御を行う。具体的には、プロセッサ11は、メモリ12に保持されたプログラムおよびデータを参照し、そのプログラムを実行することにより、ノイズ抽出部111、特徴量抽出部112、ノイズ判定部113、話者登録部114、類似度計算モデル選択部115、信頼度計算部116、認証部117等の各部の機能を実現する。 The processor 11 is configured using, for example, a central processing unit (CPU) or a field programmable gate array (FPGA), and performs various processing and control in cooperation with the memory 12. Specifically, the processor 11 refers to the program and data held in the memory 12 and executes the program, thereby controlling the noise extraction section 111, the feature amount extraction section 112, the noise determination section 113, and the speaker registration section. 114, the similarity calculation model selection section 115, the reliability calculation section 116, the authentication section 117, and other sections.
 プロセッサ11は、話者USの音声登録時に、ノイズ抽出部111、特徴量抽出部112、ノイズ判定部113、話者登録部114の各部の機能を実現することで、登録話者データベースDB2への話者USの特徴量の新規登録(格納)処理を実行する。なお、ここでいう特徴量は、登録音声データから抽出された話者USの個人性を示す特徴量である。 When registering the voice of the speaker US, the processor 11 realizes the functions of the noise extraction unit 111, the feature extraction unit 112, the noise determination unit 113, and the speaker registration unit 114, thereby registering the voice in the registered speaker database DB2. New registration (storage) processing of the feature amount of the speaker US is executed. Note that the feature amount here is a feature amount indicating the individuality of the speaker US extracted from the registered voice data.
 また、プロセッサ11は、話者USの音声認証時に、ノイズ抽出部111、特徴量抽出部112、ノイズ判定部113、類似度計算モデル選択部115、信頼度計算部116、認証部117の各部の機能を実現することで、話者認証処理を実行する。 Furthermore, during voice authentication of the speaker US, the processor 11 controls each section of the noise extraction section 111, the feature amount extraction section 112, the noise determination section 113, the similarity calculation model selection section 115, the reliability calculation section 116, and the authentication section 117. By realizing this function, speaker authentication processing is executed.
 検出部および抽出部の一例としてのノイズ抽出部111は、マイクMKから送信された話者USの登録音声データ、あるいは認証音声データを取得する。ノイズ抽出部111は、登録音声データ、あるいは認証音声データのうち話者USが発話している発話区間と、話者USが発話していない区間(以降、「非発話区間」と表記)とを検出する。ノイズ抽出部111は、検出された非発話区間に含まれるノイズを抽出し、抽出されたノイズのデータ(以降、「ノイズデータ」と表記)をノイズ判定部113に出力する。 The noise extraction unit 111, which is an example of a detection unit and an extraction unit, acquires registered voice data or authentication voice data of the speaker US transmitted from the microphone MK. The noise extraction unit 111 distinguishes the utterance section in which the speaker US is uttering and the utterance section in which the speaker US is not uttering (hereinafter referred to as "non-speech section") out of registered voice data or authentication voice data. To detect. The noise extraction unit 111 extracts the noise included in the detected non-speech period, and outputs the extracted noise data (hereinafter referred to as “noise data”) to the noise determination unit 113.
 検出部の一例としての特徴量抽出部112は、マイクMKから送信された話者USの登録音声データ、あるいは認証音声データを取得する。特徴量抽出部112は、登録音声データ、あるいは認証音声データから発話区間を検出し、特徴量抽出モデルを用いて、検出された登録音声データ、あるいは認証音声データの発話区間から話者USの個人性を示す特徴量を抽出する。 The feature extraction unit 112, which is an example of a detection unit, acquires the registered voice data or authentication voice data of the speaker US transmitted from the microphone MK. The feature extraction unit 112 detects a speech section from the registered speech data or the authentication speech data, and uses the feature extraction model to identify the individual of the speaker US from the detected speech section of the registered speech data or the authentication speech data. Extract features that indicate gender.
 特徴量抽出部112は、マイクMKから送信された話者USの登録音声データに、話者USの特徴量の登録を要求する制御指令と、話者USの登録話者情報とが対応付けられている場合、話者USの音声登録処理に移行する。特徴量抽出部112は、抽出された話者USの特徴量を話者登録部114に出力する。 The feature amount extraction unit 112 associates a control command requesting registration of the feature amount of the speaker US and registered speaker information of the speaker US with the registered voice data of the speaker US transmitted from the microphone MK. If so, the process moves to the voice registration process of the speaker US. The feature amount extraction unit 112 outputs the extracted feature amount of the speaker US to the speaker registration unit 114.
 また、特徴量抽出部112は、マイクMKから送信された話者USの認証音声データに、話者認証を要求する制御指令が対応付けられている場合、話者認証処理を実行する。特徴量抽出部112は、抽出された話者USの特徴量を認証部117に出力する。 Further, the feature extraction unit 112 executes speaker authentication processing when a control command requesting speaker authentication is associated with the authentication voice data of the speaker US transmitted from the microphone MK. The feature amount extraction unit 112 outputs the extracted feature amount of the speaker US to the authentication unit 117.
 ノイズ判定部113は、ノイズ抽出部111から出力されたノイズデータを取得する。ノイズ判定部113は、ネットワークNWを介して、ノイズデータをノイズ判定装置P2に送信し、話者USの登録音声データ、あるいは認証音声データに含まれるノイズ種別の判定を実行する。 The noise determination unit 113 acquires the noise data output from the noise extraction unit 111. The noise determination unit 113 transmits noise data to the noise determination device P2 via the network NW, and determines the type of noise included in the registered voice data or authentication voice data of the speaker US.
 なお、ここでいうノイズは、収音時の環境(背景)に起因して収音されたノイズであって、例えば、収音時の周囲の話し声、音楽、車両の走行音、風の音等である。ノイズ種別は、例えば、店舗内雑音、屋外風雑音、店舗内音楽、駅構内等のノイズが発生する環境(場所)、位置等を示す。ノイズ種別は、例えば、早朝、昼間、夜間等の時間帯の情報をさらに含んでもよい。 Note that the noise referred to here is noise that is collected due to the environment (background) at the time of sound collection, and includes, for example, surrounding voices at the time of sound collection, music, the sound of a vehicle running, the sound of the wind, etc. . The noise type indicates, for example, the environment (place) or position where the noise occurs, such as in-store noise, outdoor wind noise, in-store music, or inside a station. The noise type may further include information on time zones such as early morning, daytime, and nighttime, for example.
 ノイズ判定部113は、話者USの音声登録処理において、ノイズ判定装置P2から送信されたノイズ種別の判定結果に対応するノイズ種別の情報を話者登録部114に出力する。また、ノイズ判定部113は、話者認証処理において、ノイズ判定装置P2から送信されたノイズ種別の判定結果に対応するノイズ種別の情報を類似度計算モデル選択部115に出力する。 In the voice registration process of the speaker US, the noise determination unit 113 outputs noise type information corresponding to the noise type determination result transmitted from the noise determination device P2 to the speaker registration unit 114. Further, in the speaker authentication process, the noise determination unit 113 outputs noise type information corresponding to the noise type determination result transmitted from the noise determination device P2 to the similarity calculation model selection unit 115.
 話者登録部114は、特徴量抽出部112から出力された話者USの特徴量と、ノイズ判定部113から出力された話者USの登録音声データに含まれるノイズ種別の情報とを取得する。話者登録部114は、話者USの特徴量と、ノイズ種別の情報と、話者USの話者情報とを対応付けて、登録話者データベースDB2に登録する。 The speaker registration unit 114 acquires the feature quantity of the speaker US outputted from the feature quantity extraction unit 112 and the noise type information included in the registered voice data of the speaker US outputted from the noise determination unit 113. . The speaker registration unit 114 associates the feature amount of the speaker US, noise type information, and speaker information of the speaker US, and registers them in the registered speaker database DB2.
 なお、話者情報は、登録音声データから音声認識により抽出されてもよいし、話者USが所有する端末(例えば、PC、ノートPC、スマートフォン,タブレット端末)から取得されてもよい。また、ここでいう話者情報は、例えば、話者USを識別可能な識別情報、話者USの氏名、話者Identification(ID)等である。 Note that the speaker information may be extracted from the registered speech data by voice recognition, or may be obtained from a terminal owned by the speaker US (for example, a PC, a notebook PC, a smartphone, a tablet terminal). Moreover, the speaker information here includes, for example, identification information that can identify the speaker US, the name of the speaker US, speaker identification (ID), and the like.
 選定部の一例としての類似度計算モデル選択部115は、登録話者データベースDB2に登録された複数の登録話者のそれぞれの特徴量に対応付けられたノイズ種別と、話者USの登録音声データから抽出されたノイズのノイズ種別との類似度を算出(評価)する。類似度計算モデル選択部115は、算出された類似度に基づいて、類似度計算モデルデータベースDB3に格納された類似度計算モデル(類似度計算モデルの一例)を選定する。 The similarity calculation model selection unit 115, which is an example of the selection unit, selects the noise type associated with the feature amount of each of the plurality of registered speakers registered in the registered speaker database DB2 and the registered voice data of the speaker US. The degree of similarity between the noise extracted from the noise type and the noise type is calculated (evaluated). The similarity calculation model selection unit 115 selects a similarity calculation model (an example of a similarity calculation model) stored in the similarity calculation model database DB3 based on the calculated similarity.
 なお、ここで選定される類似度計算モデルは、話者USの特徴量と、いずれかの登録話者の特徴量との類似度の算出により適した、あるいは最適なモデルである。 Note that the similarity calculation model selected here is a model that is more suitable or optimal for calculating the similarity between the feature amount of the speaker US and the feature amount of any registered speaker.
 類似度計算モデル選択部115は、話者USの登録音声データに対応するノイズ種別の情報と、複数の登録話者に対応するノイズ種別の情報と、これらのノイズ種別の類似度に基づいて選定された類似度計算モデルの情報とを対応付けた対応リストLST1,LST2(図5、図6参照)を参照して、複数の選択モデル(類似度モデル)のそれぞれのうちいずれか1つの選択モデルを選定し、信頼度計算部116および認証部117のそれぞれに出力する。 The similarity calculation model selection unit 115 makes a selection based on noise type information corresponding to the registered voice data of the speaker US, noise type information corresponding to a plurality of registered speakers, and the similarity of these noise types. With reference to the correspondence lists LST1 and LST2 (see FIGS. 5 and 6), which are associated with the information on the similarity calculation models that have been selected, select one of the plurality of selection models (similarity models). is selected and output to each of the reliability calculation section 116 and the authentication section 117.
 信頼度算出部の一例としての信頼度計算部116は、認証部117により算出された類似度に基づく話者USの特定結果の確からしさを示す信頼度(スコア)を算出(評価)する。信頼度計算部116は、認証部117により算出された類似度、類似度算出処理に用いられたノイズ種別の情報および類似度計算モデル等に基づいて、算出された信頼度の情報を認証部117に出力する。なお、信頼度計算部116による信頼度の算出処理は、必須でなく省略されてもよい。 The reliability calculation unit 116, which is an example of the reliability calculation unit, calculates (evaluates) the reliability (score) indicating the certainty of the identification result of the speaker US based on the similarity calculated by the authentication unit 117. The reliability calculation unit 116 transmits the calculated reliability information to the authentication unit 117 based on the similarity calculated by the authentication unit 117, the noise type information used in the similarity calculation process, the similarity calculation model, etc. Output to. Note that the reliability calculation process by the reliability calculation unit 116 is not essential and may be omitted.
 算出部の一例としての認証部117は、特徴量抽出部112から出力された話者USの特徴量を取得し、登録話者データベースDB2に登録された複数の登録話者のそれぞれの特徴量を取得する。また、認証部117は、類似度計算モデル選択部115から出力された対応リストLST1,LST2の選択モデルを取得する。 The authentication unit 117, which is an example of a calculation unit, acquires the feature amount of the speaker US output from the feature amount extraction unit 112, and calculates the feature amount of each of the plurality of registered speakers registered in the registered speaker database DB2. get. The authentication unit 117 also obtains the selected models of the correspondence lists LST1 and LST2 output from the similarity calculation model selection unit 115.
 認証部117は、対応リストLST1,LST2に基づく類似度計算モデルを用いて、複数の登録話者のそれぞれの特徴量と、話者USの特徴量との類似度を算出する。認証部117は、算出された類似度に基づいて、話者USを特定する。認証部117は、特定された話者USの話者情報に基づいて、認証結果画面SCを生成して、モニタMNに送信する。 The authentication unit 117 uses a similarity calculation model based on the correspondence lists LST1 and LST2 to calculate the similarity between the feature amounts of each of the plurality of registered speakers and the feature amount of the speaker US. The authentication unit 117 identifies the speaker US based on the calculated similarity. The authentication unit 117 generates an authentication result screen SC based on the speaker information of the identified speaker US, and transmits it to the monitor MN.
 メモリ12は、例えばプロセッサ11の各処理を実行する際に用いられるワークメモリとしてのRandom Access Memory(以降、「RAM」と表記)と、プロセッサ11の動作を規定したプログラムおよびデータを格納するRead Only Memory(以降、「ROM」と表記)とを有する。RAMには、プロセッサ11により生成あるいは取得されたデータもしくは情報が一時的に保存される。ROMには、プロセッサ11の動作を規定するプログラムが書き込まれている。 The memory 12 includes, for example, a random access memory (hereinafter referred to as "RAM") as a work memory used when executing each process of the processor 11, and a read-only memory that stores programs and data that define the operation of the processor 11. Memory (hereinafter referred to as "ROM"). Data or information generated or acquired by the processor 11 is temporarily stored in the RAM. A program that defines the operation of the processor 11 is written in the ROM.
 特徴量抽出モデルデータベースDB1は、所謂ストレージであって、例えばフラッシュメモリ、Hard Disk Drive(以降、「HDD」と表記)あるいはSolid State Drive(以降、「SSD」と表記)等の記憶媒体を用いて構成される。特徴量抽出モデルデータベースDB1は、登録音声データあるいは認証音声データから話者USの発話区間を検出し、この話者USの特徴量を抽出可能な特徴量抽出モデルを格納する。特徴量抽出モデルは、例えば、ディープラーニング等を用いた学習により生成された学習モデルである。 The feature extraction model database DB1 is a so-called storage, and is stored using a storage medium such as a flash memory, a Hard Disk Drive (hereinafter referred to as "HDD"), or a Solid State Drive (hereinafter referred to as "SSD"). configured. The feature extraction model database DB1 stores a feature extraction model capable of detecting the utterance section of the speaker US from registered speech data or authenticated speech data and extracting the feature of the speaker US. The feature extraction model is, for example, a learning model generated by learning using deep learning or the like.
 登録話者データベースDB2は、所謂ストレージであって、例えばフラッシュメモリ、HDDあるいはSSD等の記憶媒体を用いて構成される。登録話者データベースDB2は、事前に登録された複数の登録話者のそれぞれの特徴量と、この特徴量が抽出された登録音声データに含まれていたノイズのノイズ種別の情報と、登録話者情報とを対応付けて格納する。 The registered speaker database DB2 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD. The registered speaker database DB2 includes feature quantities of each of a plurality of registered speakers registered in advance, information on the noise type of noise included in the registered voice data from which the feature quantities were extracted, and registered speaker information. Store information in association with other information.
 類似度計算モデルデータベースDB3は、所謂ストレージであって、例えばフラッシュメモリ、HDDあるいはSSD等の記憶媒体を用いて構成される。類似度計算モデルデータベースDB3は、2つの特徴量の類似度を算出可能な類似度計算モデルを格納する。類似度計算モデルは、例えば、ディープラーニング等を用いた学習により生成された学習モデルである。 The similarity calculation model database DB3 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD. The similarity calculation model database DB3 stores a similarity calculation model that can calculate the similarity between two feature amounts. The similarity calculation model is, for example, a learning model generated by learning using deep learning or the like.
 例えば、類似度計算モデルは、2つの多次元ベクトルの類似度を高精度に算出するために、個人性の表れやすい次元を事前学習しておき保持しておくものである。なお、モデルを利用した類似度の算出方法は、ベクトル間の類似度計算における手法の一例であって、ユークリッド距離,コサイン類似度等の既出の技術が用いられてもよい。 For example, a similarity calculation model is one in which dimensions in which individuality is likely to be expressed are learned in advance and retained in order to calculate the similarity between two multidimensional vectors with high precision. Note that the method for calculating the similarity using a model is an example of a method for calculating the similarity between vectors, and previously described techniques such as Euclidean distance and cosine similarity may be used.
 ノイズ-類似度計算モデル対応リストDB4は、所謂ストレージであって、例えばフラッシュメモリ、HDDあるいはSSD等の記憶媒体を用いて構成される。ノイズ-類似度計算モデル対応リストDB4は、類似度算出処理に用いられる類似度計算モデルをノイズ種別の組み合わせごとに格納する。 The noise-similarity calculation model correspondence list DB4 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD. The noise-similarity calculation model correspondence list DB4 stores similarity calculation models used in similarity calculation processing for each combination of noise types.
 モニタMNは、例えばLiquid Crystal Display(LCD)または有機Electroluminescence(EL)等のディスプレイを用いて構成される。モニタMNは、端末装置P1から出力された認証結果画面SCを表示する。 The monitor MN is configured using a display such as a Liquid Crystal Display (LCD) or an organic electroluminescence (EL), for example. The monitor MN displays the authentication result screen SC output from the terminal device P1.
 認証結果画面SCは、話者認証結果を管理者(例えば、モニタMNを視聴する人物等)に通知する画面であって、認証結果情報「XX XXさんの声と一致しました。」と、信頼度情報「信頼度:高」を含む。認証結果画面SCは、他の登録話者情報(例えば、顔画像等)を含んでもよい。また、認証結果画面SCは、信頼度情報を含まなくてもよい。 The authentication result screen SC is a screen that notifies the administrator (for example, the person viewing Monitor MN) of the speaker authentication result, and displays the authentication result information "XX matched Mr. XX's voice." Contains degree information "Reliability: High". The authentication result screen SC may also include other registered speaker information (for example, a face image, etc.). Further, the authentication result screen SC does not need to include reliability information.
 ネットワークNWは、端末装置P1とノイズ判定装置P2との間をデータ通信可能に接続する。なお、ノイズ判定装置P2は、ネットワークNWを介して端末装置P1に接続されるだけでなく、端末装置P1の一部であってもよい。 The network NW connects the terminal device P1 and the noise determination device P2 to enable data communication. Note that the noise determination device P2 may not only be connected to the terminal device P1 via the network NW, but may also be a part of the terminal device P1.
 ノイズ判定装置P2は、端末装置P1から送信された話者USの登録音声データ、あるいは認証音声データから抽出されたノイズを取得する。ノイズ判定装置P2は、取得されたノイズに基づいて、ノイズの種別を判定する。ノイズ判定装置P2は、ノイズ種別の情報を端末装置P1に送信する。 The noise determination device P2 acquires the noise extracted from the registered voice data of the speaker US or the authentication voice data transmitted from the terminal device P1. The noise determination device P2 determines the type of noise based on the acquired noise. The noise determination device P2 transmits noise type information to the terminal device P1.
 次に、図3を参照して、端末装置P1の動作手順について説明する。図3は、実施の形態における端末装置P1の動作手順例を示すフローチャートである。 Next, with reference to FIG. 3, the operating procedure of the terminal device P1 will be described. FIG. 3 is a flowchart showing an example of the operation procedure of the terminal device P1 in the embodiment.
 端末装置P1は、マイクMKから音声データを取得する(St11)。なお、マイクMKは、例えば、PC、ノートPC、スマートフォン、タブレット端末が備えるマイクであってもよい。 The terminal device P1 acquires audio data from the microphone MK (St11). Note that the microphone MK may be, for example, a microphone included in a PC, a notebook PC, a smartphone, or a tablet terminal.
 端末装置P1は、音声データに対応付けられた制御指令が登録話者データベースDB2への登録を要求する制御指令であるか否かを判定する(St12)。 The terminal device P1 determines whether the control command associated with the voice data is a control command requesting registration in the registered speaker database DB2 (St12).
 端末装置P1は、ステップSt12の処理において、制御指令が登録話者データベースDB2への登録を要求する制御指令である場合には、登録話者データベースDB2に話者USの特徴量を新規登録すると判定し(St12,YES)、音声データ(登録音声データ)のうち非発話区間に含まれるノイズを抽出する(St13)。ここでいうノイズは、音声データ(登録音声データあるいは認証音声データ)に含まれるノイズであって、話者USの発話音声が収音された時の周囲の環境音、雑音等である。 In the process of step St12, the terminal device P1 determines to newly register the feature amount of the speaker US in the registered speaker database DB2 if the control command is a control command requesting registration in the registered speaker database DB2. (St12, YES), and extracts noise included in the non-speech section from the voice data (registered voice data) (St13). The noise referred to here is the noise included in the voice data (registered voice data or authentication voice data), and is the surrounding environmental sound, noise, etc. when the utterance voice of the speaker US is collected.
 一方、端末装置P1は、ステップSt12の処理において、制御指令が登録話者データベースDB2への登録を要求する制御指令でなく、話者認証を要求する制御指令である場合、登録話者データベースDB2に話者USの特徴量を新規登録しないと判定し(St12,NO)、音声データ(認証音声データ)の非発話区間に含まれるノイズを抽出する(St14)。 On the other hand, in the process of step St12, if the control command is not a control command requesting registration in the registered speaker database DB2 but a control command requesting speaker authentication, the terminal device P1 registers the registered speaker database DB2. It is determined that the feature quantity of the speaker US is not to be newly registered (St12, NO), and noise included in the non-speech section of the voice data (authentication voice data) is extracted (St14).
 端末装置P1は、抽出されたノイズ種別の判定を要求する制御指令と、ノイズとを対応付けて、抽出されたノイズをノイズ判定装置P2に送信する。端末装置P1は、ノイズ判定装置P2から送信されたノイズ種別の情報(つまり、判定結果)を取得することで、ノイズ種別の判定処理を実行する(St15)。 The terminal device P1 associates a control command requesting determination of the extracted noise type with the noise, and transmits the extracted noise to the noise determination device P2. The terminal device P1 executes noise type determination processing by acquiring the noise type information (that is, the determination result) transmitted from the noise determination device P2 (St15).
 端末装置P1は、登録音声データの発話区間から話者USの個人性を示す特徴量を抽出する(St16)。ここで、登録音声データおよび認証音声データの発話区間から抽出される特徴量は、話者USの個人性を示す特徴量と、ノイズの特徴量とを含む。 The terminal device P1 extracts a feature amount indicating the individuality of the speaker US from the utterance section of the registered voice data (St16). Here, the feature amounts extracted from the utterance sections of the registered voice data and the authentication voice data include a feature amount indicating the individuality of the speaker US and a noise feature amount.
 端末装置P1は、登録音声データから抽出された話者USの特徴量と、ノイズ種別の情報と、話者情報とを対応付けて、登録話者データベースDB2に登録する(St17)。 The terminal device P1 associates the feature amount of the speaker US extracted from the registered voice data, noise type information, and speaker information and registers them in the registered speaker database DB2 (St17).
 端末装置P1は、認証音声データの非発話区間から抽出されたノイズをさらに抽出する。端末装置P1は、抽出されたノイズと、ノイズ種別の判定を要求する制御指令とを対応付けて、ノイズ判定装置P2に送信する。端末装置P1は、ノイズ判定装置P2から送信されたノイズ種別の情報(つまり、判定結果)を取得することで、ノイズ種別の判定処理を実行する(St18)。 The terminal device P1 further extracts the noise extracted from the non-speech section of the authentication voice data. The terminal device P1 associates the extracted noise with a control command requesting a noise type determination, and transmits the extracted noise to the noise determination device P2. The terminal device P1 executes noise type determination processing by acquiring the noise type information (that is, the determination result) transmitted from the noise determination device P2 (St18).
 端末装置P1は、話者USの認証音声データの発話区間から話者USの特徴量を抽出する(St19)。また、端末装置P1は、登録話者データベースDB2に登録された複数の登録話者のそれぞれの特徴量を取得して(St20)、話者認証処理を実行する(St21)。 The terminal device P1 extracts the feature amount of the speaker US from the utterance section of the authenticated voice data of the speaker US (St19). Further, the terminal device P1 acquires the feature amounts of each of the plurality of registered speakers registered in the registered speaker database DB2 (St20), and executes speaker authentication processing (St21).
 次に、図4を参照して、図3に示すステップSt21に示す話者認証手順について説明する。図4は、実施の形態における端末装置P1の話者認証手順例を示すフローチャートである。 Next, with reference to FIG. 4, the speaker authentication procedure shown in step St21 shown in FIG. 3 will be described. FIG. 4 is a flowchart showing an example of the speaker authentication procedure of the terminal device P1 in the embodiment.
 端末装置P1は、話者USの音声データのノイズ種別と、登録話者データベースDB2に登録された複数の登録話者のそれぞれのノイズ種別とに基づいて、話者USの特徴量と、複数の登録話者のそれぞれの特徴量との類似度を算出するために適した類似度算出モデルを登録話者ごとに選定する。端末装置P1は、話者USの音声データのノイズ種別と、複数の登録話者のそれぞれのノイズ種別と、選定された類似度算出モデルとを対応付けた対応リストLST1,LST2(図5、図6参照)を参照する。 The terminal device P1 determines the feature amount of the speaker US and the plurality of A similarity calculation model suitable for calculating the degree of similarity with each feature quantity of the registered speakers is selected for each registered speaker. The terminal device P1 creates correspondence lists LST1 and LST2 (Fig. 5, 6).
 端末装置P1は、参照された対応リストLST1,LST2に基づいて、話者USの特徴量と、複数の登録話者のそれぞれのうちいずれかの登録話者の特徴量との類似度判定に用いられる類似度計算モデルを類似度計算モデルデータベースDB3から読み込む(St211)。 The terminal device P1 uses the referenced correspondence lists LST1 and LST2 to determine the degree of similarity between the feature amount of the speaker US and the feature amount of one of the registered speakers among the plurality of registered speakers. The similarity calculation model is read from the similarity calculation model database DB3 (St211).
 端末装置P1は、類似度計算モデルを用いて、話者USの音声データの特徴量と、登録話者データベースDB2に登録済みの複数の登録話者のそれぞれのうちいずれかの登録話者の特徴量との類似度を算出する(St212)。端末装置P1は、話者USの音声データの特徴量と、登録話者データベースDB2に登録済みのすべての登録話者の特徴量との類似度を算出するまで、ステップSt212の処理を繰り返し実行する。 The terminal device P1 uses the similarity calculation model to calculate the feature amount of the voice data of the speaker US and the characteristics of one of the registered speakers among the plurality of registered speakers registered in the registered speaker database DB2. The degree of similarity with the amount is calculated (St212). The terminal device P1 repeatedly executes the process of step St212 until it calculates the similarity between the feature amount of the voice data of the speaker US and the feature amount of all registered speakers registered in the registered speaker database DB2. .
 端末装置P1は、算出された類似度のうち閾値以上の類似度があるか否かを判定する(St213)。 The terminal device P1 determines whether there is a degree of similarity greater than or equal to a threshold value among the calculated degrees of similarity (St213).
 端末装置P1は、ステップSt213の処理において、算出された類似度のそれぞれのうち閾値以上の類似度があると判定した場合(St213,YES)、閾値以上であると判定され類似度に対応する登録話者情報に基づいて、話者USを特定する(St214)。なお、端末装置P1は、閾値以上であると判定された類似度が複数ある場合には、算出された類似度が最も高い類似度に対応する登録話者情報に基づいて、話者USを特定してもよい。 If the terminal device P1 determines in the process of step St213 that there is a degree of similarity that is equal to or greater than the threshold among the calculated degrees of similarity (St213, YES), the terminal device P1 registers the degree of similarity that is determined to be equal to or greater than the threshold and corresponds to the degree of similarity. The speaker US is identified based on the speaker information (St214). Note that if there are multiple degrees of similarity that are determined to be equal to or higher than the threshold, the terminal device P1 identifies the speaker US based on the registered speaker information corresponding to the degree of similarity with the highest calculated degree of similarity. You may.
 端末装置P1は、ステップSt213の処理において、算出された類似度のそれぞれのうち閾値以上の類似度がないと判定した場合(St213,NO)、話者USを特定不可であると判定する(St215)。 If the terminal device P1 determines in the process of step St213 that there is no similarity greater than or equal to the threshold among the calculated similarities (St213, NO), the terminal device P1 determines that the speaker US cannot be identified (St215). ).
 端末装置P1は、特定された話者USの登録話者情報に基づいて、認証結果画面SCを生成する。端末装置P1は、生成された認証結果画面SCをモニタMNに出力して、表示させる(St216)。 The terminal device P1 generates an authentication result screen SC based on the registered speaker information of the identified speaker US. The terminal device P1 outputs the generated authentication result screen SC to the monitor MN for display (St216).
 以上により、端末装置P1は、音声登録時に話者情報と、話者USの特徴量と、話者USの発話区間に含まれるノイズ種別の情報とを対応付けて登録する。これにより、端末装置P1は、音声登録時の特徴量に含まれるノイズ種別と、音声認証時の特徴量に含まれるノイズ種別とが異なる場合であっても、それぞれのノイズ種別に適した類似度計算モデルを選定できる。したがって、端末装置P1は、選定された類似度計算モデルを用いることで、異なるノイズが含まれている話者USの特徴量と、登録話者の特徴量との類似度をより高精度に判定でき、認証音声データに含まれるノイズに起因する話者認証精度の低下をより効果的に抑制できる。 As described above, the terminal device P1 registers the speaker information, the feature amount of the speaker US, and the information on the noise type included in the utterance section of the speaker US in association with each other at the time of voice registration. As a result, even if the noise type included in the feature amount at the time of voice registration is different from the noise type included in the feature amount at the time of voice authentication, the terminal device P1 can determine the degree of similarity suitable for each noise type. You can select a calculation model. Therefore, by using the selected similarity calculation model, the terminal device P1 more accurately determines the similarity between the feature amount of the speaker US including different noises and the feature amount of the registered speaker. This makes it possible to more effectively suppress a decrease in speaker authentication accuracy caused by noise contained in the authentication voice data.
 また、端末装置P1は、類似度の算出に用いられた類似度計算モデルに基づいて、算出された類似度が示す、話者認証処理により特定された話者の確からしさを示す信頼度を算出し、表示する。これにより、端末装置P1は、モニタMNを視聴する管理者に話者認証結果の確からしさを提示できる。したがって、端末装置P1は、類似度の算出に適した類似度計算モデルがなく、後述する類似度計算モデル「汎用モデル」を用いて話者認証を実行したことを、信頼度の提示により管理者に知らせることができる。 Furthermore, the terminal device P1 calculates the degree of reliability indicating the likelihood of the speaker identified by the speaker authentication process, which is indicated by the calculated degree of similarity, based on the similarity calculation model used to calculate the degree of similarity. and display. Thereby, the terminal device P1 can present the certainty of the speaker authentication result to the administrator viewing the monitor MN. Therefore, the terminal device P1 informs the administrator by presenting the reliability that there is no similarity calculation model suitable for calculating the similarity, and that speaker authentication has been performed using a similarity calculation model "universal model", which will be described later. can be informed.
 図5および図6のそれぞれを参照して、対応リストLST1,LST2の一例について説明する。図5は、音声登録時のノイズ種別と音声認証時のノイズ種別とが同一である場合の対応リストLST1の一例を説明する図である。図6は、音声登録時のノイズ種別と音声認証時のノイズ種別とが異なる場合の対応リストLST2の一例を説明する図である。 An example of the correspondence lists LST1 and LST2 will be described with reference to FIGS. 5 and 6, respectively. FIG. 5 is a diagram illustrating an example of the correspondence list LST1 when the noise type at the time of voice registration and the noise type at the time of voice authentication are the same. FIG. 6 is a diagram illustrating an example of the correspondence list LST2 when the noise type at the time of voice registration and the noise type at the time of voice authentication are different.
 なお、図5および図6では、説明を分かりやすくするために登録話者データベースDB2に登録されたいずれか1人の登録話者の特徴量に対応付けられたノイズ種別の情報と、話者認証対象である話者USの認証音声データに含まれるノイズのノイズ種別とに基づいて、対応リストLST1,LST2を参照する例について説明する。 In addition, in FIGS. 5 and 6, in order to make the explanation easier to understand, noise type information associated with the feature amount of any one registered speaker registered in the registered speaker database DB2 and speaker authentication information are shown. An example will be described in which the correspondence lists LST1 and LST2 are referred to based on the noise type of noise included in the authenticated voice data of the target speaker US.
 図5に示す対応リストLST1の参照例において、音声登録時および音声認証時のそれぞれの音声データは、同一のノイズ種別「店舗内雑音」に該当するノイズを含む。 In the reference example of the correspondence list LST1 shown in FIG. 5, each of the voice data at the time of voice registration and voice authentication includes noise that corresponds to the same noise type "in-store noise."
 端末装置P1は、マイクMKから送信された話者USの認証音声データから抽出されたノイズをノイズ判定装置P2に送信する。端末装置P1は、ノイズ判定装置P2から送信されたノイズ種別の判定結果と、登録話者データベースDB2に登録された登録話者のノイズ種別の情報とに基づいて、事前に定義された対応リストLST1を参照して、複数の選択モデル(類似度モデル)のそれぞれのうちいずれか1つの選択モデルを選定する。 The terminal device P1 transmits the noise extracted from the authentication voice data of the speaker US transmitted from the microphone MK to the noise determination device P2. The terminal device P1 generates a predefined correspondence list LST1 based on the noise type determination result transmitted from the noise determination device P2 and the noise type information of registered speakers registered in the registered speaker database DB2. With reference to , one selection model is selected from each of the plurality of selection models (similarity models).
 対応リストLST1は、話者USの認証音声データに含まれるノイズのノイズ種別の判定結果「ノイズ判定結果1」と、登録話者データベースDB2に登録された登録話者のノイズ種別の判定結果「ノイズ判定結果2」と、これら2つのノイズ種別に基づいて選定された類似度計算モデル「選択モデル」とを対応付けたデータである。 The correspondence list LST1 includes the noise type determination result "Noise determination result 1" of the noise included in the authenticated voice data of the speaker US and the noise type determination result "Noise determination result 1" of the registered speaker registered in the registered speaker database DB2. This is data that associates "Judgment Result 2" with a similarity calculation model "selected model" selected based on these two noise types.
 ノイズ種別の判定結果「ノイズ判定結果1」は、例えば認証時の音声におけるノイズ判定結果を示す。 The noise type determination result "Noise determination result 1" indicates, for example, the noise determination result in the voice at the time of authentication.
 ノイズ種別の判定結果「ノイズ判定結果2」は、例えば登録音声のノイズ判定結果を示す。 The noise type determination result "Noise determination result 2" indicates, for example, the noise determination result of registered speech.
 なお、登録話者データベースDB2に登録されるノイズ種別の情報は、1つ以上であればよい。つまり、ノイズ種別の候補が複数ある場合は、複数の情報を保持しておいてもよい。また、複数のノイズ種別の判定結果に対応するそれぞれの信頼度を表す判定確率の情報は、必須でなく省略されてもよい。 Note that the number of noise type information registered in the registered speaker database DB2 may be one or more. In other words, if there are multiple noise type candidates, multiple pieces of information may be retained. Furthermore, information on determination probabilities indicating respective reliability corresponding to determination results of a plurality of noise types is not essential and may be omitted.
 類似度計算モデル「選択モデル」は、例えばノイズ種別の判定結果「ノイズ判定結果1」とノイズ種別の判定結果「ノイズ判定結果2」とに対応して選択された類似度計算モデル「モデルA」、「モデルB」、「モデルC」、「モデルZ」のそれぞれを含む。 The similarity calculation model "selected model" is, for example, the similarity calculation model "model A" selected in response to the noise type determination result "noise determination result 1" and the noise type determination result "noise determination result 2". , "Model B", "Model C", and "Model Z".
 例えば、類似度計算モデル「モデルA」は、話者USの特徴量に含まれるノイズのノイズ種別の情報が「ノイズA」であって、登録話者の特徴量に含まれるノイズのノイズ種別の情報が「ノイズA」である場合に、話者USの特徴量と、登録話者の特徴量との類似度の算出処理に最適であると判定された類似度計算モデルである。 For example, in the similarity calculation model "Model A", the information on the noise type of the noise included in the features of the speaker US is "Noise A", and the information on the noise type of the noise included in the features of the registered speaker is "Noise A". This similarity calculation model is determined to be optimal for calculating the similarity between the feature amount of the speaker US and the feature amount of the registered speaker when the information is "noise A."
 なお、類似度計算モデル選択部115は、話者USの特徴量に含まれるノイズのノイズ種別の情報と、登録話者の特徴量に含まれるノイズのノイズ種別の情報との組み合わせに基づいて、類似度算出処理に適した類似度計算モデルがないと判定した場合、汎用的な類似度計算モデルである「モデルZ」を選択する。 Note that the similarity calculation model selection unit 115, based on the combination of the information on the noise type of the noise included in the feature amount of the speaker US and the information on the noise type of the noise included in the feature amount of the registered speaker, If it is determined that there is no similarity calculation model suitable for the similarity calculation process, "Model Z", which is a general-purpose similarity calculation model, is selected.
 類似度計算モデル選択部115は、参照された対応リストLST1に基づいて、類似度計算モデルデータベースDB3から類似度計算モデルデータベースを選択する。例えば、図5に示す例において、類似度計算モデル選択部115は、類似度計算モデル「モデルA」を選択する。 The similarity calculation model selection unit 115 selects a similarity calculation model database from the similarity calculation model database DB3 based on the referenced correspondence list LST1. For example, in the example shown in FIG. 5, the similarity calculation model selection unit 115 selects the similarity calculation model "model A."
 次に、図6に示す対応リストLST2の参照例において、音声登録時の話者USの認証音声データは、ノイズ種別「店舗内雑音」に該当するノイズを含む。また、音声認証時の登録話者の登録音声データは、話者USの認証音声データと異なるノイズ種別「屋外雑音」に該当するノイズを含む。 Next, in the reference example of the correspondence list LST2 shown in FIG. 6, the authenticated voice data of the speaker US at the time of voice registration includes noise that corresponds to the noise type "in-store noise." Furthermore, the registered voice data of the registered speaker at the time of voice authentication includes noise that corresponds to a noise type "outdoor noise" that is different from the authenticated voice data of the speaker US.
 端末装置P1は、マイクMKから送信された話者USの認証音声データから抽出されたノイズをノイズ判定装置P2に送信する。端末装置P1は、ノイズ判定装置P2から送信されたノイズ種別の判定結果と、登録話者データベースDB2に登録された登録話者のノイズ種別の情報とに基づいて、事前に定義された対応リストLST2を参照して、複数の選択モデル(類似度モデル)のそれぞれのうちいずれか1つの選択モデルを選定する。 The terminal device P1 transmits the noise extracted from the authentication voice data of the speaker US transmitted from the microphone MK to the noise determination device P2. The terminal device P1 generates a predefined correspondence list LST2 based on the noise type determination result transmitted from the noise determination device P2 and the noise type information of registered speakers registered in the registered speaker database DB2. With reference to , one selection model is selected from each of the plurality of selection models (similarity models).
 対応リストLST2は、話者USの認証音声データに含まれるノイズのノイズ種別の判定結果「ノイズ判定結果3」と、登録話者データベースDB2に登録された登録話者のノイズ種別の判定結果「ノイズ判定結果4」と、これら2つのノイズ種別に基づいて選定された類似度計算モデル「選択モデル」とを対応付けたデータである。 The correspondence list LST2 includes the noise type determination result "Noise determination result 3" of the noise included in the authenticated voice data of the speaker US, and the noise type determination result "Noise determination result 3" of the registered speaker registered in the registered speaker database DB2. This is data that associates "determination result 4" with a similarity calculation model "selected model" selected based on these two noise types.
 ノイズ種別の判定結果「ノイズ判定結果3」は、登録話者の音声データから抽出されたノイズ種別の判定結果を示す。 Noise type determination result "Noise determination result 3" indicates the noise type determination result extracted from the voice data of the registered speaker.
 ノイズ種別の判定結果「ノイズ判定結果4」は、登録話者データベースDB2に登録された登録話者のノイズ種別の情報の判定結果を示す。 The noise type determination result "Noise determination result 4" indicates the determination result of the noise type information of the registered speakers registered in the registered speaker database DB2.
 なお、登録話者データベースDB2に登録されるノイズ種別の情報は、1つ以上であればよい。つまり、ノイズ種別の候補が複数ある場合は、複数の情報を保持しておいてもよい。また、各ノイズ種別の判定結果に対応する信頼度を表す判定確率の情報は、必須でなく省略されてもよい。 Note that the number of noise type information registered in the registered speaker database DB2 may be one or more. In other words, if there are multiple noise type candidates, multiple pieces of information may be retained. Further, information on the determination probability indicating the reliability corresponding to the determination result of each noise type is not essential and may be omitted.
 類似度計算モデル「選択モデル」は、ノイズ種別の判定結果「ノイズ判定結果3」とノイズ種別の判定結果「ノイズ判定結果4」とに対応して選択された類似度計算モデル「モデルG」、「モデルH」、「モデルI」、「モデルZ」のそれぞれを含む。 The similarity calculation model "selected model" is a similarity calculation model "model G" selected corresponding to the noise type determination result "noise determination result 3" and the noise type determination result "noise determination result 4", Includes each of "Model H", "Model I", and "Model Z".
 例えば、類似度計算モデル「モデルG」は、認証音声データに含まれるノイズのノイズ種別の情報が「ノイズA」であって、登録話者のノイズ種別の情報が「ノイズD」である場合に、話者USの特徴量と、登録話者の特徴量との類似度の算出処理に最適であると判定された類似度計算モデルである。 For example, the similarity calculation model "Model G" is used when the noise type information of the noise included in the authentication speech data is "Noise A" and the noise type information of the registered speaker is "Noise D". , is a similarity calculation model determined to be optimal for calculating the similarity between the feature amount of the speaker US and the feature amount of the registered speaker.
 なお、類似度計算モデル選択部115は、話者USの特徴量に含まれるノイズのノイズ種別の情報と、登録話者のノイズ種別の情報との組み合わせに基づいて、類似度算出処理に適した類似度計算モデルがないと判定した場合、汎用の類似度計算モデル「モデルZ」を選択する。 Note that the similarity calculation model selection unit 115 selects a model suitable for the similarity calculation process based on the combination of the noise type information of the noise included in the feature amount of the speaker US and the noise type information of the registered speaker. If it is determined that there is no similarity calculation model, a general-purpose similarity calculation model "Model Z" is selected.
 類似度計算モデル選択部115は、参照された対応リストLST2に基づいて、類似度計算モデルデータベースDB3から類似度計算モデルデータベースを選択する。例えば、図6に示す例において、類似度計算モデル選択部115は、類似度計算モデル「モデルE」を選択する。 The similarity calculation model selection unit 115 selects a similarity calculation model database from the similarity calculation model database DB3 based on the referenced correspondence list LST2. For example, in the example shown in FIG. 6, the similarity calculation model selection unit 115 selects the similarity calculation model "model E."
 以上により、端末装置P1は、類似度算出対象である話者USの特徴量と登録話者の特徴量とにそれぞれ含まれるノイズ種別の組み合わせに基づいて、2つの特徴量(話者USの特徴量および登録話者の特徴量)の類似度算出処理に最適な類似度計算モデルを選定できる。これにより、端末装置P1は、音声登録時の特徴量に含まれるノイズ種別と、音声認証時の特徴量に含まれるノイズ種別とが変化した場合であっても、2つの特徴量の類似度を算出するために最適な類似度算出モデルを選定できる。つまり、端末装置P1は、音声データに含まれるノイズに起因する話者認証精度の低下をより効果的に抑制できる。ノイズ判定時に複数条件の候補が存在している場合、端末装置P1は、例えばそれぞれに相当する類似度計算モデルを利用して類似度を算出し、平均値を計算して類似度として採用してもよい。 As described above, the terminal device P1 calculates two feature quantities (features of the speaker US and It is possible to select the optimal similarity calculation model for the similarity calculation process (for example, the characteristics of registered speakers). As a result, even if the noise type included in the feature amount at the time of voice registration and the noise type included in the feature amount at the time of voice authentication change, the terminal device P1 calculates the degree of similarity between the two feature amounts. The optimal similarity calculation model can be selected for calculation. In other words, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by noise included in the audio data. If there are candidates for multiple conditions at the time of noise determination, the terminal device P1 calculates the degree of similarity using, for example, a similarity calculation model corresponding to each condition, calculates the average value, and uses it as the degree of similarity. Good too.
 図7を参照して、類似度算出処理に用いられる類似度計算モデルの信頼度について説明する。図7は、信頼度の算出例を説明する図である。 With reference to FIG. 7, the reliability of the similarity calculation model used in the similarity calculation process will be explained. FIG. 7 is a diagram illustrating an example of calculating reliability.
 なお、図7では、信頼度を2つの指標「高」、「低」で算出する例を示すが、信頼度は、例えば0(ゼロ)~100等の数値で算出されてもよい。 Note that although FIG. 7 shows an example in which the reliability is calculated using two indicators "high" and "low", the reliability may be calculated as a numerical value from 0 (zero) to 100, for example.
 また、図7では、説明を分かりやすくするために、図5で説明した、音声登録時の登録音声データおよび音声認証時の認証音声データのそれぞれが同一のノイズ種別である場合の信頼度の算出例について説明する。 In addition, in order to make the explanation easier to understand, in FIG. 7, the reliability is calculated when the registered voice data at the time of voice registration and the authentication voice data at the time of voice authentication are each of the same noise type, as explained in FIG. Let's discuss an example.
 信頼度計算部116は、認証部117により算出された類似度の信頼度を判定する。ここで、信頼度計算部116は、類似度の算出に用いられた類似度計算モデルが既知のノイズ種別に基づく類似度計算モデルであるか否か、ノイズ種別の判定確率等に基づいて、信頼度を判定する。 The reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117. Here, the reliability calculation unit 116 determines whether or not the similarity calculation model used to calculate the similarity is a similarity calculation model based on a known noise type, and whether or not the reliability calculation unit 116 is reliable based on the determination probability of the noise type. Determine the degree.
 「ケース1」に示す例において、ノイズに基づくノイズ種別の判定結果は、ノイズ種別「屋外雑音」である判定確率が「90%」、ノイズ種別「店舗内音楽」である判定確率が「6%」、ノイズ種別「不明雑音」である判定確率が「4%」である。また、「ケース1」に示すノイズ種別の判定結果は、ノイズ種別「屋外雑音」、「店舗内雑音」が既知のノイズであって、ノイズ種別「不明雑音」が未知のノイズであることを示す。 In the example shown in "Case 1", the noise type determination result based on noise is that the determination probability for the noise type "outdoor noise" is "90%", and the determination probability for the noise type "in-store music" is "6%". ”, and the determination probability that the noise type is “unknown noise” is “4%”. In addition, the noise type determination results shown in "Case 1" indicate that the noise types "outdoor noise" and "in-store noise" are known noises, and the noise type "unknown noise" is unknown noises. .
 類似度計算モデル選択部115は、ノイズのノイズ種別の判定結果に基づいて、類似度計算モデル「屋外風雑音モデル」を選択する。認証部117は、類似度計算モデル「屋外雑音モデル」を用いて、話者USの特徴量と、登録話者データベースDB2に登録されたいずれかの登録話者の特徴量との類似度を算出する。 The similarity calculation model selection unit 115 selects the similarity calculation model "outdoor wind noise model" based on the determination result of the noise type of the noise. The authentication unit 117 uses the similarity calculation model "outdoor noise model" to calculate the similarity between the feature amount of the speaker US and the feature amount of any registered speaker registered in the registered speaker database DB2. do.
 信頼度計算部116は、認証部117により算出された類似度の信頼度を判定する。ここで、信頼度計算部116は、類似度計算モデル「屋外雑音モデル」が既知のノイズ種別「屋外風雑音モデル」に基づく類似度計算モデルであって、かつ、ノイズの判定確率が「90%」であるため、信頼度「高」を算出する。 The reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117. Here, the reliability calculation unit 116 calculates that the similarity calculation model "outdoor noise model" is a similarity calculation model based on a known noise type "outdoor wind noise model" and that the noise determination probability is "90%". ”, the reliability is calculated as “high”.
 なお、信頼度計算部116は、ノイズの判定確率が所定確率(例えば、85%、90%等)以上であると判定した場合には信頼度「高」であると算出し、ノイズの判定確率が所定確率以上でないと判定した場合には信頼度「低」であると算出する。 Note that when the reliability calculation unit 116 determines that the noise determination probability is equal to or higher than a predetermined probability (for example, 85%, 90%, etc.), the reliability calculation unit 116 calculates that the reliability is "high" and calculates the noise determination probability as "high." If it is determined that the probability is not greater than a predetermined probability, the reliability is calculated to be "low".
 「ケース2」に示す例において、ノイズに基づくノイズ種別の判定結果は、ノイズ種別「屋外風雑音」である判定確率が「48%」、ノイズ種別「不明雑音」である判定確率が「39%」、ノイズ種別「店舗内音楽」である判定確率が「13%」である。また、「ケース2」に示すノイズ種別の判定結果は、ノイズ種別「屋外風雑音」、「店舗内音楽」が既知のノイズであって、ノイズ種別「不明雑音」が未知のノイズであることを示す。 In the example shown in "Case 2", the noise type determination results based on the noise are that the determination probability for the noise type "outdoor wind noise" is "48%" and the determination probability for the noise type "unknown noise" is "39%". ”, and the determination probability that the noise type is “in-store music” is “13%”. In addition, the noise type determination results shown in "Case 2" indicate that the noise types "Outdoor wind noise" and "In-store music" are known noises, and the noise type "Unknown noise" is unknown noises. show.
 類似度計算モデル選択部115は、ノイズのノイズ種別の判定結果に基づいて、類似度計算モデル「屋外風雑音モデル」を選択する。認証部117は、類似度計算モデル「屋外雑音モデル」を用いて、話者USの特徴量と、登録話者データベースDB2に登録されたいずれかの登録話者の特徴量との類似度を算出する。 The similarity calculation model selection unit 115 selects the similarity calculation model "outdoor wind noise model" based on the determination result of the noise type of the noise. The authentication unit 117 uses the similarity calculation model "outdoor noise model" to calculate the similarity between the feature amount of the speaker US and the feature amount of any registered speaker registered in the registered speaker database DB2. do.
 信頼度計算部116は、認証部117により算出された類似度の信頼度を判定する。ここで、信頼度計算部116は、類似度計算モデル「屋外雑音モデル」が既知のノイズ種別「屋外風雑音モデル」に基づく類似度計算モデルであって、かつ、ノイズの判定確率が「48%」であるため、信頼度「低」を算出する。 The reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117. Here, the reliability calculation unit 116 calculates that the similarity calculation model "outdoor noise model" is a similarity calculation model based on a known noise type "outdoor wind noise model" and that the noise determination probability is "48%". ”, the reliability is calculated as “low”.
 「ケース3」に示す例において、ノイズに基づくノイズ種別の判定結果は、ノイズ種別「不明雑音」である判定確率が「55%」、ノイズ種別「屋外風雑音」である判定確率が「28%」、ノイズ種別「店舗内音楽」である判定確率が「17%」である。また、「ケース3」に示すノイズ種別の判定結果は、ノイズ種別「屋外風雑音」、「店舗内雑音」が既知のノイズであって、ノイズ種別「不明雑音」が未知のノイズであることを示す。 In the example shown in "Case 3", the noise type determination results based on the noise are that the determination probability for the noise type "unknown noise" is "55%" and the determination probability for the noise type "outdoor wind noise" is "28%". ”, and the determination probability that the noise type is “in-store music” is “17%”. In addition, the noise type determination results shown in "Case 3" indicate that the noise types "Outdoor wind noise" and "In-store noise" are known noises, and the noise type "Unknown noise" is unknown noises. show.
 類似度計算モデル選択部115は、ノイズのノイズ種別の判定結果に基づいて、類似度計算モデル「汎用モデル」を選択する。認証部117は、類似度計算モデル「汎用モデル」を用いて、話者USの特徴量と、登録話者データベースDB2に登録されたいずれかの登録話者の特徴量との類似度を算出する。 The similarity calculation model selection unit 115 selects the similarity calculation model "general model" based on the determination result of the noise type of the noise. The authentication unit 117 calculates the similarity between the feature amount of the speaker US and the feature amount of any registered speaker registered in the registered speaker database DB2 using the similarity calculation model "general model". .
 信頼度計算部116は、認証部117により算出された類似度の信頼度を判定する。ここで、信頼度計算部116は、類似度の算出に用いられた類似度計算モデル「汎用モデル」が未知のノイズであって、かつ、ノイズの判定確率が「55%」であるため、信頼度「低」を算出する。 The reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117. Here, the reliability calculation unit 116 determines that the similarity calculation model "general-purpose model" used to calculate the similarity is unknown noise and the noise determination probability is "55%", so the reliability calculation unit 116 determines that it is reliable. Calculate the degree "low".
 信頼度計算部116は、認証部117により算出された類似度の信頼度を判定する。ここで、信頼度計算部116は、類似度計算モデル「屋外雑音モデル」が既知のノイズ種別「屋外風雑音モデル」に基づく類似度計算モデルであって、かつ、ノイズの判定確率が「48%」であるため、信頼度「低」を算出する。 The reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117. Here, the reliability calculation unit 116 calculates that the similarity calculation model "outdoor noise model" is a similarity calculation model based on a known noise type "outdoor wind noise model" and that the noise determination probability is "48%". ”, the reliability is calculated as “low”.
 以上により、実施の形態に係る端末装置P1は、音声データを取得する通信部10(取得部の一例)と、音声データから話者が発話している発話区間と、話者が発話していない非発話区間とを検出するノイズ抽出部111および特徴量抽出部112(検出部の一例)と、発話区間の特徴量(発話特徴量の一例)と、音声データの非発話区間に含まれるノイズとを抽出するノイズ抽出部111および特徴量抽出部112(抽出部の一例)と、抽出されたノイズと、事前に登録された複数の登録話者の特徴量(登録特徴量の一例)に対応付けられたノイズとに基づいて、複数の類似度計算モデル(類似度計算モデルの一例)のそれぞれのうちいずれか1つの類似度計算モデルを選定する類似度計算モデル選択部115(選定部の一例)と、選定された類似度計算モデルを用いて、話者USの特徴量と、登録話者の特徴量とを照合して、話者を認証する認証部117と、を備える。 As described above, the terminal device P1 according to the embodiment includes the communication unit 10 (an example of an acquisition unit) that acquires audio data, the utterance section where the speaker is speaking from the audio data, and the utterance section where the speaker is not speaking. A noise extraction unit 111 and a feature extraction unit 112 (an example of a detection unit) detect non-speech intervals, feature quantities of the utterance interval (an example of utterance features), and noise included in the non-speech interval of audio data. A noise extraction unit 111 and a feature quantity extraction unit 112 (an example of an extraction unit) extract the noise, and associate the extracted noise with the feature quantities of a plurality of registered speakers registered in advance (an example of registered feature quantities). a similarity calculation model selection unit 115 (an example of a selection unit) that selects any one similarity calculation model from each of a plurality of similarity calculation models (an example of a similarity calculation model) based on the detected noise; and an authentication unit 117 that authenticates the speaker by comparing the feature amount of the speaker US with the feature amount of the registered speaker using the selected similarity calculation model.
 これにより、実施の形態に係る端末装置P1は、話者USの特徴量と登録話者の特徴量とにそれぞれ含まれるノイズ種別の組み合わせに基づいて、話者認証に適した類似度計算モデルを選定できる。つまり、端末装置P1は、音声登録時の特徴量に含まれるノイズ種別と、音声認証時の特徴量に含まれるノイズとが異なる場合であっても、話者認証により適した類似度算出モデルを選定できる。したがって、端末装置P1は、音声データに含まれるノイズに起因する話者認証精度の低下をより効果的に抑制できる。 Thereby, the terminal device P1 according to the embodiment calculates a similarity calculation model suitable for speaker authentication based on the combination of noise types included in the feature amount of the speaker US and the feature amount of the registered speaker. You can choose. In other words, even if the noise type included in the feature amount during voice registration is different from the noise included in the feature amount during voice authentication, the terminal device P1 selects a similarity calculation model more suitable for speaker authentication. You can choose. Therefore, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by noise included in the audio data.
 また、実施の形態に係る端末装置P1における通信部10は、抽出されたノイズのノイズ種別情報を、さらに取得する。類似度計算モデル選択部115は、取得された話者のノイズ種別情報と、登録話者のノイズ種別情報とに基づいて、類似度計算モデルを選定する。これにより、実施の形態に係る端末装置P1は、話者USの特徴量と登録話者の特徴量とにそれぞれ含まれるノイズ種別の組み合わせに基づいて、2つの特徴量(話者USの特徴量および登録話者の特徴量)の類似度算出処理により適した類似度計算モデルを選定できる。 Furthermore, the communication unit 10 in the terminal device P1 according to the embodiment further acquires noise type information of the extracted noise. The similarity calculation model selection unit 115 selects a similarity calculation model based on the acquired speaker noise type information and the registered speaker noise type information. As a result, the terminal device P1 according to the embodiment generates two feature quantities (feature quantities of the speaker US It is possible to select a similarity calculation model that is more suitable for the similarity calculation process of the features of the registered speaker and the registered speaker's feature values.
 また、実施の形態に係る端末装置P1は、話者USの特徴量と、登録話者の特徴量との類似度を算出する認証部117(算出部の一例)、をさらに備える。認証部117は、算出された複数の類似度に基づいて、話者USを認証する。これにより、実施の形態に係る端末装置P1は、事前に登録された複数の登録話者の特徴量と、話者USの特徴量との類似度を用いて、話者認証を実行できる。 Furthermore, the terminal device P1 according to the embodiment further includes an authentication unit 117 (an example of a calculation unit) that calculates the degree of similarity between the feature amount of the speaker US and the feature amount of the registered speaker. The authentication unit 117 authenticates the speaker US based on the plurality of calculated similarities. Thereby, the terminal device P1 according to the embodiment can perform speaker authentication using the degree of similarity between the feature amounts of a plurality of registered speakers registered in advance and the feature amounts of the speaker US.
 また、実施の形態に係る端末装置P1は、類似度の信頼度を算出する信頼度計算部116(信頼度算出部の一例)、をさらに備える。通信部10は、抽出されたノイズのノイズ種別情報と、ノイズがノイズ種別であるスコアとを、さらに取得する。信頼度計算部116は、スコアに基づいて、類似度の信頼度を算出する。これにより、実施の形態に係る端末装置P1は、類似度に対応する信頼度を算出することで、話者認証結果の信頼度を算出できる。 Furthermore, the terminal device P1 according to the embodiment further includes a reliability calculation unit 116 (an example of a reliability calculation unit) that calculates the reliability of the similarity. The communication unit 10 further acquires noise type information of the extracted noise and a score indicating the noise type of the noise. The reliability calculation unit 116 calculates the reliability of the similarity based on the score. Thereby, the terminal device P1 according to the embodiment can calculate the reliability of the speaker authentication result by calculating the reliability corresponding to the similarity.
 また、実施の形態に係る端末装置P1における認証部117は、類似度が閾値以上である登録話者を話者USであると特定する。これにより、実施の形態に係る端末装置P1は、事前に登録された複数の登録話者の特徴量と、話者USの特徴量との類似度を用いて、話者認証を実行できる。 Furthermore, the authentication unit 117 in the terminal device P1 according to the embodiment identifies the registered speaker whose degree of similarity is equal to or greater than the threshold value as the speaker US. Thereby, the terminal device P1 according to the embodiment can perform speaker authentication using the degree of similarity between the feature amounts of a plurality of registered speakers registered in advance and the feature amounts of the speaker US.
 また、実施の形態に係る端末装置P1における認証部117は、類似度が閾値以上である登録話者に関する情報を含む認証結果画面SCを生成して、出力する。これにより、実施の形態に係る端末装置P1は、話者USあるいは管理者に話者認証結果を提示できる。 Additionally, the authentication unit 117 in the terminal device P1 according to the embodiment generates and outputs an authentication result screen SC that includes information regarding registered speakers whose degree of similarity is equal to or greater than the threshold value. Thereby, the terminal device P1 according to the embodiment can present the speaker authentication result to the speaker US or the administrator.
 また、実施の形態に係る端末装置P1における認証部117は、算出された複数の類似度が閾値以上でないと判定した場合、話者USを特定不可であると判定する。これにより、実施の形態に係る端末装置P1は、話者認証精度の低下をより効果的に抑制し、話者USの誤認証をより効果的に抑制できる。 Furthermore, if the authentication unit 117 in the terminal device P1 according to the embodiment determines that the plurality of calculated similarities are not equal to or greater than the threshold value, the authentication unit 117 determines that the speaker US cannot be identified. Thereby, the terminal device P1 according to the embodiment can more effectively suppress a decrease in speaker authentication accuracy and more effectively suppress erroneous authentication of the speaker US.
 また、実施の形態に係る端末装置P1における認証部117は、類似度が閾値以上である登録話者に関する情報と、算出された信頼度の情報とを含む認証結果画面SCを生成して、出力する。これにより、実施の形態に係る端末装置P1は、話者認証結果と、話者認証結果の信頼度とを表示することで、管理者に話者認証結果が信頼できるものであるか否かの確認を促すことができる。 Further, the authentication unit 117 in the terminal device P1 according to the embodiment generates and outputs an authentication result screen SC including information regarding registered speakers whose degree of similarity is equal to or higher than a threshold value and information on the calculated reliability. do. As a result, the terminal device P1 according to the embodiment displays the speaker authentication result and the reliability of the speaker authentication result, so that the administrator can know whether the speaker authentication result is reliable or not. You can prompt for confirmation.
 以上、図面を参照しながら各種の実施の形態について説明したが、本開示はかかる例に限定されないことは言うまでもない。当業者であれば、特許請求の範囲に記載された範疇内において、各種の変更例、修正例、置換例、付加例、削除例、均等例に想到し得ることは明らかであり、それらについても当然に本開示の技術的範囲に属するものと了解される。また、発明の趣旨を逸脱しない範囲において、上述した各種の実施の形態における各構成要素を任意に組み合わせてもよい。 Although various embodiments have been described above with reference to the drawings, it goes without saying that the present disclosure is not limited to such examples. It is clear that those skilled in the art can come up with various changes, modifications, substitutions, additions, deletions, and equivalents within the scope of the claims, and It is understood that it naturally falls within the technical scope of the present disclosure. Further, each component in the various embodiments described above may be combined as desired without departing from the spirit of the invention.
 なお、本出願は、2022年3月22日出願の日本特許出願(特願2022-045390)に基づくものであり、その内容は本出願の中に参照として援用される。 Note that this application is based on a Japanese patent application (Japanese Patent Application No. 2022-045390) filed on March 22, 2022, and the contents thereof are incorporated as a reference in this application.
 本開示は、環境雑音の変化に起因する話者認証精度の低下を抑制する音声認証装置および音声認証方法として有用である。 The present disclosure is useful as a voice authentication device and a voice authentication method that suppress a decrease in speaker authentication accuracy due to changes in environmental noise.
10 通信部
11 プロセッサ
12 メモリ
100 音声認証システム
111 ノイズ抽出部
112 特徴量抽出部
113 ノイズ判定部
114 話者登録部
115 類似度計算モデル選択部
116 信頼度計算部
117 認証部
DB1 特徴量抽出モデルデータベース
DB2 登録話者データベース
DB3 類似度計算モデルデータベース
DB4 ノイズ-類似度計算モデル対応リスト
MK マイク
MN モニタ
NW ネットワーク
P1 端末装置
P2 ノイズ判定装置
SC 認証結果画面
US 話者
10 Communication unit 11 Processor 12 Memory 100 Voice authentication system 111 Noise extraction unit 112 Feature extraction unit 113 Noise determination unit 114 Speaker registration unit 115 Similarity calculation model selection unit 116 Reliability calculation unit 117 Authentication unit DB1 Feature extraction model database DB2 Registered speaker database DB3 Similarity calculation model database DB4 Noise-similarity calculation model correspondence list MK Microphone MN Monitor NW Network P1 Terminal device P2 Noise determination device SC Authentication result screen US Speaker

Claims (9)

  1.  音声データを取得する取得部と、
     前記音声データから話者が発話している発話区間と、前記話者が発話していない非発話区間とを検出する検出部と、
     前記発話区間の発話特徴量と、前記音声データの前記非発話区間に含まれるノイズとを抽出する抽出部と、
     抽出されたノイズと、事前に登録された複数の登録話者の登録特徴量に対応付けられたノイズとに基づいて、複数の類似度計算モデルのそれぞれのうちいずれか1つの類似度計算モデルを選定する選定部と、
     選定された類似度計算モデルを用いて、前記話者の発話特徴量と、前記登録話者の登録特徴量とを照合して、前記話者を認証する認証部と、を備える、
     音声認証装置。
    an acquisition unit that acquires audio data;
    a detection unit that detects a speech section in which the speaker is speaking and a non-speech section in which the speaker is not speaking from the audio data;
    an extraction unit that extracts speech features of the speech section and noise included in the non-speech section of the audio data;
    Based on the extracted noise and the noise associated with registered feature values of a plurality of registered speakers registered in advance, one of the plurality of similarity calculation models is selected. A selection section that selects;
    an authentication unit that authenticates the speaker by comparing the utterance feature amount of the speaker with the registered feature amount of the registered speaker using the selected similarity calculation model;
    Voice authentication device.
  2.  前記取得部は、抽出された前記ノイズのノイズ種別情報を、さらに取得し、
     前記選定部は、取得された前記話者のノイズ種別情報と、前記登録話者のノイズ種別情報とに基づいて、前記類似度計算モデルを選定する、
     請求項1に記載の音声認証装置。
    The acquisition unit further acquires noise type information of the extracted noise,
    The selection unit selects the similarity calculation model based on the obtained noise type information of the speaker and noise type information of the registered speaker.
    The voice authentication device according to claim 1.
  3.  前記話者の発話特徴量と、前記登録話者の登録特徴量との類似度を算出する算出部、をさらに備え、
     前記認証部は、算出された複数の前記類似度に基づいて、前記話者を認証する、
     請求項1に記載の音声認証装置。
    further comprising a calculation unit that calculates the degree of similarity between the utterance feature amount of the speaker and the registered feature amount of the registered speaker,
    The authentication unit authenticates the speaker based on the plurality of calculated similarities;
    The voice authentication device according to claim 1.
  4.  前記類似度の信頼度を算出する信頼度算出部、をさらに備え、
     前記取得部は、抽出された前記ノイズのノイズ種別の情報と、前記ノイズが前記ノイズ種別であるスコアとを、さらに取得し、
     前記信頼度算出部は、前記スコアに基づいて、前記類似度の信頼度を算出する、
     請求項3に記載の音声認証装置。
    further comprising a reliability calculation unit that calculates reliability of the similarity,
    The acquisition unit further acquires information on the noise type of the extracted noise and a score indicating that the noise is the noise type,
    The reliability calculation unit calculates the reliability of the similarity based on the score.
    The voice authentication device according to claim 3.
  5.  前記認証部は、前記類似度が閾値以上である登録話者を前記話者であると特定する、
     請求項3に記載の音声認証装置。
    The authentication unit identifies a registered speaker for whom the degree of similarity is a threshold value or more as the speaker;
    The voice authentication device according to claim 3.
  6.  前記認証部は、前記類似度が前記閾値以上である前記登録話者に関する情報を含む認証結果画面を生成して、出力する、
     請求項5に記載の音声認証装置。
    The authentication unit generates and outputs an authentication result screen including information regarding the registered speaker whose degree of similarity is greater than or equal to the threshold;
    The voice authentication device according to claim 5.
  7.  前記認証部は、算出された前記複数の類似度が閾値以上でないと判定した場合、前記話者を特定不可であると判定する、
     請求項3に記載の音声認証装置。
    When the authentication unit determines that the plurality of calculated similarities are not equal to or greater than a threshold, the authentication unit determines that the speaker cannot be identified.
    The voice authentication device according to claim 3.
  8.  前記認証部は、前記類似度が閾値以上である登録話者を前記話者であると特定し、前記類似度が前記閾値以上である前記登録話者に関する情報と、算出された前記信頼度の情報とを含む認証結果画面を生成して、出力する、
     請求項4に記載の音声認証装置。
    The authentication unit identifies the registered speaker for whom the degree of similarity is greater than or equal to the threshold value as the speaker, and includes information regarding the registered speaker for whom the degree of similarity is greater than or equal to the threshold value and the calculated degree of reliability. Generate and output an authentication result screen including information,
    The voice authentication device according to claim 4.
  9.  端末装置が行う音声認証方法であって、
     音声データを取得し、
     前記音声データから話者が発話している発話区間と、前記話者が発話していない非発話区間とを検出し、
     前記発話区間の発話特徴量と、前記音声データの前記非発話区間に含まれるノイズとを抽出し、
     抽出されたノイズと、事前に登録された複数の登録話者の登録特徴量に対応付けられたノイズとに基づいて、複数の類似度計算モデルのそれぞれのうちいずれか1つの類似度計算モデルを選定し、
     選定された類似度計算モデルを用いて、前記話者の発話特徴量と、前記登録話者の登録特徴量とを照合して、前記話者を認証する、
     音声認証方法。
    A voice authentication method performed by a terminal device, the method comprising:
    Get audio data,
    detecting from the audio data a speech section in which the speaker is speaking and a non-speech section in which the speaker is not speaking;
    extracting the speech feature amount of the speech section and the noise included in the non-speech section of the audio data,
    Based on the extracted noise and the noise associated with registered feature values of a plurality of registered speakers registered in advance, one of the plurality of similarity calculation models is selected. Select,
    authenticating the speaker by comparing the utterance features of the speaker and the registered features of the registered speaker using the selected similarity calculation model;
    Voice authentication method.
PCT/JP2023/009469 2022-03-22 2023-03-10 Voice authentication device and voice authentication method WO2023182016A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2022045390 2022-03-22
JP2022-045390 2022-03-22

Publications (1)

Publication Number Publication Date
WO2023182016A1 true WO2023182016A1 (en) 2023-09-28

Family

ID=88101383

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/009469 WO2023182016A1 (en) 2022-03-22 2023-03-10 Voice authentication device and voice authentication method

Country Status (1)

Country Link
WO (1) WO2023182016A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6242198A (en) * 1985-08-20 1987-02-24 松下電器産業株式会社 Voice recognition equipment
JPH0573090A (en) * 1991-09-18 1993-03-26 Fujitsu Ltd Speech recognizing method
JPH0736477A (en) * 1993-07-16 1995-02-07 Ricoh Co Ltd Pattern matching system
JP2006003400A (en) * 2004-06-15 2006-01-05 Honda Motor Co Ltd On-board voice recognition system
JP2019035935A (en) * 2017-08-10 2019-03-07 トヨタ自動車株式会社 Voice recognition apparatus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6242198A (en) * 1985-08-20 1987-02-24 松下電器産業株式会社 Voice recognition equipment
JPH0573090A (en) * 1991-09-18 1993-03-26 Fujitsu Ltd Speech recognizing method
JPH0736477A (en) * 1993-07-16 1995-02-07 Ricoh Co Ltd Pattern matching system
JP2006003400A (en) * 2004-06-15 2006-01-05 Honda Motor Co Ltd On-board voice recognition system
JP2019035935A (en) * 2017-08-10 2019-03-07 トヨタ自動車株式会社 Voice recognition apparatus

Similar Documents

Publication Publication Date Title
US20230368780A1 (en) Wakeword detection
US11734326B2 (en) Profile disambiguation
US20170084274A1 (en) Dialog management apparatus and method
AU2017425675B2 (en) Extracting domain-specific actions and entities in natural language commands
AU2017424116B2 (en) Extracting domain-specific actions and entities in natural language commands
US20160071516A1 (en) Keyword detection using speaker-independent keyword models for user-designated keywords
JP2019053126A (en) Growth type interactive device
JP6280074B2 (en) Rephrase detection device, speech recognition system, rephrase detection method, program
US11514900B1 (en) Wakeword detection
JP2018136493A (en) Voice recognition computer program, voice recognition device and voice recognition method
US20190042560A1 (en) Extracting domain-specific actions and entities in natural language commands
JP6495792B2 (en) Speech recognition apparatus, speech recognition method, and program
US20230386468A1 (en) Adapting hotword recognition based on personalized negatives
US9224388B2 (en) Sound recognition method and system
JP6676009B2 (en) Speaker determination device, speaker determination information generation method, and program
WO2023182016A1 (en) Voice authentication device and voice authentication method
US10997972B2 (en) Object authentication device and object authentication method
WO2023182014A1 (en) Voice authentication device and voice authentication method
KR101840363B1 (en) Voice recognition apparatus and terminal device for detecting misprononced phoneme, and method for training acoustic model
US20210241755A1 (en) Information-processing device and information-processing method
WO2023182015A1 (en) Voice authentication device and voice authentication method
JP7326596B2 (en) Voice data creation device
CN110895938B (en) Voice correction system and voice correction method
US20220335927A1 (en) Learning apparatus, estimation apparatus, methods and programs for the same
KR20200053242A (en) Voice recognition system for vehicle and method of controlling the same

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23774615

Country of ref document: EP

Kind code of ref document: A1