WO2023182016A1 - Dispositif d'authentification vocale et procédé d'authentification vocale - Google Patents

Dispositif d'authentification vocale et procédé d'authentification vocale Download PDF

Info

Publication number
WO2023182016A1
WO2023182016A1 PCT/JP2023/009469 JP2023009469W WO2023182016A1 WO 2023182016 A1 WO2023182016 A1 WO 2023182016A1 JP 2023009469 W JP2023009469 W JP 2023009469W WO 2023182016 A1 WO2023182016 A1 WO 2023182016A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
noise
registered
similarity
authentication
Prior art date
Application number
PCT/JP2023/009469
Other languages
English (en)
Japanese (ja)
Inventor
正成 宮本
Original Assignee
パナソニックIpマネジメント株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by パナソニックIpマネジメント株式会社 filed Critical パナソニックIpマネジメント株式会社
Publication of WO2023182016A1 publication Critical patent/WO2023182016A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions

Definitions

  • the present disclosure relates to a voice authentication device and a voice authentication method.
  • Patent Document 1 discloses a voice recognition device that recognizes a test subject's voice.
  • the speech recognition device stores a plurality of motion noise models created corresponding to each of the plurality of motions in association with each of the plurality of motions, detects input speech including the subject's voice, and detects the subject's motion. is specified, and a motion noise model corresponding to the motion specified by the motion specifying means is read out.
  • the speech recognition device reads an environmental noise model corresponding to the subject's current position, synthesizes the environmental noise model with the read motion noise model, and uses the synthesized noise superimposition model to detect the noise contained in the detected input speech. recognize the test subject's voice.
  • Patent Document 1 it is necessary to collect in advance the motion noise generated in each of a plurality of motions and the environmental noise at each of a plurality of positions where voice recognition can be performed, which is very time-consuming. Furthermore, in voiceprint authentication, the feature quantity that indicates the individuality of an authentication target (person) extracted from a voice signal changes depending on the noise contained in the voice signal. Therefore, when performing voiceprint authentication using the voice recognition device described above, if the noise contained in the pre-registered voice signal and the noise contained in the voice signal collected during voiceprint authentication are different, each Features extracted from voice signals do not indicate the individuality of the same person, and there is a possibility that the accuracy of voiceprint authentication will decrease.
  • the present disclosure was devised in view of the above-mentioned conventional situation, and aims to provide a voice authentication device and a voice authentication method that suppress a decrease in speaker authentication accuracy due to changes in environmental noise.
  • the present disclosure includes: an acquisition unit that acquires audio data; a detection unit that detects a speech section in which a speaker is speaking from the audio data; and a non-speech section in which the speaker is not speaking; an extraction unit that extracts the utterance feature amount of the speech data and noise included in the non-speech section of the audio data; a selection unit that selects any one similarity calculation model from each of the plurality of similarity calculation models based on the noise generated by the speaker; and an authentication unit that authenticates the speaker by comparing the registered feature amount of the registered speaker.
  • the present disclosure also provides a voice authentication method performed by a terminal device, which acquires voice data, and determines from the voice data a speech section in which the speaker is speaking and a non-speech section in which the speaker is not speaking. , and extracts the speech feature amount of the speech section and the noise included in the non-speech section of the audio data, and extracts the extracted noise and the registered feature amount of a plurality of registered speakers registered in advance.
  • One of the similarity calculation models is selected based on the noise associated with the noise, and the selected similarity calculation model is used to calculate the utterance features of the speaker and , a voice authentication method is provided in which the speaker is authenticated by comparing the registered feature amount with the registered feature amount of the registered speaker.
  • Block diagram showing an example of internal configuration of a voice authentication system A diagram illustrating each process performed by a processor of a terminal device in an embodiment.
  • Flowchart showing an example of the operation procedure of the terminal device in the embodiment Flowchart showing an example of a speaker authentication procedure of a terminal device in an embodiment
  • FIG. 1 is a block diagram showing an example of the internal configuration of a voice authentication system 100 according to an embodiment.
  • FIG. 2 is a diagram illustrating each process performed by the processor 11 of the terminal device P1 in the embodiment.
  • the voice authentication system 100 includes a terminal device P1 as an example of a voice authentication device, a monitor MN, a noise determination device P2, and a network NW. Note that the voice authentication system 100 may include a microphone MK or a monitor MN.
  • the microphone MK picks up the voice uttered by the speaker US in order to register the voice in the terminal device P1 in advance.
  • the microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data that is registered in the terminal device P1.
  • Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .
  • the microphone MK picks up the voice uttered by the speaker US, which is used for speaker authentication.
  • the microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data.
  • Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .
  • the voice data for voice registration or the voice data already registered in the terminal device P1 will be referred to as "registered voice data", and the voice data for voice authentication will be referred to as "authentication voice data”. and differentiate.
  • the microphone MK may be, for example, a microphone included in a predetermined device such as a Personal Computer (hereinafter referred to as "PC"), a notebook PC, a smartphone, a tablet terminal, or the like. Further, the microphone MK may transmit an audio signal or audio data to the terminal device P1 by wireless communication via a network (not shown).
  • PC Personal Computer
  • the microphone MK may transmit an audio signal or audio data to the terminal device P1 by wireless communication via a network (not shown).
  • the terminal device P1 is realized by, for example, a PC, a notebook PC, a smartphone, a tablet terminal, etc., and executes a voice registration process using registered voice data of the speaker US and a speaker authentication process using authentication voice data. . It includes a communication unit 10, a processor 11, a memory 12, a feature extraction model database DB1, a registered speaker database DB2, a similarity calculation model database DB3, and a noise-similarity calculation model correspondence list DB4.
  • the communication unit 10 which is an example of the acquisition unit, is connected to the microphone MK, the monitor MN, and the noise determination device P2 so that data can be transmitted and received through wired communication or wireless communication.
  • the wireless communication referred to here is, for example, short-range wireless communication such as Bluetooth (registered trademark) or NFC (registered trademark), or communication via a wireless LAN (Local Area Network) such as Wi-Fi (registered trademark).
  • the communication unit 10 may transmit and receive data to and from the microphone MK via an interface such as a Universal Serial Bus (USB). Furthermore, the communication unit 10 may perform data transmission and reception with the monitor MN via an interface such as High-Definition Multimedia Interface (HDMI, registered trademark).
  • HDMI High-Definition Multimedia Interface
  • the processor 11 is configured using, for example, a central processing unit (CPU) or a field programmable gate array (FPGA), and performs various processing and control in cooperation with the memory 12. Specifically, the processor 11 refers to the program and data held in the memory 12 and executes the program, thereby controlling the noise extraction section 111, the feature amount extraction section 112, the noise determination section 113, and the speaker registration section. 114, the similarity calculation model selection section 115, the reliability calculation section 116, the authentication section 117, and other sections.
  • CPU central processing unit
  • FPGA field programmable gate array
  • the processor 11 When registering the voice of the speaker US, the processor 11 realizes the functions of the noise extraction unit 111, the feature extraction unit 112, the noise determination unit 113, and the speaker registration unit 114, thereby registering the voice in the registered speaker database DB2. New registration (storage) processing of the feature amount of the speaker US is executed.
  • the feature amount here is a feature amount indicating the individuality of the speaker US extracted from the registered voice data.
  • the processor 11 controls each section of the noise extraction section 111, the feature amount extraction section 112, the noise determination section 113, the similarity calculation model selection section 115, the reliability calculation section 116, and the authentication section 117. By realizing this function, speaker authentication processing is executed.
  • the noise extraction unit 111 which is an example of a detection unit and an extraction unit, acquires registered voice data or authentication voice data of the speaker US transmitted from the microphone MK.
  • the noise extraction unit 111 distinguishes the utterance section in which the speaker US is uttering and the utterance section in which the speaker US is not uttering (hereinafter referred to as "non-speech section") out of registered voice data or authentication voice data.
  • non-speech section the utterance section in which the speaker US is not uttering
  • the noise extraction unit 111 extracts the noise included in the detected non-speech period, and outputs the extracted noise data (hereinafter referred to as “noise data”) to the noise determination unit 113.
  • the feature extraction unit 112 which is an example of a detection unit, acquires the registered voice data or authentication voice data of the speaker US transmitted from the microphone MK.
  • the feature extraction unit 112 detects a speech section from the registered speech data or the authentication speech data, and uses the feature extraction model to identify the individual of the speaker US from the detected speech section of the registered speech data or the authentication speech data. Extract features that indicate gender.
  • the feature amount extraction unit 112 associates a control command requesting registration of the feature amount of the speaker US and registered speaker information of the speaker US with the registered voice data of the speaker US transmitted from the microphone MK. If so, the process moves to the voice registration process of the speaker US.
  • the feature amount extraction unit 112 outputs the extracted feature amount of the speaker US to the speaker registration unit 114.
  • the feature extraction unit 112 executes speaker authentication processing when a control command requesting speaker authentication is associated with the authentication voice data of the speaker US transmitted from the microphone MK.
  • the feature amount extraction unit 112 outputs the extracted feature amount of the speaker US to the authentication unit 117.
  • the noise determination unit 113 acquires the noise data output from the noise extraction unit 111.
  • the noise determination unit 113 transmits noise data to the noise determination device P2 via the network NW, and determines the type of noise included in the registered voice data or authentication voice data of the speaker US.
  • the noise referred to here is noise that is collected due to the environment (background) at the time of sound collection, and includes, for example, surrounding voices at the time of sound collection, music, the sound of a vehicle running, the sound of the wind, etc. .
  • the noise type indicates, for example, the environment (place) or position where the noise occurs, such as in-store noise, outdoor wind noise, in-store music, or inside a station.
  • the noise type may further include information on time zones such as early morning, daytime, and nighttime, for example.
  • the noise determination unit 113 In the voice registration process of the speaker US, the noise determination unit 113 outputs noise type information corresponding to the noise type determination result transmitted from the noise determination device P2 to the speaker registration unit 114. Further, in the speaker authentication process, the noise determination unit 113 outputs noise type information corresponding to the noise type determination result transmitted from the noise determination device P2 to the similarity calculation model selection unit 115.
  • the speaker registration unit 114 acquires the feature quantity of the speaker US outputted from the feature quantity extraction unit 112 and the noise type information included in the registered voice data of the speaker US outputted from the noise determination unit 113. .
  • the speaker registration unit 114 associates the feature amount of the speaker US, noise type information, and speaker information of the speaker US, and registers them in the registered speaker database DB2.
  • the speaker information may be extracted from the registered speech data by voice recognition, or may be obtained from a terminal owned by the speaker US (for example, a PC, a notebook PC, a smartphone, a tablet terminal).
  • the speaker information here includes, for example, identification information that can identify the speaker US, the name of the speaker US, speaker identification (ID), and the like.
  • the similarity calculation model selection unit 115 which is an example of the selection unit, selects the noise type associated with the feature amount of each of the plurality of registered speakers registered in the registered speaker database DB2 and the registered voice data of the speaker US. The degree of similarity between the noise extracted from the noise type and the noise type is calculated (evaluated). The similarity calculation model selection unit 115 selects a similarity calculation model (an example of a similarity calculation model) stored in the similarity calculation model database DB3 based on the calculated similarity.
  • a similarity calculation model an example of a similarity calculation model
  • the similarity calculation model selected here is a model that is more suitable or optimal for calculating the similarity between the feature amount of the speaker US and the feature amount of any registered speaker.
  • the similarity calculation model selection unit 115 makes a selection based on noise type information corresponding to the registered voice data of the speaker US, noise type information corresponding to a plurality of registered speakers, and the similarity of these noise types.
  • noise type information corresponding to the registered voice data of the speaker US
  • noise type information corresponding to a plurality of registered speakers
  • similarity of these noise types With reference to the correspondence lists LST1 and LST2 (see FIGS. 5 and 6), which are associated with the information on the similarity calculation models that have been selected, select one of the plurality of selection models (similarity models). is selected and output to each of the reliability calculation section 116 and the authentication section 117.
  • the reliability calculation unit 116 which is an example of the reliability calculation unit, calculates (evaluates) the reliability (score) indicating the certainty of the identification result of the speaker US based on the similarity calculated by the authentication unit 117.
  • the reliability calculation unit 116 transmits the calculated reliability information to the authentication unit 117 based on the similarity calculated by the authentication unit 117, the noise type information used in the similarity calculation process, the similarity calculation model, etc. Output to. Note that the reliability calculation process by the reliability calculation unit 116 is not essential and may be omitted.
  • the authentication unit 117 which is an example of a calculation unit, acquires the feature amount of the speaker US output from the feature amount extraction unit 112, and calculates the feature amount of each of the plurality of registered speakers registered in the registered speaker database DB2. get.
  • the authentication unit 117 also obtains the selected models of the correspondence lists LST1 and LST2 output from the similarity calculation model selection unit 115.
  • the authentication unit 117 uses a similarity calculation model based on the correspondence lists LST1 and LST2 to calculate the similarity between the feature amounts of each of the plurality of registered speakers and the feature amount of the speaker US.
  • the authentication unit 117 identifies the speaker US based on the calculated similarity.
  • the authentication unit 117 generates an authentication result screen SC based on the speaker information of the identified speaker US, and transmits it to the monitor MN.
  • the memory 12 includes, for example, a random access memory (hereinafter referred to as "RAM”) as a work memory used when executing each process of the processor 11, and a read-only memory that stores programs and data that define the operation of the processor 11.
  • RAM random access memory
  • ROM read-only memory
  • Data or information generated or acquired by the processor 11 is temporarily stored in the RAM.
  • a program that defines the operation of the processor 11 is written in the ROM.
  • the feature extraction model database DB1 is a so-called storage, and is stored using a storage medium such as a flash memory, a Hard Disk Drive (hereinafter referred to as "HDD”), or a Solid State Drive (hereinafter referred to as "SSD”). configured.
  • the feature extraction model database DB1 stores a feature extraction model capable of detecting the utterance section of the speaker US from registered speech data or authenticated speech data and extracting the feature of the speaker US.
  • the feature extraction model is, for example, a learning model generated by learning using deep learning or the like.
  • the registered speaker database DB2 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD.
  • the registered speaker database DB2 includes feature quantities of each of a plurality of registered speakers registered in advance, information on the noise type of noise included in the registered voice data from which the feature quantities were extracted, and registered speaker information. Store information in association with other information.
  • the similarity calculation model database DB3 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD.
  • the similarity calculation model database DB3 stores a similarity calculation model that can calculate the similarity between two feature amounts.
  • the similarity calculation model is, for example, a learning model generated by learning using deep learning or the like.
  • a similarity calculation model is one in which dimensions in which individuality is likely to be expressed are learned in advance and retained in order to calculate the similarity between two multidimensional vectors with high precision.
  • the method for calculating the similarity using a model is an example of a method for calculating the similarity between vectors, and previously described techniques such as Euclidean distance and cosine similarity may be used.
  • the noise-similarity calculation model correspondence list DB4 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD.
  • the noise-similarity calculation model correspondence list DB4 stores similarity calculation models used in similarity calculation processing for each combination of noise types.
  • the monitor MN is configured using a display such as a Liquid Crystal Display (LCD) or an organic electroluminescence (EL), for example.
  • the monitor MN displays the authentication result screen SC output from the terminal device P1.
  • the authentication result screen SC is a screen that notifies the administrator (for example, the person viewing Monitor MN) of the speaker authentication result, and displays the authentication result information "XX matched Mr. XX's voice.” Contains degree information "Reliability: High”.
  • the authentication result screen SC may also include other registered speaker information (for example, a face image, etc.). Further, the authentication result screen SC does not need to include reliability information.
  • the network NW connects the terminal device P1 and the noise determination device P2 to enable data communication.
  • the noise determination device P2 may not only be connected to the terminal device P1 via the network NW, but may also be a part of the terminal device P1.
  • the noise determination device P2 acquires the noise extracted from the registered voice data of the speaker US or the authentication voice data transmitted from the terminal device P1.
  • the noise determination device P2 determines the type of noise based on the acquired noise.
  • the noise determination device P2 transmits noise type information to the terminal device P1.
  • FIG. 3 is a flowchart showing an example of the operation procedure of the terminal device P1 in the embodiment.
  • the terminal device P1 acquires audio data from the microphone MK (St11).
  • the microphone MK may be, for example, a microphone included in a PC, a notebook PC, a smartphone, or a tablet terminal.
  • the terminal device P1 determines whether the control command associated with the voice data is a control command requesting registration in the registered speaker database DB2 (St12).
  • the terminal device P1 determines to newly register the feature amount of the speaker US in the registered speaker database DB2 if the control command is a control command requesting registration in the registered speaker database DB2. (St12, YES), and extracts noise included in the non-speech section from the voice data (registered voice data) (St13).
  • the noise referred to here is the noise included in the voice data (registered voice data or authentication voice data), and is the surrounding environmental sound, noise, etc. when the utterance voice of the speaker US is collected.
  • step St12 if the control command is not a control command requesting registration in the registered speaker database DB2 but a control command requesting speaker authentication, the terminal device P1 registers the registered speaker database DB2. It is determined that the feature quantity of the speaker US is not to be newly registered (St12, NO), and noise included in the non-speech section of the voice data (authentication voice data) is extracted (St14).
  • the terminal device P1 associates a control command requesting determination of the extracted noise type with the noise, and transmits the extracted noise to the noise determination device P2.
  • the terminal device P1 executes noise type determination processing by acquiring the noise type information (that is, the determination result) transmitted from the noise determination device P2 (St15).
  • the terminal device P1 extracts a feature amount indicating the individuality of the speaker US from the utterance section of the registered voice data (St16).
  • the feature amounts extracted from the utterance sections of the registered voice data and the authentication voice data include a feature amount indicating the individuality of the speaker US and a noise feature amount.
  • the terminal device P1 associates the feature amount of the speaker US extracted from the registered voice data, noise type information, and speaker information and registers them in the registered speaker database DB2 (St17).
  • the terminal device P1 further extracts the noise extracted from the non-speech section of the authentication voice data.
  • the terminal device P1 associates the extracted noise with a control command requesting a noise type determination, and transmits the extracted noise to the noise determination device P2.
  • the terminal device P1 executes noise type determination processing by acquiring the noise type information (that is, the determination result) transmitted from the noise determination device P2 (St18).
  • the terminal device P1 extracts the feature amount of the speaker US from the utterance section of the authenticated voice data of the speaker US (St19). Further, the terminal device P1 acquires the feature amounts of each of the plurality of registered speakers registered in the registered speaker database DB2 (St20), and executes speaker authentication processing (St21).
  • FIG. 4 is a flowchart showing an example of the speaker authentication procedure of the terminal device P1 in the embodiment.
  • the terminal device P1 determines the feature amount of the speaker US and the plurality of A similarity calculation model suitable for calculating the degree of similarity with each feature quantity of the registered speakers is selected for each registered speaker.
  • the terminal device P1 creates correspondence lists LST1 and LST2 (Fig. 5, 6).
  • the terminal device P1 uses the referenced correspondence lists LST1 and LST2 to determine the degree of similarity between the feature amount of the speaker US and the feature amount of one of the registered speakers among the plurality of registered speakers.
  • the similarity calculation model is read from the similarity calculation model database DB3 (St211).
  • the terminal device P1 uses the similarity calculation model to calculate the feature amount of the voice data of the speaker US and the characteristics of one of the registered speakers among the plurality of registered speakers registered in the registered speaker database DB2.
  • the degree of similarity with the amount is calculated (St212).
  • the terminal device P1 repeatedly executes the process of step St212 until it calculates the similarity between the feature amount of the voice data of the speaker US and the feature amount of all registered speakers registered in the registered speaker database DB2. .
  • the terminal device P1 determines whether there is a degree of similarity greater than or equal to a threshold value among the calculated degrees of similarity (St213).
  • the terminal device P1 determines in the process of step St213 that there is a degree of similarity that is equal to or greater than the threshold among the calculated degrees of similarity (St213, YES)
  • the terminal device P1 registers the degree of similarity that is determined to be equal to or greater than the threshold and corresponds to the degree of similarity.
  • the speaker US is identified based on the speaker information (St214). Note that if there are multiple degrees of similarity that are determined to be equal to or higher than the threshold, the terminal device P1 identifies the speaker US based on the registered speaker information corresponding to the degree of similarity with the highest calculated degree of similarity. You may.
  • the terminal device P1 determines in the process of step St213 that there is no similarity greater than or equal to the threshold among the calculated similarities (St213, NO), the terminal device P1 determines that the speaker US cannot be identified (St215). ).
  • the terminal device P1 generates an authentication result screen SC based on the registered speaker information of the identified speaker US.
  • the terminal device P1 outputs the generated authentication result screen SC to the monitor MN for display (St216).
  • the terminal device P1 registers the speaker information, the feature amount of the speaker US, and the information on the noise type included in the utterance section of the speaker US in association with each other at the time of voice registration.
  • the terminal device P1 can determine the degree of similarity suitable for each noise type. You can select a calculation model. Therefore, by using the selected similarity calculation model, the terminal device P1 more accurately determines the similarity between the feature amount of the speaker US including different noises and the feature amount of the registered speaker. This makes it possible to more effectively suppress a decrease in speaker authentication accuracy caused by noise contained in the authentication voice data.
  • the terminal device P1 calculates the degree of reliability indicating the likelihood of the speaker identified by the speaker authentication process, which is indicated by the calculated degree of similarity, based on the similarity calculation model used to calculate the degree of similarity. and display. Thereby, the terminal device P1 can present the certainty of the speaker authentication result to the administrator viewing the monitor MN. Therefore, the terminal device P1 informs the administrator by presenting the reliability that there is no similarity calculation model suitable for calculating the similarity, and that speaker authentication has been performed using a similarity calculation model "universal model", which will be described later. can be informed.
  • FIG. 5 is a diagram illustrating an example of the correspondence list LST1 when the noise type at the time of voice registration and the noise type at the time of voice authentication are the same.
  • FIG. 6 is a diagram illustrating an example of the correspondence list LST2 when the noise type at the time of voice registration and the noise type at the time of voice authentication are different.
  • noise type information associated with the feature amount of any one registered speaker registered in the registered speaker database DB2 and speaker authentication information are shown.
  • An example will be described in which the correspondence lists LST1 and LST2 are referred to based on the noise type of noise included in the authenticated voice data of the target speaker US.
  • each of the voice data at the time of voice registration and voice authentication includes noise that corresponds to the same noise type "in-store noise.”
  • the terminal device P1 transmits the noise extracted from the authentication voice data of the speaker US transmitted from the microphone MK to the noise determination device P2.
  • the terminal device P1 generates a predefined correspondence list LST1 based on the noise type determination result transmitted from the noise determination device P2 and the noise type information of registered speakers registered in the registered speaker database DB2. With reference to , one selection model is selected from each of the plurality of selection models (similarity models).
  • the correspondence list LST1 includes the noise type determination result "Noise determination result 1" of the noise included in the authenticated voice data of the speaker US and the noise type determination result "Noise determination result 1" of the registered speaker registered in the registered speaker database DB2. This is data that associates "Judgment Result 2" with a similarity calculation model "selected model” selected based on these two noise types.
  • the noise type determination result "Noise determination result 1" indicates, for example, the noise determination result in the voice at the time of authentication.
  • the noise type determination result "Noise determination result 2" indicates, for example, the noise determination result of registered speech.
  • the number of noise type information registered in the registered speaker database DB2 may be one or more. In other words, if there are multiple noise type candidates, multiple pieces of information may be retained. Furthermore, information on determination probabilities indicating respective reliability corresponding to determination results of a plurality of noise types is not essential and may be omitted.
  • the similarity calculation model "selected model” is, for example, the similarity calculation model “model A” selected in response to the noise type determination result "noise determination result 1" and the noise type determination result "noise determination result 2". , "Model B”, “Model C”, and "Model Z”.
  • the similarity calculation model "Model A”
  • the information on the noise type of the noise included in the features of the speaker US is “Noise A”
  • the information on the noise type of the noise included in the features of the registered speaker is “Noise A”.
  • This similarity calculation model is determined to be optimal for calculating the similarity between the feature amount of the speaker US and the feature amount of the registered speaker when the information is "noise A.”
  • Model Z which is a general-purpose similarity calculation model
  • the similarity calculation model selection unit 115 selects a similarity calculation model database from the similarity calculation model database DB3 based on the referenced correspondence list LST1. For example, in the example shown in FIG. 5, the similarity calculation model selection unit 115 selects the similarity calculation model "model A.”
  • the authenticated voice data of the speaker US at the time of voice registration includes noise that corresponds to the noise type "in-store noise.” Furthermore, the registered voice data of the registered speaker at the time of voice authentication includes noise that corresponds to a noise type "outdoor noise" that is different from the authenticated voice data of the speaker US.
  • the terminal device P1 transmits the noise extracted from the authentication voice data of the speaker US transmitted from the microphone MK to the noise determination device P2.
  • the terminal device P1 generates a predefined correspondence list LST2 based on the noise type determination result transmitted from the noise determination device P2 and the noise type information of registered speakers registered in the registered speaker database DB2.
  • one selection model is selected from each of the plurality of selection models (similarity models).
  • the correspondence list LST2 includes the noise type determination result "Noise determination result 3" of the noise included in the authenticated voice data of the speaker US, and the noise type determination result "Noise determination result 3" of the registered speaker registered in the registered speaker database DB2. This is data that associates "determination result 4" with a similarity calculation model "selected model” selected based on these two noise types.
  • Noise type determination result "Noise determination result 3" indicates the noise type determination result extracted from the voice data of the registered speaker.
  • the noise type determination result "Noise determination result 4" indicates the determination result of the noise type information of the registered speakers registered in the registered speaker database DB2.
  • the number of noise type information registered in the registered speaker database DB2 may be one or more. In other words, if there are multiple noise type candidates, multiple pieces of information may be retained. Further, information on the determination probability indicating the reliability corresponding to the determination result of each noise type is not essential and may be omitted.
  • the similarity calculation model "selected model” is a similarity calculation model "model G” selected corresponding to the noise type determination result "noise determination result 3" and the noise type determination result "noise determination result 4", Includes each of "Model H”, “Model I”, and "Model Z”.
  • the similarity calculation model "Model G” is used when the noise type information of the noise included in the authentication speech data is "Noise A” and the noise type information of the registered speaker is "Noise D”. , is a similarity calculation model determined to be optimal for calculating the similarity between the feature amount of the speaker US and the feature amount of the registered speaker.
  • the similarity calculation model selection unit 115 selects a model suitable for the similarity calculation process based on the combination of the noise type information of the noise included in the feature amount of the speaker US and the noise type information of the registered speaker. If it is determined that there is no similarity calculation model, a general-purpose similarity calculation model "Model Z" is selected.
  • the similarity calculation model selection unit 115 selects a similarity calculation model database from the similarity calculation model database DB3 based on the referenced correspondence list LST2. For example, in the example shown in FIG. 6, the similarity calculation model selection unit 115 selects the similarity calculation model "model E.”
  • the terminal device P1 calculates two feature quantities (features of the speaker US and It is possible to select the optimal similarity calculation model for the similarity calculation process (for example, the characteristics of registered speakers). As a result, even if the noise type included in the feature amount at the time of voice registration and the noise type included in the feature amount at the time of voice authentication change, the terminal device P1 calculates the degree of similarity between the two feature amounts.
  • the optimal similarity calculation model can be selected for calculation. In other words, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by noise included in the audio data. If there are candidates for multiple conditions at the time of noise determination, the terminal device P1 calculates the degree of similarity using, for example, a similarity calculation model corresponding to each condition, calculates the average value, and uses it as the degree of similarity. Good too.
  • FIG. 7 is a diagram illustrating an example of calculating reliability.
  • FIG. 7 shows an example in which the reliability is calculated using two indicators “high” and "low", the reliability may be calculated as a numerical value from 0 (zero) to 100, for example.
  • the reliability is calculated when the registered voice data at the time of voice registration and the authentication voice data at the time of voice authentication are each of the same noise type, as explained in FIG. Let's discuss an example.
  • the reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117.
  • the reliability calculation unit 116 determines whether or not the similarity calculation model used to calculate the similarity is a similarity calculation model based on a known noise type, and whether or not the reliability calculation unit 116 is reliable based on the determination probability of the noise type. Determine the degree.
  • the noise type determination result based on noise is that the determination probability for the noise type “outdoor noise” is "90%", and the determination probability for the noise type “in-store music” is “6%”. ”, and the determination probability that the noise type is “unknown noise” is “4%”.
  • the noise type determination results shown in “Case 1" indicate that the noise types "outdoor noise” and “in-store noise” are known noises, and the noise type "unknown noise” is unknown noises. .
  • the similarity calculation model selection unit 115 selects the similarity calculation model "outdoor wind noise model” based on the determination result of the noise type of the noise.
  • the authentication unit 117 uses the similarity calculation model "outdoor noise model” to calculate the similarity between the feature amount of the speaker US and the feature amount of any registered speaker registered in the registered speaker database DB2. do.
  • the reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117.
  • the reliability calculation unit 116 calculates that the similarity calculation model "outdoor noise model” is a similarity calculation model based on a known noise type "outdoor wind noise model” and that the noise determination probability is "90%”. ”, the reliability is calculated as “high”.
  • the reliability calculation unit 116 determines that the noise determination probability is equal to or higher than a predetermined probability (for example, 85%, 90%, etc.), the reliability calculation unit 116 calculates that the reliability is "high” and calculates the noise determination probability as “high.” If it is determined that the probability is not greater than a predetermined probability, the reliability is calculated to be "low”.
  • a predetermined probability for example, 85%, 90%, etc.
  • the noise type determination results based on the noise are that the determination probability for the noise type “outdoor wind noise” is “48%” and the determination probability for the noise type “unknown noise” is “39%”. ”, and the determination probability that the noise type is “in-store music” is “13%”.
  • the noise type determination results shown in “Case 2” indicate that the noise types “Outdoor wind noise” and “In-store music” are known noises, and the noise type "Unknown noise” is unknown noises. show.
  • the similarity calculation model selection unit 115 selects the similarity calculation model "outdoor wind noise model” based on the determination result of the noise type of the noise.
  • the authentication unit 117 uses the similarity calculation model "outdoor noise model” to calculate the similarity between the feature amount of the speaker US and the feature amount of any registered speaker registered in the registered speaker database DB2. do.
  • the reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117.
  • the reliability calculation unit 116 calculates that the similarity calculation model "outdoor noise model” is a similarity calculation model based on a known noise type “outdoor wind noise model” and that the noise determination probability is "48%”. ”, the reliability is calculated as “low”.
  • the noise type determination results based on the noise are that the determination probability for the noise type “unknown noise” is “55%” and the determination probability for the noise type “outdoor wind noise” is “28%”. ”, and the determination probability that the noise type is “in-store music” is “17%”.
  • the noise type determination results shown in “Case 3” indicate that the noise types “Outdoor wind noise” and “In-store noise” are known noises, and the noise type "Unknown noise” is unknown noises. show.
  • the similarity calculation model selection unit 115 selects the similarity calculation model "general model” based on the determination result of the noise type of the noise.
  • the authentication unit 117 calculates the similarity between the feature amount of the speaker US and the feature amount of any registered speaker registered in the registered speaker database DB2 using the similarity calculation model "general model”. .
  • the reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117.
  • the reliability calculation unit 116 determines that the similarity calculation model "general-purpose model" used to calculate the similarity is unknown noise and the noise determination probability is "55%”, so the reliability calculation unit 116 determines that it is reliable. Calculate the degree "low”.
  • the reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117.
  • the reliability calculation unit 116 calculates that the similarity calculation model "outdoor noise model” is a similarity calculation model based on a known noise type “outdoor wind noise model” and that the noise determination probability is "48%”. ”, the reliability is calculated as “low”.
  • the terminal device P1 includes the communication unit 10 (an example of an acquisition unit) that acquires audio data, the utterance section where the speaker is speaking from the audio data, and the utterance section where the speaker is not speaking.
  • a noise extraction unit 111 and a feature extraction unit 112 detect non-speech intervals, feature quantities of the utterance interval (an example of utterance features), and noise included in the non-speech interval of audio data.
  • a noise extraction unit 111 and a feature quantity extraction unit 112 extract the noise, and associate the extracted noise with the feature quantities of a plurality of registered speakers registered in advance (an example of registered feature quantities).
  • a similarity calculation model selection unit 115 (an example of a selection unit) that selects any one similarity calculation model from each of a plurality of similarity calculation models (an example of a similarity calculation model) based on the detected noise; and an authentication unit 117 that authenticates the speaker by comparing the feature amount of the speaker US with the feature amount of the registered speaker using the selected similarity calculation model.
  • the terminal device P1 calculates a similarity calculation model suitable for speaker authentication based on the combination of noise types included in the feature amount of the speaker US and the feature amount of the registered speaker. You can choose. In other words, even if the noise type included in the feature amount during voice registration is different from the noise included in the feature amount during voice authentication, the terminal device P1 selects a similarity calculation model more suitable for speaker authentication. You can choose. Therefore, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by noise included in the audio data.
  • the communication unit 10 in the terminal device P1 according to the embodiment further acquires noise type information of the extracted noise.
  • the similarity calculation model selection unit 115 selects a similarity calculation model based on the acquired speaker noise type information and the registered speaker noise type information.
  • the terminal device P1 according to the embodiment generates two feature quantities (feature quantities of the speaker US It is possible to select a similarity calculation model that is more suitable for the similarity calculation process of the features of the registered speaker and the registered speaker's feature values.
  • the terminal device P1 according to the embodiment further includes an authentication unit 117 (an example of a calculation unit) that calculates the degree of similarity between the feature amount of the speaker US and the feature amount of the registered speaker.
  • the authentication unit 117 authenticates the speaker US based on the plurality of calculated similarities.
  • the terminal device P1 according to the embodiment can perform speaker authentication using the degree of similarity between the feature amounts of a plurality of registered speakers registered in advance and the feature amounts of the speaker US.
  • the terminal device P1 according to the embodiment further includes a reliability calculation unit 116 (an example of a reliability calculation unit) that calculates the reliability of the similarity.
  • the communication unit 10 further acquires noise type information of the extracted noise and a score indicating the noise type of the noise.
  • the reliability calculation unit 116 calculates the reliability of the similarity based on the score. Thereby, the terminal device P1 according to the embodiment can calculate the reliability of the speaker authentication result by calculating the reliability corresponding to the similarity.
  • the authentication unit 117 in the terminal device P1 according to the embodiment identifies the registered speaker whose degree of similarity is equal to or greater than the threshold value as the speaker US.
  • the terminal device P1 according to the embodiment can perform speaker authentication using the degree of similarity between the feature amounts of a plurality of registered speakers registered in advance and the feature amounts of the speaker US.
  • the authentication unit 117 in the terminal device P1 according to the embodiment generates and outputs an authentication result screen SC that includes information regarding registered speakers whose degree of similarity is equal to or greater than the threshold value.
  • the terminal device P1 according to the embodiment can present the speaker authentication result to the speaker US or the administrator.
  • the authentication unit 117 in the terminal device P1 according to the embodiment determines that the plurality of calculated similarities are not equal to or greater than the threshold value, the authentication unit 117 determines that the speaker US cannot be identified. Thereby, the terminal device P1 according to the embodiment can more effectively suppress a decrease in speaker authentication accuracy and more effectively suppress erroneous authentication of the speaker US.
  • the authentication unit 117 in the terminal device P1 according to the embodiment generates and outputs an authentication result screen SC including information regarding registered speakers whose degree of similarity is equal to or higher than a threshold value and information on the calculated reliability. do.
  • the terminal device P1 according to the embodiment displays the speaker authentication result and the reliability of the speaker authentication result, so that the administrator can know whether the speaker authentication result is reliable or not. You can prompt for confirmation.
  • the present disclosure is useful as a voice authentication device and a voice authentication method that suppress a decrease in speaker authentication accuracy due to changes in environmental noise.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)

Abstract

L'invention concerne un dispositif d'authentification vocale comprenant : une unité de détection pour détecter, à partir de données de parole, un segment de parole dans lequel un locuteur parle et un segment non de parole dans lequel le locuteur ne parle pas; une unité d'extraction pour extraire une quantité caractéristique de parole du segment de parole et du bruit contenu dans le segment non de parole; une unité de sélection pour sélectionner, sur la base du bruit extrait et du bruit associé à une pluralité de quantités caractéristiques enregistrées préenregistrées, un modèle de calcul de similarité parmi une pluralité de modèles de calcul de similarité; et une unité d'authentification pour authentifier le locuteur par mise en correspondance des quantités caractéristiques de parole du locuteur avec les quantités caractéristiques enregistrées du locuteur enregistré à l'aide du modèle de calcul de similarité sélectionné.
PCT/JP2023/009469 2022-03-22 2023-03-10 Dispositif d'authentification vocale et procédé d'authentification vocale WO2023182016A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2022-045390 2022-03-22
JP2022045390 2022-03-22

Publications (1)

Publication Number Publication Date
WO2023182016A1 true WO2023182016A1 (fr) 2023-09-28

Family

ID=88101383

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/009469 WO2023182016A1 (fr) 2022-03-22 2023-03-10 Dispositif d'authentification vocale et procédé d'authentification vocale

Country Status (1)

Country Link
WO (1) WO2023182016A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6242198A (ja) * 1985-08-20 1987-02-24 松下電器産業株式会社 音声認識装置
JPH0573090A (ja) * 1991-09-18 1993-03-26 Fujitsu Ltd 音声認識方法
JPH0736477A (ja) * 1993-07-16 1995-02-07 Ricoh Co Ltd パターンマッチング方式
JP2006003400A (ja) * 2004-06-15 2006-01-05 Honda Motor Co Ltd 車載音声認識システム
JP2019035935A (ja) * 2017-08-10 2019-03-07 トヨタ自動車株式会社 音声認識装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6242198A (ja) * 1985-08-20 1987-02-24 松下電器産業株式会社 音声認識装置
JPH0573090A (ja) * 1991-09-18 1993-03-26 Fujitsu Ltd 音声認識方法
JPH0736477A (ja) * 1993-07-16 1995-02-07 Ricoh Co Ltd パターンマッチング方式
JP2006003400A (ja) * 2004-06-15 2006-01-05 Honda Motor Co Ltd 車載音声認識システム
JP2019035935A (ja) * 2017-08-10 2019-03-07 トヨタ自動車株式会社 音声認識装置

Similar Documents

Publication Publication Date Title
US11978440B2 (en) Wakeword detection
US11734326B2 (en) Profile disambiguation
EP3144931A1 (fr) Appareil et procédé de gestion de dialogue
CN117577099A (zh) 设备上的多用户认证的方法、系统和介质
US10565986B2 (en) Extracting domain-specific actions and entities in natural language commands
JP2019053126A (ja) 成長型対話装置
US20190042561A1 (en) Extracting domain-specific actions and entities in natural language commands
US20180240460A1 (en) Speech recognition program medium, speech recognition apparatus, and speech recognition method
JP6280074B2 (ja) 言い直し検出装置、音声認識システム、言い直し検出方法、プログラム
US20230386468A1 (en) Adapting hotword recognition based on personalized negatives
US9224388B2 (en) Sound recognition method and system
JP6676009B2 (ja) 話者判定装置、話者判定情報生成方法、プログラム
WO2023182016A1 (fr) Dispositif d'authentification vocale et procédé d'authentification vocale
US10997972B2 (en) Object authentication device and object authentication method
WO2023182014A1 (fr) Dispositif d'authentification vocale et procédé d'authentification vocale
KR101840363B1 (ko) 오류 발음 검출을 위한 단말 및 음성 인식 장치, 그리고 그의 음향 모델 학습 방법
WO2023182015A1 (fr) Dispositif d'authentification vocale et procédé d'authentification vocale
CN110895938B (zh) 语音校正系统及语音校正方法
US20220335927A1 (en) Learning apparatus, estimation apparatus, methods and programs for the same
US20230386508A1 (en) Information processing apparatus, information processing method, and non-transitory recording medium
JP2018018404A (ja) 翻訳システム
Liu et al. Utterance verification on DTW based speech recognition using likelihood
JP6537996B2 (ja) 未知語検出装置、未知語検出方法、プログラム
JP2024034016A (ja) 音声取得装置および音声取得方法
CN117636872A (zh) 音频处理方法、装置、电子设备和可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23774615

Country of ref document: EP

Kind code of ref document: A1