WO2023182016A1

WO2023182016A1 - Voice authentication device and voice authentication method

Info

Publication number: WO2023182016A1
Application number: PCT/JP2023/009469
Authority: WO
Inventors: 正成宮本
Original assignee: パナソニックＩｐマネジメント株式会社
Priority date: 2022-03-22
Filing date: 2023-03-10
Publication date: 2023-09-28

Abstract

This voice authentication device comprises: a detection unit for detecting, from speech data, a speech segment in which a speaker is speaking and a non-speech segment in which the speaker is not speaking; an extraction unit for extracting a speech feature amount of the speech segment and noise contained in the non-speech segment; a selection unit for selecting, on the basis of the extracted noise and noise associated with a plurality of pre-registered registered feature amounts, one similarity calculation model from among a plurality of similarity calculation models; and an authentication unit for authenticating the speaker by matching the speech feature amounts of the speaker with the registered feature amounts of the registered speaker using the selected similarity calculation model.

Description

Voice authentication device and voice authentication method

The present disclosure relates to a voice authentication device and a voice authentication method.

Patent Document 1 discloses a voice recognition device that recognizes a test subject's voice. The speech recognition device stores a plurality of motion noise models created corresponding to each of the plurality of motions in association with each of the plurality of motions, detects input speech including the subject's voice, and detects the subject's motion. is specified, and a motion noise model corresponding to the motion specified by the motion specifying means is read out. The speech recognition device reads an environmental noise model corresponding to the subject's current position, synthesizes the environmental noise model with the read motion noise model, and uses the synthesized noise superimposition model to detect the noise contained in the detected input speech. recognize the test subject's voice.

Japanese Patent Application Publication No. 2008-250059

However, in Patent Document 1, it is necessary to collect in advance the motion noise generated in each of a plurality of motions and the environmental noise at each of a plurality of positions where voice recognition can be performed, which is very time-consuming. Furthermore, in voiceprint authentication, the feature quantity that indicates the individuality of an authentication target (person) extracted from a voice signal changes depending on the noise contained in the voice signal. Therefore, when performing voiceprint authentication using the voice recognition device described above, if the noise contained in the pre-registered voice signal and the noise contained in the voice signal collected during voiceprint authentication are different, each Features extracted from voice signals do not indicate the individuality of the same person, and there is a possibility that the accuracy of voiceprint authentication will decrease.

The present disclosure was devised in view of the above-mentioned conventional situation, and aims to provide a voice authentication device and a voice authentication method that suppress a decrease in speaker authentication accuracy due to changes in environmental noise.

The present disclosure includes: an acquisition unit that acquires audio data; a detection unit that detects a speech section in which a speaker is speaking from the audio data; and a non-speech section in which the speaker is not speaking; an extraction unit that extracts the utterance feature amount of the speech data and noise included in the non-speech section of the audio data; a selection unit that selects any one similarity calculation model from each of the plurality of similarity calculation models based on the noise generated by the speaker; and an authentication unit that authenticates the speaker by comparing the registered feature amount of the registered speaker.

The present disclosure also provides a voice authentication method performed by a terminal device, which acquires voice data, and determines from the voice data a speech section in which the speaker is speaking and a non-speech section in which the speaker is not speaking. , and extracts the speech feature amount of the speech section and the noise included in the non-speech section of the audio data, and extracts the extracted noise and the registered feature amount of a plurality of registered speakers registered in advance. One of the similarity calculation models is selected based on the noise associated with the noise, and the selected similarity calculation model is used to calculate the utterance features of the speaker and , a voice authentication method is provided in which the speaker is authenticated by comparing the registered feature amount with the registered feature amount of the registered speaker.

According to the present disclosure, it is possible to suppress a decrease in speaker authentication accuracy due to changes in environmental noise.

Block diagram showing an example of internal configuration of a voice authentication system according to an embodiment A diagram illustrating each process performed by a processor of a terminal device in an embodiment. Flowchart showing an example of the operation procedure of the terminal device in the embodiment Flowchart showing an example of a speaker authentication procedure of a terminal device in an embodiment Diagram illustrating an example of a correspondence list when the noise type at the time of voice registration and the noise type at the time of voice authentication are the same Diagram explaining an example of a correspondence list when the noise type at the time of voice registration and the noise type at the time of voice authentication are different Diagram explaining an example of calculating reliability

Hereinafter, embodiments specifically disclosing a voice authentication device and a voice authentication method according to the present disclosure will be described in detail with reference to the drawings as appropriate. However, more detailed explanation than necessary may be omitted. For example, detailed explanations of well-known matters and redundant explanations of substantially the same configurations may be omitted. This is to avoid unnecessary redundancy in the following description and to facilitate understanding by those skilled in the art. The accompanying drawings and the following description are provided to enable those skilled in the art to fully understand the present disclosure, and are not intended to limit the subject matter recited in the claims.

(Embodiment)
First, a voice authentication system 100 according to an embodiment will be described with reference to FIGS. 1 and 2. FIG. 1 is a block diagram showing an example of the internal configuration of a voice authentication system 100 according to an embodiment. FIG. 2 is a diagram illustrating each process performed by the processor 11 of the terminal device P1 in the embodiment.

The voice authentication system 100 includes a terminal device P1 as an example of a voice authentication device, a monitor MN, a noise determination device P2, and a network NW. Note that the voice authentication system 100 may include a microphone MK or a monitor MN.

The microphone MK picks up the voice uttered by the speaker US in order to register the voice in the terminal device P1 in advance. The microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data that is registered in the terminal device P1. Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .

Additionally, the microphone MK picks up the voice uttered by the speaker US, which is used for speaker authentication. The microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data. Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .

In the following explanation, in order to make the explanation easier to understand, the voice data for voice registration or the voice data already registered in the terminal device P1 will be referred to as "registered voice data", and the voice data for voice authentication will be referred to as "authentication voice data". and differentiate.

Note that the microphone MK may be, for example, a microphone included in a predetermined device such as a Personal Computer (hereinafter referred to as "PC"), a notebook PC, a smartphone, a tablet terminal, or the like. Further, the microphone MK may transmit an audio signal or audio data to the terminal device P1 by wireless communication via a network (not shown).

The terminal device P1 is realized by, for example, a PC, a notebook PC, a smartphone, a tablet terminal, etc., and executes a voice registration process using registered voice data of the speaker US and a speaker authentication process using authentication voice data. . It includes a communication unit 10, a processor 11, a memory 12, a feature extraction model database DB1, a registered speaker database DB2, a similarity calculation model database DB3, and a noise-similarity calculation model correspondence list DB4.

The communication unit 10, which is an example of the acquisition unit, is connected to the microphone MK, the monitor MN, and the noise determination device P2 so that data can be transmitted and received through wired communication or wireless communication. The wireless communication referred to here is, for example, short-range wireless communication such as Bluetooth (registered trademark) or NFC (registered trademark), or communication via a wireless LAN (Local Area Network) such as Wi-Fi (registered trademark).

Note that the communication unit 10 may transmit and receive data to and from the microphone MK via an interface such as a Universal Serial Bus (USB). Furthermore, the communication unit 10 may perform data transmission and reception with the monitor MN via an interface such as High-Definition Multimedia Interface (HDMI, registered trademark).

The processor 11 is configured using, for example, a central processing unit (CPU) or a field programmable gate array (FPGA), and performs various processing and control in cooperation with the memory 12. Specifically, the processor 11 refers to the program and data held in the memory 12 and executes the program, thereby controlling the noise extraction section 111, the feature amount extraction section 112, the noise determination section 113, and the speaker registration section. 114, the similarity calculation model selection section 115, the reliability calculation section 116, the authentication section 117, and other sections.

When registering the voice of the speaker US, the processor 11 realizes the functions of the noise extraction unit 111, the feature extraction unit 112, the noise determination unit 113, and the speaker registration unit 114, thereby registering the voice in the registered speaker database DB2. New registration (storage) processing of the feature amount of the speaker US is executed. Note that the feature amount here is a feature amount indicating the individuality of the speaker US extracted from the registered voice data.

Furthermore, during voice authentication of the speaker US, the processor 11 controls each section of the noise extraction section 111, the feature amount extraction section 112, the noise determination section 113, the similarity calculation model selection section 115, the reliability calculation section 116, and the authentication section 117. By realizing this function, speaker authentication processing is executed.

The noise extraction unit 111, which is an example of a detection unit and an extraction unit, acquires registered voice data or authentication voice data of the speaker US transmitted from the microphone MK. The noise extraction unit 111 distinguishes the utterance section in which the speaker US is uttering and the utterance section in which the speaker US is not uttering (hereinafter referred to as "non-speech section") out of registered voice data or authentication voice data. To detect. The noise extraction unit 111 extracts the noise included in the detected non-speech period, and outputs the extracted noise data (hereinafter referred to as “noise data”) to the noise determination unit 113.

The feature extraction unit 112, which is an example of a detection unit, acquires the registered voice data or authentication voice data of the speaker US transmitted from the microphone MK. The feature extraction unit 112 detects a speech section from the registered speech data or the authentication speech data, and uses the feature extraction model to identify the individual of the speaker US from the detected speech section of the registered speech data or the authentication speech data. Extract features that indicate gender.

The feature amount extraction unit 112 associates a control command requesting registration of the feature amount of the speaker US and registered speaker information of the speaker US with the registered voice data of the speaker US transmitted from the microphone MK. If so, the process moves to the voice registration process of the speaker US. The feature amount extraction unit 112 outputs the extracted feature amount of the speaker US to the speaker registration unit 114.

Further, the feature extraction unit 112 executes speaker authentication processing when a control command requesting speaker authentication is associated with the authentication voice data of the speaker US transmitted from the microphone MK. The feature amount extraction unit 112 outputs the extracted feature amount of the speaker US to the authentication unit 117.

The noise determination unit 113 acquires the noise data output from the noise extraction unit 111. The noise determination unit 113 transmits noise data to the noise determination device P2 via the network NW, and determines the type of noise included in the registered voice data or authentication voice data of the speaker US.

Note that the noise referred to here is noise that is collected due to the environment (background) at the time of sound collection, and includes, for example, surrounding voices at the time of sound collection, music, the sound of a vehicle running, the sound of the wind, etc. . The noise type indicates, for example, the environment (place) or position where the noise occurs, such as in-store noise, outdoor wind noise, in-store music, or inside a station. The noise type may further include information on time zones such as early morning, daytime, and nighttime, for example.

In the voice registration process of the speaker US, the noise determination unit 113 outputs noise type information corresponding to the noise type determination result transmitted from the noise determination device P2 to the speaker registration unit 114. Further, in the speaker authentication process, the noise determination unit 113 outputs noise type information corresponding to the noise type determination result transmitted from the noise determination device P2 to the similarity calculation model selection unit 115.

The speaker registration unit 114 acquires the feature quantity of the speaker US outputted from the feature quantity extraction unit 112 and the noise type information included in the registered voice data of the speaker US outputted from the noise determination unit 113. . The speaker registration unit 114 associates the feature amount of the speaker US, noise type information, and speaker information of the speaker US, and registers them in the registered speaker database DB2.

Note that the speaker information may be extracted from the registered speech data by voice recognition, or may be obtained from a terminal owned by the speaker US (for example, a PC, a notebook PC, a smartphone, a tablet terminal). Moreover, the speaker information here includes, for example, identification information that can identify the speaker US, the name of the speaker US, speaker identification (ID), and the like.

The similarity calculation model selection unit 115, which is an example of the selection unit, selects the noise type associated with the feature amount of each of the plurality of registered speakers registered in the registered speaker database DB2 and the registered voice data of the speaker US. The degree of similarity between the noise extracted from the noise type and the noise type is calculated (evaluated). The similarity calculation model selection unit 115 selects a similarity calculation model (an example of a similarity calculation model) stored in the similarity calculation model database DB3 based on the calculated similarity.

Note that the similarity calculation model selected here is a model that is more suitable or optimal for calculating the similarity between the feature amount of the speaker US and the feature amount of any registered speaker.

The similarity calculation model selection unit 115 makes a selection based on noise type information corresponding to the registered voice data of the speaker US, noise type information corresponding to a plurality of registered speakers, and the similarity of these noise types. With reference to the correspondence lists LST1 and LST2 (see FIGS. 5 and 6), which are associated with the information on the similarity calculation models that have been selected, select one of the plurality of selection models (similarity models). is selected and output to each of the reliability calculation section 116 and the authentication section 117.

The reliability calculation unit 116, which is an example of the reliability calculation unit, calculates (evaluates) the reliability (score) indicating the certainty of the identification result of the speaker US based on the similarity calculated by the authentication unit 117. The reliability calculation unit 116 transmits the calculated reliability information to the authentication unit 117 based on the similarity calculated by the authentication unit 117, the noise type information used in the similarity calculation process, the similarity calculation model, etc. Output to. Note that the reliability calculation process by the reliability calculation unit 116 is not essential and may be omitted.

The authentication unit 117, which is an example of a calculation unit, acquires the feature amount of the speaker US output from the feature amount extraction unit 112, and calculates the feature amount of each of the plurality of registered speakers registered in the registered speaker database DB2. get. The authentication unit 117 also obtains the selected models of the correspondence lists LST1 and LST2 output from the similarity calculation model selection unit 115.

The authentication unit 117 uses a similarity calculation model based on the correspondence lists LST1 and LST2 to calculate the similarity between the feature amounts of each of the plurality of registered speakers and the feature amount of the speaker US. The authentication unit 117 identifies the speaker US based on the calculated similarity. The authentication unit 117 generates an authentication result screen SC based on the speaker information of the identified speaker US, and transmits it to the monitor MN.

The memory 12 includes, for example, a random access memory (hereinafter referred to as "RAM") as a work memory used when executing each process of the processor 11, and a read-only memory that stores programs and data that define the operation of the processor 11. Memory (hereinafter referred to as "ROM"). Data or information generated or acquired by the processor 11 is temporarily stored in the RAM. A program that defines the operation of the processor 11 is written in the ROM.

The feature extraction model database DB1 is a so-called storage, and is stored using a storage medium such as a flash memory, a Hard Disk Drive (hereinafter referred to as "HDD"), or a Solid State Drive (hereinafter referred to as "SSD"). configured. The feature extraction model database DB1 stores a feature extraction model capable of detecting the utterance section of the speaker US from registered speech data or authenticated speech data and extracting the feature of the speaker US. The feature extraction model is, for example, a learning model generated by learning using deep learning or the like.

The registered speaker database DB2 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD. The registered speaker database DB2 includes feature quantities of each of a plurality of registered speakers registered in advance, information on the noise type of noise included in the registered voice data from which the feature quantities were extracted, and registered speaker information. Store information in association with other information.

The similarity calculation model database DB3 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD. The similarity calculation model database DB3 stores a similarity calculation model that can calculate the similarity between two feature amounts. The similarity calculation model is, for example, a learning model generated by learning using deep learning or the like.

For example, a similarity calculation model is one in which dimensions in which individuality is likely to be expressed are learned in advance and retained in order to calculate the similarity between two multidimensional vectors with high precision. Note that the method for calculating the similarity using a model is an example of a method for calculating the similarity between vectors, and previously described techniques such as Euclidean distance and cosine similarity may be used.

The noise-similarity calculation model correspondence list DB4 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD. The noise-similarity calculation model correspondence list DB4 stores similarity calculation models used in similarity calculation processing for each combination of noise types.

The monitor MN is configured using a display such as a Liquid Crystal Display (LCD) or an organic electroluminescence (EL), for example. The monitor MN displays the authentication result screen SC output from the terminal device P1.

The authentication result screen SC is a screen that notifies the administrator (for example, the person viewing Monitor MN) of the speaker authentication result, and displays the authentication result information "XX matched Mr. XX's voice." Contains degree information "Reliability: High". The authentication result screen SC may also include other registered speaker information (for example, a face image, etc.). Further, the authentication result screen SC does not need to include reliability information.

The network NW connects the terminal device P1 and the noise determination device P2 to enable data communication. Note that the noise determination device P2 may not only be connected to the terminal device P1 via the network NW, but may also be a part of the terminal device P1.

The noise determination device P2 acquires the noise extracted from the registered voice data of the speaker US or the authentication voice data transmitted from the terminal device P1. The noise determination device P2 determines the type of noise based on the acquired noise. The noise determination device P2 transmits noise type information to the terminal device P1.

Next, with reference to FIG. 3, the operating procedure of the terminal device P1 will be described. FIG. 3 is a flowchart showing an example of the operation procedure of the terminal device P1 in the embodiment.

The terminal device P1 acquires audio data from the microphone MK (St11). Note that the microphone MK may be, for example, a microphone included in a PC, a notebook PC, a smartphone, or a tablet terminal.

The terminal device P1 determines whether the control command associated with the voice data is a control command requesting registration in the registered speaker database DB2 (St12).

In the process of step St12, the terminal device P1 determines to newly register the feature amount of the speaker US in the registered speaker database DB2 if the control command is a control command requesting registration in the registered speaker database DB2. (St12, YES), and extracts noise included in the non-speech section from the voice data (registered voice data) (St13). The noise referred to here is the noise included in the voice data (registered voice data or authentication voice data), and is the surrounding environmental sound, noise, etc. when the utterance voice of the speaker US is collected.

On the other hand, in the process of step St12, if the control command is not a control command requesting registration in the registered speaker database DB2 but a control command requesting speaker authentication, the terminal device P1 registers the registered speaker database DB2. It is determined that the feature quantity of the speaker US is not to be newly registered (St12, NO), and noise included in the non-speech section of the voice data (authentication voice data) is extracted (St14).

The terminal device P1 associates a control command requesting determination of the extracted noise type with the noise, and transmits the extracted noise to the noise determination device P2. The terminal device P1 executes noise type determination processing by acquiring the noise type information (that is, the determination result) transmitted from the noise determination device P2 (St15).

The terminal device P1 extracts a feature amount indicating the individuality of the speaker US from the utterance section of the registered voice data (St16). Here, the feature amounts extracted from the utterance sections of the registered voice data and the authentication voice data include a feature amount indicating the individuality of the speaker US and a noise feature amount.

The terminal device P1 associates the feature amount of the speaker US extracted from the registered voice data, noise type information, and speaker information and registers them in the registered speaker database DB2 (St17).

The terminal device P1 further extracts the noise extracted from the non-speech section of the authentication voice data. The terminal device P1 associates the extracted noise with a control command requesting a noise type determination, and transmits the extracted noise to the noise determination device P2. The terminal device P1 executes noise type determination processing by acquiring the noise type information (that is, the determination result) transmitted from the noise determination device P2 (St18).

The terminal device P1 extracts the feature amount of the speaker US from the utterance section of the authenticated voice data of the speaker US (St19). Further, the terminal device P1 acquires the feature amounts of each of the plurality of registered speakers registered in the registered speaker database DB2 (St20), and executes speaker authentication processing (St21).

Next, with reference to FIG. 4, the speaker authentication procedure shown in step St21 shown in FIG. 3 will be described. FIG. 4 is a flowchart showing an example of the speaker authentication procedure of the terminal device P1 in the embodiment.

The terminal device P1 determines the feature amount of the speaker US and the plurality of A similarity calculation model suitable for calculating the degree of similarity with each feature quantity of the registered speakers is selected for each registered speaker. The terminal device P1 creates correspondence lists LST1 and LST2 (Fig. 5, 6).

The terminal device P1 uses the referenced correspondence lists LST1 and LST2 to determine the degree of similarity between the feature amount of the speaker US and the feature amount of one of the registered speakers among the plurality of registered speakers. The similarity calculation model is read from the similarity calculation model database DB3 (St211).

The terminal device P1 uses the similarity calculation model to calculate the feature amount of the voice data of the speaker US and the characteristics of one of the registered speakers among the plurality of registered speakers registered in the registered speaker database DB2. The degree of similarity with the amount is calculated (St212). The terminal device P1 repeatedly executes the process of step St212 until it calculates the similarity between the feature amount of the voice data of the speaker US and the feature amount of all registered speakers registered in the registered speaker database DB2. .

The terminal device P1 determines whether there is a degree of similarity greater than or equal to a threshold value among the calculated degrees of similarity (St213).

If the terminal device P1 determines in the process of step St213 that there is a degree of similarity that is equal to or greater than the threshold among the calculated degrees of similarity (St213, YES), the terminal device P1 registers the degree of similarity that is determined to be equal to or greater than the threshold and corresponds to the degree of similarity. The speaker US is identified based on the speaker information (St214). Note that if there are multiple degrees of similarity that are determined to be equal to or higher than the threshold, the terminal device P1 identifies the speaker US based on the registered speaker information corresponding to the degree of similarity with the highest calculated degree of similarity. You may.

If the terminal device P1 determines in the process of step St213 that there is no similarity greater than or equal to the threshold among the calculated similarities (St213, NO), the terminal device P1 determines that the speaker US cannot be identified (St215). ).

The terminal device P1 generates an authentication result screen SC based on the registered speaker information of the identified speaker US. The terminal device P1 outputs the generated authentication result screen SC to the monitor MN for display (St216).

As described above, the terminal device P1 registers the speaker information, the feature amount of the speaker US, and the information on the noise type included in the utterance section of the speaker US in association with each other at the time of voice registration. As a result, even if the noise type included in the feature amount at the time of voice registration is different from the noise type included in the feature amount at the time of voice authentication, the terminal device P1 can determine the degree of similarity suitable for each noise type. You can select a calculation model. Therefore, by using the selected similarity calculation model, the terminal device P1 more accurately determines the similarity between the feature amount of the speaker US including different noises and the feature amount of the registered speaker. This makes it possible to more effectively suppress a decrease in speaker authentication accuracy caused by noise contained in the authentication voice data.

Furthermore, the terminal device P1 calculates the degree of reliability indicating the likelihood of the speaker identified by the speaker authentication process, which is indicated by the calculated degree of similarity, based on the similarity calculation model used to calculate the degree of similarity. and display. Thereby, the terminal device P1 can present the certainty of the speaker authentication result to the administrator viewing the monitor MN. Therefore, the terminal device P1 informs the administrator by presenting the reliability that there is no similarity calculation model suitable for calculating the similarity, and that speaker authentication has been performed using a similarity calculation model "universal model", which will be described later. can be informed.

An example of the correspondence lists LST1 and LST2 will be described with reference to FIGS. 5 and 6, respectively. FIG. 5 is a diagram illustrating an example of the correspondence list LST1 when the noise type at the time of voice registration and the noise type at the time of voice authentication are the same. FIG. 6 is a diagram illustrating an example of the correspondence list LST2 when the noise type at the time of voice registration and the noise type at the time of voice authentication are different.

In addition, in FIGS. 5 and 6, in order to make the explanation easier to understand, noise type information associated with the feature amount of any one registered speaker registered in the registered speaker database DB2 and speaker authentication information are shown. An example will be described in which the correspondence lists LST1 and LST2 are referred to based on the noise type of noise included in the authenticated voice data of the target speaker US.

In the reference example of the correspondence list LST1 shown in FIG. 5, each of the voice data at the time of voice registration and voice authentication includes noise that corresponds to the same noise type "in-store noise."

The terminal device P1 transmits the noise extracted from the authentication voice data of the speaker US transmitted from the microphone MK to the noise determination device P2. The terminal device P1 generates a predefined correspondence list LST1 based on the noise type determination result transmitted from the noise determination device P2 and the noise type information of registered speakers registered in the registered speaker database DB2. With reference to , one selection model is selected from each of the plurality of selection models (similarity models).

The correspondence list LST1 includes the noise type determination result "Noise determination result 1" of the noise included in the authenticated voice data of the speaker US and the noise type determination result "Noise determination result 1" of the registered speaker registered in the registered speaker database DB2. This is data that associates "Judgment Result 2" with a similarity calculation model "selected model" selected based on these two noise types.

The noise type determination result "Noise determination result 1" indicates, for example, the noise determination result in the voice at the time of authentication.

The noise type determination result "Noise determination result 2" indicates, for example, the noise determination result of registered speech.

Note that the number of noise type information registered in the registered speaker database DB2 may be one or more. In other words, if there are multiple noise type candidates, multiple pieces of information may be retained. Furthermore, information on determination probabilities indicating respective reliability corresponding to determination results of a plurality of noise types is not essential and may be omitted.

The similarity calculation model "selected model" is, for example, the similarity calculation model "model A" selected in response to the noise type determination result "noise determination result 1" and the noise type determination result "noise determination result 2". , "Model B", "Model C", and "Model Z".

For example, in the similarity calculation model "Model A", the information on the noise type of the noise included in the features of the speaker US is "Noise A", and the information on the noise type of the noise included in the features of the registered speaker is "Noise A". This similarity calculation model is determined to be optimal for calculating the similarity between the feature amount of the speaker US and the feature amount of the registered speaker when the information is "noise A."

Note that the similarity calculation model selection unit 115, based on the combination of the information on the noise type of the noise included in the feature amount of the speaker US and the information on the noise type of the noise included in the feature amount of the registered speaker, If it is determined that there is no similarity calculation model suitable for the similarity calculation process, "Model Z", which is a general-purpose similarity calculation model, is selected.

The similarity calculation model selection unit 115 selects a similarity calculation model database from the similarity calculation model database DB3 based on the referenced correspondence list LST1. For example, in the example shown in FIG. 5, the similarity calculation model selection unit 115 selects the similarity calculation model "model A."

Next, in the reference example of the correspondence list LST2 shown in FIG. 6, the authenticated voice data of the speaker US at the time of voice registration includes noise that corresponds to the noise type "in-store noise." Furthermore, the registered voice data of the registered speaker at the time of voice authentication includes noise that corresponds to a noise type "outdoor noise" that is different from the authenticated voice data of the speaker US.

The terminal device P1 transmits the noise extracted from the authentication voice data of the speaker US transmitted from the microphone MK to the noise determination device P2. The terminal device P1 generates a predefined correspondence list LST2 based on the noise type determination result transmitted from the noise determination device P2 and the noise type information of registered speakers registered in the registered speaker database DB2. With reference to , one selection model is selected from each of the plurality of selection models (similarity models).

The correspondence list LST2 includes the noise type determination result "Noise determination result 3" of the noise included in the authenticated voice data of the speaker US, and the noise type determination result "Noise determination result 3" of the registered speaker registered in the registered speaker database DB2. This is data that associates "determination result 4" with a similarity calculation model "selected model" selected based on these two noise types.

Noise type determination result "Noise determination result 3" indicates the noise type determination result extracted from the voice data of the registered speaker.

The noise type determination result "Noise determination result 4" indicates the determination result of the noise type information of the registered speakers registered in the registered speaker database DB2.

Note that the number of noise type information registered in the registered speaker database DB2 may be one or more. In other words, if there are multiple noise type candidates, multiple pieces of information may be retained. Further, information on the determination probability indicating the reliability corresponding to the determination result of each noise type is not essential and may be omitted.

The similarity calculation model "selected model" is a similarity calculation model "model G" selected corresponding to the noise type determination result "noise determination result 3" and the noise type determination result "noise determination result 4", Includes each of "Model H", "Model I", and "Model Z".

For example, the similarity calculation model "Model G" is used when the noise type information of the noise included in the authentication speech data is "Noise A" and the noise type information of the registered speaker is "Noise D". , is a similarity calculation model determined to be optimal for calculating the similarity between the feature amount of the speaker US and the feature amount of the registered speaker.

Note that the similarity calculation model selection unit 115 selects a model suitable for the similarity calculation process based on the combination of the noise type information of the noise included in the feature amount of the speaker US and the noise type information of the registered speaker. If it is determined that there is no similarity calculation model, a general-purpose similarity calculation model "Model Z" is selected.

The similarity calculation model selection unit 115 selects a similarity calculation model database from the similarity calculation model database DB3 based on the referenced correspondence list LST2. For example, in the example shown in FIG. 6, the similarity calculation model selection unit 115 selects the similarity calculation model "model E."

As described above, the terminal device P1 calculates two feature quantities (features of the speaker US and It is possible to select the optimal similarity calculation model for the similarity calculation process (for example, the characteristics of registered speakers). As a result, even if the noise type included in the feature amount at the time of voice registration and the noise type included in the feature amount at the time of voice authentication change, the terminal device P1 calculates the degree of similarity between the two feature amounts. The optimal similarity calculation model can be selected for calculation. In other words, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by noise included in the audio data. If there are candidates for multiple conditions at the time of noise determination, the terminal device P1 calculates the degree of similarity using, for example, a similarity calculation model corresponding to each condition, calculates the average value, and uses it as the degree of similarity. Good too.

With reference to FIG. 7, the reliability of the similarity calculation model used in the similarity calculation process will be explained. FIG. 7 is a diagram illustrating an example of calculating reliability.

Note that although FIG. 7 shows an example in which the reliability is calculated using two indicators "high" and "low", the reliability may be calculated as a numerical value from 0 (zero) to 100, for example.

In addition, in order to make the explanation easier to understand, in FIG. 7, the reliability is calculated when the registered voice data at the time of voice registration and the authentication voice data at the time of voice authentication are each of the same noise type, as explained in FIG. Let's discuss an example.

The reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117. Here, the reliability calculation unit 116 determines whether or not the similarity calculation model used to calculate the similarity is a similarity calculation model based on a known noise type, and whether or not the reliability calculation unit 116 is reliable based on the determination probability of the noise type. Determine the degree.

In the example shown in "Case 1", the noise type determination result based on noise is that the determination probability for the noise type "outdoor noise" is "90%", and the determination probability for the noise type "in-store music" is "6%". ”, and the determination probability that the noise type is “unknown noise” is “4%”. In addition, the noise type determination results shown in "Case 1" indicate that the noise types "outdoor noise" and "in-store noise" are known noises, and the noise type "unknown noise" is unknown noises. .

The similarity calculation model selection unit 115 selects the similarity calculation model "outdoor wind noise model" based on the determination result of the noise type of the noise. The authentication unit 117 uses the similarity calculation model "outdoor noise model" to calculate the similarity between the feature amount of the speaker US and the feature amount of any registered speaker registered in the registered speaker database DB2. do.

The reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117. Here, the reliability calculation unit 116 calculates that the similarity calculation model "outdoor noise model" is a similarity calculation model based on a known noise type "outdoor wind noise model" and that the noise determination probability is "90%". ”, the reliability is calculated as “high”.

Note that when the reliability calculation unit 116 determines that the noise determination probability is equal to or higher than a predetermined probability (for example, 85%, 90%, etc.), the reliability calculation unit 116 calculates that the reliability is "high" and calculates the noise determination probability as "high." If it is determined that the probability is not greater than a predetermined probability, the reliability is calculated to be "low".

In the example shown in "Case 2", the noise type determination results based on the noise are that the determination probability for the noise type "outdoor wind noise" is "48%" and the determination probability for the noise type "unknown noise" is "39%". ”, and the determination probability that the noise type is “in-store music” is “13%”. In addition, the noise type determination results shown in "Case 2" indicate that the noise types "Outdoor wind noise" and "In-store music" are known noises, and the noise type "Unknown noise" is unknown noises. show.

The reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117. Here, the reliability calculation unit 116 calculates that the similarity calculation model "outdoor noise model" is a similarity calculation model based on a known noise type "outdoor wind noise model" and that the noise determination probability is "48%". ”, the reliability is calculated as “low”.

In the example shown in "Case 3", the noise type determination results based on the noise are that the determination probability for the noise type "unknown noise" is "55%" and the determination probability for the noise type "outdoor wind noise" is "28%". ”, and the determination probability that the noise type is “in-store music” is “17%”. In addition, the noise type determination results shown in "Case 3" indicate that the noise types "Outdoor wind noise" and "In-store noise" are known noises, and the noise type "Unknown noise" is unknown noises. show.

The similarity calculation model selection unit 115 selects the similarity calculation model "general model" based on the determination result of the noise type of the noise. The authentication unit 117 calculates the similarity between the feature amount of the speaker US and the feature amount of any registered speaker registered in the registered speaker database DB2 using the similarity calculation model "general model". .

The reliability calculation unit 116 determines the reliability of the similarity calculated by the authentication unit 117. Here, the reliability calculation unit 116 determines that the similarity calculation model "general-purpose model" used to calculate the similarity is unknown noise and the noise determination probability is "55%", so the reliability calculation unit 116 determines that it is reliable. Calculate the degree "low".

As described above, the terminal device P1 according to the embodiment includes the communication unit 10 (an example of an acquisition unit) that acquires audio data, the utterance section where the speaker is speaking from the audio data, and the utterance section where the speaker is not speaking. A noise extraction unit 111 and a feature extraction unit 112 (an example of a detection unit) detect non-speech intervals, feature quantities of the utterance interval (an example of utterance features), and noise included in the non-speech interval of audio data. A noise extraction unit 111 and a feature quantity extraction unit 112 (an example of an extraction unit) extract the noise, and associate the extracted noise with the feature quantities of a plurality of registered speakers registered in advance (an example of registered feature quantities). a similarity calculation model selection unit 115 (an example of a selection unit) that selects any one similarity calculation model from each of a plurality of similarity calculation models (an example of a similarity calculation model) based on the detected noise; and an authentication unit 117 that authenticates the speaker by comparing the feature amount of the speaker US with the feature amount of the registered speaker using the selected similarity calculation model.

Thereby, the terminal device P1 according to the embodiment calculates a similarity calculation model suitable for speaker authentication based on the combination of noise types included in the feature amount of the speaker US and the feature amount of the registered speaker. You can choose. In other words, even if the noise type included in the feature amount during voice registration is different from the noise included in the feature amount during voice authentication, the terminal device P1 selects a similarity calculation model more suitable for speaker authentication. You can choose. Therefore, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by noise included in the audio data.

Furthermore, the communication unit 10 in the terminal device P1 according to the embodiment further acquires noise type information of the extracted noise. The similarity calculation model selection unit 115 selects a similarity calculation model based on the acquired speaker noise type information and the registered speaker noise type information. As a result, the terminal device P1 according to the embodiment generates two feature quantities (feature quantities of the speaker US It is possible to select a similarity calculation model that is more suitable for the similarity calculation process of the features of the registered speaker and the registered speaker's feature values.

Furthermore, the terminal device P1 according to the embodiment further includes an authentication unit 117 (an example of a calculation unit) that calculates the degree of similarity between the feature amount of the speaker US and the feature amount of the registered speaker. The authentication unit 117 authenticates the speaker US based on the plurality of calculated similarities. Thereby, the terminal device P1 according to the embodiment can perform speaker authentication using the degree of similarity between the feature amounts of a plurality of registered speakers registered in advance and the feature amounts of the speaker US.

Furthermore, the terminal device P1 according to the embodiment further includes a reliability calculation unit 116 (an example of a reliability calculation unit) that calculates the reliability of the similarity. The communication unit 10 further acquires noise type information of the extracted noise and a score indicating the noise type of the noise. The reliability calculation unit 116 calculates the reliability of the similarity based on the score. Thereby, the terminal device P1 according to the embodiment can calculate the reliability of the speaker authentication result by calculating the reliability corresponding to the similarity.

Furthermore, the authentication unit 117 in the terminal device P1 according to the embodiment identifies the registered speaker whose degree of similarity is equal to or greater than the threshold value as the speaker US. Thereby, the terminal device P1 according to the embodiment can perform speaker authentication using the degree of similarity between the feature amounts of a plurality of registered speakers registered in advance and the feature amounts of the speaker US.

Additionally, the authentication unit 117 in the terminal device P1 according to the embodiment generates and outputs an authentication result screen SC that includes information regarding registered speakers whose degree of similarity is equal to or greater than the threshold value. Thereby, the terminal device P1 according to the embodiment can present the speaker authentication result to the speaker US or the administrator.

Furthermore, if the authentication unit 117 in the terminal device P1 according to the embodiment determines that the plurality of calculated similarities are not equal to or greater than the threshold value, the authentication unit 117 determines that the speaker US cannot be identified. Thereby, the terminal device P1 according to the embodiment can more effectively suppress a decrease in speaker authentication accuracy and more effectively suppress erroneous authentication of the speaker US.

Further, the authentication unit 117 in the terminal device P1 according to the embodiment generates and outputs an authentication result screen SC including information regarding registered speakers whose degree of similarity is equal to or higher than a threshold value and information on the calculated reliability. do. As a result, the terminal device P1 according to the embodiment displays the speaker authentication result and the reliability of the speaker authentication result, so that the administrator can know whether the speaker authentication result is reliable or not. You can prompt for confirmation.

Although various embodiments have been described above with reference to the drawings, it goes without saying that the present disclosure is not limited to such examples. It is clear that those skilled in the art can come up with various changes, modifications, substitutions, additions, deletions, and equivalents within the scope of the claims, and It is understood that it naturally falls within the technical scope of the present disclosure. Further, each component in the various embodiments described above may be combined as desired without departing from the spirit of the invention.

Note that this application is based on a Japanese patent application (Japanese Patent Application No. 2022-045390) filed on March 22, 2022, and the contents thereof are incorporated as a reference in this application.

The present disclosure is useful as a voice authentication device and a voice authentication method that suppress a decrease in speaker authentication accuracy due to changes in environmental noise.

10 Communication unit 11 Processor 12 Memory 100 Voice authentication system 111 Noise extraction unit 112 Feature extraction unit 113 Noise determination unit 114 Speaker registration unit 115 Similarity calculation model selection unit 116 Reliability calculation unit 117 Authentication unit DB1 Feature extraction model database DB2 Registered speaker database DB3 Similarity calculation model database DB4 Noise-similarity calculation model correspondence list MK Microphone MN Monitor NW Network P1 Terminal device P2 Noise determination device SC Authentication result screen US Speaker

Claims

an acquisition unit that acquires audio data;
a detection unit that detects a speech section in which the speaker is speaking and a non-speech section in which the speaker is not speaking from the audio data;
an extraction unit that extracts speech features of the speech section and noise included in the non-speech section of the audio data;
Based on the extracted noise and the noise associated with registered feature values of a plurality of registered speakers registered in advance, one of the plurality of similarity calculation models is selected. A selection section that selects;
an authentication unit that authenticates the speaker by comparing the utterance feature amount of the speaker with the registered feature amount of the registered speaker using the selected similarity calculation model;
Voice authentication device.
The acquisition unit further acquires noise type information of the extracted noise,
The selection unit selects the similarity calculation model based on the obtained noise type information of the speaker and noise type information of the registered speaker.
The voice authentication device according to claim 1.
further comprising a calculation unit that calculates the degree of similarity between the utterance feature amount of the speaker and the registered feature amount of the registered speaker,
The authentication unit authenticates the speaker based on the plurality of calculated similarities;
The voice authentication device according to claim 1.
further comprising a reliability calculation unit that calculates reliability of the similarity,
The acquisition unit further acquires information on the noise type of the extracted noise and a score indicating that the noise is the noise type,
The reliability calculation unit calculates the reliability of the similarity based on the score.
The voice authentication device according to claim 3.
The authentication unit identifies a registered speaker for whom the degree of similarity is a threshold value or more as the speaker;
The voice authentication device according to claim 3.
The authentication unit generates and outputs an authentication result screen including information regarding the registered speaker whose degree of similarity is greater than or equal to the threshold;
The voice authentication device according to claim 5.
When the authentication unit determines that the plurality of calculated similarities are not equal to or greater than a threshold, the authentication unit determines that the speaker cannot be identified.
The voice authentication device according to claim 3.
The authentication unit identifies the registered speaker for whom the degree of similarity is greater than or equal to the threshold value as the speaker, and includes information regarding the registered speaker for whom the degree of similarity is greater than or equal to the threshold value and the calculated degree of reliability. Generate and output an authentication result screen including information,
The voice authentication device according to claim 4.
A voice authentication method performed by a terminal device, the method comprising:
Get audio data,
detecting from the audio data a speech section in which the speaker is speaking and a non-speech section in which the speaker is not speaking;
extracting the speech feature amount of the speech section and the noise included in the non-speech section of the audio data,
Based on the extracted noise and the noise associated with registered feature values of a plurality of registered speakers registered in advance, one of the plurality of similarity calculation models is selected. Select,
authenticating the speaker by comparing the utterance features of the speaker and the registered features of the registered speaker using the selected similarity calculation model;
Voice authentication method.