WO2023182014A1

WO2023182014A1 - Voice authentication device and voice authentication method

Info

Publication number: WO2023182014A1
Application number: PCT/JP2023/009467
Authority: WO
Inventors: 正成宮本
Original assignee: パナソニックＩｐマネジメント株式会社
Priority date: 2022-03-22
Filing date: 2023-03-10
Publication date: 2023-09-28

Abstract

This voice authentication device comprises: an acquisition unit that acquires voice data; a detection unit that detects from the voice data an utterance section in which a speaker is uttering; an extraction unit that extracts an utterance feature amount of the speaker from the detected utterance section; a selection unit that, on the basis of the extracted utterance feature amount of the speaker and an utterance feature amount of at least one registered speaker that is previously registered, selects a first similarity calculation model used for authenticating the speaker from among a plurality of similarity calculation models; and an authentication unit that, using the selected first similarity calculation model, compares the utterance feature amount of the speaker and the utterance feature amount of the registered speaker, and authenticates the speaker.

Description

Voice authentication device and voice authentication method

The present disclosure relates to a voice authentication device and a voice authentication method.

Patent Document 1 discloses a voice recognition device that recognizes a test subject's voice. The speech recognition device stores a plurality of motion noise models created corresponding to each of the plurality of motions in association with each of the plurality of motions, detects input speech including the subject's voice, and detects the subject's motion. is specified, and a motion noise model corresponding to the motion specified by the motion specifying means is read out. The speech recognition device reads an environmental noise model corresponding to the subject's current position, synthesizes the environmental noise model with the read motion noise model, and uses the synthesized noise superimposition model to detect the noise contained in the detected input speech. recognize the test subject's voice.

Japanese Patent Application Publication No. 2008-250059

However, in Patent Document 1, it is necessary to collect in advance the motion noise generated in each of a plurality of motions and the environmental noise at each of a plurality of positions where voice recognition can be performed, which is very time-consuming. In addition, in voiceprint authentication, the feature quantity that indicates the individuality of the authentication target (person) extracted from the audio signal changes depending on the noise contained in the audio signal and the sound collection conditions of the sound collection device with which the audio signal was collected. do. Therefore, when performing voiceprint authentication using the voice recognition device described above, if the recording conditions for the pre-registered voice signal and the recording conditions for the voice signal collected during voiceprint authentication are different, each Features extracted from voice signals do not indicate the individuality of the same person, and there is a possibility that the accuracy of voiceprint authentication will decrease.

The present disclosure was devised in view of the above-mentioned conventional situation, and aims to provide a voice authentication device and a voice authentication method that suppress a decrease in speaker authentication accuracy due to changes in environmental noise.

The present disclosure includes an acquisition unit that acquires audio data, a detection unit that detects an utterance section in which a speaker is speaking from the audio data, and an extraction unit that extracts utterance features of the speaker from the detected utterance section. , the extracted utterance feature of the speaker, and the utterance feature of at least one registered speaker registered in advance. a selection unit that selects a first similarity calculation model based on the selected first similarity calculation model, and comparing the utterance features of the speaker with the utterance features of the registered speaker, using the selected first similarity calculation model; A voice authentication device is provided, comprising: an authentication section that authenticates the speaker.

The present disclosure also provides a voice authentication method performed by a terminal device, which acquires voice data, detects a speech section in which the speaker is speaking from the voice data, and detects the speech section of the speaker from the detected speech section. The utterance features are extracted, and based on the extracted utterance features of the speaker and the utterance features of at least one registered speaker registered in advance, the speaker is selected from among a plurality of similarity calculation models. A first similarity calculation model used for authentication is selected, and the selected first similarity calculation model is used to compare the utterance features of the speaker with the utterance features of the registered speaker. , provides a voice authentication method for authenticating the speaker.

According to the present disclosure, it is possible to suppress a decrease in speaker authentication accuracy due to changes in environmental noise.

Block diagram showing an example of internal configuration of a voice authentication system according to an embodiment A diagram illustrating each process performed by a processor of a terminal device in an embodiment. Flowchart showing an example of the operation procedure of the terminal device in the embodiment Flowchart illustrating an example of a procedure for determining sound collection conditions of a terminal device in an embodiment Diagram illustrating an example of determining sound collection conditions and calculating reliability Flowchart showing an example of a speaker authentication procedure of a terminal device in an embodiment A diagram illustrating an example of a correspondence list when the sound collection condition estimation result at the time of voice registration and the sound collection condition estimation result at the time of voice authentication are the same. Diagram explaining an example of a correspondence list when the noise type at the time of voice registration and the noise type at the time of voice authentication are different Diagram explaining a specific example of the correspondence list

Hereinafter, embodiments specifically disclosing a voice authentication device and a voice authentication method according to the present disclosure will be described in detail with reference to the drawings as appropriate. However, more detailed explanation than necessary may be omitted. For example, detailed explanations of well-known matters and redundant explanations of substantially the same configurations may be omitted. This is to avoid unnecessary redundancy in the following description and to facilitate understanding by those skilled in the art. The accompanying drawings and the following description are provided to enable those skilled in the art to fully understand the present disclosure, and are not intended to limit the subject matter recited in the claims.

First, a voice authentication system 100 according to an embodiment will be described with reference to FIGS. 1 and 2. FIG. 1 is a block diagram showing an example of the internal configuration of a voice authentication system 100 according to an embodiment. FIG. 2 is a diagram illustrating each process performed by the processor 11 of the terminal device P1 in the embodiment.

The voice authentication system 100 includes a terminal device P1 as an example of a voice authentication device and a monitor MN. Note that the voice authentication system 100 may include a microphone MK or a monitor MN.

The microphone MK picks up the voice uttered by the speaker US in order to register the voice in the terminal device P1 in advance. The microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data that is registered in the terminal device P1. Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .

Additionally, the microphone MK picks up the voice uttered by the speaker US, which is used for speaker authentication. The microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data. Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .

In the following explanation, in order to make the explanation easier to understand, the voice data for voice registration or the voice data already registered in the terminal device P1 will be referred to as "registered voice data", and the voice data for voice authentication will be referred to as "authentication voice data". and differentiate.

Note that the microphone MK may be, for example, a microphone included in a predetermined device such as a Personal Computer (hereinafter referred to as "PC"), a notebook PC, a smartphone, a tablet terminal, or the like. Further, the microphone MK may transmit an audio signal or audio data to the terminal device P1 by wireless communication via a network (not shown).

The terminal device P1 is realized by, for example, a PC, a notebook PC, a smartphone, a tablet terminal, etc., and executes a voice registration process using registered voice data of the speaker US and a speaker authentication process using authentication voice data. . It includes a communication unit 10, a processor 11, a memory 12, a feature extraction model database DB1, a registered speaker database DB2, a similarity calculation model database DB3, and a sound collection condition learning database DB4.

The communication unit 10, which is an example of an acquisition unit, is connected to the microphone MK and the monitor MN so that data can be transmitted and received by wire or wirelessly. The wireless communication referred to here is, for example, short-range wireless communication such as Bluetooth (registered trademark) or NFC (registered trademark), or communication via a wireless local area network (LAN) such as Wi-Fi (registered trademark).

Note that the communication unit 10 may transmit and receive data to and from the microphone MK via an interface such as a Universal Serial Bus (USB). Furthermore, the communication unit 10 may perform data transmission and reception with the monitor MN via an interface such as High-Definition Multimedia Interface (HDMI, registered trademark).

The processor 11 is configured using, for example, a central processing unit (CPU) or a field programmable gate array (FPGA), and performs various processing and control in cooperation with the memory 12. Specifically, the processor 11 refers to the program and data held in the memory 12, and executes the program to identify the feature amount extraction section 111, the sound collection condition determination section 112, the speaker registration section 113, and the similar speaker registration section 113. The functions of each unit such as the degree calculation model selection unit 114, the reliability calculation unit 115, and the authentication unit 116 are realized.

When registering the voice of the speaker US, the processor 11 implements the functions of the feature extraction unit 111, the speaker registration unit 113, and the sound collection condition determination unit 112, thereby registering the voice of the speaker US in the registered speaker database DB2. Executes new registration (storage) processing.

In addition, the processor 11 realizes the functions of the feature extraction unit 111, the sound collection condition determination unit 112, the similarity calculation model selection unit 114, the reliability calculation unit 115, and the authentication unit 116 during voice authentication of the speaker US. By doing so, speaker authentication processing is executed.

The feature extraction unit 111, which is an example of a detection unit and an extraction unit, acquires the voice data (registered voice data or authentication voice data) of the speaker US transmitted from the microphone MK. The feature extraction unit 111 executes voice registration processing or voice authentication processing based on a control command associated with the voice data.

The feature extraction unit 111 detects the utterance section in which the speaker US is speaking from the registered voice data during voice registration. The feature amount extracting unit 111 extracts a feature amount indicating the individuality of the speaker US from the detected speech section, and outputs the extracted feature amount to the sound collection condition determining unit 112 and the speaker registration unit 113.

Further, during voice authentication, the feature amount extraction unit 111 extracts the feature amount of the speaker US from the authentication voice data and outputs it to each of the sound collection condition determination unit 112 and the authentication unit 116.

Based on the feature amount of the speaker US outputted from the feature amount extraction section 111, the sound collection condition determination section 112, which is an example of the determination section, selects the voice data from which the feature amount has been extracted (that is, the utterance of the speaker US). The sound collection conditions under which the sound) was collected are determined. The sound collection condition determination unit 112 outputs information on the sound collection conditions of the feature amount of the speaker US to the speaker registration unit 113 at the time of voice registration. Further, the sound collection condition determination unit 112 outputs information on the sound collection conditions of the feature amount of the speaker US to the similarity calculation model selection unit 114 during voice authentication.

Note that the sound pickup conditions here refer to the sound pickup device that picked up the utterances of the speaker US or registered speaker, the language, gender, age, and noise of the noise included in the features of the speaker US or registered speaker. Type, etc. The sound collecting device is, for example, a microphone, a telephone, a headset, or the like.

Noise is noise that is collected due to the environment (background) at the time of sound collection, and includes, for example, surrounding voices at the time of sound collection, music, the sound of a vehicle running, the sound of the wind, etc. The noise type indicates, for example, the environment (place) or position where the noise occurs, such as in-store noise, outdoor wind noise, in-store music, or inside a station. The noise type may further include information on time zones such as early morning, daytime, and nighttime, for example.

The speaker registration unit 113 acquires the feature quantity of the speaker US outputted from the feature quantity extraction unit 111 and the speaker information of the speaker US associated with the registered voice data. The speaker registration unit 113 also acquires information on the sound collection conditions output from the sound collection condition determination unit 112. The speaker registration unit 113 associates the feature amount of the speaker US, the speaker information of the speaker US, and the information on the sound collection conditions and registers them in the registered speaker database DB2.

Note that the speaker information may be extracted from the registered speech data by voice recognition, or may be obtained from a terminal owned by the speaker US (for example, a PC, a notebook PC, a smartphone, a tablet terminal). Moreover, the speaker information here includes, for example, identification information that can identify the speaker US, the name of the speaker US, speaker identification (ID), and the like.

The similarity calculation model selection unit 114, which is an example of the selection unit, uses the sound collection condition information of the feature quantity of the speaker US outputted from the sound collection condition determination unit 112 and the plurality of sound collection conditions registered in the registered speaker database DB2. Information on the sound collection conditions of the feature quantities of each registered speaker is acquired. The similarity calculation model selection unit 114 selects the speaker US based on the acquired sound pickup condition information of the feature amount of the speaker US and the information of the sound pickup condition of the feature amount of each of the plurality of registered speakers. A similarity calculation model (an example of a similarity calculation model) to be used in the similarity calculation process between the feature amount and the feature amount of any one registered speaker is selected.

The similarity calculation model selection unit 114 uses the information on the sound collection conditions for the acquired feature quantities of the speaker US, the information on the sound collection conditions for the feature quantities of each of the plurality of registered speakers, and the selected similarity calculations. Select one of the plurality of selected models (first similarity calculation model) by referring to the correspondence lists LST, LST1, LST2 (see FIGS. 7, 8, and 9) that are associated with the models. A model is selected and output to each of the reliability calculation section 115 and the authentication section 116.

The reliability calculation unit 115, which is an example of the reliability calculation unit, calculates (evaluates) the reliability (score) indicating the certainty of the identification result of the speaker US based on the similarity calculated by the authentication unit 116. The reliability calculation unit 115 calculates the reliability based on the distance between the model learning data distribution of the similarity calculation model used in the similarity calculation process by the authentication unit 116 and the feature amount of the speaker US. The reliability calculation unit 115 outputs the calculated reliability information to the authentication unit 116.

The authentication unit 116, which is an example of a calculation unit, acquires the feature amount of the speaker US output from the feature amount extraction unit 111, and calculates the feature amount of each of the plurality of registered speakers registered in the registered speaker database DB2. get. The authentication unit 116 also obtains the selected models of the correspondence lists LST, LST1, and LST2 output from the similarity calculation model selection unit 114.

The authentication unit 116 uses a similarity calculation model based on the correspondence lists LST, LST1, and LST2 to calculate the similarity between the feature amounts of each of the plurality of registered speakers and the feature amount of the speaker US. The authentication unit 116 identifies the speaker US based on the calculated similarity. The authentication unit 116 also obtains reliability information output from the reliability calculation unit 115. The authentication unit 116 generates an authentication result screen SC based on the speaker information of the identified speaker US and reliability information, and transmits it to the monitor MN.

The memory 12 includes, for example, a random access memory (hereinafter referred to as "RAM") as a work memory used when executing each process of the processor 11, and a read-only memory that stores programs and data that define the operation of the processor 11. Memory (hereinafter referred to as "ROM"). Data or information generated or acquired by the processor 11 is temporarily stored in the RAM. A program that defines the operation of the processor 11 is written in the ROM.

The feature extraction model database DB1 is a so-called storage, and is stored using a storage medium such as a flash memory, a Hard Disk Drive (hereinafter referred to as "HDD"), or a Solid State Drive (hereinafter referred to as "SSD"). configured. The feature extraction model database DB1 stores a feature extraction model capable of detecting the utterance section of the speaker US from registered speech data or authenticated speech data and extracting the feature of the speaker US. The feature extraction model is, for example, a learning model generated by learning using deep learning or the like.

The registered speaker database DB2 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD. The registered speaker database DB2 stores feature quantities of each of a plurality of registered speakers registered in advance, information on the determination results of sound collection conditions corresponding to the feature quantities, and registered speaker information in association with each other. do.

The similarity calculation model database DB3 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD. The similarity calculation model database DB3 stores a similarity calculation model that can calculate the similarity between the feature extracted from the authenticated voice data and the feature of a registered speaker registered in the registered speaker database DB2. The similarity calculation model is a learning model that is trained and generated using learning data under predetermined sound collection conditions by deep learning or the like.

For example, a similarity calculation model is one in which dimensions in which individuality is likely to be expressed are learned in advance and retained in order to calculate the similarity between two multidimensional vectors with high precision. Note that the method of calculating similarity using a model is just one example of a method for calculating similarity between vectors, and previously described techniques such as Euclidean distance and cosine similarity may be used.

The learning database DB4 for each sound collection condition is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD. The sound collection condition-specific learning database DB4 stores distribution information of learning data used for learning the similarity calculation model.

The monitor MN is configured using a display such as a Liquid Crystal Display (LCD) or an organic electroluminescence (EL), for example. The monitor MN displays the authentication result screen SC output from the terminal device P1.

The authentication result screen SC is a screen that notifies the administrator (for example, the person viewing Monitor MN) of the speaker authentication result, and displays the authentication result information "XX matched Mr. XX's voice." Contains degree information "Reliability: High". The authentication result screen SC may also include other registered speaker information (for example, a face image, etc.). Further, the authentication result screen SC does not need to include reliability information.

Next, with reference to FIG. 3, the operating procedure of the terminal device P1 will be described. FIG. 3 is a flowchart showing an example of the operation procedure of the terminal device P1 in the embodiment.

The terminal device P1 acquires audio data from the microphone MK (St11). Note that the microphone MK may be, for example, a microphone included in a PC, a notebook PC, a smartphone, or a tablet terminal.

The terminal device P1 determines whether the control command associated with the voice data is a control command requesting registration in the registered speaker database DB2 (St12).

In the process of step St12, the terminal device P1 determines to newly register the feature amount of the speaker US in the registered speaker database DB2 if the control command is a control command requesting registration in the registered speaker database DB2. (St12, YES), and the feature amount of the speaker US is extracted from the voice data (registered voice data) (St13).

On the other hand, in the process of step St12, if the control command is not a control command requesting registration in the registered speaker database DB2 but a control command requesting speaker authentication, the terminal device P1 registers the registered speaker database DB2. It is determined that the feature amount of the speaker US is not to be newly registered (St12, NO), and the feature amount of the speaker US is extracted from the voice data (authentication voice data) (St14).

Based on the feature amount of the speaker US extracted from the registered voice data, the terminal device P1 executes a process of determining the sound collection condition of the registered voice data (uttered voice) from which this feature amount has been extracted (St15A).

The terminal device P1 stores (registers) the feature amount of the speaker US, the information on the sound collection conditions, and the speaker information of the speaker US in the registered speaker database DB2 in association with each other (St16).

Based on the feature amount of the speaker US extracted from the authentication voice data, the terminal device P1 executes a process of determining the sound collection condition of the registered voice data (uttered voice) from which this feature amount has been extracted (St15B).

The terminal device P1 acquires information on the feature amounts and sound collection conditions of each of the plurality of registered speakers registered in the registered speaker database DB2. The terminal device P1 collects the feature values of the speaker US and the plurality of registered speakers based on the information on the feature values and sound collection conditions of the speaker US and the information on the feature values and sound collection conditions of each of the plurality of registered speakers. A similarity calculation model for calculating the similarity with each feature of the speaker is selected (St18).

Note that here, the terminal device P1 creates correspondence lists LST, LST1, With reference to LST2 (see FIGS. 7, 8, and 9), one of the plurality of selection models is selected.

The terminal device P1 executes speaker authentication processing based on the selected models of the selected correspondence lists LST, LST1, and LST2 (St19).

Next, with reference to FIGS. 4 and 5, the procedure for determining the sound collection conditions shown in each of step St15A and step St15B shown in FIG. 3 will be described. FIG. 4 is a flowchart illustrating an example of a sound collection condition determination procedure of the terminal device P1 in the embodiment. FIG. 5 is a diagram illustrating an example of determining sound collection conditions and an example of calculating reliability.

The terminal device P1 acquires the feature amount of the speaker US of the authentication voice data (St151), and acquires learning data information for each similarity calculation model registered in the similarity calculation model database DB3 (St152). Note that the learning data here is data for calculating the similarity of feature amounts under predetermined sound collection conditions.

In order to select the optimal model, the terminal device P1 calculates, for example, the distance between the feature amount of the speaker US and each learning data (St153). The terminal device P1 identifies and selects the similarity calculation model with the smallest calculated distance (St154). The terminal device P1 determines that the sound collection condition corresponding to this similarity calculation model is the sound collection condition for the feature amount of the speaker US (St154).

For example, as shown in FIG _. 5, the terminal device P1 calculates the distances d _XA , d Calculate _XC .

Note that the three similarity calculation models DBA, DBB, and DBC shown in FIG. 5 show an example in which the area includes five learning data, but the number of learning data used for generation (learning) of the similarity calculation models is as follows. It is sufficient if it is 1 or more. Furthermore, the number of similarity calculation models DBA, DBB, and DBC may be two or more.

The terminal device P1 uses the following (Formula 1) to calculate the distances d _XA , d _XB , _d calculate. Note that the number N (N: an integer of 1 or more) shown in (Formula 1) is the number of training data used to generate the similarity calculation model, or the representative number of the similarity calculation models used to calculate the distance. Indicates the number of training data. Furthermore, the number N of learning data for each similarity calculation model does not have to be the same.

For example, in the example shown in FIG _. 5, N=5, and _the _distance _d , A ₄ , and A ₅ . _The _distance _d _{_} _{_} _{_} _The _distance _d _{_} _{_} _{_}

The terminal device P1 selects the similarity calculation model corresponding to the distribution of the learning data having the minimum distance among the plurality of calculated distances as the similarity calculation model used for calculating the similarity. For example, in the example shown in FIG. 5, the distance _dXC is the minimum distance among the calculated distances _dXA , _dXB , and _dXC . In such a case, the terminal device P1 determines that the sound collection conditions for the voice uttered by the speaker US (feature amount PT1) are the same as the sound collection conditions corresponding to the similarity calculation model DBC.

In addition, the terminal device P1 calculates the distance between the distribution of learning data of the similarity calculation model used to calculate the similarity and the feature amount of the speaker US (that is, the distance d _XA , d _XB , d _XC ). Based on this, the reliability of the calculated similarity is calculated.

The terminal device P1 determines whether the distance between the similarity calculation model used to calculate the similarity and the feature amount of the speaker US is less than or equal to a predetermined value.

If the terminal device P1 determines that the distance between the training data distribution of the similarity calculation model and the feature amount of the speaker US is less than or equal to a predetermined value, the terminal device P1 determines that the reliability of the similarity is "high". Calculate (evaluate). On the other hand, if the terminal device P1 determines that the distance between the distribution of training data of the similarity calculation model and the feature amount of the speaker US is not less than a predetermined value, the terminal device P1 sets the reliability of the similarity to "low". Calculate (evaluate).

For example, in the example shown in FIG. 5, if the distance _d It is calculated (evaluated) as "high". On the other hand, if the distance _d )do.

Next, with reference to FIG. 6, the speaker authentication procedure shown in step St19 shown in FIG. 3 will be described. FIG. 6 is a flowchart showing an example of the speaker authentication procedure of the terminal device P1 in the embodiment.

The terminal device P1 is used to determine the degree of similarity between the feature amount of the speaker US and the feature amount of one of the registered speakers among the plurality of registered speakers, based on the correspondence lists LST, LST1, and LST2. A similarity calculation model is read from the similarity calculation model database DB3 (St191).

The terminal device P1 uses the similarity calculation model to calculate the similarity between the feature amount of the speaker US and the feature amount of the registered speaker (St192). Furthermore, the terminal device P1 calculates the reliability of the calculated similarity based on the distance between the similarity calculation model calculated using (Formula 1) and the feature amount of the speaker US (St192). . The terminal device P1 performs step St192 until it calculates the similarity and reliability between the feature amount of the voice data of the speaker US and the feature amounts of all registered speakers registered in the registered speaker database DB2. Repeat the process.

The terminal device P1 determines whether there is a degree of similarity greater than or equal to a threshold value among the calculated degrees of similarity (St193).

If the terminal device P1 determines in the process of step St193 that there is a degree of similarity that is equal to or greater than the threshold among the calculated degrees of similarity (St193, YES), the terminal device P1 registers the degree of similarity that is determined to be equal to or greater than the threshold and corresponds to the degree of similarity. It is determined that the speaker and the speaker US are the same person. The terminal device P1 identifies the speaker US based on the registered speaker information of this registered speaker (St194). Note that if there are multiple degrees of similarity determined to be equal to or greater than the threshold, the terminal device P1 determines that the registered speaker corresponding to the highest calculated degree of similarity and the speaker US are the same person. It may be determined that

If the terminal device P1 determines in the process of step St193 that there is no similarity greater than or equal to the threshold among the calculated similarities (St193, NO), the terminal device P1 determines that the speaker US cannot be identified (St195). ).

The terminal device P1 generates an authentication result screen SC based on the registered speaker information of the identified speaker US. The terminal device P1 outputs the generated authentication result screen SC to the monitor MN for display (St196).

As described above, the terminal device P1 registers the speaker information, the feature amount of the speaker US, and the sound collection condition of the feature amount of the speaker US in association with each other at the time of voice registration. The terminal device P1 associates and registers the sound collection conditions that change the feature values indicating the individuality of the speaker US, so that the sound collection conditions of the characteristic Yanagi at the time of voice registration and the feature values at the time of voice authentication are registered. Even if the sound collection conditions are different and the feature values at the time of voice registration and the feature values at the time of voice authentication are different, the similarity calculation model can be selected based on the sound collection conditions for each feature value. Speaker authentication can be performed with higher accuracy. Therefore, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by noise included in the authentication voice data.

Furthermore, the terminal device P1 calculates the degree of reliability indicating the likelihood of the speaker identified by the speaker authentication process, which is indicated by the calculated degree of similarity, based on the similarity calculation model used to calculate the degree of similarity. and display. Thereby, the terminal device P1 can present the certainty of the speaker authentication result to the administrator viewing the monitor MN. Therefore, the terminal device P1 informs the administrator by presenting the reliability that there is no similarity calculation model suitable for calculating the similarity, and that speaker authentication has been performed using a similarity calculation model "universal model", which will be described later. can be informed.

An example of the correspondence lists LST1 and LST2 will be described with reference to FIGS. 7 and 8, respectively. FIG. 7 is a diagram illustrating an example of the correspondence list LST1 when the sound collection condition estimation result at the time of voice registration and the sound collection condition estimation result at the time of voice authentication are the same. FIG. 8 is a diagram illustrating an example of the correspondence list LST2 when the sound collection conditions at the time of voice registration and the sound collection conditions at the time of voice authentication are different.

In addition, in FIGS. 7 and 8, in order to make the explanation easier to understand, information on the sound collection conditions associated with the feature amount of any one registered speaker registered in the registered speaker database DB2 and information on the speaker An example will be described in which the correspondence lists LST1 and LST2 are referred to based on the sound collection conditions of the authenticated voice data of the speaker US who is the authentication target.

In the reference example of the correspondence list LST1 shown in FIG. 7, the audio data at the time of voice registration and voice authentication have the same sound collection condition "in-store noise, male, telephone voice".

The terminal device P1 extracts the feature quantity of the speaker US from the authentication voice data of the speaker US transmitted from a sound collection device such as a microphone MK. The terminal device P1 refers to the sound collection condition-specific learning database DB4, and based on the extracted features and the distribution of learning data for each sound collection condition, the terminal device P1 determines the sound collection condition "in-store" for the feature amount of the speaker US. It is determined that the sound is "noise, male, telephone voice."

Furthermore, the terminal device P1 collects the sound of the speaker US based on the information on the sound collection conditions of the speaker US and the information on the sound collection conditions of each of the plurality of registered speakers registered in the registered speaker database DB2. A similarity calculation model is selected that corresponds to the combination of sound conditions and sound collection conditions of each of the plurality of registered speakers. The terminal device P1 refers to a correspondence list LST1 that associates information on the sound collection conditions of the speaker US, information on the sound collection conditions of each of the plurality of registered speakers, and each of the selected similarity calculation models. do.

The correspondence list LST1 includes the determination result "Sound collection condition determination result 1" of the sound collection condition of the feature quantity of the speaker US, and the determination result "Sound collection condition determination result 1" of the sound collection condition of the registered speaker registered in the registered speaker database DB2. This is data that associates "condition determination result 2" with a similarity calculation model "selected model" selected based on these two sound collection conditions.

Sound collection condition determination result "Sound collection condition determination result 1" indicates the determination result of the sound collection condition of the speaker US.

Sound collection condition determination result "Sound collection condition determination result 2" indicates the determination result of the sound collection conditions of the registered speakers registered in the registered speaker database DB2. Further, information on the determination probability corresponding to the determination result of each sound collection condition is not essential and may be omitted.

Further, the determination result of the sound collection conditions may include a plurality of conditions such as the gender of the speaker US, the sound collection device, the type of noise, etc., for example.

The similarity calculation model "selected model" is a similarity calculation model selected corresponding to the sound collection condition determination result "sound collection condition determination result 1" and the sound collection condition determination result "sound collection condition determination result 2". Includes each of the models "Model A", "Model B", "Model C", and "Model Z".

For example, in the similarity calculation model "Model A", when the sound collection condition of the feature quantity of the speaker US is "XX1", and the information of the sound collection condition of the registered speaker's feature quantity is "XX1", , is a similarity calculation model determined to be optimal for calculating the similarity between the feature amount of the speaker US and the feature amount of the registered speaker.

Note that the similarity calculation model selection unit 114 performs similarity calculation suitable for the similarity calculation process based on the combination of the sound collection conditions for the feature amount of the speaker US and the sound collection conditions for the feature amount of the registered speaker. If it is determined that there is no model, a general-purpose similarity calculation model "Model Z" is selected.

The similarity calculation model selection unit 114 selects a similarity calculation model database from the similarity calculation model database DB3 based on the referenced correspondence list LST1. For example, in the example shown in FIG. 7, the similarity calculation model selection unit 114 selects the similarity calculation model "model A."

Next, in the reference example of the correspondence list LST2 shown in FIG. 8, the registered voice data at the time of voice registration is the same sound collection condition "in-store noise, male, telephone voice". The audio data at the time of audio authentication has a sound collection condition "outdoor noise, male, headset voice" that is different from the sound collection condition of the registered audio data at the time of voice registration.

Furthermore, the terminal device P1 collects the sound of the speaker US based on the information on the sound collection conditions of the speaker US and the information on the sound collection conditions of each of the plurality of registered speakers registered in the registered speaker database DB2. A similarity calculation model is selected that corresponds to the combination of sound conditions and sound collection conditions of each of the plurality of registered speakers. The terminal device P1 refers to a correspondence list LST2 that associates information on the sound collection conditions of the speaker US, information on the sound collection conditions of each of the plurality of registered speakers, and each of the selected similarity calculation models. do.

The correspondence list LST2 includes the determination result "Sound collection condition determination result 3" of the sound collection condition of the feature quantity of the speaker US, and the determination result "Sound collection condition determination result 3" of the sound collection condition of the registered speaker registered in the registered speaker database DB2. This is data that associates "condition determination result 4" with a similarity calculation model "selected model" selected based on these two sound collection conditions.

Sound collection condition determination result "Sound collection condition determination result 3" indicates the determination result of the sound collection condition of the speaker US.

Sound collection condition determination result "Sound collection condition determination result 4" indicates the determination result of the sound collection conditions of the registered speakers registered in the registered speaker database DB2. Further, information on the determination probability corresponding to the determination result of each sound collection condition is not essential and may be omitted.

The similarity calculation model "selected model" is a similarity calculation model selected corresponding to the sound collection condition determination result "sound collection condition determination result 3" and the sound collection condition determination result "sound collection condition determination result 4". Includes each of the models "Model D", "Model E", "Model F", and "Model Z".

For example, in the similarity calculation model "Model D", when the sound collection condition of the feature quantity of the speaker US is "XX1" and the information of the sound collection condition of the registered speaker's feature quantity is "XX4", , is a similarity calculation model determined to be optimal for calculating the similarity between the feature amount of the speaker US and the feature amount of the registered speaker.

Note that the similarity calculation model selection unit 114 performs similarity calculation suitable for the similarity calculation process based on the combination of the sound collection conditions for the feature amount of the speaker US and the sound collection conditions for the feature amount of the registered speaker. If it is determined that there is no model, select the similarity calculation model "general model".

The similarity calculation model selection unit 114 selects a similarity calculation model database from the similarity calculation model database DB3 based on the referenced correspondence list LST2. For example, in the example shown in FIG. 8, the similarity calculation model selection unit 114 selects the similarity calculation model "Model D."

As described above, the terminal device P1 calculates the two feature quantities (of the speaker US) based on the combination of the sound collection conditions included in the feature quantities of the speaker US and the registered speaker feature quantities, respectively, which are the subject of similarity calculation. It is possible to select the optimal similarity calculation model for the similarity calculation process (features and features of registered speakers). As a result, even if the sound collection conditions included in the feature amounts at the time of voice registration and the sound collection conditions included in the feature amounts at the time of voice authentication change, the terminal device P1 can detect the similarity between the two feature amounts. The optimal similarity calculation model can be selected to calculate the degree of similarity. In other words, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by noise included in the audio data. When determining the sound collection condition, if there are candidates for multiple conditions, the terminal device P1 calculates the degree of similarity using, for example, a similarity calculation model corresponding to each condition, and calculates the average value as the degree of similarity. May be adopted.

The correspondence list will be specifically explained with reference to FIG. 9. FIG. 9 is a diagram illustrating a specific example of the correspondence list LST.

In order to make the explanation easier to understand, in FIG. 9, similar to the reference example of the correspondence list LST1 shown in FIG. , male, telephone voice” will be explained below.

The correspondence list LST includes the sound collection condition determination result of the feature quantity of the speaker US "sound collection condition determination result AA", the sound collection condition information of the registered speaker's feature quantity "sound collection condition determination result BB", The selected similarity calculation model "selected model" is associated with the selected similarity calculation model.

For example, the sound collection condition determination result ``sound collection condition determination result AA'' of the feature quantity of the speaker US and the information ``sound collection condition determination result BB'' of the sound collection condition of the registered speaker's feature quantity are, for example, Includes gender, equipment (sound quality), and noise type. Note that the sound collection conditions shown in FIG. 9 are merely an example, and the present invention is not limited thereto. Furthermore, the correspondence list LST indicates the same sound collection conditions in bold letters in the sound collection condition determination result “sound collection condition determination result AA” and the sound collection condition information “sound collection condition determination result BB”.

The sound collection condition "gender" indicates the gender of the speaker US determined based on the feature amount of the speaker US. The sound collection condition "equipment (sound quality)" indicates information regarding the sound collection device determined based on the feature amount of the speaker US. The sound collection condition "noise type" indicates the type of environmental sound, noise, etc. at the time of speaking that is included in the feature amount of the speaker US.

The similarity calculation model "selection model" is a similarity calculation model used to calculate the similarity between the feature amount of the speaker US and the feature amount of the registered speaker.

For example, in the terminal device P1, the determination results of the two sound collection conditions are that the sound collection condition "gender" is "male", the sound collection condition "equipment (sound quality)" is "telephone", The sound collection condition "Noise type" is "In-store noise". In such a case, the terminal device P1 uses a similarity calculation model ``Similarity Calculation Model'' suitable for calculating the similarity of the feature amounts of the same sound pickup conditions ``Male'', ``Telephone'', and ``Store Noise'' under the two sound pickup conditions. Select "Male Phone Store Noise Model".

Further, for example, in the terminal device P1, the determination results of the two sound collection conditions are such that the sound collection condition "gender" is "male" and the sound collection condition "equipment (sound quality)" is "telephone". , "Headset", and the sound collection condition "Noise type" is "In-store noise" and "Outdoor noise". In such a case, the terminal device P1 selects the similarity calculation model "male model" suitable for calculating the similarity of the feature amount of the sound collection condition "male" which is the same or similar in the two sound collection conditions.

Further, for example, in the terminal device P1, the determination results of the two sound collection conditions are such that the sound collection condition "gender" is "female" and the sound collection condition "equipment (sound quality)" is "headset". '', and the sound collection conditions ``noise type'' are ``none (clean voice)'' and ``in-store noise.'' In such a case, the terminal device P1 uses a similarity calculation model "female headset" suitable for calculating the similarity of the feature amounts of the same or similar sound collection conditions "female" and "headset" in the two sound collection conditions. Select "Model".

As described above, the terminal device P1 according to the embodiment includes the communication unit 10 (an example of an acquisition unit) that acquires audio data, and the feature extraction unit 111 that detects a speech section in which the speaker US is speaking from the audio data. (an example of a detection unit), a feature extraction unit 111 (an example of an extraction unit) that extracts a feature quantity of the speaker US (an example of a speech feature quantity) from a detected utterance interval, Based on the feature amount and the feature amount of at least one registered speaker registered in advance, a first similarity calculation model (an example of a similarity calculation model) used for authenticating the speaker US is selected based on the feature amount and the feature amount of at least one registered speaker registered in advance. A similarity calculation model selection unit 114 (an example of a selection unit) that selects a similarity calculation model (an example of a first similarity calculation model and the similarity calculation model selected in the process of step St18); and an authentication unit 116 that authenticates the speaker US by comparing the feature quantities of the speaker US with the feature quantities of registered speakers using the first similarity calculation model.

As a result, the terminal device P1 according to the embodiment selects a similarity calculation model that is more suitable for the similarity calculation process of the two feature amounts (the feature amount of the speaker US and the feature amount of the registered speaker). By using a similar similarity calculation model, speaker authentication based on two features can be performed with higher accuracy. In other words, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by differences in noise (environmental noise) during voice registration and voice authentication.

Furthermore, the terminal device P1 according to the embodiment further includes a sound collection condition determination unit 112 (an example of a determination unit) that determines the sound collection condition of the uttered sound corresponding to the feature amount based on the feature amount. The sound collection condition determining unit 112 selects the first similarity calculation model based on the sound collection condition corresponding to the feature amount of the speaker US and the sound collection condition corresponding to the feature amount of the registered speaker. Thereby, the terminal device P1 according to the embodiment can select a similarity calculation model more suitable for feature value matching based on the combination of the sound collection conditions at the time of voice registration and the sound collection conditions at the time of voice authentication. Therefore, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy.

Further, the plurality of similarity calculation models in the terminal device P1 according to the embodiment are generated using at least one learning data under a predetermined sound collection condition. The sound collection condition determination unit 112 determines the sound collection conditions of the speaker US or the registered speaker based on the distance between the feature amount of the speaker US or the registered speaker and each of the plurality of similarity calculation models. do. Thereby, the terminal device P1 according to the embodiment calculates each sound based on the distance between the similarity calculation model generated using the learning data for each sound collection condition and the feature amount of the speaker US or the registered speaker. Sound collection conditions corresponding to feature amounts can be determined with higher accuracy.

Further, the sound collection condition determination unit 112 in the terminal device P1 according to the embodiment determines the second sound collection condition determination unit 112 in which the distance between the feature amount of the speaker US or the registered speaker and each of the plurality of similarity calculation models is the shortest. (which is an example of the second similarity calculation model and is selected in the process of step St154), and picks up sound corresponding to the selected second similarity calculation model. Based on the conditions, the sound pickup conditions of the speaker US or the registered speaker are determined. Thereby, the terminal device P1 according to the embodiment selects the similarity calculation model with the closest characteristics when calculating the similarity with the feature amount of the speaker US or the registered speaker. Sound collection conditions corresponding to the feature quantities of registered speakers can be determined with higher accuracy.

Furthermore, the terminal device P1 according to the embodiment includes an authentication unit 116 (an example of a calculation unit) that calculates the degree of similarity between the feature amount of the audio data of the utterance section and the feature amount of each of the plurality of registered speakers. Be prepared for more. The authentication unit 116 authenticates the speaker US based on the plurality of calculated similarities. Thereby, the terminal device P1 according to the embodiment can perform speaker authentication using the degree of similarity between the feature amounts of a plurality of registered speakers registered in advance and the feature amounts of the speaker US.

Furthermore, the authentication unit 116 in the terminal device P1 according to the embodiment specifies that the registered speaker whose degree of similarity is equal to or greater than the threshold is the speaker US. Thereby, the terminal device P1 according to the embodiment can perform speaker authentication using the degree of similarity between the feature amounts of a plurality of registered speakers registered in advance and the feature amounts of the speaker US.

Additionally, the authentication unit 116 in the terminal device P1 according to the embodiment generates and outputs an authentication result screen SC that includes information regarding registered speakers whose degree of similarity is equal to or greater than the threshold value. Thereby, the terminal device P1 according to the embodiment can present the speaker authentication result to the speaker US or the administrator.

Furthermore, when determining that the plurality of calculated similarities are not equal to or greater than the threshold, the authentication unit 116 in the terminal device P1 according to the embodiment determines that the speaker US cannot be identified. Thereby, the terminal device P1 according to the embodiment can more effectively suppress a decrease in speaker authentication accuracy and more effectively suppress erroneous authentication of the speaker US.

The terminal device P1 according to the embodiment also includes an authentication unit 116 that calculates the similarity between the feature amount of the audio data of the utterance section and the feature amount of each of the plurality of registered speakers, and the authentication unit 116 that calculates the reliability of the similarity. It further includes a reliability calculation unit 115 (an example of a reliability calculation unit) that calculates reliability. The reliability calculation unit 115 calculates the reliability of the similarity based on the distance between the feature amount of the speaker US and the second similarity calculation model. As a result, the terminal device P1 according to the embodiment has confidence in the similarity, the sound collection conditions determined in the process of calculating the similarity, the first similarity calculation model used to calculate the similarity, etc. Can calculate degrees.

Further, the authentication unit 116 in the terminal device P1 according to the embodiment identifies the registered speaker whose degree of similarity is equal to or greater than the threshold value as the speaker US, and provides information regarding the registered speaker whose degree of similarity is equal to or greater than the threshold value; An authentication result screen SC including the calculated reliability information is generated and output. Thereby, the terminal device P1 according to the embodiment displays the speaker authentication result and the reliability of the speaker authentication result, so that the administrator can know whether the speaker authentication result is reliable or not. You can prompt for confirmation.

Furthermore, in the terminal device P1 according to the embodiment, the sound collection conditions include the gender of the speaker US, the age of the speaker US, the language of the speaker US, the sound collection device with which the uttered voice was collected, or the uttered voice. At least one of the types of noise included. Thereby, the terminal device P1 according to the embodiment can change the feature amounts used for speaker authentication and select a similarity calculation model based on the sound collection conditions that may cause a decrease in speaker authentication accuracy. Therefore, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy.

Although various embodiments have been described above with reference to the drawings, it goes without saying that the present disclosure is not limited to such examples. It is clear that those skilled in the art can come up with various changes, modifications, substitutions, additions, deletions, and equivalents within the scope of the claims, and It is understood that it naturally falls within the technical scope of the present disclosure. Further, each of the constituent elements in the various embodiments described above may be arbitrarily combined without departing from the spirit of the invention.

Note that this application is based on a Japanese patent application (Japanese Patent Application No. 2022-045391) filed on March 22, 2022, and the contents thereof are incorporated as a reference in this application.

The present disclosure is useful as a voice authentication device and voice authentication method that can suppress a decrease in speaker authentication accuracy due to changes in environmental noise.

10 Communication unit 11 Processor 12 Memory 100 Voice authentication system 111 Feature extraction unit 112 Sound collection condition determination unit 113 Speaker registration unit 114 Similarity calculation model selection unit 115 Reliability calculation unit 116 Authentication unit DB1 Feature extraction model database DB2 Registration Speaker database DB3 Similarity calculation model database DB4 Sound collection condition learning database MK Microphone MN Monitor P1 Terminal SC Authentication result screen US Speaker

Claims

an acquisition unit that acquires audio data;
a detection unit that detects a speech section in which the speaker is speaking from the audio data;
an extraction unit that extracts speech features of the speaker from the detected speech section;
Based on the extracted utterance features of the speaker and the utterance features of at least one registered speaker registered in advance, the first one of the plurality of similarity calculation models used for authenticating the speaker is selected. a selection unit that selects a similarity calculation model;
an authentication unit that authenticates the speaker by comparing the utterance features of the speaker and the utterance features of the registered speaker using the selected first similarity calculation model;
Voice authentication device.
further comprising a determination unit that determines, based on the utterance feature amount, a sound collection condition of the uttered voice corresponding to the utterance feature amount,
The selection unit selects the first similarity calculation model based on a sound collection condition corresponding to the utterance feature of the speaker and a sound collection condition corresponding to the utterance feature of the registered speaker. ,
The voice authentication device according to claim 1.
The plurality of similarity calculation models are generated using at least one learning data under predetermined sound collection conditions,
The determination unit determines a sound pickup condition of the speaker or registered speaker based on a distance between the utterance feature amount of the speaker or registered speaker and each of the plurality of similarity calculation models. ,
The voice authentication device according to claim 2.
The determination unit selects a second similarity calculation model that has the shortest distance between the utterance feature of the speaker or the registered speaker and each of the plurality of similarity calculation models, and determining sound collection conditions of the speaker or registered speaker based on sound collection conditions corresponding to a second similarity calculation model;
The voice authentication device according to claim 3.
further comprising a calculation unit that calculates the degree of similarity between the utterance feature amount of the audio data of the utterance section and the utterance feature amount of each of the plurality of registered speakers,
The authentication unit authenticates the speaker based on the plurality of calculated similarities;
The voice authentication device according to claim 1.
The authentication unit identifies a registered speaker for whom the degree of similarity is a threshold value or more as the speaker;
The voice authentication device according to claim 5.
The authentication unit generates and outputs an authentication result screen including information regarding the registered speaker whose degree of similarity is greater than or equal to the threshold;
The voice authentication device according to claim 6.
When the authentication unit determines that the plurality of calculated similarities are not equal to or greater than a threshold, the authentication unit determines that the speaker cannot be identified.
The voice authentication device according to claim 6.
a calculation unit that calculates the degree of similarity between the utterance feature amount of the audio data of the utterance section and the utterance feature amount of each of the plurality of registered speakers;
further comprising a reliability calculation unit that calculates reliability of the similarity,
The reliability calculation unit calculates the reliability of the similarity based on the distance between the utterance feature of the speaker and the second similarity calculation model.
The voice authentication device according to claim 4.
The authentication unit identifies the registered speaker for whom the degree of similarity is greater than or equal to the threshold value as the speaker, and includes information regarding the registered speaker for whom the degree of similarity is greater than or equal to the threshold value and the calculated degree of reliability. Generate and output an authentication result screen including information,
The voice authentication device according to claim 9.
The sound collection condition is at least one of the gender of the speaker, the age of the speaker, the language of the speaker, the sound collection device with which the spoken voice was collected, or the type of noise contained in the spoken voice. including,
The voice authentication device according to claim 2.
A voice authentication method performed by a terminal device, the method comprising:
Get audio data,
detecting an utterance section in which the speaker is speaking from the audio data;
Extracting the speaker's utterance features from the detected utterance interval,
Based on the extracted utterance features of the speaker and the utterance features of at least one registered speaker registered in advance, the first one of the plurality of similarity calculation models used for authenticating the speaker is selected. Select a similarity calculation model,
authenticating the speaker by comparing the utterance features of the speaker and the utterance features of the registered speaker using the selected first similarity calculation model;
Voice authentication method.