WO2023182015A1

WO2023182015A1 - Voice authentication device and voice authentication method

Info

Publication number: WO2023182015A1
Application number: PCT/JP2023/009468
Authority: WO
Inventors: 正成宮本
Original assignee: パナソニックＩｐマネジメント株式会社
Priority date: 2022-03-22
Filing date: 2023-03-10
Publication date: 2023-09-28
Also published as: JP2023139711A

Abstract

Provided is a voice authentication device comprising an acquisition unit which acquires voice data, a detection unit which detects, from the acquired voice data, an utterance section in which a speaker is making an utterance and a non-utterance section in which the speaker is not making an utterance, a combining unit which combines voice data in the non-utterance section with voice data of each of a plurality of speakers registered in advance, and an authentication unit which authenticates the speaker on the basis of voice data in the utterance section and the plurality of pieces of combined voice data in which the voice data in the non-utterance section is combined.

Description

Voice authentication device and voice authentication method

The present disclosure relates to a voice authentication device and a voice authentication method.

Patent Document 1 discloses that a speech signal containing noise as a recognition target is input, noise is removed from the input speech signal, known noise is added to the signal after noise removal, and the signal after noise addition is used for speech recognition. A speech recognition device has been disclosed that performs speech recognition by converting the parameters into parameters for use with an acoustic model and comparing the parameters with an acoustic model. Note that the known noise here refers to a pattern referenced in a speech recognition device, background noise learned during acoustic model learning in a statistical method, or noise having similar characteristics. By adding known noise to the input speech signal, the speech recognition device reduces the discrepancy between the speech signal containing the unremoved noise and the speech signal recognized using the acoustic model prepared in advance. and improve the accuracy of speech recognition.

Japanese Patent Application Publication No. 2004-12884

However, when performing voiceprint authentication using audio signals, the audio signals that are registered in advance are often collected in a quiet environment where noise is less likely to occur. Therefore, when performing voiceprint authentication using a voice signal to which known noise has been added and a registered voice signal, there is a possibility that voiceprint authentication accuracy may be reduced due to the added known noise.

The present disclosure was devised in view of the above-mentioned conventional situation, and aims to provide a voice authentication device and a voice authentication method that suppress a decrease in speaker authentication accuracy due to changes in environmental noise.

The present disclosure includes: an acquisition unit that acquires audio data; a detection unit that detects a speech section in which a speaker is speaking and a non-speech section in which the speaker is not speaking from the acquired audio data; a synthesis unit that synthesizes the voice data of the non-speech section with the voice data of each of a plurality of speakers registered in advance; a plurality of synthesized voice data in which the voice data of the non-speech section is synthesized; and the speech section. and an authentication section that authenticates the speaker based on the voice data of the speaker.

Further, the present disclosure acquires audio data, detects from the acquired audio data a speech section in which the speaker is speaking, and a non-speech section in which the speaker does not speak, and detects speech sections in which the speaker is not speaking, and The voice data of the non-speech section is synthesized with the voice data of each of the plurality of speakers, and based on the plurality of synthesized voice data in which the voice data of the non-speech section is synthesized and the voice data of the speech section. , provides a voice authentication method for authenticating the speaker.

According to the present disclosure, it is possible to suppress a decrease in speaker authentication accuracy due to changes in environmental noise.

Block diagram showing an example of internal configuration of a voice authentication system according to an embodiment A diagram illustrating each process performed by a processor of a terminal device in an embodiment. Flowchart showing an example of the operation procedure of the terminal device in the embodiment Flowchart showing an example of a speaker authentication procedure of a terminal device in an embodiment

Hereinafter, embodiments specifically disclosing a voice authentication device and a voice authentication method according to the present disclosure will be described in detail with reference to the drawings as appropriate. However, more detailed explanation than necessary may be omitted. For example, detailed explanations of well-known matters and redundant explanations of substantially the same configurations may be omitted. This is to avoid unnecessary redundancy in the following description and to facilitate understanding by those skilled in the art. The accompanying drawings and the following description are provided to enable those skilled in the art to fully understand the present disclosure, and are not intended to limit the subject matter recited in the claims.

First, a voice authentication system 100 according to an embodiment will be described with reference to FIGS. 1 and 2. FIG. 1 is a block diagram showing an example of the internal configuration of a voice authentication system 100 according to an embodiment. FIG. 2 is a diagram illustrating each process performed by the processor 11 of the terminal device P1 in the embodiment.

The voice authentication system 100 includes a terminal device P1 as an example of a voice authentication device and a monitor MN. Note that the voice authentication system 100 may include a microphone MK or a monitor MN.

The microphone MK picks up the voice uttered by the speaker US, which is registered in advance in the terminal device P1. The microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data that is registered in the terminal device P1. Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .

Additionally, the microphone MK picks up the voice uttered by the speaker US, which is used for speaker authentication. The microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data. Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .

Note that the microphone MK may be, for example, a microphone included in a predetermined device such as a Personal Computer (hereinafter referred to as "PC"), a notebook PC, a smartphone, a tablet terminal, or the like. Further, the microphone MK may transmit an audio signal or audio data to the terminal device P1 by wireless communication via a network (not shown). Note that in the following explanation, speaker authentication processing using voice data will be explained.

The terminal device P1 is realized by, for example, a PC, a notebook PC, a smartphone, a tablet terminal, etc., and executes speaker authentication based on the voice uttered by the speaker US. It includes a communication unit 10, a processor 11, a memory 12, a feature extraction model database DB1, a registered speaker database DB2, and a similarity calculation model database DB3.

The communication unit 10, which is an example of an acquisition unit, is connected to the microphone MK and the monitor MN so that data can be transmitted and received by wire or wirelessly. The wireless communication referred to here is, for example, short-range wireless communication such as Bluetooth (registered trademark) or NFC (registered trademark), or communication via a wireless local area network (LAN) such as Wi-Fi (registered trademark).

Note that the communication unit 10 may transmit and receive data to and from the microphone MK via an interface such as a Universal Serial Bus (USB). Furthermore, the communication unit 10 may perform data transmission and reception with the monitor MN via an interface such as High-Definition Multimedia Interface (HDMI, registered trademark).

The processor 11 is configured using, for example, a central processing unit (CPU) or a field programmable gate array (FPGA), and performs various processing and control in cooperation with the memory 12. Specifically, the processor 11 refers to the program and data held in the memory 12 and executes the program to obtain the speaker registration unit 111, the noise extraction unit 112, the noise synthesis unit 113, and the first feature amount. The functions of each section such as the extraction section 114, the second feature amount extraction section 115, and the authentication section 116 are realized.

When registering the voice of the speaker US, the processor 11 registers (stores) the voice data of the speaker US in the registered speaker database DB2 by realizing the function of the speaker registration unit 111. Furthermore, the processor 11 realizes the functions of the noise extraction unit 112, the noise synthesis unit 113, the first feature extraction unit 114, the second feature extraction unit 115, and the authentication unit 116 during speech authentication of the speaker US. By doing so, speaker authentication processing is executed.

The speaker registration unit 111 associates the audio data of the speaker US transmitted from the microphone MK with the speaker information of the speaker US, and registers the data in the registered speaker database DB2.

Note that the speaker information may be extracted from the audio data by voice recognition, or may be acquired from a terminal owned by the speaker US (for example, a PC, a notebook PC, a smartphone, a tablet terminal). Moreover, the speaker information here includes, for example, identification information that can identify the speaker US, the name of the speaker US, speaker identification (ID), and the like.

The noise extraction unit 112, which is an example of a detection unit, acquires the voice data of the speaker US transmitted from the microphone MK. The noise extraction unit 112 detects a speech section in which the speaker US is speaking and a section in which the speaker US is not speaking (hereinafter referred to as a "non-speech section") from the audio data. The noise extraction unit 112 extracts noise included in the detected non-speech period, and outputs the extracted noise data (hereinafter referred to as “noise data”) to the noise synthesis unit 113.

The noise referred to here is noise that is collected due to the environment (background) at the time of sound collection, and includes, for example, surrounding voices at the time of sound collection, music, the sound of a vehicle running, the sound of the wind, etc.

The noise synthesis unit 113, which is an example of a synthesis unit, acquires the noise data output from the noise extraction unit 112. The noise synthesis unit 113 synthesizes the acquired noise data with each of the registered voice data of a plurality of speakers registered in the registered speaker database DB2, and outputs the synthesized noise data to the first feature amount extraction unit 114.

The first feature extraction unit 114, which is an example of an extraction unit, acquires each of the registered voice data of a plurality of speakers into which noise data has been synthesized from the noise synthesis unit 113. The first feature extraction unit 114 uses the feature extraction model stored in the feature extraction model database DB1 to extract feature quantities that indicate the individuality of each speaker from each of the registered voice data of the plurality of speakers. do. The first feature extraction unit 114 outputs the feature quantities of each of the plurality of speakers to the authentication unit 116.

The second feature extraction unit 115, which is an example of an extraction unit, acquires the voice data of the speaker US transmitted from the microphone MK. The second feature extraction unit 115 extracts a feature indicating the individuality of the speaker US from the voice data of the speaker US using the feature extraction model stored in the feature extraction model database DB1. The second feature amount extraction unit 115 outputs the feature amount of the speaker US to the authentication unit 116.

The authentication unit 116, which is an example of a calculation unit, uses the feature quantities of each of the plurality of speakers outputted from the first feature quantity extraction unit 114 and the feature quantity of the speaker US outputted from the second feature quantity extraction unit 115. The degree of similarity is calculated using the degree of similarity calculation model stored in the degree of similarity calculation model database DB3. The authentication unit 116 identifies the speaker US based on the calculated similarity. The authentication unit 116 generates an authentication result screen SC based on the speaker information of the identified speaker US, and outputs it to the monitor MN.

The memory 12 includes, for example, a random access memory (hereinafter referred to as "RAM") as a work memory used when executing each process of the processor 11, and a read-only memory that stores programs and data that define the operation of the processor 11. Memory (hereinafter referred to as "ROM"). Data or information generated or acquired by the processor 11 is temporarily stored in the RAM. A program that defines the operation of the processor 11 is written in the ROM.

The feature extraction model database DB1 is a so-called storage, and is stored using a storage medium such as a flash memory, a Hard Disk Drive (hereinafter referred to as "HDD"), or a Solid State Drive (hereinafter referred to as "SSD"). configured. The feature extraction model database DB1 stores a feature extraction model capable of detecting the utterance section of a person such as the speaker US from voice data and extracting the feature of the person. The feature extraction model is, for example, a learning model generated by learning using deep learning or the like.

The registered speaker database DB2 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD. The registered speaker database DB2 stores voice data of a plurality of speakers registered in advance in association with speaker information.

The similarity calculation model database DB3 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD. The similarity calculation model database DB3 stores a similarity calculation model that can calculate the similarity between two feature amounts. The similarity calculation model is, for example, a learning model generated by learning using deep learning or the like.

For example, a similarity calculation model is one in which dimensions in which individuality is likely to be expressed are learned in advance and retained in order to calculate the similarity between two multidimensional vectors with high accuracy. Note that the method of calculating similarity using a model is just one example of a method for calculating similarity between vectors, and previously described techniques such as Euclidean distance and cosine similarity may be used.

The monitor MN is configured using, for example, a display such as a Liquid Crystal Display (hereinafter referred to as "LCD") or an organic electroluminescence (hereinafter referred to as "EL"). The monitor MN displays the authentication result screen SC output from the terminal device P1.

The authentication result screen SC is a screen that notifies the speaker US or the administrator (for example, the person viewing Monitor MN) of the speaker authentication result, and the authentication result information "XX matched Mr. XX's voice. "including. The authentication result screen SC may also include other speaker information (for example, a face image, etc.).

Next, with reference to FIG. 3, the operating procedure of the terminal device P1 will be described. FIG. 3 is a flowchart showing an example of the operation procedure of the terminal device P1 in the embodiment.

The terminal device P1 acquires audio data from the microphone MK (St11). Note that the microphone MK may be, for example, a microphone included in a PC, a notebook PC, a smartphone, or a tablet terminal.

The terminal device P1 determines whether to register in the registered speaker database DB2 based on the control command associated with the acquired voice data or the speaker information (St12). Note that if the acquired voice data contains a lot of noise, the voice data may be acquired again from the microphone MK without being registered in the registered speaker database DB2. By keeping the amount of noise included in the voice data below a certain value, speaker authentication accuracy can be improved.

In the process of step St12, the terminal device P1 determines to register the voice data in the registered speaker database DB2 if the voice data is associated with a control command or speaker information requesting voice data registration (St12, YES), the audio data and speaker information are associated and registered in the registered speaker database DB2 (St14).

In the process of step St12, if the voice data is not associated with a control command requesting registration of the voice data, the terminal device P1 determines not to register it in the registered speaker database DB2 (St12, NO); Noise included in the non-speech section of the audio data is extracted (St13). The noise referred to here is noise, and is environmental noise, noise, etc. around the time when audio data is collected.

The terminal device P1 synthesizes the extracted noise data with the voice data of each of the plurality of speakers stored (registered) in the registered speaker database DB2 (St15).

The terminal device P1 extracts feature quantities indicating the individuality of each speaker registered in the registered speaker database DB2 from the voice data of each of the plurality of speakers after noise synthesis (St16).

The terminal device P1 extracts a feature amount indicating the individuality of the speaker US from the utterance section of the acquired audio data (St17).

The terminal device P1 executes speaker authentication processing based on the extracted feature amount of the speaker US and the feature amount of each of the plurality of speakers (St20).

As described above, the terminal device P1 synthesizes the noise extracted from the voice data used for speaker authentication with the voice data of each of the plurality of speakers registered in the registered speaker database DB2, thereby converting the noise into voice data. Decrease in speaker authentication accuracy due to included noise can be more effectively suppressed.

Next, with reference to FIG. 4, the speaker authentication procedure shown in step St20 shown in FIG. 3 will be described. FIG. 4 is a flowchart showing an example of the speaker authentication procedure of the terminal device P1 in the embodiment.

The terminal device P1 reads the similarity calculation model from the similarity calculation model database DB3 (St21).

The terminal device P1 calculates the degree of similarity between the acquired feature amount of the speaker US and the feature amount of each of the plurality of speakers after noise synthesis (St22).

The terminal device P1 determines whether there is a degree of similarity greater than or equal to a threshold value among the calculated degrees of similarity (St23).

In the process of step St23, if it is determined that there is a degree of similarity that is equal to or greater than the threshold among the calculated degrees of similarity (St23, YES), the terminal device P1 selects a story that is determined to be equal to or greater than the threshold and corresponds to the degree of similarity. The speaker US is identified based on the speaker information (St24). Note that if there are multiple degrees of similarity determined to be equal to or greater than the threshold, the terminal device P1 identifies the speaker US based on the speaker information corresponding to the degree of similarity with the highest calculated degree of similarity. It's okay.

In the process of step St23, if the terminal device P1 determines that there is no similarity greater than or equal to the threshold among the calculated similarities (St23, NO), the terminal device P1 determines that the speaker US cannot be identified (St25). ).

The terminal device P1 generates an authentication result screen SC based on the speaker information of the identified speaker US. The terminal device P1 outputs the generated authentication result screen SC to the monitor MN for display (St26).

As described above, even if the voice data of the speaker US at the time of voice registration does not include noise and the voice data of the speaker US at the time of voice authentication includes noise, the terminal device P1 can improve the accuracy of speaker authentication. The decline can be suppressed more effectively. In other words, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by changes in environmental noise.

Note that the terminal device P1 may determine whether noise is included in the voice data of the speaker US at the time of voice registration. For example, if the terminal device P1 determines that the noise included in the audio data at the time of audio registration is less than the threshold value, the terminal device P1 may determine that the audio data does not include noise and execute the audio registration process. This threshold value may be set to an arbitrary value for determining whether the noise is negligible in speaker authentication processing using feature amounts or to a level that does not cause false authentication.

As described above, the terminal device P1 (an example of a voice authentication device) according to the embodiment includes a communication unit 10 (an example of an acquisition unit) that acquires voice data, and an utterance uttered by a speaker based on the acquired voice data. A noise extraction unit 112 (an example of a detection unit) that detects a non-speech interval and a non-speech interval in which the speaker US does not speak, and a noise extraction unit 112 (an example of a detection unit) that detects the non-speech interval and the non-speech interval in which the voice data of each of the plurality of speakers registered in advance is Authentication that authenticates the speaker US based on the noise synthesis unit 113 (an example of a synthesis unit) that synthesizes data, a plurality of synthesized speech data obtained by synthesizing the speech data of the non-speech section, and the speech data of the speech section. 116.

As a result, the terminal device P1 according to the embodiment, even if noise is not included in the voice data of the speaker US at the time of voice registration and noise is included in the voice data of the speaker US at the time of voice authentication. Decrease in speaker authentication accuracy can be more effectively suppressed.

Furthermore, the terminal device P1 according to the embodiment further includes a first feature extracting unit 114 and a second feature extracting unit 115 (an example of an extracting unit) that extract the speaker's feature from the audio data. The authentication unit 116 authenticates the speaker US based on the feature amounts of the plurality of extracted synthetic voice data and the feature amounts of the voice data of the utterance section. Thereby, the terminal device P1 according to the embodiment can perform speaker authentication using the feature amount indicating the individuality of the speaker US.

Furthermore, the terminal device P1 according to the embodiment further includes an authentication unit 116 (an example of a calculation unit) that calculates the similarity between the plurality of synthesized voice data and the voice data of the utterance section. The authentication unit 116 authenticates the speaker US based on the plurality of calculated similarities. Thereby, the terminal device P1 according to the embodiment can perform speaker authentication using the degree of similarity between the feature amounts of a plurality of speakers registered in advance and the feature amounts of the speaker US.

Further, the authentication unit 116 in the terminal device P1 according to the embodiment determines whether the calculated degree of similarity is greater than or equal to the threshold value, and corresponds to the speech synthesis data for which the degree of similarity is determined to be greater than or equal to the threshold value. Identify the speaker as speaker US. Thereby, the terminal device P1 according to the embodiment can identify the speaker US with higher accuracy based on the degree of similarity between the feature amount of the speaker US and the feature amount of the speaker.

Additionally, the authentication unit 116 in the terminal device P1 according to the embodiment generates and outputs an authentication result screen SC that includes information regarding speakers whose similarity is equal to or higher than the threshold value. Thereby, the terminal device P1 according to the embodiment can present the speaker authentication result to the speaker US or the administrator.

Furthermore, when determining that the plurality of calculated similarities are not equal to or greater than the threshold, the authentication unit 116 in the terminal device P1 according to the embodiment determines that the speaker US cannot be identified. Thereby, the terminal device P1 according to the embodiment can more effectively suppress a decrease in speaker authentication accuracy and more effectively suppress erroneous authentication of the speaker US.

Further, the noise extraction unit 112 in the terminal device P1 according to the embodiment extracts noise included in the non-speech period. The noise synthesis unit 113 synthesizes noise with each voice data of a plurality of speakers. Thereby, the terminal device P1 according to the embodiment synthesizes the noise at the time of voice authentication with the voice data at the time of voice registration, thereby creating a sound collection environment for the uttered voice (voice data) at the time of voice registration and voice authentication. This makes it possible to more effectively suppress a decline in speaker authentication accuracy due to changes in environmental noise.

Although various embodiments have been described above with reference to the drawings, it goes without saying that the present disclosure is not limited to such examples. It is clear that those skilled in the art can come up with various changes, modifications, substitutions, additions, deletions, and equivalents within the scope of the claims, and It is understood that it naturally falls within the technical scope of the present disclosure. Further, each of the constituent elements in the various embodiments described above may be arbitrarily combined without departing from the spirit of the invention.

Note that this application is based on a Japanese patent application (Japanese Patent Application No. 2022-045389) filed on March 22, 2022, and the contents thereof are incorporated as a reference in this application.

The present disclosure is useful as a voice authentication device and a voice authentication method that suppress a decrease in speaker authentication accuracy due to changes in environmental noise.

10 Communication unit 11 Processor 12 Memory 100 Voice authentication system 111 Speaker registration unit 112 Noise extraction unit 113 Noise synthesis unit 114 First feature extraction unit 115 Second feature extraction unit 116 Authentication unit DB1 Feature extraction model database DB2 Registration story Person database DB3 Similarity calculation model database MK Microphone MN Monitor P1 Terminal device SC Authentication result screen US Speaker

Claims

an acquisition unit that acquires audio data;
a detection unit that detects a speech section in which the speaker is speaking and a non-speech section in which the speaker is not speaking from the acquired audio data;
a synthesis unit that synthesizes the audio data of the non-speech section with the audio data of each of a plurality of speakers registered in advance;
an authentication unit that authenticates the speaker based on a plurality of synthesized speech data obtained by synthesizing the speech data of the non-speech section and the speech data of the speech section;
Voice authentication device.
further comprising an extraction unit that extracts features of the speaker from the voice data,
The authentication unit authenticates the speaker based on the extracted feature quantities of the plurality of synthesized speech data and the feature quantities of the speech data of the utterance section.
The voice authentication device according to claim 1.
further comprising a calculation unit that calculates the degree of similarity between the plurality of synthesized speech data and the speech data of the utterance section,
The authentication unit authenticates the speaker based on the plurality of calculated similarities;
The voice authentication device according to claim 1.
The authentication unit determines whether the calculated degree of similarity is greater than or equal to a threshold, and determines that the speaker corresponds to the speech synthesis data for which the degree of similarity is determined to be greater than or equal to the threshold. identify,
The voice authentication device according to claim 3.
The authentication unit generates and outputs an authentication result screen including information regarding the speaker whose degree of similarity is greater than or equal to the threshold;
The voice authentication device according to claim 4.
When the authentication unit determines that the plurality of calculated similarities are not equal to or greater than the threshold, the authentication unit determines that the speaker cannot be identified.
The voice authentication device according to claim 5.
The detection unit extracts noise included in the non-speech section,
The synthesis unit synthesizes the noise with the voice data of each of the plurality of speakers.
The voice authentication device according to claim 1.
Get audio data,
detecting a speech section in which the speaker is speaking and a non-speech section in which the speaker is not speaking from the acquired audio data;
Synthesizing the voice data of the non-speech section with the voice data of each of a plurality of speakers registered in advance,
authenticating the speaker based on a plurality of synthesized speech data obtained by synthesizing the speech data of the non-speech section and the speech data of the speech section;
Voice authentication method.