WO2023182015A1 - Voice authentication device and voice authentication method - Google Patents

Voice authentication device and voice authentication method Download PDF

Info

Publication number
WO2023182015A1
WO2023182015A1 PCT/JP2023/009468 JP2023009468W WO2023182015A1 WO 2023182015 A1 WO2023182015 A1 WO 2023182015A1 JP 2023009468 W JP2023009468 W JP 2023009468W WO 2023182015 A1 WO2023182015 A1 WO 2023182015A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
data
speech
voice
authentication
Prior art date
Application number
PCT/JP2023/009468
Other languages
French (fr)
Japanese (ja)
Inventor
正成 宮本
Original Assignee
パナソニックIpマネジメント株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by パナソニックIpマネジメント株式会社 filed Critical パナソニックIpマネジメント株式会社
Publication of WO2023182015A1 publication Critical patent/WO2023182015A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise

Definitions

  • the present disclosure relates to a voice authentication device and a voice authentication method.
  • Patent Document 1 discloses that a speech signal containing noise as a recognition target is input, noise is removed from the input speech signal, known noise is added to the signal after noise removal, and the signal after noise addition is used for speech recognition.
  • a speech recognition device has been disclosed that performs speech recognition by converting the parameters into parameters for use with an acoustic model and comparing the parameters with an acoustic model.
  • the known noise here refers to a pattern referenced in a speech recognition device, background noise learned during acoustic model learning in a statistical method, or noise having similar characteristics.
  • the speech recognition device reduces the discrepancy between the speech signal containing the unremoved noise and the speech signal recognized using the acoustic model prepared in advance. and improve the accuracy of speech recognition.
  • the present disclosure was devised in view of the above-mentioned conventional situation, and aims to provide a voice authentication device and a voice authentication method that suppress a decrease in speaker authentication accuracy due to changes in environmental noise.
  • the present disclosure includes: an acquisition unit that acquires audio data; a detection unit that detects a speech section in which a speaker is speaking and a non-speech section in which the speaker is not speaking from the acquired audio data; a synthesis unit that synthesizes the voice data of the non-speech section with the voice data of each of a plurality of speakers registered in advance; a plurality of synthesized voice data in which the voice data of the non-speech section is synthesized; and the speech section. and an authentication section that authenticates the speaker based on the voice data of the speaker.
  • the present disclosure acquires audio data, detects from the acquired audio data a speech section in which the speaker is speaking, and a non-speech section in which the speaker does not speak, and detects speech sections in which the speaker is not speaking, and
  • the voice data of the non-speech section is synthesized with the voice data of each of the plurality of speakers, and based on the plurality of synthesized voice data in which the voice data of the non-speech section is synthesized and the voice data of the speech section. , provides a voice authentication method for authenticating the speaker.
  • Block diagram showing an example of internal configuration of a voice authentication system according to an embodiment A diagram illustrating each process performed by a processor of a terminal device in an embodiment. Flowchart showing an example of the operation procedure of the terminal device in the embodiment Flowchart showing an example of a speaker authentication procedure of a terminal device in an embodiment
  • FIG. 1 is a block diagram showing an example of the internal configuration of a voice authentication system 100 according to an embodiment.
  • FIG. 2 is a diagram illustrating each process performed by the processor 11 of the terminal device P1 in the embodiment.
  • the voice authentication system 100 includes a terminal device P1 as an example of a voice authentication device and a monitor MN. Note that the voice authentication system 100 may include a microphone MK or a monitor MN.
  • the microphone MK picks up the voice uttered by the speaker US, which is registered in advance in the terminal device P1.
  • the microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data that is registered in the terminal device P1.
  • Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .
  • the microphone MK picks up the voice uttered by the speaker US, which is used for speaker authentication.
  • the microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data.
  • Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .
  • the microphone MK may be, for example, a microphone included in a predetermined device such as a Personal Computer (hereinafter referred to as "PC"), a notebook PC, a smartphone, a tablet terminal, or the like. Further, the microphone MK may transmit an audio signal or audio data to the terminal device P1 by wireless communication via a network (not shown). Note that in the following explanation, speaker authentication processing using voice data will be explained.
  • PC Personal Computer
  • the microphone MK may transmit an audio signal or audio data to the terminal device P1 by wireless communication via a network (not shown).
  • the terminal device P1 is realized by, for example, a PC, a notebook PC, a smartphone, a tablet terminal, etc., and executes speaker authentication based on the voice uttered by the speaker US. It includes a communication unit 10, a processor 11, a memory 12, a feature extraction model database DB1, a registered speaker database DB2, and a similarity calculation model database DB3.
  • the communication unit 10 which is an example of an acquisition unit, is connected to the microphone MK and the monitor MN so that data can be transmitted and received by wire or wirelessly.
  • the wireless communication referred to here is, for example, short-range wireless communication such as Bluetooth (registered trademark) or NFC (registered trademark), or communication via a wireless local area network (LAN) such as Wi-Fi (registered trademark).
  • the communication unit 10 may transmit and receive data to and from the microphone MK via an interface such as a Universal Serial Bus (USB). Furthermore, the communication unit 10 may perform data transmission and reception with the monitor MN via an interface such as High-Definition Multimedia Interface (HDMI, registered trademark).
  • HDMI High-Definition Multimedia Interface
  • the processor 11 is configured using, for example, a central processing unit (CPU) or a field programmable gate array (FPGA), and performs various processing and control in cooperation with the memory 12. Specifically, the processor 11 refers to the program and data held in the memory 12 and executes the program to obtain the speaker registration unit 111, the noise extraction unit 112, the noise synthesis unit 113, and the first feature amount. The functions of each section such as the extraction section 114, the second feature amount extraction section 115, and the authentication section 116 are realized.
  • CPU central processing unit
  • FPGA field programmable gate array
  • the processor 11 When registering the voice of the speaker US, the processor 11 registers (stores) the voice data of the speaker US in the registered speaker database DB2 by realizing the function of the speaker registration unit 111. Furthermore, the processor 11 realizes the functions of the noise extraction unit 112, the noise synthesis unit 113, the first feature extraction unit 114, the second feature extraction unit 115, and the authentication unit 116 during speech authentication of the speaker US. By doing so, speaker authentication processing is executed.
  • the speaker registration unit 111 associates the audio data of the speaker US transmitted from the microphone MK with the speaker information of the speaker US, and registers the data in the registered speaker database DB2.
  • the speaker information may be extracted from the audio data by voice recognition, or may be acquired from a terminal owned by the speaker US (for example, a PC, a notebook PC, a smartphone, a tablet terminal).
  • the speaker information here includes, for example, identification information that can identify the speaker US, the name of the speaker US, speaker identification (ID), and the like.
  • the noise extraction unit 112 which is an example of a detection unit, acquires the voice data of the speaker US transmitted from the microphone MK.
  • the noise extraction unit 112 detects a speech section in which the speaker US is speaking and a section in which the speaker US is not speaking (hereinafter referred to as a "non-speech section") from the audio data.
  • the noise extraction unit 112 extracts noise included in the detected non-speech period, and outputs the extracted noise data (hereinafter referred to as “noise data”) to the noise synthesis unit 113.
  • the noise referred to here is noise that is collected due to the environment (background) at the time of sound collection, and includes, for example, surrounding voices at the time of sound collection, music, the sound of a vehicle running, the sound of the wind, etc.
  • the noise synthesis unit 113 which is an example of a synthesis unit, acquires the noise data output from the noise extraction unit 112.
  • the noise synthesis unit 113 synthesizes the acquired noise data with each of the registered voice data of a plurality of speakers registered in the registered speaker database DB2, and outputs the synthesized noise data to the first feature amount extraction unit 114.
  • the first feature extraction unit 114 which is an example of an extraction unit, acquires each of the registered voice data of a plurality of speakers into which noise data has been synthesized from the noise synthesis unit 113.
  • the first feature extraction unit 114 uses the feature extraction model stored in the feature extraction model database DB1 to extract feature quantities that indicate the individuality of each speaker from each of the registered voice data of the plurality of speakers. do.
  • the first feature extraction unit 114 outputs the feature quantities of each of the plurality of speakers to the authentication unit 116.
  • the second feature extraction unit 115 which is an example of an extraction unit, acquires the voice data of the speaker US transmitted from the microphone MK.
  • the second feature extraction unit 115 extracts a feature indicating the individuality of the speaker US from the voice data of the speaker US using the feature extraction model stored in the feature extraction model database DB1.
  • the second feature amount extraction unit 115 outputs the feature amount of the speaker US to the authentication unit 116.
  • the authentication unit 116 which is an example of a calculation unit, uses the feature quantities of each of the plurality of speakers outputted from the first feature quantity extraction unit 114 and the feature quantity of the speaker US outputted from the second feature quantity extraction unit 115.
  • the degree of similarity is calculated using the degree of similarity calculation model stored in the degree of similarity calculation model database DB3.
  • the authentication unit 116 identifies the speaker US based on the calculated similarity.
  • the authentication unit 116 generates an authentication result screen SC based on the speaker information of the identified speaker US, and outputs it to the monitor MN.
  • the memory 12 includes, for example, a random access memory (hereinafter referred to as "RAM”) as a work memory used when executing each process of the processor 11, and a read-only memory that stores programs and data that define the operation of the processor 11.
  • RAM random access memory
  • ROM read-only memory
  • Data or information generated or acquired by the processor 11 is temporarily stored in the RAM.
  • a program that defines the operation of the processor 11 is written in the ROM.
  • the feature extraction model database DB1 is a so-called storage, and is stored using a storage medium such as a flash memory, a Hard Disk Drive (hereinafter referred to as "HDD”), or a Solid State Drive (hereinafter referred to as "SSD”). configured.
  • the feature extraction model database DB1 stores a feature extraction model capable of detecting the utterance section of a person such as the speaker US from voice data and extracting the feature of the person.
  • the feature extraction model is, for example, a learning model generated by learning using deep learning or the like.
  • the registered speaker database DB2 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD.
  • the registered speaker database DB2 stores voice data of a plurality of speakers registered in advance in association with speaker information.
  • the similarity calculation model database DB3 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD.
  • the similarity calculation model database DB3 stores a similarity calculation model that can calculate the similarity between two feature amounts.
  • the similarity calculation model is, for example, a learning model generated by learning using deep learning or the like.
  • a similarity calculation model is one in which dimensions in which individuality is likely to be expressed are learned in advance and retained in order to calculate the similarity between two multidimensional vectors with high accuracy.
  • the method of calculating similarity using a model is just one example of a method for calculating similarity between vectors, and previously described techniques such as Euclidean distance and cosine similarity may be used.
  • the monitor MN is configured using, for example, a display such as a Liquid Crystal Display (hereinafter referred to as "LCD”) or an organic electroluminescence (hereinafter referred to as "EL").
  • LCD Liquid Crystal Display
  • EL organic electroluminescence
  • the monitor MN displays the authentication result screen SC output from the terminal device P1.
  • the authentication result screen SC is a screen that notifies the speaker US or the administrator (for example, the person viewing Monitor MN) of the speaker authentication result, and the authentication result information "XX matched Mr. XX's voice. "including.
  • the authentication result screen SC may also include other speaker information (for example, a face image, etc.).
  • FIG. 3 is a flowchart showing an example of the operation procedure of the terminal device P1 in the embodiment.
  • the terminal device P1 acquires audio data from the microphone MK (St11).
  • the microphone MK may be, for example, a microphone included in a PC, a notebook PC, a smartphone, or a tablet terminal.
  • the terminal device P1 determines whether to register in the registered speaker database DB2 based on the control command associated with the acquired voice data or the speaker information (St12). Note that if the acquired voice data contains a lot of noise, the voice data may be acquired again from the microphone MK without being registered in the registered speaker database DB2. By keeping the amount of noise included in the voice data below a certain value, speaker authentication accuracy can be improved.
  • step St12 the terminal device P1 determines to register the voice data in the registered speaker database DB2 if the voice data is associated with a control command or speaker information requesting voice data registration (St12, YES), the audio data and speaker information are associated and registered in the registered speaker database DB2 (St14).
  • step St12 if the voice data is not associated with a control command requesting registration of the voice data, the terminal device P1 determines not to register it in the registered speaker database DB2 (St12, NO); Noise included in the non-speech section of the audio data is extracted (St13).
  • the noise referred to here is noise, and is environmental noise, noise, etc. around the time when audio data is collected.
  • the terminal device P1 synthesizes the extracted noise data with the voice data of each of the plurality of speakers stored (registered) in the registered speaker database DB2 (St15).
  • the terminal device P1 extracts feature quantities indicating the individuality of each speaker registered in the registered speaker database DB2 from the voice data of each of the plurality of speakers after noise synthesis (St16).
  • the terminal device P1 extracts a feature amount indicating the individuality of the speaker US from the utterance section of the acquired audio data (St17).
  • the terminal device P1 executes speaker authentication processing based on the extracted feature amount of the speaker US and the feature amount of each of the plurality of speakers (St20).
  • the terminal device P1 synthesizes the noise extracted from the voice data used for speaker authentication with the voice data of each of the plurality of speakers registered in the registered speaker database DB2, thereby converting the noise into voice data. Decrease in speaker authentication accuracy due to included noise can be more effectively suppressed.
  • FIG. 4 is a flowchart showing an example of the speaker authentication procedure of the terminal device P1 in the embodiment.
  • the terminal device P1 reads the similarity calculation model from the similarity calculation model database DB3 (St21).
  • the terminal device P1 calculates the degree of similarity between the acquired feature amount of the speaker US and the feature amount of each of the plurality of speakers after noise synthesis (St22).
  • the terminal device P1 determines whether there is a degree of similarity greater than or equal to a threshold value among the calculated degrees of similarity (St23).
  • step St23 if it is determined that there is a degree of similarity that is equal to or greater than the threshold among the calculated degrees of similarity (St23, YES), the terminal device P1 selects a story that is determined to be equal to or greater than the threshold and corresponds to the degree of similarity.
  • the speaker US is identified based on the speaker information (St24). Note that if there are multiple degrees of similarity determined to be equal to or greater than the threshold, the terminal device P1 identifies the speaker US based on the speaker information corresponding to the degree of similarity with the highest calculated degree of similarity. It's okay.
  • step St23 if the terminal device P1 determines that there is no similarity greater than or equal to the threshold among the calculated similarities (St23, NO), the terminal device P1 determines that the speaker US cannot be identified (St25). ).
  • the terminal device P1 generates an authentication result screen SC based on the speaker information of the identified speaker US.
  • the terminal device P1 outputs the generated authentication result screen SC to the monitor MN for display (St26).
  • the terminal device P1 can improve the accuracy of speaker authentication. The decline can be suppressed more effectively. In other words, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by changes in environmental noise.
  • the terminal device P1 may determine whether noise is included in the voice data of the speaker US at the time of voice registration. For example, if the terminal device P1 determines that the noise included in the audio data at the time of audio registration is less than the threshold value, the terminal device P1 may determine that the audio data does not include noise and execute the audio registration process.
  • This threshold value may be set to an arbitrary value for determining whether the noise is negligible in speaker authentication processing using feature amounts or to a level that does not cause false authentication.
  • the terminal device P1 (an example of a voice authentication device) according to the embodiment includes a communication unit 10 (an example of an acquisition unit) that acquires voice data, and an utterance uttered by a speaker based on the acquired voice data.
  • a communication unit 10 an example of an acquisition unit
  • a noise extraction unit 112 (an example of a detection unit) that detects a non-speech interval and a non-speech interval in which the speaker US does not speak, and a noise extraction unit 112 (an example of a detection unit) that detects the non-speech interval and the non-speech interval in which the voice data of each of the plurality of speakers registered in advance is Authentication that authenticates the speaker US based on the noise synthesis unit 113 (an example of a synthesis unit) that synthesizes data, a plurality of synthesized speech data obtained by synthesizing the speech data of the non-speech section, and the speech data of the speech section. 116.
  • the terminal device P1 even if noise is not included in the voice data of the speaker US at the time of voice registration and noise is included in the voice data of the speaker US at the time of voice authentication. Decrease in speaker authentication accuracy can be more effectively suppressed.
  • the terminal device P1 further includes a first feature extracting unit 114 and a second feature extracting unit 115 (an example of an extracting unit) that extract the speaker's feature from the audio data.
  • the authentication unit 116 authenticates the speaker US based on the feature amounts of the plurality of extracted synthetic voice data and the feature amounts of the voice data of the utterance section. Thereby, the terminal device P1 according to the embodiment can perform speaker authentication using the feature amount indicating the individuality of the speaker US.
  • the terminal device P1 according to the embodiment further includes an authentication unit 116 (an example of a calculation unit) that calculates the similarity between the plurality of synthesized voice data and the voice data of the utterance section.
  • the authentication unit 116 authenticates the speaker US based on the plurality of calculated similarities.
  • the terminal device P1 according to the embodiment can perform speaker authentication using the degree of similarity between the feature amounts of a plurality of speakers registered in advance and the feature amounts of the speaker US.
  • the authentication unit 116 in the terminal device P1 according to the embodiment determines whether the calculated degree of similarity is greater than or equal to the threshold value, and corresponds to the speech synthesis data for which the degree of similarity is determined to be greater than or equal to the threshold value. Identify the speaker as speaker US. Thereby, the terminal device P1 according to the embodiment can identify the speaker US with higher accuracy based on the degree of similarity between the feature amount of the speaker US and the feature amount of the speaker.
  • the authentication unit 116 in the terminal device P1 according to the embodiment generates and outputs an authentication result screen SC that includes information regarding speakers whose similarity is equal to or higher than the threshold value. Thereby, the terminal device P1 according to the embodiment can present the speaker authentication result to the speaker US or the administrator.
  • the authentication unit 116 in the terminal device P1 according to the embodiment determines that the speaker US cannot be identified. Thereby, the terminal device P1 according to the embodiment can more effectively suppress a decrease in speaker authentication accuracy and more effectively suppress erroneous authentication of the speaker US.
  • the noise extraction unit 112 in the terminal device P1 extracts noise included in the non-speech period.
  • the noise synthesis unit 113 synthesizes noise with each voice data of a plurality of speakers.
  • the terminal device P1 according to the embodiment synthesizes the noise at the time of voice authentication with the voice data at the time of voice registration, thereby creating a sound collection environment for the uttered voice (voice data) at the time of voice registration and voice authentication. This makes it possible to more effectively suppress a decline in speaker authentication accuracy due to changes in environmental noise.
  • the present disclosure is useful as a voice authentication device and a voice authentication method that suppress a decrease in speaker authentication accuracy due to changes in environmental noise.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Telephone Function (AREA)

Abstract

Provided is a voice authentication device comprising an acquisition unit which acquires voice data, a detection unit which detects, from the acquired voice data, an utterance section in which a speaker is making an utterance and a non-utterance section in which the speaker is not making an utterance, a combining unit which combines voice data in the non-utterance section with voice data of each of a plurality of speakers registered in advance, and an authentication unit which authenticates the speaker on the basis of voice data in the utterance section and the plurality of pieces of combined voice data in which the voice data in the non-utterance section is combined.

Description

音声認証装置および音声認証方法Voice authentication device and voice authentication method
 本開示は、音声認証装置および音声認証方法に関する。 The present disclosure relates to a voice authentication device and a voice authentication method.
 特許文献1には、雑音を含む認識対象としての音声信号を入力し、入力された音声信号から雑音を除去し、雑音除去後の信号に既知雑音を付加し、雑音付加後の信号を音声認識用パラメータに変換し、パラメータを音響モデルと比較して音声認識を行う音声認識装置が開示されている。なお、ここでいう既知雑音とは、音声認識装置において参照されるパターンや統計的手法における音響モデル学習時に背景雑音として学習されたもの、あるいはそれと同様の特性を有する雑音である。音声認識装置は、入力された音声信号に既知雑音を付加することで、除去し切れなかった消し残り雑音を含む音声信号と、予め容易された音響モデルで認識される音声信号との齟齬を低減し、音声認識の精度を向上させる。 Patent Document 1 discloses that a speech signal containing noise as a recognition target is input, noise is removed from the input speech signal, known noise is added to the signal after noise removal, and the signal after noise addition is used for speech recognition. A speech recognition device has been disclosed that performs speech recognition by converting the parameters into parameters for use with an acoustic model and comparing the parameters with an acoustic model. Note that the known noise here refers to a pattern referenced in a speech recognition device, background noise learned during acoustic model learning in a statistical method, or noise having similar characteristics. By adding known noise to the input speech signal, the speech recognition device reduces the discrepancy between the speech signal containing the unremoved noise and the speech signal recognized using the acoustic model prepared in advance. and improve the accuracy of speech recognition.
日本国特開2004-12884号公報Japanese Patent Application Publication No. 2004-12884
 しかしながら、音声信号を用いた声紋認証を行う場合、事前に登録される音声信号は、静かで雑音が発生しにくい環境で収音されることが多い。よって、既知雑音が付加された音声信号と、登録された音声信号とを用いて声紋認証を実行する場合には、付加された既知雑音により声紋認証精度が低下する可能性があった。 However, when performing voiceprint authentication using audio signals, the audio signals that are registered in advance are often collected in a quiet environment where noise is less likely to occur. Therefore, when performing voiceprint authentication using a voice signal to which known noise has been added and a registered voice signal, there is a possibility that voiceprint authentication accuracy may be reduced due to the added known noise.
 本開示は、上述した従来の状況に鑑みて案出され、環境雑音の変化に起因する話者認証精度の低下を抑制する音声認証装置および音声認証方法を提供することを目的とする。 The present disclosure was devised in view of the above-mentioned conventional situation, and aims to provide a voice authentication device and a voice authentication method that suppress a decrease in speaker authentication accuracy due to changes in environmental noise.
 本開示は、音声データを取得する取得部と、取得された前記音声データから話者が発話している発話区間と、前記話者が発話していない非発話区間とを検出する検出部と、事前に登録された複数の話者のそれぞれの音声データに前記非発話区間の音声データを合成する合成部と、前記非発話区間の音声データが合成された複数の合成音声データと、前記発話区間の音声データとに基づいて、前記話者を認証する認証部と、を備える、音声認証装置を提供する。 The present disclosure includes: an acquisition unit that acquires audio data; a detection unit that detects a speech section in which a speaker is speaking and a non-speech section in which the speaker is not speaking from the acquired audio data; a synthesis unit that synthesizes the voice data of the non-speech section with the voice data of each of a plurality of speakers registered in advance; a plurality of synthesized voice data in which the voice data of the non-speech section is synthesized; and the speech section. and an authentication section that authenticates the speaker based on the voice data of the speaker.
 また、本開示は、音声データを取得し、取得された前記音声データから話者が発話している発話区間と、前記話者が発話していない非発話区間とを検出し、事前に登録された複数の話者のそれぞれの音声データに前記非発話区間の音声データを合成し、前記非発話区間の音声データが合成された複数の合成音声データと、前記発話区間の音声データとに基づいて、前記話者を認証する、音声認証方法を提供する。 Further, the present disclosure acquires audio data, detects from the acquired audio data a speech section in which the speaker is speaking, and a non-speech section in which the speaker does not speak, and detects speech sections in which the speaker is not speaking, and The voice data of the non-speech section is synthesized with the voice data of each of the plurality of speakers, and based on the plurality of synthesized voice data in which the voice data of the non-speech section is synthesized and the voice data of the speech section. , provides a voice authentication method for authenticating the speaker.
 本開示によれば、環境雑音の変化に起因する話者認証精度の低下を抑制できる。 According to the present disclosure, it is possible to suppress a decrease in speaker authentication accuracy due to changes in environmental noise.
実施の形態に係る音声認証システムの内部構成例を示すブロック図Block diagram showing an example of internal configuration of a voice authentication system according to an embodiment 実施の形態における端末装置のプロセッサが行う各処理について説明する図A diagram illustrating each process performed by a processor of a terminal device in an embodiment. 実施の形態における端末装置の動作手順例を示すフローチャートFlowchart showing an example of the operation procedure of the terminal device in the embodiment 実施の形態における端末装置の話者認証手順例を示すフローチャートFlowchart showing an example of a speaker authentication procedure of a terminal device in an embodiment
 以下、適宜図面を参照しながら、本開示に係る音声認証装置および音声認証方法を具体的に開示した実施の形態を詳細に説明する。但し、必要以上に詳細な説明は省略する場合がある。例えば、既によく知られた事項の詳細説明および実質的に同一の構成に対する重複説明を省略する場合がある。これは、以下の説明が不必要に冗長になるのを避け、当業者の理解を容易にするためである。なお、添付図面および以下の説明は、当業者が本開示を十分に理解するために提供されるのであって、これらにより特許請求の範囲に記載の主題を限定することは意図されていない。 Hereinafter, embodiments specifically disclosing a voice authentication device and a voice authentication method according to the present disclosure will be described in detail with reference to the drawings as appropriate. However, more detailed explanation than necessary may be omitted. For example, detailed explanations of well-known matters and redundant explanations of substantially the same configurations may be omitted. This is to avoid unnecessary redundancy in the following description and to facilitate understanding by those skilled in the art. The accompanying drawings and the following description are provided to enable those skilled in the art to fully understand the present disclosure, and are not intended to limit the subject matter recited in the claims.
 まず、図1および図2を参照して、実施の形態に係る音声認証システム100について説明する。図1は、実施の形態に係る音声認証システム100の内部構成例を示すブロック図である。図2は、実施の形態における端末装置P1のプロセッサ11が行う各処理について説明する図である。 First, a voice authentication system 100 according to an embodiment will be described with reference to FIGS. 1 and 2. FIG. 1 is a block diagram showing an example of the internal configuration of a voice authentication system 100 according to an embodiment. FIG. 2 is a diagram illustrating each process performed by the processor 11 of the terminal device P1 in the embodiment.
 音声認証システム100は、音声認証装置の一例としての端末装置P1と、モニタMNとを含む。なお、音声認証システム100は、マイクMKあるいはモニタMNを含む構成であってもよい。 The voice authentication system 100 includes a terminal device P1 as an example of a voice authentication device and a monitor MN. Note that the voice authentication system 100 may include a microphone MK or a monitor MN.
 マイクMKは、端末装置P1に事前に登録される話者USの発話音声を収音する。マイクMKは、収音された話者USの発話音声を、端末装置P1に登録される音声信号または音声データに変換する。マイクMKは、変換された音声信号または音声データを、通信部10を介してプロセッサ11に送信する。 The microphone MK picks up the voice uttered by the speaker US, which is registered in advance in the terminal device P1. The microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data that is registered in the terminal device P1. Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .
 また、マイクMKは、話者認証に用いられる話者USの発話音声を収音する。マイクMKは、収音された話者USの発話音声を音声信号または音声データに変換する。マイクMKは、変換された音声信号または音声データを、通信部10を介してプロセッサ11に送信する。 Additionally, the microphone MK picks up the voice uttered by the speaker US, which is used for speaker authentication. The microphone MK converts the collected voice uttered by the speaker US into an audio signal or audio data. Microphone MK transmits the converted audio signal or audio data to processor 11 via communication unit 10 .
 なお、マイクMKは、例えば、Personal Computer(以降、「PC」と表記)、ノートPC、スマートフォン、タブレット端末等の所定の装置が備えるマイクであってもよい。また、マイクMKは、ネットワーク(不図示)を介した無線通信により、音声信号または音声データを端末装置P1に送信してもよい。なお、以降の説明では、音声データを用いた話者認証処理について説明する。 Note that the microphone MK may be, for example, a microphone included in a predetermined device such as a Personal Computer (hereinafter referred to as "PC"), a notebook PC, a smartphone, a tablet terminal, or the like. Further, the microphone MK may transmit an audio signal or audio data to the terminal device P1 by wireless communication via a network (not shown). Note that in the following explanation, speaker authentication processing using voice data will be explained.
 端末装置P1は、例えば、PC、ノートPC、スマートフォン、タブレット端末等により実現され、話者USの発話音声に基づく話者認証を実行する。通信部10と、プロセッサ11と、メモリ12と、特徴量抽出モデルデータベースDB1と、登録話者データベースDB2と、類似度計算モデルデータベースDB3と、を含む。 The terminal device P1 is realized by, for example, a PC, a notebook PC, a smartphone, a tablet terminal, etc., and executes speaker authentication based on the voice uttered by the speaker US. It includes a communication unit 10, a processor 11, a memory 12, a feature extraction model database DB1, a registered speaker database DB2, and a similarity calculation model database DB3.
 取得部の一例としての通信部10は、マイクMKと、モニタMNとの間でデータ送受信可能に有線通信あるいは無線通信可能に接続される。ここでいう無線通信は、例えばBluetooth(登録商標)、NFC(登録商標)等の近距離無線通信、またはWi-Fi(登録商標)等の無線Local Area Network(LAN)を介した通信である。 The communication unit 10, which is an example of an acquisition unit, is connected to the microphone MK and the monitor MN so that data can be transmitted and received by wire or wirelessly. The wireless communication referred to here is, for example, short-range wireless communication such as Bluetooth (registered trademark) or NFC (registered trademark), or communication via a wireless local area network (LAN) such as Wi-Fi (registered trademark).
 なお、通信部10は、Universal Serial Bus(USB)等のインターフェースを介してマイクMKとの間でデータ送受信を実行してもよい。また、通信部10は、High-Definition Multimedia Interface(HDMI,登録商標)等のインターフェースを介してモニタMNとの間でデータ送受信を実行してもよい。 Note that the communication unit 10 may transmit and receive data to and from the microphone MK via an interface such as a Universal Serial Bus (USB). Furthermore, the communication unit 10 may perform data transmission and reception with the monitor MN via an interface such as High-Definition Multimedia Interface (HDMI, registered trademark).
 プロセッサ11は、例えばCentral Processing Unit(CPU)またはField Programmable Gate Array(FPGA)を用いて構成されて、メモリ12と協働して、各種の処理および制御を行う。具体的には、プロセッサ11は、メモリ12に保持されたプログラムおよびデータを参照し、そのプログラムを実行することにより、話者登録部111、ノイズ抽出部112、ノイズ合成部113、第1特徴量抽出部114、第2特徴量抽出部115、認証部116等の各部の機能を実現する。 The processor 11 is configured using, for example, a central processing unit (CPU) or a field programmable gate array (FPGA), and performs various processing and control in cooperation with the memory 12. Specifically, the processor 11 refers to the program and data held in the memory 12 and executes the program to obtain the speaker registration unit 111, the noise extraction unit 112, the noise synthesis unit 113, and the first feature amount. The functions of each section such as the extraction section 114, the second feature amount extraction section 115, and the authentication section 116 are realized.
 プロセッサ11は、話者USの音声登録時に、話者登録部111の機能を実現することで、登録話者データベースDB2への話者USの音声データの登録(格納)を実行する。また、プロセッサ11は、話者USの音声認証時に、ノイズ抽出部112、ノイズ合成部113、第1特徴量抽出部114、第2特徴量抽出部115、認証部116の各部の機能を実現することで、話者認証処理を実行する。 When registering the voice of the speaker US, the processor 11 registers (stores) the voice data of the speaker US in the registered speaker database DB2 by realizing the function of the speaker registration unit 111. Furthermore, the processor 11 realizes the functions of the noise extraction unit 112, the noise synthesis unit 113, the first feature extraction unit 114, the second feature extraction unit 115, and the authentication unit 116 during speech authentication of the speaker US. By doing so, speaker authentication processing is executed.
 話者登録部111は、マイクMKから送信された話者USの音声データと、話者USの話者情報とを対応付けて、登録話者データベースDB2に登録する。 The speaker registration unit 111 associates the audio data of the speaker US transmitted from the microphone MK with the speaker information of the speaker US, and registers the data in the registered speaker database DB2.
 なお、話者情報は、音声データから音声認識により抽出されてもよいし、話者USが所有する端末(例えば、PC、ノートPC、スマートフォン,タブレット端末)から取得されてもよい。また、ここでいう話者情報は、例えば、話者USを識別可能な識別情報、話者USの氏名、話者Identification(ID)等である。 Note that the speaker information may be extracted from the audio data by voice recognition, or may be acquired from a terminal owned by the speaker US (for example, a PC, a notebook PC, a smartphone, a tablet terminal). Moreover, the speaker information here includes, for example, identification information that can identify the speaker US, the name of the speaker US, speaker identification (ID), and the like.
 検出部の一例としてのノイズ抽出部112は、マイクMKから送信された話者USの音声データを取得する。ノイズ抽出部112は、音声データのうち話者USが発話している発話区間と、話者USが発話していない区間(以降、「非発話区間」と表記)とを検出する。ノイズ抽出部112は、検出された非発話区間に含まれるノイズを抽出し、抽出されたノイズのデータ(以降、「ノイズデータ」と表記)をノイズ合成部113に出力する。 The noise extraction unit 112, which is an example of a detection unit, acquires the voice data of the speaker US transmitted from the microphone MK. The noise extraction unit 112 detects a speech section in which the speaker US is speaking and a section in which the speaker US is not speaking (hereinafter referred to as a "non-speech section") from the audio data. The noise extraction unit 112 extracts noise included in the detected non-speech period, and outputs the extracted noise data (hereinafter referred to as “noise data”) to the noise synthesis unit 113.
 ここでいうノイズは、収音時の環境(背景)に起因して収音されたノイズであって、例えば、収音時の周囲の話し声、音楽、車両の走行音、風の音等である。 The noise referred to here is noise that is collected due to the environment (background) at the time of sound collection, and includes, for example, surrounding voices at the time of sound collection, music, the sound of a vehicle running, the sound of the wind, etc.
 合成部の一例としてのノイズ合成部113は、ノイズ抽出部112から出力されたノイズデータを取得する。ノイズ合成部113は、登録話者データベースDB2に登録された複数の話者の登録音声データのそれぞれに取得されたノイズデータを合成し、第1特徴量抽出部114に出力する。 The noise synthesis unit 113, which is an example of a synthesis unit, acquires the noise data output from the noise extraction unit 112. The noise synthesis unit 113 synthesizes the acquired noise data with each of the registered voice data of a plurality of speakers registered in the registered speaker database DB2, and outputs the synthesized noise data to the first feature amount extraction unit 114.
 抽出部の一例としての第1特徴量抽出部114は、ノイズ合成部113からノイズデータが合成された複数の話者の登録音声データのそれぞれを取得する。第1特徴量抽出部114は、特徴量抽出モデルデータベースDB1に格納された特徴量抽出モデルを用いて、複数の話者の登録音声データのそれぞれから各話者の個人性を示す特徴量を抽出する。第1特徴量抽出部114は、複数の話者のそれぞれの特徴量を認証部116に出力する。 The first feature extraction unit 114, which is an example of an extraction unit, acquires each of the registered voice data of a plurality of speakers into which noise data has been synthesized from the noise synthesis unit 113. The first feature extraction unit 114 uses the feature extraction model stored in the feature extraction model database DB1 to extract feature quantities that indicate the individuality of each speaker from each of the registered voice data of the plurality of speakers. do. The first feature extraction unit 114 outputs the feature quantities of each of the plurality of speakers to the authentication unit 116.
 抽出部の一例としての第2特徴量抽出部115は、マイクMKから送信された話者USの音声データを取得する。第2特徴量抽出部115は、特徴量抽出モデルデータベースDB1に格納された特徴量抽出モデルを用いて、話者USの音声データから話者USの個人性を示す特徴量を抽出する。第2特徴量抽出部115は、話者USの特徴量を認証部116に出力する。 The second feature extraction unit 115, which is an example of an extraction unit, acquires the voice data of the speaker US transmitted from the microphone MK. The second feature extraction unit 115 extracts a feature indicating the individuality of the speaker US from the voice data of the speaker US using the feature extraction model stored in the feature extraction model database DB1. The second feature amount extraction unit 115 outputs the feature amount of the speaker US to the authentication unit 116.
 算出部の一例としての認証部116は、第1特徴量抽出部114から出力された複数の話者のそれぞれの特徴量と、第2特徴量抽出部115から出力された話者USの特徴量との類似度を、類似度計算モデルデータベースDB3に格納された類似度計算モデルを用いて算出する。認証部116は、算出された類似度に基づいて、話者USを特定する。認証部116は、特定された話者USの話者情報に基づいて、認証結果画面SCを生成して、モニタMNに出力する。 The authentication unit 116, which is an example of a calculation unit, uses the feature quantities of each of the plurality of speakers outputted from the first feature quantity extraction unit 114 and the feature quantity of the speaker US outputted from the second feature quantity extraction unit 115. The degree of similarity is calculated using the degree of similarity calculation model stored in the degree of similarity calculation model database DB3. The authentication unit 116 identifies the speaker US based on the calculated similarity. The authentication unit 116 generates an authentication result screen SC based on the speaker information of the identified speaker US, and outputs it to the monitor MN.
 メモリ12は、例えばプロセッサ11の各処理を実行する際に用いられるワークメモリとしてのRandom Access Memory(以降、「RAM」と表記)と、プロセッサ11の動作を規定したプログラムおよびデータを格納するRead Only Memory(以降、「ROM」と表記)とを有する。RAMには、プロセッサ11により生成あるいは取得されたデータもしくは情報が一時的に保存される。ROMには、プロセッサ11の動作を規定するプログラムが書き込まれている。 The memory 12 includes, for example, a random access memory (hereinafter referred to as "RAM") as a work memory used when executing each process of the processor 11, and a read-only memory that stores programs and data that define the operation of the processor 11. Memory (hereinafter referred to as "ROM"). Data or information generated or acquired by the processor 11 is temporarily stored in the RAM. A program that defines the operation of the processor 11 is written in the ROM.
 特徴量抽出モデルデータベースDB1は、所謂ストレージであって、例えばフラッシュメモリ、Hard Disk Drive(以降、「HDD」と表記)あるいはSolid State Drive(以降、「SSD」と表記)等の記憶媒体を用いて構成される。特徴量抽出モデルデータベースDB1は、音声データから話者US等の人物の発話区間を検出し、人物の特徴量を抽出可能な特徴量抽出モデルを格納する。特徴量抽出モデルは、例えば、ディープラーニング等を用いた学習により生成された学習モデルである。 The feature extraction model database DB1 is a so-called storage, and is stored using a storage medium such as a flash memory, a Hard Disk Drive (hereinafter referred to as "HDD"), or a Solid State Drive (hereinafter referred to as "SSD"). configured. The feature extraction model database DB1 stores a feature extraction model capable of detecting the utterance section of a person such as the speaker US from voice data and extracting the feature of the person. The feature extraction model is, for example, a learning model generated by learning using deep learning or the like.
 登録話者データベースDB2は、所謂ストレージであって、例えばフラッシュメモリ、HDDあるいはSSD等の記憶媒体を用いて構成される。登録話者データベースDB2は、事前に登録された複数の話者のそれぞれの音声データと、話者情報とを対応付けて格納する。 The registered speaker database DB2 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD. The registered speaker database DB2 stores voice data of a plurality of speakers registered in advance in association with speaker information.
 類似度計算モデルデータベースDB3は、所謂ストレージであって、例えばフラッシュメモリ、HDDあるいはSSD等の記憶媒体を用いて構成される。類似度計算モデルデータベースDB3は、2つの特徴量の類似度を算出可能な類似度計算モデルを格納する。類似度計算モデルは、例えば、ディープラーニング等を用いた学習により生成された学習モデルである。 The similarity calculation model database DB3 is a so-called storage, and is configured using a storage medium such as a flash memory, HDD, or SSD. The similarity calculation model database DB3 stores a similarity calculation model that can calculate the similarity between two feature amounts. The similarity calculation model is, for example, a learning model generated by learning using deep learning or the like.
 例えば、類似度計算モデルは、2つの多次元ベクトルの類似度を高精度に算出するために、個人性の表れやすい次元を事前学習しておき保持しておくものである。なお、モデルを利用した類似度の算出方法は、ベクトル間の類似度計算における手法のあくまで一例であって、ユークリッド距離やコサイン類似度などの既出の技術が用いられてもよい。 For example, a similarity calculation model is one in which dimensions in which individuality is likely to be expressed are learned in advance and retained in order to calculate the similarity between two multidimensional vectors with high accuracy. Note that the method of calculating similarity using a model is just one example of a method for calculating similarity between vectors, and previously described techniques such as Euclidean distance and cosine similarity may be used.
 モニタMNは、例えばLiquid Crystal Display(以降、「LCD」と表記)または有機Electroluminescence(以降、「EL」と表記)等のディスプレイを用いて構成される。モニタMNは、端末装置P1から出力された認証結果画面SCを表示する。 The monitor MN is configured using, for example, a display such as a Liquid Crystal Display (hereinafter referred to as "LCD") or an organic electroluminescence (hereinafter referred to as "EL"). The monitor MN displays the authentication result screen SC output from the terminal device P1.
 認証結果画面SCは、話者認証結果を話者USあるいは管理者(例えば、モニタMNを視聴する人物等)に通知する画面であって、認証結果情報「XX XXさんの声と一致しました。」を含む。認証結果画面SCは、他の話者情報(例えば、顔画像等)を含んでもよい。 The authentication result screen SC is a screen that notifies the speaker US or the administrator (for example, the person viewing Monitor MN) of the speaker authentication result, and the authentication result information "XX matched Mr. XX's voice. "including. The authentication result screen SC may also include other speaker information (for example, a face image, etc.).
 次に、図3を参照して、端末装置P1の動作手順について説明する。図3は、実施の形態における端末装置P1の動作手順例を示すフローチャートである。 Next, with reference to FIG. 3, the operating procedure of the terminal device P1 will be described. FIG. 3 is a flowchart showing an example of the operation procedure of the terminal device P1 in the embodiment.
 端末装置P1は、マイクMKから音声データを取得する(St11)。なお、マイクMKは、例えば、PC、ノートPC、スマートフォン、タブレット端末が備えるマイクであってもよい。 The terminal device P1 acquires audio data from the microphone MK (St11). Note that the microphone MK may be, for example, a microphone included in a PC, a notebook PC, a smartphone, or a tablet terminal.
 端末装置P1は、取得された音声データに対応付けられた制御指令、あるいは話者情報に基づいて、登録話者データベースDB2に登録するか否かを判定する(St12)。なお、取得された音声データにノイズが多く含まれている場合には、登録話者データベースDB2に登録せず、再度マイクMKから音声データを取得してもよい。音声データに含まれるノイズ量を一定値以下にすることにより、話者認証精度を向上することができる。 The terminal device P1 determines whether to register in the registered speaker database DB2 based on the control command associated with the acquired voice data or the speaker information (St12). Note that if the acquired voice data contains a lot of noise, the voice data may be acquired again from the microphone MK without being registered in the registered speaker database DB2. By keeping the amount of noise included in the voice data below a certain value, speaker authentication accuracy can be improved.
 端末装置P1は、ステップSt12の処理において、音声データに音声データの登録を要求する制御指令あるいは話者情報が対応付けられている場合には、登録話者データベースDB2に登録すると判定し(St12,YES)、音声データと話者情報とを対応付けて、登録話者データベースDB2に登録する(St14)。 In the process of step St12, the terminal device P1 determines to register the voice data in the registered speaker database DB2 if the voice data is associated with a control command or speaker information requesting voice data registration (St12, YES), the audio data and speaker information are associated and registered in the registered speaker database DB2 (St14).
 端末装置P1は、ステップSt12の処理において、音声データに音声データの登録を要求する制御指令が対応付けられていない場合には、登録話者データベースDB2に登録しないと判定し(St12,NO)、音声データのうち非発話区間に含まれるノイズを抽出する(St13)。ここでいうノイズは、ノイズであって、音声データが収音された時の周囲の環境雑音、雑音等である。 In the process of step St12, if the voice data is not associated with a control command requesting registration of the voice data, the terminal device P1 determines not to register it in the registered speaker database DB2 (St12, NO); Noise included in the non-speech section of the audio data is extracted (St13). The noise referred to here is noise, and is environmental noise, noise, etc. around the time when audio data is collected.
 端末装置P1は、登録話者データベースDB2に格納(登録)された複数の話者のそれぞれの音声データに、抽出されたノイズデータを合成する(St15)。 The terminal device P1 synthesizes the extracted noise data with the voice data of each of the plurality of speakers stored (registered) in the registered speaker database DB2 (St15).
 端末装置P1は、ノイズ合成後の複数の話者のそれぞれの音声データから、登録話者データベースDB2に登録された各話者の個人性を示す特徴量を抽出する(St16)。 The terminal device P1 extracts feature quantities indicating the individuality of each speaker registered in the registered speaker database DB2 from the voice data of each of the plurality of speakers after noise synthesis (St16).
 端末装置P1は、取得された音声データの発話区間から話者USの個人性を示す特徴量を抽出する(St17)。 The terminal device P1 extracts a feature amount indicating the individuality of the speaker US from the utterance section of the acquired audio data (St17).
 端末装置P1は、抽出された話者USの特徴量と、複数の話者のそれぞれの特徴量とに基づいて、話者認証処理を実行する(St20)。 The terminal device P1 executes speaker authentication processing based on the extracted feature amount of the speaker US and the feature amount of each of the plurality of speakers (St20).
 以上により、端末装置P1は、話者認証に用いられる音声データから抽出されたノイズを、登録話者データベースDB2に登録された複数の話者のそれぞれの音声データに合成することで、音声データに含まれるノイズに起因する話者認証精度の低下をより効果的に抑制できる。 As described above, the terminal device P1 synthesizes the noise extracted from the voice data used for speaker authentication with the voice data of each of the plurality of speakers registered in the registered speaker database DB2, thereby converting the noise into voice data. Decrease in speaker authentication accuracy due to included noise can be more effectively suppressed.
 次に、図4を参照して、図3に示すステップSt20に示す話者認証手順について説明する。図4は、実施の形態における端末装置P1の話者認証手順例を示すフローチャートである。 Next, with reference to FIG. 4, the speaker authentication procedure shown in step St20 shown in FIG. 3 will be described. FIG. 4 is a flowchart showing an example of the speaker authentication procedure of the terminal device P1 in the embodiment.
 端末装置P1は、類似度計算モデルデータベースDB3から類似度計算モデルを読み込む(St21)。 The terminal device P1 reads the similarity calculation model from the similarity calculation model database DB3 (St21).
 端末装置P1は、取得された話者USの特徴量と、ノイズ合成後の複数の話者のそれぞれの特徴量との類似度をそれぞれ算出する(St22)。 The terminal device P1 calculates the degree of similarity between the acquired feature amount of the speaker US and the feature amount of each of the plurality of speakers after noise synthesis (St22).
 端末装置P1は、算出された類似度のそれぞれのうち閾値以上の類似度があるか否かを判定する(St23)。 The terminal device P1 determines whether there is a degree of similarity greater than or equal to a threshold value among the calculated degrees of similarity (St23).
 端末装置P1は、ステップSt23の処理において、算出された類似度のそれぞれのうち閾値以上の類似度があると判定した場合(St23,YES)、閾値以上であると判定され類似度に対応する話者情報に基づいて、話者USを特定する(St24)。なお、端末装置P1は、閾値以上であると判定された類似度が複数ある場合には、算出された類似度が最も高い類似度に対応する話者情報に基づいて、話者USを特定してもよい。 In the process of step St23, if it is determined that there is a degree of similarity that is equal to or greater than the threshold among the calculated degrees of similarity (St23, YES), the terminal device P1 selects a story that is determined to be equal to or greater than the threshold and corresponds to the degree of similarity. The speaker US is identified based on the speaker information (St24). Note that if there are multiple degrees of similarity determined to be equal to or greater than the threshold, the terminal device P1 identifies the speaker US based on the speaker information corresponding to the degree of similarity with the highest calculated degree of similarity. It's okay.
 端末装置P1は、ステップSt23の処理において、算出された類似度のそれぞれのうち閾値以上の類似度がないと判定した場合(St23,NO)、話者USを特定不可であると判定する(St25)。 In the process of step St23, if the terminal device P1 determines that there is no similarity greater than or equal to the threshold among the calculated similarities (St23, NO), the terminal device P1 determines that the speaker US cannot be identified (St25). ).
 端末装置P1は、特定された話者USの話者情報に基づいて、認証結果画面SCを生成する。端末装置P1は、生成された認証結果画面SCをモニタMNに出力して、表示させる(St26)。 The terminal device P1 generates an authentication result screen SC based on the speaker information of the identified speaker US. The terminal device P1 outputs the generated authentication result screen SC to the monitor MN for display (St26).
 以上により、端末装置P1は、音声登録時の話者USの音声データにノイズが含まれず、音声認証時の話者USの音声データにノイズが含まれる場合であっても、話者認証精度の低下をより効果的に抑制できる。つまり、端末装置P1は、環境雑音(ノイズ)の変化に起因する話者認証精度の低下をより効果的に抑制できる。 As described above, even if the voice data of the speaker US at the time of voice registration does not include noise and the voice data of the speaker US at the time of voice authentication includes noise, the terminal device P1 can improve the accuracy of speaker authentication. The decline can be suppressed more effectively. In other words, the terminal device P1 can more effectively suppress a decrease in speaker authentication accuracy caused by changes in environmental noise.
 なお、端末装置P1は、音声登録時の話者USの音声データにノイズが含まれるか否かを判定してもよい。例えば、端末装置P1は、音声登録時の音声データに含まれるノイズが閾値未満であると判定した場合、音声データがノイズを含んでいないと判定し、音声登録処理を実行してもよい。この閾値は、例えば、特徴量を用いた話者認証処理において無視可能なノイズ、あるいは誤認証を誘発しない程度のノイズであるか否かを判定するための任意の値が設定されてよい。 Note that the terminal device P1 may determine whether noise is included in the voice data of the speaker US at the time of voice registration. For example, if the terminal device P1 determines that the noise included in the audio data at the time of audio registration is less than the threshold value, the terminal device P1 may determine that the audio data does not include noise and execute the audio registration process. This threshold value may be set to an arbitrary value for determining whether the noise is negligible in speaker authentication processing using feature amounts or to a level that does not cause false authentication.
 以上により、実施の形態に係る端末装置P1(音声認証装置の一例)は、音声データを取得する通信部10(取得部の一例)と、取得された音声データから話者が発話している発話区間と、話者USが発話していない非発話区間とを検出するノイズ抽出部112(検出部の一例)と、事前に登録された複数の話者のそれぞれの音声データに非発話区間の音声データを合成するノイズ合成部113(合成部の一例)と、非発話区間の音声データが合成された複数の合成音声データと、発話区間の音声データとに基づいて、話者USを認証する認証部116と、を備える。 As described above, the terminal device P1 (an example of a voice authentication device) according to the embodiment includes a communication unit 10 (an example of an acquisition unit) that acquires voice data, and an utterance uttered by a speaker based on the acquired voice data. A noise extraction unit 112 (an example of a detection unit) that detects a non-speech interval and a non-speech interval in which the speaker US does not speak, and a noise extraction unit 112 (an example of a detection unit) that detects the non-speech interval and the non-speech interval in which the voice data of each of the plurality of speakers registered in advance is Authentication that authenticates the speaker US based on the noise synthesis unit 113 (an example of a synthesis unit) that synthesizes data, a plurality of synthesized speech data obtained by synthesizing the speech data of the non-speech section, and the speech data of the speech section. 116.
 これにより、実施の形態に係る端末装置P1は、音声登録時の話者USの音声データにノイズが含まれず、音声認証時の話者USの音声データにノイズが含まれる場合であっても、話者認証精度の低下をより効果的に抑制できる。 As a result, the terminal device P1 according to the embodiment, even if noise is not included in the voice data of the speaker US at the time of voice registration and noise is included in the voice data of the speaker US at the time of voice authentication. Decrease in speaker authentication accuracy can be more effectively suppressed.
 また、実施の形態に係る端末装置P1は、音声データから話者の特徴量を抽出する第1特徴量抽出部114および第2特徴量抽出部115(抽出部の一例)、をさらに備える。認証部116は、抽出された複数の合成音声データの特徴量と、発話区間の音声データの特徴量とに基づいて、話者USを認証する。これにより、実施の形態に係る端末装置P1は、話者USの個人性を示す特徴量を用いて、話者認証を実行できる。 Furthermore, the terminal device P1 according to the embodiment further includes a first feature extracting unit 114 and a second feature extracting unit 115 (an example of an extracting unit) that extract the speaker's feature from the audio data. The authentication unit 116 authenticates the speaker US based on the feature amounts of the plurality of extracted synthetic voice data and the feature amounts of the voice data of the utterance section. Thereby, the terminal device P1 according to the embodiment can perform speaker authentication using the feature amount indicating the individuality of the speaker US.
 また、実施の形態に係る端末装置P1は、複数の合成音声データと、発話区間の音声データとの類似度を算出する認証部116(算出部の一例)、をさらに備える。認証部116は、算出された複数の類似度に基づいて、話者USを認証する。これにより、実施の形態に係る端末装置P1は、事前に登録された複数の話者の特徴量と、話者USの特徴量との類似度を用いて、話者認証を実行できる。 Furthermore, the terminal device P1 according to the embodiment further includes an authentication unit 116 (an example of a calculation unit) that calculates the similarity between the plurality of synthesized voice data and the voice data of the utterance section. The authentication unit 116 authenticates the speaker US based on the plurality of calculated similarities. Thereby, the terminal device P1 according to the embodiment can perform speaker authentication using the degree of similarity between the feature amounts of a plurality of speakers registered in advance and the feature amounts of the speaker US.
 また、実施の形態に係る端末装置P1における認証部116は、算出された類似度が閾値以上であるか否かを判定し、類似度が閾値以上であると判定された音声合成データに対応する話者を話者USであると特定する。これにより、実施の形態に係る端末装置P1は、話者USの特徴量と話者の特徴量との類似度に基づいて、話者USをより高精度に特定できる。 Further, the authentication unit 116 in the terminal device P1 according to the embodiment determines whether the calculated degree of similarity is greater than or equal to the threshold value, and corresponds to the speech synthesis data for which the degree of similarity is determined to be greater than or equal to the threshold value. Identify the speaker as speaker US. Thereby, the terminal device P1 according to the embodiment can identify the speaker US with higher accuracy based on the degree of similarity between the feature amount of the speaker US and the feature amount of the speaker.
 また、実施の形態に係る端末装置P1における認証部116は、類似度が閾値以上である話者に関する情報を含む認証結果画面SCを生成して、出力する。これにより、実施の形態に係る端末装置P1は、話者USあるいは管理者に話者認証結果を提示できる。 Additionally, the authentication unit 116 in the terminal device P1 according to the embodiment generates and outputs an authentication result screen SC that includes information regarding speakers whose similarity is equal to or higher than the threshold value. Thereby, the terminal device P1 according to the embodiment can present the speaker authentication result to the speaker US or the administrator.
 また、実施の形態に係る端末装置P1における認証部116は、算出された複数の類似度が閾値以上でないと判定した場合、話者USを特定不可であると判定する。これにより、実施の形態に係る端末装置P1は、話者認証精度の低下をより効果的に抑制し、話者USの誤認証をより効果的に抑制できる。 Furthermore, when determining that the plurality of calculated similarities are not equal to or greater than the threshold, the authentication unit 116 in the terminal device P1 according to the embodiment determines that the speaker US cannot be identified. Thereby, the terminal device P1 according to the embodiment can more effectively suppress a decrease in speaker authentication accuracy and more effectively suppress erroneous authentication of the speaker US.
 また、実施の形態に係る端末装置P1におけるノイズ抽出部112は、非発話区間に含まれるノイズを抽出する。ノイズ合成部113は、複数の話者のそれぞれの音声データにノイズを合成する。これにより、実施の形態に係る端末装置P1は、音声登録時の音声データに音声認証時のノイズを合成することで、音声登録時と音声認証時との発話音声(音声データ)の収音環境を近づけることができ、環境雑音(ノイズ)の変化による話者認証精度の低下をより効果的に抑制できる。 Further, the noise extraction unit 112 in the terminal device P1 according to the embodiment extracts noise included in the non-speech period. The noise synthesis unit 113 synthesizes noise with each voice data of a plurality of speakers. Thereby, the terminal device P1 according to the embodiment synthesizes the noise at the time of voice authentication with the voice data at the time of voice registration, thereby creating a sound collection environment for the uttered voice (voice data) at the time of voice registration and voice authentication. This makes it possible to more effectively suppress a decline in speaker authentication accuracy due to changes in environmental noise.
 以上、図面を参照しながら各種の実施の形態について説明したが、本開示はかかる例に限定されないことは言うまでもない。当業者であれば、特許請求の範囲に記載された範疇内において、各種の変更例、修正例、置換例、付加例、削除例、均等例に想到し得ることは明らかであり、それらについても当然に本開示の技術的範囲に属するものと了解される。また、発明の趣旨を逸脱しない範囲において、上述した各種の実施の形態における各構成要素を任意に組み合わせてもよい。 Although various embodiments have been described above with reference to the drawings, it goes without saying that the present disclosure is not limited to such examples. It is clear that those skilled in the art can come up with various changes, modifications, substitutions, additions, deletions, and equivalents within the scope of the claims, and It is understood that it naturally falls within the technical scope of the present disclosure. Further, each of the constituent elements in the various embodiments described above may be arbitrarily combined without departing from the spirit of the invention.
 なお、本出願は、2022年3月22日出願の日本特許出願(特願2022-045389)に基づくものであり、その内容は本出願の中に参照として援用される。 Note that this application is based on a Japanese patent application (Japanese Patent Application No. 2022-045389) filed on March 22, 2022, and the contents thereof are incorporated as a reference in this application.
 本開示は、環境雑音の変化に起因する話者認証精度の低下を抑制する音声認証装置および音声認証方法として有用である。 The present disclosure is useful as a voice authentication device and a voice authentication method that suppress a decrease in speaker authentication accuracy due to changes in environmental noise.
10 通信部
11 プロセッサ
12 メモリ
100 音声認証システム
111 話者登録部
112 ノイズ抽出部
113 ノイズ合成部
114 第1特徴量抽出部
115 第2特徴量抽出部
116 認証部
DB1 特徴量抽出モデルデータベース
DB2 登録話者データベース
DB3 類似度計算モデルデータベース
MK マイク
MN モニタ
P1 端末装置
SC 認証結果画面
US 話者
10 Communication unit 11 Processor 12 Memory 100 Voice authentication system 111 Speaker registration unit 112 Noise extraction unit 113 Noise synthesis unit 114 First feature extraction unit 115 Second feature extraction unit 116 Authentication unit DB1 Feature extraction model database DB2 Registration story Person database DB3 Similarity calculation model database MK Microphone MN Monitor P1 Terminal device SC Authentication result screen US Speaker

Claims (8)

  1.  音声データを取得する取得部と、
     取得された前記音声データから話者が発話している発話区間と、前記話者が発話していない非発話区間とを検出する検出部と、
     事前に登録された複数の話者のそれぞれの音声データに前記非発話区間の音声データを合成する合成部と、
     前記非発話区間の音声データが合成された複数の合成音声データと、前記発話区間の音声データとに基づいて、前記話者を認証する認証部と、を備える、
     音声認証装置。
    an acquisition unit that acquires audio data;
    a detection unit that detects a speech section in which the speaker is speaking and a non-speech section in which the speaker is not speaking from the acquired audio data;
    a synthesis unit that synthesizes the audio data of the non-speech section with the audio data of each of a plurality of speakers registered in advance;
    an authentication unit that authenticates the speaker based on a plurality of synthesized speech data obtained by synthesizing the speech data of the non-speech section and the speech data of the speech section;
    Voice authentication device.
  2.  前記音声データから話者の特徴量を抽出する抽出部、をさらに備え、
     前記認証部は、抽出された前記複数の合成音声データの特徴量と、前記発話区間の音声データの特徴量とに基づいて、前記話者を認証する、
     請求項1に記載の音声認証装置。
    further comprising an extraction unit that extracts features of the speaker from the voice data,
    The authentication unit authenticates the speaker based on the extracted feature quantities of the plurality of synthesized speech data and the feature quantities of the speech data of the utterance section.
    The voice authentication device according to claim 1.
  3.  前記複数の合成音声データと、前記発話区間の音声データとの類似度を算出する算出部、をさらに備え、
     前記認証部は、算出された複数の前記類似度に基づいて、前記話者を認証する、
     請求項1に記載の音声認証装置。
    further comprising a calculation unit that calculates the degree of similarity between the plurality of synthesized speech data and the speech data of the utterance section,
    The authentication unit authenticates the speaker based on the plurality of calculated similarities;
    The voice authentication device according to claim 1.
  4.  前記認証部は、算出された前記類似度が閾値以上であるか否かを判定し、前記類似度が前記閾値以上であると判定された音声合成データに対応する話者を前記話者であると特定する、
     請求項3に記載の音声認証装置。
    The authentication unit determines whether the calculated degree of similarity is greater than or equal to a threshold, and determines that the speaker corresponds to the speech synthesis data for which the degree of similarity is determined to be greater than or equal to the threshold. identify,
    The voice authentication device according to claim 3.
  5.  前記認証部は、前記類似度が前記閾値以上である前記話者に関する情報を含む認証結果画面を生成して、出力する、
     請求項4に記載の音声認証装置。
    The authentication unit generates and outputs an authentication result screen including information regarding the speaker whose degree of similarity is greater than or equal to the threshold;
    The voice authentication device according to claim 4.
  6.  前記認証部は、算出された前記複数の類似度が前記閾値以上でないと判定した場合、前記話者を特定不可であると判定する、
     請求項5に記載の音声認証装置。
    When the authentication unit determines that the plurality of calculated similarities are not equal to or greater than the threshold, the authentication unit determines that the speaker cannot be identified.
    The voice authentication device according to claim 5.
  7.  前記検出部は、前記非発話区間に含まれるノイズを抽出し、
     前記合成部は、前記複数の話者のそれぞれの前記音声データに前記ノイズを合成する、
     請求項1に記載の音声認証装置。
    The detection unit extracts noise included in the non-speech section,
    The synthesis unit synthesizes the noise with the voice data of each of the plurality of speakers.
    The voice authentication device according to claim 1.
  8.  音声データを取得し、
     取得された前記音声データから話者が発話している発話区間と、前記話者が発話していない非発話区間とを検出し、
     事前に登録された複数の話者のそれぞれの音声データに前記非発話区間の音声データを合成し、
     前記非発話区間の音声データが合成された複数の合成音声データと、前記発話区間の音声データとに基づいて、前記話者を認証する、
     音声認証方法。
    Get audio data,
    detecting a speech section in which the speaker is speaking and a non-speech section in which the speaker is not speaking from the acquired audio data;
    Synthesizing the voice data of the non-speech section with the voice data of each of a plurality of speakers registered in advance,
    authenticating the speaker based on a plurality of synthesized speech data obtained by synthesizing the speech data of the non-speech section and the speech data of the speech section;
    Voice authentication method.
PCT/JP2023/009468 2022-03-22 2023-03-10 Voice authentication device and voice authentication method WO2023182015A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2022045389A JP2023139711A (en) 2022-03-22 2022-03-22 Voice authentication device and voice authentication method
JP2022-045389 2022-03-22

Publications (1)

Publication Number Publication Date
WO2023182015A1 true WO2023182015A1 (en) 2023-09-28

Family

ID=88101353

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/009468 WO2023182015A1 (en) 2022-03-22 2023-03-10 Voice authentication device and voice authentication method

Country Status (2)

Country Link
JP (1) JP2023139711A (en)
WO (1) WO2023182015A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09120293A (en) * 1995-10-24 1997-05-06 Ricoh Co Ltd System and method for recognizing speaker
JPH11205451A (en) * 1998-01-19 1999-07-30 Canon Inc Speech recognition device, method therefor and computer readable memory
JP2006079079A (en) * 2004-09-06 2006-03-23 Samsung Electronics Co Ltd Distributed speech recognition system and its method
JP2020060757A (en) * 2018-10-05 2020-04-16 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America Speaker recognition device, speaker recognition method, and program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09120293A (en) * 1995-10-24 1997-05-06 Ricoh Co Ltd System and method for recognizing speaker
JPH11205451A (en) * 1998-01-19 1999-07-30 Canon Inc Speech recognition device, method therefor and computer readable memory
JP2006079079A (en) * 2004-09-06 2006-03-23 Samsung Electronics Co Ltd Distributed speech recognition system and its method
JP2020060757A (en) * 2018-10-05 2020-04-16 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America Speaker recognition device, speaker recognition method, and program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZEN, HEIGA ET AL.: "ICSLP 2006 Summary: Acoustic Modeling and Speech Synthesis", IPSJ SIG TECHNICAL REPORTS, INFORMATION PROCESSING SOCIETY OF JAPAN, JP, vol. 2006, no. 136, 1 January 2006 (2006-01-01), JP , pages 179 - 184, XP009549622, ISSN: 0919-6072 *

Also Published As

Publication number Publication date
JP2023139711A (en) 2023-10-04

Similar Documents

Publication Publication Date Title
KR102339594B1 (en) Object recognition method, computer device, and computer-readable storage medium
EP3525204A1 (en) Method and apparatus to provide comprehensive smart assistant services
CN109493850B (en) Growing type dialogue device
JP3967952B2 (en) Grammar update system and method
US8731936B2 (en) Energy-efficient unobtrusive identification of a speaker
US20150221305A1 (en) Multiple speech locale-specific hotword classifiers for selection of a speech locale
US20120022863A1 (en) Method and apparatus for voice activity detection
CN104217149A (en) Biometric authentication method and equipment based on voice
TW201250670A (en) Speech recognition device and a speech recognition method thereof
WO2022166218A1 (en) Method for adding punctuation during voice recognition and voice recognition device
JP2011059186A (en) Speech section detecting device and speech recognition device, program and recording medium
US9953633B2 (en) Speaker dependent voiced sound pattern template mapping
KR20180012639A (en) Voice recognition method, voice recognition device, apparatus comprising Voice recognition device, storage medium storing a program for performing the Voice recognition method, and method for making transformation model
CN105338327A (en) Video monitoring networking system capable of achieving speech recognition
WO2023182015A1 (en) Voice authentication device and voice authentication method
US11107476B2 (en) Speaker estimation method and speaker estimation device
JP2013083796A (en) Method for identifying male/female voice, male/female voice identification device, and program
US11950081B2 (en) Multi-channel speech compression system and method
US20220254361A1 (en) Multi-channel speech compression system and method
WO2023182014A1 (en) Voice authentication device and voice authentication method
WO2023182016A1 (en) Voice authentication device and voice authentication method
WO2023047893A1 (en) Authentication device and authentication method
WO2023233754A1 (en) Voice authentication device and voice authentication method
JP6693340B2 (en) Audio processing program, audio processing device, and audio processing method
US20230395063A1 (en) System and Method for Secure Transcription Generation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23774614

Country of ref document: EP

Kind code of ref document: A1