WO2023281717A1 - Speaker diarization method, speaker diarization device, and speaker diarization program - Google Patents

Speaker diarization method, speaker diarization device, and speaker diarization program Download PDF

Info

Publication number
WO2023281717A1
WO2023281717A1 PCT/JP2021/025849 JP2021025849W WO2023281717A1 WO 2023281717 A1 WO2023281717 A1 WO 2023281717A1 JP 2021025849 W JP2021025849 W JP 2021025849W WO 2023281717 A1 WO2023281717 A1 WO 2023281717A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
frame
label
vector
utterance
Prior art date
Application number
PCT/JP2021/025849
Other languages
French (fr)
Japanese (ja)
Inventor
厚志 安藤
有実子 村田
岳至 森
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2021/025849 priority Critical patent/WO2023281717A1/en
Publication of WO2023281717A1 publication Critical patent/WO2023281717A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building

Definitions

  • the present invention relates to a speaker diarization method, a speaker diarization device, and a speaker diarization program.
  • EEND End-to-End Neural Diarization
  • a speaker diarization technology see Non-Patent Document 1
  • EEND End-to-End Neural Diarization
  • an audio signal is divided into frames, and a speaker label representing whether or not a specific speaker exists in the frame is estimated for each frame from the audio features extracted from each frame. If the maximum number of speakers in the audio signal is S, then the speaker label for each frame is an S-dimensional vector, and in that frame, 1 if a speaker is speaking, becomes 0. That is, EEND implements speaker diarization by performing multi-label binary classification of the number of speakers.
  • the EEND model used for estimating the speaker label sequence for each frame in EEND is a deep learning-based model composed of layers capable of backpropagating errors, and the speaker label sequence for each frame is extracted from the acoustic feature sequence at once. It can be estimated by penetrating.
  • the EEND model includes an RNN (Recurrent Neural Network) layer that performs time-series modeling. As a result, in EEND, it is possible to estimate the speaker label for each frame by using the acoustic features of not only the current frame but also the surrounding frames.
  • Bidirectional LSTM (Long Short-Term Memory)-RNN and Transformer Encoder are used in this RNN layer.
  • Non-Patent Document 2 describes multitask learning. Also, Non-Patent Document 3 describes a loss function based on distance learning. In addition, Non-Patent Document 4 describes an acoustic feature quantity.
  • the conventional EEND model only learns speaker utterance labels for each frame, and does not consider whether speakers are similar to each other. Therefore, for acoustic signals containing different speakers with similar speaking styles and voice qualities, there have been many speaker errors in which the utterance period of one speaker is assumed to be the utterance period of another speaker.
  • the present invention has been made in view of the above, and it is an object of the present invention to perform highly accurate speaker diarization for acoustic signals containing different speakers with similar speaking styles and voice qualities.
  • a speaker diarization method is a speaker diarization method executed by a speaker diarization apparatus, comprising: an extraction step of extracting a vector representing the speaker feature of each frame using the series of; an estimation step of estimating a speaker utterance label representing the speaker of the vector using the extracted vector; The identity of the speaker in each frame calculated using the estimated vector, the speaker utterance label representing the speaker of the estimated vector, and the correct label of the speaker utterance label representing the speaker in each frame and a learning step of generating, through learning, a model for estimating the speaker utterance label of each frame vector using a loss function including a loss function representing gender.
  • FIG. 1 is a diagram for explaining the outline of a speaker diarization device.
  • FIG. 2 is a schematic diagram illustrating a schematic configuration of the speaker diarization device.
  • FIG. 3 is a diagram for explaining the processing of the speaker diarization device.
  • FIG. 4 is a diagram for explaining the processing of the speaker diarization device.
  • FIG. 5 is a diagram for explaining the processing of the speaker diarization device.
  • FIG. 6 is a flow chart showing a speaker diarization processing procedure.
  • FIG. 7 is a flow chart showing a speaker diarization processing procedure.
  • FIG. 8 is a diagram illustrating a computer executing a speaker diarization program.
  • FIG. 1 is a diagram for explaining the outline of a speaker diarization device.
  • the speaker diarization apparatus of this embodiment adds a loss term for evaluating speaker identity for each speaker included in the acoustic signal, and considers speaker identity for each speaker included in the acoustic signal.
  • the speaker diarization model 14a is trained as follows. This loss term is set so that speaker identity is high for the same speaker, and speaker identity is low for different speakers. As a result, the speaker diarization device makes the speaker diarization model 14a learn that the speakers are not the same speaker even if they speak in a similar way, thereby reducing speaker errors.
  • the speaker diarization model 14a has a speaker embedding extraction block, a speaker diarization result generation block, and a speaker vector generation block.
  • the speaker embedding extraction block extracts the speaker in this section from the acoustic features from the (t ⁇ N)-th frame to the (t+N)-th frame consecutively traced back from the current t-th frame in the input acoustic feature sequence. Extract embeddings.
  • the speaker embedding is a vector containing speaker characteristics such as gender, age, and speaking style necessary for speaker diarization, and is a higher-dimensional vector than the speaker vector described later.
  • This speaker embedding extraction block is composed of a Linear layer, an RNN layer, an attention mechanism layer, and so on. Moreover, in the speaker diarization apparatus of this embodiment, unlike the conventional speaker diarization model, the speaker embedding extraction block refers only to fixed-length limited intervals to estimate speaker embeddings. do.
  • the speaker diarization result generation block is composed of an RNN layer, a linear layer, a sigmoid layer, etc., and estimates the sequence of speaker utterance labels for each frame based on the speaker embeddings obtained by the speaker embedding extraction block. .
  • the speaker vector generation block generates speaker vectors based on speaker embedding.
  • a speaker vector is a vector that has the same information as the speaker embedding and has a lower dimension than the speaker embedding.
  • the speaker vector generation block is used only during learning of the speaker diarization model 14a, and is not used during inference, which will be described later.
  • This speaker vector generation block consists of a subsample layer and a linear layer, and randomly selects frames in which only one speaker is speaking. Then, the subsample layer inputs the speaker embedding of the selected frame to the next Linear layer.
  • the Linear layer projects the speaker embedding output from the subsample layer onto a speaker vector with a predetermined number of dimensions.
  • parameter learning is performed in the framework of multitask learning based on two loss functions, speaker utterance label loss and speaker identity loss for each frame.
  • the frame-by-frame speaker utterance label loss is calculated using the frame-by-frame speaker utterance correct label sequence and the frame-by-frame speaker utterance estimation result sequence.
  • the speaker identity loss is calculated using the frame-by-frame speaker utterance correct label sequence of the frames selected in the subsample layer and the speaker label output by the Linear layer.
  • the speaker diarization device learns the speaker diarization model 14a so as to consider the speaker identity for each speaker included in the acoustic signal. Therefore, the speaker diarization device allows the speaker diarization model 14a to learn that the speakers are not the same speaker even if their speaking style is similar, thereby reducing speaker errors.
  • FIG. 2 is a schematic diagram illustrating a schematic configuration of the speaker diarization device.
  • 3 to 5 are diagrams for explaining the processing of the speaker diarization device.
  • the speaker diarization apparatus 10 of this embodiment is implemented by a general-purpose computer such as a personal computer, and includes an input unit 11, an output unit 12, a communication control unit 13, a storage unit 14, and a control unit. A part 15 is provided.
  • the input unit 11 is implemented using input devices such as a keyboard and a mouse, and inputs various instruction information such as processing start to the control unit 15 in response to input operations by the practitioner.
  • the output unit 12 is implemented by a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like.
  • the communication control unit 13 is implemented by a NIC (Network Interface Card) or the like, and controls communication between the control unit 15 and an external device such as a server or a device that acquires an acoustic signal via a network.
  • NIC Network Interface Card
  • the storage unit 14 is implemented by semiconductor memory devices such as RAM (Random Access Memory) and flash memory, or storage devices such as hard disks and optical disks. Note that the storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13 . In this embodiment, the storage unit 14 stores, for example, a speaker diarization model 14a used for speaker diarization processing, which will be described later.
  • RAM Random Access Memory
  • flash memory or storage devices such as hard disks and optical disks.
  • the storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13 .
  • the storage unit 14 stores, for example, a speaker diarization model 14a used for speaker diarization processing, which will be described later.
  • the control unit 15 is implemented using a CPU (Central Processing Unit), NP (Network Processor), FPGA (Field Programmable Gate Array), etc., and executes a processing program stored in memory. 2, the control unit 15 includes an acoustic feature extraction unit 15a, a speaker utterance label generation unit 15b, a speaker feature extraction unit 15c, a speaker utterance label estimation unit 15d, and a speaker vector generation unit. 15e, learning unit 15f, and utterance segment estimating unit 15g. Note that these functional units may be implemented in different hardware. For example, the acoustic feature extraction unit 15a may be implemented in hardware different from other functional units. Also, the control unit 15 may include other functional units.
  • the acoustic feature extraction unit 15a extracts acoustic features for each frame of an acoustic signal containing utterances of a plurality of speakers. For example, the acoustic feature extraction unit 15a receives input of acoustic signals via the input unit 11 or via the communication control unit 13 from a device that acquires acoustic signals. Further, the acoustic feature extraction unit 15a divides the acoustic signal into frames, performs discrete Fourier transform and filter bank multiplication on the signal from each frame, extracts an acoustic feature vector, and extracts an acoustic feature vector coupled in the frame direction. Output the feature sequence. In this embodiment, the frame length is 25 ms and the frame shift width is 10 ms.
  • the acoustic feature vector is, for example, a 24-dimensional MFCC (Mel Frequency Cepstral Coefficient), but is not limited to this, and may be, for example, another acoustic feature quantity for each frame such as Mel filter bank output.
  • MFCC Mel Frequency Cepstral Coefficient
  • the speaker's utterance label generation unit 15b uses the acoustic feature sequence to generate a speaker's utterance label for each frame. Specifically, the speaker's utterance label generation unit 15b generates a speaker's utterance label for each frame by using the acoustic feature sequence and the correct label of the utterance period of the speaker, as shown in FIG. 3, which will be described later. . As a result, a speaker utterance label and a correct label for each frame corresponding to each frame of the acoustic feature sequence are generated as teacher data used in the processing of the learning unit 15f, which will be described later.
  • the speaker feature extraction unit 15c extracts a vector representing the speaker feature of each frame using the series of acoustic features for each frame of the acoustic signal. Specifically, the speaker feature extraction unit 15c generates a speaker embedding vector by inputting the acoustic feature sequence acquired from the acoustic feature extraction unit 15a to the speaker embedding extraction block shown in FIG. As described above, the speaker embedding vector includes speaker characteristics such as gender, age, and speaking habits.
  • the speaker utterance label estimation unit 15d uses the extracted speaker embedding vector to estimate the speaker utterance label representing the speaker of the speaker embedding vector. Specifically, the speaker utterance label estimation unit 15d inputs the speaker embedding vector acquired from the speaker feature extraction unit 15c to the speaker diarization result generation block shown in FIG. Obtain a sequence of speaker utterance label estimation results.
  • the speaker vector generation unit 15e uses the extracted speaker embedding vector and the correct label of the speaker utterance label of each frame to generate a speaker vector representing the speaker feature of the frame in which there is only one speaker. do.
  • FIG. 3 illustrates the processing of the speaker label generation unit in the speaker vector generation block shown in FIG.
  • the speaker vector generation unit 15e generates a speaker embedding vector and a series of correct labels of speaker utterance labels for each frame (speaker utterance correct label series for each frame) of the same length as follows: Input to the subsample layer.
  • frames in which only one speaker speaks out of correct speaker utterance labels for each frame are targeted, and up to K frames per speaker are randomly selected from the target frames.
  • the 2nd, 3rd, 6th, 8th, and 9th frames indicated by hatching are targeted, and the 1st, 7th, and 10th frames where no one speaks, and two or more speakers are shown.
  • the 4th and 5th frames of speaking are excluded.
  • the speaker embedding vector of the selected frame is input from the subsample layer to the Linear layer.
  • the input speaker embedding vector is projected onto a speaker vector with a predetermined number of dimensions.
  • the speaker vector generator 15e obtains the speaker vector of the selected frame.
  • FIG. 4 shows an example in which the learning unit 15f performs the processing of the speaker feature extraction unit 15c and the speaker utterance label estimation unit 15d.
  • the speaker feature extraction unit 15c may be included in the speaker utterance label estimation unit 15d.
  • FIG. 4 shows an example in which the speaker utterance label estimation unit 15d performs the processing of the speaker feature extraction unit 15c.
  • the learning unit 15f calculates using the extracted speaker embedding vector, the speaker utterance label representing the speaker of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame,
  • a speaker diarization model 14a for estimating the speaker utterance label of the speaker embedding vector of each frame is generated by learning using a loss function including a loss function representing speaker identity of each frame.
  • the learning unit 15f calculates the loss function using the generated speaker vector, the speaker utterance label of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame. is used to generate the speaker diarization model 14a by learning.
  • the learning unit 15f generates the speaker vector generated by the speaker vector generation unit 15e and the correct speaker utterance label for each frame corresponding to the speaker vector. (speaker utterance correct label sequence for each frame after subsampling) is used.
  • FIG. 5 shows an example of calculation of speaker identity loss.
  • the speaker identity loss is calculated using the loss function shown in the following equation (1) based on distance learning such as Generalized End-to-End Loss.
  • the speaker diarization model 14a is learned so that the cosine distance of the speaker vector is short for the same speaker, and the cosine distance of the speaker vector is long for a different speaker.
  • the speaker identity loss is not limited to Generalized End-to-End Loss, and may be a loss function based on other distance learning.
  • Triplet Loss or Siamese Loss may be used.
  • the speaker vector is calculated from the speaker embedding vector, and this speaker embedding vector is used in the processing of the speaker utterance label estimation unit 15d. This enables the speaker utterance label estimation unit 15d to estimate the speaker utterance label while considering whether the speaker of each frame is the same speaker.
  • the speaker diarization model 14a has, as shown in FIG. 1, a speaker embedding extraction block, a speaker diarization result generation block, and a speaker vector generation block.
  • the speaker embedding block is composed of, for example, one linear layer and two bidirectional LSTM-RNN layers, as shown in FIG.
  • the speaker diarization result generation block is composed of, for example, three layers of bi-directional LSTM-RNN layers, one layer of Linear layer, and one layer of sigmoid layer.
  • the speaker vector generation block is composed of one subsample layer and one linear layer. However, the total number of layers and the number of hidden units may be changed.
  • This speaker diarization model 14a is input with a frame-by-frame acoustic feature sequence of T frames ⁇ D dimensions and an estimated speech feature sequence for each frame of T ⁇ S dimensions where each dimension takes the values [0, 1].
  • a user utterance label sequence is output.
  • the learning unit 15f uses the estimated speaker utterance label and the correct label of the speaker utterance label of each frame to calculate the loss function for the speaker utterance label of each frame, and the extracted speaker embedding vector. and the correct label of the speaker's utterance label of each frame. .
  • the speaker utterance label loss for each frame calculated using the correct label sequence of the speaker utterance label for each frame generated by the utterance label generator 15b is used.
  • the frame-by-frame speaker utterance label loss is calculated by multi-label binary cross-entropy ignoring the sorting order, as in the past.
  • the learning unit 15f uses the weighted sum of the two loss functions of the speaker utterance label loss and the speaker identity loss for each frame as the loss function of the entire model to obtain the error Parameter optimization is performed by the backpropagation method.
  • the speaker utterance label estimation unit 15d uses the generated speaker diarization model 14a to estimate a speaker utterance label for each frame of the acoustic signal. Specifically, as shown in FIG. 4, the speaker utterance label estimation unit 15d forward propagates the acoustic feature sequence to the speaker embedding extraction block and the speaker diarization result generation block shown in FIG. obtains a series of speaker utterance label estimation results (posterior probabilities) for each frame.
  • the speech segment estimation unit 15g estimates the speaker's speech segment in the acoustic signal using the output speaker's speech label posterior probability. Specifically, the speech period estimation unit 15g estimates the speaker's speech label using a moving average of a plurality of frames. That is, the utterance segment estimation unit 15g first calculates the moving average of the frame-by-frame speaker utterance label posterior probability for the length 6 of the current frame and the preceding 5 frames. This makes it possible to prevent erroneous detection of impractically short speech segments such as speech with only one frame.
  • the utterance segment estimation unit 15g estimates that the frame is the utterance segment of the speaker of the dimension. For each speaker, the utterance period estimation unit 15g regards a continuous utterance period frame group as one utterance, and calculates the start time and end time of the utterance period up to a predetermined time from the frames. As a result, the speech start time and the speech end time up to a predetermined time can be obtained for each speech of each speaker.
  • FIG. 6 shows the learning processing procedure.
  • the flowchart of FIG. 6 is started, for example, at the timing when an instruction to start the learning process is received.
  • the acoustic feature extraction unit 15a extracts acoustic features for each frame of an acoustic signal containing utterances of a plurality of speakers, and outputs an acoustic feature sequence (step S1).
  • the speaker feature extraction unit 15c extracts a speaker embedding vector representing the speaker feature of each frame using the sequence of acoustic features for each frame of the acoustic signal (step S2).
  • the learning unit 15f calculates using the extracted speaker embedding vector, the speaker utterance label representing the speaker of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame.
  • the speaker diarization model 14a for estimating the speaker utterance label of the speaker embedding vector of each frame is generated by learning using the loss function including the loss function representing the identity of the speaker of each frame (step S3).
  • the speaker utterance label estimation unit 15d uses the extracted speaker embedding vector to estimate the speaker utterance label representing the speaker of the speaker embedding vector.
  • the speaker vector generation unit 15e uses the extracted speaker embedding vector and the correct label of the speaker utterance label of each frame to generate a speaker vector representing the speaker feature of the frame in which there is only one speaker. to generate
  • the learning unit 15f uses the generated speaker vector and the correct label of the speaker's utterance label of each frame to calculate a loss function representing speaker identity of each frame. In addition, the learning unit 15f calculates a loss function for the speaker utterance label for each frame using the estimated speaker utterance label sequence for each frame and the generated correct label sequence for the speaker utterance label for each frame. do. Then, the learning unit 15f uses the weighted sum of the two loss functions as the loss function for the entire model to generate the speaker diarization model 14a. This completes a series of learning processes.
  • FIG. 7 shows the estimation processing procedure.
  • the flowchart in FIG. 7 is started, for example, when an input instructing the start of the estimation process is received.
  • the acoustic feature extraction unit 15a extracts acoustic features for each frame of an acoustic signal containing utterances of a plurality of speakers, and outputs an acoustic feature sequence (step S1).
  • the speaker's utterance label estimation unit 15d uses the generated speaker's diarization model 14a to estimate the speaker's utterance label for each frame of the acoustic signal (step S4). Specifically, the speaker utterance label estimation unit 15d outputs the speaker utterance label posterior probability (estimated value of the speaker utterance label) for each frame of the acoustic feature sequence.
  • the utterance segment estimation unit 15g estimates the utterance segment of the speaker in the acoustic signal using the output speaker utterance label posterior probability (step S5). This completes a series of estimation processes.
  • the speaker feature extraction unit 15c uses the sequence of acoustic features for each frame of the acoustic signal to express the speaker features of each frame. extract the person embedding vector.
  • a speaker utterance label estimation unit 15d uses the extracted speaker embedding vector to estimate a speaker utterance label representing the speaker of the speaker embedding vector.
  • the learning unit 15f calculates using the extracted speaker embedding vector, the speaker utterance label representing the speaker of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame.
  • a speaker diarization model 14a for estimating the speaker utterance label of the speaker feature vector of each frame is generated by learning using the loss function including the loss function representing the identity of the speaker of each frame.
  • the speaker diarization device 10 can estimate the speaker's utterance label by considering whether the speaker of the speaker embedding vector is the same speaker in each frame. This makes it possible to perform highly accurate speaker diarization for acoustic signals containing different speakers with similar speaking styles and voice qualities.
  • the speaker vector generation unit 15e uses the speaker embedding vector and the correct label of the speaker utterance label of each frame to generate a speaker vector representing the speaker feature of a frame with a single speaker. to generate
  • the learning unit 15f calculates a loss function using the generated speaker vector, the speaker utterance label of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame. is used to generate a speaker diarization model 14a by learning.
  • the speaker diarization device 10 considers whether or not the speaker of the speaker vector generated from the speaker embedding vector is the same speaker in each frame. Labels can be estimated. This makes it possible to more accurately perform speaker diarization for acoustic signals containing different speakers with similar speaking styles and voice qualities.
  • the learning unit 15f calculates a loss function for the speaker utterance label of each frame using the estimated speaker utterance label and the correct label of the speaker utterance label of each frame, and the extracted speaker
  • the speaker diarization model 14a is learned by using the loss function of the weighted sum of the loss function representing the identity of the speaker calculated using the embedding vector and the correct label of the speaker utterance label of each frame. Generate. This enables the speaker diarization device 10 to perform speaker diarization with higher accuracy.
  • the speaker's utterance label estimation unit 15d uses the generated speaker's diarization model 14a to estimate the speaker's utterance label for each frame of the acoustic signal.
  • the speaker diarization apparatus can perform highly accurate speaker diarization while suppressing speaker errors by considering whether or not the speaker of each frame is the same speaker.
  • the speech period estimation unit 15g estimates the speaker label using the moving average of a plurality of frames. This enables the speaker diarization device 10 to prevent erroneous detection of unrealistically short speech segments.
  • the speaker diarization apparatus 10 can be implemented by installing a speaker diarization program for executing the above-described speaker diarization processing as package software or online software on a desired computer.
  • the information processing apparatus can function as the speaker diarization apparatus 10 by causing the information processing apparatus to execute the speaker diarization program.
  • information processing devices include mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants).
  • the functions of the speaker diarization device 10 may be implemented in a cloud server.
  • FIG. 8 is a diagram showing an example of a computer that executes a speaker diarization program.
  • Computer 1000 includes, for example, memory 1010 , CPU 1020 , hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .
  • the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012 .
  • the ROM 1011 stores a boot program such as BIOS (Basic Input Output System).
  • BIOS Basic Input Output System
  • Hard disk drive interface 1030 is connected to hard disk drive 1031 .
  • Disk drive interface 1040 is connected to disk drive 1041 .
  • a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041, for example.
  • a mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050, for example.
  • a display 1061 is connected to the video adapter 1060 .
  • the hard disk drive 1031 stores an OS 1091, application programs 1092, program modules 1093 and program data 1094, for example. Each piece of information described in the above embodiment is stored in the hard disk drive 1031 or the memory 1010, for example.
  • the speaker diarization program is stored in the hard disk drive 1031 as a program module 1093 in which instructions to be executed by the computer 1000 are written, for example.
  • the hard disk drive 1031 stores a program module 1093 that describes each process executed by the speaker diarization apparatus 10 described in the above embodiment.
  • Data used for information processing by the speaker diarization program is stored as program data 1094 in the hard disk drive 1031, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 as necessary, and executes each procedure described above.
  • program module 1093 and program data 1094 relating to the speaker diarization program are not limited to being stored in the hard disk drive 1031.
  • they are stored in a detachable storage medium and stored in the CPU 1020 via the disk drive 1041 or the like. may be read by
  • the program module 1093 and program data 1094 related to the speaker diarization program are stored in another computer connected via a network such as LAN (Local Area Network) or WAN (Wide Area Network), and network interface 1070 may be read by CPU 1020 via a network such as LAN (Local Area Network) or WAN (Wide Area Network), and network interface 1070 may be read by CPU 1020 via a network such as LAN (Local Area Network) or WAN (Wide Area Network), and network interface 1070 may be read by CPU 1020 via a network such as LAN (Local Area Network) or WAN (Wide Area Network), and network interface 1070 may be read by CPU 1020 via a network such as LAN (Local Area Network) or
  • speaker diarization device 11 input unit 12 output unit 13 communication control unit 14 storage unit 14a speaker diarization model 15 control unit 15a acoustic feature extraction unit 15b speaker utterance label generation unit 15c speaker feature extraction unit 15d speaker utterance Label estimation unit 15e Speaker vector generation unit 15f Learning unit 15g Speech segment estimation unit

Abstract

A speaker feature extraction unit (15c) extracts, using a sequence of acoustic features of frames of an acoustic signal, a speaker embedding vector indicating a speaker feature of each of the frames. A speaker utterance label inference unit (15d) infers, using the extracted speaker embedding vector, a speaker utterance label indicating a speaker of the speaker embedding vector. By using loss functions that are calculated by using the extracted speaker embedding vector, the inferred speaker utterance label indicating the speaker of the speaker embedding vector, and a correct answer label of a speaker utterance label of each of the frames and that include a loss function indicating identity of the speaker of the frames, a training unit (15f) trains and generates a speaker diarization model (14a) for inferring a speaker utterance label of the speaker feature vector of each of the frames.

Description

話者ダイアライゼーション方法、話者ダイアライゼーション装置および話者ダイアライゼーションプログラムSpeaker diarization method, speaker diarization device and speaker diarization program
 本発明は、話者ダイアライゼーション方法、話者ダイアライゼーション装置および話者ダイアライゼーションプログラムに関する。 The present invention relates to a speaker diarization method, a speaker diarization device, and a speaker diarization program.
 近年、音響信号を入力とし、音響信号に含まれる全ての話者の発話区間を同定する話者ダイアライゼーション技術が期待されている。話者ダイアライゼーション技術によれば、例えば、会議において誰がいつ発言したかを記録する自動書き起こしや、コンタクトセンタにおいて通話からオペレータと顧客との発話の自動切り出し等、様々な応用が可能となる。 In recent years, expectations have been high for a speaker diarization technology that takes an acoustic signal as input and identifies the utterance intervals of all speakers included in the acoustic signal. According to the speaker diarization technology, various applications are possible, for example, automatic transcription for recording who spoke when in a conference, automatic extraction of speech between an operator and a customer from a call in a contact center, and the like.
 従来、話者ダイアライゼーション技術として、深層学習に基づくEEND(End-to-End Neural Diarization)と呼ばれる技術が開示されている(非特許文献1参照)。EENDでは、音響信号をフレームごとに分割し、各フレームから抽出した音響特徴から、当該フレームにおいて特定の話者が存在するか否かを表す話者ラベルをフレームごとに推定する。音響信号内の最大話者数Sである場合に、フレームごとの話者ラベルはS次元のベクトルであり、当該フレームにおいて、ある話者が発話している場合に1、発話していない場合に0となる。すなわち、EENDでは、話者数のマルチラベル二値分類を行うことにより、話者ダイアライゼーションを実現している。 Conventionally, a technology called EEND (End-to-End Neural Diarization) based on deep learning has been disclosed as a speaker diarization technology (see Non-Patent Document 1). In EEND, an audio signal is divided into frames, and a speaker label representing whether or not a specific speaker exists in the frame is estimated for each frame from the audio features extracted from each frame. If the maximum number of speakers in the audio signal is S, then the speaker label for each frame is an S-dimensional vector, and in that frame, 1 if a speaker is speaking, becomes 0. That is, EEND implements speaker diarization by performing multi-label binary classification of the number of speakers.
 EENDでフレームごとの話者ラベル系列の推定に用いられるEENDモデルは、誤差逆伝搬可能な層で構成される深層学習に基づくモデルであって、音響特徴系列からフレームごとの話者ラベル系列を一気通貫で推定できる。EENDモデルには、時系列モデル化を行うRNN(Recurrent Neural Network)層が含まれる。これにより、EENDでは当該フレームだけでなく周囲のフレームの音響特徴量を用いて、フレームごとの話者ラベルを推定することが可能となる。このRNN層には、双方向LSTM(Long Short-Term Memory)-RNNやTransformer Encoderが用いられる。 The EEND model used for estimating the speaker label sequence for each frame in EEND is a deep learning-based model composed of layers capable of backpropagating errors, and the speaker label sequence for each frame is extracted from the acoustic feature sequence at once. It can be estimated by penetrating. The EEND model includes an RNN (Recurrent Neural Network) layer that performs time-series modeling. As a result, in EEND, it is possible to estimate the speaker label for each frame by using the acoustic features of not only the current frame but also the surrounding frames. Bidirectional LSTM (Long Short-Term Memory)-RNN and Transformer Encoder are used in this RNN layer.
 なお、非特許文献2には、マルチタスク学習について記載されている。また、非特許文献3には、距離学習に基づく損失関数について記載されている。また、非特許文献4には、音響特徴量について記載されている。 It should be noted that Non-Patent Document 2 describes multitask learning. Also, Non-Patent Document 3 describes a loss function based on distance learning. In addition, Non-Patent Document 4 describes an acoustic feature quantity.
 しかしながら、従来技術では、話し方や声質が似た異なる話者を含む音響信号に対する話者ダイアライゼーションを高精度に行うことが困難であった。つまり、従来のEENDモデルは、フレームごとの話者発話ラベルのみを学習しており、話者が互いに似ているか否かを考慮していない。そのため、話し方や声質が似た異なる話者を含む音響信号に対しては、ある話者の発話区間を別の話者の発話区間であると推定しまう話者誤りが多く発生していた。 However, with the conventional technology, it is difficult to accurately perform speaker diarization for acoustic signals containing different speakers with similar speaking styles and voice qualities. That is, the conventional EEND model only learns speaker utterance labels for each frame, and does not consider whether speakers are similar to each other. Therefore, for acoustic signals containing different speakers with similar speaking styles and voice qualities, there have been many speaker errors in which the utterance period of one speaker is assumed to be the utterance period of another speaker.
 本発明は、上記に鑑みてなされたものであって、話し方や声質が似た異なる話者を含む音響信号に対する話者ダイアライゼーションを高精度に行うことを目的とする。 The present invention has been made in view of the above, and it is an object of the present invention to perform highly accurate speaker diarization for acoustic signals containing different speakers with similar speaking styles and voice qualities.
 上述した課題を解決し、目的を達成するために、本発明に係る話者ダイアライゼーション方法は、話者ダイアライゼーション装置が実行する話者ダイアライゼーション方法であって、音響信号のフレームごとの音響特徴の系列を用いて、各フレームの話者特徴を表すベクトルを抽出する抽出工程と、抽出された前記ベクトルを用いて、該ベクトルの話者を表す話者発話ラベルを推定する推定工程と、抽出された前記ベクトルと、推定された該ベクトルの話者を表す話者発話ラベルと、各フレームの話者を表す話者発話ラベルの正解ラベルとを用いて算出した、各フレームの話者の同一性を表す損失関数を含む損失関数を用いて、各フレームのベクトルの話者発話ラベルを推定するモデルを学習により生成する学習工程と、を含んだことを特徴とする。 In order to solve the above-mentioned problems and achieve the object, a speaker diarization method according to the present invention is a speaker diarization method executed by a speaker diarization apparatus, comprising: an extraction step of extracting a vector representing the speaker feature of each frame using the series of; an estimation step of estimating a speaker utterance label representing the speaker of the vector using the extracted vector; The identity of the speaker in each frame calculated using the estimated vector, the speaker utterance label representing the speaker of the estimated vector, and the correct label of the speaker utterance label representing the speaker in each frame and a learning step of generating, through learning, a model for estimating the speaker utterance label of each frame vector using a loss function including a loss function representing gender.
 本発明によれば、話し方や声質が似た異なる話者を含む音響信号に対する話者ダイアライゼーションを高精度に行うことが可能となる。 According to the present invention, it is possible to perform highly accurate speaker diarization for acoustic signals containing different speakers with similar speaking styles and voice qualities.
図1は、話者ダイアライゼーション装置の概要を説明するための図である。FIG. 1 is a diagram for explaining the outline of a speaker diarization device. 図2は、話者ダイアライゼーション装置の概略構成を例示する模式図である。FIG. 2 is a schematic diagram illustrating a schematic configuration of the speaker diarization device. 図3は、話者ダイアライゼーション装置の処理を説明するための図である。FIG. 3 is a diagram for explaining the processing of the speaker diarization device. 図4は、話者ダイアライゼーション装置の処理を説明するための図である。FIG. 4 is a diagram for explaining the processing of the speaker diarization device. 図5は、話者ダイアライゼーション装置の処理を説明するための図である。FIG. 5 is a diagram for explaining the processing of the speaker diarization device. 図6は、話者ダイアライゼーション処理手順を示すフローチャートである。FIG. 6 is a flow chart showing a speaker diarization processing procedure. 図7は、話者ダイアライゼーション処理手順を示すフローチャートである。FIG. 7 is a flow chart showing a speaker diarization processing procedure. 図8は、話者ダイアライゼーションプログラムを実行するコンピュータを例示する図である。FIG. 8 is a diagram illustrating a computer executing a speaker diarization program.
 以下、図面を参照して、本発明の一実施形態を詳細に説明する。なお、この実施形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。 An embodiment of the present invention will be described in detail below with reference to the drawings. It should be noted that the present invention is not limited by this embodiment. Moreover, in the description of the drawings, the same parts are denoted by the same reference numerals.
[話者ダイアライゼーション装置の概要]
 図1は、話者ダイアライゼーション装置の概要を説明するための図である。本実施形態の話者ダイアライゼーション装置は、音響信号に含まれる話者ごとの話者同一性を評価する損失項を追加して、音響信号に含まれる話者ごとの話者同一性を考慮するように話者ダイアライゼーションモデル14aの学習を行う。この損失項は、同一話者であれば話者同一性が高く、別の話者であれば話者同一性が低くなるように設定される。これにより、話者ダイアライゼーション装置は、話し方が似ていても同一話者ではないことを話者ダイアライゼーションモデル14aに学習させ、話者誤りを減らすことが可能となる。
[Overview of speaker diarization device]
FIG. 1 is a diagram for explaining the outline of a speaker diarization device. The speaker diarization apparatus of this embodiment adds a loss term for evaluating speaker identity for each speaker included in the acoustic signal, and considers speaker identity for each speaker included in the acoustic signal. The speaker diarization model 14a is trained as follows. This loss term is set so that speaker identity is high for the same speaker, and speaker identity is low for different speakers. As a result, the speaker diarization device makes the speaker diarization model 14a learn that the speakers are not the same speaker even if they speak in a similar way, thereby reducing speaker errors.
 具体的には、図1に示すように、話者ダイアライゼーションモデル14aは、話者埋め込み抽出ブロックと、話者ダイアライゼーション結果生成ブロックと、話者ベクトル生成ブロックとを有する。 Specifically, as shown in FIG. 1, the speaker diarization model 14a has a speaker embedding extraction block, a speaker diarization result generation block, and a speaker vector generation block.
 話者埋め込み抽出ブロックは、入力された音響特徴系列のうち、現在のtフレーム目から連続して遡った(t-N)フレーム目~(t+N)フレームまでの音響特徴から、この区間の話者埋め込みを抽出する。話者埋め込みとは、話者ダイアライゼーションに必要な、性別、年齢、話し方の癖等の話者の性質を内容するベクトルのことであり、後述する話者ベクトルより高次元のベクトルである。 The speaker embedding extraction block extracts the speaker in this section from the acoustic features from the (t−N)-th frame to the (t+N)-th frame consecutively traced back from the current t-th frame in the input acoustic feature sequence. Extract embeddings. The speaker embedding is a vector containing speaker characteristics such as gender, age, and speaking style necessary for speaker diarization, and is a higher-dimensional vector than the speaker vector described later.
 この話者埋め込み抽出ブロックは、Linear(線形)層、RNN層、注意機構層等で構成される。また、本実施形態の話者ダイアライゼーション装置では、従来の話者ダイアライゼーションモデルとは異なり、話者埋め込み抽出ブロックは、話者埋め込みを推定するために、固定長の限定された区間のみを参照する。 This speaker embedding extraction block is composed of a Linear layer, an RNN layer, an attention mechanism layer, and so on. Moreover, in the speaker diarization apparatus of this embodiment, unlike the conventional speaker diarization model, the speaker embedding extraction block refers only to fixed-length limited intervals to estimate speaker embeddings. do.
 話者ダイアライゼーション結果生成ブロックは、RNN層、Linear層、sigmoid層等で構成され、話者埋め込み抽出ブロックで得られた話者埋め込みに基づいて、フレームごとの話者発話ラベルの系列を推定する。 The speaker diarization result generation block is composed of an RNN layer, a linear layer, a sigmoid layer, etc., and estimates the sequence of speaker utterance labels for each frame based on the speaker embeddings obtained by the speaker embedding extraction block. .
 話者ベクトル生成ブロックは、話者埋め込みに基づいて、話者ベクトルを生成する。話者ベクトルとは、話者埋め込みと同等の情報を有し、話者埋め込みより低次元のベクトルである。話者ベクトル生成ブロックは、話者ダイアライゼーションモデル14aの学習時にのみ利用され、後述する推論時には利用されない。 The speaker vector generation block generates speaker vectors based on speaker embedding. A speaker vector is a vector that has the same information as the speaker embedding and has a lower dimension than the speaker embedding. The speaker vector generation block is used only during learning of the speaker diarization model 14a, and is not used during inference, which will be described later.
 この話者ベクトル生成ブロックは、subsample(サブサンプリング)層とLinear層とで構成され、話者1名のみが発話しているフレームをランダムに選択する。そして、subsample層は、選択したフレームの話者埋め込みを次段のLinear層に入力する。Linear層は、subsample層が出力する話者埋め込みを所定の次元数の話者ベクトルに射影する。 This speaker vector generation block consists of a subsample layer and a linear layer, and randomly selects frames in which only one speaker is speaking. Then, the subsample layer inputs the speaker embedding of the selected frame to the next Linear layer. The Linear layer projects the speaker embedding output from the subsample layer onto a speaker vector with a predetermined number of dimensions.
 そして、話者ダイアライゼーションモデル14aでは、フレームごと話者発話ラベル損失と、話者同一性損失との2つの損失関数に基づいて、マルチタスク学習の枠組みでパラメータの学習が行われる。 Then, in the speaker diarization model 14a, parameter learning is performed in the framework of multitask learning based on two loss functions, speaker utterance label loss and speaker identity loss for each frame.
 なお、図1に示すように、フレームごと話者発話ラベル損失は、フレームごと話者発話正解ラベル系列と、フレームごと話者発話推定結果系列とを用いて算出される。また、話者同一性損失は、subsample層で選択されたフレームのフレームごと話者発話正解ラベル系列と、Linear層が出力した話者ラベルとを用いて算出される。 As shown in FIG. 1, the frame-by-frame speaker utterance label loss is calculated using the frame-by-frame speaker utterance correct label sequence and the frame-by-frame speaker utterance estimation result sequence. Also, the speaker identity loss is calculated using the frame-by-frame speaker utterance correct label sequence of the frames selected in the subsample layer and the speaker label output by the Linear layer.
 これにより、話者ダイアライゼーション装置は、音響信号に含まれる話者ごとの話者同一性を考慮するように話者ダイアライゼーションモデル14aの学習を行う。したがって、話者ダイアライゼーション装置は、話し方が似ていても同一話者ではないことを話者ダイアライゼーションモデル14aに学習させ、話者誤りを減らすことが可能となる。 As a result, the speaker diarization device learns the speaker diarization model 14a so as to consider the speaker identity for each speaker included in the acoustic signal. Therefore, the speaker diarization device allows the speaker diarization model 14a to learn that the speakers are not the same speaker even if their speaking style is similar, thereby reducing speaker errors.
[話者ダイアライゼーション装置の構成]
 図2は、話者ダイアライゼーション装置の概略構成を例示する模式図である。また、図3~図5は、話者ダイアライゼーション装置の処理を説明するための図である。まず、図2に例示するように、本実施形態の話者ダイアライゼーション装置10は、パソコン等の汎用コンピュータで実現され、入力部11、出力部12、通信制御部13、記憶部14、および制御部15を備える。
[Structure of speaker diarization device]
FIG. 2 is a schematic diagram illustrating a schematic configuration of the speaker diarization device. 3 to 5 are diagrams for explaining the processing of the speaker diarization device. First, as illustrated in FIG. 2, the speaker diarization apparatus 10 of this embodiment is implemented by a general-purpose computer such as a personal computer, and includes an input unit 11, an output unit 12, a communication control unit 13, a storage unit 14, and a control unit. A part 15 is provided.
 入力部11は、キーボードやマウス等の入力デバイスを用いて実現され、実施者による入力操作に対応して、制御部15に対して処理開始などの各種指示情報を入力する。出力部12は、液晶ディスプレイなどの表示装置、プリンター等の印刷装置、情報通信装置等によって実現される。通信制御部13は、NIC(Network Interface Card)等で実現され、サーバや、音響信号を取得する装置等の外部の装置と制御部15とのネットワークを介した通信を制御する。 The input unit 11 is implemented using input devices such as a keyboard and a mouse, and inputs various instruction information such as processing start to the control unit 15 in response to input operations by the practitioner. The output unit 12 is implemented by a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like. The communication control unit 13 is implemented by a NIC (Network Interface Card) or the like, and controls communication between the control unit 15 and an external device such as a server or a device that acquires an acoustic signal via a network.
 記憶部14は、RAM(Random Access Memory)、フラッシュメモリ(Flash Memory)等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。なお、記憶部14は、通信制御部13を介して制御部15と通信する構成でもよい。本実施形態において、記憶部14には、例えば、後述する話者ダイアライゼーション処理に用いられる話者ダイアライゼーションモデル14a等が記憶される。 The storage unit 14 is implemented by semiconductor memory devices such as RAM (Random Access Memory) and flash memory, or storage devices such as hard disks and optical disks. Note that the storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13 . In this embodiment, the storage unit 14 stores, for example, a speaker diarization model 14a used for speaker diarization processing, which will be described later.
 制御部15は、CPU(Central Processing Unit)やNP(Network Processor)やFPGA(Field Programmable Gate Array)等を用いて実現され、メモリに記憶された処理プログラムを実行する。これにより、制御部15は、図2に例示するように、音響特徴抽出部15a、話者発話ラベル生成部15b、話者特徴抽出部15c、話者発話ラベル推定部15d、話者ベクトル生成部15e、学習部15fおよび発話区間推定部15gとして機能する。なお、これらの機能部は、それぞれが異なるハードウェアに実装されてもよい。例えば、音響特徴抽出部15aは他の機能部とは異なるハードウェアに実装されてもよい。また、制御部15は、その他の機能部を備えてもよい。 The control unit 15 is implemented using a CPU (Central Processing Unit), NP (Network Processor), FPGA (Field Programmable Gate Array), etc., and executes a processing program stored in memory. 2, the control unit 15 includes an acoustic feature extraction unit 15a, a speaker utterance label generation unit 15b, a speaker feature extraction unit 15c, a speaker utterance label estimation unit 15d, and a speaker vector generation unit. 15e, learning unit 15f, and utterance segment estimating unit 15g. Note that these functional units may be implemented in different hardware. For example, the acoustic feature extraction unit 15a may be implemented in hardware different from other functional units. Also, the control unit 15 may include other functional units.
 音響特徴抽出部15aは、複数の話者の発話を含む音響信号のフレームごとの音響特徴を抽出する。例えば、音響特徴抽出部15aは、入力部11を介して、あるいは音響信号を取得する装置等から通信制御部13を介して、音響信号の入力を受け付ける。また、音響特徴抽出部15aは、音響信号をフレームごとに分割し、各フレームからの信号に対して離散フーリエ変換やフィルタバンク乗算を行うことにより音響特徴ベクトルを抽出し、フレーム方向に結合した音響特徴系列を出力する。本実施形態では、フレーム長は25ms、フレームシフト幅は10msとする。 The acoustic feature extraction unit 15a extracts acoustic features for each frame of an acoustic signal containing utterances of a plurality of speakers. For example, the acoustic feature extraction unit 15a receives input of acoustic signals via the input unit 11 or via the communication control unit 13 from a device that acquires acoustic signals. Further, the acoustic feature extraction unit 15a divides the acoustic signal into frames, performs discrete Fourier transform and filter bank multiplication on the signal from each frame, extracts an acoustic feature vector, and extracts an acoustic feature vector coupled in the frame direction. Output the feature sequence. In this embodiment, the frame length is 25 ms and the frame shift width is 10 ms.
 ここで、音響特徴ベクトルは、例えば、24次元のMFCC(Mel Frequency Cepstral Coefficient)であるが、これに限定されず、例えば、メルフィルタバンク出力等の他のフレームごとの音響特徴量でもよい。 Here, the acoustic feature vector is, for example, a 24-dimensional MFCC (Mel Frequency Cepstral Coefficient), but is not limited to this, and may be, for example, another acoustic feature quantity for each frame such as Mel filter bank output.
 話者発話ラベル生成部15bは、音響特徴系列を用いて、各フレームの話者発話ラベルを生成する。具体的には、話者発話ラベル生成部15bは、後述する図3に示すように、音響特徴系列と話者の発話区間の正解ラベルとを用いて、フレームごとの話者発話ラベルを生成する。これにより、後述する学習部15fの処理に用いられる教師データとして、音響特徴系列の各フレームに対応するフレームごとの話者発話ラベルと正解ラベルが生成される。 The speaker's utterance label generation unit 15b uses the acoustic feature sequence to generate a speaker's utterance label for each frame. Specifically, the speaker's utterance label generation unit 15b generates a speaker's utterance label for each frame by using the acoustic feature sequence and the correct label of the utterance period of the speaker, as shown in FIG. 3, which will be described later. . As a result, a speaker utterance label and a correct label for each frame corresponding to each frame of the acoustic feature sequence are generated as teacher data used in the processing of the learning unit 15f, which will be described later.
 ここで、話者数がSである(話者1、話者2、…、話者S)場合に、tフレーム目(t=0,1,…,T)の話者発話ラベルはS次元のベクトルとなる。例えば、時刻t×フレームシフト幅のフレームがいずれかの話者の発話区間に含まれる場合には、当該話者に対応する次元の値が1、それ以外の次元の値が0となる。したがって、フレームごとの話者発話ラベルは、T×S次元の二値[0,1]のマルチラベルとなる。なお、本実施形態では、例えばS=10とする。 Here, when the number of speakers is S (speaker 1, speaker 2, ..., speaker S), the speaker utterance label of the t-th frame (t = 0, 1, ..., T) vector. For example, when a frame of time t×frame shift width is included in the utterance period of any speaker, the value of the dimension corresponding to that speaker is 1, and the value of the other dimensions is 0. Therefore, the speaker utterance label for each frame is a binary [0, 1] multi-label of T×S dimension. In this embodiment, for example, S=10.
 図2の説明に戻る。話者特徴抽出部15cは、音響信号のフレームごとの音響特徴の系列を用いて、各フレームの話者特徴を表すベクトルを抽出する。具体的には、話者特徴抽出部15cは、音響特徴抽出部15aから取得した音響特徴系列を、図1に示した話者埋め込み抽出ブロックに入力することにより、話者埋め込みベクトルを生成する。話者埋め込みベクトルは、上記したように、性別、年齢、話し方の癖等の話者の性質を内包する。 Return to the description of Figure 2. The speaker feature extraction unit 15c extracts a vector representing the speaker feature of each frame using the series of acoustic features for each frame of the acoustic signal. Specifically, the speaker feature extraction unit 15c generates a speaker embedding vector by inputting the acoustic feature sequence acquired from the acoustic feature extraction unit 15a to the speaker embedding extraction block shown in FIG. As described above, the speaker embedding vector includes speaker characteristics such as gender, age, and speaking habits.
 話者発話ラベル推定部15dは、抽出された話者埋め込みベクトルを用いて、当該話者埋め込みベクトルの話者を表す話者発話ラベルを推定する。具体的には、話者発話ラベル推定部15dは、話者特徴抽出部15cから取得した話者埋め込みベクトルを、図1に示した話者ダイアライゼーション結果生成ブロックに入力することにより、フレームごとの話者発話ラベルの推定結果の系列を得る。 The speaker utterance label estimation unit 15d uses the extracted speaker embedding vector to estimate the speaker utterance label representing the speaker of the speaker embedding vector. Specifically, the speaker utterance label estimation unit 15d inputs the speaker embedding vector acquired from the speaker feature extraction unit 15c to the speaker diarization result generation block shown in FIG. Obtain a sequence of speaker utterance label estimation results.
 話者ベクトル生成部15eは、抽出された話者埋め込みベクトルと、各フレームの話者発話ラベルの正解ラベルとを用いて、話者が一人であるフレームの話者特徴を表す話者ベクトルを生成する。 The speaker vector generation unit 15e uses the extracted speaker embedding vector and the correct label of the speaker utterance label of each frame to generate a speaker vector representing the speaker feature of the frame in which there is only one speaker. do.
 ここで、図3には、図1に示した話者ベクトル生成ブロックにおける話者ラベル生成部の処理が例示されている。話者ベクトル生成部15eは、図3に示すように、話者埋め込みベクトルと、同じ長さのフレームごとの話者発話ラベルの正解ラベルの系列(フレームごと話者発話正解ラベル系列)とを、subsample層に入力する。 Here, FIG. 3 illustrates the processing of the speaker label generation unit in the speaker vector generation block shown in FIG. As shown in FIG. 3, the speaker vector generation unit 15e generates a speaker embedding vector and a series of correct labels of speaker utterance labels for each frame (speaker utterance correct label series for each frame) of the same length as follows: Input to the subsample layer.
 subsample層では、フレームごと話者発話正解ラベルのうち、1名のみが発話しているフレームが対象とされ、対象フレームから1話者あたり最大Kフレームがランダムに選択される。図3には、斜線の網掛けで示した2、3、6、8、9フレーム目が対象とされ、誰も発話していない1、7、10フレーム目と、2名以上の話者が発話している4、5フレーム目は対象外とされている。 In the subsample layer, frames in which only one speaker speaks out of correct speaker utterance labels for each frame are targeted, and up to K frames per speaker are randomly selected from the target frames. In FIG. 3, the 2nd, 3rd, 6th, 8th, and 9th frames indicated by hatching are targeted, and the 1st, 7th, and 10th frames where no one speaks, and two or more speakers are shown. The 4th and 5th frames of speaking are excluded.
 そして、選択されたフレームの話者埋め込みベクトルがsubsample層からLinear層に入力される。Linear層で、入力された話者埋め込みベクトルが所定の次元数の話者ベクトルに射影される。これにより、話者ベクトル生成部15eは、選択されたフレームの話者ベクトルを得る。 Then, the speaker embedding vector of the selected frame is input from the subsample layer to the Linear layer. In the Linear layer, the input speaker embedding vector is projected onto a speaker vector with a predetermined number of dimensions. Thereby, the speaker vector generator 15e obtains the speaker vector of the selected frame.
 なお、話者特徴抽出部15c、話者発話ラベル推定部15dおよび話者ベクトル生成部15eは、後述する学習部15fに内包されてもよい。例えば、図4では、学習部15fが、話者特徴抽出部15cおよび話者発話ラベル推定部15dの処理を行う例が示されている。 Note that the speaker feature extraction unit 15c, the speaker utterance label estimation unit 15d, and the speaker vector generation unit 15e may be included in the learning unit 15f, which will be described later. For example, FIG. 4 shows an example in which the learning unit 15f performs the processing of the speaker feature extraction unit 15c and the speaker utterance label estimation unit 15d.
 また、話者特徴抽出部15cは、話者発話ラベル推定部15dに内包されてもよい。例えば、図4では、話者発話ラベル推定部15dが、話者特徴抽出部15cの処理を行う例が示されている。 Also, the speaker feature extraction unit 15c may be included in the speaker utterance label estimation unit 15d. For example, FIG. 4 shows an example in which the speaker utterance label estimation unit 15d performs the processing of the speaker feature extraction unit 15c.
 図2の説明に戻る。学習部15fは、抽出された話者埋め込みベクトルと、推定された該話者埋め込みベクトルの話者を表す話者発話ラベルと、各フレームの話者発話ラベルの正解ラベルとを用いて算出した、各フレームの話者の同一性を表す損失関数を含む損失関数を用いて、各フレームの話者埋め込みベクトルの話者発話ラベルを推定する話者ダイアライゼーションモデル14aを学習により生成する。 Return to the description of Figure 2. The learning unit 15f calculates using the extracted speaker embedding vector, the speaker utterance label representing the speaker of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame, A speaker diarization model 14a for estimating the speaker utterance label of the speaker embedding vector of each frame is generated by learning using a loss function including a loss function representing speaker identity of each frame.
 具体的には、学習部15fは、生成された話者ベクトルと、推定された話者埋め込みベクトルの話者発話ラベルと、各フレームの話者発話ラベルの正解ラベルとを用いて算出した損失関数を用いて、話者ダイアライゼーションモデル14aを学習により生成する。 Specifically, the learning unit 15f calculates the loss function using the generated speaker vector, the speaker utterance label of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame. is used to generate the speaker diarization model 14a by learning.
 すなわち、学習部15fは、図1に示したように、上記した話者ベクトル生成部15eが生成した話者ベクトルと、この話者ベクトルに対応するフレームのフレームごとの話者発話ラベルの正解ラベル(サブサンプリング後フレームごと話者発話正解ラベル系列)とを用いて算出された、話者同一性損失を用いる。 That is, as shown in FIG. 1, the learning unit 15f generates the speaker vector generated by the speaker vector generation unit 15e and the correct speaker utterance label for each frame corresponding to the speaker vector. (speaker utterance correct label sequence for each frame after subsampling) is used.
 ここで、図5には、話者同一性損失の算出例が示されている。話者同一性損失は、図5に示すように、例えば、Generalized End-to-End Loss等の距離学習に基づく、次式(1)に示す損失関数を用いて算出される。 Here, FIG. 5 shows an example of calculation of speaker identity loss. As shown in FIG. 5, the speaker identity loss is calculated using the loss function shown in the following equation (1) based on distance learning such as Generalized End-to-End Loss.
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 これにより、同一話者であれば話者ベクトルのコサイン距離が近く、別の話者であれば話者ベクトルのコサイン距離が遠くなるように、話者ダイアライゼーションモデル14aの学習が行われる。 As a result, the speaker diarization model 14a is learned so that the cosine distance of the speaker vector is short for the same speaker, and the cosine distance of the speaker vector is long for a different speaker.
 なお、話者同一性損失は、Generalized End-to-End Lossに限定されず、他の距離学習に基づく損失関数でもよい。例えば、Triplet LossやSiamese Lossが用いられてもよい。 Note that the speaker identity loss is not limited to Generalized End-to-End Loss, and may be a loss function based on other distance learning. For example, Triplet Loss or Siamese Loss may be used.
 このように、話者ベクトルは話者埋め込みベクトルから算出され、この話者埋め込みベクトルが話者発話ラベル推定部15dの処理に利用される。これにより、話者発話ラベル推定部15dは、各フレームの話者が同一話者か否かを考慮しつつ、話者発話ラベルを推定することが可能となる。 In this way, the speaker vector is calculated from the speaker embedding vector, and this speaker embedding vector is used in the processing of the speaker utterance label estimation unit 15d. This enables the speaker utterance label estimation unit 15d to estimate the speaker utterance label while considering whether the speaker of each frame is the same speaker.
 また、話者ダイアライゼーションモデル14aは、図1に示したように、話者埋め込み抽出ブロックと、話者ダイアライゼーション結果生成ブロックと、話者ベクトル生成ブロックとを有する。話者埋め込みブロックは、図1に示したように、例えば、1層のLinear層、2層の双方向LSTM-RNN層で構成される。また、話者ダイアライゼーション結果生成ブロックは、例えば、3層の双方向のLSTM-RNN層、1層のLinear層、1層のsigmoid層で構成される。また、話者ベクトル生成ブロックは、1層のsubsample層および1層のLinear層で構成される。ただし、各層の総数や隠れユニット数は変更されてもよい。 In addition, the speaker diarization model 14a has, as shown in FIG. 1, a speaker embedding extraction block, a speaker diarization result generation block, and a speaker vector generation block. The speaker embedding block is composed of, for example, one linear layer and two bidirectional LSTM-RNN layers, as shown in FIG. Also, the speaker diarization result generation block is composed of, for example, three layers of bi-directional LSTM-RNN layers, one layer of Linear layer, and one layer of sigmoid layer. Also, the speaker vector generation block is composed of one subsample layer and one linear layer. However, the total number of layers and the number of hidden units may be changed.
 この話者ダイアライゼーションモデル14aには、Tフレーム×D次元のフレームごとの音響特徴系列が入力され、各次元が[0,1]の値をとるT×S次元のフレームごとの推定された話者発話ラベル系列が出力される。 This speaker diarization model 14a is input with a frame-by-frame acoustic feature sequence of T frames×D dimensions and an estimated speech feature sequence for each frame of T×S dimensions where each dimension takes the values [0, 1]. A user utterance label sequence is output.
 学習部15fは、推定された話者発話ラベルと、各フレームの話者発話ラベルの正解ラベルとを用いて算出した、各フレームの話者発話ラベルに関する損失関数と、抽出された話者埋め込みベクトルと、各フレームの話者発話ラベルの正解ラベルとを用いて算出した、話者の同一性を表す損失関数との重み付け和の損失関数を用いて、話者ダイアライゼーションモデル14aを学習により生成する。 The learning unit 15f uses the estimated speaker utterance label and the correct label of the speaker utterance label of each frame to calculate the loss function for the speaker utterance label of each frame, and the extracted speaker embedding vector. and the correct label of the speaker's utterance label of each frame. .
 すなわち、話者ダイアライゼーションモデル14aでは、上記の話者同一性損失に加え、図1に示したように、話者発話ラベル推定部15dが推定したフレームごと話者発話ラベルの系列と、話者発話ラベル生成部15bが生成したフレームごとの話者発話ラベルの正解ラベル系列とを用いて算出されたフレームごと話者発話ラベル損失を用いる。フレームごと話者発話ラベル損失は、従来と同様に、並び替え順を無視したマルチラベル二値交差エントロピーにより算出される。 That is, in the speaker diarization model 14a, in addition to the speaker identity loss described above, as shown in FIG. The speaker utterance label loss for each frame calculated using the correct label sequence of the speaker utterance label for each frame generated by the utterance label generator 15b is used. The frame-by-frame speaker utterance label loss is calculated by multi-label binary cross-entropy ignoring the sorting order, as in the past.
 そして、学習部15fは、次式(2)に示すように、フレームごと話者発話ラベル損失と話者同一性損失との2つの損失関数の重み付け和をモデル全体の損失関数として用いて、誤差逆伝搬法により、パラメータ最適化を行う。ここで、αは重み付けパラメータであり、本実施形態では、例えばα=0.5とする。 Then, as shown in the following equation (2), the learning unit 15f uses the weighted sum of the two loss functions of the speaker utterance label loss and the speaker identity loss for each frame as the loss function of the entire model to obtain the error Parameter optimization is performed by the backpropagation method. Here, α is a weighting parameter, and in this embodiment, α=0.5, for example.
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 図2の説明に戻る。話者発話ラベル推定部15dは、生成された話者ダイアライゼーションモデル14aを用いて、音響信号のフレームごとの話者発話ラベルを推定する。具体的には、話者発話ラベル推定部15dは、図4に示すように、図1に示した話者埋め込み抽出ブロックと話者ダイアライゼーション結果生成ブロックとに、音響特徴系列を順伝搬させることにより、フレームごとの話者発話ラベルの推定結果(事後確率)の系列を得る。 Return to the description of Figure 2. The speaker utterance label estimation unit 15d uses the generated speaker diarization model 14a to estimate a speaker utterance label for each frame of the acoustic signal. Specifically, as shown in FIG. 4, the speaker utterance label estimation unit 15d forward propagates the acoustic feature sequence to the speaker embedding extraction block and the speaker diarization result generation block shown in FIG. obtains a series of speaker utterance label estimation results (posterior probabilities) for each frame.
 発話区間推定部15gは、出力された話者発話ラベル事後確率を用いて、音響信号中の話者の発話区間を推定する。具体的には、発話区間推定部15gは、複数のフレームの移動平均を用いて、話者発話ラベルを推定する。すなわち、発話区間推定部15gは、まず、フレームごとの話者発話ラベル事後確率に対し、自フレームとその前の5フレームとの長さ6での移動平均を算出する。これにより、1フレームしかない発話等、現実的ではない短い発話区間の誤検出を防止することが可能となる。 The speech segment estimation unit 15g estimates the speaker's speech segment in the acoustic signal using the output speaker's speech label posterior probability. Specifically, the speech period estimation unit 15g estimates the speaker's speech label using a moving average of a plurality of frames. That is, the utterance segment estimation unit 15g first calculates the moving average of the frame-by-frame speaker utterance label posterior probability for the length 6 of the current frame and the preceding 5 frames. This makes it possible to prevent erroneous detection of impractically short speech segments such as speech with only one frame.
 次に、発話区間推定部15gは、算出した移動平均の値が0.5より大きい場合に、当該フレームが、当該次元の話者の発話区間と推定する。また、発話区間推定部15gは、各話者について、連続する発話区間フレーム群を1つの発話とみなし、所定の時刻までの発話区間の開始時刻と終了時刻とをフレームから逆算する。これにより、話者ごとの発話ごとの所定の時刻までの発話開始時刻と発話終了時刻とを得ることができる。 Next, when the calculated moving average value is greater than 0.5, the utterance segment estimation unit 15g estimates that the frame is the utterance segment of the speaker of the dimension. For each speaker, the utterance period estimation unit 15g regards a continuous utterance period frame group as one utterance, and calculates the start time and end time of the utterance period up to a predetermined time from the frames. As a result, the speech start time and the speech end time up to a predetermined time can be obtained for each speech of each speaker.
[話者ダイアライゼーション処理]
 次に、話者ダイアライゼーション装置10による話者ダイアライゼーション処理について説明する。図6および図7は、話者ダイアライゼーション処理手順を示すフローチャートである。本実施形態の話者ダイアライゼーション処理は、学習処理と推定処理とを含む。まず、図6は、学習処理手順を示す。図6のフローチャートは、例えば、学習処理の開始を指示する入力があったタイミングで開始される。
[Speaker diarization processing]
Next, the speaker diarization processing by the speaker diarization device 10 will be described. 6 and 7 are flowcharts showing speaker diarization processing procedures. The speaker diarization processing of this embodiment includes learning processing and estimation processing. First, FIG. 6 shows the learning processing procedure. The flowchart of FIG. 6 is started, for example, at the timing when an instruction to start the learning process is received.
 まず、音響特徴抽出部15aが、複数の話者の発話を含む音響信号のフレームごとの音響特徴を抽出し、音響特徴系列を出力する(ステップS1)。 First, the acoustic feature extraction unit 15a extracts acoustic features for each frame of an acoustic signal containing utterances of a plurality of speakers, and outputs an acoustic feature sequence (step S1).
 次に、話者特徴抽出部15cが、音響信号のフレームごとの音響特徴の系列を用いて、各フレームの話者特徴を表す話者埋め込みベクトルを抽出する(ステップS2)。 Next, the speaker feature extraction unit 15c extracts a speaker embedding vector representing the speaker feature of each frame using the sequence of acoustic features for each frame of the acoustic signal (step S2).
 そして、学習部15fが、抽出された話者埋め込みベクトルと、推定された該話者埋め込みベクトルの話者を表す話者発話ラベルと、各フレームの話者発話ラベルの正解ラベルとを用いて算出した、各フレームの話者の同一性を表す損失関数を含む損失関数を用いて、各フレームの話者埋め込みベクトルの話者発話ラベルを推定する話者ダイアライゼーションモデル14aを学習により生成する(ステップS3)。 Then, the learning unit 15f calculates using the extracted speaker embedding vector, the speaker utterance label representing the speaker of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame. The speaker diarization model 14a for estimating the speaker utterance label of the speaker embedding vector of each frame is generated by learning using the loss function including the loss function representing the identity of the speaker of each frame (step S3).
 具体的には、話者発話ラベル推定部15dが、抽出された話者埋め込みベクトルを用いて、当該話者埋め込みベクトルの話者を表す話者発話ラベルを推定する。また、話者ベクトル生成部15eが、抽出された話者埋め込みベクトルと、各フレームの話者発話ラベルの正解ラベルとを用いて、話者が一人であるフレームの話者特徴を表す話者ベクトルを生成する。 Specifically, the speaker utterance label estimation unit 15d uses the extracted speaker embedding vector to estimate the speaker utterance label representing the speaker of the speaker embedding vector. In addition, the speaker vector generation unit 15e uses the extracted speaker embedding vector and the correct label of the speaker utterance label of each frame to generate a speaker vector representing the speaker feature of the frame in which there is only one speaker. to generate
 そして、学習部15fは、生成された話者ベクトルと、各フレームの話者発話ラベルの正解ラベルとを用いて、各フレームの話者同一性を表す損失関数を算出する。また、学習部15fは、推定されたフレームごと話者発話ラベルの系列と、生成されたフレームごとの話者発話ラベルの正解ラベル系列とを用いて、フレームごと話者発話ラベルに関する損失関数を算出する。そして、学習部15fは、2つの損失関数の重み付け和をモデル全体の損失関数として用いて、話者ダイアライゼーションモデル14aを生成する。これにより、一連の学習処理が終了する。 Then, the learning unit 15f uses the generated speaker vector and the correct label of the speaker's utterance label of each frame to calculate a loss function representing speaker identity of each frame. In addition, the learning unit 15f calculates a loss function for the speaker utterance label for each frame using the estimated speaker utterance label sequence for each frame and the generated correct label sequence for the speaker utterance label for each frame. do. Then, the learning unit 15f uses the weighted sum of the two loss functions as the loss function for the entire model to generate the speaker diarization model 14a. This completes a series of learning processes.
 次に、図7は、推定処理手順を示す。図7のフローチャートは、例えば、推定処理の開始を指示する入力があったタイミングで開始される。 Next, FIG. 7 shows the estimation processing procedure. The flowchart in FIG. 7 is started, for example, when an input instructing the start of the estimation process is received.
 まず、音響特徴抽出部15aが、複数の話者の発話を含む音響信号のフレームごとの音響特徴を抽出し、音響特徴系列を出力する(ステップS1)。 First, the acoustic feature extraction unit 15a extracts acoustic features for each frame of an acoustic signal containing utterances of a plurality of speakers, and outputs an acoustic feature sequence (step S1).
 次に、話者発話ラベル推定部15dが、生成された話者ダイアライゼーションモデル14aを用いて、音響信号のフレームごとの話者発話ラベルを推定する(ステップS4)。具体的には、話者発話ラベル推定部15dは、音響特徴系列のフレームごとの話者発話ラベル事後確率(話者発話ラベルの推定値)を出力する。 Next, the speaker's utterance label estimation unit 15d uses the generated speaker's diarization model 14a to estimate the speaker's utterance label for each frame of the acoustic signal (step S4). Specifically, the speaker utterance label estimation unit 15d outputs the speaker utterance label posterior probability (estimated value of the speaker utterance label) for each frame of the acoustic feature sequence.
 そして、発話区間推定部15gが、出力された話者発話ラベル事後確率を用いて、音響信号中の話者の発話区間を推定する(ステップS5)。これにより、一連の推定処理が終了する。 Then, the utterance segment estimation unit 15g estimates the utterance segment of the speaker in the acoustic signal using the output speaker utterance label posterior probability (step S5). This completes a series of estimation processes.
[効果]
 以上、説明したように、本実施形態の話者ダイアライゼーション装置10において、話者特徴抽出部15cが、音響信号のフレームごとの音響特徴の系列を用いて、各フレームの話者特徴を表す話者埋め込みベクトルを抽出する。また、話者発話ラベル推定部15dが、抽出された話者埋め込みベクトルを用いて、該話者埋め込みベクトルの話者を表す話者発話ラベルを推定する。また、学習部15fが、抽出された話者埋め込みベクトルと、推定された該話者埋め込みベクトルの話者を表す話者発話ラベルと、各フレームの話者発話ラベルの正解ラベルとを用いて算出した、各フレームの話者の同一性を表す損失関数を含む損失関数を用いて、各フレームの話者特徴ベクトルの話者発話ラベルを推定する話者ダイアライゼーションモデル14aを学習により生成する。
[effect]
As described above, in the speaker diarization apparatus 10 of the present embodiment, the speaker feature extraction unit 15c uses the sequence of acoustic features for each frame of the acoustic signal to express the speaker features of each frame. extract the person embedding vector. A speaker utterance label estimation unit 15d uses the extracted speaker embedding vector to estimate a speaker utterance label representing the speaker of the speaker embedding vector. Also, the learning unit 15f calculates using the extracted speaker embedding vector, the speaker utterance label representing the speaker of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame. A speaker diarization model 14a for estimating the speaker utterance label of the speaker feature vector of each frame is generated by learning using the loss function including the loss function representing the identity of the speaker of each frame.
 このように、話者ダイアライゼーション装置10では、話者埋め込みベクトルの話者が各フレームで同一話者か否かを考慮して、話者発話ラベルを推定することが可能となる。これにより、話し方や声質が似た異なる話者を含む音響信号に対する話者ダイアライゼーションを高精度に行うことが可能となる。 In this way, the speaker diarization device 10 can estimate the speaker's utterance label by considering whether the speaker of the speaker embedding vector is the same speaker in each frame. This makes it possible to perform highly accurate speaker diarization for acoustic signals containing different speakers with similar speaking styles and voice qualities.
 具体的には、話者ベクトル生成部15eが、話者埋め込みベクトルと、各フレームの話者発話ラベルの正解ラベルとを用いて、話者が一人であるフレームの話者特徴を表す話者ベクトルを生成する。この場合に、学習部15fは、生成された話者ベクトルと、推定された話者埋め込みベクトルの話者発話ラベルと、各フレームの話者発話ラベルの正解ラベルとを用いて算出した損失関数を用いて、話者ダイアライゼーションモデル14aを学習により生成する。 Specifically, the speaker vector generation unit 15e uses the speaker embedding vector and the correct label of the speaker utterance label of each frame to generate a speaker vector representing the speaker feature of a frame with a single speaker. to generate In this case, the learning unit 15f calculates a loss function using the generated speaker vector, the speaker utterance label of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame. is used to generate a speaker diarization model 14a by learning.
 このように、話者ダイアライゼーション装置10では、話者埋め込みベクトルから生成される話者が一人である話者ベクトルの話者が各フレームで同一話者か否かを考慮して、話者発話ラベルを推定することが可能となる。これにより、話し方や声質が似た異なる話者を含む音響信号に対する話者ダイアライゼーションをより高精度に行うことが可能となる。 In this way, the speaker diarization device 10 considers whether or not the speaker of the speaker vector generated from the speaker embedding vector is the same speaker in each frame. Labels can be estimated. This makes it possible to more accurately perform speaker diarization for acoustic signals containing different speakers with similar speaking styles and voice qualities.
 また、学習部15fは、推定された話者発話ラベルと、各フレームの話者発話ラベルの正解ラベルとを用いて算出した、各フレームの話者発話ラベルに関する損失関数と、抽出された話者埋め込みベクトルと、各フレームの話者発話ラベルの正解ラベルとを用いて算出した、話者の同一性を表す損失関数との重み付け和の損失関数を用いて、話者ダイアライゼーションモデル14aを学習により生成する。これにより、話者ダイアライゼーション装置10は、より高精度に話者ダイアライゼーションを行うことが可能となる。 In addition, the learning unit 15f calculates a loss function for the speaker utterance label of each frame using the estimated speaker utterance label and the correct label of the speaker utterance label of each frame, and the extracted speaker The speaker diarization model 14a is learned by using the loss function of the weighted sum of the loss function representing the identity of the speaker calculated using the embedding vector and the correct label of the speaker utterance label of each frame. Generate. This enables the speaker diarization device 10 to perform speaker diarization with higher accuracy.
 また、話者発話ラベル推定部15dが、生成された話者ダイアライゼーションモデル14aを用いて、音響信号のフレームごとの話者発話ラベルを推定する。これにより、話者ダイアライゼーション装置は、各フレームの話者が同一話者か否かを考慮して、話者誤りを抑制した高精度な話者ダイアライゼーションが可能となる。 In addition, the speaker's utterance label estimation unit 15d uses the generated speaker's diarization model 14a to estimate the speaker's utterance label for each frame of the acoustic signal. As a result, the speaker diarization apparatus can perform highly accurate speaker diarization while suppressing speaker errors by considering whether or not the speaker of each frame is the same speaker.
 また、発話区間推定部15gが、複数のフレームの移動平均を用いて、話者ラベルを推定する。これにより、話者ダイアライゼーション装置10は、現実的ではない短い発話区間の誤検出を防止することが可能となる。 Also, the speech period estimation unit 15g estimates the speaker label using the moving average of a plurality of frames. This enables the speaker diarization device 10 to prevent erroneous detection of unrealistically short speech segments.
[プログラム]
 上記実施形態に係る話者ダイアライゼーション装置10が実行する処理をコンピュータが実行可能な言語で記述したプログラムを作成することもできる。一実施形態として、話者ダイアライゼーション装置10は、パッケージソフトウェアやオンラインソフトウェアとして上記の話者ダイアライゼーション処理を実行する話者ダイアライゼーションプログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記の話者ダイアライゼーションプログラムを情報処理装置に実行させることにより、情報処理装置を話者ダイアライゼーション装置10として機能させることができる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やPHS(Personal Handyphone System)等の移動体通信端末、さらには、PDA(Personal Digital Assistant)等のスレート端末等がその範疇に含まれる。また、話者ダイアライゼーション装置10の機能を、クラウドサーバに実装してもよい。
[program]
It is also possible to create a program in which the processing executed by the speaker diarization apparatus 10 according to the above embodiment is described in a computer-executable language. As one embodiment, the speaker diarization apparatus 10 can be implemented by installing a speaker diarization program for executing the above-described speaker diarization processing as package software or online software on a desired computer. For example, the information processing apparatus can function as the speaker diarization apparatus 10 by causing the information processing apparatus to execute the speaker diarization program. In addition, information processing devices include mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants). Also, the functions of the speaker diarization device 10 may be implemented in a cloud server.
 図8は、話者ダイアライゼーションプログラムを実行するコンピュータの一例を示す図である。コンピュータ1000は、例えば、メモリ1010と、CPU1020と、ハードディスクドライブインタフェース1030と、ディスクドライブインタフェース1040と、シリアルポートインタフェース1050と、ビデオアダプタ1060と、ネットワークインタフェース1070とを有する。これらの各部は、バス1080によって接続される。 FIG. 8 is a diagram showing an example of a computer that executes a speaker diarization program. Computer 1000 includes, for example, memory 1010 , CPU 1020 , hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .
 メモリ1010は、ROM(Read Only Memory)1011およびRAM1012を含む。ROM1011は、例えば、BIOS(Basic Input Output System)等のブートプログラムを記憶する。ハードディスクドライブインタフェース1030は、ハードディスクドライブ1031に接続される。ディスクドライブインタフェース1040は、ディスクドライブ1041に接続される。ディスクドライブ1041には、例えば、磁気ディスクや光ディスク等の着脱可能な記憶媒体が挿入される。シリアルポートインタフェース1050には、例えば、マウス1051およびキーボード1052が接続される。ビデオアダプタ1060には、例えば、ディスプレイ1061が接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012 . The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1031 . Disk drive interface 1040 is connected to disk drive 1041 . A removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041, for example. A mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050, for example. For example, a display 1061 is connected to the video adapter 1060 .
 ここで、ハードディスクドライブ1031は、例えば、OS1091、アプリケーションプログラム1092、プログラムモジュール1093およびプログラムデータ1094を記憶する。上記実施形態で説明した各情報は、例えばハードディスクドライブ1031やメモリ1010に記憶される。 Here, the hard disk drive 1031 stores an OS 1091, application programs 1092, program modules 1093 and program data 1094, for example. Each piece of information described in the above embodiment is stored in the hard disk drive 1031 or the memory 1010, for example.
 また、話者ダイアライゼーションプログラムは、例えば、コンピュータ1000によって実行される指令が記述されたプログラムモジュール1093として、ハードディスクドライブ1031に記憶される。具体的には、上記実施形態で説明した話者ダイアライゼーション装置10が実行する各処理が記述されたプログラムモジュール1093が、ハードディスクドライブ1031に記憶される。 Also, the speaker diarization program is stored in the hard disk drive 1031 as a program module 1093 in which instructions to be executed by the computer 1000 are written, for example. Specifically, the hard disk drive 1031 stores a program module 1093 that describes each process executed by the speaker diarization apparatus 10 described in the above embodiment.
 また、話者ダイアライゼーションプログラムによる情報処理に用いられるデータは、プログラムデータ1094として、例えば、ハードディスクドライブ1031に記憶される。そして、CPU1020が、ハードディスクドライブ1031に記憶されたプログラムモジュール1093やプログラムデータ1094を必要に応じてRAM1012に読み出して、上述した各手順を実行する。 Data used for information processing by the speaker diarization program is stored as program data 1094 in the hard disk drive 1031, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 as necessary, and executes each procedure described above.
 なお、話者ダイアライゼーションプログラムに係るプログラムモジュール1093やプログラムデータ1094は、ハードディスクドライブ1031に記憶される場合に限られず、例えば、着脱可能な記憶媒体に記憶されて、ディスクドライブ1041等を介してCPU1020によって読み出されてもよい。あるいは、話者ダイアライゼーションプログラムに係るプログラムモジュール1093やプログラムデータ1094は、LAN(Local Area Network)やWAN(Wide Area Network)等のネットワークを介して接続された他のコンピュータに記憶され、ネットワークインタフェース1070を介してCPU1020によって読み出されてもよい。 Note that the program module 1093 and program data 1094 relating to the speaker diarization program are not limited to being stored in the hard disk drive 1031. For example, they are stored in a detachable storage medium and stored in the CPU 1020 via the disk drive 1041 or the like. may be read by Alternatively, the program module 1093 and program data 1094 related to the speaker diarization program are stored in another computer connected via a network such as LAN (Local Area Network) or WAN (Wide Area Network), and network interface 1070 may be read by CPU 1020 via
 以上、本発明者によってなされた発明を適用した実施形態について説明したが、本実施形態による本発明の開示の一部をなす記述および図面により本発明は限定されることはない。すなわち、本実施形態に基づいて当業者等によりなされる他の実施形態、実施例および運用技術等は全て本発明の範疇に含まれる。 Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the descriptions and drawings forming part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operation techniques, etc. made by those skilled in the art based on this embodiment are all included in the scope of the present invention.
 10 話者ダイアライゼーション装置
 11 入力部
 12 出力部
 13 通信制御部
 14 記憶部
 14a 話者ダイアライゼーションモデル
 15 制御部
 15a 音響特徴抽出部
 15b 話者発話ラベル生成部
 15c 話者特徴抽出部
 15d 話者発話ラベル推定部
 15e 話者ベクトル生成部
 15f 学習部
 15g 発話区間推定部
10 speaker diarization device 11 input unit 12 output unit 13 communication control unit 14 storage unit 14a speaker diarization model 15 control unit 15a acoustic feature extraction unit 15b speaker utterance label generation unit 15c speaker feature extraction unit 15d speaker utterance Label estimation unit 15e Speaker vector generation unit 15f Learning unit 15g Speech segment estimation unit

Claims (7)

  1.  話者ダイアライゼーション装置が実行する話者ダイアライゼーション方法であって、
     音響信号のフレームごとの音響特徴の系列を用いて、各フレームの話者特徴を表すベクトルを抽出する抽出工程と、
     抽出された前記ベクトルを用いて、該ベクトルの話者を表す話者発話ラベルを推定する推定工程と、
     抽出された前記ベクトルと、推定された該ベクトルの話者を表す話者発話ラベルと、各フレームの話者を表す話者発話ラベルの正解ラベルとを用いて算出した、各フレームの話者の同一性を表す損失関数を含む損失関数を用いて、各フレームのベクトルの話者発話ラベルを推定するモデルを学習により生成する学習工程と、
     を含んだことを特徴とする話者ダイアライゼーション方法。
    A speaker diarization method performed by a speaker diarization device, comprising:
    an extraction step of extracting a vector representing a speaker feature of each frame using a sequence of acoustic features for each frame of the acoustic signal;
    an estimating step of using the extracted vector to estimate a speaker utterance label representing the speaker of the vector;
    The speaker's utterance label for each frame calculated using the extracted vector, the speaker's utterance label representing the speaker of the estimated vector, and the correct label of the speaker's utterance label representing the speaker of each frame a learning step of generating a model for estimating a speaker utterance label of a vector of each frame by learning using a loss function including a loss function representing identity;
    A speaker diarization method comprising:
  2.  前記学習工程は、推定された前記話者発話ラベルと、前記各フレームの話者発話ラベルの正解ラベルとを用いて算出した、各フレームの話者発話ラベルに関する損失関数と、抽出された前記ベクトルと、前記各フレームの話者発話ラベルの正解ラベルとを用いて算出した各フレームの話者の同一性を表す損失関数との重み付け和の損失関数とを用いて、前記モデルを学習により生成することを特徴とする請求項1に記載の話者ダイアライゼーション方法。 In the learning step, a loss function for the speaker utterance label of each frame calculated using the estimated speaker utterance label and a correct label of the speaker utterance label of each frame, and the extracted vector and a loss function representing the identity of the speaker of each frame calculated using the correct label of the speaker's utterance label of each frame. The speaker diarization method according to claim 1, characterized in that:
  3.  前記ベクトルと、各フレームの話者発話ラベルの正解ラベルとを用いて、話者が一人であるフレームの話者特徴を表す話者ベクトルを生成する生成工程をさらに含み、
     前記学習工程は、生成された前記話者ベクトルと、推定された前記ベクトルの話者発話ラベルと、各フレームの話者発話ラベルの正解ラベルとを用いて算出した前記損失関数を用いて、前記モデルを学習により生成することを特徴とする請求項1に記載の話者ダイアライゼーション方法。
    a generation step of generating a speaker vector representing the speaker feature of a frame in which there is only one speaker, using the vector and the correct label of the speaker utterance label of each frame;
    In the learning step, using the loss function calculated using the generated speaker vector, the estimated speaker utterance label of the vector, and the correct label of the speaker utterance label of each frame, the 2. The method of speaker diarization according to claim 1, wherein the model is generated by learning.
  4.  前記推定工程は、生成された前記モデルを用いて、音響信号のフレームごとの話者発話ラベルを推定することを特徴とする請求項1に記載の話者ダイアライゼーション方法。 The speaker diarization method according to claim 1, wherein the estimation step uses the generated model to estimate a speaker utterance label for each frame of an acoustic signal.
  5.  前記推定工程は、複数のフレームの移動平均を用いて、前記話者発話ラベルを推定することを特徴とする請求項4に記載の話者ダイアライゼーション方法。 The speaker diarization method according to claim 4, wherein said estimation step estimates said speaker utterance label using a moving average of a plurality of frames.
  6.  音響信号のフレームごとの音響特徴の系列を用いて、各フレームの話者特徴を表すベクトルを抽出する抽出部と、
     抽出された前記ベクトルを用いて、該ベクトルの話者を表す話者発話ラベルを推定する推定部と、
     抽出された前記ベクトルと、推定された該ベクトルの話者を表す話者発話ラベルと、各フレームの話者を表す話者発話ラベルの正解ラベルとを用いて算出した、各フレームの話者の同一性を表す損失関数を含む損失関数を用いて、各フレームのベクトルの話者発話ラベルを推定するモデルを学習により生成する学習部と、
     を有することを特徴とする話者ダイアライゼーション装置。
    an extraction unit that extracts a vector representing a speaker feature of each frame using a sequence of acoustic features for each frame of an acoustic signal;
    an estimating unit that uses the extracted vector to estimate a speaker utterance label representing the speaker of the vector;
    The speaker's utterance label for each frame calculated using the extracted vector, the speaker's utterance label representing the speaker of the estimated vector, and the correct label of the speaker's utterance label representing the speaker of each frame a learning unit that generates a model for estimating a speaker utterance label of a vector of each frame by learning, using a loss function including a loss function representing identity;
    A speaker diarization device comprising:
  7.  音響信号のフレームごとの音響特徴の系列を用いて、各フレームの話者特徴を表すベクトルを抽出する抽出ステップと、
     抽出された前記ベクトルを用いて、該ベクトルの話者を表す話者発話ラベルを推定する推定ステップと、
     抽出された前記ベクトルと、推定された該ベクトルの話者を表す話者発話ラベルと、各フレームの話者を表す話者発話ラベルの正解ラベルとを用いて算出した、各フレームの話者の同一性を表す損失関数を含む損失関数を用いて、各フレームのベクトルの話者発話ラベルを推定するモデルを学習により生成する学習ステップと、
     をコンピュータに実行させるための話者ダイアライゼーションプログラム。
    an extraction step of extracting a vector representing a speaker feature of each frame using a sequence of acoustic features for each frame of the acoustic signal;
    an estimation step of using the extracted vector to estimate a speaker utterance label representing the speaker of the vector;
    The speaker's utterance label for each frame calculated using the extracted vector, the speaker's utterance label representing the speaker of the estimated vector, and the correct label of the speaker's utterance label representing the speaker of each frame a learning step of learning to generate a model for estimating a speaker utterance label of a vector of each frame using a loss function including a loss function representing identity;
    A speaker diarization program for running on a computer.
PCT/JP2021/025849 2021-07-08 2021-07-08 Speaker diarization method, speaker diarization device, and speaker diarization program WO2023281717A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/025849 WO2023281717A1 (en) 2021-07-08 2021-07-08 Speaker diarization method, speaker diarization device, and speaker diarization program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/025849 WO2023281717A1 (en) 2021-07-08 2021-07-08 Speaker diarization method, speaker diarization device, and speaker diarization program

Publications (1)

Publication Number Publication Date
WO2023281717A1 true WO2023281717A1 (en) 2023-01-12

Family

ID=84800543

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/025849 WO2023281717A1 (en) 2021-07-08 2021-07-08 Speaker diarization method, speaker diarization device, and speaker diarization program

Country Status (1)

Country Link
WO (1) WO2023281717A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012118668A (en) * 2010-11-30 2012-06-21 National Institute Of Information & Communication Technology Learning device for pattern classification device and computer program for the same
US20190295553A1 (en) * 2018-03-21 2019-09-26 Hyundai Mobis Co., Ltd. Apparatus for recognizing voice speaker and method for the same
JP2020052611A (en) * 2018-09-26 2020-04-02 日本電信電話株式会社 Tag estimation device, tag estimation method and program
JP2020527248A (en) * 2018-05-28 2020-09-03 平安科技(深▲せん▼)有限公司Ping An Technology (Shenzhen) Co.,Ltd. Speaker separation model training method, separation method for both speakers and related equipment
WO2020240682A1 (en) * 2019-05-28 2020-12-03 日本電気株式会社 Signal extraction system, signal extraction learning method, and signal extraction learning program
JP2021026050A (en) * 2019-07-31 2021-02-22 株式会社リコー Voice recognition system, information processing device, voice recognition method, program

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012118668A (en) * 2010-11-30 2012-06-21 National Institute Of Information & Communication Technology Learning device for pattern classification device and computer program for the same
US20190295553A1 (en) * 2018-03-21 2019-09-26 Hyundai Mobis Co., Ltd. Apparatus for recognizing voice speaker and method for the same
JP2020527248A (en) * 2018-05-28 2020-09-03 平安科技(深▲せん▼)有限公司Ping An Technology (Shenzhen) Co.,Ltd. Speaker separation model training method, separation method for both speakers and related equipment
JP2020052611A (en) * 2018-09-26 2020-04-02 日本電信電話株式会社 Tag estimation device, tag estimation method and program
WO2020240682A1 (en) * 2019-05-28 2020-12-03 日本電気株式会社 Signal extraction system, signal extraction learning method, and signal extraction learning program
JP2021026050A (en) * 2019-07-31 2021-02-22 株式会社リコー Voice recognition system, information processing device, voice recognition method, program

Similar Documents

Publication Publication Date Title
US10332510B2 (en) Method and apparatus for training language model and recognizing speech
Zeng et al. Effective combination of DenseNet and BiLSTM for keyword spotting
Hayashi et al. Duration-controlled LSTM for polyphonic sound event detection
CN110689879B (en) Method, system and device for training end-to-end voice transcription model
US9240184B1 (en) Frame-level combination of deep neural network and gaussian mixture models
CN110275939B (en) Method and device for determining conversation generation model, storage medium and electronic equipment
CN108520741A (en) A kind of whispering voice restoration methods, device, equipment and readable storage medium storing program for executing
CN108885870A (en) For by combining speech to TEXT system with speech to intention system the system and method to realize voice user interface
JP2017097162A (en) Keyword detection device, keyword detection method and computer program for keyword detection
CN110648691B (en) Emotion recognition method, device and system based on energy value of voice
JP6780033B2 (en) Model learners, estimators, their methods, and programs
EP3915063B1 (en) Multi-model structures for classification and intent determination
CN112837669B (en) Speech synthesis method, device and server
WO2019138897A1 (en) Learning device and method, and program
CN112750461B (en) Voice communication optimization method and device, electronic equipment and readable storage medium
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
CN112786028B (en) Acoustic model processing method, apparatus, device and readable storage medium
CN113160855A (en) Method and apparatus for improving on-line voice activity detection system
WO2023281717A1 (en) Speaker diarization method, speaker diarization device, and speaker diarization program
Kaur et al. Speech recognition system; challenges and techniques
Banjara et al. Nepali speech recognition using cnn and sequence models
JP7111017B2 (en) Paralinguistic information estimation model learning device, paralinguistic information estimation device, and program
CN113160801A (en) Speech recognition method, apparatus and computer readable storage medium
US20240038255A1 (en) Speaker diarization method, speaker diarization device, and speaker diarization program
WO2022130471A1 (en) Speaker diarization method, speaker diarization device, and speaker diarization program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21949350

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE