WO2023281717A1

WO2023281717A1 - Speaker diarization method, speaker diarization device, and speaker diarization program

Info

Publication number: WO2023281717A1
Application number: PCT/JP2021/025849
Authority: WO
Inventors: 厚志安藤; 有実子村田; 岳至森
Original assignee: 日本電信電話株式会社
Priority date: 2021-07-08
Filing date: 2021-07-08
Publication date: 2023-01-12

Abstract

A speaker feature extraction unit (15c) extracts, using a sequence of acoustic features of frames of an acoustic signal, a speaker embedding vector indicating a speaker feature of each of the frames. A speaker utterance label inference unit (15d) infers, using the extracted speaker embedding vector, a speaker utterance label indicating a speaker of the speaker embedding vector. By using loss functions that are calculated by using the extracted speaker embedding vector, the inferred speaker utterance label indicating the speaker of the speaker embedding vector, and a correct answer label of a speaker utterance label of each of the frames and that include a loss function indicating identity of the speaker of the frames, a training unit (15f) trains and generates a speaker diarization model (14a) for inferring a speaker utterance label of the speaker feature vector of each of the frames.

Description

Speaker diarization method, speaker diarization device and speaker diarization program

The present invention relates to a speaker diarization method, a speaker diarization device, and a speaker diarization program.

In recent years, expectations have been high for a speaker diarization technology that takes an acoustic signal as input and identifies the utterance intervals of all speakers included in the acoustic signal. According to the speaker diarization technology, various applications are possible, for example, automatic transcription for recording who spoke when in a conference, automatic extraction of speech between an operator and a customer from a call in a contact center, and the like.

Conventionally, a technology called EEND (End-to-End Neural Diarization) based on deep learning has been disclosed as a speaker diarization technology (see Non-Patent Document 1). In EEND, an audio signal is divided into frames, and a speaker label representing whether or not a specific speaker exists in the frame is estimated for each frame from the audio features extracted from each frame. If the maximum number of speakers in the audio signal is S, then the speaker label for each frame is an S-dimensional vector, and in that frame, 1 if a speaker is speaking, becomes 0. That is, EEND implements speaker diarization by performing multi-label binary classification of the number of speakers.

The EEND model used for estimating the speaker label sequence for each frame in EEND is a deep learning-based model composed of layers capable of backpropagating errors, and the speaker label sequence for each frame is extracted from the acoustic feature sequence at once. It can be estimated by penetrating. The EEND model includes an RNN (Recurrent Neural Network) layer that performs time-series modeling. As a result, in EEND, it is possible to estimate the speaker label for each frame by using the acoustic features of not only the current frame but also the surrounding frames. Bidirectional LSTM (Long Short-Term Memory)-RNN and Transformer Encoder are used in this RNN layer.

It should be noted that Non-Patent Document 2 describes multitask learning. Also, Non-Patent Document 3 describes a loss function based on distance learning. In addition, Non-Patent Document 4 describes an acoustic feature quantity.

However, with the conventional technology, it is difficult to accurately perform speaker diarization for acoustic signals containing different speakers with similar speaking styles and voice qualities. That is, the conventional EEND model only learns speaker utterance labels for each frame, and does not consider whether speakers are similar to each other. Therefore, for acoustic signals containing different speakers with similar speaking styles and voice qualities, there have been many speaker errors in which the utterance period of one speaker is assumed to be the utterance period of another speaker.

The present invention has been made in view of the above, and it is an object of the present invention to perform highly accurate speaker diarization for acoustic signals containing different speakers with similar speaking styles and voice qualities.

In order to solve the above-mentioned problems and achieve the object, a speaker diarization method according to the present invention is a speaker diarization method executed by a speaker diarization apparatus, comprising: an extraction step of extracting a vector representing the speaker feature of each frame using the series of; an estimation step of estimating a speaker utterance label representing the speaker of the vector using the extracted vector; The identity of the speaker in each frame calculated using the estimated vector, the speaker utterance label representing the speaker of the estimated vector, and the correct label of the speaker utterance label representing the speaker in each frame and a learning step of generating, through learning, a model for estimating the speaker utterance label of each frame vector using a loss function including a loss function representing gender.

According to the present invention, it is possible to perform highly accurate speaker diarization for acoustic signals containing different speakers with similar speaking styles and voice qualities.

FIG. 1 is a diagram for explaining the outline of a speaker diarization device. FIG. 2 is a schematic diagram illustrating a schematic configuration of the speaker diarization device. FIG. 3 is a diagram for explaining the processing of the speaker diarization device. FIG. 4 is a diagram for explaining the processing of the speaker diarization device. FIG. 5 is a diagram for explaining the processing of the speaker diarization device. FIG. 6 is a flow chart showing a speaker diarization processing procedure. FIG. 7 is a flow chart showing a speaker diarization processing procedure. FIG. 8 is a diagram illustrating a computer executing a speaker diarization program.

An embodiment of the present invention will be described in detail below with reference to the drawings. It should be noted that the present invention is not limited by this embodiment. Moreover, in the description of the drawings, the same parts are denoted by the same reference numerals.

[Overview of speaker diarization device]
FIG. 1 is a diagram for explaining the outline of a speaker diarization device. The speaker diarization apparatus of this embodiment adds a loss term for evaluating speaker identity for each speaker included in the acoustic signal, and considers speaker identity for each speaker included in the acoustic signal. The speaker diarization model 14a is trained as follows. This loss term is set so that speaker identity is high for the same speaker, and speaker identity is low for different speakers. As a result, the speaker diarization device makes the speaker diarization model 14a learn that the speakers are not the same speaker even if they speak in a similar way, thereby reducing speaker errors.

Specifically, as shown in FIG. 1, the speaker diarization model 14a has a speaker embedding extraction block, a speaker diarization result generation block, and a speaker vector generation block.

The speaker embedding extraction block extracts the speaker in this section from the acoustic features from the (t−N)-th frame to the (t+N)-th frame consecutively traced back from the current t-th frame in the input acoustic feature sequence. Extract embeddings. The speaker embedding is a vector containing speaker characteristics such as gender, age, and speaking style necessary for speaker diarization, and is a higher-dimensional vector than the speaker vector described later.

This speaker embedding extraction block is composed of a Linear layer, an RNN layer, an attention mechanism layer, and so on. Moreover, in the speaker diarization apparatus of this embodiment, unlike the conventional speaker diarization model, the speaker embedding extraction block refers only to fixed-length limited intervals to estimate speaker embeddings. do.

The speaker diarization result generation block is composed of an RNN layer, a linear layer, a sigmoid layer, etc., and estimates the sequence of speaker utterance labels for each frame based on the speaker embeddings obtained by the speaker embedding extraction block. .

The speaker vector generation block generates speaker vectors based on speaker embedding. A speaker vector is a vector that has the same information as the speaker embedding and has a lower dimension than the speaker embedding. The speaker vector generation block is used only during learning of the speaker diarization model 14a, and is not used during inference, which will be described later.

This speaker vector generation block consists of a subsample layer and a linear layer, and randomly selects frames in which only one speaker is speaking. Then, the subsample layer inputs the speaker embedding of the selected frame to the next Linear layer. The Linear layer projects the speaker embedding output from the subsample layer onto a speaker vector with a predetermined number of dimensions.

Then, in the speaker diarization model 14a, parameter learning is performed in the framework of multitask learning based on two loss functions, speaker utterance label loss and speaker identity loss for each frame.

As shown in FIG. 1, the frame-by-frame speaker utterance label loss is calculated using the frame-by-frame speaker utterance correct label sequence and the frame-by-frame speaker utterance estimation result sequence. Also, the speaker identity loss is calculated using the frame-by-frame speaker utterance correct label sequence of the frames selected in the subsample layer and the speaker label output by the Linear layer.

As a result, the speaker diarization device learns the speaker diarization model 14a so as to consider the speaker identity for each speaker included in the acoustic signal. Therefore, the speaker diarization device allows the speaker diarization model 14a to learn that the speakers are not the same speaker even if their speaking style is similar, thereby reducing speaker errors.

[Structure of speaker diarization device]
FIG. 2 is a schematic diagram illustrating a schematic configuration of the speaker diarization device. 3 to 5 are diagrams for explaining the processing of the speaker diarization device. First, as illustrated in FIG. 2, the speaker diarization apparatus 10 of this embodiment is implemented by a general-purpose computer such as a personal computer, and includes an input unit 11, an output unit 12, a communication control unit 13, a storage unit 14, and a control unit. A part 15 is provided.

The input unit 11 is implemented using input devices such as a keyboard and a mouse, and inputs various instruction information such as processing start to the control unit 15 in response to input operations by the practitioner. The output unit 12 is implemented by a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like. The communication control unit 13 is implemented by a NIC (Network Interface Card) or the like, and controls communication between the control unit 15 and an external device such as a server or a device that acquires an acoustic signal via a network.

The storage unit 14 is implemented by semiconductor memory devices such as RAM (Random Access Memory) and flash memory, or storage devices such as hard disks and optical disks. Note that the storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13 . In this embodiment, the storage unit 14 stores, for example, a speaker diarization model 14a used for speaker diarization processing, which will be described later.

The control unit 15 is implemented using a CPU (Central Processing Unit), NP (Network Processor), FPGA (Field Programmable Gate Array), etc., and executes a processing program stored in memory. 2, the control unit 15 includes an acoustic feature extraction unit 15a, a speaker utterance label generation unit 15b, a speaker feature extraction unit 15c, a speaker utterance label estimation unit 15d, and a speaker vector generation unit. 15e, learning unit 15f, and utterance segment estimating unit 15g. Note that these functional units may be implemented in different hardware. For example, the acoustic feature extraction unit 15a may be implemented in hardware different from other functional units. Also, the control unit 15 may include other functional units.

The acoustic feature extraction unit 15a extracts acoustic features for each frame of an acoustic signal containing utterances of a plurality of speakers. For example, the acoustic feature extraction unit 15a receives input of acoustic signals via the input unit 11 or via the communication control unit 13 from a device that acquires acoustic signals. Further, the acoustic feature extraction unit 15a divides the acoustic signal into frames, performs discrete Fourier transform and filter bank multiplication on the signal from each frame, extracts an acoustic feature vector, and extracts an acoustic feature vector coupled in the frame direction. Output the feature sequence. In this embodiment, the frame length is 25 ms and the frame shift width is 10 ms.

Here, the acoustic feature vector is, for example, a 24-dimensional MFCC (Mel Frequency Cepstral Coefficient), but is not limited to this, and may be, for example, another acoustic feature quantity for each frame such as Mel filter bank output.

The speaker's utterance label generation unit 15b uses the acoustic feature sequence to generate a speaker's utterance label for each frame. Specifically, the speaker's utterance label generation unit 15b generates a speaker's utterance label for each frame by using the acoustic feature sequence and the correct label of the utterance period of the speaker, as shown in FIG. 3, which will be described later. . As a result, a speaker utterance label and a correct label for each frame corresponding to each frame of the acoustic feature sequence are generated as teacher data used in the processing of the learning unit 15f, which will be described later.

Here, when the number of speakers is S (speaker 1, speaker 2, ..., speaker S), the speaker utterance label of the t-th frame (t = 0, 1, ..., T) vector. For example, when a frame of time t×frame shift width is included in the utterance period of any speaker, the value of the dimension corresponding to that speaker is 1, and the value of the other dimensions is 0. Therefore, the speaker utterance label for each frame is a binary [0, 1] multi-label of T×S dimension. In this embodiment, for example, S=10.

Return to the description of Figure 2. The speaker feature extraction unit 15c extracts a vector representing the speaker feature of each frame using the series of acoustic features for each frame of the acoustic signal. Specifically, the speaker feature extraction unit 15c generates a speaker embedding vector by inputting the acoustic feature sequence acquired from the acoustic feature extraction unit 15a to the speaker embedding extraction block shown in FIG. As described above, the speaker embedding vector includes speaker characteristics such as gender, age, and speaking habits.

The speaker utterance label estimation unit 15d uses the extracted speaker embedding vector to estimate the speaker utterance label representing the speaker of the speaker embedding vector. Specifically, the speaker utterance label estimation unit 15d inputs the speaker embedding vector acquired from the speaker feature extraction unit 15c to the speaker diarization result generation block shown in FIG. Obtain a sequence of speaker utterance label estimation results.

The speaker vector generation unit 15e uses the extracted speaker embedding vector and the correct label of the speaker utterance label of each frame to generate a speaker vector representing the speaker feature of the frame in which there is only one speaker. do.

Here, FIG. 3 illustrates the processing of the speaker label generation unit in the speaker vector generation block shown in FIG. As shown in FIG. 3, the speaker vector generation unit 15e generates a speaker embedding vector and a series of correct labels of speaker utterance labels for each frame (speaker utterance correct label series for each frame) of the same length as follows: Input to the subsample layer.

In the subsample layer, frames in which only one speaker speaks out of correct speaker utterance labels for each frame are targeted, and up to K frames per speaker are randomly selected from the target frames. In FIG. 3, the 2nd, 3rd, 6th, 8th, and 9th frames indicated by hatching are targeted, and the 1st, 7th, and 10th frames where no one speaks, and two or more speakers are shown. The 4th and 5th frames of speaking are excluded.

Then, the speaker embedding vector of the selected frame is input from the subsample layer to the Linear layer. In the Linear layer, the input speaker embedding vector is projected onto a speaker vector with a predetermined number of dimensions. Thereby, the speaker vector generator 15e obtains the speaker vector of the selected frame.

Note that the speaker feature extraction unit 15c, the speaker utterance label estimation unit 15d, and the speaker vector generation unit 15e may be included in the learning unit 15f, which will be described later. For example, FIG. 4 shows an example in which the learning unit 15f performs the processing of the speaker feature extraction unit 15c and the speaker utterance label estimation unit 15d.

Also, the speaker feature extraction unit 15c may be included in the speaker utterance label estimation unit 15d. For example, FIG. 4 shows an example in which the speaker utterance label estimation unit 15d performs the processing of the speaker feature extraction unit 15c.

Return to the description of Figure 2. The learning unit 15f calculates using the extracted speaker embedding vector, the speaker utterance label representing the speaker of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame, A speaker diarization model 14a for estimating the speaker utterance label of the speaker embedding vector of each frame is generated by learning using a loss function including a loss function representing speaker identity of each frame.

Specifically, the learning unit 15f calculates the loss function using the generated speaker vector, the speaker utterance label of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame. is used to generate the speaker diarization model 14a by learning.

That is, as shown in FIG. 1, the learning unit 15f generates the speaker vector generated by the speaker vector generation unit 15e and the correct speaker utterance label for each frame corresponding to the speaker vector. (speaker utterance correct label sequence for each frame after subsampling) is used.

Here, FIG. 5 shows an example of calculation of speaker identity loss. As shown in FIG. 5, the speaker identity loss is calculated using the loss function shown in the following equation (1) based on distance learning such as Generalized End-to-End Loss.

As a result, the speaker diarization model 14a is learned so that the cosine distance of the speaker vector is short for the same speaker, and the cosine distance of the speaker vector is long for a different speaker.

Note that the speaker identity loss is not limited to Generalized End-to-End Loss, and may be a loss function based on other distance learning. For example, Triplet Loss or Siamese Loss may be used.

In this way, the speaker vector is calculated from the speaker embedding vector, and this speaker embedding vector is used in the processing of the speaker utterance label estimation unit 15d. This enables the speaker utterance label estimation unit 15d to estimate the speaker utterance label while considering whether the speaker of each frame is the same speaker.

In addition, the speaker diarization model 14a has, as shown in FIG. 1, a speaker embedding extraction block, a speaker diarization result generation block, and a speaker vector generation block. The speaker embedding block is composed of, for example, one linear layer and two bidirectional LSTM-RNN layers, as shown in FIG. Also, the speaker diarization result generation block is composed of, for example, three layers of bi-directional LSTM-RNN layers, one layer of Linear layer, and one layer of sigmoid layer. Also, the speaker vector generation block is composed of one subsample layer and one linear layer. However, the total number of layers and the number of hidden units may be changed.

This speaker diarization model 14a is input with a frame-by-frame acoustic feature sequence of T frames×D dimensions and an estimated speech feature sequence for each frame of T×S dimensions where each dimension takes the values [0, 1]. A user utterance label sequence is output.

The learning unit 15f uses the estimated speaker utterance label and the correct label of the speaker utterance label of each frame to calculate the loss function for the speaker utterance label of each frame, and the extracted speaker embedding vector. and the correct label of the speaker's utterance label of each frame. .

That is, in the speaker diarization model 14a, in addition to the speaker identity loss described above, as shown in FIG. The speaker utterance label loss for each frame calculated using the correct label sequence of the speaker utterance label for each frame generated by the utterance label generator 15b is used. The frame-by-frame speaker utterance label loss is calculated by multi-label binary cross-entropy ignoring the sorting order, as in the past.

Then, as shown in the following equation (2), the learning unit 15f uses the weighted sum of the two loss functions of the speaker utterance label loss and the speaker identity loss for each frame as the loss function of the entire model to obtain the error Parameter optimization is performed by the backpropagation method. Here, α is a weighting parameter, and in this embodiment, α=0.5, for example.

Return to the description of Figure 2. The speaker utterance label estimation unit 15d uses the generated speaker diarization model 14a to estimate a speaker utterance label for each frame of the acoustic signal. Specifically, as shown in FIG. 4, the speaker utterance label estimation unit 15d forward propagates the acoustic feature sequence to the speaker embedding extraction block and the speaker diarization result generation block shown in FIG. obtains a series of speaker utterance label estimation results (posterior probabilities) for each frame.

The speech segment estimation unit 15g estimates the speaker's speech segment in the acoustic signal using the output speaker's speech label posterior probability. Specifically, the speech period estimation unit 15g estimates the speaker's speech label using a moving average of a plurality of frames. That is, the utterance segment estimation unit 15g first calculates the moving average of the frame-by-frame speaker utterance label posterior probability for the length 6 of the current frame and the preceding 5 frames. This makes it possible to prevent erroneous detection of impractically short speech segments such as speech with only one frame.

Next, when the calculated moving average value is greater than 0.5, the utterance segment estimation unit 15g estimates that the frame is the utterance segment of the speaker of the dimension. For each speaker, the utterance period estimation unit 15g regards a continuous utterance period frame group as one utterance, and calculates the start time and end time of the utterance period up to a predetermined time from the frames. As a result, the speech start time and the speech end time up to a predetermined time can be obtained for each speech of each speaker.

[Speaker diarization processing]
Next, the speaker diarization processing by the speaker diarization device 10 will be described. 6 and 7 are flowcharts showing speaker diarization processing procedures. The speaker diarization processing of this embodiment includes learning processing and estimation processing. First, FIG. 6 shows the learning processing procedure. The flowchart of FIG. 6 is started, for example, at the timing when an instruction to start the learning process is received.

First, the acoustic feature extraction unit 15a extracts acoustic features for each frame of an acoustic signal containing utterances of a plurality of speakers, and outputs an acoustic feature sequence (step S1).

Next, the speaker feature extraction unit 15c extracts a speaker embedding vector representing the speaker feature of each frame using the sequence of acoustic features for each frame of the acoustic signal (step S2).

Then, the learning unit 15f calculates using the extracted speaker embedding vector, the speaker utterance label representing the speaker of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame. The speaker diarization model 14a for estimating the speaker utterance label of the speaker embedding vector of each frame is generated by learning using the loss function including the loss function representing the identity of the speaker of each frame (step S3).

Specifically, the speaker utterance label estimation unit 15d uses the extracted speaker embedding vector to estimate the speaker utterance label representing the speaker of the speaker embedding vector. In addition, the speaker vector generation unit 15e uses the extracted speaker embedding vector and the correct label of the speaker utterance label of each frame to generate a speaker vector representing the speaker feature of the frame in which there is only one speaker. to generate

Then, the learning unit 15f uses the generated speaker vector and the correct label of the speaker's utterance label of each frame to calculate a loss function representing speaker identity of each frame. In addition, the learning unit 15f calculates a loss function for the speaker utterance label for each frame using the estimated speaker utterance label sequence for each frame and the generated correct label sequence for the speaker utterance label for each frame. do. Then, the learning unit 15f uses the weighted sum of the two loss functions as the loss function for the entire model to generate the speaker diarization model 14a. This completes a series of learning processes.

Next, FIG. 7 shows the estimation processing procedure. The flowchart in FIG. 7 is started, for example, when an input instructing the start of the estimation process is received.

Next, the speaker's utterance label estimation unit 15d uses the generated speaker's diarization model 14a to estimate the speaker's utterance label for each frame of the acoustic signal (step S4). Specifically, the speaker utterance label estimation unit 15d outputs the speaker utterance label posterior probability (estimated value of the speaker utterance label) for each frame of the acoustic feature sequence.

Then, the utterance segment estimation unit 15g estimates the utterance segment of the speaker in the acoustic signal using the output speaker utterance label posterior probability (step S5). This completes a series of estimation processes.

[effect]
As described above, in the speaker diarization apparatus 10 of the present embodiment, the speaker feature extraction unit 15c uses the sequence of acoustic features for each frame of the acoustic signal to express the speaker features of each frame. extract the person embedding vector. A speaker utterance label estimation unit 15d uses the extracted speaker embedding vector to estimate a speaker utterance label representing the speaker of the speaker embedding vector. Also, the learning unit 15f calculates using the extracted speaker embedding vector, the speaker utterance label representing the speaker of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame. A speaker diarization model 14a for estimating the speaker utterance label of the speaker feature vector of each frame is generated by learning using the loss function including the loss function representing the identity of the speaker of each frame.

In this way, the speaker diarization device 10 can estimate the speaker's utterance label by considering whether the speaker of the speaker embedding vector is the same speaker in each frame. This makes it possible to perform highly accurate speaker diarization for acoustic signals containing different speakers with similar speaking styles and voice qualities.

Specifically, the speaker vector generation unit 15e uses the speaker embedding vector and the correct label of the speaker utterance label of each frame to generate a speaker vector representing the speaker feature of a frame with a single speaker. to generate In this case, the learning unit 15f calculates a loss function using the generated speaker vector, the speaker utterance label of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame. is used to generate a speaker diarization model 14a by learning.

In this way, the speaker diarization device 10 considers whether or not the speaker of the speaker vector generated from the speaker embedding vector is the same speaker in each frame. Labels can be estimated. This makes it possible to more accurately perform speaker diarization for acoustic signals containing different speakers with similar speaking styles and voice qualities.

In addition, the learning unit 15f calculates a loss function for the speaker utterance label of each frame using the estimated speaker utterance label and the correct label of the speaker utterance label of each frame, and the extracted speaker The speaker diarization model 14a is learned by using the loss function of the weighted sum of the loss function representing the identity of the speaker calculated using the embedding vector and the correct label of the speaker utterance label of each frame. Generate. This enables the speaker diarization device 10 to perform speaker diarization with higher accuracy.

In addition, the speaker's utterance label estimation unit 15d uses the generated speaker's diarization model 14a to estimate the speaker's utterance label for each frame of the acoustic signal. As a result, the speaker diarization apparatus can perform highly accurate speaker diarization while suppressing speaker errors by considering whether or not the speaker of each frame is the same speaker.

Also, the speech period estimation unit 15g estimates the speaker label using the moving average of a plurality of frames. This enables the speaker diarization device 10 to prevent erroneous detection of unrealistically short speech segments.

[program]
It is also possible to create a program in which the processing executed by the speaker diarization apparatus 10 according to the above embodiment is described in a computer-executable language. As one embodiment, the speaker diarization apparatus 10 can be implemented by installing a speaker diarization program for executing the above-described speaker diarization processing as package software or online software on a desired computer. For example, the information processing apparatus can function as the speaker diarization apparatus 10 by causing the information processing apparatus to execute the speaker diarization program. In addition, information processing devices include mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants). Also, the functions of the speaker diarization device 10 may be implemented in a cloud server.

FIG. 8 is a diagram showing an example of a computer that executes a speaker diarization program. Computer 1000 includes, for example, memory 1010 , CPU 1020 , hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .

The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012 . The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1031 . Disk drive interface 1040 is connected to disk drive 1041 . A removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041, for example. A mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050, for example. For example, a display 1061 is connected to the video adapter 1060 .

Here, the hard disk drive 1031 stores an OS 1091, application programs 1092, program modules 1093 and program data 1094, for example. Each piece of information described in the above embodiment is stored in the hard disk drive 1031 or the memory 1010, for example.

Also, the speaker diarization program is stored in the hard disk drive 1031 as a program module 1093 in which instructions to be executed by the computer 1000 are written, for example. Specifically, the hard disk drive 1031 stores a program module 1093 that describes each process executed by the speaker diarization apparatus 10 described in the above embodiment.

Data used for information processing by the speaker diarization program is stored as program data 1094 in the hard disk drive 1031, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 as necessary, and executes each procedure described above.

Note that the program module 1093 and program data 1094 relating to the speaker diarization program are not limited to being stored in the hard disk drive 1031. For example, they are stored in a detachable storage medium and stored in the CPU 1020 via the disk drive 1041 or the like. may be read by Alternatively, the program module 1093 and program data 1094 related to the speaker diarization program are stored in another computer connected via a network such as LAN (Local Area Network) or WAN (Wide Area Network), and network interface 1070 may be read by CPU 1020 via

Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the descriptions and drawings forming part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operation techniques, etc. made by those skilled in the art based on this embodiment are all included in the scope of the present invention.

10 speaker diarization device 11 input unit 12 output unit 13 communication control unit 14 storage unit 14a speaker diarization model 15 control unit 15a acoustic feature extraction unit 15b speaker utterance label generation unit 15c speaker feature extraction unit 15d speaker utterance Label estimation unit 15e Speaker vector generation unit 15f Learning unit 15g Speech segment estimation unit

Claims

A speaker diarization method performed by a speaker diarization device, comprising:
an extraction step of extracting a vector representing a speaker feature of each frame using a sequence of acoustic features for each frame of the acoustic signal;
an estimating step of using the extracted vector to estimate a speaker utterance label representing the speaker of the vector;
The speaker's utterance label for each frame calculated using the extracted vector, the speaker's utterance label representing the speaker of the estimated vector, and the correct label of the speaker's utterance label representing the speaker of each frame a learning step of generating a model for estimating a speaker utterance label of a vector of each frame by learning using a loss function including a loss function representing identity;
A speaker diarization method comprising:
In the learning step, a loss function for the speaker utterance label of each frame calculated using the estimated speaker utterance label and a correct label of the speaker utterance label of each frame, and the extracted vector and a loss function representing the identity of the speaker of each frame calculated using the correct label of the speaker's utterance label of each frame. The speaker diarization method according to claim 1, characterized in that:
a generation step of generating a speaker vector representing the speaker feature of a frame in which there is only one speaker, using the vector and the correct label of the speaker utterance label of each frame;
In the learning step, using the loss function calculated using the generated speaker vector, the estimated speaker utterance label of the vector, and the correct label of the speaker utterance label of each frame, the 2. The method of speaker diarization according to claim 1, wherein the model is generated by learning.
The speaker diarization method according to claim 1, wherein the estimation step uses the generated model to estimate a speaker utterance label for each frame of an acoustic signal.
The speaker diarization method according to claim 4, wherein said estimation step estimates said speaker utterance label using a moving average of a plurality of frames.
an extraction unit that extracts a vector representing a speaker feature of each frame using a sequence of acoustic features for each frame of an acoustic signal;
an estimating unit that uses the extracted vector to estimate a speaker utterance label representing the speaker of the vector;
The speaker's utterance label for each frame calculated using the extracted vector, the speaker's utterance label representing the speaker of the estimated vector, and the correct label of the speaker's utterance label representing the speaker of each frame a learning unit that generates a model for estimating a speaker utterance label of a vector of each frame by learning, using a loss function including a loss function representing identity;
A speaker diarization device comprising:
an extraction step of extracting a vector representing a speaker feature of each frame using a sequence of acoustic features for each frame of the acoustic signal;
an estimation step of using the extracted vector to estimate a speaker utterance label representing the speaker of the vector;
The speaker's utterance label for each frame calculated using the extracted vector, the speaker's utterance label representing the speaker of the estimated vector, and the correct label of the speaker's utterance label representing the speaker of each frame a learning step of learning to generate a model for estimating a speaker utterance label of a vector of each frame using a loss function including a loss function representing identity;
A speaker diarization program for running on a computer.