WO2023281717A1 - Procédé de journalisation de locuteur, dispositif de journalisation de locuteur et programme de journalisation de locuteur - Google Patents

Procédé de journalisation de locuteur, dispositif de journalisation de locuteur et programme de journalisation de locuteur Download PDF

Info

Publication number
WO2023281717A1
WO2023281717A1 PCT/JP2021/025849 JP2021025849W WO2023281717A1 WO 2023281717 A1 WO2023281717 A1 WO 2023281717A1 JP 2021025849 W JP2021025849 W JP 2021025849W WO 2023281717 A1 WO2023281717 A1 WO 2023281717A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
frame
label
vector
utterance
Prior art date
Application number
PCT/JP2021/025849
Other languages
English (en)
Japanese (ja)
Inventor
厚志 安藤
有実子 村田
岳至 森
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2021/025849 priority Critical patent/WO2023281717A1/fr
Publication of WO2023281717A1 publication Critical patent/WO2023281717A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building

Definitions

  • the present invention relates to a speaker diarization method, a speaker diarization device, and a speaker diarization program.
  • EEND End-to-End Neural Diarization
  • a speaker diarization technology see Non-Patent Document 1
  • EEND End-to-End Neural Diarization
  • an audio signal is divided into frames, and a speaker label representing whether or not a specific speaker exists in the frame is estimated for each frame from the audio features extracted from each frame. If the maximum number of speakers in the audio signal is S, then the speaker label for each frame is an S-dimensional vector, and in that frame, 1 if a speaker is speaking, becomes 0. That is, EEND implements speaker diarization by performing multi-label binary classification of the number of speakers.
  • the EEND model used for estimating the speaker label sequence for each frame in EEND is a deep learning-based model composed of layers capable of backpropagating errors, and the speaker label sequence for each frame is extracted from the acoustic feature sequence at once. It can be estimated by penetrating.
  • the EEND model includes an RNN (Recurrent Neural Network) layer that performs time-series modeling. As a result, in EEND, it is possible to estimate the speaker label for each frame by using the acoustic features of not only the current frame but also the surrounding frames.
  • Bidirectional LSTM (Long Short-Term Memory)-RNN and Transformer Encoder are used in this RNN layer.
  • Non-Patent Document 2 describes multitask learning. Also, Non-Patent Document 3 describes a loss function based on distance learning. In addition, Non-Patent Document 4 describes an acoustic feature quantity.
  • the conventional EEND model only learns speaker utterance labels for each frame, and does not consider whether speakers are similar to each other. Therefore, for acoustic signals containing different speakers with similar speaking styles and voice qualities, there have been many speaker errors in which the utterance period of one speaker is assumed to be the utterance period of another speaker.
  • the present invention has been made in view of the above, and it is an object of the present invention to perform highly accurate speaker diarization for acoustic signals containing different speakers with similar speaking styles and voice qualities.
  • a speaker diarization method is a speaker diarization method executed by a speaker diarization apparatus, comprising: an extraction step of extracting a vector representing the speaker feature of each frame using the series of; an estimation step of estimating a speaker utterance label representing the speaker of the vector using the extracted vector; The identity of the speaker in each frame calculated using the estimated vector, the speaker utterance label representing the speaker of the estimated vector, and the correct label of the speaker utterance label representing the speaker in each frame and a learning step of generating, through learning, a model for estimating the speaker utterance label of each frame vector using a loss function including a loss function representing gender.
  • FIG. 1 is a diagram for explaining the outline of a speaker diarization device.
  • FIG. 2 is a schematic diagram illustrating a schematic configuration of the speaker diarization device.
  • FIG. 3 is a diagram for explaining the processing of the speaker diarization device.
  • FIG. 4 is a diagram for explaining the processing of the speaker diarization device.
  • FIG. 5 is a diagram for explaining the processing of the speaker diarization device.
  • FIG. 6 is a flow chart showing a speaker diarization processing procedure.
  • FIG. 7 is a flow chart showing a speaker diarization processing procedure.
  • FIG. 8 is a diagram illustrating a computer executing a speaker diarization program.
  • FIG. 1 is a diagram for explaining the outline of a speaker diarization device.
  • the speaker diarization apparatus of this embodiment adds a loss term for evaluating speaker identity for each speaker included in the acoustic signal, and considers speaker identity for each speaker included in the acoustic signal.
  • the speaker diarization model 14a is trained as follows. This loss term is set so that speaker identity is high for the same speaker, and speaker identity is low for different speakers. As a result, the speaker diarization device makes the speaker diarization model 14a learn that the speakers are not the same speaker even if they speak in a similar way, thereby reducing speaker errors.
  • the speaker diarization model 14a has a speaker embedding extraction block, a speaker diarization result generation block, and a speaker vector generation block.
  • the speaker embedding extraction block extracts the speaker in this section from the acoustic features from the (t ⁇ N)-th frame to the (t+N)-th frame consecutively traced back from the current t-th frame in the input acoustic feature sequence. Extract embeddings.
  • the speaker embedding is a vector containing speaker characteristics such as gender, age, and speaking style necessary for speaker diarization, and is a higher-dimensional vector than the speaker vector described later.
  • This speaker embedding extraction block is composed of a Linear layer, an RNN layer, an attention mechanism layer, and so on. Moreover, in the speaker diarization apparatus of this embodiment, unlike the conventional speaker diarization model, the speaker embedding extraction block refers only to fixed-length limited intervals to estimate speaker embeddings. do.
  • the speaker diarization result generation block is composed of an RNN layer, a linear layer, a sigmoid layer, etc., and estimates the sequence of speaker utterance labels for each frame based on the speaker embeddings obtained by the speaker embedding extraction block. .
  • the speaker vector generation block generates speaker vectors based on speaker embedding.
  • a speaker vector is a vector that has the same information as the speaker embedding and has a lower dimension than the speaker embedding.
  • the speaker vector generation block is used only during learning of the speaker diarization model 14a, and is not used during inference, which will be described later.
  • This speaker vector generation block consists of a subsample layer and a linear layer, and randomly selects frames in which only one speaker is speaking. Then, the subsample layer inputs the speaker embedding of the selected frame to the next Linear layer.
  • the Linear layer projects the speaker embedding output from the subsample layer onto a speaker vector with a predetermined number of dimensions.
  • parameter learning is performed in the framework of multitask learning based on two loss functions, speaker utterance label loss and speaker identity loss for each frame.
  • the frame-by-frame speaker utterance label loss is calculated using the frame-by-frame speaker utterance correct label sequence and the frame-by-frame speaker utterance estimation result sequence.
  • the speaker identity loss is calculated using the frame-by-frame speaker utterance correct label sequence of the frames selected in the subsample layer and the speaker label output by the Linear layer.
  • the speaker diarization device learns the speaker diarization model 14a so as to consider the speaker identity for each speaker included in the acoustic signal. Therefore, the speaker diarization device allows the speaker diarization model 14a to learn that the speakers are not the same speaker even if their speaking style is similar, thereby reducing speaker errors.
  • FIG. 2 is a schematic diagram illustrating a schematic configuration of the speaker diarization device.
  • 3 to 5 are diagrams for explaining the processing of the speaker diarization device.
  • the speaker diarization apparatus 10 of this embodiment is implemented by a general-purpose computer such as a personal computer, and includes an input unit 11, an output unit 12, a communication control unit 13, a storage unit 14, and a control unit. A part 15 is provided.
  • the input unit 11 is implemented using input devices such as a keyboard and a mouse, and inputs various instruction information such as processing start to the control unit 15 in response to input operations by the practitioner.
  • the output unit 12 is implemented by a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like.
  • the communication control unit 13 is implemented by a NIC (Network Interface Card) or the like, and controls communication between the control unit 15 and an external device such as a server or a device that acquires an acoustic signal via a network.
  • NIC Network Interface Card
  • the storage unit 14 is implemented by semiconductor memory devices such as RAM (Random Access Memory) and flash memory, or storage devices such as hard disks and optical disks. Note that the storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13 . In this embodiment, the storage unit 14 stores, for example, a speaker diarization model 14a used for speaker diarization processing, which will be described later.
  • RAM Random Access Memory
  • flash memory or storage devices such as hard disks and optical disks.
  • the storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13 .
  • the storage unit 14 stores, for example, a speaker diarization model 14a used for speaker diarization processing, which will be described later.
  • the control unit 15 is implemented using a CPU (Central Processing Unit), NP (Network Processor), FPGA (Field Programmable Gate Array), etc., and executes a processing program stored in memory. 2, the control unit 15 includes an acoustic feature extraction unit 15a, a speaker utterance label generation unit 15b, a speaker feature extraction unit 15c, a speaker utterance label estimation unit 15d, and a speaker vector generation unit. 15e, learning unit 15f, and utterance segment estimating unit 15g. Note that these functional units may be implemented in different hardware. For example, the acoustic feature extraction unit 15a may be implemented in hardware different from other functional units. Also, the control unit 15 may include other functional units.
  • the acoustic feature extraction unit 15a extracts acoustic features for each frame of an acoustic signal containing utterances of a plurality of speakers. For example, the acoustic feature extraction unit 15a receives input of acoustic signals via the input unit 11 or via the communication control unit 13 from a device that acquires acoustic signals. Further, the acoustic feature extraction unit 15a divides the acoustic signal into frames, performs discrete Fourier transform and filter bank multiplication on the signal from each frame, extracts an acoustic feature vector, and extracts an acoustic feature vector coupled in the frame direction. Output the feature sequence. In this embodiment, the frame length is 25 ms and the frame shift width is 10 ms.
  • the acoustic feature vector is, for example, a 24-dimensional MFCC (Mel Frequency Cepstral Coefficient), but is not limited to this, and may be, for example, another acoustic feature quantity for each frame such as Mel filter bank output.
  • MFCC Mel Frequency Cepstral Coefficient
  • the speaker's utterance label generation unit 15b uses the acoustic feature sequence to generate a speaker's utterance label for each frame. Specifically, the speaker's utterance label generation unit 15b generates a speaker's utterance label for each frame by using the acoustic feature sequence and the correct label of the utterance period of the speaker, as shown in FIG. 3, which will be described later. . As a result, a speaker utterance label and a correct label for each frame corresponding to each frame of the acoustic feature sequence are generated as teacher data used in the processing of the learning unit 15f, which will be described later.
  • the speaker feature extraction unit 15c extracts a vector representing the speaker feature of each frame using the series of acoustic features for each frame of the acoustic signal. Specifically, the speaker feature extraction unit 15c generates a speaker embedding vector by inputting the acoustic feature sequence acquired from the acoustic feature extraction unit 15a to the speaker embedding extraction block shown in FIG. As described above, the speaker embedding vector includes speaker characteristics such as gender, age, and speaking habits.
  • the speaker utterance label estimation unit 15d uses the extracted speaker embedding vector to estimate the speaker utterance label representing the speaker of the speaker embedding vector. Specifically, the speaker utterance label estimation unit 15d inputs the speaker embedding vector acquired from the speaker feature extraction unit 15c to the speaker diarization result generation block shown in FIG. Obtain a sequence of speaker utterance label estimation results.
  • the speaker vector generation unit 15e uses the extracted speaker embedding vector and the correct label of the speaker utterance label of each frame to generate a speaker vector representing the speaker feature of the frame in which there is only one speaker. do.
  • FIG. 3 illustrates the processing of the speaker label generation unit in the speaker vector generation block shown in FIG.
  • the speaker vector generation unit 15e generates a speaker embedding vector and a series of correct labels of speaker utterance labels for each frame (speaker utterance correct label series for each frame) of the same length as follows: Input to the subsample layer.
  • frames in which only one speaker speaks out of correct speaker utterance labels for each frame are targeted, and up to K frames per speaker are randomly selected from the target frames.
  • the 2nd, 3rd, 6th, 8th, and 9th frames indicated by hatching are targeted, and the 1st, 7th, and 10th frames where no one speaks, and two or more speakers are shown.
  • the 4th and 5th frames of speaking are excluded.
  • the speaker embedding vector of the selected frame is input from the subsample layer to the Linear layer.
  • the input speaker embedding vector is projected onto a speaker vector with a predetermined number of dimensions.
  • the speaker vector generator 15e obtains the speaker vector of the selected frame.
  • FIG. 4 shows an example in which the learning unit 15f performs the processing of the speaker feature extraction unit 15c and the speaker utterance label estimation unit 15d.
  • the speaker feature extraction unit 15c may be included in the speaker utterance label estimation unit 15d.
  • FIG. 4 shows an example in which the speaker utterance label estimation unit 15d performs the processing of the speaker feature extraction unit 15c.
  • the learning unit 15f calculates using the extracted speaker embedding vector, the speaker utterance label representing the speaker of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame,
  • a speaker diarization model 14a for estimating the speaker utterance label of the speaker embedding vector of each frame is generated by learning using a loss function including a loss function representing speaker identity of each frame.
  • the learning unit 15f calculates the loss function using the generated speaker vector, the speaker utterance label of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame. is used to generate the speaker diarization model 14a by learning.
  • the learning unit 15f generates the speaker vector generated by the speaker vector generation unit 15e and the correct speaker utterance label for each frame corresponding to the speaker vector. (speaker utterance correct label sequence for each frame after subsampling) is used.
  • FIG. 5 shows an example of calculation of speaker identity loss.
  • the speaker identity loss is calculated using the loss function shown in the following equation (1) based on distance learning such as Generalized End-to-End Loss.
  • the speaker diarization model 14a is learned so that the cosine distance of the speaker vector is short for the same speaker, and the cosine distance of the speaker vector is long for a different speaker.
  • the speaker identity loss is not limited to Generalized End-to-End Loss, and may be a loss function based on other distance learning.
  • Triplet Loss or Siamese Loss may be used.
  • the speaker vector is calculated from the speaker embedding vector, and this speaker embedding vector is used in the processing of the speaker utterance label estimation unit 15d. This enables the speaker utterance label estimation unit 15d to estimate the speaker utterance label while considering whether the speaker of each frame is the same speaker.
  • the speaker diarization model 14a has, as shown in FIG. 1, a speaker embedding extraction block, a speaker diarization result generation block, and a speaker vector generation block.
  • the speaker embedding block is composed of, for example, one linear layer and two bidirectional LSTM-RNN layers, as shown in FIG.
  • the speaker diarization result generation block is composed of, for example, three layers of bi-directional LSTM-RNN layers, one layer of Linear layer, and one layer of sigmoid layer.
  • the speaker vector generation block is composed of one subsample layer and one linear layer. However, the total number of layers and the number of hidden units may be changed.
  • This speaker diarization model 14a is input with a frame-by-frame acoustic feature sequence of T frames ⁇ D dimensions and an estimated speech feature sequence for each frame of T ⁇ S dimensions where each dimension takes the values [0, 1].
  • a user utterance label sequence is output.
  • the learning unit 15f uses the estimated speaker utterance label and the correct label of the speaker utterance label of each frame to calculate the loss function for the speaker utterance label of each frame, and the extracted speaker embedding vector. and the correct label of the speaker's utterance label of each frame. .
  • the speaker utterance label loss for each frame calculated using the correct label sequence of the speaker utterance label for each frame generated by the utterance label generator 15b is used.
  • the frame-by-frame speaker utterance label loss is calculated by multi-label binary cross-entropy ignoring the sorting order, as in the past.
  • the learning unit 15f uses the weighted sum of the two loss functions of the speaker utterance label loss and the speaker identity loss for each frame as the loss function of the entire model to obtain the error Parameter optimization is performed by the backpropagation method.
  • the speaker utterance label estimation unit 15d uses the generated speaker diarization model 14a to estimate a speaker utterance label for each frame of the acoustic signal. Specifically, as shown in FIG. 4, the speaker utterance label estimation unit 15d forward propagates the acoustic feature sequence to the speaker embedding extraction block and the speaker diarization result generation block shown in FIG. obtains a series of speaker utterance label estimation results (posterior probabilities) for each frame.
  • the speech segment estimation unit 15g estimates the speaker's speech segment in the acoustic signal using the output speaker's speech label posterior probability. Specifically, the speech period estimation unit 15g estimates the speaker's speech label using a moving average of a plurality of frames. That is, the utterance segment estimation unit 15g first calculates the moving average of the frame-by-frame speaker utterance label posterior probability for the length 6 of the current frame and the preceding 5 frames. This makes it possible to prevent erroneous detection of impractically short speech segments such as speech with only one frame.
  • the utterance segment estimation unit 15g estimates that the frame is the utterance segment of the speaker of the dimension. For each speaker, the utterance period estimation unit 15g regards a continuous utterance period frame group as one utterance, and calculates the start time and end time of the utterance period up to a predetermined time from the frames. As a result, the speech start time and the speech end time up to a predetermined time can be obtained for each speech of each speaker.
  • FIG. 6 shows the learning processing procedure.
  • the flowchart of FIG. 6 is started, for example, at the timing when an instruction to start the learning process is received.
  • the acoustic feature extraction unit 15a extracts acoustic features for each frame of an acoustic signal containing utterances of a plurality of speakers, and outputs an acoustic feature sequence (step S1).
  • the speaker feature extraction unit 15c extracts a speaker embedding vector representing the speaker feature of each frame using the sequence of acoustic features for each frame of the acoustic signal (step S2).
  • the learning unit 15f calculates using the extracted speaker embedding vector, the speaker utterance label representing the speaker of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame.
  • the speaker diarization model 14a for estimating the speaker utterance label of the speaker embedding vector of each frame is generated by learning using the loss function including the loss function representing the identity of the speaker of each frame (step S3).
  • the speaker utterance label estimation unit 15d uses the extracted speaker embedding vector to estimate the speaker utterance label representing the speaker of the speaker embedding vector.
  • the speaker vector generation unit 15e uses the extracted speaker embedding vector and the correct label of the speaker utterance label of each frame to generate a speaker vector representing the speaker feature of the frame in which there is only one speaker. to generate
  • the learning unit 15f uses the generated speaker vector and the correct label of the speaker's utterance label of each frame to calculate a loss function representing speaker identity of each frame. In addition, the learning unit 15f calculates a loss function for the speaker utterance label for each frame using the estimated speaker utterance label sequence for each frame and the generated correct label sequence for the speaker utterance label for each frame. do. Then, the learning unit 15f uses the weighted sum of the two loss functions as the loss function for the entire model to generate the speaker diarization model 14a. This completes a series of learning processes.
  • FIG. 7 shows the estimation processing procedure.
  • the flowchart in FIG. 7 is started, for example, when an input instructing the start of the estimation process is received.
  • the acoustic feature extraction unit 15a extracts acoustic features for each frame of an acoustic signal containing utterances of a plurality of speakers, and outputs an acoustic feature sequence (step S1).
  • the speaker's utterance label estimation unit 15d uses the generated speaker's diarization model 14a to estimate the speaker's utterance label for each frame of the acoustic signal (step S4). Specifically, the speaker utterance label estimation unit 15d outputs the speaker utterance label posterior probability (estimated value of the speaker utterance label) for each frame of the acoustic feature sequence.
  • the utterance segment estimation unit 15g estimates the utterance segment of the speaker in the acoustic signal using the output speaker utterance label posterior probability (step S5). This completes a series of estimation processes.
  • the speaker feature extraction unit 15c uses the sequence of acoustic features for each frame of the acoustic signal to express the speaker features of each frame. extract the person embedding vector.
  • a speaker utterance label estimation unit 15d uses the extracted speaker embedding vector to estimate a speaker utterance label representing the speaker of the speaker embedding vector.
  • the learning unit 15f calculates using the extracted speaker embedding vector, the speaker utterance label representing the speaker of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame.
  • a speaker diarization model 14a for estimating the speaker utterance label of the speaker feature vector of each frame is generated by learning using the loss function including the loss function representing the identity of the speaker of each frame.
  • the speaker diarization device 10 can estimate the speaker's utterance label by considering whether the speaker of the speaker embedding vector is the same speaker in each frame. This makes it possible to perform highly accurate speaker diarization for acoustic signals containing different speakers with similar speaking styles and voice qualities.
  • the speaker vector generation unit 15e uses the speaker embedding vector and the correct label of the speaker utterance label of each frame to generate a speaker vector representing the speaker feature of a frame with a single speaker. to generate
  • the learning unit 15f calculates a loss function using the generated speaker vector, the speaker utterance label of the estimated speaker embedding vector, and the correct label of the speaker utterance label of each frame. is used to generate a speaker diarization model 14a by learning.
  • the speaker diarization device 10 considers whether or not the speaker of the speaker vector generated from the speaker embedding vector is the same speaker in each frame. Labels can be estimated. This makes it possible to more accurately perform speaker diarization for acoustic signals containing different speakers with similar speaking styles and voice qualities.
  • the learning unit 15f calculates a loss function for the speaker utterance label of each frame using the estimated speaker utterance label and the correct label of the speaker utterance label of each frame, and the extracted speaker
  • the speaker diarization model 14a is learned by using the loss function of the weighted sum of the loss function representing the identity of the speaker calculated using the embedding vector and the correct label of the speaker utterance label of each frame. Generate. This enables the speaker diarization device 10 to perform speaker diarization with higher accuracy.
  • the speaker's utterance label estimation unit 15d uses the generated speaker's diarization model 14a to estimate the speaker's utterance label for each frame of the acoustic signal.
  • the speaker diarization apparatus can perform highly accurate speaker diarization while suppressing speaker errors by considering whether or not the speaker of each frame is the same speaker.
  • the speech period estimation unit 15g estimates the speaker label using the moving average of a plurality of frames. This enables the speaker diarization device 10 to prevent erroneous detection of unrealistically short speech segments.
  • the speaker diarization apparatus 10 can be implemented by installing a speaker diarization program for executing the above-described speaker diarization processing as package software or online software on a desired computer.
  • the information processing apparatus can function as the speaker diarization apparatus 10 by causing the information processing apparatus to execute the speaker diarization program.
  • information processing devices include mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants).
  • the functions of the speaker diarization device 10 may be implemented in a cloud server.
  • FIG. 8 is a diagram showing an example of a computer that executes a speaker diarization program.
  • Computer 1000 includes, for example, memory 1010 , CPU 1020 , hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .
  • the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012 .
  • the ROM 1011 stores a boot program such as BIOS (Basic Input Output System).
  • BIOS Basic Input Output System
  • Hard disk drive interface 1030 is connected to hard disk drive 1031 .
  • Disk drive interface 1040 is connected to disk drive 1041 .
  • a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041, for example.
  • a mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050, for example.
  • a display 1061 is connected to the video adapter 1060 .
  • the hard disk drive 1031 stores an OS 1091, application programs 1092, program modules 1093 and program data 1094, for example. Each piece of information described in the above embodiment is stored in the hard disk drive 1031 or the memory 1010, for example.
  • the speaker diarization program is stored in the hard disk drive 1031 as a program module 1093 in which instructions to be executed by the computer 1000 are written, for example.
  • the hard disk drive 1031 stores a program module 1093 that describes each process executed by the speaker diarization apparatus 10 described in the above embodiment.
  • Data used for information processing by the speaker diarization program is stored as program data 1094 in the hard disk drive 1031, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 as necessary, and executes each procedure described above.
  • program module 1093 and program data 1094 relating to the speaker diarization program are not limited to being stored in the hard disk drive 1031.
  • they are stored in a detachable storage medium and stored in the CPU 1020 via the disk drive 1041 or the like. may be read by
  • the program module 1093 and program data 1094 related to the speaker diarization program are stored in another computer connected via a network such as LAN (Local Area Network) or WAN (Wide Area Network), and network interface 1070 may be read by CPU 1020 via a network such as LAN (Local Area Network) or WAN (Wide Area Network), and network interface 1070 may be read by CPU 1020 via a network such as LAN (Local Area Network) or WAN (Wide Area Network), and network interface 1070 may be read by CPU 1020 via a network such as LAN (Local Area Network) or WAN (Wide Area Network), and network interface 1070 may be read by CPU 1020 via a network such as LAN (Local Area Network) or
  • speaker diarization device 11 input unit 12 output unit 13 communication control unit 14 storage unit 14a speaker diarization model 15 control unit 15a acoustic feature extraction unit 15b speaker utterance label generation unit 15c speaker feature extraction unit 15d speaker utterance Label estimation unit 15e Speaker vector generation unit 15f Learning unit 15g Speech segment estimation unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

Une unité d'extraction de caractéristiques de locuteur (15c) extrait, en utilisant une séquence de caractéristiques acoustiques de trames d'un signal acoustique, un vecteur incorporant un locuteur indiquant une caractéristique de locuteur de chacune des trames. Une unité de déduction d'étiquette d'énoncé de locuteur (15d) déduit, en utilisant le vecteur incorporant un locuteur extrait, une étiquette d'énoncé de locuteur indiquant un locuteur du vecteur incorporant un locuteur. En utilisant des fonctions de perte qui sont calculées en utilisant le vecteur incorporant un locuteur extrait, l'étiquette d'énoncé de locuteur déduite indiquant le locuteur du vecteur incorporant un locuteur, et une étiquette de réponse correcte d'une étiquette d'énoncé de locuteur de chacune des trames et qui comprend une fonction de perte indiquant l'identité du locuteur, une unité d'apprentissage (15f) entraîne et génère un modèle de journalisation de locuteur (14a) pour déduire une étiquette d'énoncé de locuteur du vecteur de caractéristiques de locuteur de chacune des trames.
PCT/JP2021/025849 2021-07-08 2021-07-08 Procédé de journalisation de locuteur, dispositif de journalisation de locuteur et programme de journalisation de locuteur WO2023281717A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/025849 WO2023281717A1 (fr) 2021-07-08 2021-07-08 Procédé de journalisation de locuteur, dispositif de journalisation de locuteur et programme de journalisation de locuteur

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/025849 WO2023281717A1 (fr) 2021-07-08 2021-07-08 Procédé de journalisation de locuteur, dispositif de journalisation de locuteur et programme de journalisation de locuteur

Publications (1)

Publication Number Publication Date
WO2023281717A1 true WO2023281717A1 (fr) 2023-01-12

Family

ID=84800543

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/025849 WO2023281717A1 (fr) 2021-07-08 2021-07-08 Procédé de journalisation de locuteur, dispositif de journalisation de locuteur et programme de journalisation de locuteur

Country Status (1)

Country Link
WO (1) WO2023281717A1 (fr)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012118668A (ja) * 2010-11-30 2012-06-21 National Institute Of Information & Communication Technology パターン分類装置の学習装置及びそのためのコンピュータプログラム
US20190295553A1 (en) * 2018-03-21 2019-09-26 Hyundai Mobis Co., Ltd. Apparatus for recognizing voice speaker and method for the same
JP2020052611A (ja) * 2018-09-26 2020-04-02 日本電信電話株式会社 タグ推定装置、タグ推定方法、プログラム
JP2020527248A (ja) * 2018-05-28 2020-09-03 平安科技(深▲せん▼)有限公司Ping An Technology (Shenzhen) Co.,Ltd. 話者分離モデルの訓練方法、両話者の分離方法及び関連設備
WO2020240682A1 (fr) * 2019-05-28 2020-12-03 日本電気株式会社 Système d'extraction de signal, procédé d'apprentissage d'extraction de signal et programme d'apprentissage d'extraction de signal
JP2021026050A (ja) * 2019-07-31 2021-02-22 株式会社リコー 音声認識システム、情報処理装置、音声認識方法、プログラム

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012118668A (ja) * 2010-11-30 2012-06-21 National Institute Of Information & Communication Technology パターン分類装置の学習装置及びそのためのコンピュータプログラム
US20190295553A1 (en) * 2018-03-21 2019-09-26 Hyundai Mobis Co., Ltd. Apparatus for recognizing voice speaker and method for the same
JP2020527248A (ja) * 2018-05-28 2020-09-03 平安科技(深▲せん▼)有限公司Ping An Technology (Shenzhen) Co.,Ltd. 話者分離モデルの訓練方法、両話者の分離方法及び関連設備
JP2020052611A (ja) * 2018-09-26 2020-04-02 日本電信電話株式会社 タグ推定装置、タグ推定方法、プログラム
WO2020240682A1 (fr) * 2019-05-28 2020-12-03 日本電気株式会社 Système d'extraction de signal, procédé d'apprentissage d'extraction de signal et programme d'apprentissage d'extraction de signal
JP2021026050A (ja) * 2019-07-31 2021-02-22 株式会社リコー 音声認識システム、情報処理装置、音声認識方法、プログラム

Similar Documents

Publication Publication Date Title
US10332510B2 (en) Method and apparatus for training language model and recognizing speech
Zeng et al. Effective combination of DenseNet and BiLSTM for keyword spotting
Hayashi et al. Duration-controlled LSTM for polyphonic sound event detection
US9240184B1 (en) Frame-level combination of deep neural network and gaussian mixture models
CN110275939B (zh) 对话生成模型的确定方法及装置、存储介质、电子设备
CN110689879B (zh) 端到端语音转写模型的训练方法、系统、装置
CN108520741A (zh) 一种耳语音恢复方法、装置、设备及可读存储介质
JP2017097162A (ja) キーワード検出装置、キーワード検出方法及びキーワード検出用コンピュータプログラム
CN110648691B (zh) 基于语音的能量值的情绪识别方法、装置和系统
JP6780033B2 (ja) モデル学習装置、推定装置、それらの方法、およびプログラム
CN112837669B (zh) 语音合成方法、装置及服务器
CN112750461B (zh) 语音通信优化方法、装置、电子设备及可读存储介质
CN115312033A (zh) 基于人工智能的语音情感识别方法、装置、设备及介质
US20210073645A1 (en) Learning apparatus and method, and program
CN112786028B (zh) 声学模型处理方法、装置、设备和可读存储介质
CN113160855A (zh) 在线语音活性检测系统改进方法和装置
WO2023281717A1 (fr) Procédé de journalisation de locuteur, dispositif de journalisation de locuteur et programme de journalisation de locuteur
Kaur et al. Speech recognition system; challenges and techniques
Banjara et al. Nepali speech recognition using cnn and sequence models
JP7111017B2 (ja) パラ言語情報推定モデル学習装置、パラ言語情報推定装置、およびプログラム
CN113160801A (zh) 语音识别方法、装置以及计算机可读存储介质
US20240038255A1 (en) Speaker diarization method, speaker diarization device, and speaker diarization program
WO2022130471A1 (fr) Procédé de journalisation de locuteur, dispositif de journalisation de locuteur, et programme de journalisation de locuteur
CN117275458B (zh) 智能客服的语音生成方法、装置、设备及存储介质
WO2022222056A1 (fr) Détection de parole synthétique

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21949350

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE