WO2022130471A1 - 話者ダイアライゼーション方法、話者ダイアライゼーション装置および話者ダイアライゼーションプログラム - Google Patents
話者ダイアライゼーション方法、話者ダイアライゼーション装置および話者ダイアライゼーションプログラム Download PDFInfo
- Publication number
- WO2022130471A1 WO2022130471A1 PCT/JP2020/046585 JP2020046585W WO2022130471A1 WO 2022130471 A1 WO2022130471 A1 WO 2022130471A1 JP 2020046585 W JP2020046585 W JP 2020046585W WO 2022130471 A1 WO2022130471 A1 WO 2022130471A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speaker
- frame
- array
- label
- learning
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 43
- 238000010586 diagram Methods 0.000 description 9
- 238000000605 extraction Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 230000015654 memory Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 5
- 230000002457 bidirectional effect Effects 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 230000010365 information processing Effects 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000037433 frameshift Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000010454 slate Substances 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
Definitions
- the present invention relates to a speaker dialing method, a speaker dialing device, and a speaker dialing program.
- EEND End-to-End Neural Diarization
- the acoustic signal is divided into frames, and a speaker label indicating whether or not a specific speaker exists in the frame is estimated for each frame from the acoustic features extracted from each frame.
- the speaker label for each frame is an S-dimensional vector, and in the frame, 1 when a speaker is speaking and 1 when not speaking. It becomes 0. That is, in EEND, speaker dialization is realized by performing multi-label binary classification of the number of speakers.
- the EEND model used in EEND to estimate the speaker label sequence for each frame is a model based on deep learning composed of layers capable of backpropagation of errors, and the speaker label sequence for each frame is changed at once from the acoustic feature sequence. It can be estimated by through.
- the EEND model includes an RNN (Recurrent Neural Network) layer for time-series modeling. As a result, in EEND, it is possible to estimate the speaker label for each frame by using the acoustic features of not only the frame but also the surrounding frames. Bidirectional LSTM (Long Short-Term Memory) -RNN or Transformer Encoder is used for this RNN layer.
- Non-Patent Document 2 describes RNN Transducer. Further, Non-Patent Document 3 describes acoustic features.
- BLSTM-RNN estimates the speaker label of the frame using the input frame and the internal state of the frame adjacent to the input frame. Therefore, it is more difficult to use the acoustic feature for estimating the speaker label as the frame is farther from the frame.
- the EEND model is learned so as to estimate in which frame the information useful for estimating the speaker label of the frame is estimated. Therefore, the longer the acoustic feature sequence, the more options for frame estimation, and the more difficult it is to estimate the speaker label.
- the present invention has been made in view of the above, and an object of the present invention is to perform speaker dialiation for a long acoustic signal with high accuracy.
- the speaker dialing method divides a series of acoustic features for each frame of an acoustic signal into segments of a predetermined length, and divides the sequence into a plurality of row directions. It includes a generation step of generating an array in which the segments of the above are arranged in the column direction, and a learning step of generating a model for estimating the speaker label of the speaker vector of each frame by learning using the array. It is a feature.
- FIG. 1 is a diagram for explaining an outline of a speaker dialyrating device.
- FIG. 2 is a schematic diagram illustrating a schematic configuration of a speaker dialyration device.
- FIG. 3 is a diagram for explaining the processing of the speaker dialyration device.
- FIG. 4 is a diagram for explaining the processing of the speaker dialyration device.
- FIG. 5 is a flowchart showing a speaker dialization processing procedure.
- FIG. 6 is a flowchart showing a speaker dialization processing procedure.
- FIG. 7 is a diagram illustrating a computer that executes a speaker dialyration program.
- FIG. 1 is a diagram for explaining an outline of a speaker dialyrating device.
- the speaker dialyration device of the present embodiment divides the input two-dimensional acoustic feature sequence into segments and converts them into a three-dimensional acoustic feature array. Then, this acoustic feature array is input to a speaker dialization model including two series models of column-oriented RNN and row-oriented RNN.
- the speaker dialyration device divides the T-frame ⁇ D-dimensional two-dimensional acoustic feature sequence into L-frame segments with a shift width of N frames. Then, each segment is set as each row, and the heads of each row are combined so as to be aligned in the column direction to generate a (TL) / N row ⁇ L column ⁇ D-dimensional three-dimensional acoustic feature array.
- the array generated in this way is applied with a row-oriented RNN layer that performs RNN processing for each row, and a hidden layer output is obtained using the acoustic feature sequence in each segment.
- a column-oriented RNN layer that performs RNN processing on each column of the array is applied to obtain a hidden layer output sequence that spans multiple segments, and the embedded sequence used for estimating the speaker label for each frame is obtained. can get.
- the rows of the embedded series for each frame are overlapped and added to obtain the speaker label embedded series for each frame of the T frame.
- the speaker dialyration device uses the Linear layer and the sigmoid layer to obtain a speaker label sequence for each frame.
- the speaker dialing device can perform speaker dialing using local context information by applying the row-oriented RNN layer. In this case, the same speaker label tends to be output in adjacent frames.
- the speaker dialing device can perform speaker dialing using global context information by applying a column-oriented RNN layer. This makes it possible to target the utterances of the same speaker, who are separated in time, into the speaker dialization.
- FIG. 2 is a schematic diagram illustrating a schematic configuration of a speaker dialyration device. Further, FIGS. 3 and 4 are diagrams for explaining the processing of the speaker dialyrating device.
- the speaker dialyration device 10 of the present embodiment is realized by a general-purpose computer such as a personal computer, and has an input unit 11, an output unit 12, a communication control unit 13, a storage unit 14, and a control. A unit 15 is provided.
- the input unit 11 is realized by using an input device such as a keyboard or a mouse, and inputs various instruction information such as processing start to the control unit 15 in response to an input operation by the practitioner.
- the output unit 12 is realized by a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, and the like.
- the communication control unit 13 is realized by a NIC (Network Interface Card) or the like, and controls communication via a network between an external device such as a server or a device for acquiring an acoustic signal and the control unit 15.
- NIC Network Interface Card
- the storage unit 14 is realized by a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory (Flash Memory), or a storage device such as a hard disk or an optical disk.
- the storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13.
- the storage unit 14 stores, for example, a speaker dialization model 14a or the like used for a speaker dialulation process described later.
- the control unit 15 is realized by using a CPU (Central Processing Unit), an NP (Network Processor), an FPGA (Field Programmable Gate Array), etc., and executes a processing program stored in a memory. As a result, as illustrated in FIG. 2, the control unit 15 functions as an acoustic feature extraction unit 15a, an array generation unit 15b, a speaker label generation unit 15c, a learning unit 15d, an estimation unit 15e, and an utterance section estimation unit 15f. .. It should be noted that these functional units may be implemented in different hardware. For example, the learning unit 15d may be mounted as a learning device, and the estimation unit 15e may be mounted as an estimation device. Further, the control unit 15 may include other functional units.
- a CPU Central Processing Unit
- NP Network Processor
- FPGA Field Programmable Gate Array
- the acoustic feature extraction unit 15a extracts the acoustic features for each frame of the acoustic signal including the utterance of the speaker. For example, the acoustic feature extraction unit 15a receives an input of an acoustic signal via the input unit 11 or from a device or the like that acquires an acoustic signal via the communication control unit 13. Further, the acoustic feature extraction unit 15a divides the acoustic signal into frames, extracts the acoustic feature vector by performing discrete Fourier transform or filter bank multiplication on the signal from each frame, and combines the acoustic signals in the frame direction. Output the feature series. In this embodiment, the frame length is 25 ms and the frame shift width is 10 ms.
- the acoustic feature vector is, for example, a 24-dimensional MFCC (Mel Frequency Cepstrum Coefficient), but is not limited to this, and may be, for example, an acoustic feature amount for each other frame such as a mel filter bank output.
- MFCC Mel Frequency Cepstrum Coefficient
- the array generation unit 15b divides a series of acoustic features for each frame of an acoustic signal into segments of a predetermined length, and generates an array in which a plurality of divided rows in the row direction are arranged in the column direction. Specifically, as shown in FIG. 1, the sequence generation unit 15b divides the input two-dimensional acoustic feature sequence into segments and converts them into a three-dimensional acoustic feature array.
- the array generation unit 15b divides the T-frame ⁇ D-dimensional two-dimensional acoustic feature sequence into L-frame segments with a shift width of N frames. Then, each segment is set as each row, and the heads of each row are combined so as to be aligned in the column direction to generate a (TL) / N row ⁇ L column ⁇ D-dimensional three-dimensional acoustic feature array.
- the sequence generation unit 15b may be included in the learning unit 15d and the estimation unit 15e, which will be described later.
- FIGS. 3 and 4 described later show an example in which the learning unit 15d and the estimation unit 15e process the sequence generation unit 15b.
- the speaker label generation unit 15c generates a speaker label for each frame using the acoustic feature series. Specifically, as shown in FIG. 3, the speaker label generation unit 15c generates a speaker label for each frame by using the acoustic feature series and the correct answer label of the speaker's utterance section. As a result, a set of the acoustic feature series and the speaker label for each frame is generated as the teacher data used for the processing of the learning unit 15d described later.
- the learning unit 15d uses the generated array to generate a speaker dialization model 14a that estimates the speaker label of the speaker vector of each frame by learning. Specifically, as shown in FIGS. 3 and 4, the learning unit 15d uses a set of an acoustic feature sequence and a speaker label for each frame as teacher data, and a speaker dialization model based on a bidirectional RNN. 14a is learned.
- FIG. 4 illustrates the configuration of the speaker dialization model 14a based on the bidirectional RNN of the present embodiment.
- the speaker dialation model 14a is composed of a plurality of layers including a row-oriented RNN layer and a column-oriented RNN layer, in addition to the segment division / arrangement layer which is the processing of the sequence generation unit 15b. ..
- the row-oriented RNN layer and the column-oriented RNN layer bidirectional processing in the row direction and the column direction of the input three-dimensional acoustic feature array is performed.
- the row-oriented BLSTM-RNN is applied as the row-oriented RNN layer
- the column-oriented BLSTM-RNN is applied as the column-oriented RNN layer.
- the speaker dialization model 14a has an overlap addition layer.
- the overlap addition layer as shown in FIG. 1, each row of the three-dimensional acoustic feature array is arranged in the same manner as the acoustic feature series before segmentation, and addition is performed with overlap. As a result, a T ⁇ D-dimensional speaker label embedded sequence similar to the acoustic feature sequence can be obtained.
- the speaker dialization model 14a has a Linear layer for performing a linear transformation and a sigmoid layer for applying a sigmoid function. As shown in FIG. 1, by inputting the T ⁇ D dimension speaker label embedding series in the Linear layer and the sigmod layer, the speaker label posterior probability for each frame of the T ⁇ S dimension is output.
- the learning unit 15d uses an error back propagation method with the posterior probability of the speaker label for each frame and the multi-label binary cross entropy with the speaker label for each frame as a loss function, and uses a linear layer of the speaker dialization model 14a. , The parameters of the row-oriented BLSTM-RNN layer and the column-oriented BLSTM-RNN layer are optimized.
- the learning unit 15d uses an online optimization algorithm using a stochastic gradient descent method for optimizing the parameters.
- the learning unit 15d generates a speaker dialization model 14a including an RNN that processes the array in the row direction and an RNN that processes the array in the column direction. This enables speaker dialification using local context information and speaker dialization using global context information. Therefore, the learning unit 15d can learn the utterances of the same speaker who are separated in time as the target of the speaker dialization.
- the estimation unit 15e estimates the speaker label for each frame of the acoustic signal by using the generated speaker dialization model 14a. Specifically, as shown in FIG. 3, the estimation unit 15e propagates the sequence generated by the sequence generation unit 15b from the acoustic feature sequence forward to the speaker dialization model 14a for each frame of the acoustic feature series. Obtain the speaker label posterior probability (estimated value of the speaker label).
- the utterance section estimation unit 15f estimates the utterance section of the speaker in the acoustic signal by using the output speaker label posterior probability. Specifically, the utterance section estimation unit 15f estimates the speaker label using the moving averages of a plurality of frames. That is, the utterance section estimation unit 15f first calculates a moving average of the length 11 of the own frame and the five frames before and after the own frame with respect to the speaker label posterior probability for each frame. This makes it possible to prevent erroneous detection of an unrealistic short utterance section such as an utterance with only one frame.
- the utterance section estimation unit 15f estimates that the frame is the utterance section of the speaker of the dimension when the calculated moving average value is larger than 0.5. Further, the utterance section estimation unit 15f considers a continuous utterance section frame group as one utterance for each speaker, and back-calculates the start time and end time of the utterance section up to a predetermined time from the frame. As a result, it is possible to obtain the utterance start time and the utterance end time up to a predetermined time for each utterance of each speaker.
- FIG. 5 and 6 are flowcharts showing the speaker dialization processing procedure.
- the speaker dialization process of the present embodiment includes a learning process and an estimation process.
- FIG. 4 shows a learning processing procedure.
- the flowchart of FIG. 5 is started, for example, at the timing when there is an input instructing the start of the learning process.
- the acoustic feature extraction unit 15a extracts the acoustic features for each frame of the acoustic signal including the speaker's utterance, and outputs the acoustic feature series (step S1).
- the array generation unit 15b divides the two-dimensional acoustic feature sequence for each frame of the acoustic signal into segments of a predetermined length, and arranges the divided plurality of row-direction segments in the column direction. Is generated (step S2).
- the learning unit 15d generates a speaker dialization model 14a for estimating the speaker label of the speaker vector of each frame by learning using the generated acoustic feature array (step S3). At that time, the learning unit 15d generates a speaker dialization model 14a including an RNN that processes the array in the row direction and an RNN that processes the array in the column direction. As a result, a series of learning processes are completed.
- FIG. 6 shows an estimation processing procedure.
- the flowchart of FIG. 6 is started, for example, at the timing when there is an input instructing the start of the estimation process.
- the acoustic feature extraction unit 15a extracts the acoustic features for each frame of the acoustic signal including the speaker's utterance, and outputs the acoustic feature series (step S1).
- the array generation unit 15b divides the two-dimensional acoustic feature sequence for each frame of the acoustic signal into segments of a predetermined length, and arranges the divided plurality of row-direction segments in the column direction to form a three-dimensional acoustic feature array. Generate (step S2).
- the estimation unit 15e estimates the speaker label for each frame of the acoustic signal using the generated speaker dialulation model 14a (step S4). Specifically, the estimation unit 15e outputs the speaker label posterior probability (estimated value of the speaker label) for each frame of the acoustic feature series.
- the utterance section estimation unit 15f estimates the utterance section of the speaker in the acoustic signal using the output speaker label posterior probability (step S5). This completes a series of estimation processes.
- the sequence generation unit 15b divides the sequence of acoustic features for each frame of the acoustic signal into segments of a predetermined length, and the divided row directions. Generate an array in which the segments of are arranged in the column direction. Further, the learning unit 15d uses the generated array to generate a speaker dialization model 14a that estimates the speaker label of the speaker vector of each frame by learning.
- the learning unit 15d generates a speaker dialization model 14a including an RNN that processes an array in the row direction and an RNN that processes the array in the column direction. This enables speaker dialification using local context information and speaker dialization using global context information. Therefore, the learning unit 15d can learn the utterances of the same speaker who are separated in time as the target of the speaker dialyration. As a result, the speaker dialyrating device 10 can perform speaker dialing for a long acoustic signal with high accuracy.
- the estimation unit 15e estimates the speaker label for each frame of the acoustic signal using the generated speaker dialization model 14a. This enables highly accurate speaker dialization for long acoustic signals.
- the utterance section estimation unit 15f estimates the speaker label using the moving average of a plurality of frames. This makes it possible to prevent erroneous detection of an unrealistic short utterance section.
- the speaker dialing device 10 can be implemented by installing a speaker dialing program that executes the above-mentioned speaker dialing process as package software or online software on a desired computer.
- the information processing device can be made to function as the speaker dialing device 10.
- the information processing device includes smartphones, mobile phones, mobile communication terminals such as PHS (Personal Handyphone System), and slate terminals such as PDAs (Personal Digital Assistants).
- the function of the speaker dialyration device 10 may be implemented in the cloud server.
- FIG. 7 is a diagram showing an example of a computer that executes a speaker dialyration program.
- the computer 1000 has, for example, a memory 1010, a CPU 1020, a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.
- the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012.
- the ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System).
- BIOS Basic Input Output System
- the hard disk drive interface 1030 is connected to the hard disk drive 1031.
- the disk drive interface 1040 is connected to the disk drive 1041.
- a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041.
- a mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050.
- a display 1061 is connected to the video adapter 1060.
- the hard disk drive 1031 stores, for example, the OS 1091, the application program 1092, the program module 1093, and the program data 1094. Each of the information described in the above embodiment is stored in, for example, the hard disk drive 1031 or the memory 1010.
- the speaker dialyration program is stored in the hard disk drive 1031 as, for example, a program module 1093 in which a command executed by the computer 1000 is described.
- the program module 1093 in which each process executed by the speaker dialyration device 10 described in the above embodiment is described is stored in the hard disk drive 1031.
- the data used for information processing by the speaker dialyration program is stored as program data 1094 in, for example, the hard disk drive 1031.
- the CPU 1020 reads the program module 1093 and the program data 1094 stored in the hard disk drive 1031 into the RAM 1012 as needed, and executes each of the above-mentioned procedures.
- the program module 1093 and the program data 1094 related to the speaker dialyration program are not limited to the case where they are stored in the hard disk drive 1031. For example, they are stored in a removable storage medium and are stored in the CPU 1020 via the disk drive 1041 or the like. May be read by.
- the program module 1093 and the program data 1094 related to the speaker dialyration program are stored in another computer connected via a network such as a LAN (Local Area Network) or WAN (Wide Area Network), and are stored in the network interface 1070. It may be read out by the CPU 1020 via.
- LAN Local Area Network
- WAN Wide Area Network
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Telephonic Communication Services (AREA)
- Circuit For Audible Band Transducer (AREA)
- Image Analysis (AREA)
Abstract
Description
図1は、話者ダイアライゼーション装置の概要を説明するための図である。図1に示すように、本実施形態の話者ダイアライゼーション装置は、入力される二次元の音響特徴系列をセグメントに分割し、三次元の音響特徴配列に変換する。そして、この音響特徴配列を、列向きRNNと行向きRNNとの2つの系列モデルを含む話者ダイアライゼーションモデルに入力する。
図2は、話者ダイアライゼーション装置の概略構成を例示する模式図である。また、図3および図4は、話者ダイアライゼーション装置の処理を説明するための図である。まず、図2に例示するように、本実施形態の話者ダイアライゼーション装置10は、パソコン等の汎用コンピュータで実現され、入力部11、出力部12、通信制御部13、記憶部14、および制御部15を備える。
次に、話者ダイアライゼーション装置10による話者ダイアライゼーション処理について説明する。図5よび図6は、話者ダイアライゼーション処理手順を示すフローチャートである。本実施形態の話者ダイアライゼーション処理は、学習処理と推定処理とを含む。まず、図4は、学習処理手順を示す。図5のフローチャートは、例えば、学習処理の開始を指示する入力があったタイミングで開始される。
上記実施形態に係る話者ダイアライゼーション装置10が実行する処理をコンピュータが実行可能な言語で記述したプログラムを作成することもできる。一実施形態として、話者ダイアライゼーション装置10は、パッケージソフトウェアやオンラインソフトウェアとして上記の話者ダイアライゼーション処理を実行する話者ダイアライゼーションプログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記の話者ダイアライゼーションプログラムを情報処理装置に実行させることにより、情報処理装置を話者ダイアライゼーション装置10として機能させることができる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やPHS(Personal Handyphone System)等の移動体通信端末、さらには、PDA(Personal Digital Assistant)等のスレート端末等がその範疇に含まれる。また、話者ダイアライゼーション装置10の機能を、クラウドサーバに実装してもよい。
11 入力部
12 出力部
13 通信制御部
14 記憶部
14a 話者ダイアライゼーションモデル
15 制御部
15a 音響特徴抽出部
15b 配列生成部
15c 話者ラベル生成部
15d 学習部
15e 推定部
15f 発話区間推定部
Claims (6)
- 話者ダイアライゼーション装置が実行する話者ダイアライゼーション方法であって、
音響信号のフレームごとの音響特徴の系列を所定長のセグメントに分割し、分割した複数の行方向のセグメントを列方向に配置した配列を生成する生成工程と、
前記配列を用いて、各フレームの話者ベクトルの話者ラベルを推定するモデルを学習により生成する学習工程と、
を含んだことを特徴とする話者ダイアライゼーション方法。 - 前記学習工程は、前記配列を行方向に処理を行うRNNと、列方向に処理を行うRNNとを含む前記モデルを生成することを特徴とする請求項1に記載の話者ダイアライゼーション方法。
- 生成された前記モデルを用いて、音響信号のフレームごとの話者ラベルを推定する推定工程を、さらに含んだことを特徴とする請求項1に記載の話者ダイアライゼーション方法。
- 前記推定工程は、複数のフレームの移動平均を用いて、前記話者ラベルを推定することを特徴とする請求項3に記載の話者ダイアライゼーション方法。
- 音響信号のフレームごとの音響特徴の系列を所定長のセグメントに分割し、分割した複数の行方向のセグメントを列方向に配置した配列を生成する生成部と、
前記配列を用いて、各フレームの話者ベクトルの話者ラベルを推定するモデルを学習により生成する学習部と、
を有することを特徴とする話者ダイアライゼーション装置。 - 音響信号のフレームごとの音響特徴の系列を所定長のセグメントに分割し、分割した複数の行方向のセグメントを列方向に配置した配列を生成する生成ステップと、
前記配列を用いて、各フレームの話者ベクトルの話者ラベルを推定するモデルを学習により生成する学習ステップと、
をコンピュータに実行させるための話者ダイアライゼーションプログラム。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/266,513 US20240105182A1 (en) | 2020-12-14 | 2020-12-14 | Speaker diarization method, speaker diarization device, and speaker diarization program |
JP2022569345A JP7505584B2 (ja) | 2020-12-14 | 2020-12-14 | 話者ダイアライゼーション方法、話者ダイアライゼーション装置および話者ダイアライゼーションプログラム |
PCT/JP2020/046585 WO2022130471A1 (ja) | 2020-12-14 | 2020-12-14 | 話者ダイアライゼーション方法、話者ダイアライゼーション装置および話者ダイアライゼーションプログラム |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2020/046585 WO2022130471A1 (ja) | 2020-12-14 | 2020-12-14 | 話者ダイアライゼーション方法、話者ダイアライゼーション装置および話者ダイアライゼーションプログラム |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022130471A1 true WO2022130471A1 (ja) | 2022-06-23 |
Family
ID=82057429
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2020/046585 WO2022130471A1 (ja) | 2020-12-14 | 2020-12-14 | 話者ダイアライゼーション方法、話者ダイアライゼーション装置および話者ダイアライゼーションプログラム |
Country Status (3)
Country | Link |
---|---|
US (1) | US20240105182A1 (ja) |
JP (1) | JP7505584B2 (ja) |
WO (1) | WO2022130471A1 (ja) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2019086679A (ja) * | 2017-11-08 | 2019-06-06 | 株式会社東芝 | 対話システム、対話方法および対話プログラム |
-
2020
- 2020-12-14 JP JP2022569345A patent/JP7505584B2/ja active Active
- 2020-12-14 US US18/266,513 patent/US20240105182A1/en active Pending
- 2020-12-14 WO PCT/JP2020/046585 patent/WO2022130471A1/ja active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2019086679A (ja) * | 2017-11-08 | 2019-06-06 | 株式会社東芝 | 対話システム、対話方法および対話プログラム |
Also Published As
Publication number | Publication date |
---|---|
JP7505584B2 (ja) | 2024-06-25 |
US20240105182A1 (en) | 2024-03-28 |
JPWO2022130471A1 (ja) | 2022-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11996088B2 (en) | Setting latency constraints for acoustic models | |
US11776531B2 (en) | Encoder-decoder models for sequence to sequence mapping | |
CN105679317B (zh) | 用于训练语言模型并识别语音的方法和设备 | |
US11538463B2 (en) | Customizable speech recognition system | |
US9984682B1 (en) | Computer-implemented systems and methods for automatically generating an assessment of oral recitations of assessment items | |
US11551708B2 (en) | Label generation device, model learning device, emotion recognition apparatus, methods therefor, program, and recording medium | |
US20180137855A1 (en) | Method and apparatus for processing natural language, method and apparatus for training natural language processing model | |
US9240184B1 (en) | Frame-level combination of deep neural network and gaussian mixture models | |
US10529319B2 (en) | User adaptive speech recognition method and apparatus | |
US20160034811A1 (en) | Efficient generation of complementary acoustic models for performing automatic speech recognition system combination | |
CN108885870A (zh) | 用于通过将言语到文本系统与言语到意图系统组合来实现声音用户接口的系统和方法 | |
KR101120765B1 (ko) | 스위칭 상태 스페이스 모델과의 멀티모덜 변동 추정을이용한 스피치 인식 방법 | |
US20210358493A1 (en) | Method and apparatus with utterance time estimation | |
JP2020042257A (ja) | 音声認識方法及び装置 | |
KR20190136578A (ko) | 음성 인식 방법 및 장치 | |
US20210073645A1 (en) | Learning apparatus and method, and program | |
GB2607133A (en) | Knowledge distillation using deep clustering | |
KR102409873B1 (ko) | 증강된 일관성 정규화를 이용한 음성 인식 모델 학습 방법 및 시스템 | |
JP7212596B2 (ja) | 学習装置、学習方法および学習プログラム | |
WO2022130471A1 (ja) | 話者ダイアライゼーション方法、話者ダイアライゼーション装置および話者ダイアライゼーションプログラム | |
CN112420075B (zh) | 一种基于多任务的音素检测方法及装置 | |
KR102292921B1 (ko) | 언어 모델 학습 방법 및 장치, 음성 인식 방법 및 장치 | |
JP7505582B2 (ja) | 話者ダイアライゼーション方法、話者ダイアライゼーション装置および話者ダイアライゼーションプログラム | |
WO2023281717A1 (ja) | 話者ダイアライゼーション方法、話者ダイアライゼーション装置および話者ダイアライゼーションプログラム | |
JP6965846B2 (ja) | 言語モデルスコア算出装置、学習装置、言語モデルスコア算出方法、学習方法及びプログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20965859 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2022569345 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18266513 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20965859 Country of ref document: EP Kind code of ref document: A1 |