WO2021220789A1 - Speaker diarization device and speaker diarization method - Google Patents

Speaker diarization device and speaker diarization method Download PDF

Info

Publication number
WO2021220789A1
WO2021220789A1 PCT/JP2021/015202 JP2021015202W WO2021220789A1 WO 2021220789 A1 WO2021220789 A1 WO 2021220789A1 JP 2021015202 W JP2021015202 W JP 2021015202W WO 2021220789 A1 WO2021220789 A1 WO 2021220789A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
feature amount
unit
dialylation
clustering
Prior art date
Application number
PCT/JP2021/015202
Other languages
French (fr)
Japanese (ja)
Inventor
翔太 堀口
Original Assignee
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立製作所 filed Critical 株式会社日立製作所
Publication of WO2021220789A1 publication Critical patent/WO2021220789A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information

Definitions

  • the present invention relates to a speaker dialing device and a speaker dialing method.
  • Patent Document 1 describes a signal analyzer configured for the purpose of performing optimum dialification and the like.
  • the signal analyzer uses a sound source position occurrence probability matrix Q consisting of the probability that a signal arrives from each sound source position candidate for each frame, which is a time interval for a plurality of sound source position candidates, and each sound source position for each sound source for a plurality of sound sources.
  • the sound source position probability matrix B consisting of the probability that a signal arrives from the candidate
  • the sound source existence probability matrix A consisting of the existence probability of the signal from each sound source for each frame, and based on this modeling, the sound source position At least one of the probability matrix B and the sound source existence probability matrix A is estimated.
  • Non-Patent Document 1 describes a method for performing speaker dialification.
  • the voice section of the voice recorded by the monaural microphone is divided into fine segments, the features including the speaker characteristics are extracted from each segment, the features are clustered, and the speaker dialylation is performed from the clustering result. conduct.
  • Patent Document 1 the direction of a sound source is estimated from the sound recorded by using a microphone (hereinafter referred to as "microphone") arranged at a predetermined position, and the sound arriving from a different direction is a different speaker. Perform speaker dialisation as if.
  • the probability distribution of the feature vector for the frequency bin for each sound source position candidate prepared in advance using the measured data is used by utilizing the fact that the arrangement of the microphone is known in the speaker dialylation. There is. Therefore, if the arrangement of the microphones is unknown and there is no learning data such as a probability distribution, speaker dialylation cannot be performed.
  • Non-Patent Document 1 since one monaural microphone is used, each segment obtained by dividing the audio section is assigned to any speaker. Therefore, for example, when a plurality of speakers speak at the same time, it is not possible to determine which speaker should be assigned to that segment. Furthermore, since the voices of all speakers are recorded by one monaural microphone, it is also necessary for all speakers to speak near the monaural microphone.
  • the present invention has been made in view of such a background, and provides a speaker dialylation device and a speaker dialylation method capable of accurately performing speaker dialimation even when a plurality of speakers speak at the same time.
  • the purpose is to do.
  • One of the present inventions for achieving the above object is a speaker dialylation device, which is configured by using an information processing device and obtains a plurality of signals obtained from each of a plurality of audio signal input units.
  • a signal dividing unit that divides into a plurality of segments having a predetermined time width, a feature amount extracting unit that extracts a feature amount from each of the segments, and a feature amount extracted from each segment of the plurality of signals are collectively combined.
  • a clustering unit for clustering is provided, and a speaker dialyization unit for performing speaker dialylation based on the result of the clustering is provided.
  • speaker dialification can be performed with high accuracy even when a plurality of speakers speak at the same time.
  • FIG. 1 It is a figure which shows the arrangement example of a speaker and a microphone. It is a schematic diagram explaining the distribution of a feature amount in a feature amount space. It is a schematic diagram explaining the result of clustering. It is a flowchart explaining a speaker dialyization process.
  • the numbers for identifying the components are used for each context, and the numbers used in one context do not always indicate the same composition in the other contexts. Further, it does not prevent the component identified by a certain number from having the function of the component identified by another number.
  • the letter "S" prefixed with the code means a processing step.
  • FIG. 1 shows a hardware configuration of a device for performing speaker diarisation (hereinafter, referred to as “speaker diarization device 1”) described as the first embodiment.
  • the speaker dialyration device 1 is an information processing device (computer), and includes a processor 11, a ROM 12 (ROM: Read Only Memory), a RAM 13 (RAM: Random Access Memory), and two signal input devices 14a and 14b. These are connected to each other so as to be able to communicate with each other through a bus 10 or the like.
  • the speaker dialing device 1 illustrated includes two signal input devices 14a and 14b, but the speaker dialing device 1 may include three or more signal input devices.
  • the signal input devices 14a and 14b may be a voice input device such as a microphone (hereinafter, referred to as a “microphone”), or may be a device that outputs a voice signal after reverberation removal, sound source separation, or the like is performed.
  • the RAM 13 stores a program for realizing the function of the speaker dialing device 1 (hereinafter, referred to as "speaker dialing execution unit 131").
  • the speaker dialyration device 1 may be configured by using a plurality of information processing devices connected so as to be able to communicate with each other.
  • the speaker dialyration device 1 provides all or part of the virtual information using virtualization technology, process space separation technology, or the like, such as a virtual server provided by a cloud system. It may be realized by using processing resources.
  • all or a part of the functions provided by the speaker dialyrating device 1 may be realized by, for example, a service provided by a cloud system via an API (Application Programming Interface) or the like.
  • the functions of the speaker dialylation execution unit 131 and the like included in the speaker dialylation device 1 include DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array), ASIC (Application Specific Integrated Circuit), and AI (Artificial Intelligence). It may be realized by hardware such as a chip.
  • DSP Digital Signal Processor
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • AI Artificial Intelligence
  • FIG. 2 is a diagram for explaining the details of the speaker dialyization execution unit 131.
  • the speaker dialyization execution unit 131 includes signal input units 1001a and 1001b, signal division units 1002a and 1002b, feature amount extraction units 1003a and 1003b, clustering unit 1007, and speaker dialation unit 1008. Includes features.
  • a signal is input to the signal input unit 1001a from the signal input device 14a. Further, a signal is input to the signal input unit 1001b from the signal input device 14b.
  • the processing performed by the signal input unit 1001a, the signal division unit 1002a, and the feature amount extraction unit 1003a, and the signal input from the signal input device 14b are signals. Since the processes performed by the input unit 1001b, the signal division unit 1002b, and the feature amount extraction unit 1003b are basically the same, only the former will be described below, and the latter will be omitted unless otherwise required. Further, unless it is particularly necessary to distinguish between them, the description of the subscripts (“a” and “b”) for distinguishing them is omitted.
  • the speaker dialyration device 1 includes two signal input devices 14
  • the set of the signal input unit 1001, the signal division unit 1002, and the feature amount extraction unit 1003 is a signal input. It is provided according to the number of devices 14.
  • the signal input unit 1001 acquires a signal (hereinafter, referred to as an “input signal”) input from the signal input device 14.
  • the input signal is converted from an analog value to a digital value by, for example, an AD conversion unit (not shown). Further, the input signal is simply a recorded audio signal when the signal input device 14 is a microphone.
  • the input signal may be, for example, an audio signal after reverberation removal, speech enhancement, and sound source separation have been performed in advance.
  • the signal x m acquired by the signal input unit 1001 from the signal input device 14 can be expressed as follows, for example. Here, m indicates the number of signal input devices, and t indicates the time.
  • the signals input to the two signal input devices 14a and 14b do not necessarily have the same start time t m and start and end time t m and end. That is, the start times t m and start and the end times t m and end of the signal input devices 14a and 14b may be different, respectively.
  • the signal dividing unit 1002 divides the signal acquired from the signal input unit 1001 into a plurality of segments having a predetermined time width.
  • the signal acquired from the signal input device 14 in the segment s can be described as follows.
  • the start time ts , start and end time ts , end of each segment s are defined as variables that do not depend on the signal input device 14.
  • the time width of the segment s is expressed as follows.
  • the time width of the segment s is set to, for example, about 1.5 seconds, but is not limited to this. For example, if a time width longer than 1.5 seconds is adopted as the time width of the segment s, more signals can be used when the feature amount representing the speaker character is extracted by the feature amount extraction unit 1003 in the subsequent stage. , The reliability of the feature quantity can be improved. Further, if a time width shorter than 1.5 seconds is adopted as the time width of the segment s, the time unit for performing speaker dialylation becomes shorter, and the speaker dialylation unit 2008 in the subsequent stage has a high-grained speaker dialylation. Can be realized.
  • each segment s is not limited to being simply divided with the same time width as described above, and a part of adjacent segments s may overlap each other. For example, if the overlapping time width between adjacent segments s is set shorter than the time width of the segments s itself, a high-grain size speaker dialylation is realized without impairing the reliability of the feature quantity representing the speaker characteristics. can do.
  • the feature amount extraction unit 1003 extracts a feature amount representing speaker characteristics from each segment s obtained by the signal division unit 1002.
  • the feature quantities that represent the speaker characteristics extracted by the feature quantity extraction unit 1003 include, for example, a vector having a fundamental frequency and a formant frequency as elements, a GMM (Gaussian Mixture Model) super vector, an HMM (Hidden Markov Model) super vector, and i-. There are vector, d-vector, x-vector, and combinations of these.
  • the two signal input devices 14a and 14b are microphones distributed in a room, for example, a microphone array such as a smart speaker in which the microphones are separated by only a few centimeters.
  • a microphone array such as a smart speaker in which the microphones are separated by only a few centimeters.
  • the same utterance can be recorded with significantly different sound pressures depending on the microphone. That is, when the utterance is recorded by the microphone closer to the speaker, the sound pressure is high, and when the utterance is recorded by the microphone far from the speaker, the sound pressure is low. Therefore, as a feature quantity representing the relative position between the microphone and the speaker, a vector in which the sound pressures recorded by the microphone are arranged, or a vector obtained by reducing the dimension by principal component analysis or the like may be used.
  • a feature amount in which a feature amount representing the speaker character and a feature amount representing the relative position of the microphone and the speaker are connected may be used.
  • the feature amounts v m, s extracted by the feature amount extraction unit 1003 in this way can be expressed as follows.
  • S m is a set of segments s included in the recording section of the signal input device 14.
  • the clustering unit 1007 collectively clusters the feature quantities extracted by the feature quantity extraction units 1003a and 1003b, respectively. That is, when the signal input devices 14a and 14b are, for example, microphones, the set is set as M, and the vectors represented by the following equations are clustered at once.
  • the above clustering method is not necessarily limited, but for example, K-means clustering, Mean-shift clustering, aggregated hierarchical clustering, and the like can be used.
  • three microphones 1 to 3 are prepared as the signal input device 14, and the result of clustering by the speaker dialing device 1 and the result of speaker dialing when two speakers A and B speak. Is a diagram schematically showing.
  • the voice recorded by the microphones 1 to 3 is clustered into two clusters, cluster A (area indicated by diagonal lines) and cluster B (area indicated by dots).
  • clustering unit 1007 collectively clusters the feature quantities extracted from the voices recorded through the three microphones 1 to 3, the speaker's voice is recorded by one of the microphones at a sufficiently large sound pressure. It suffices, and the speaker dialylation in a wider space becomes possible as compared with the case of using one microphone.
  • clustering the features collectively it is possible to assign a different cluster to each segment s acquired through different microphones at the same time, and speaker dialylation considering the overlap of utterances becomes possible. ..
  • the speaker dialylation unit 1008 performs speaker dialylation based on the result of clustering by the clustering unit 1007.
  • the result D of the speaker dialylation can be obtained from the following equation.
  • ⁇ c is a set of features belonging to cluster c
  • S is the number of segments
  • C is the number of clusters (number of speakers).
  • speaker dialylation is performed as shown on the right side of the figure.
  • FIG. 4 is a flowchart illustrating a process performed by the speaker dialing device 1 (hereinafter, referred to as “speaker dialing process S2000”).
  • speaker dialing process S2000 the speaker dialyration process S2000 will be described with reference to the figure.
  • the signal input unit 1001 inputs the input signal acquired from the signal input device 14 to the signal division unit 1002, and the signal division unit 1002 divides the input signal into a plurality of segments having a predetermined time width (S2001).
  • the signal dividing unit 1002 inputs the divided plurality of segments to the feature amount extracting unit 1003, the feature amount extracting unit 1003 extracts the feature amount from each of the plurality of segments, and the extracted feature amount is used as the clustering unit 1007. (S2002).
  • the clustering unit 1007 collectively clusters the feature amounts input from each of the feature amount extraction units 1003 (1003a, 1003b), and inputs the result to the speaker dialylation unit 1008 (S2003).
  • the speaker dialylation unit 1008 performs speaker dialylation based on the input clustering result (S2004). This completes the speaker dialylation process S2000.
  • speaker dialimization can be performed with high accuracy even when a plurality of speakers speak at the same time.
  • the speaker dialymating device 1 of the second embodiment has a function of detecting a voice section before the signal input unit 1001 inputs the acquired input signal to the signal dividing unit 1002. It is different from the dialylation device 1.
  • Other configurations of the speaker dialylation device 1 of the second embodiment are basically the same as those of the first embodiment. Hereinafter, the differences from the first embodiment will be mainly described.
  • FIG. 5 is a diagram for explaining the details of the speaker dialylation execution unit 131 of the speaker dialylation device 1 shown as the second embodiment.
  • the speaker dialyization execution unit 131 of the second embodiment has a configuration in which the voice section detection unit 1005 is interposed between the signal input unit 1001 and the signal division unit 1002. It is different from the speaker dialyization execution unit 131 of the form.
  • the voice section detection unit 1005 detects a voice section for the input signal input from the signal input unit 1001 and outputs the signal of the detected voice section to the signal division unit 1002. For example, the voice section detection unit 1005 detects a section in which the sound pressure exceeds a predetermined threshold value as a voice section for the input signal input from the signal input unit 1001. Further, the voice section detection unit 1005 detects a voice section by inputting an input signal to a machine learning model (voice section detector) learned by using a method such as DNN (Deep Neural Network).
  • DNN Deep Neural Network
  • the signal division unit 1002 divides the voice section into a plurality of segments for the signal input from the voice section detection unit 1005, inputs the obtained segment to the feature amount extraction unit 1003, and the feature amount extraction unit 1003 , Extract features from the input segment.
  • FIG. 6 is a flowchart illustrating a process (hereinafter, referred to as “speaker dialulation process S2100”) performed by the speaker dialyration device 1 of the second embodiment.
  • speaker dialyration process S2100 will be described with reference to the figure.
  • the signal input unit 1001 inputs the input signal acquired from the signal input device 14 to the voice section detection unit 1005, the voice section detection unit 1005 detects the voice section from the input signal, and the detected voice section signal is divided into signals. Input to unit 1002 (S2101).
  • the signal division unit 1002 divides the voice section into a plurality of segments for the signal input from the voice section detection unit 1005, and inputs the obtained segment to the feature amount extraction unit 1003 (S2102).
  • the feature amount extraction unit 1003 extracts the feature amount from the input segment, and inputs the extracted feature amount to the clustering unit 1007 (S2103).
  • S2104 to S2105 is the same as the processing of S2003 to S2004 in FIG. 4, and thus the description thereof will be omitted.
  • the voice section detection unit 1005 is interposed between the signal input unit 1001 and the signal division unit 1002, but the voice section detection unit 1005 is implemented in another embodiment. You can also do it.
  • the voice section detection unit 1005 may be interposed between the signal division unit 1002 and the feature amount extraction unit 1003, that is, after the signal division unit 1002.
  • the voice section detection unit 1005 detects the voice section from a plurality of segments divided by the signal division unit 1002.
  • the voice section detection unit 1005 inputs the signal of the detected voice section to the feature amount extraction unit 1003.
  • the feature amount extraction unit 1003 extracts the feature amount from the segment including the voice section acquired from the voice section detection unit 1005 and inputs it to the clustering unit 1007.
  • the clustering unit 1007 collectively clusters the feature amounts input from the feature amount extraction units 1003a and 1003b, and inputs the result to the speaker dialyization unit 1008.
  • the speaker dialylation unit 1008 performs speaker dialylation based on the input clustering result.
  • FIG. 8 is a flowchart illustrating a process (hereinafter, referred to as “speaker dialulation process S2200”) performed by the speaker dialylation execution unit 131 shown in FIG. 7.
  • speaker dialyration process S2200 will be described with reference to the figure.
  • the processing of S2201 is the same as that of S2001 of the speaker dialing processing S2200 of the first embodiment shown in FIG. 4, and the signal dividing unit 1002 divides the signal acquired by the signal input unit 1001 into a plurality of segments. , The divided plurality of segments are input to the voice section detection unit 1005.
  • the voice section detection unit 1005 detects the segment including the voice section from the plurality of segments input from the signal division unit 1002, and outputs the segment including the detected voice section to the feature amount extraction unit 1003. (S2202).
  • the feature amount extraction unit 1003 extracts the feature amount from the segment including the voice section input from the voice section detection unit 1005, and inputs the extracted feature amount to the clustering unit 1007 (S2203).
  • the speaker dialyming device 1 of the second embodiment detects a voice section from the signal acquired by the signal input unit 1001 and extracts a feature amount for the detected voice section. Therefore, the non-voice section is excluded from the feature amount extraction target, and clustering can be efficiently performed in a short time. In addition, non-speech sections such as silent sections and noise sections are excluded from the feature amount extraction targets, and the accuracy of speaker dialification can be improved.
  • the speaker dialing device 1 in the first embodiment and the second embodiment collectively clusters all the feature quantities extracted by the feature quantity extracting unit 1003, and performs speaker dialiation based on the clustering result. conduct.
  • the speaker dialylation device 1 of the third embodiment selects a feature amount to be used for clustering from the feature amounts extracted by the feature amount extraction unit 1003, and performs clustering using the selected feature amount.
  • the speaker dialing device 1 of the third embodiment will be described focusing on the differences from the speaker dialing device 1 of the first embodiment.
  • the speaker dialing device 1 of the third embodiment may include the configuration of the speaker dialing device 1 of the second embodiment.
  • FIG. 9 is a diagram for explaining the details of the speaker dialyization execution unit 131 of the third embodiment.
  • the speaker dialyization execution unit 131 of the third embodiment has a feature amount selection unit 1006 interposed between the feature amount extraction unit 1003 and the clustering unit 1007, that is, in front of the clustering unit 1007.
  • the configuration is different from that of the first embodiment.
  • the feature amount selection unit 1006 selects the feature amount to be used for clustering from the feature amounts extracted by the feature amount extraction units 1003a and 1003b.
  • the clustering unit 1007 performs clustering using the feature amount selected by the feature amount selection unit 1006.
  • the feature amount selection unit 1006 selects the feature amount as follows, for example.
  • FIG. 10 is a diagram illustrating a method in which the feature amount selection unit 1006 selects a feature amount, and is an example in which two speakers A and B and three microphones (1) to (3) are arranged.
  • the microphones (1) to (3) are arranged in the space between the speaker A and the speaker B.
  • the microphone (1) is located closest to speaker A
  • the microphone (3) is located closest to speaker B
  • the microphone (2) is of microphone (1) and microphone (3). It is placed in the space between them.
  • FIG. 11 is a schematic diagram illustrating the distribution of the feature amount in the feature amount space. If the feature amounts representing the speakers extracted from the voices recorded by the microphones (1) to (3) are the feature amounts (1) to (3), the feature amounts (1) to (3) are It is expected that the voices of the speaker A and the speaker B are substantially arranged in a line according to the mixing ratio of the voices in the feature space. These features are arranged densely in this row as the number of microphones increases, but clustering using all of these features may adversely affect clustering. That is, for example, the center of the cluster of speaker A exists near the feature amount (1), and the center of the cluster of speaker B exists near the feature amount (3), but the feature amount (2) also exists. When used for clustering, the center of the cluster moves in the feature quantity (2) direction. Therefore, in the present embodiment, this problem is solved by appropriately selecting the feature amount used for clustering.
  • the set V s of the feature quantities in the segment s is expressed by the following equation.
  • the feature amount of the number of speakers (denoted by C) is selected from the number of speakers.
  • the feature amount is selected according to, for example, the following equation.
  • dist (v i, v j ) is a function representing a distance between the feature amount, for example, can be used as the Euclidean distance.
  • the speaker dialyization execution unit 131 selects and clusters the two most distant features among the features extracted by the feature extraction unit 1003. That is, the speaker dialyization execution unit 131 selects a set of the feature amount (1) and the feature amount (3) that maximizes the difference in the feature amount space, and performs clustering based on the selected feature amount set. As a result, clustering is performed using only the features extracted from the segments in which the voices of the speakers A and B are predominantly recorded.
  • FIG. 12 is a schematic diagram showing an example of the clustering result by the speaker dialylation device of the third embodiment.
  • the areas indicated by diagonal lines are clusters in which the feature amounts of speaker A are grouped, and the areas indicated by dots are clusters in which the feature amounts of speaker B are grouped. Areas shown in black are areas that are not used for clustering.
  • FIG. 13 is a flowchart illustrating the speaker dialiation process S2300 performed by the speaker dialyration device of the third embodiment.
  • the speaker dialyration process S2300 will be described with reference to the figure.
  • the feature amount extraction unit 1003 extracts the feature amount from each of the plurality of segments, and inputs the extracted feature amount to the feature amount selection unit 1006.
  • the feature amount selection unit 1006 selects the feature amount set having the maximum difference in the feature amount space from the input feature amounts, and inputs the selected feature amount set to the clustering unit 1007. do.
  • the clustering unit 1007 clusters the input feature set, and inputs the clustering result to the speaker dialylation unit 1008 (S2304).
  • the speaker dialylation unit 1008 performs speaker dialylation based on the input clustering result (S2305). This completes the speaker dialylation process S2300.
  • the speaker dialylation device 1 of the third embodiment does not use all the feature amounts extracted by the feature amount extraction unit 1003 for clustering, but uses the feature amount selected by the feature amount selection unit 1006. Since clustering is performed, highly reliable speaker dialification can be realized.
  • the number of feature quantities selected from the feature quantities extracted by the feature quantity extraction unit 1003 is defined as the number of speakers C.
  • the number of speakers is estimated for each segment, and the estimated number of speakers is used. May be used instead.
  • a so-called bottom-up clustering method in which features close to the speakers are sequentially grouped can be used.
  • the feature amount to be selected is smaller than the number of speakers C existing in the entire signal acquired by the signal input unit 1001, and the mixing ratio of the two speakers is close to 0, that is, the sound of the two speakers. It is possible to prevent the features that are mixed at the same sound pressure from being used for clustering.
  • the feature amount selection by the feature amount selection unit 1006 may be performed by a method based on the sound pressure. For example, when the sound pressure of the signal acquired by the signal input unit 1001 is small, the signal-to-noise ratio is small, so it is expected that the reliability will be low when the feature amount representing the speaker is extracted. Therefore, the feature amount selection unit 1006 may select the feature amount for the number of speakers in descending order of the sound pressure that maximizes the difference in the feature amount space. As a result, clustering can be performed using only highly reliable features.
  • the feature amount selection unit 1006 may use the above two feature amount selection methods in combination.
  • the present invention is not limited to the embodiments described above, and includes various modifications. Further, for example, the embodiments described above have been described in detail in order to explain the present invention in an easy-to-understand manner, and are not necessarily limited to those including all the described configurations. In addition, a part of the configuration of each embodiment can be added, deleted, or replaced with another configuration.
  • the function of the speaker dialing device 1 described above can be used, for example, in a processing portion for performing voice section detection and speaker dialing in a voice recognition system using a distributed monaural microphone. Further, the function of the speaker dialyration device 1 can be applied to, for example, in the above-mentioned voice recognition system, a processing portion for determining who spoke after the voice recognition result is obtained.
  • Speaker dialing device 14 15 Signal input device 131 Speaker dialing execution unit 1001 Signal input unit 1002 Signal splitting unit 1003 Feature amount extraction unit 1005 Voice section detection unit 1006 Feature amount selection unit 1007 Clustering unit 1008 Speaker dialing Department

Abstract

The objective of the present invention is to carry out speaker diarization accurately even when a plurality of speakers are speaking simultaneously. This speaker diarization device divides each of a plurality of signals obtained respectively from a plurality of audio signal input units into a plurality of segments of a prescribed time width, extracts a feature amount from each of the segments, collectively clusters the feature amounts extracted from each of the segments of the plurality of signals, and carries out speaker diarization on the basis of the clustering result. The speaker diarization device detects a voice section, which is a section containing an audio signal, from each of the plurality of signals, divides the voice sections of each of the plurality of signals into segments, and extracts a feature amount from each of the segments obtained by the division.

Description

話者ダイアライゼーション装置、及び話者ダイアライゼーション方法Speaker dialing device and speaker dialing method
 本発明は、話者ダイアライゼーション装置、及び話者ダイアライゼーション方法に関する。 The present invention relates to a speaker dialing device and a speaker dialing method.
 本出願は、2020年4月30日に出願された日本特許出願2020-079958号に基づく優先権を主張し、その開示全体を援用して本出願に取り込むものである。 This application claims priority based on Japanese Patent Application No. 2020-079958 filed on April 30, 2020, and incorporates the entire disclosure into this application.
 特許文献1には、最適なダイアライゼーションの実行等を目的として構成された信号分析装置について記載されている。信号分析装置は、複数の音源位置候補についての時間区間であるフレーム毎の各音源位置候補から信号が到来する確率からなる音源位置生起確率行列Qを、複数の音源についての音源毎の各音源位置候補から信号が到来する確率からなる音源位置確率行列Bと、フレーム毎の各音源からの信号の存在確率からなる音源存在確率行列Aと、の積でモデル化し、このモデル化に基づき、音源位置確率行列B及び音源存在確率行列Aの少なくとも一方を推定する。 Patent Document 1 describes a signal analyzer configured for the purpose of performing optimum dialification and the like. The signal analyzer uses a sound source position occurrence probability matrix Q consisting of the probability that a signal arrives from each sound source position candidate for each frame, which is a time interval for a plurality of sound source position candidates, and each sound source position for each sound source for a plurality of sound sources. Modeled by the product of the sound source position probability matrix B consisting of the probability that a signal arrives from the candidate and the sound source existence probability matrix A consisting of the existence probability of the signal from each sound source for each frame, and based on this modeling, the sound source position At least one of the probability matrix B and the sound source existence probability matrix A is estimated.
 非特許文献1には、話者ダイアライゼーションを行う手法について記載されている。この手法は、モノラルマイクで収録した音声における音声区間を細かいセグメントに分割し、各セグメントから話者性を含む特徴量を抽出し、この特徴量をクラスタリングし、クラスタリングの結果から話者ダイアライゼーションを行う。 Non-Patent Document 1 describes a method for performing speaker dialification. In this method, the voice section of the voice recorded by the monaural microphone is divided into fine segments, the features including the speaker characteristics are extracted from each segment, the features are clustered, and the speaker dialylation is performed from the clustering result. conduct.
特開2019-184747号公報Japanese Unexamined Patent Publication No. 2019-184747
 特許文献1では、予め定められた位置に配置されたマイクロフォン(以下、「マイク」と称する。)を用いて収録された音声から音源の方向を推定し、異なる方向から到来した音声は異なる話者であるとして話者ダイアライゼーション(Speaker diarisation)を行う。しかし特許文献1では、話者ダイアライゼーションに際し、マイクの配置が既知であることを利用し、実測データを用いて事前に準備された音源位置候補毎の周波数ビンに対する特徴ベクトルの確率分布を用いている。そのため、マイクの配置が未知であり確率分布のような学習データが存在しない場合は話者ダイアライゼーションを行うことができない。 In Patent Document 1, the direction of a sound source is estimated from the sound recorded by using a microphone (hereinafter referred to as "microphone") arranged at a predetermined position, and the sound arriving from a different direction is a different speaker. Perform speaker dialisation as if. However, in Patent Document 1, the probability distribution of the feature vector for the frequency bin for each sound source position candidate prepared in advance using the measured data is used by utilizing the fact that the arrangement of the microphone is known in the speaker dialylation. There is. Therefore, if the arrangement of the microphones is unknown and there is no learning data such as a probability distribution, speaker dialylation cannot be performed.
 また、非特許文献1では、1つのモノラルマイクを用いるため、音声区間を分割して得られる各セグメントがいずれかの話者に割り当てられることになる。そのため、例えば、複数の話者が同時に発話した場合、どの話者をそのセグメントに割り当てるべきか判定することができない。さらに、全ての話者の音声が1つのモノラルマイクで収録されるため、全ての話者がモノラルマイクの近くで発話する必要もある。 Further, in Non-Patent Document 1, since one monaural microphone is used, each segment obtained by dividing the audio section is assigned to any speaker. Therefore, for example, when a plurality of speakers speak at the same time, it is not possible to determine which speaker should be assigned to that segment. Furthermore, since the voices of all speakers are recorded by one monaural microphone, it is also necessary for all speakers to speak near the monaural microphone.
 本発明はこうした背景に鑑みてなされたものであり、複数の話者が同時に発話する場合でも精度よく話者ダイアライゼーションを行うことが可能な、話者ダイアライゼーション装置及び話者ダイアライゼーション方法を提供することを目的とする。 The present invention has been made in view of such a background, and provides a speaker dialylation device and a speaker dialylation method capable of accurately performing speaker dialimation even when a plurality of speakers speak at the same time. The purpose is to do.
 上記目的を達成するための本発明の一つは、話者ダイアライゼーション装置であって、情報処理装置を用いて構成され、複数の音声信号の入力部の夫々から取得される複数の信号を、夫々、所定時間幅の複数のセグメントに分割する信号分割部と、前記セグメントの夫々から特徴量を抽出する特徴量抽出部と、前記複数の信号の夫々のセグメントから抽出された前記特徴量を一括してクラスタリングするクラスタリング部と、前記クラスタリングの結果に基づき話者ダイアライゼーションを行う話者ダイアライゼーション部と、を備える。 One of the present inventions for achieving the above object is a speaker dialylation device, which is configured by using an information processing device and obtains a plurality of signals obtained from each of a plurality of audio signal input units. A signal dividing unit that divides into a plurality of segments having a predetermined time width, a feature amount extracting unit that extracts a feature amount from each of the segments, and a feature amount extracted from each segment of the plurality of signals are collectively combined. A clustering unit for clustering is provided, and a speaker dialyization unit for performing speaker dialylation based on the result of the clustering is provided.
 その他、本願が開示する課題、及びその解決方法は、発明を実施するための形態の欄、及び図面により明らかにされる。 In addition, the problems disclosed in the present application and the solutions thereof will be clarified by the column of the form for carrying out the invention and the drawings.
 本発明によれば、複数の話者が同時に発話するような場合であっても、精度よく話者ダイアライゼーションを行うことができる。 According to the present invention, speaker dialification can be performed with high accuracy even when a plurality of speakers speak at the same time.
第1実施形態の話者ダイアライゼーション装置のハードウェア構成図である。It is a hardware block diagram of the speaker dialyization apparatus of 1st Embodiment. 話者ダイアライゼーション実行部の詳細を説明する図である。It is a figure explaining the detail of the speaker dialyization execution part. クラスタリングの結果と話者ダイアライゼーションの結果を説明する模式図である。It is a schematic diagram explaining the result of clustering and the result of speaker dialyization. 話者ダイアライゼーション処理を説明するフローチャートである。It is a flowchart explaining a speaker dialyization process. 第2実施形態の話者ダイアライゼーション実行部の詳細を説明する図である。It is a figure explaining the detail of the speaker dialyization execution part of 2nd Embodiment. 話者ダイアライゼーション処理を説明するフローチャートである。It is a flowchart explaining a speaker dialyization process. 話者ダイアライゼーション実行部の変形例を示す図である。It is a figure which shows the modification of the speaker dialyization execution part. 話者ダイアライゼーション処理を説明するフローチャートである。It is a flowchart explaining a speaker dialyization process. 第3実施形態の話者ダイアライゼーション実行部の詳細を説明する図である。It is a figure explaining the detail of the speaker dialyization execution part of 3rd Embodiment. 話者とマイクの配置例を示す図である。It is a figure which shows the arrangement example of a speaker and a microphone. 特徴量空間における特徴量の分布を説明する模式図である。It is a schematic diagram explaining the distribution of a feature amount in a feature amount space. クラスタリングの結果を説明する模式図である。It is a schematic diagram explaining the result of clustering. 話者ダイアライゼーション処理を説明するフローチャートである。It is a flowchart explaining a speaker dialyization process.
 以下、実施形態について、図面を用いて詳細に説明する。但し、本発明は以下に示す実施の形態の記載内容に限定して解釈されるものではない。本発明の思想ないし趣旨から逸脱しない範囲で、その具体的構成を変更し得ることは当業者であれば容易に理解される。 Hereinafter, the embodiment will be described in detail with reference to the drawings. However, the present invention is not construed as being limited to the description of the embodiments shown below. It is easily understood by those skilled in the art that a specific configuration thereof can be changed without departing from the idea or gist of the present invention.
 以下に説明する発明の構成において、同一部分又は同様な機能を有する部分には同一の符号を異なる図面間で共通して用い、重複する説明は省略することがある。また、同一あるいは同様な機能を有する要素が複数ある場合には、同一の符号に異なる添字を付して説明する場合がある。但し、複数の要素を区別する必要がない場合には、添字を省略して説明する場合がある。また、本明細書等における「第1」、「第2」、「第3」などの表記は、構成要素を識別するために付するものであり、必ずしも、数、順序、もしくはその内容を限定するものではない。また、構成要素の識別のための番号は文脈毎に用いられ、一つの文脈で用いた番号が、他の文脈で必ずしも同一の構成を示すとは限らない。また、ある番号で識別された構成要素が、他の番号で識別された構成要素の機能を兼ねることを妨げるものではない。以下の説明において、符号の前に付した「S」の文字は処理ステップの意味である。 In the configuration of the invention described below, the same reference numerals may be used in common between different drawings for the same parts or parts having similar functions, and duplicate explanations may be omitted. Further, when there are a plurality of elements having the same or similar functions, they may be described by adding different subscripts to the same reference numerals. However, when it is not necessary to distinguish between a plurality of elements, the subscript may be omitted for explanation. In addition, the notations such as "first", "second", and "third" in the present specification and the like are attached to identify the components, and the number, order, or contents thereof are not necessarily limited. It's not something to do. In addition, the numbers for identifying the components are used for each context, and the numbers used in one context do not always indicate the same composition in the other contexts. Further, it does not prevent the component identified by a certain number from having the function of the component identified by another number. In the following description, the letter "S" prefixed with the code means a processing step.
[第1実施形態]
 図1に、第1実施形態として説明する、話者ダイアライゼーション(Speaker diarisation)を行う装置(以下、「話者ダイアライゼーション装置1」と称する。)のハードウェア構成を示している。話者ダイアライゼーション装置1は、情報処理装置(コンピュータ)であり、プロセッサ11、ROM12(ROM:Read Only Memory)、RAM13(RAM:Random Access Memory)、2つの信号入力装置14a,14bを備える。これらはバス10等を通して互いに通信可能に接続されている。尚、例示する話者ダイアライゼーション装置1は、2つの信号入力装置14a,14bを備えるが、話者ダイアライゼーション装置1は、3つ以上の信号入力装置を備えていてもよい。信号入力装置14a,14bは、マイクロフォン(以下、「マイク」と称する。)等の音声入力装置でもよいし、残響除去や音源分離等が行われた後の音声信号を出力する装置でもよい。RAM13には、話者ダイアライゼーション装置1の機能(以下、「話者ダイアライゼーション実行部131」と称する。)を実現するためのプログラムが格納されている。
[First Embodiment]
FIG. 1 shows a hardware configuration of a device for performing speaker diarisation (hereinafter, referred to as “speaker diarization device 1”) described as the first embodiment. The speaker dialyration device 1 is an information processing device (computer), and includes a processor 11, a ROM 12 (ROM: Read Only Memory), a RAM 13 (RAM: Random Access Memory), and two signal input devices 14a and 14b. These are connected to each other so as to be able to communicate with each other through a bus 10 or the like. The speaker dialing device 1 illustrated includes two signal input devices 14a and 14b, but the speaker dialing device 1 may include three or more signal input devices. The signal input devices 14a and 14b may be a voice input device such as a microphone (hereinafter, referred to as a “microphone”), or may be a device that outputs a voice signal after reverberation removal, sound source separation, or the like is performed. The RAM 13 stores a program for realizing the function of the speaker dialing device 1 (hereinafter, referred to as "speaker dialing execution unit 131").
 話者ダイアライゼーション装置1は、通信可能に接続された複数の情報処理装置を用いて構成してもよい。また、話者ダイアライゼーション装置1は、その全部または一部を、例えば、クラウドシステムによって提供される仮想サーバのように、仮想化技術やプロセス空間分離技術等を用いて提供される仮想的な情報処理資源を用いて実現してもよい。また、話者ダイアライゼーション装置1によって提供される機能の全部または一部を、例えば、クラウドシステムがAPI(Application Programming Interface)等を介して提供するサービスによって実現してもよい。また、話者ダイアライゼーション装置1が備える話者ダイアライゼーション実行部131等の機能は、DSP(Digital Signal Processor)、FPGA(Field Programmable Gate Array)、ASIC(Application Specific Integrated Circuit)、AI(Artificial Intelligence)チップ等のハードウェアによって実現してもよい。 The speaker dialyration device 1 may be configured by using a plurality of information processing devices connected so as to be able to communicate with each other. In addition, the speaker dialyration device 1 provides all or part of the virtual information using virtualization technology, process space separation technology, or the like, such as a virtual server provided by a cloud system. It may be realized by using processing resources. Further, all or a part of the functions provided by the speaker dialyrating device 1 may be realized by, for example, a service provided by a cloud system via an API (Application Programming Interface) or the like. In addition, the functions of the speaker dialylation execution unit 131 and the like included in the speaker dialylation device 1 include DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array), ASIC (Application Specific Integrated Circuit), and AI (Artificial Intelligence). It may be realized by hardware such as a chip.
 図2は、話者ダイアライゼーション実行部131の詳細を説明する図である。同図に示すように、話者ダイアライゼーション実行部131は、信号入力部1001a,1001b、信号分割部1002a,1002b、特徴量抽出部1003a,1003b、クラスタリング部1007、話者ダイアライゼーション部1008の各機能を含む。 FIG. 2 is a diagram for explaining the details of the speaker dialyization execution unit 131. As shown in the figure, the speaker dialyization execution unit 131 includes signal input units 1001a and 1001b, signal division units 1002a and 1002b, feature amount extraction units 1003a and 1003b, clustering unit 1007, and speaker dialation unit 1008. Includes features.
 信号入力部1001aには、信号入力装置14aから信号が入力される。また、信号入力部1001bには、信号入力装置14bから信号が入力される。尚、信号入力装置14aから信号が入力される信号について、信号入力部1001a、信号分割部1002a、及び特徴量抽出部1003aが行う処理と、信号入力装置14bから信号が入力される信号について、信号入力部1001b、信号分割部1002b、及び特徴量抽出部1003bが行う処理は基本的に同様であるので、以下では、とくに必要でない限り、前者についてのみ説明し、後者については説明を省略する。また、とくに区別する必要がない限り、これらを区別するための添え字(「a」、「b」)の記載を省略する。尚、本実施形態では、話者ダイアライゼーション装置1が2つの信号入力装置14を備える場合について説明するが、信号入力部1001、信号分割部1002、及び特徴量抽出部1003の組は、信号入力装置14の数に応じて設けられる。 A signal is input to the signal input unit 1001a from the signal input device 14a. Further, a signal is input to the signal input unit 1001b from the signal input device 14b. Regarding the signal input from the signal input device 14a, the processing performed by the signal input unit 1001a, the signal division unit 1002a, and the feature amount extraction unit 1003a, and the signal input from the signal input device 14b are signals. Since the processes performed by the input unit 1001b, the signal division unit 1002b, and the feature amount extraction unit 1003b are basically the same, only the former will be described below, and the latter will be omitted unless otherwise required. Further, unless it is particularly necessary to distinguish between them, the description of the subscripts (“a” and “b”) for distinguishing them is omitted. In the present embodiment, the case where the speaker dialyration device 1 includes two signal input devices 14 will be described, but the set of the signal input unit 1001, the signal division unit 1002, and the feature amount extraction unit 1003 is a signal input. It is provided according to the number of devices 14.
 信号入力部1001は、信号入力装置14から入力される信号(以下、「入力信号」と称する。)を取得する。入力信号は、例えば、不図示のAD変換部によってアナログ値からデジタル値に変換されている。また、入力信号は、信号入力装置14がマイクの場合には単に収録された音声信号となる。入力信号は、例えば、予め残響除去や音声強調、音源分離が行われた後の音声信号でもよい。信号入力部1001が信号入力装置14から取得した信号xmは、例えば、以下のように表記することができる。
Figure JPOXMLDOC01-appb-M000001
 ここで、mは信号入力装置の数を示し、tは時刻を示している。2つの信号入力装置14a,14bの夫々に入力される信号は、必ずしも開始時刻tm,start及び終了時刻tm,endが同じである必要はない。つまり、信号入力装置14a,14bの夫々の開始時刻tm,start及び終了時刻tm,endは異なっていてもよい。
The signal input unit 1001 acquires a signal (hereinafter, referred to as an “input signal”) input from the signal input device 14. The input signal is converted from an analog value to a digital value by, for example, an AD conversion unit (not shown). Further, the input signal is simply a recorded audio signal when the signal input device 14 is a microphone. The input signal may be, for example, an audio signal after reverberation removal, speech enhancement, and sound source separation have been performed in advance. The signal x m acquired by the signal input unit 1001 from the signal input device 14 can be expressed as follows, for example.
Figure JPOXMLDOC01-appb-M000001
Here, m indicates the number of signal input devices, and t indicates the time. The signals input to the two signal input devices 14a and 14b do not necessarily have the same start time t m and start and end time t m and end. That is, the start times t m and start and the end times t m and end of the signal input devices 14a and 14b may be different, respectively.
 信号分割部1002は、信号入力部1001から取得した信号を、所定時間幅を有する複数のセグメントに分割する。セグメントsにおける信号入力装置14から取得した信号は、以下のように表記することができる。
Figure JPOXMLDOC01-appb-M000002
 ここで、各セグメントsの開始時刻ts,start及び終了時刻ts,endは、信号入力装置14に依存しない変数として定められる。また、セグメントsの時間幅は、以下のように表記される。
Figure JPOXMLDOC01-appb-M000003
The signal dividing unit 1002 divides the signal acquired from the signal input unit 1001 into a plurality of segments having a predetermined time width. The signal acquired from the signal input device 14 in the segment s can be described as follows.
Figure JPOXMLDOC01-appb-M000002
Here, the start time ts , start and end time ts , end of each segment s are defined as variables that do not depend on the signal input device 14. The time width of the segment s is expressed as follows.
Figure JPOXMLDOC01-appb-M000003
 セグメントsの時間幅は、例えば、1.5秒程度に定められるが、これに限定されない。例えば、セグメントsの時間幅として1.5秒よりも長い時間幅を採用すれば、後段の特徴量抽出部1003において話者性を表す特徴量を抽出する際により多くの信号を用いることができ、特徴量の信頼性を向上させることができる。また、セグメントsの時間幅として1.5秒よりも短い時間幅を採用すれば、話者ダイアライゼーションを行う時間単位が短くなり、後段の話者ダイアライゼーション部1008において粒度の高い話者ダイアライゼーションを実現することができる。 The time width of the segment s is set to, for example, about 1.5 seconds, but is not limited to this. For example, if a time width longer than 1.5 seconds is adopted as the time width of the segment s, more signals can be used when the feature amount representing the speaker character is extracted by the feature amount extraction unit 1003 in the subsequent stage. , The reliability of the feature quantity can be improved. Further, if a time width shorter than 1.5 seconds is adopted as the time width of the segment s, the time unit for performing speaker dialylation becomes shorter, and the speaker dialylation unit 2008 in the subsequent stage has a high-grained speaker dialylation. Can be realized.
 また、各セグメントsは、上記のように同一の時間幅で単に分割されることに限定されず、隣り合うセグメントs同士の一部が重複するようにしてもよい。例えば、隣り合うセグメントs同士の重複する時間幅をセグメントs自体の時間幅よりも短く設定すれば,話者性を表す特徴量の信頼性を損なうことなく、粒度の高い話者ダイアライゼーションを実現することができる。 Further, each segment s is not limited to being simply divided with the same time width as described above, and a part of adjacent segments s may overlap each other. For example, if the overlapping time width between adjacent segments s is set shorter than the time width of the segments s itself, a high-grain size speaker dialylation is realized without impairing the reliability of the feature quantity representing the speaker characteristics. can do.
 特徴量抽出部1003は、信号分割部1002で得られた各セグメントsから話者性を表す特徴量を抽出する。特徴量抽出部1003が抽出する話者性を表す特徴量として、例えば、基本周波数やフォルマント周波数を要素に持つベクトル、GMM(Gaussian Mixture Model)スーパーベクトル、HMM(Hidden Markov Model)スーパーベクトル、i-vector、d-vector、x-vectorや、これらを組み合わせたもの等がある。 The feature amount extraction unit 1003 extracts a feature amount representing speaker characteristics from each segment s obtained by the signal division unit 1002. The feature quantities that represent the speaker characteristics extracted by the feature quantity extraction unit 1003 include, for example, a vector having a fundamental frequency and a formant frequency as elements, a GMM (Gaussian Mixture Model) super vector, an HMM (Hidden Markov Model) super vector, and i-. There are vector, d-vector, x-vector, and combinations of these.
 2つの信号入力装置14a,14bが、例えば、部屋の中に分散して配置されるマイクである場合、例えば、スマートスピーカのように各マイクが数センチメートル程度しか離れていないようなマイクアレイとは異なり、夫々のマイクによって同一の発話が大きく異なる音圧で収録され得る。つまり、発話が話者に近い側のマイクで収録された場合には音圧が大きく、話者から遠い側のマイクで収録された場合には音圧が小さく収録される。そこで、マイクと話者との相対的位置を表す特徴量として、マイクで収録された音圧を並べたベクトルや、それを主成分分析などによって次元削減したベクトルなどを用いてもよい。また、話者性を表す特徴量と、マイク及び話者の相対的位置を表す特徴量とを連結した特徴量等を用いてもよい。このようにして特徴量抽出部1003によって抽出された特徴量vm,sは、以下のように表記することができる。
Figure JPOXMLDOC01-appb-M000004
 ここでSmは、信号入力装置14の収録区間に含まれるセグメントsの集合である。
When the two signal input devices 14a and 14b are microphones distributed in a room, for example, a microphone array such as a smart speaker in which the microphones are separated by only a few centimeters. Different, the same utterance can be recorded with significantly different sound pressures depending on the microphone. That is, when the utterance is recorded by the microphone closer to the speaker, the sound pressure is high, and when the utterance is recorded by the microphone far from the speaker, the sound pressure is low. Therefore, as a feature quantity representing the relative position between the microphone and the speaker, a vector in which the sound pressures recorded by the microphone are arranged, or a vector obtained by reducing the dimension by principal component analysis or the like may be used. Further, a feature amount in which a feature amount representing the speaker character and a feature amount representing the relative position of the microphone and the speaker are connected may be used. The feature amounts v m, s extracted by the feature amount extraction unit 1003 in this way can be expressed as follows.
Figure JPOXMLDOC01-appb-M000004
Here, S m is a set of segments s included in the recording section of the signal input device 14.
 クラスタリング部1007は、特徴量抽出部1003a,1003bの夫々によって抽出された特徴量を一括してクラスタリングする。つまり、信号入力装置14a,14bが、例えば、マイクである場合、その集合をMとして、次式で表されるベクトルを一度にクラスタリングする。
Figure JPOXMLDOC01-appb-M000005
 上記のクラスタリングの手法は必ずしも限定されないが、例えば、K-meansクラスタリング、Mean-shiftクラスタリング、凝集型階層的クラスタリング等を用いることができる。
The clustering unit 1007 collectively clusters the feature quantities extracted by the feature quantity extraction units 1003a and 1003b, respectively. That is, when the signal input devices 14a and 14b are, for example, microphones, the set is set as M, and the vectors represented by the following equations are clustered at once.
Figure JPOXMLDOC01-appb-M000005
The above clustering method is not necessarily limited, but for example, K-means clustering, Mean-shift clustering, aggregated hierarchical clustering, and the like can be used.
 図3は、信号入力装置14として3つのマイク1~3が用意され、2人の話者A,Bが発話する場合における、話者ダイアライゼーション装置1によるクラスタリングの結果と話者ダイアライゼーションの結果を模式的に示した図である。 In FIG. 3, three microphones 1 to 3 are prepared as the signal input device 14, and the result of clustering by the speaker dialing device 1 and the result of speaker dialing when two speakers A and B speak. Is a diagram schematically showing.
 同図左側に示すように、マイク1~3によって収録された音声は、クラスタA(斜線で示す領域)とクラスタB(ドットで示す領域)の2つにクラスタリングされる。ここでクラスタリング部1007は、3つのマイク1~3を通して収録された音声から抽出された特徴量を一括してクラスタリングするため、話者の音声はいずれかのマイクによって十分大きな音圧で収録されていればよく、1つのマイクを用いる場合に比べて、より広い空間での話者ダイアライゼーションが可能となる。また、特徴量を一括してクラスタリングすることで、同一時刻に異なるマイクを通して取得された各セグメントsに別のクラスタを割り当てることができ、発話のオーバーラップを考慮した話者ダイアライゼーションが可能になる。 As shown on the left side of the figure, the voice recorded by the microphones 1 to 3 is clustered into two clusters, cluster A (area indicated by diagonal lines) and cluster B (area indicated by dots). Here, since the clustering unit 1007 collectively clusters the feature quantities extracted from the voices recorded through the three microphones 1 to 3, the speaker's voice is recorded by one of the microphones at a sufficiently large sound pressure. It suffices, and the speaker dialylation in a wider space becomes possible as compared with the case of using one microphone. In addition, by clustering the features collectively, it is possible to assign a different cluster to each segment s acquired through different microphones at the same time, and speaker dialylation considering the overlap of utterances becomes possible. ..
 話者ダイアライゼーション部1008は、クラスタリング部1007によってクラスタリングされた結果に基づき話者ダイアライゼーションを行う。クラスタリング部1007によるクラスタリングの結果を用いることで、話者ダイアライゼーションの結果Dは、以下の式から求めることができる。
Figure JPOXMLDOC01-appb-M000006
The speaker dialylation unit 1008 performs speaker dialylation based on the result of clustering by the clustering unit 1007. By using the result of clustering by the clustering unit 1007, the result D of the speaker dialylation can be obtained from the following equation.
Figure JPOXMLDOC01-appb-M000006
Figure JPOXMLDOC01-appb-M000007
 ここでΩcはクラスタcに属する特徴量の集合、Sはセグメント数、Cはクラスタ数(話者数)である。以上のようにして、同図右側に示すように話者ダイアライゼーションが行われる。
Figure JPOXMLDOC01-appb-M000007
Here, Ω c is a set of features belonging to cluster c, S is the number of segments, and C is the number of clusters (number of speakers). As described above, speaker dialylation is performed as shown on the right side of the figure.
 図4は、話者ダイアライゼーション装置1が行う処理(以下、「話者ダイアライゼーション処理S2000」と称する。)を説明するフローチャートである。以下、同図とともに話者ダイアライゼーション処理S2000について説明する。 FIG. 4 is a flowchart illustrating a process performed by the speaker dialing device 1 (hereinafter, referred to as “speaker dialing process S2000”). Hereinafter, the speaker dialyration process S2000 will be described with reference to the figure.
 まず信号入力部1001が、信号入力装置14から取得した入力信号を信号分割部1002に入力し、信号分割部1002が入力信号を所定時間幅の複数のセグメントに分割する(S2001)。 First, the signal input unit 1001 inputs the input signal acquired from the signal input device 14 to the signal division unit 1002, and the signal division unit 1002 divides the input signal into a plurality of segments having a predetermined time width (S2001).
 続いて、信号分割部1002が、分割した複数のセグメントを特徴量抽出部1003に入力し、特徴量抽出部1003が複数のセグメントの夫々から特徴量を抽出し、抽出した特徴量をクラスタリング部1007に入力する(S2002)。 Subsequently, the signal dividing unit 1002 inputs the divided plurality of segments to the feature amount extracting unit 1003, the feature amount extracting unit 1003 extracts the feature amount from each of the plurality of segments, and the extracted feature amount is used as the clustering unit 1007. (S2002).
 続いて、クラスタリング部1007は、特徴量抽出部1003(1003a,1003b)の夫々から入力された特徴量を一括してクラスタリングし、その結果を話者ダイアライゼーション部1008に入力する(S2003)。 Subsequently, the clustering unit 1007 collectively clusters the feature amounts input from each of the feature amount extraction units 1003 (1003a, 1003b), and inputs the result to the speaker dialylation unit 1008 (S2003).
 続いて、話者ダイアライゼーション部1008が、入力されたクラスタリングの結果に基づき話者ダイアライゼーションを行う(S2004)。以上で話者ダイアライゼーション処理S2000は終了する。 Subsequently, the speaker dialylation unit 1008 performs speaker dialylation based on the input clustering result (S2004). This completes the speaker dialylation process S2000.
 以上に説明したように、本実施形態の話者ダイアライゼーション装置1によれば、複数の話者が同時に発話する場合でも、精度よく話者ダイアライゼーションを行うことができる。 As described above, according to the speaker dialylation device 1 of the present embodiment, speaker dialimization can be performed with high accuracy even when a plurality of speakers speak at the same time.
[第2実施形態]
 第2実施形態の話者ダイアライゼーション装置1は、信号入力部1001が、取得した入力信号を信号分割部1002に入力する前に音声区間を検出する機能を有する点で第1実施形態の話者ダイアライゼーション装置1と異なる。第2実施形態の話者ダイアライゼーション装置1のその他の構成については、基本的に第1実施形態と同様である。以下、第1実施形態と相違する点を中心に説明する。
[Second Embodiment]
The speaker dialymating device 1 of the second embodiment has a function of detecting a voice section before the signal input unit 1001 inputs the acquired input signal to the signal dividing unit 1002. It is different from the dialylation device 1. Other configurations of the speaker dialylation device 1 of the second embodiment are basically the same as those of the first embodiment. Hereinafter, the differences from the first embodiment will be mainly described.
 図5は、第2実施形態として示す話者ダイアライゼーション装置1の話者ダイアライゼーション実行部131の詳細を説明する図である。同図に示すように、第2実施形態の話者ダイアライゼーション実行部131は、信号入力部1001と信号分割部1002との間に音声区間検出部1005が介在する構成を有する点で第1実施形態の話者ダイアライゼーション実行部131と異なる。 FIG. 5 is a diagram for explaining the details of the speaker dialylation execution unit 131 of the speaker dialylation device 1 shown as the second embodiment. As shown in the figure, the speaker dialyization execution unit 131 of the second embodiment has a configuration in which the voice section detection unit 1005 is interposed between the signal input unit 1001 and the signal division unit 1002. It is different from the speaker dialyization execution unit 131 of the form.
 音声区間検出部1005は、信号入力部1001から入力される入力信号について音声区間を検出し、検出した音声区間の信号を信号分割部1002に出力する。音声区間検出部1005は、例えば、信号入力部1001から入力される入力信号について、音圧が所定の閾値を超える区間を音声区間として検出する。また、音声区間検出部1005は、例えば、DNN(Deep Neural Network)等の手法を用いて学習した機械学習モデル(音声区間検出器)に入力信号を入力することにより音声区間を検出する。 The voice section detection unit 1005 detects a voice section for the input signal input from the signal input unit 1001 and outputs the signal of the detected voice section to the signal division unit 1002. For example, the voice section detection unit 1005 detects a section in which the sound pressure exceeds a predetermined threshold value as a voice section for the input signal input from the signal input unit 1001. Further, the voice section detection unit 1005 detects a voice section by inputting an input signal to a machine learning model (voice section detector) learned by using a method such as DNN (Deep Neural Network).
 信号分割部1002は、音声区間検出部1005から入力された信号を対象として当該音声区間を複数のセグメントに分割し、得られたセグメントを特徴量抽出部1003に入力し、特徴量抽出部1003は、入力されたセグメントから特徴量を抽出する。 The signal division unit 1002 divides the voice section into a plurality of segments for the signal input from the voice section detection unit 1005, inputs the obtained segment to the feature amount extraction unit 1003, and the feature amount extraction unit 1003 , Extract features from the input segment.
 図6は、第2実施形態の話者ダイアライゼーション装置1が行う処理(以下、「話者ダイアライゼーション処理S2100」と称する。)を説明するフローチャートである。以下、同図とともに話者ダイアライゼーション処理S2100について説明する。 FIG. 6 is a flowchart illustrating a process (hereinafter, referred to as “speaker dialulation process S2100”) performed by the speaker dialyration device 1 of the second embodiment. Hereinafter, the speaker dialyration process S2100 will be described with reference to the figure.
 まず信号入力部1001が、信号入力装置14から取得した入力信号を音声区間検出部1005に入力し、音声区間検出部1005が入力信号から音声区間を検出し、検出した音声区間の信号を信号分割部1002に入力する(S2101)。 First, the signal input unit 1001 inputs the input signal acquired from the signal input device 14 to the voice section detection unit 1005, the voice section detection unit 1005 detects the voice section from the input signal, and the detected voice section signal is divided into signals. Input to unit 1002 (S2101).
 続いて、信号分割部1002が、音声区間検出部1005から入力された信号を対象として当該音声区間を複数のセグメントに分割し、得られたセグメントを特徴量抽出部1003に入力する(S2102)。 Subsequently, the signal division unit 1002 divides the voice section into a plurality of segments for the signal input from the voice section detection unit 1005, and inputs the obtained segment to the feature amount extraction unit 1003 (S2102).
 続いて、特徴量抽出部1003は、入力されたセグメントから特徴量を抽出し、抽出した特徴量をクラスタリング部1007に入力する(S2103)。 Subsequently, the feature amount extraction unit 1003 extracts the feature amount from the input segment, and inputs the extracted feature amount to the clustering unit 1007 (S2103).
 続くS2104~S2105の処理は、図4のS2003~S2004の処理と同様であるので説明を省略する。 The subsequent processing of S2104 to S2105 is the same as the processing of S2003 to S2004 in FIG. 4, and thus the description thereof will be omitted.
 尚、以上では、図5に示すように、音声区間検出部1005を、信号入力部1001と信号分割部1002の間に介在させているが、音声区間検出部1005は、他の態様で実装することもできる。 In the above, as shown in FIG. 5, the voice section detection unit 1005 is interposed between the signal input unit 1001 and the signal division unit 1002, but the voice section detection unit 1005 is implemented in another embodiment. You can also do it.
 例えば、図7に示すように、音声区間検出部1005は、信号分割部1002と特徴量抽出部1003との間、即ち信号分割部1002の後段に介在させてもよい。この場合、音声区間検出部1005は、信号分割部1002で分割された複数のセグメントから音声区間を検出する。音声区間検出部1005は、検出された音声区間の信号を特徴量抽出部1003に入力する。特徴量抽出部1003は、音声区間検出部1005から取得した音声区間を含むセグメントから特徴量を抽出してクラスタリング部1007に入力する。クラスタリング部1007は、特徴量抽出部1003a,1003bから入力される特徴量を一括してクラスタリングし、その結果を話者ダイアライゼーション部1008に入力する。話者ダイアライゼーション部1008は、入力されるクラスタリングの結果に基づき話者ダイアライゼーションを行う。 For example, as shown in FIG. 7, the voice section detection unit 1005 may be interposed between the signal division unit 1002 and the feature amount extraction unit 1003, that is, after the signal division unit 1002. In this case, the voice section detection unit 1005 detects the voice section from a plurality of segments divided by the signal division unit 1002. The voice section detection unit 1005 inputs the signal of the detected voice section to the feature amount extraction unit 1003. The feature amount extraction unit 1003 extracts the feature amount from the segment including the voice section acquired from the voice section detection unit 1005 and inputs it to the clustering unit 1007. The clustering unit 1007 collectively clusters the feature amounts input from the feature amount extraction units 1003a and 1003b, and inputs the result to the speaker dialyization unit 1008. The speaker dialylation unit 1008 performs speaker dialylation based on the input clustering result.
 図8は、図7に示した話者ダイアライゼーション実行部131が行う処理(以下、「話者ダイアライゼーション処理S2200」と称する。)を説明するフローチャートである。以下、同図とともに話者ダイアライゼーション処理S2200について説明する。 FIG. 8 is a flowchart illustrating a process (hereinafter, referred to as “speaker dialulation process S2200”) performed by the speaker dialylation execution unit 131 shown in FIG. 7. Hereinafter, the speaker dialyration process S2200 will be described with reference to the figure.
 まずS2201の処理は、図4に示した第1実施形態の話者ダイアライゼーション処理S2200のS2001と同様であり、信号分割部1002は、信号入力部1001で取得した信号を複数のセグメントに分割し、分割した複数のセグメントを音声区間検出部1005に入力する。 First, the processing of S2201 is the same as that of S2001 of the speaker dialing processing S2200 of the first embodiment shown in FIG. 4, and the signal dividing unit 1002 divides the signal acquired by the signal input unit 1001 into a plurality of segments. , The divided plurality of segments are input to the voice section detection unit 1005.
 続いて、音声区間検出部1005が、信号分割部1002から入力される複数のセグメントの中から音声区間を含むセグメントを検出し、検出された音声区間を含むセグメントを特徴量抽出部1003に出力する(S2202)。 Subsequently, the voice section detection unit 1005 detects the segment including the voice section from the plurality of segments input from the signal division unit 1002, and outputs the segment including the detected voice section to the feature amount extraction unit 1003. (S2202).
 続いて、特徴量抽出部1003が、音声区間検出部1005から入力される音声区間を含んだセグメントから特徴量を抽出し、抽出した特徴量をクラスタリング部1007に入力する(S2203)。 Subsequently, the feature amount extraction unit 1003 extracts the feature amount from the segment including the voice section input from the voice section detection unit 1005, and inputs the extracted feature amount to the clustering unit 1007 (S2203).
 続くS2204~S2205の処理は、図6のS2104~S2105と同様である。 Subsequent processing of S2204 to S2205 is the same as that of S2104 to S2105 of FIG.
 以上のように、第2実施形態の話者ダイアライゼーション装置1は、信号入力部1001で取得した信号から音声区間を検出し、検出した音声区間を対象として特徴量を抽出する。そのため、非音声区間が特徴量の抽出対象から除外され、クラスタリングを短い時間で効率よく行うことができる。また、無音区間やノイズ区間といった非音声区間が特徴量の抽出対象から除外され、話者ダイアライゼーションの精度を向上することができる。 As described above, the speaker dialyming device 1 of the second embodiment detects a voice section from the signal acquired by the signal input unit 1001 and extracts a feature amount for the detected voice section. Therefore, the non-voice section is excluded from the feature amount extraction target, and clustering can be efficiently performed in a short time. In addition, non-speech sections such as silent sections and noise sections are excluded from the feature amount extraction targets, and the accuracy of speaker dialification can be improved.
[第3実施形態]
 第1実施形態及び第2実施形態における話者ダイアライゼーション装置1は、いずれも特徴量抽出部1003によって抽出された全ての特徴量を一括してクラスタリングし、クラスタリングの結果に基づき話者ダイアライゼーションを行う。これに対し第3実施形態の話者ダイアライゼーション装置1は、特徴量抽出部1003によって抽出された特徴量の中からクラスタリングに用いる特徴量を選択し、選択した特徴量を用いてクラスタリングを行う。以下、第3実施形態の話者ダイアライゼーション装置1について、第1実施形態の話者ダイアライゼーション装置1と相違する点を中心として説明する。尚、第3実施形態の話者ダイアライゼーション装置1は、第2実施形態の話者ダイアライゼーション装置1の構成を備えていてもよい。
[Third Embodiment]
The speaker dialing device 1 in the first embodiment and the second embodiment collectively clusters all the feature quantities extracted by the feature quantity extracting unit 1003, and performs speaker dialiation based on the clustering result. conduct. On the other hand, the speaker dialylation device 1 of the third embodiment selects a feature amount to be used for clustering from the feature amounts extracted by the feature amount extraction unit 1003, and performs clustering using the selected feature amount. Hereinafter, the speaker dialing device 1 of the third embodiment will be described focusing on the differences from the speaker dialing device 1 of the first embodiment. The speaker dialing device 1 of the third embodiment may include the configuration of the speaker dialing device 1 of the second embodiment.
 図9は、第3実施形態の話者ダイアライゼーション実行部131の詳細を説明する図である。同図に示すように、第3実施形態の話者ダイアライゼーション実行部131は、特徴量抽出部1003とクラスタリング部1007との間、即ちクラスタリング部1007の前段に特徴量選択部1006が介在する点で第1実施形態と構成が異なる。 FIG. 9 is a diagram for explaining the details of the speaker dialyization execution unit 131 of the third embodiment. As shown in the figure, the speaker dialyization execution unit 131 of the third embodiment has a feature amount selection unit 1006 interposed between the feature amount extraction unit 1003 and the clustering unit 1007, that is, in front of the clustering unit 1007. The configuration is different from that of the first embodiment.
 特徴量選択部1006は、特徴量抽出部1003a,1003bによって抽出された特徴量の中から、クラスタリングに用いる特徴量を選択する。クラスタリング部1007は、特徴量選択部1006によって選択された特徴量を用いてクラスタリングを行う。特徴量選択部1006は、例えば、次のようにして特徴量を選択する。 The feature amount selection unit 1006 selects the feature amount to be used for clustering from the feature amounts extracted by the feature amount extraction units 1003a and 1003b. The clustering unit 1007 performs clustering using the feature amount selected by the feature amount selection unit 1006. The feature amount selection unit 1006 selects the feature amount as follows, for example.
 図10は、特徴量選択部1006が特徴量を選択する方法を説明する図であり、2人の話者A、Bと、3つのマイク(1)~(3)を配置した例である。マイク(1)~(3)は、話者Aと話者Bの間の空間に配置されている。マイク(1)は、話者Aに最も近い位置に配置され、マイク(3)は、話者Bに最も近い位置に配置され、マイク(2)は、マイク(1)とマイク(3)の間の空間に配置されている。 FIG. 10 is a diagram illustrating a method in which the feature amount selection unit 1006 selects a feature amount, and is an example in which two speakers A and B and three microphones (1) to (3) are arranged. The microphones (1) to (3) are arranged in the space between the speaker A and the speaker B. The microphone (1) is located closest to speaker A, the microphone (3) is located closest to speaker B, and the microphone (2) is of microphone (1) and microphone (3). It is placed in the space between them.
 図10の配置において、話者Aと話者Bが同時に発話する場合を考える。この場合、マイク(1)では話者Aの音声の方が話者Bの音声よりも大きい音圧で収録され、マイク(3)では話者Bの音声の方が話者Aの音声よりも大きい音圧で収録されることが期待される。 Consider the case where speaker A and speaker B speak at the same time in the arrangement shown in FIG. In this case, the voice of speaker A is recorded at a higher sound pressure than the voice of speaker B in the microphone (1), and the voice of speaker B is recorded in the microphone (3) more than the voice of speaker A. It is expected to be recorded with a large sound pressure.
 図11は、特徴量空間における特徴量の分布を説明する模式図である。マイク(1)~(3)で収録された音声から抽出された話者を表す特徴量を夫々、特徴量(1)~(3)とすれば、特徴量(1)~(3)は、特徴量空間において話者Aと話者Bの音声の混合割合に応じて実質的に一列に並ぶことが期待される。これらの特徴量は、マイクの数が増加するほどこの列に密に並ぶことになるが、これら全ての特徴量を用いてクラスタリングを行うとクラスタリングに悪影響を及ぼす可能性がある。即ち、例えば、話者Aのクラスタの中心は特徴量(1)の付近に存在し、話者Bのクラスタの中心は特徴量(3)の付近に存在するが、特徴量(2)についてもクラスタリングに用いると、クラスタの中心が特徴量(2)方向に移動してしまうことになる。そこで、本実施形態では、クラスタリングに用いる特徴量を適切に選択することで、この問題の解決を図る。 FIG. 11 is a schematic diagram illustrating the distribution of the feature amount in the feature amount space. If the feature amounts representing the speakers extracted from the voices recorded by the microphones (1) to (3) are the feature amounts (1) to (3), the feature amounts (1) to (3) are It is expected that the voices of the speaker A and the speaker B are substantially arranged in a line according to the mixing ratio of the voices in the feature space. These features are arranged densely in this row as the number of microphones increases, but clustering using all of these features may adversely affect clustering. That is, for example, the center of the cluster of speaker A exists near the feature amount (1), and the center of the cluster of speaker B exists near the feature amount (3), but the feature amount (2) also exists. When used for clustering, the center of the cluster moves in the feature quantity (2) direction. Therefore, in the present embodiment, this problem is solved by appropriately selecting the feature amount used for clustering.
 まず信号入力部1001で取得した信号で、セグメントsにおける特徴量の集合Vsは次式で表される。
Figure JPOXMLDOC01-appb-M000008
 この集合の要素数が信号入力部1001で取得した信号に含まれる話者数よりも多い場合、その中から話者数(Cで表記する)の特徴量を選択する。特徴量の選択は、例えば、次式に従って行う。
Figure JPOXMLDOC01-appb-M000009
 ここで、dist(vi,vj)は特徴量同士の距離を表す関数であり、例えば、ユークリッド距離などを用いることができる。
First, in the signal acquired by the signal input unit 1001, the set V s of the feature quantities in the segment s is expressed by the following equation.
Figure JPOXMLDOC01-appb-M000008
When the number of elements of this set is larger than the number of speakers included in the signal acquired by the signal input unit 1001, the feature amount of the number of speakers (denoted by C) is selected from the number of speakers. The feature amount is selected according to, for example, the following equation.
Figure JPOXMLDOC01-appb-M000009
Here, dist (v i, v j ) is a function representing a distance between the feature amount, for example, can be used as the Euclidean distance.
 図11の例では、話者ダイアライゼーション実行部131は、特徴量抽出部1003で抽出された特徴量のうち、最も離れた2つの特徴量を選択してクラスタリングを行う。即ち、話者ダイアライゼーション実行部131は、特徴量空間における差が最大となる特徴量(1)と特徴量(3)の組を選択し、選択した特徴量の組に基づきクラスタリングを行う。これにより各話者A、Bの音声が支配的に収録されているセグメントから抽出された特徴量のみを用いてクラスタリングが行われることになる。 In the example of FIG. 11, the speaker dialyization execution unit 131 selects and clusters the two most distant features among the features extracted by the feature extraction unit 1003. That is, the speaker dialyization execution unit 131 selects a set of the feature amount (1) and the feature amount (3) that maximizes the difference in the feature amount space, and performs clustering based on the selected feature amount set. As a result, clustering is performed using only the features extracted from the segments in which the voices of the speakers A and B are predominantly recorded.
 図12は、第3実施形態の話者ダイアライゼーション装置によるクラスタリング結果の一例を示す模式図である。同図において、斜線で示す領域は話者Aの特徴量をグループ化したクラスタであり、ドットで示す領域は話者Bの特徴量をグループ化したクラスタである。黒塗りで示す領域はクラスタリングに用いられない領域である。 FIG. 12 is a schematic diagram showing an example of the clustering result by the speaker dialylation device of the third embodiment. In the figure, the areas indicated by diagonal lines are clusters in which the feature amounts of speaker A are grouped, and the areas indicated by dots are clusters in which the feature amounts of speaker B are grouped. Areas shown in black are areas that are not used for clustering.
 図13は、第3実施形態の話者ダイアライゼーション装置が行う話者ダイアライゼーション処理S2300を説明するフローチャートである。以下、同図とともに話者ダイアライゼーション処理S2300について説明する。 FIG. 13 is a flowchart illustrating the speaker dialiation process S2300 performed by the speaker dialyration device of the third embodiment. Hereinafter, the speaker dialyration process S2300 will be described with reference to the figure.
 まず同図におけるS2301~S2302までの処理は、図4に示した話者ダイアライゼーション処理S2000におけるS2001~S2002までの処理と同様であるので説明を省略する。尚、S2302では、特徴量抽出部1003が複数のセグメントの夫々から特徴量を抽出し、抽出した特徴量を特徴量選択部1006に入力する。 First, since the processes S2301 to S2302 in the figure are the same as the processes S2001 to S2002 in the speaker dialylation process S2000 shown in FIG. 4, the description thereof will be omitted. In S2302, the feature amount extraction unit 1003 extracts the feature amount from each of the plurality of segments, and inputs the extracted feature amount to the feature amount selection unit 1006.
 続くS2303では、特徴量選択部1006が、入力された特徴量の中から、特徴量空間での差が最大となる特徴量の組を選択し、選択した特徴量の組をクラスタリング部1007に入力する。 In the following S2303, the feature amount selection unit 1006 selects the feature amount set having the maximum difference in the feature amount space from the input feature amounts, and inputs the selected feature amount set to the clustering unit 1007. do.
 続いて、クラスタリング部1007が、入力された特徴量の組についてクラスタリングを行い、クラスタリングの結果を話者ダイアライゼーション部1008に入力する(S2304)。 Subsequently, the clustering unit 1007 clusters the input feature set, and inputs the clustering result to the speaker dialylation unit 1008 (S2304).
 続いて、話者ダイアライゼーション部1008が、入力されたクラスタリングの結果に基づき話者ダイアライゼーションを行う(S2305)。以上で話者ダイアライゼーション処理S2300は終了する。 Subsequently, the speaker dialylation unit 1008 performs speaker dialylation based on the input clustering result (S2305). This completes the speaker dialylation process S2300.
 以上のように、第3実施形態の話者ダイアライゼーション装置1は、特徴量抽出部1003が抽出した全ての特徴量をクラスタリングに用いるのではなく、特徴量選択部1006が選択した特徴量を用いてクラスタリングするため、信頼性の高い話者ダイアライゼーションを実現することができる。 As described above, the speaker dialylation device 1 of the third embodiment does not use all the feature amounts extracted by the feature amount extraction unit 1003 for clustering, but uses the feature amount selected by the feature amount selection unit 1006. Since clustering is performed, highly reliable speaker dialification can be realized.
 尚、以上では、特徴量抽出部1003が抽出した特徴量のうち、選択する特徴量の数を話者数Cとしたが、セグメント毎に話者数の推定を行い、推定された話者数を代わりに用いてもよい。話者数の推定を行う場合、例えば、話者に近い特徴を順次グループ化していく、いわゆるボトムアップ型のクラスタリング手法を用いることができる。これにより、信号入力部1001で取得した信号全体に存在する話者数Cに比べて選択する特徴量が少なくなり、2話者の混合比が0に近くなるような、即ち2話者の音声が同程度の音圧で混合されている特徴量がクラスタリングに使われてしまうのを抑制することができる。 In the above, the number of feature quantities selected from the feature quantities extracted by the feature quantity extraction unit 1003 is defined as the number of speakers C. However, the number of speakers is estimated for each segment, and the estimated number of speakers is used. May be used instead. When estimating the number of speakers, for example, a so-called bottom-up clustering method in which features close to the speakers are sequentially grouped can be used. As a result, the feature amount to be selected is smaller than the number of speakers C existing in the entire signal acquired by the signal input unit 1001, and the mixing ratio of the two speakers is close to 0, that is, the sound of the two speakers. It is possible to prevent the features that are mixed at the same sound pressure from being used for clustering.
 また、特徴量選択部1006による特徴量の選択は、音圧を元にした方法により行ってもよい。例えば、信号入力部1001で取得した信号の音圧が小さい場合、信号対雑音比が小さいため、話者性を表す特徴量を抽出した場合にその信頼性が低くなることが予想される。従って、特徴量選択部1006は、特徴量空間での差が最大となる音圧が大きい順に話者数分の特徴量を選択するようにしてもよい。これにより信頼性の高い特徴のみを用いてクラスタリングを行うことができる。尚、特徴量選択部1006が、上記の2つの特徴量選択の方法を組み合わせて用いてもよい。 Further, the feature amount selection by the feature amount selection unit 1006 may be performed by a method based on the sound pressure. For example, when the sound pressure of the signal acquired by the signal input unit 1001 is small, the signal-to-noise ratio is small, so it is expected that the reliability will be low when the feature amount representing the speaker is extracted. Therefore, the feature amount selection unit 1006 may select the feature amount for the number of speakers in descending order of the sound pressure that maximizes the difference in the feature amount space. As a result, clustering can be performed using only highly reliable features. The feature amount selection unit 1006 may use the above two feature amount selection methods in combination.
 以上、本発明の実施形態につき説明したが、本発明は以上に説明した実施形態に限定されるものではなく、様々な変形例が含まれる。また、例えば、以上に説明した実施形態は本発明を分かりやすく説明するために構成を詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、各実施形態の構成の一部について、他の構成に追加、削除、置換することが可能である。 Although the embodiments of the present invention have been described above, the present invention is not limited to the embodiments described above, and includes various modifications. Further, for example, the embodiments described above have been described in detail in order to explain the present invention in an easy-to-understand manner, and are not necessarily limited to those including all the described configurations. In addition, a part of the configuration of each embodiment can be added, deleted, or replaced with another configuration.
 例えば、以上に説明した話者ダイアライゼーション装置1の機能は、例えば、分散モノラルマイクを用いた音声認識システムにおける音声区間検出や話者ダイアライゼーションを行う処理部分に用いることができる。また、話者ダイアライゼーション装置1の機能は、例えば、上記の音声認識システムにおいて、音声認識結果が得られた後に誰の発話であったのかを判定する処理部分に適用することもできる。 For example, the function of the speaker dialing device 1 described above can be used, for example, in a processing portion for performing voice section detection and speaker dialing in a voice recognition system using a distributed monaural microphone. Further, the function of the speaker dialyration device 1 can be applied to, for example, in the above-mentioned voice recognition system, a processing portion for determining who spoke after the voice recognition result is obtained.
1 話者ダイアライゼーション装置
14、15 信号入力装置
131 話者ダイアライゼーション実行部
1001 信号入力部
1002 信号分割部
1003 特徴量抽出部
1005 音声区間検出部
1006 特徴量選択部
1007 クラスタリング部
1008 話者ダイアライゼーション部
1 Speaker dialing device 14, 15 Signal input device 131 Speaker dialing execution unit 1001 Signal input unit 1002 Signal splitting unit 1003 Feature amount extraction unit 1005 Voice section detection unit 1006 Feature amount selection unit 1007 Clustering unit 1008 Speaker dialing Department

Claims (14)

  1.  情報処理装置を用いて構成され、
     複数の音声信号の入力部の夫々から取得される複数の信号を、夫々、所定時間幅の複数のセグメントに分割する信号分割部と、
     前記セグメントの夫々から特徴量を抽出する特徴量抽出部と、
     前記複数の信号の夫々のセグメントから抽出された前記特徴量を一括してクラスタリングするクラスタリング部と、
     前記クラスタリングの結果に基づき話者ダイアライゼーションを行う話者ダイアライゼーション部と、
     を備える、話者ダイアライゼーション装置。
    It is configured using an information processing device,
    A signal division unit that divides a plurality of signals acquired from each of the input units of a plurality of audio signals into a plurality of segments having a predetermined time width, respectively.
    A feature amount extraction unit that extracts a feature amount from each of the segments, and a feature amount extraction unit.
    A clustering unit that collectively clusters the features extracted from each segment of the plurality of signals, and a clustering unit.
    A speaker dialylation unit that performs speaker dialylation based on the results of the clustering, and
    A speaker dialylation device.
  2.  請求項1に記載の話者ダイアライゼーション装置であって、
     前記特徴量抽出部は、前記特徴量として、話者性を含む特徴量を抽出する、
     話者ダイアライゼーション装置。
    The speaker dialylation device according to claim 1.
    The feature amount extraction unit extracts a feature amount including speaker characteristics as the feature amount.
    Speaker dialyration device.
  3.  請求項1に記載の話者ダイアライゼーション装置であって、
     前記特徴量抽出部は、前記特徴量として、音圧を含む特徴量を抽出する、
     話者ダイアライゼーション装置。
    The speaker dialylation device according to claim 1.
    The feature amount extraction unit extracts a feature amount including sound pressure as the feature amount.
    Speaker dialyration device.
  4.  請求項1に記載の話者ダイアライゼーション装置であって、
     前記複数の信号の夫々から音声信号を含む区間である音声区間を検出する音声区間検出部をさらに備え、
     前記信号分割部は、前記複数の信号の夫々の前記音声区間を対象として前記セグメントへの分割を行い、
     前記特徴量抽出部は、前記分割により得られた前記セグメントの夫々から前記特徴量を抽出する、
     話者ダイアライゼーション装置。
    The speaker dialylation device according to claim 1.
    Further, a voice section detection unit for detecting a voice section which is a section including a voice signal from each of the plurality of signals is provided.
    The signal dividing unit divides each of the plurality of signals into the segments for the voice sections.
    The feature amount extraction unit extracts the feature amount from each of the segments obtained by the division.
    Speaker dialyration device.
  5.  請求項1に記載の話者ダイアライゼーション装置であって、
     前記複数の信号の夫々から音声信号を含む区間である音声区間を検出する音声区間検出部をさらに備え、
     前記信号分割部は、前記複数の信号を、夫々、複数の前記セグメントに分割し、
     前記音声区間検出部は、前記セグメントが音声区間であるか否かを判定し、
     前記特徴量抽出部は、音声区間であると判定された前記セグメントを対象として前記特徴量を抽出する、
     話者ダイアライゼーション装置。
    The speaker dialylation device according to claim 1.
    Further, a voice section detection unit for detecting a voice section which is a section including a voice signal from each of the plurality of signals is provided.
    The signal dividing unit divides the plurality of signals into a plurality of the segments, respectively.
    The voice section detection unit determines whether or not the segment is a voice section, and determines whether or not the segment is a voice section.
    The feature amount extraction unit extracts the feature amount from the segment determined to be a voice section.
    Speaker dialyration device.
  6.  請求項1、4、5のいずれか一項に記載の話者ダイアライゼーション装置であって、
     前記特徴量抽出部によって抽出された前記特徴量の中から、前記クラスタリングの対象とする前記特徴量を選択する特徴量選択部をさらに備え、
     前記クラスタリング部は、選択された前記特徴量をクラスタリングする、
     話者ダイアライゼーション装置。
    The speaker dialylation device according to any one of claims 1, 4 and 5.
    A feature amount selection unit for selecting the feature amount to be clustered from the feature amount extracted by the feature amount extraction unit is further provided.
    The clustering unit clusters the selected features.
    Speaker dialyration device.
  7.  請求項6に記載の話者ダイアライゼーション装置であって、
     前記特徴量選択部は、前記特徴量抽出部によって抽出された、同一時刻における複数の前記特徴量の中から、特徴量空間における差が最大となる所定数の前記特徴量を前記クラスタリングの対象として選択する、
     話者ダイアライゼーション装置。
    The speaker dialylation device according to claim 6.
    The feature amount selection unit targets a predetermined number of the feature amounts with the maximum difference in the feature amount space from among a plurality of the feature amounts at the same time extracted by the feature amount extraction unit as the target of the clustering. select,
    Speaker dialyration device.
  8.  請求項6に記載の話者ダイアライゼーション装置であって、
     前記特徴量選択部は、前記特徴量抽出部によって抽出された、同一時刻における複数の前記特徴量の中から、抽出元の前記信号の音圧が大きい順に所定数の前記特徴量を前記クラスタリングの対象として選択する、
     話者ダイアライゼーション装置。
    The speaker dialylation device according to claim 6.
    The feature amount selection unit clusters a predetermined number of the feature amounts from a plurality of the feature amounts extracted by the feature amount extraction unit at the same time in descending order of sound pressure of the signal of the extraction source. Select as a target,
    Speaker dialyration device.
  9.  情報処理装置が、
     複数の音声信号の入力部の夫々から取得される複数の信号を、夫々、所定時間幅の複数のセグメントに分割するステップと、
     前記セグメントの夫々から特徴量を抽出するステップと、
     前記複数の信号の夫々のセグメントから抽出された前記特徴量を一括してクラスタリングするステップと、
     前記クラスタリングの結果に基づき話者ダイアライゼーションを行うステップと、
     を実行する、話者ダイアライゼーション方法。
    Information processing device
    A step of dividing a plurality of signals acquired from each of a plurality of audio signal input units into a plurality of segments having a predetermined time width, respectively.
    Steps to extract features from each of the segments,
    A step of collectively clustering the features extracted from each segment of the plurality of signals, and
    A step of performing speaker dialyration based on the result of the clustering, and
    How to perform speaker dialification.
  10.  請求項9に記載の話者ダイアライゼーション方法であって、
     前記情報処理装置が、
     前記複数の信号の夫々から音声信号を含む区間である音声区間を検出するステップと、 、前記複数の信号の夫々の前記音声区間を対象として前記セグメントへの分割を行うステップと、
     前記分割により得られた前記セグメントの夫々から前記特徴量を抽出するステップと、
     をさらに実行する、話者ダイアライゼーション方法。
    The speaker dialylation method according to claim 9.
    The information processing device
    A step of detecting a voice section which is a section including a voice signal from each of the plurality of signals, a step of dividing the voice section of each of the plurality of signals into the segment, and a step of dividing the voice section into the segments.
    A step of extracting the feature amount from each of the segments obtained by the division, and
    A speaker dialyization method that further performs.
  11.  請求項9に記載の話者ダイアライゼーション方法であって、
     前記情報処理装置が、
     前記複数の信号の夫々から音声信号を含む区間である音声区間を検出するステップと、
     前記複数の信号を、夫々、複数の前記セグメントに分割するステップと、
     前記セグメントが音声区間であるか否かを判定するステップと、
     音声区間であると判定された前記セグメントを対象として前記特徴量を抽出するステップと、
     をさらに実行する、話者ダイアライゼーション方法。
    The speaker dialylation method according to claim 9.
    The information processing device
    A step of detecting a voice section which is a section including a voice signal from each of the plurality of signals, and a step of detecting the voice section.
    A step of dividing the plurality of signals into a plurality of the segments, respectively,
    A step of determining whether or not the segment is an audio section,
    A step of extracting the feature amount for the segment determined to be a voice section, and
    A speaker dialyization method that further performs.
  12.  請求項9、10、11のいずれか一項に記載の話者ダイアライゼーション方法であって、
     前記情報処理装置が、
     抽出された前記特徴量の中から、前記クラスタリングの対象とする前記特徴量を選択するステップと、
     選択された前記特徴量をクラスタリングするステップと、
     をさらに実行する、話者ダイアライゼーション方法。
    The speaker dialing method according to any one of claims 9, 10 and 11.
    The information processing device
    A step of selecting the feature amount to be clustered from the extracted feature amounts, and
    A step of clustering the selected features and
    A speaker dialyization method that further performs.
  13.  請求項12に記載の話者ダイアライゼーション方法であって、
     前記情報処理装置が、
     抽出された、同一時刻における複数の前記特徴量の中から、特徴量空間における差が最大となる所定数の前記特徴量を前記クラスタリングの対象として選択するステップ、
     をさらに実行する、話者ダイアライゼーション方法。
    The speaker dialylation method according to claim 12.
    The information processing device
    A step of selecting a predetermined number of the feature quantities having the maximum difference in the feature quantity space as the target of the clustering from the plurality of the feature quantities extracted at the same time.
    A speaker dialyization method that further performs.
  14.  請求項12に記載の話者ダイアライゼーション方法であって、
     前記情報処理装置が、
     抽出された、同一時刻における複数の前記特徴量の中から、抽出元の前記信号の音圧が大きい順に所定数の前記特徴量を前記クラスタリングの対象として選択するステップ、
     をさらに実行する、話者ダイアライゼーション方法。
    The speaker dialylation method according to claim 12.
    The information processing device
    A step of selecting a predetermined number of the feature quantities as targets for clustering from a plurality of the feature quantities extracted at the same time in descending order of sound pressure of the signal of the extraction source.
    A speaker dialyization method that further performs.
PCT/JP2021/015202 2020-04-30 2021-04-12 Speaker diarization device and speaker diarization method WO2021220789A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2020079958A JP7471139B2 (en) 2020-04-30 2020-04-30 SPEAKER DIARIZATION APPARATUS AND SPEAKER DIARIZATION METHOD
JP2020-079958 2020-04-30

Publications (1)

Publication Number Publication Date
WO2021220789A1 true WO2021220789A1 (en) 2021-11-04

Family

ID=78281765

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/015202 WO2021220789A1 (en) 2020-04-30 2021-04-12 Speaker diarization device and speaker diarization method

Country Status (2)

Country Link
JP (1) JP7471139B2 (en)
WO (1) WO2021220789A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010054733A (en) * 2008-08-27 2010-03-11 Nippon Telegr & Teleph Corp <Ntt> Device and method for estimating multiple signal section, its program, and recording medium
JP2014219557A (en) * 2013-05-08 2014-11-20 カシオ計算機株式会社 Voice processing device, voice processing method, and program
US20180075860A1 (en) * 2016-09-14 2018-03-15 Nuance Communications, Inc. Method for Microphone Selection and Multi-Talker Segmentation with Ambient Automated Speech Recognition (ASR)

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010054733A (en) * 2008-08-27 2010-03-11 Nippon Telegr & Teleph Corp <Ntt> Device and method for estimating multiple signal section, its program, and recording medium
JP2014219557A (en) * 2013-05-08 2014-11-20 カシオ計算機株式会社 Voice processing device, voice processing method, and program
US20180075860A1 (en) * 2016-09-14 2018-03-15 Nuance Communications, Inc. Method for Microphone Selection and Multi-Talker Segmentation with Ambient Automated Speech Recognition (ASR)

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DING, N.: "Speaker clustering based on inter-utterance distance using phonological information and directional information", PROCEEDINGS OF THE 2014 AUTUMN MEETING OF THE ACOUSTICAL SOCIETY OF JAPAN; SAPPORO; SEPTEMBER 3-5, 2014, 26 August 2014 (2014-08-26) - 5 September 2014 (2014-09-05), JP, pages 133 - 136, XP009531496 *
IWANO, KOJI: "Dialogue group detection and speaker determination in multi-person conversation voice recorded on multiple smartphones", IEICE TECHNICAL REPORT, vol. 114, no. 151 (SP2014-71), 17 July 2017 (2017-07-17), pages 47 - 52, XP009531495 *

Also Published As

Publication number Publication date
JP7471139B2 (en) 2024-04-19
JP2021173952A (en) 2021-11-01

Similar Documents

Publication Publication Date Title
Lukic et al. Speaker identification and clustering using convolutional neural networks
US10366693B2 (en) Acoustic signature building for a speaker from multiple sessions
Takahashi et al. Recursive speech separation for unknown number of speakers
CN108305615B (en) Object identification method and device, storage medium and terminal thereof
JP6158348B2 (en) Speaker verification and identification using artificial neural network based subphoneme discrimination
EP2048656B1 (en) Speaker recognition
Menne et al. Analysis of deep clustering as preprocessing for automatic speech recognition of sparsely overlapping speech
KR101616112B1 (en) Speaker separation system and method using voice feature vectors
WO2020240682A1 (en) Signal extraction system, signal extraction learning method, and signal extraction learning program
JP4787979B2 (en) Noise detection apparatus and noise detection method
CN113808612B (en) Voice processing method, device and storage medium
CN112397093B (en) Voice detection method and device
Hegde et al. Isolated word recognition for Kannada language using support vector machine
Prabavathy et al. An enhanced musical instrument classification using deep convolutional neural network
WO2021220789A1 (en) Speaker diarization device and speaker diarization method
Wang et al. Synthetic voice detection and audio splicing detection using se-res2net-conformer architecture
Zeinali et al. Spoken pass-phrase verification in the i-vector space
WO2019053544A1 (en) Identification of audio components in an audio mix
JP2005321530A (en) Utterance identification system and method therefor
Alex et al. Variational autoencoder for prosody‐based speaker recognition
Khonglah et al. Indoor/Outdoor Audio Classification Using Foreground Speech Segmentation.
KR101069232B1 (en) Apparatus and method for classifying music genre
Mahum et al. EDL-Det: A Robust TTS Synthesis Detector Using VGG19-Based YAMNet and Ensemble Learning Block
Zhu et al. Identify speakers in cocktail parties with end-to-end attention
JP3322491B2 (en) Voice recognition device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21797241

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21797241

Country of ref document: EP

Kind code of ref document: A1