WO2021220789A1

WO2021220789A1 - Speaker diarization device and speaker diarization method

Info

Publication number: WO2021220789A1
Application number: PCT/JP2021/015202
Authority: WO
Inventors: 翔太堀口
Original assignee: 株式会社日立製作所
Priority date: 2020-04-30
Filing date: 2021-04-12
Publication date: 2021-11-04
Also published as: JP7471139B2; JP2021173952A

Abstract

The objective of the present invention is to carry out speaker diarization accurately even when a plurality of speakers are speaking simultaneously. This speaker diarization device divides each of a plurality of signals obtained respectively from a plurality of audio signal input units into a plurality of segments of a prescribed time width, extracts a feature amount from each of the segments, collectively clusters the feature amounts extracted from each of the segments of the plurality of signals, and carries out speaker diarization on the basis of the clustering result. The speaker diarization device detects a voice section, which is a section containing an audio signal, from each of the plurality of signals, divides the voice sections of each of the plurality of signals into segments, and extracts a feature amount from each of the segments obtained by the division.

Description

Speaker dialing device and speaker dialing method

The present invention relates to a speaker dialing device and a speaker dialing method.

This application claims priority based on Japanese Patent Application No. 2020-079958 filed on April 30, 2020, and incorporates the entire disclosure into this application.

Patent Document 1 describes a signal analyzer configured for the purpose of performing optimum dialification and the like. The signal analyzer uses a sound source position occurrence probability matrix Q consisting of the probability that a signal arrives from each sound source position candidate for each frame, which is a time interval for a plurality of sound source position candidates, and each sound source position for each sound source for a plurality of sound sources. Modeled by the product of the sound source position probability matrix B consisting of the probability that a signal arrives from the candidate and the sound source existence probability matrix A consisting of the existence probability of the signal from each sound source for each frame, and based on this modeling, the sound source position At least one of the probability matrix B and the sound source existence probability matrix A is estimated.

Non-Patent Document 1 describes a method for performing speaker dialification. In this method, the voice section of the voice recorded by the monaural microphone is divided into fine segments, the features including the speaker characteristics are extracted from each segment, the features are clustered, and the speaker dialylation is performed from the clustering result. conduct.

Japanese Unexamined Patent Publication No. 2019-184747

In Patent Document 1, the direction of a sound source is estimated from the sound recorded by using a microphone (hereinafter referred to as "microphone") arranged at a predetermined position, and the sound arriving from a different direction is a different speaker. Perform speaker dialisation as if. However, in Patent Document 1, the probability distribution of the feature vector for the frequency bin for each sound source position candidate prepared in advance using the measured data is used by utilizing the fact that the arrangement of the microphone is known in the speaker dialylation. There is. Therefore, if the arrangement of the microphones is unknown and there is no learning data such as a probability distribution, speaker dialylation cannot be performed.

Further, in Non-Patent Document 1, since one monaural microphone is used, each segment obtained by dividing the audio section is assigned to any speaker. Therefore, for example, when a plurality of speakers speak at the same time, it is not possible to determine which speaker should be assigned to that segment. Furthermore, since the voices of all speakers are recorded by one monaural microphone, it is also necessary for all speakers to speak near the monaural microphone.

The present invention has been made in view of such a background, and provides a speaker dialylation device and a speaker dialylation method capable of accurately performing speaker dialimation even when a plurality of speakers speak at the same time. The purpose is to do.

One of the present inventions for achieving the above object is a speaker dialylation device, which is configured by using an information processing device and obtains a plurality of signals obtained from each of a plurality of audio signal input units. A signal dividing unit that divides into a plurality of segments having a predetermined time width, a feature amount extracting unit that extracts a feature amount from each of the segments, and a feature amount extracted from each segment of the plurality of signals are collectively combined. A clustering unit for clustering is provided, and a speaker dialyization unit for performing speaker dialylation based on the result of the clustering is provided.

In addition, the problems disclosed in the present application and the solutions thereof will be clarified by the column of the form for carrying out the invention and the drawings.

According to the present invention, speaker dialification can be performed with high accuracy even when a plurality of speakers speak at the same time.

It is a hardware block diagram of the speaker dialyization apparatus of 1st Embodiment. It is a figure explaining the detail of the speaker dialyization execution part. It is a schematic diagram explaining the result of clustering and the result of speaker dialyization. It is a flowchart explaining a speaker dialyization process. It is a figure explaining the detail of the speaker dialyization execution part of 2nd Embodiment. It is a flowchart explaining a speaker dialyization process. It is a figure which shows the modification of the speaker dialyization execution part. It is a flowchart explaining a speaker dialyization process. It is a figure explaining the detail of the speaker dialyization execution part of 3rd Embodiment. It is a figure which shows the arrangement example of a speaker and a microphone. It is a schematic diagram explaining the distribution of a feature amount in a feature amount space. It is a schematic diagram explaining the result of clustering. It is a flowchart explaining a speaker dialyization process.

Hereinafter, the embodiment will be described in detail with reference to the drawings. However, the present invention is not construed as being limited to the description of the embodiments shown below. It is easily understood by those skilled in the art that a specific configuration thereof can be changed without departing from the idea or gist of the present invention.

In the configuration of the invention described below, the same reference numerals may be used in common between different drawings for the same parts or parts having similar functions, and duplicate explanations may be omitted. Further, when there are a plurality of elements having the same or similar functions, they may be described by adding different subscripts to the same reference numerals. However, when it is not necessary to distinguish between a plurality of elements, the subscript may be omitted for explanation. In addition, the notations such as "first", "second", and "third" in the present specification and the like are attached to identify the components, and the number, order, or contents thereof are not necessarily limited. It's not something to do. In addition, the numbers for identifying the components are used for each context, and the numbers used in one context do not always indicate the same composition in the other contexts. Further, it does not prevent the component identified by a certain number from having the function of the component identified by another number. In the following description, the letter "S" prefixed with the code means a processing step.

[First Embodiment]
FIG. 1 shows a hardware configuration of a device for performing speaker diarisation (hereinafter, referred to as “speaker diarization device 1”) described as the first embodiment. The speaker dialyration device 1 is an information processing device (computer), and includes a processor 11, a ROM 12 (ROM: Read Only Memory), a RAM 13 (RAM: Random Access Memory), and two

signal input devices

14a and 14b. These are connected to each other so as to be able to communicate with each other through a bus 10 or the like. The speaker dialing device 1 illustrated includes two

signal input devices

14a and 14b, but the speaker dialing device 1 may include three or more signal input devices. The

signal input devices

14a and 14b may be a voice input device such as a microphone (hereinafter, referred to as a “microphone”), or may be a device that outputs a voice signal after reverberation removal, sound source separation, or the like is performed. The RAM 13 stores a program for realizing the function of the speaker dialing device 1 (hereinafter, referred to as "speaker dialing execution unit 131").

The speaker dialyration device 1 may be configured by using a plurality of information processing devices connected so as to be able to communicate with each other. In addition, the speaker dialyration device 1 provides all or part of the virtual information using virtualization technology, process space separation technology, or the like, such as a virtual server provided by a cloud system. It may be realized by using processing resources. Further, all or a part of the functions provided by the speaker dialyrating device 1 may be realized by, for example, a service provided by a cloud system via an API (Application Programming Interface) or the like. In addition, the functions of the speaker dialylation execution unit 131 and the like included in the speaker dialylation device 1 include DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array), ASIC (Application Specific Integrated Circuit), and AI (Artificial Intelligence). It may be realized by hardware such as a chip.

FIG. 2 is a diagram for explaining the details of the speaker dialyization execution unit 131. As shown in the figure, the speaker dialyization execution unit 131 includes

signal input units

1001a and 1001b,

signal division units

1002a and 1002b, feature

amount extraction units

1003a and 1003b, clustering unit 1007, and speaker dialation unit 1008. Includes features.

A signal is input to the signal input unit 1001a from the signal input device 14a. Further, a signal is input to the signal input unit 1001b from the signal input device 14b. Regarding the signal input from the signal input device 14a, the processing performed by the signal input unit 1001a, the signal division unit 1002a, and the feature amount extraction unit 1003a, and the signal input from the signal input device 14b are signals. Since the processes performed by the input unit 1001b, the signal division unit 1002b, and the feature amount extraction unit 1003b are basically the same, only the former will be described below, and the latter will be omitted unless otherwise required. Further, unless it is particularly necessary to distinguish between them, the description of the subscripts (“a” and “b”) for distinguishing them is omitted. In the present embodiment, the case where the speaker dialyration device 1 includes two signal input devices 14 will be described, but the set of the signal input unit 1001, the signal division unit 1002, and the feature amount extraction unit 1003 is a signal input. It is provided according to the number of devices 14.

The signal input unit 1001 acquires a signal (hereinafter, referred to as an “input signal”) input from the signal input device 14. The input signal is converted from an analog value to a digital value by, for example, an AD conversion unit (not shown). Further, the input signal is simply a recorded audio signal when the signal input device 14 is a microphone. The input signal may be, for example, an audio signal after reverberation removal, speech enhancement, and sound source separation have been performed in advance. _{The signal x m} acquired by the signal input unit 1001 from the signal input device 14 can be expressed as follows, for example.

Here, m indicates the number of signal input devices, and t indicates the time. The signals input to the two

signal input devices

14a and 14b do not necessarily have the same _{start time t m and start} and end time t _{m and end.} That is, the start times t _{m and start} and the end times t _{m and end} of the

signal input devices

14a and 14b may be different, respectively.

The signal dividing unit 1002 divides the signal acquired from the signal input unit 1001 into a plurality of segments having a predetermined time width. The signal acquired from the signal input device 14 in the segment s can be described as follows.

Here, the start time ts _{, start} and end time ts _{, end} of each segment s are defined as variables that do not depend on the signal input device 14. The time width of the segment s is expressed as follows.

The time width of the segment s is set to, for example, about 1.5 seconds, but is not limited to this. For example, if a time width longer than 1.5 seconds is adopted as the time width of the segment s, more signals can be used when the feature amount representing the speaker character is extracted by the feature amount extraction unit 1003 in the subsequent stage. , The reliability of the feature quantity can be improved. Further, if a time width shorter than 1.5 seconds is adopted as the time width of the segment s, the time unit for performing speaker dialylation becomes shorter, and the speaker dialylation unit 2008 in the subsequent stage has a high-grained speaker dialylation. Can be realized.

Further, each segment s is not limited to being simply divided with the same time width as described above, and a part of adjacent segments s may overlap each other. For example, if the overlapping time width between adjacent segments s is set shorter than the time width of the segments s itself, a high-grain size speaker dialylation is realized without impairing the reliability of the feature quantity representing the speaker characteristics. can do.

The feature amount extraction unit 1003 extracts a feature amount representing speaker characteristics from each segment s obtained by the signal division unit 1002. The feature quantities that represent the speaker characteristics extracted by the feature quantity extraction unit 1003 include, for example, a vector having a fundamental frequency and a formant frequency as elements, a GMM (Gaussian Mixture Model) super vector, an HMM (Hidden Markov Model) super vector, and i-. There are vector, d-vector, x-vector, and combinations of these.

When the two

signal input devices

14a and 14b are microphones distributed in a room, for example, a microphone array such as a smart speaker in which the microphones are separated by only a few centimeters. Different, the same utterance can be recorded with significantly different sound pressures depending on the microphone. That is, when the utterance is recorded by the microphone closer to the speaker, the sound pressure is high, and when the utterance is recorded by the microphone far from the speaker, the sound pressure is low. Therefore, as a feature quantity representing the relative position between the microphone and the speaker, a vector in which the sound pressures recorded by the microphone are arranged, or a vector obtained by reducing the dimension by principal component analysis or the like may be used. Further, a feature amount in which a feature amount representing the speaker character and a feature amount representing the relative position of the microphone and the speaker are connected may be used. _{The feature amounts v m, s} extracted by the feature amount extraction unit 1003 in this way can be expressed as follows.

Here, S _m is a set of segments s included in the recording section of the signal input device 14.

The clustering unit 1007 collectively clusters the feature quantities extracted by the feature

quantity extraction units

1003a and 1003b, respectively. That is, when the

signal input devices

14a and 14b are, for example, microphones, the set is set as M, and the vectors represented by the following equations are clustered at once.

The above clustering method is not necessarily limited, but for example, K-means clustering, Mean-shift clustering, aggregated hierarchical clustering, and the like can be used.

In FIG. 3, three microphones 1 to 3 are prepared as the signal input device 14, and the result of clustering by the speaker dialing device 1 and the result of speaker dialing when two speakers A and B speak. Is a diagram schematically showing.

As shown on the left side of the figure, the voice recorded by the microphones 1 to 3 is clustered into two clusters, cluster A (area indicated by diagonal lines) and cluster B (area indicated by dots). Here, since the clustering unit 1007 collectively clusters the feature quantities extracted from the voices recorded through the three microphones 1 to 3, the speaker's voice is recorded by one of the microphones at a sufficiently large sound pressure. It suffices, and the speaker dialylation in a wider space becomes possible as compared with the case of using one microphone. In addition, by clustering the features collectively, it is possible to assign a different cluster to each segment s acquired through different microphones at the same time, and speaker dialylation considering the overlap of utterances becomes possible. ..

The speaker dialylation unit 1008 performs speaker dialylation based on the result of clustering by the clustering unit 1007. By using the result of clustering by the clustering unit 1007, the result D of the speaker dialylation can be obtained from the following equation.

Here, Ω _c is a set of features belonging to cluster c, S is the number of segments, and C is the number of clusters (number of speakers). As described above, speaker dialylation is performed as shown on the right side of the figure.

FIG. 4 is a flowchart illustrating a process performed by the speaker dialing device 1 (hereinafter, referred to as “speaker dialing process S2000”). Hereinafter, the speaker dialyration process S2000 will be described with reference to the figure.

First, the signal input unit 1001 inputs the input signal acquired from the signal input device 14 to the signal division unit 1002, and the signal division unit 1002 divides the input signal into a plurality of segments having a predetermined time width (S2001).

Subsequently, the signal dividing unit 1002 inputs the divided plurality of segments to the feature amount extracting unit 1003, the feature amount extracting unit 1003 extracts the feature amount from each of the plurality of segments, and the extracted feature amount is used as the clustering unit 1007. (S2002).

Subsequently, the clustering unit 1007 collectively clusters the feature amounts input from each of the feature amount extraction units 1003 (1003a, 1003b), and inputs the result to the speaker dialylation unit 1008 (S2003).

Subsequently, the speaker dialylation unit 1008 performs speaker dialylation based on the input clustering result (S2004). This completes the speaker dialylation process S2000.

As described above, according to the speaker dialylation device 1 of the present embodiment, speaker dialimization can be performed with high accuracy even when a plurality of speakers speak at the same time.

[Second Embodiment]
The speaker dialymating device 1 of the second embodiment has a function of detecting a voice section before the signal input unit 1001 inputs the acquired input signal to the signal dividing unit 1002. It is different from the dialylation device 1. Other configurations of the speaker dialylation device 1 of the second embodiment are basically the same as those of the first embodiment. Hereinafter, the differences from the first embodiment will be mainly described.

FIG. 5 is a diagram for explaining the details of the speaker dialylation execution unit 131 of the speaker dialylation device 1 shown as the second embodiment. As shown in the figure, the speaker dialyization execution unit 131 of the second embodiment has a configuration in which the voice section detection unit 1005 is interposed between the signal input unit 1001 and the signal division unit 1002. It is different from the speaker dialyization execution unit 131 of the form.

The voice section detection unit 1005 detects a voice section for the input signal input from the signal input unit 1001 and outputs the signal of the detected voice section to the signal division unit 1002. For example, the voice section detection unit 1005 detects a section in which the sound pressure exceeds a predetermined threshold value as a voice section for the input signal input from the signal input unit 1001. Further, the voice section detection unit 1005 detects a voice section by inputting an input signal to a machine learning model (voice section detector) learned by using a method such as DNN (Deep Neural Network).

The signal division unit 1002 divides the voice section into a plurality of segments for the signal input from the voice section detection unit 1005, inputs the obtained segment to the feature amount extraction unit 1003, and the feature amount extraction unit 1003 , Extract features from the input segment.

FIG. 6 is a flowchart illustrating a process (hereinafter, referred to as “speaker dialulation process S2100”) performed by the speaker dialyration device 1 of the second embodiment. Hereinafter, the speaker dialyration process S2100 will be described with reference to the figure.

First, the signal input unit 1001 inputs the input signal acquired from the signal input device 14 to the voice section detection unit 1005, the voice section detection unit 1005 detects the voice section from the input signal, and the detected voice section signal is divided into signals. Input to unit 1002 (S2101).

Subsequently, the signal division unit 1002 divides the voice section into a plurality of segments for the signal input from the voice section detection unit 1005, and inputs the obtained segment to the feature amount extraction unit 1003 (S2102).

Subsequently, the feature amount extraction unit 1003 extracts the feature amount from the input segment, and inputs the extracted feature amount to the clustering unit 1007 (S2103).

The subsequent processing of S2104 to S2105 is the same as the processing of S2003 to S2004 in FIG. 4, and thus the description thereof will be omitted.

In the above, as shown in FIG. 5, the voice section detection unit 1005 is interposed between the signal input unit 1001 and the signal division unit 1002, but the voice section detection unit 1005 is implemented in another embodiment. You can also do it.

For example, as shown in FIG. 7, the voice section detection unit 1005 may be interposed between the signal division unit 1002 and the feature amount extraction unit 1003, that is, after the signal division unit 1002. In this case, the voice section detection unit 1005 detects the voice section from a plurality of segments divided by the signal division unit 1002. The voice section detection unit 1005 inputs the signal of the detected voice section to the feature amount extraction unit 1003. The feature amount extraction unit 1003 extracts the feature amount from the segment including the voice section acquired from the voice section detection unit 1005 and inputs it to the clustering unit 1007. The clustering unit 1007 collectively clusters the feature amounts input from the feature

amount extraction units

1003a and 1003b, and inputs the result to the speaker dialyization unit 1008. The speaker dialylation unit 1008 performs speaker dialylation based on the input clustering result.

FIG. 8 is a flowchart illustrating a process (hereinafter, referred to as “speaker dialulation process S2200”) performed by the speaker dialylation execution unit 131 shown in FIG. 7. Hereinafter, the speaker dialyration process S2200 will be described with reference to the figure.

First, the processing of S2201 is the same as that of S2001 of the speaker dialing processing S2200 of the first embodiment shown in FIG. 4, and the signal dividing unit 1002 divides the signal acquired by the signal input unit 1001 into a plurality of segments. , The divided plurality of segments are input to the voice section detection unit 1005.

Subsequently, the voice section detection unit 1005 detects the segment including the voice section from the plurality of segments input from the signal division unit 1002, and outputs the segment including the detected voice section to the feature amount extraction unit 1003. (S2202).

Subsequently, the feature amount extraction unit 1003 extracts the feature amount from the segment including the voice section input from the voice section detection unit 1005, and inputs the extracted feature amount to the clustering unit 1007 (S2203).

Subsequent processing of S2204 to S2205 is the same as that of S2104 to S2105 of FIG.

As described above, the speaker dialyming device 1 of the second embodiment detects a voice section from the signal acquired by the signal input unit 1001 and extracts a feature amount for the detected voice section. Therefore, the non-voice section is excluded from the feature amount extraction target, and clustering can be efficiently performed in a short time. In addition, non-speech sections such as silent sections and noise sections are excluded from the feature amount extraction targets, and the accuracy of speaker dialification can be improved.

[Third Embodiment]
The speaker dialing device 1 in the first embodiment and the second embodiment collectively clusters all the feature quantities extracted by the feature quantity extracting unit 1003, and performs speaker dialiation based on the clustering result. conduct. On the other hand, the speaker dialylation device 1 of the third embodiment selects a feature amount to be used for clustering from the feature amounts extracted by the feature amount extraction unit 1003, and performs clustering using the selected feature amount. Hereinafter, the speaker dialing device 1 of the third embodiment will be described focusing on the differences from the speaker dialing device 1 of the first embodiment. The speaker dialing device 1 of the third embodiment may include the configuration of the speaker dialing device 1 of the second embodiment.

FIG. 9 is a diagram for explaining the details of the speaker dialyization execution unit 131 of the third embodiment. As shown in the figure, the speaker dialyization execution unit 131 of the third embodiment has a feature amount selection unit 1006 interposed between the feature amount extraction unit 1003 and the clustering unit 1007, that is, in front of the clustering unit 1007. The configuration is different from that of the first embodiment.

The feature amount selection unit 1006 selects the feature amount to be used for clustering from the feature amounts extracted by the feature

amount extraction units

1003a and 1003b. The clustering unit 1007 performs clustering using the feature amount selected by the feature amount selection unit 1006. The feature amount selection unit 1006 selects the feature amount as follows, for example.

FIG. 10 is a diagram illustrating a method in which the feature amount selection unit 1006 selects a feature amount, and is an example in which two speakers A and B and three microphones (1) to (3) are arranged. The microphones (1) to (3) are arranged in the space between the speaker A and the speaker B. The microphone (1) is located closest to speaker A, the microphone (3) is located closest to speaker B, and the microphone (2) is of microphone (1) and microphone (3). It is placed in the space between them.

Consider the case where speaker A and speaker B speak at the same time in the arrangement shown in FIG. In this case, the voice of speaker A is recorded at a higher sound pressure than the voice of speaker B in the microphone (1), and the voice of speaker B is recorded in the microphone (3) more than the voice of speaker A. It is expected to be recorded with a large sound pressure.

FIG. 11 is a schematic diagram illustrating the distribution of the feature amount in the feature amount space. If the feature amounts representing the speakers extracted from the voices recorded by the microphones (1) to (3) are the feature amounts (1) to (3), the feature amounts (1) to (3) are It is expected that the voices of the speaker A and the speaker B are substantially arranged in a line according to the mixing ratio of the voices in the feature space. These features are arranged densely in this row as the number of microphones increases, but clustering using all of these features may adversely affect clustering. That is, for example, the center of the cluster of speaker A exists near the feature amount (1), and the center of the cluster of speaker B exists near the feature amount (3), but the feature amount (2) also exists. When used for clustering, the center of the cluster moves in the feature quantity (2) direction. Therefore, in the present embodiment, this problem is solved by appropriately selecting the feature amount used for clustering.

First, in the signal acquired by the signal input unit 1001, the set V _s of the feature quantities in the segment s is expressed by the following equation.

When the number of elements of this set is larger than the number of speakers included in the signal acquired by the signal input unit 1001, the feature amount of the number of speakers (denoted by C) is selected from the number of speakers. The feature amount is selected according to, for example, the following equation.

_{_{Here, dist (v i, v j}} ) is a function representing a distance between the feature amount, for example, can be used as the Euclidean distance.

In the example of FIG. 11, the speaker dialyization execution unit 131 selects and clusters the two most distant features among the features extracted by the feature extraction unit 1003. That is, the speaker dialyization execution unit 131 selects a set of the feature amount (1) and the feature amount (3) that maximizes the difference in the feature amount space, and performs clustering based on the selected feature amount set. As a result, clustering is performed using only the features extracted from the segments in which the voices of the speakers A and B are predominantly recorded.

FIG. 12 is a schematic diagram showing an example of the clustering result by the speaker dialylation device of the third embodiment. In the figure, the areas indicated by diagonal lines are clusters in which the feature amounts of speaker A are grouped, and the areas indicated by dots are clusters in which the feature amounts of speaker B are grouped. Areas shown in black are areas that are not used for clustering.

FIG. 13 is a flowchart illustrating the speaker dialiation process S2300 performed by the speaker dialyration device of the third embodiment. Hereinafter, the speaker dialyration process S2300 will be described with reference to the figure.

First, since the processes S2301 to S2302 in the figure are the same as the processes S2001 to S2002 in the speaker dialylation process S2000 shown in FIG. 4, the description thereof will be omitted. In S2302, the feature amount extraction unit 1003 extracts the feature amount from each of the plurality of segments, and inputs the extracted feature amount to the feature amount selection unit 1006.

In the following S2303, the feature amount selection unit 1006 selects the feature amount set having the maximum difference in the feature amount space from the input feature amounts, and inputs the selected feature amount set to the clustering unit 1007. do.

Subsequently, the clustering unit 1007 clusters the input feature set, and inputs the clustering result to the speaker dialylation unit 1008 (S2304).

Subsequently, the speaker dialylation unit 1008 performs speaker dialylation based on the input clustering result (S2305). This completes the speaker dialylation process S2300.

As described above, the speaker dialylation device 1 of the third embodiment does not use all the feature amounts extracted by the feature amount extraction unit 1003 for clustering, but uses the feature amount selected by the feature amount selection unit 1006. Since clustering is performed, highly reliable speaker dialification can be realized.

In the above, the number of feature quantities selected from the feature quantities extracted by the feature quantity extraction unit 1003 is defined as the number of speakers C. However, the number of speakers is estimated for each segment, and the estimated number of speakers is used. May be used instead. When estimating the number of speakers, for example, a so-called bottom-up clustering method in which features close to the speakers are sequentially grouped can be used. As a result, the feature amount to be selected is smaller than the number of speakers C existing in the entire signal acquired by the signal input unit 1001, and the mixing ratio of the two speakers is close to 0, that is, the sound of the two speakers. It is possible to prevent the features that are mixed at the same sound pressure from being used for clustering.

Further, the feature amount selection by the feature amount selection unit 1006 may be performed by a method based on the sound pressure. For example, when the sound pressure of the signal acquired by the signal input unit 1001 is small, the signal-to-noise ratio is small, so it is expected that the reliability will be low when the feature amount representing the speaker is extracted. Therefore, the feature amount selection unit 1006 may select the feature amount for the number of speakers in descending order of the sound pressure that maximizes the difference in the feature amount space. As a result, clustering can be performed using only highly reliable features. The feature amount selection unit 1006 may use the above two feature amount selection methods in combination.

Although the embodiments of the present invention have been described above, the present invention is not limited to the embodiments described above, and includes various modifications. Further, for example, the embodiments described above have been described in detail in order to explain the present invention in an easy-to-understand manner, and are not necessarily limited to those including all the described configurations. In addition, a part of the configuration of each embodiment can be added, deleted, or replaced with another configuration.

For example, the function of the speaker dialing device 1 described above can be used, for example, in a processing portion for performing voice section detection and speaker dialing in a voice recognition system using a distributed monaural microphone. Further, the function of the speaker dialyration device 1 can be applied to, for example, in the above-mentioned voice recognition system, a processing portion for determining who spoke after the voice recognition result is obtained.

1 Speaker dialing device 14, 15 Signal input device 131 Speaker dialing execution unit 1001 Signal input unit 1002 Signal splitting unit 1003 Feature amount extraction unit 1005 Voice section detection unit 1006 Feature amount selection unit 1007 Clustering unit 1008 Speaker dialing Department

Claims

It is configured using an information processing device,
A signal division unit that divides a plurality of signals acquired from each of the input units of a plurality of audio signals into a plurality of segments having a predetermined time width, respectively.
A feature amount extraction unit that extracts a feature amount from each of the segments, and a feature amount extraction unit.
A clustering unit that collectively clusters the features extracted from each segment of the plurality of signals, and a clustering unit.
A speaker dialylation unit that performs speaker dialylation based on the results of the clustering, and
A speaker dialylation device.
The speaker dialylation device according to claim 1.
The feature amount extraction unit extracts a feature amount including speaker characteristics as the feature amount.
Speaker dialyration device.
The speaker dialylation device according to claim 1.
The feature amount extraction unit extracts a feature amount including sound pressure as the feature amount.
Speaker dialyration device.
The speaker dialylation device according to claim 1.
Further, a voice section detection unit for detecting a voice section which is a section including a voice signal from each of the plurality of signals is provided.
The signal dividing unit divides each of the plurality of signals into the segments for the voice sections.
The feature amount extraction unit extracts the feature amount from each of the segments obtained by the division.
Speaker dialyration device.
The speaker dialylation device according to claim 1.
Further, a voice section detection unit for detecting a voice section which is a section including a voice signal from each of the plurality of signals is provided.
The signal dividing unit divides the plurality of signals into a plurality of the segments, respectively.
The voice section detection unit determines whether or not the segment is a voice section, and determines whether or not the segment is a voice section.
The feature amount extraction unit extracts the feature amount from the segment determined to be a voice section.
Speaker dialyration device.
The speaker dialylation device according to any one of claims 1, 4 and 5.
A feature amount selection unit for selecting the feature amount to be clustered from the feature amount extracted by the feature amount extraction unit is further provided.
The clustering unit clusters the selected features.
Speaker dialyration device.
The speaker dialylation device according to claim 6.
The feature amount selection unit targets a predetermined number of the feature amounts with the maximum difference in the feature amount space from among a plurality of the feature amounts at the same time extracted by the feature amount extraction unit as the target of the clustering. select,
Speaker dialyration device.
The speaker dialylation device according to claim 6.
The feature amount selection unit clusters a predetermined number of the feature amounts from a plurality of the feature amounts extracted by the feature amount extraction unit at the same time in descending order of sound pressure of the signal of the extraction source. Select as a target,
Speaker dialyration device.
Information processing device
A step of dividing a plurality of signals acquired from each of a plurality of audio signal input units into a plurality of segments having a predetermined time width, respectively.
Steps to extract features from each of the segments,
A step of collectively clustering the features extracted from each segment of the plurality of signals, and
A step of performing speaker dialyration based on the result of the clustering, and
How to perform speaker dialification.
The speaker dialylation method according to claim 9.
The information processing device
A step of detecting a voice section which is a section including a voice signal from each of the plurality of signals, a step of dividing the voice section of each of the plurality of signals into the segment, and a step of dividing the voice section into the segments.
A step of extracting the feature amount from each of the segments obtained by the division, and
A speaker dialyization method that further performs.
The speaker dialylation method according to claim 9.
The information processing device
A step of detecting a voice section which is a section including a voice signal from each of the plurality of signals, and a step of detecting the voice section.
A step of dividing the plurality of signals into a plurality of the segments, respectively,
A step of determining whether or not the segment is an audio section,
A step of extracting the feature amount for the segment determined to be a voice section, and
A speaker dialyization method that further performs.
The speaker dialing method according to any one of claims 9, 10 and 11.
The information processing device
A step of selecting the feature amount to be clustered from the extracted feature amounts, and
A step of clustering the selected features and
A speaker dialyization method that further performs.
The speaker dialylation method according to claim 12.
The information processing device
A step of selecting a predetermined number of the feature quantities having the maximum difference in the feature quantity space as the target of the clustering from the plurality of the feature quantities extracted at the same time.
A speaker dialyization method that further performs.
The speaker dialylation method according to claim 12.
The information processing device
A step of selecting a predetermined number of the feature quantities as targets for clustering from a plurality of the feature quantities extracted at the same time in descending order of sound pressure of the signal of the extraction source.
A speaker dialyization method that further performs.