JP2022071960A

JP2022071960A - Utterance cutting and dividing system and method therefor

Info

Publication number: JP2022071960A
Application number: JP2020181115A
Authority: JP
Inventors: ソロビヨフ・イワン; Solov'ev Ivan; 開平井; Kai Hirai
Original assignee: Nsd Advanced Technology Research Institute Co Ltd
Current assignee: Nsd Advanced Technology Research Institute Co Ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2022-05-17
Anticipated expiration: 2040-10-29
Also published as: JP7356960B2

Abstract

To securely cut out and divide speakers and their utterances in real time even if there are a plurality of speakers in the same space.SOLUTION: An utterance cutting and dividing system 2 comprises: microphones M1 to Mn that speakers A to N wear respectively; an utterance section detection unit 4 which detects a plurality of pieces of mixed voice data acquired by the microphones M1 to Mn respectively in each of utterance sections from the start to the end of each piece of voice data; a detected voice storage unit 5 which stores the voice data; a similarity determination unit 6 which synchronizes and refers to the voice data acquired by the respective microphones M1 to Mn, and calculates similarities of the acquired voice data to determine levels of the similarities through comparison; and a voice energy determination unit 7 which compares and determines levels of voice energy of voice data determined by the similarity determination unit 6 to be high and equal in similarity to specify a microphone determined to be relatively large in voice energy, and relates the voice data acquired from the microphone and stored to the speaker of the microphone.SELECTED DRAWING: Figure 2

Description

本発明は、会議における議事録作成やコールセンター等の通話記録作成に供される発言切り分けシステムとその方法に関するものである。 The present invention relates to a remark separation system and a method thereof used for creating minutes at a meeting and creating a call record in a call center or the like.

従来、会議や打ち合わせ等、複数の話者が発言する場において、他者と重複することのない発言区間を切り出すには、例えば、会場に設置されたマイクアレイの各マイクの音声信号に基づいて、最も信号強度の強い収音ビーム信号を選択し、それに対応する方位を検出し、方位データに基づき音の到来方向を予測し、話者を同定して識別するようにしたものが知られている（例えば、特許文献１参照）。 Conventionally, in a place where multiple speakers speak, such as a meeting or a meeting, in order to cut out a speech section that does not overlap with others, for example, based on the voice signal of each microphone of the microphone array installed at the venue. It is known that the sound pickup beam signal with the strongest signal strength is selected, the corresponding orientation is detected, the arrival direction of the sound is predicted based on the orientation data, and the speaker is identified and identified. (For example, see Patent Document 1).

また、複数のマイクから取得される複数の音声信号について、音声データから重複分を取り除く処理を行い、２以上の音声が含まれる場合、音声毎に分離して各音声信号を出力するようにしたものが知られている（例えば、特許文献２参照）。 In addition, for multiple audio signals acquired from multiple microphones, processing is performed to remove duplicates from the audio data, and when two or more audios are included, each audio signal is output separately for each audio signal. Is known (see, for example, Patent Document 2).

国際公開ＷＯ２００７－１３９０４０Ａ１号公報International Publication WO2007-139040A1 Gazette 特開２００８－３０９８５６号公報Japanese Unexamined Patent Publication No. 2008-309856

しかしながら、上記先行技術文献１では、方位データにより話者を識別しているので、発言を重複させないで切り分けにくく、さらに正確に話者が特定しにくいという問題がある。また、上記先行技術文献２では、分離した音声信号について、話者の発言を特徴量毎に音声信号として記憶し、特徴量毎に用意された辞書を用いるだけでなく、分離フィルタを更新する必要があり、フィルタ演算の処理が複雑になるという問題がある。 However, in the above-mentioned prior art document 1, since the speaker is identified by the orientation data, there is a problem that it is difficult to separate the remarks without duplication and it is difficult to identify the speaker more accurately. Further, in the above-mentioned prior art document 2, regarding the separated audio signal, it is necessary not only to store the speaker's remark as an audio signal for each feature amount and to use the dictionary prepared for each feature amount but also to update the separation filter. There is a problem that the processing of the filter operation becomes complicated.

本発明は上記課題を解決するためになされたもので、独立性の評価や音声の分離処理を必要とせず、類似度と音量を用いるだけの簡素な構成で、同一空間に複数の話者が存在する会議やコールセンター、インカム通話などの場において、あるいは、オンライン会議などで、他の話者の音声を自己の端末のスピーカを通じて聞きながら話し合う場において複数の話者の音声を重複することなくしかも話者とその発言を正確に特定して切り出すことができる発言切り分けシステムとその方法を提供することを目的としている。 The present invention has been made to solve the above-mentioned problems, and has a simple configuration that does not require independence evaluation or voice separation processing and only uses similarity and volume, and a plurality of speakers can be used in the same space. In an existing conference, call center, income call, etc., or in an online conference, where you talk while listening to the voice of another speaker through the speaker of your terminal, the voices of multiple speakers are not duplicated. The purpose is to provide a speech separation system and a method capable of accurately identifying and cutting out a speaker and his / her speech.

本発明の請求項１に係る発言切り分けシステムは、複数の話者の音声が混在して取得されて入力される音声データに基づいて、発言を切り出す発言切り分けシステムであって、自己の音声と他者の音声が混在して入力される話者毎の音声入力部を備え、これら音声入力部毎に取得され混在する複数の音声データを、各音声データの開始から終了までの発言区間毎に検知して自己の音声入力部から入力された自己の音声データを蓄積し、各音声入力部から取得された話者毎の蓄積された音声データを同期させて参照し、取得した話者毎の音声データの類似度を算出して類似度の高低を比較判別し、類似度が低い音声データは異なる話者の音声データとみなし、類似度が高い音声データは同一話者の音声データとみなし、これら類似度が高い話者が同一とみなされた音声データについて音声エネルギの大小を比較判別し、音声エネルギが相対的に大きいと判別された音声データを、自己の音声入力部から入力された自己の発言と特定し、自他の発言を切り分けることを特徴とするものである。 The remark separation system according to claim 1 of the present invention is a remark separation system that cuts out remarks based on voice data obtained and input by a mixture of voices of a plurality of speakers, and is a remark separation system including own voice and others. It is equipped with a voice input unit for each speaker in which the voices of people are mixed and input, and multiple voice data acquired and mixed for each voice input unit are detected for each speech section from the start to the end of each voice data. Then, the self-voice data input from the self-voice input unit is accumulated, and the stored voice data for each speaker acquired from each voice input unit is synchronized and referred to, and the acquired voice for each speaker is referenced. The similarity of the data is calculated and the high and low of the similarity are compared and discriminated. The voice data having a low degree of similarity is regarded as the voice data of different speakers, and the voice data having a high degree of similarity is regarded as the voice data of the same speaker. Speakers with a high degree of similarity compare and discriminate the magnitude of the voice energy for the voice data considered to be the same, and the voice data determined to have a relatively large voice energy is input from the voice input unit of the self. It is characterized by identifying it as a statement and separating one's own and other statements.

本発明の請求項１に係る発言切り分けシステムでは、複数の話者の音声が混在して取得されて入力される音声データに基づいて、発言を切り出す発言切り分けシステムであって、自己の音声と他者の音声が混在して入力される話者毎の音声入力部を備え、これら音声入力部毎に取得され混在する複数の音声データを、各音声データの開始から終了までの発言区間毎に検知して自己の音声入力部から入力された自己の音声データを蓄積し、各音声入力部から取得された話者毎の蓄積された音声データを同期させて参照し、取得した話者毎の音声データの類似度を算出して類似度の高低を比較判別し、類似度が低い音声データは異なる話者の音声データとみなし、類似度が高い音声データは同一話者の音声データとみなし、これら類似度が高い話者が同一とみなされた音声データについて音声エネルギの大小を比較判別し、音声エネルギが相対的に大きいと判別された音声データを、自己の音声入力部から入力された自己の発言と特定し、自他の発言を切り分けるようにしたことにより、話者の発言内容を迅速かつ正確に重複することなく切り出すことができる。 The remark separation system according to claim 1 of the present invention is a remark separation system that cuts out remarks based on voice data obtained and input by a mixture of voices of a plurality of speakers, such as own voice and others. It is equipped with a voice input unit for each speaker in which the voices of people are mixed and input, and multiple voice data acquired and mixed for each voice input unit are detected for each speech section from the start to the end of each voice data. Then, the self-voice data input from the self-voice input unit is accumulated, and the stored voice data for each speaker acquired from each voice input unit is synchronized and referred to, and the acquired voice for each speaker is referenced. The similarity of the data is calculated and the high and low of the similarity are compared and discriminated. The voice data having a low degree of similarity is regarded as the voice data of different speakers, and the voice data having a high degree of similarity is regarded as the voice data of the same speaker. Speakers with a high degree of similarity compare and discriminate the magnitude of the voice energy for the voice data considered to be the same, and the voice data determined to have a relatively large voice energy is input from the voice input unit of the self. By identifying it as a remark and separating one's own remarks from others, it is possible to quickly and accurately cut out the remarks of the speaker without duplication.

また、本発明に係る発言切り分けシステムは、自己の発言と特定された音声入力部に基づいて、話者とその発言を特定することが好ましい。係る構成とすることにより、話者の発言を、発言内容だけでなく発言内容とその話者を特定して切り出すことができる。 Further, it is preferable that the speech separation system according to the present invention identifies the speaker and the speech based on the voice input unit identified as the speaker's speech. With such a configuration, the speaker's remark can be cut out by specifying not only the remark content but also the remark content and the speaker.

本発明の請求項３に係る発言切り分けシステムは、複数の話者の音声が混在して取得されて入力される音声データに基づいて、発言を切り出す発言切り分けシステムであって、自己の音声と他者の音声が混在して入力される話者毎の音声入力部と、各音声入力部毎に設けられ、自己の音声入力部から取得され混在する複数の音声データから自己の音声データの発言開始から発言終了までの発言区間を検知する発言区間検知部と、発言区間検知部毎にそれぞれ設けられ、検知された自己の発言区間の音声データを蓄積する検知音声蓄積部と、各発言区間検知部とその検知音声蓄積部とを同期させて参照し、各発言区間検知部の検知音声蓄積部に蓄積された音声データについて、類似度を算出して類似度の高低を比較判別し、類似度が低い音声データは異なる話者の音声データとみなし、類似度が高い音声データは複数の音声入力部から取得された同一話者の音声データとみなす類似度判別部と、類似度判別部により判別された同一話者の音声データについて、音声データ毎に音声エネルギを算出して音声エネルギの大小を比較判別し、音声エネルギが相対的に高いと判別された音声データが取得された発言区間検知部を特定する音声エネルギ判別部とを有し、特定された発言区間検知部とその検知音声蓄積部に蓄積された音声データに基づいて、話者とその発言を切り出すことを特徴とするものである。 The remark separation system according to claim 3 of the present invention is a remark separation system that cuts out remarks based on voice data obtained and input by a mixture of voices of a plurality of speakers, such as own voice and others. A voice input unit for each speaker, in which the voice of the person is mixed and input, and a voice input unit for each voice input unit, which is obtained from the voice input unit of the user and starts speaking of the voice data of the user from a plurality of mixed voice data. A speech section detection unit that detects the speech section from to the end of speech, a detection voice storage section that is provided for each speech section detection section and stores voice data of the detected own speech section, and each speech section detection unit. And the detected voice storage unit are referred to in synchronization, and for the voice data stored in the detected voice storage unit of each speech section detection unit, the similarity is calculated and the high and low of the similarity are compared and discriminated, and the similarity is determined. The low voice data is regarded as the voice data of different speakers, and the voice data having a high degree of similarity is discriminated by the similarity discriminating unit and the similarity discriminating unit, which are regarded as the voice data of the same speaker acquired from a plurality of voice input units. For the voice data of the same speaker, the voice energy is calculated for each voice data and the magnitude of the voice energy is compared and discriminated. It has a voice energy discriminating unit to specify, and is characterized in that it cuts out a speaker and his / her speech based on the voice data stored in the specified speech section detection unit and the detected voice storage unit.

本発明の請求項３に係る発言切り分けシステムでは、複数の話者の音声が混在して取得されて入力される音声データに基づいて、発言を切り出す発言切り分けシステムであって、自己の音声と他者の音声が混在して入力される話者毎の音声入力部と、各音声入力部毎に設けられ、自己の音声入力部から取得され混在する複数の音声データから自己の音声データの発言開始から発言終了までの発言区間を検知する発言区間検知部と、発言区間検知部毎にそれぞれ設けられ、検知された自己の発言区間の音声データを蓄積する検知音声蓄積部と、各発言区間検知部とその検知音声蓄積部とを同期させて参照し、各発言区間検知部の検知音声蓄積部に蓄積された音声データについて、類似度を算出して類似度の高低を比較判別し、類似度が低い音声データは異なる話者の音声データとみなし、類似度が高い音声データは複数の音声入力部から取得された同一話者の音声データとみなす類似度判別部と、類似度判別部により判別された同一話者の音声データについて、音声データ毎に音声エネルギを算出して音声エネルギの大小を比較判別し、音声エネルギが相対的に高いと判別された音声データが取得された発言区間検知部を特定する音声エネルギ判別部とを有し、特定された発言区間検知部とその検知音声蓄積部に蓄積された音声データに基づいて、話者とその発言を切り出すようにしたことにより、各音声入力部毎に話者が予め特定されており、発言区間毎に蓄積された音声データについて同期させて参照し、類似度判別部により類似度の高低を比較判別し、類似度の高い同一の話者とみなされた音声データについて、音声エネルギ判別部により音声エネルギの大小を比較判別するだけで、迅速かつ正確に話者とその話者が発言した音声データを重複なく切り出すことができる。 The remark separation system according to claim 3 of the present invention is a remark separation system that cuts out remarks based on voice data obtained and input by a mixture of voices of a plurality of speakers, such as own voice and others. A voice input unit for each speaker, in which the voice of the person is mixed and input, and a voice input unit for each voice input unit, which is obtained from the voice input unit of the user and starts speaking of the voice data of the user from a plurality of mixed voice data. A speech section detection unit that detects the speech section from to the end of speech, a detection voice storage section that is provided for each speech section detection section and stores voice data of the detected own speech section, and each speech section detection unit. And the detected voice storage unit are referred to in synchronization, and for the voice data stored in the detected voice storage unit of each speech section detection unit, the similarity is calculated and the high and low of the similarity are compared and discriminated, and the similarity is determined. The low voice data is regarded as the voice data of different speakers, and the voice data having a high degree of similarity is discriminated by the similarity discriminating unit and the similarity discriminating unit, which are regarded as the voice data of the same speaker acquired from a plurality of voice input units. For the voice data of the same speaker, the voice energy is calculated for each voice data and the magnitude of the voice energy is compared and discriminated. Each voice input has a voice energy discriminating unit to specify, and cuts out the speaker and his / her voice based on the voice data stored in the specified voice section detection unit and the detected voice storage unit. The speaker is specified in advance for each section, and the voice data accumulated for each section is referred to in synchronization, and the similarity discrimination unit compares and discriminates the high and low of the similarity, and the same speaker with a high degree of similarity is compared and discriminated. It is possible to quickly and accurately cut out the speaker and the voice data spoken by the speaker without duplication, only by comparing and discriminating the magnitude of the voice energy with respect to the voice data considered to be.

また、本発明に係る発言切り分けシステムは、音声入力部には、自他の話者の音声が話者毎のマイクを通じて入力されるか、または自他の話者の音声が自己の端末のマイクを通じて入力されるようにすることが好ましい。係る構成とすることにより、音声入力部を自他の話者の音声が話者毎のマイクを通じて入力されるよう構成すれば、同一空間内で複数の話者が発言しても重複することなく話者とその発言を特定することができ、自他の話者の音声が自己の端末のマイクを通じて入力されるように構成すれば、遠隔地で複数の話者が端末を通じて発言しても重複することなく話者とその発言を特定することができる。さらに、本発明に係る発言切り分けシステムは、音声入力部には、自他の話者の音声が、マイクを通じてリアルタイムで入力されるか、またはすでに取得されて入力され音声データとして記録された記録部を通じて入力されるように構成することが好ましい。係る構成とすることにより、音声入力部に、自他の話者の音声が、マイクを通じてリアルタイムで入力される場合、話し合い終了後、直ちに話者とその発言のデータを入手することができる。一旦、記録部に音声データを記録してさえおけば、記録部を通じていつでも必要な時に話者とその発言のデータを入手することができる。また、本発明に係る発言切り分けシステムは、各音声入力部には、複数の話者からなる話者グループの音声が入力され、話者グループとその話者グループの発言を切り出すように構成することが好ましい。係る構成とすることにより、話者一人ひとりでなく話者グループとその話者グループ毎の発言のデータを入手することができる。 Further, in the speech isolation system according to the present invention, the voice of one's own or other speaker is input to the voice input unit through the microphone of each speaker, or the voice of one's own or other speaker is input to the microphone of its own terminal. It is preferable to input through. With this configuration, if the voice input unit is configured so that the voices of the own and other speakers are input through the microphones of each speaker, even if a plurality of speakers speak in the same space, they will not be duplicated. If the speaker and his / her speech can be identified and the voices of one's own speaker and other speakers are input through the microphone of one's own terminal, even if multiple speakers speak through the terminal at a remote location, they are duplicated. You can identify the speaker and his remarks without doing anything. Further, in the speech separation system according to the present invention, the voice input unit is a recording unit in which the voice of one's own speaker or another speaker is input in real time through a microphone, or is already acquired and input and recorded as voice data. It is preferable to configure it so that it is input through. With this configuration, when the voices of one's own and other speakers are input to the voice input unit in real time through a microphone, the data of the speaker and his / her remark can be obtained immediately after the discussion is completed. Once the voice data is recorded in the recording unit, the data of the speaker and his / her remark can be obtained at any time through the recording unit. Further, the speech separation system according to the present invention is configured such that the voice of a speaker group composed of a plurality of speakers is input to each voice input unit, and the speaker group and the speech of the speaker group are cut out. Is preferable. With such a configuration, it is possible to obtain the speaker group and the speech data for each speaker group instead of each speaker.

また、本発明に係る発言切り分けシステムは、発言区間検知部には、入力される音声データが一定間隔毎に区切られた音声フレームとして入力されるとともに、音声フレームを、未検知または検知中の何れかの状態として検知し、初期状態を未検知とし、検知状態が未検知で発言の開始を検知すると検知状態を検知中に変更する発言開始検知部と、検知状態が検知中の間、検知音声蓄積部へ音声データの蓄積を行い、発言の終了を検知すると検知音声蓄積部に蓄積された音声データを出力あるいは削除し、検知状態を未検知に変更する発言終了検知部とを有することが好ましい。係る構成とすることにより、正確に発言区間の音声データを入手することができる。さらに、本発明に係る発言切り分けシステムは、音声入力部には、自他の音声とノイズが混在して入力され、発言区間検知部に入力される音声フレームについて、この音声フレームを、発言開始直後または発言終了直前のうち少なくともいずれか一方で、音声エネルギの大小を予め求められた音声エネルギの閾値に基づいて、人の音声か音声以外の雑音か否かを判別する発言判別部を有し、音声以外の雑音と判別された音声データに基づいて特定された発言区間検知部の検知音声蓄積部に蓄積され雑音と判別された音声データを削除することが好ましい。係る構成とすることにより、音声以外の雑音を音声データから取り除くことができ、人の音声のみを確実に取り込むこことができる。そして、音声エネルギの閾値を会場や端末の条件により変更して適用することができ、精度の向上を図ることができる。また、本発明に係る発言切り分けシステムは、類似度判別部により判別された同一話者の音声データについて、予め求められた音声の時間の長さの閾値に基づいて音声の時間が所定の長さを有するか否かを判別する音声長さ判別部を有し、所定時間長さを有する場合、音声エネルギ判別部で音声エネルギの大小を比較判別し、所定時間長さを有していない場合、蓄積された音声データを検知音声蓄積部から削除することが好ましい。係る構成とすることにより、話者の発声のうち、咳払いや舌打ち等意味のない発声を音声データから除き、思考に基づいてある程度の長さで発話される意味のある発言のみを音声データとして取り込むことができ、無駄な発声を取り除くことができる。また、音声データの欠損をなくすことができる。さらに、本発明に係る発言切り分けシステムは、検知音声蓄積部に蓄積された音声データに対し、蓄積された音声データ間の時間のずれを求め、この求められた時間的ずれを用いて音声データの時間ずれを補正する時間ずれ補正部を有するようにすることが好ましい。係る構成とすることにより、本来の音声を欠けることなく確実に音声データとして取り込むことができる。また、本発明に係る発言切り分けシステムは、発言区間検知部の検知音声蓄積部を通じて特定された話者とその音声データが出力されると、特定された話者とその音声データを、文字データ、文字データを翻訳した翻訳データまたは音声のうち少なくともいずれか１として表示または出力する表示出力部を有するようにすることが好ましい。係る構成とすることにより、会議や通話記録終了後、直ちに会議録や音声記録を入手することができる。さらに、本発明に係る発言切り分けシステムは、マイクは、同一の場所に集まった話者、コールセンターの通話者または会話をインカムを通じて行う会話者の何れかに装着されることが好ましい。係る構成とすることにより、多様な用途に利用することができる。 Further, in the speech separation system according to the present invention, the input voice data is input to the speech section detection unit as a voice frame divided at regular intervals, and the voice frame is either undetected or detected. A speech start detection unit that detects this state, makes the initial state undetected, and changes the detection status to during detection when the detection status is undetected and detects the start of speech, and the detection voice storage unit while the detection status is being detected. It is preferable to have a speech end detection unit that accumulates voice data and outputs or deletes the voice data stored in the detection voice storage unit when the end of speech is detected, and changes the detection state to undetected. With such a configuration, it is possible to accurately obtain the voice data of the speech section. Further, in the speech separation system according to the present invention, the voice input unit is input with a mixture of own and other voices and noise, and the voice frame input to the speech section detection unit is immediately after the start of speech. Alternatively, it has a speech discriminating unit that determines whether the voice energy is human voice or non-voice noise based on a voice energy threshold obtained in advance for the magnitude of voice energy at least one of immediately before the end of speech. It is preferable to delete the voice data accumulated in the detection voice storage unit of the speech section detection unit specified based on the voice data determined to be noise other than voice and determined to be noise. With such a configuration, noise other than voice can be removed from the voice data, and only human voice can be reliably captured. Then, the threshold value of the voice energy can be changed and applied depending on the conditions of the venue and the terminal, and the accuracy can be improved. Further, in the speech separation system according to the present invention, for the voice data of the same speaker determined by the similarity determination unit, the voice time is a predetermined length based on the threshold value of the voice time length obtained in advance. If it has a voice length discriminating unit that determines whether or not it has a predetermined time length, and if it has a predetermined time length, the voice energy discriminating unit compares and discriminates the magnitude of the voice energy, and if it does not have a predetermined time length. It is preferable to delete the stored voice data from the detection voice storage unit. With this configuration, meaningless utterances such as throat clearing and tongue slap are excluded from the voice data, and only meaningful utterances that are uttered for a certain length based on thought are captured as voice data. You can get rid of unnecessary vocalizations. In addition, it is possible to eliminate the loss of voice data. Further, the speech isolation system according to the present invention obtains a time lag between the stored voice data for the voice data stored in the detected voice storage unit, and uses the obtained time lag to obtain the voice data. It is preferable to have a time lag correction unit for correcting the time lag. With such a configuration, it is possible to reliably capture the original voice as voice data without missing it. Further, in the speech separation system according to the present invention, when the specified speaker and its voice data are output through the detection voice storage unit of the speech section detection unit, the specified speaker and its voice data are converted into character data. It is preferable to have a display output unit that displays or outputs as at least one of translated data or voice obtained by translating character data. With such a configuration, the minutes and voice recording can be obtained immediately after the end of the conference or call recording. Further, in the speech isolation system according to the present invention, it is preferable that the microphone is attached to either a speaker gathered at the same place, a call center speaker, or a speaker who conducts a conversation through an income. With such a configuration, it can be used for various purposes.

本発明の請求項１３に係る発言切り分け方法は、複数の話者の音声が混在して取得されて入力される音声データに基づいて、発言を切り出す発言切り分け方法であって、自己の音声と他者の音声が混在して入力される話者毎の音声入力部を備え、これら音声入力部毎に取得され混在する複数の音声データを、各音声データの開始から終了までの発言区間毎に検知して自己の音声入力部から入力された自己の音声データを蓄積する第１のステップと、各音声入力部から取得された話者毎の蓄積された音声データを同期させて参照し、取得した話者毎の音声データの類似度を算出して類似度の高低を比較判別し、類似度が低い音声データは異なる話者の音声データとみなし、類似度が高い音声データは同一話者の音声データとみなす第２のステップと、これら類似度が高い話者が同一とみなされた音声データについて音声エネルギの大小を比較判別し、音声エネルギが相対的に大きいと判別された音声データを、自己の音声入力部から入力された自己の発言と特定し、自他の発言を切り分ける第３のステップとを有することを特徴とするものである。 The remark separation method according to claim 13 of the present invention is a remark separation method for cutting out remarks based on voice data obtained and input by a mixture of voices of a plurality of speakers, such as own voice and others. It is equipped with a voice input unit for each speaker in which the voices of people are mixed and input, and multiple voice data acquired and mixed for each voice input unit are detected for each speech section from the start to the end of each voice data. Then, the first step of accumulating the own voice data input from the own voice input unit and the accumulated voice data for each speaker acquired from each voice input unit are synchronized, referred to, and acquired. The similarity of the voice data for each speaker is calculated and the high and low of the similarity are compared and discriminated. The voice data with a low degree of similarity is regarded as the voice data of different speakers, and the voice data with a high degree of similarity is the voice of the same speaker. The second step, which is regarded as data, and the voice data which are considered to be the same by the speakers with high similarity are compared and discriminated by the magnitude of the voice energy, and the voice data determined to have a relatively large voice energy is self-determined. It is characterized by having a third step of identifying oneself's remarks input from the voice input unit of the above and separating one's own remarks and other remarks.

本発明の請求項１３に係る発言切り分け方法では、複数の話者の音声が混在して取得されて入力される音声データに基づいて、発言を切り出す発言切り分け方法であって、自己の音声と他者の音声が混在して入力される話者毎の音声入力部を備え、これら音声入力部毎に取得され混在する複数の音声データを、各音声データの開始から終了までの発言区間毎に検知して自己の音声入力部から入力された自己の音声データを蓄積する第１のステップと、各音声入力部から取得された話者毎の蓄積された音声データを同期させて参照し、取得した話者毎の音声データの類似度を算出して類似度の高低を比較判別し、類似度が低い音声データは異なる話者の音声データとみなし、類似度が高い音声データは同一話者の音声データとみなす第２のステップと、これら類似度が高い話者が同一とみなされた音声データについて音声エネルギの大小を比較判別し、音声エネルギが相対的に大きいと判別された音声データを、自己の音声入力部から入力された自己の発言と特定し、自他の発言を切り分ける第３のステップとを有するようにしたことにより、話者の発言内容を迅速かつ正確に重複することなく切り出すことができる。 The remark separation method according to claim 13 of the present invention is a remark separation method for cutting out remarks based on voice data obtained and input by a mixture of voices of a plurality of speakers, such as own voice and others. It is equipped with a voice input unit for each speaker in which the voices of people are mixed and input, and multiple voice data acquired and mixed for each voice input unit are detected for each speech section from the start to the end of each voice data. Then, the first step of accumulating the own voice data input from the own voice input unit and the accumulated voice data for each speaker acquired from each voice input unit are synchronized, referred to, and acquired. The similarity of the voice data for each speaker is calculated and the high and low of the similarity are compared and discriminated. The voice data with a low degree of similarity is regarded as the voice data of different speakers, and the voice data with a high degree of similarity is the voice of the same speaker. The second step, which is regarded as data, and the voice data which are considered to be the same by the speakers with high similarity are compared and discriminated by the magnitude of the voice energy, and the voice data determined to have a relatively large voice energy is self-determined. By identifying it as one's own remarks input from the voice input unit of the speaker and having a third step of separating one's own remarks and other remarks, the content of the speaker's remarks can be cut out quickly and accurately without duplication. Can be done.

また、本発明の発言切り分け方法は、自己の発言と特定された音声入力部に基づいて、話者とその発言を特定することが好ましい。係る構成とすることにより、話者の発言を、発言内容と話者とを特定して切り出すことができる。さらに、本発明に係る発言切り分け方法は、特定された話者とその音声データを、文字データ、文字データを翻訳した翻訳データまたは音声のうち少なくともいずれか１として表示または出力することが好ましい。係る構成とすることにより、会議や通話記録終了後、直ちに会議録や音声記録を入手することができる。 Further, in the speech isolation method of the present invention, it is preferable to identify the speaker and the speech based on the voice input unit identified as the speaker's speech. With such a configuration, the speaker's remark can be cut out by identifying the remark content and the speaker. Further, in the speech isolation method according to the present invention, it is preferable to display or output the specified speaker and its voice data as at least one of character data, translated data obtained by translating the character data, or voice. With such a configuration, the minutes and voice recording can be obtained immediately after the end of the conference or call recording.

本発明の請求項１に係る発言切り分けシステムでは、複数の話者の音声が混在して取得されて入力される音声データに基づいて、発言を切り出す発言切り分けシステムであって、自己の音声と他者の音声が混在して入力される話者毎の音声入力部を備え、これら音声入力部毎に取得され混在する複数の音声データを、各音声データの開始から終了までの発言区間毎に検知して自己の音声入力部から入力された自己の音声データを蓄積し、各音声入力部から取得された話者毎の蓄積された音声データを同期させて参照し、取得した話者毎の音声データの類似度を算出して類似度の高低を比較判別し、類似度が低い音声データは異なる話者の音声データとみなし、類似度が高い音声データは同一話者の音声データとみなし、これら類似度が高い話者が同一とみなされた音声データについて音声エネルギの大小を比較判別し、音声エネルギが相対的に大きいと判別された音声データを、自己の音声入力部から入力された自己の発言と特定し、自他の発言を切り分けるようにしたことにより、類似度と音量を用いるだけの簡素な構成で、同一空間に複数の話者が存在する会議やコールセンター、インカム通話などの場やオンライン会議等の場において、複数の話者の音声を重複することなくそれぞれの発言を正確に特定して切り出すことができるので、正確な会議録や通話記録を得ることができる。 The remark separation system according to claim 1 of the present invention is a remark separation system that cuts out remarks based on voice data obtained and input by a mixture of voices of a plurality of speakers, such as own voice and others. It is equipped with a voice input unit for each speaker in which the voices of people are mixed and input, and multiple voice data acquired and mixed for each voice input unit are detected for each speech section from the start to the end of each voice data. Then, the self-voice data input from the self-voice input unit is accumulated, and the stored voice data for each speaker acquired from each voice input unit is synchronized and referred to, and the acquired voice for each speaker is referenced. The similarity of the data is calculated and the high and low of the similarity are compared and discriminated. The voice data having a low degree of similarity is regarded as the voice data of different speakers, and the voice data having a high degree of similarity is regarded as the voice data of the same speaker. Speakers with a high degree of similarity compare and discriminate the magnitude of the voice energy for the voice data considered to be the same, and the voice data determined to have a relatively large voice energy is input from the voice input unit of the self. By identifying it as a statement and separating it from other statements, it is possible to use a simple configuration that only uses similarity and volume, such as meetings, call centers, and income calls where multiple speakers are present in the same space. In an online conference or the like, it is possible to accurately identify and cut out each remark without duplicating the voices of a plurality of speakers, so that accurate conference records and call records can be obtained.

また、本発明の請求項３に係る発言切り分けシステムでは、複数の話者の音声が混在して取得されて入力される音声データに基づいて、発言を切り出す発言切り分けシステムであって、自己の音声と他者の音声が混在して入力される話者毎の音声入力部と、各音声入力部毎に設けられ、自己の音声入力部から取得され混在する複数の音声データから自己の音声データの発言開始から発言終了までの発言区間を検知する発言区間検知部と、発言区間検知部毎にそれぞれ設けられ、検知された自己の発言区間の音声データを蓄積する検知音声蓄積部と、各発言区間検知部とその検知音声蓄積部とを同期させて参照し、各発言区間検知部の検知音声蓄積部に蓄積された音声データについて、類似度を算出して類似度の高低を比較判別し、類似度が低い音声データは異なる話者の音声データとみなし、類似度が高い音声データは複数の音声入力部から取得された同一話者の音声データとみなす類似度判別部と、類似度判別部により判別された同一話者の音声データについて、音声データ毎に音声エネルギを算出して音声エネルギの大小を比較判別し、音声エネルギが相対的に高いと判別された音声データが取得された発言区間検知部を特定する音声エネルギ判別部とを有し、特定された発言区間検知部とその検知音声蓄積部に蓄積された音声データに基づいて、話者とその発言を切り出すようにしたことにより、複数の話者の発言に対して、類似度と音声エネルギとをそれぞれ比較判別して容易かつ確実に話者とその話者が発言した音声データを重複することなく特定することができ、より精密な会議録や通話記録を得ることができる。 Further, the speech separation system according to claim 3 of the present invention is a speech separation system that cuts out a speech based on voice data obtained and input by a mixture of voices of a plurality of speakers, and is a voice of its own. A voice input unit for each speaker in which voices of other people are mixed and input, and a voice data of oneself from a plurality of voice data obtained from one's own voice input unit and mixed in each voice input unit. A speech section detection unit that detects the speech section from the start of speech to the end of speech, a detection voice storage unit that is provided for each speech section detection section and stores voice data of the detected own speech section, and each speech section. The detection unit and the detected voice storage unit are referred to in synchronization, and the similarity is calculated for the voice data stored in the detection voice storage unit of each speech section detection unit, and the high and low of the similarity are compared and discriminated to be similar. The low-degree voice data is regarded as the voice data of different speakers, and the high-similar voice data is regarded as the voice data of the same speaker acquired from a plurality of voice input units. For the identified voice data of the same speaker, the voice energy is calculated for each voice data and the magnitude of the voice energy is compared and discriminated, and the voice data determined to have a relatively high voice energy is acquired. It has a voice energy discriminating unit that identifies the unit, and by cutting out the speaker and his / her speech based on the voice data stored in the specified speech section detection unit and the detected voice storage unit. It is possible to easily and surely identify the speaker and the voice data spoken by the speaker without duplication by comparing and discriminating the similarity and the voice energy with respect to the speech of the speaker, which is more precise. You can get conference records and call records.

さらに、本発明の請求項１３に係る発言切り分け方法では、複数の話者の音声が混在して取得されて入力される音声データに基づいて、発言を切り出す発言切り分け方法であって、自己の音声と他者の音声が混在して入力される話者毎の音声入力部を備え、これら音声入力部毎に取得され混在する複数の音声データを、各音声データの開始から終了までの発言区間毎に検知して自己の音声入力部から入力された自己の音声データを蓄積する第１のステップと、各音声入力部から取得された話者毎の蓄積された音声データを同期させて参照し、取得した話者毎の音声データの類似度を算出して類似度の高低を比較判別し、類似度が低い音声データは異なる話者の音声データとみなし、類似度が高い音声データは同一話者の音声データとみなす第２のステップと、これら類似度が高い話者が同一とみなされた音声データについて音声エネルギの大小を比較判別し、音声エネルギが相対的に大きいと判別された音声データを、自己の音声入力部から入力された自己の発言と特定し、自他の発言を切り分ける第３のステップとを有するようにしたことにより、類似度と音量を用いるだけの簡素な構成で、同一空間に複数の話者が存在する会議やコールセンター、インカム通話などの場やオンライン会議等の場において、複数の話者の音声を重複することなくそれぞれの発言を正確に特定して切り出すことができるので、正確な会議録や通話記録を得ることができる。 Further, the remark separation method according to claim 13 of the present invention is a remark separation method for cutting out remarks based on voice data obtained and input by a mixture of voices of a plurality of speakers, and is a self-voice. It is equipped with a voice input unit for each speaker in which voices of other people are mixed and input, and a plurality of voice data acquired and mixed for each voice input unit can be input for each speech section from the start to the end of each voice data. The first step of accumulating the self-voice data input from the self-voice input unit by detecting the data is synchronized with the accumulated voice data of each speaker acquired from each voice input unit for reference. The similarity of the acquired voice data for each speaker is calculated and the high and low of the similarity are compared and discriminated. The voice data with a low degree of similarity is regarded as the voice data of different speakers, and the voice data with a high degree of similarity is the same speaker. The second step, which is regarded as the voice data of the above, and the voice data which are considered to be the same by the speakers having a high degree of similarity are compared and discriminated by the magnitude of the voice energy, and the voice data determined to have a relatively large voice energy is used. , By identifying it as its own remark input from its own voice input unit and having a third step of separating self and other remarks, it is the same with a simple configuration that only uses similarity and volume. In a conference or call center where multiple speakers exist in the space, in a place such as an income call, or in an online conference, it is possible to accurately identify and cut out each statement without duplicating the voices of multiple speakers. Therefore, accurate conference records and call records can be obtained.

図１は、本発明の一実施形態に係る発言切り分けシステムの概念を示す概略構成図である。FIG. 1 is a schematic configuration diagram showing a concept of a remark separation system according to an embodiment of the present invention. 図２は、図１の発言切り分けシステムの全体構成を模式的に示すシステム構成図である。FIG. 2 is a system configuration diagram schematically showing the overall configuration of the speech separation system of FIG. 1. 図３の（Ａ）、（Ｂ）はそれぞれ、図１の発言切り分けシステムにおいて同一空間内において話者毎にマイクがセットされる一例を示す説明図および特定（自己）の話者が自らの端末のマイクを通じて遠隔地の他の話者と話し合う場合であって他の話者の音声が自らの端末のマイクを通じて入力される一例を示す説明図である。(A) and (B) of FIG. 3 are an explanatory diagram showing an example in which a microphone is set for each speaker in the same space in the speech isolation system of FIG. 1, and a specific (self) speaker owns a terminal. It is explanatory drawing which shows an example which talks with another speaker of a remote place through the microphone of the other speaker, and the voice of another speaker is input through the microphone of one's terminal. 図４は、図２の発言切り分けシステムにおける発言区間検知部の構成を示す構成図である。FIG. 4 is a configuration diagram showing a configuration of a speech section detection unit in the speech separation system of FIG. 2. 図５は、図４の発言区間検知部における発言開始検知部の動作を示すフローチャートである。FIG. 5 is a flowchart showing the operation of the speech start detection unit in the speech section detection unit of FIG. 図６は、図４の発言区間検知部における発言終了検知部の動作を示すフローチャートである。FIG. 6 is a flowchart showing the operation of the speech end detection unit in the speech section detection unit of FIG. 図７は、図４の発言区間検知部における類似度判別部と音声エネルギ判別部との動作を示すフローチャートである。FIG. 7 is a flowchart showing the operation of the similarity discrimination unit and the voice energy discrimination unit in the speech section detection unit of FIG. 図８の（Ａ）、（Ｂ）はそれぞれ、マイク毎の発言区間検知部の検知音声蓄積部に蓄積される音声データのイメージを示す説明図およびその音声データのイメージに対して相互相関関数により類似している音声のみを抽出したイメージを示す説明図である。8 (A) and 8 (B) are explanatory diagrams showing an image of voice data stored in the detection voice storage unit of the speech section detection unit for each microphone and a cross-correlation function with respect to the image of the voice data. It is explanatory drawing which shows the image which extracted only the similar voice. 図９は、音声入力部に、自他の話者の音声が話者毎のマイクを通じて入力される場合を示し、話者の発言が特定された検知音声蓄積部に蓄積された音声データの出力を示す説明図である。FIG. 9 shows a case where the voices of one's own and other speakers are input to the voice input unit through the microphone of each speaker, and the output of the voice data stored in the detected voice storage unit in which the speaker's remark is specified is output. It is explanatory drawing which shows. 図１０は、音声入力部には、自他の話者グループの音声が話者グループ毎のマイクを通じて入力される場合を示し、話者グループの発言が特定された検知音声蓄積部に蓄積された音声データの出力を示す説明図である。FIG. 10 shows a case where the voice of one's own and other speaker groups is input to the voice input unit through the microphone of each speaker group, and the speech of the speaker group is stored in the detected voice storage unit specified. It is explanatory drawing which shows the output of audio data. 図１１の（Ａ）、（Ｂ）はそれぞれ、特定（自己）の話者の発言中、特定の話者の音声を他の複数の話者のマイクで取得した場合の蓄積された音声データの出力を示す説明図および自他の話者がそれぞれ異なる発言をした場合の蓄積された音声データの出力を示す説明図である。11 (A) and 11 (B) are the accumulated voice data when the voice of a specific speaker is acquired by the microphones of a plurality of other speakers during the speech of a specific (self) speaker, respectively. It is explanatory drawing which shows the output and the explanatory diagram which shows the output of the accumulated voice data when oneself and other speakers make different remarks. 図１２の（Ａ）、（Ｂ）はそれぞれ、図２の発言切り分けシステムにより、複数の話者の発言がない場合の動作を、順を追って説明するフローチャートおよび図１の発言切り分けシステムにより、複数の話者のうち一人が発言する場合の動作を、順を追って説明するフローチャートおよびである。In each of FIGS. 12A and 12B, the operation when there is no speech by a plurality of speakers is described step by step by the speech isolation system of FIG. 2, and a plurality of speech isolation systems of FIG. 1 are used. It is a flowchart and a step-by-step explanation of the operation when one of the speakers of the above speaks. 図１３は、図２の発言切り分けシステムにより、特定の話者の発言中に、他の話者が発言した場合の動作を説明するフローチャートである。FIG. 13 is a flowchart illustrating an operation when another speaker speaks while a specific speaker speaks by the speech separation system of FIG. 2. 図１４は、図２の発言切り分けシステムにより、特定の話者の発言が他の話者のマイクを通じて取得された場合の動作を説明するフローチャートである。FIG. 14 is a flowchart illustrating an operation when a speech of a specific speaker is acquired through a microphone of another speaker by the speech isolation system of FIG.

以下、図面に示す一実施形態により本発明を説明する。本発明の一実施形態に係る発言切り分けシステム２は、図１ないし図３の（Ａ）に示すように、同一空間内において自己の音声と他者の音声が混在して入力される話者毎のマイク（音声入力部）Ｍ１～Ｍｎを備え、これらマイクＭ１～Ｍｎ毎に取得され混在する複数の音声データを、各音声データの開始から終了までの発言区間毎に検知して自己のマイクＭ１～Ｍｎから入力された自己の音声データを蓄積し、各マイクＭ１～Ｍｎから取得された話者毎の蓄積された音声データを同期させて参照し、取得した話者毎の音声データの類似度を算出して類似度の高低を比較判別し、類似度が低い音声データは異なる話者の音声データとみなし、類似度が高い音声データは話者の同一の音声データとみなし、これら類似度が高い話者が同一とみなされた音声データについて音声エネルギの大小を比較判別し、音声エネルギが相対的に大きいと判別された音声データを、自己のマイクＭｘから入力された自己の発言と特定し、自他の発言を切り分けるようにしたものである。つまり、発言者を特定せず発言のみ重複することなく切り分けるようにしたものである。また、本実施形態に係る発言切り分けシステム２は、自己の発言と特定されたマイクＭｘに基づいて、そのマイクＭｘから取得され蓄積された音声データをマイクＭｘの話者と関連付けするようにしたものである。 Hereinafter, the present invention will be described with reference to one embodiment shown in the drawings. As shown in (A) of FIGS. 1 to 3, the speech separation system 2 according to the embodiment of the present invention is for each speaker in which the voice of oneself and the voice of another person are mixed and input in the same space. The microphones (voice input units) M1 to Mn are provided, and a plurality of voice data acquired and mixed for each of these microphones M1 to Mn are detected for each speech section from the start to the end of each voice data, and the own microphone M1. -Self-voice data input from Mn is accumulated, and the stored voice data for each speaker acquired from each microphone M1 to Mn is synchronously referred to, and the similarity of the acquired voice data for each speaker is obtained. Is calculated and the high and low of the similarity are compared and discriminated, the voice data having a low degree of similarity is regarded as the voice data of different speakers, and the voice data having a high degree of similarity is regarded as the same voice data of the speakers. The magnitude of the voice energy is compared and discriminated for the voice data considered to be the same by a high speaker, and the voice data determined to have a relatively large voice energy is identified as the self-speech input from the own microphone Mx. , It is intended to separate the remarks of oneself and others. In other words, the speaker is not specified and only the statement is separated without duplication. Further, the speech isolation system 2 according to the present embodiment is designed to associate the voice data acquired and accumulated from the microphone Mx with the speaker of the microphone Mx based on the microphone Mx identified as the self-speaking. Is.

本実施形態に係る発言切り分けシステム２は、図１に示すように、会議などの参加者（話者）Ａ～Ｎ（Ｎは２以上の任意の整数）毎に装着されたマイクＭ１～Ｍｎからの音声を、切り分けて出力することができるだけでなく、話者とその発言を特定して、切り分け出力するシステムである。この発言切り分けシステム２は、図２に示すように、同一空間内で行われる会議などの参加者（話者）Ａ～Ｎ（Ｎは２以上の任意の整数）毎にマイクＭ１～Ｍｎが装着されるか、話者Ａ～Ｎの近傍に配置される（図３の（Ａ）参照）。つまり、話者Ａを自己とすると（自己ＡのマイクＭ１）、自己のマイクＭ１以外の他の話者のマイクＭ２～Ｍｎより最も近い距離に配置されるのが、自己のマイクＭ１となるように配置される。マイクＭ１～Ｍｎは、装着された話者Ａ～Ｎ毎に対応して関連付けされる。これらマイクＭ１～Ｍｎは、話者Ａ～Ｎと関連付けされてハードウェア（ハードディスク、情報処理部、記憶部）、コンピュータあるいはクラウドコンピュータ３に音声データを入力可能に接続される。本実施例では、情報処理部（ＣＰＵ）と記憶部と入出力部と表示部とを有するコンピュータ（ＰＣ）３を例に説明する。ＰＣ３には、後述する動作を行うソフトウェアが収納される。 As shown in FIG. 1, the speech separation system 2 according to the present embodiment is from microphones M1 to Mn attached to each participant (speaker) A to N (N is an arbitrary integer of 2 or more) such as a conference. It is a system that not only can separate and output the voice of the speaker, but also identify the speaker and his / her remark and output it separately. As shown in FIG. 2, the speech separation system 2 is equipped with microphones M1 to Mn for each participant (speaker) A to N (N is an arbitrary integer of 2 or more) such as a conference held in the same space. Or placed in the vicinity of speakers A to N (see (A) in FIG. 3). That is, when the speaker A is the self (microphone M1 of the self A), the microphone M1 of the self is arranged at the closest distance from the microphones M2 to Mn of other speakers other than the microphone M1 of the self. Placed in. The microphones M1 to Mn are associated with each of the attached speakers A to N. These microphones M1 to Mn are associated with speakers A to N and are connected to hardware (hard disk, information processing unit, storage unit), a computer, or a cloud computer 3 so that voice data can be input. In this embodiment, a computer (PC) 3 having an information processing unit (CPU), a storage unit, an input / output unit, and a display unit will be described as an example. The PC 3 stores software that performs operations described later.

本実施形態に係る発言切り分けシステム２は、各マイクＭ１～Ｍｎ毎の音声データが入力される発言区間検知部４（４：Ｍ１、４：Ｍ２・・・４：Ｍｎ）と、発言区間検知部４毎に設けられ発言区間検知部４で検知された発言区間の音声データを蓄積する検知音声蓄積部５（５：Ｍ１、５：Ｍ２・・・５：Ｍｎ）と、類似度判別部６と、音声エネルギ判別部７と、音声長さ判別部８と、蓄積音声出力部９と、発言判別部（ノイズ判別部）１０と、発言開始検知部１１と、発言終了検知部１２と、時間ずれ補正部１３とを有して構成される。 The speech separation system 2 according to the present embodiment includes a speech section detection unit 4 (4: M1, 4: M2 ... 4: Mn) into which voice data for each microphone M1 to Mn is input, and a speech section detection unit. A detection voice storage unit 5 (5: M1, 5: M2 ... 5: Mn) that is provided for each 4 and stores voice data of the speech section detected by the speech section detection unit 4, and a similarity determination unit 6 , Voice energy discrimination unit 7, voice length discrimination unit 8, stored voice output unit 9, speech discrimination unit (noise discrimination unit) 10, speech start detection unit 11, speech end detection unit 12, and time lag. It is configured to have a correction unit 13.

発言区間検知部４は、図４に示すように、各マイクＭ１～Ｍｎ毎に互いに同期して設けられ、対応するマイクから取得され混在する複数の音声データの発言開始から発言終了までのそれぞれの発言区間を検知するようになっている。すなわち、発言区間検知部４は、発言判別部１０と発言開始検知部１１と発言終了検知部１２とにより発言区間を検知するようになっている（図２参照）。発言区間検知部４は、入力される音声データを短時間の一定間隔毎に区切られた音声フレーム（本実施形態では、例えば、３０msec分の音声データ）として入力する。発言判別部１０は、発言区間検知部４に入力された音声フレームを、発言開始直後または発言終了直前のうち少なくともいずれか一方で、音声エネルギの大小を予め求められた音声エネルギの第１の閾値ＴＨＲ１または第２の閾値ＴＨＲ２に基づいて、人の音声か音声以外の雑音か否かを判別するようになっている。また、発言判別部１０で音声以外の雑音と判別された音声データは、情報処理部により削除されるようになっている。発言開始検知部１１は、音声フレームを、未検知または検知中の何れかの状態として検知し、初期状態を未検知とし、検知状態が未検知で発言の開始を検知すると検知状態を検知中に変更するようになっている。発言終了検知部は１２は、検知状態が検知中の間、検知音声蓄積部５へ音声データの蓄積を行い、発言の終了を検知すると検知音声蓄積部５に蓄積された音声データを出力あるいは削除し、検知状態を未検知に変更するようになっている。つまり、発言開始検知部１１は、図５に示すように、関連付けされた単一のマイクＭｘから取得される、他の話者の音声データが混在する複数の音声データが入力されると（ステップＳ１）、音声フレーム毎に情報処理部により音声エネルギを算出し（本実施形態では、例えば、音声の二乗平均平方根（ＲＭＳ）を用いている）（ステップＳ２）、算出された値を、発言判別部１０により予め設定された音声エネルギの第１の閾値ＴＨＲ１と比較判別し（ステップＳ３）、この第１の閾値ＴＨＲ１以上の場合、音声データの検知状態Ｓを「検知中」に変更する（ステップＳ４）。第１の閾値ＴＨＲ１未満の場合、このマイクＭｘから拾った聞き取りにくい音声エネルギの低い音声データとみなし、検知状態Ｓを「未検知」のままとし処理を終了する（ステップＳ５）。つまり、検知状態Ｓを参照し、「未検知」であれば入力された音声フレームを解析し、人の音声であると判別すると検知状態を「検知中」に変更するようになっている。なお、本実施形態では、音声フレームを、３０msec分の音声データとしているがこれに限られるものではなく、状況や環境あるいは用途に応じて適宜変更可能であることは言うまでもない。また、本実施形態では、音声エネルギの算出に当たり、システム負荷が軽い音声の二重平均平方根（ＲＭＳ）を用いているがこれに限られるものではなく、他の算出方法を用いてもよい。 As shown in FIG. 4, the speech section detection unit 4 is provided for each microphone M1 to Mn in synchronization with each other, and each of the plurality of voice data acquired from the corresponding microphones and mixed from the start to the end of speech is provided. It is designed to detect the speech section. That is, the speech section detection unit 4 detects the speech section by the speech discrimination unit 10, the speech start detection unit 11, and the speech end detection unit 12 (see FIG. 2). The speech section detection unit 4 inputs the input voice data as a voice frame (for example, voice data for 30 msec in the present embodiment) divided at regular intervals for a short time. The speech determination unit 10 uses the voice frame input to the speech section detection unit 4 at least one of immediately after the start of speech and immediately before the end of speech, and the first threshold value of voice energy for which the magnitude of voice energy is previously determined. Based on THR1 or a second threshold value THR2, it is determined whether or not it is human voice or non-voice noise. Further, the voice data determined to be noise other than the voice by the speech determination unit 10 is deleted by the information processing unit. The speech start detection unit 11 detects the voice frame as either an undetected state or a detected state, sets the initial state as undetected, and detects the detection state when the detected state is not detected and the start of speech is detected. It is designed to change. The speech end detection unit 12 accumulates voice data in the detected voice storage unit 5 while the detection state is being detected, and when it detects the end of speech, outputs or deletes the voice data stored in the detection voice storage unit 5. The detection status is changed to undetected. That is, as shown in FIG. 5, when the speech start detection unit 11 inputs a plurality of voice data in which voice data of other speakers are mixed, which is acquired from the associated single microphone Mx (step). S1), the voice energy is calculated by the information processing unit for each voice frame (for example, in this embodiment, the squared mean square root (RMS) of the voice is used) (step S2), and the calculated value is used for speech discrimination. It is compared and discriminated with the first threshold value THR1 of the voice energy preset by the unit 10 (step S3), and when the first threshold value THR1 or more is obtained, the detection state S of the voice data is changed to "detecting" (step). S4). If it is less than the first threshold value THR1, it is regarded as voice data with low voice energy that is difficult to hear picked up from the microphone Mx, the detection state S is left as “undetected”, and the process ends (step S5). That is, the detection state S is referred to, and if it is "not detected", the input voice frame is analyzed, and if it is determined that the voice is a human voice, the detection state is changed to "detecting". In the present embodiment, the audio frame is audio data for 30 msec, but the present invention is not limited to this, and it goes without saying that the audio frame can be appropriately changed depending on the situation, environment, or application. Further, in the present embodiment, in calculating the voice energy, the root mean square (RMS) of the voice having a light system load is used, but the calculation is not limited to this, and another calculation method may be used.

発言判別部１０は、発言開始検知部１１が音声フレームを検知し、音声データの蓄積が始まると、つまり、発言開始直後に音声フレームの音声エネルギの大小を、予め求められた音声エネルギの第１の閾値ＴＨＲ１に基づいて、人の音声か音声以外の雑音か否かを判別するようになっている（図５のステップＳ２～ステップＳＳ５参照）。また、発言判別部１０は、発言終了検知部１２が音声フレームを検知すると、その音声フレームの音声エネルギの大小を、予め求められた音声エネルギの第２の閾値ＴＨＲ２に基づいて、人の音声か音声以外の雑音か否かを判別するようになっている（図６のステップＳ１５～ステップＳ１６参照）。すなわち、ステップＳ１４で、類似度の結果がTrueである場合、音声エネルギを算出し、算出された値を音声エネルギの第２の閾値ＴＨＲ２と比較判別し（ステップＳ１６）、第２の閾値ＴＨＲ２未満の場合、発言終了とみなし、音声長さ判別部８に検知音声蓄積部５の音声データを出力する（ステップＳ１７参照）。第２の閾値ＴＨＲ未満の場合、終了処理は行わず、次の音声フレームの入力を待つ。なお、第２の閾値ＴＨＲ２は、会場やマイクの条件に応じて、第１の閾値ＴＨＲと同一であってもよいし、異ならせてもよい。 In the speech discrimination unit 10, when the speech start detection unit 11 detects the voice frame and the accumulation of voice data starts, that is, the magnitude of the voice energy of the voice frame is determined immediately after the start of speech, the first voice energy obtained in advance. Based on the threshold value THR1 of the above, it is determined whether or not it is human voice or noise other than voice (see steps S2 to SS5 in FIG. 5). Further, when the speech end detection unit 12 detects the voice frame, the speech discrimination unit 10 determines the magnitude of the voice energy of the voice frame based on the second threshold value THR2 of the voice energy obtained in advance, and whether it is a human voice. It is designed to determine whether or not it is noise other than voice (see steps S15 to S16 in FIG. 6). That is, in step S14, when the result of the similarity is True, the voice energy is calculated, and the calculated value is compared and discriminated with the second threshold value THR2 of the voice energy (step S16), and is less than the second threshold value THR2. In the case of, it is regarded as the end of speech, and the voice data of the detected voice storage unit 5 is output to the voice length determination unit 8 (see step S17). If it is less than the second threshold value THR, the end processing is not performed and the input of the next voice frame is waited for. The second threshold value THR2 may be the same as or different from the first threshold value THR depending on the conditions of the venue and the microphone.

このように、上記実施形態に係る話者とその発言切り分けシステム２では、図９に示すように、話者Ａ～Ｎそれぞれに装着したマイクＭ１～Ｍｎから音声データを取得する。取得した複数の音声データをそれぞれ発言区間検知部４に入力すると、発言区間ごとに分離された音声が出力される。従って、発言区間検知部４は録音に使用するマイクの数だけ必要となる。発言区間検知部４は、図４のような構造となっている。発言区間検知部４には、マイクＭ１～Ｍｎから取得された音声データが入力される。音声データは一定間隔で区切られた音声フレームであり、音声フレームが入力されるたびに発言区間検知部４が処理を行う。発言区間検知部４は常に検知状態Ｓを保持している。検知状態Ｓは「未検知」と「検知中」のいずれかの状態を示し、初期状態は「未検知」である。また、検知状態Ｓは自身の、あるいは同時に動作している他の発言区間検知部４から参照される。音声を解析する音声解析部１０、１１、１２、６、７、１３では、入力が発言であるかどうかを判断する。音声解析部１０、１１、１２、６、７、１３は検知状態Ｓによって動作が異なり、検知状態が「未検知」であれば発言開始、「検知中」であれば発言終了を検知するための解析を行う。 As described above, in the speaker and the speech separation system 2 according to the above embodiment, as shown in FIG. 9, voice data is acquired from the microphones M1 to Mn attached to each of the speakers A to N. When the acquired plurality of voice data are input to the speech section detection unit 4, the voices separated for each speech section are output. Therefore, the speech section detection unit 4 is required for the number of microphones used for recording. The speech section detection unit 4 has a structure as shown in FIG. The voice data acquired from the microphones M1 to Mn is input to the speech section detection unit 4. The voice data is a voice frame divided at regular intervals, and each time the voice frame is input, the speech section detection unit 4 processes. The speech section detection unit 4 always holds the detection state S. The detection state S indicates either a “not detected” state or a “detecting” state, and the initial state is “not detected”. Further, the detection state S is referred to by its own or another speech section detection unit 4 operating at the same time. The voice analysis unit 10, 11, 12, 6, 7, and 13 that analyzes the voice determines whether or not the input is a speech. The voice analysis units 10, 11, 12, 6, 7, and 13 operate differently depending on the detection state S. If the detection state is "not detected", the speech starts, and if the detection status is "detecting", the speech ends. Perform analysis.

発言開始検知部１１および発言終了検知部１２はそれぞれ、図５および図６のように動作する。発言開始検知部１１では、入力の音声エネルギを算出し、この音声エネルギを予め設定していた閾値（第１の閾値ＴＨＲ１）と比較する。比較の結果が第１の閾値ＴＨＲ１以上であれば検知状態を「検知中」に変更し、次の音声フレームの入力に対して処理を行う。終了検知処理では、開始検知処理で行った音声エネルギの比較、検知状態の変更に加え、検知音声蓄積部５の入出力処理と、他の話者の発言が含まれているかどうかを判断するための類似度の算出処理が行われる。終了検知処理では、はじめに音声フレームを検知音声蓄積部５に格納する。次に、類似度の算出を行い、類似度の結果がTrueでなかった場合、つまり、音声が同一のものでないと判断された場合、、検知状態Ｓを「未検知」に変更後、次の入力の処理を行う。類似度の結果がTrueである場合は、音声エネルギの判定を行う。音声エネルギが発言判別部１０の第２の閾値ＴＨＲ２を上回っている場合は次の音声フレームの処理を実行するが、下回っている場合は、発言が終了したとみなし、検知音声蓄積部５に格納されている音声データを音声長さ判別部８に出力し、検知状態を「未検知」に戻す。 The speech start detection unit 11 and the speech end detection unit 12 operate as shown in FIGS. 5 and 6, respectively. The speech start detection unit 11 calculates the input voice energy and compares this voice energy with a preset threshold value (first threshold value THR1). If the result of the comparison is equal to or higher than the first threshold value THR1, the detection state is changed to "detecting", and processing is performed for the input of the next voice frame. In the end detection process, in addition to the comparison of voice energy performed in the start detection process and the change of the detection state, the input / output process of the detection voice storage unit 5 and whether or not the speech of another speaker is included is determined. The degree of similarity is calculated. In the end detection process, the voice frame is first stored in the detection voice storage unit 5. Next, when the similarity is calculated and the result of the similarity is not True, that is, when it is determined that the voices are not the same, after changing the detection state S to "not detected", the next Process the input. If the result of similarity is True, the voice energy is determined. If the voice energy exceeds the second threshold value THR2 of the speech determination unit 10, the processing of the next voice frame is executed, but if it is below the speech energy, it is considered that the speech has been completed and stored in the detection voice storage unit 5. The voice data that has been input is output to the voice length determination unit 8, and the detection state is returned to "not detected".

類似度判別部６では、他の発言区間検知部４の検知音声蓄積部５に格納されている音声データに同一音声が含まれていないかを確認する。同期させて発言区間検知部４の数だけ処理が必要なため、内部はループ構造を有している（ステップＳ３２参照）。同一の音声データの確認にあたって、まず他の発言区間検知部４の検知状態Ｓを参照する（ステップＳ３３参照）。対象となる発言区間検知部４の検知状態Ｓが「未検知」の場合はTrueを出力し、他の発言区間検知部４の確認に移る。一方で、「検知中」であれば検知音声蓄積部５に同一音声が含まれている可能性があるため、自身の発言区間検知部４（Ｍｘ）と対象の発言区間検知部４［（Ｍ１～Ｍｎ）－Ｍｘ］の検知音声蓄積部５に格納されている音声データの類似度を算出する（ステップＳ３５参照、本実施形態では、例えば、ピアソンの積率相関係数Ｃを類似度として算出する）。類似度が予め設定した第２の閾値ＴＨＲ２を下回る場合、この音声は同一のものではないと判断され（ステップＳ３６参照）、Trueを出力して他の発言区間検知部４の確認に移る。第２の閾値ＴＨＲ２を上回る場合は音声が同一であるため、検知音声蓄積部５の音声のエネルギを算出し（ステップＳ３７参照）、自身の音声エネルギが大きい場合は、Trueを出力する（ステップＳ３８、Ｓ３９参照）。 The similarity determination unit 6 confirms whether or not the same voice is included in the voice data stored in the detection voice storage unit 5 of the other speech section detection unit 4. Since processing is required for the number of speech section detection units 4 in synchronization, the inside has a loop structure (see step S32). In confirming the same voice data, first, the detection state S of the other speech section detection unit 4 is referred to (see step S33). If the detection state S of the target speech section detection unit 4 is "not detected", True is output and the process moves to confirmation of another speech section detection unit 4. On the other hand, if it is "detecting", there is a possibility that the same voice is included in the detected voice storage unit 5, so that the own voice section detection unit 4 (Mx) and the target voice section detection unit 4 [(M1). -Mn) -Mx] to calculate the similarity of the voice data stored in the detected voice storage unit 5 (see step S35, in this embodiment, for example, Pearson's product-moment correlation coefficient C is calculated as the similarity. do). When the similarity is lower than the second threshold value THR2 set in advance, it is determined that the voices are not the same (see step S36), True is output, and the confirmation of the other speech section detection unit 4 is performed. When the second threshold value THR2 is exceeded, the voice is the same, so the voice energy of the detected voice storage unit 5 is calculated (see step S37), and when its own voice energy is large, True is output (step S38). , See S39).

以上の処理を他の発言区間検知部４すべてに対して実行することで、他の発言区間検知部４の数だけ確認結果が得られる。確認結果がすべてTrueになっていれば、自身の音声は独立した発言であるため、Trueを出力し、発言終了検知部１２で適切な処理が行われる。その結果、同一音声に対しては、「検知中」の状態を持つ発言区間検知部４が常に一つとなり、音声の重複を防ぐことができる。このようにして発言区間検知部４は動作するが、各発言区間検知部４は同期的に動作する必要がある。具体的には、ある時刻に対するすべての音声フレームの処理が各発言区間検知部で終了するまで、次の音声は入力しないようにしている。そうしなければ、検知状態に時間的なずれが生じるため、同一音声の検知ができなくなるからである。 By executing the above processing for all the other speech section detection units 4, confirmation results can be obtained for the number of other speech section detection units 4. If all the confirmation results are True, since the own voice is an independent speech, True is output and appropriate processing is performed by the speech end detection unit 12. As a result, for the same voice, the speech section detection unit 4 having the "detecting" state is always one, and duplication of voice can be prevented. Although the speech section detection unit 4 operates in this way, each speech section detection unit 4 needs to operate synchronously. Specifically, the next voice is not input until the processing of all voice frames for a certain time is completed by each speech section detection unit. Otherwise, there will be a time lag in the detection state, and it will not be possible to detect the same voice.

類似度判別部６は、各発言区間検知部４とその検知音声蓄積部５とを同期させて参照し、各発言区間検知部４に入力される音声データと各マイクの検知音声蓄積部５に格納された発言区間の音声データについて、類似度を算出して類似度の高低を比較判別し、類似度が低い音声データは異なる話者の音声データとみなして判別する対象から除き（図１１の（Ｂ）参照）、類似度が高い音声データは複数のマイク（例えば、マイクＭ１～Ｍ３）から取得された同一の音声データとみなし、これら複数のマイクの同一の音声データを判別しこれら判別された同一の音声データを有する発言区間検知部４（４：Ｍ１、４：Ｍ２、４：Ｍ３）を特定するようになっている（図１１の（Ａ）参照）。 The similarity determination unit 6 synchronizes and refers to each speech section detection unit 4 and its detection voice storage unit 5, and the voice data input to each speech section detection unit 4 and the detection voice storage unit 5 of each microphone are used. Regarding the stored voice data of the speech section, the similarity is calculated and the high and low of the similarity are compared and discriminated, and the voice data having a low similarity is excluded from the target to be discriminated as the voice data of different speakers (FIG. 11). (See (B)), voice data with high similarity is regarded as the same voice data acquired from a plurality of microphones (for example, microphones M1 to M3), and the same voice data of these plurality of microphones is discriminated and discriminated. The speech section detection unit 4 (4: M1, 4: M2, 4: M3) having the same voice data is specified (see (A) in FIG. 11).

つまり、類似度判別部６は、各発言区間検知部４（４：Ｍ１～４：Ｍｎ）とその検知音声蓄積部５（５：Ｍ１～５：Ｍｎ）とを同期させて参照するようになっている。この類似度判別部６は、各マイクＭ１～Ｍｎに対応する各発言区間検知部４（４：Ｍ１～４：Ｍｎ）の検知音声蓄積部５にそれぞれ格納された発言区間の音声データについて、特定のマイクＭｘ（自己のマイクＭ１）に対応する発言区間検知部４（４：Ｍｘ）の検知音声蓄積部５に蓄積された音声データと、それ以外の他のマイク[（Ｍ１～Ｍｎ）－Ｍｘ]に対応する発言区間検知部４[（４：Ｍ１～４：Ｍｎ）―（４：Ｍｘ）]の検知音声蓄積部５に蓄積された音声データとの類似度をそれぞれ算出し、特定（自己）の検知音声蓄積部５から取得した音声データに対して他の音声データとの類似度の高低を比較判別し、類似度が低い音声データは異なる話者の音声データとみなして判別する対象から除き、類似度が高い音声データは複数のマイクから取得された同一の音声データとみなし、これら複数のマイクの同一の音声データＶｉｄＭ１、ＶｉｄＭ２、ＶｉｄＭ３（図１４参照）を判別して特定するようになっている。 That is, the similarity determination unit 6 refers to each speech section detection unit 4 (4: M1 to 4: Mn) in synchronization with the detection voice storage unit 5 (5: M1 to 5: Mn). ing. The similarity determination unit 6 specifies the voice data of the speech section stored in the detection voice storage unit 5 of each speech section detection unit 4 (4: M1 to 4: Mn) corresponding to each microphone M1 to Mn. The voice data stored in the detection voice storage unit 5 of the speech section detection unit 4 (4: Mx) corresponding to the microphone Mx (own microphone M1) of the above, and other microphones [(M1 to Mn) -Mx. ] Corresponding to the speech section detection unit 4 [(4: M1 to 4: Mn)-(4: Mx)], the similarity with the voice data stored in the detection voice storage unit 5 is calculated and specified (self). ) Detection The level of similarity with other voice data is compared and discriminated with respect to the voice data acquired from the voice storage unit 5, and the voice data with low similarity is regarded as the voice data of different speakers and discriminated. Except, the voice data having a high degree of similarity is regarded as the same voice data acquired from a plurality of microphones, and the same voice data VidM1, VidM2, and VidM3 (see FIG. 14) of these multiple microphones are discriminated and specified. It has become.

言い換えれば、類似度判別部６は、各音声蓄積部５に蓄積された発言区間の音声データ（例えば、ＶｉｄＭ１、ＶｉｄＭ２、ＶｉｄＭ３、・・・）について、各検知音声蓄積部５（５：Ｍ１～５：Ｍｎ）と各発言区間検知部４（４：Ｍ１～４：Ｍｎ）を同期させて参照し、情報処理部により類似度を算出し（図６のステップＳ１３）、特定（自己）の発言区間検知部４（４：Ｍ１）について、類似度の結果がTrueかどうか判別する（ステップＳ１４）。類似度の結果がTrueでない場合、つまり、音声データが同一のものであると判断された場合、音声長さ判別部８に音声データを出力し（ステップＳ１７）、音声データの長さが予め設定された長さより長いかどうか判別するようになっている。類似度の結果がTrueである場合、発言判別部１０に音声データを出力するようになっている（ステップＳ１５参照）。 In other words, the similarity determination unit 6 has each detected voice storage unit 5 (5: M1 to ...) For the voice data (for example, VidM1, VidM2, VidM3, ...) Of the speech section stored in each voice storage unit 5. 5: Mn) and each speech section detection unit 4 (4: M1 to 4: Mn) are referred to in synchronization, the similarity is calculated by the information processing unit (step S13 in FIG. 6), and the specific (self) speech is made. With respect to the section detection unit 4 (4: M1), it is determined whether or not the result of the similarity is True (step S14). When the result of the similarity is not True, that is, when it is determined that the voice data is the same, the voice data is output to the voice length determination unit 8 (step S17), and the length of the voice data is preset. It is designed to determine if it is longer than the specified length. When the result of the similarity is True, the voice data is output to the speech determination unit 10 (see step S15).

音声エネルギ判別部７は、類似度判別部６により判別され特定された同一の音声データＶｉｄＭ１、ＶｉｄＭ２、ＶｉｄＭ３について、蓄積された音声データの音声エネルギを算出して音声エネルギの大小を比較判別し、音声エネルギが低い場合（例えば、同一の音声データＶｉｄＭ１、ＶｉｄＭ２、ＶｉｄＭ３のうちＶｉｄＭ２、ＶｉｄＭ３）、話者Ｂ、ＣのマイクＭ２、Ｍ３から取得された自己の音声データとみなし、音声エネルギが相対的に最も高いと判別された音声データ（例えば、音声データＶｉｄＭ１）を特定される対象とみなし、音声が取得されたマイクＭ１を特定し、そのマイクＭ１から取得され蓄積された音声データをマイクＭ１の話者Ａと関連付けして特定し、蓄積音声出力部９により外部に出力するようになっている。音声エネルギが相対的に最も高いか否かは算出された値を比較して判別される。なお、本実施形態に係る発言切り分けシステム２では、話者とその発言（音声データ）を紐付けして話者と発言とを特定するようにしているがこれに限られるものではなく、話者を特定せず、異なる発言者の発言のみを切り出すようにしてもよいことは言うまでもない。 The voice energy discriminating unit 7 calculates the voice energy of the stored voice data for the same voice data VidM1, VidM2, and VidM3 discriminated and specified by the similarity discriminating unit 6, and compares and discriminates the magnitude of the voice energy. When the voice energy is low (for example, VidM2, VidM3 among the same voice data VidM1, VidM2, VidM3), it is regarded as own voice data acquired from the microphones M2 and M3 of the speakers B and C, and the voice energy is relative. The voice data determined to be the highest (for example, voice data VidM1) is regarded as a target to be specified, the microphone M1 from which the voice is acquired is specified, and the voice data acquired from the microphone M1 and accumulated is the voice data of the microphone M1. It is specified in association with the speaker A, and is output to the outside by the stored voice output unit 9. Whether or not the voice energy is relatively high is determined by comparing the calculated values. In the remark separation system 2 according to the present embodiment, the speaker and the remark (voice data) are associated with each other to specify the speaker and the remark, but the present invention is not limited to this, and the speaker is not limited to this. Needless to say, it is possible to cut out only the remarks of different speakers without specifying.

音声長さ判別部８は、音声データが予め設定された所定の長さ（本実施形態の場合、１秒未満で、好ましくは０．５秒に設定）を有しているかどうかを判別し、話者の発声のうち、咳払いや舌打ち等意味のない発声を音声データから除き、思考に基づいてある程度の長さで発話される意味のある発言のみを音声データとして取り込むようにしている。 The voice length determination unit 8 determines whether or not the voice data has a predetermined length set in advance (in the case of the present embodiment, it is set to less than 1 second, preferably 0.5 seconds). Of the speaker's utterances, meaningless utterances such as throat clearing and tongue slap are excluded from the voice data, and only meaningful utterances that are uttered for a certain length based on thoughts are captured as voice data.

音声長さ判別部８は、ステップＳ１６で、発言判別部１０から第２の閾値ＴＨＲ２未満の音声データが入力されるか、ステップＳ１４で類似度の結果がTrueでない場合、つまり、音声データが同一のものであると判断された場合、音声データが設定された一定の長さを有するかどうか判別し、一定の長さを有していれば、自己（特定）の音声データとして検知音声蓄積部５から蓄積された音声データを出力し（ステップＳ１８）、次に、その音声データを削除し（ステップＳ１９）、検知状態Ｓを「未検知」に変更後（ステップＳ２０）、次の入力処理を行う。一定の長さを有していない場合、自己（特定）の検知音声蓄積部５から蓄積された音声データを削除し（ステップＳ１９）、検知状態Ｓを「未検知」に変更する（ステップＳ２０）。 In step S16, the voice length discriminating unit 8 inputs voice data less than the second threshold value THR2 from the speech discriminating unit 10, or the similarity result is not True in step S14, that is, the voice data is the same. If it is determined to be, it is determined whether the voice data has a set fixed length, and if it has a certain length, it is detected as self (specific) voice data. The voice data accumulated from 5 is output (step S18), then the voice data is deleted (step S19), the detection state S is changed to "not detected" (step S20), and then the next input process is performed. conduct. If it does not have a certain length, the voice data accumulated from the self (specific) detection voice storage unit 5 is deleted (step S19), and the detection state S is changed to "undetected" (step S20). ..

類似度判別部６は、各検知音声蓄積部５を参照して（ステップＳ３１）、動作中の発言区間検知部４の数（例えば、ＶｉｄＭ１、ＶｉｄＭ２、ＶｉｄＭ３であれば３回）をループし（ステップＳ３２）、動作中の各発言区間検知部Ｄｎ（Ｄｎ＝５：Ｍ１～５：Ｍｎ）から検知状態Ｓ（Ｓ－１、Ｓ－２、Ｓ－３）を取得し（ステップＳ３３）、検知状態Ｓが「検知中」か否かを判別し（ステップＳ３４）、「検知中」であれば、自己（特定）の検知音声蓄積部５（５：Ｍ１）と他の検知音声蓄積部５（５：Ｍ２、５：Ｍ３）とにそれぞれ格納されている音声データから音声の類似度（本実施形態では、積率相関係数Ｃ）を算出し（ステップＳ３５）、この類似度が予め設定された第３の閾値ＴＨＲ３より大きいか小さいかを判別する（ステップＳ３６）。 The similarity determination unit 6 refers to each detection voice storage unit 5 (step S31), and loops the number of speech section detection units 4 in operation (for example, 3 times for VidM1, VidM2, and VidM3) (3 times). Step S32), the detection state S (S-1, S-2, S-3) is acquired from each speech section detection unit Dn (Dn = 5: M1 to 5: Mn) during operation (step S33), and detection is performed. It is determined whether or not the state S is "detecting" (step S34), and if it is "detecting", the self (specific) detected voice storage unit 5 (5: M1) and another detected voice storage unit 5 (step S34). The similarity of speech (in this embodiment, the product factor correlation coefficient C) is calculated from the speech data stored in 5: M2, 5: M3), respectively (step S35), and this similarity is preset. It is determined whether it is larger or smaller than the third threshold value THR3 (step S36).

音声エネルギ判別部７は、図７に示すように、音声データＶｉｄＭ１、ＶｉｄＭ２、ＶｉｄＭ３の類似度が第３の閾値ＴＨＲ３より大きい場合、自己（特定）の検知音声蓄積部５（５：Ｍ１）と他の検知音声蓄積部５（５：Ｍ２、５：Ｍ３）とにそれぞれ格納されている音声データの音声エネルギＥ＿ｓ（Ｅ＿ｓ：ＶｉｄＭ１）、Ｅ＿Ｄｎ（Ｅ＿ｄ：ＶｉｄＭ２、Ｅ＿ｄ：ＶｉｄＭ３）を算出する（ステップＳ３７）。そして、これら算出された音声エネルギＥ＿ｓ（Ｅ＿ｓ：ＶｉｄＭ１）、Ｅ＿Ｄｎ（Ｅ＿ｄ：ＶｉｄＭ２、Ｅ＿ｄ：ＶｉｄＭ３）の大小を判別し（ステップＳ３８）、算出された数値が最も大きい音声エネルギをTrueとして出力する（ステップＳ３９）。このステップＳ３９で音声エネルギＥ＿ｓ（Ｅ＿ｓ：ＶｉｄＭ１）が、最も高いものがTrueとして出力されると、話者Ａ（マイクＭ１）が特定され、その発言区間の音声データＶｉｄＭ１が関連付けされて出力されるようになっている。このステップＳ３９で、例えば、音声エネルギＥ＿Ｄｎ（Ｅ＿ｄ：ＶｉｄＭ２）が最も大きければ、話者Ｂとその発言区間の音声データが、音声エネルギＥ＿Ｄｎ（Ｅ＿ｄ：ＶｉｄＭ３）が最も大きければ、話者Ｃとその発言区間の音声データが、それぞれ特定される。こうして、同一の音声に対しては、「検知中」の状態を持つ発言区間検知部４が常に一つとなるため、音声の重複を防いで話者とその発言区間の音声データを特定することができる。 As shown in FIG. 7, when the similarity of the voice data VidM1, VidM2, and VidM3 is larger than the third threshold value THR3, the voice energy discriminating unit 7 and the self (specific) detection voice storage unit 5 (5: M1). The voice energies E_s (E_s: VidM1) and E_Dn (E_d: VidM2, E_d: VidM3) of the voice data stored in the other detected voice storage units 5 (5: M2, 5: M3) are calculated (step). S37). Then, the magnitude of these calculated voice energies E_s (E_s: VidM1) and E_Dn (E_d: VidM2, E_d: VidM3) is discriminated (step S38), and the voice energy having the largest calculated numerical value is output as True (step S38). Step S39). When the highest voice energy E_s (E_s: VidM1) is output as True in this step S39, the speaker A (microphone M1) is specified, and the voice data VidM1 in the speech section is associated and output. It has become like. In this step S39, for example, if the voice energy E_Dn (E_d: VidM2) is the largest, the voice data of the speaker B and its speech section is the largest, and if the voice energy E_Dn (E_d: VidM3) is the largest, the speaker C and its The voice data of the speech section is specified respectively. In this way, for the same voice, the speech section detection unit 4 having the "detecting" state is always one, so that it is possible to prevent duplication of speech and identify the speaker and the speech data of the speech section. can.

なお、ステップＳ３４で、検知状態Ｓが「未検知」であれば、該当する検知音声蓄積部５の音声データは自己の音声データを優先し、蓄積され続ける。また，ステップＳ３６で、「検知中」の自己と他の検知音声蓄積部５に蓄積された音声データの類似度が第３の閾値ＴＨＲ３より小さい場合、該当する検知音声蓄積部５の音声データは蓄積され続ける。 If the detection state S is "not detected" in step S34, the voice data of the corresponding detection voice storage unit 5 gives priority to its own voice data and continues to be stored. Further, in step S36, when the similarity between the “detecting” self and the voice data stored in the other detected voice storage unit 5 is smaller than the third threshold value THR3, the voice data of the corresponding detected voice storage unit 5 is Continues to accumulate.

また、本実施形態に係る発言切り分けシステム２は、時間ずれ補正部１３を有している。時間ずれ補正部１３は、検知音声蓄積部５に蓄積された音声データについて、発言判別部１０により音声エネルギの第１の閾値ＴＨＲ１に基づいて音声エネルギを算出する際、音声データ間の時間のずれを求め、この求められた時間的ずれを用いて音声データの時間ずれを補正するようになっている。つまり、時間ずれ補正部１３は、類似している音声のみを抽出し、発話の音声エネルギを算出する際、相互相関関数を用い、音声間がずれている時間を求めるようにしている。つまり、相互相関関数が最大値をとるとき、音声間がずれている時間が得られる。その時間を用いて元の音声を切り出すことができるようになっている。図８の（Ａ）、（Ｂ）はそれぞれ、マイク毎の発言区間検知部４の検知音声蓄積部５に蓄積される音声データのイメージを示す説明図およびその音声データのイメージに対して相互相関関数により類似している音声のみを抽出したイメージを示す説明図である。 Further, the remark separation system 2 according to the present embodiment has a time lag correction unit 13. When the time lag correction unit 13 calculates the voice energy of the voice data stored in the detected voice storage unit 5 based on the first threshold THR1 of the voice energy by the speech discrimination unit 10, the time lag between the voice data is calculated. Is obtained, and the time lag of the voice data is corrected by using the obtained time lag. That is, the time lag correction unit 13 extracts only similar voices and uses a cross-correlation function when calculating the voice energy of the utterance to obtain the time when the voices are out of alignment. That is, when the cross-correlation function takes the maximum value, the time when the voices are out of alignment is obtained. The original voice can be cut out using that time. 8A and 8B are cross-correlation with the explanatory diagram showing the image of the voice data stored in the detection voice storage unit 5 of the speech section detection unit 4 for each microphone and the image of the voice data, respectively. It is explanatory drawing which shows the image which extracted only the voice which is similar by a function.

すなわち、時間ずれは、次のようにして生じる。話者特定を現実に使用すると、話者同士の発言に区切りがなく連続して会話が行われる場合がある。例として、話者Ａの発言の直後に話者Ｂの発言があった場合を想定し、それぞれの発言を自己の発言区間検知部４：Ａと他の発言区間検知部４：Ｂで検知する場合を想定する。このとき、話者Ｂの発言は発言区間検知部４：Ａでも取得されており、そのエネルギが閾値（第１の閾値ＴＨＲ１）を超えているとする。まず、話者Ａの発言が発言区間検知部４：Ａの検知音声蓄積部５：Ａに蓄積される。続けて話者Ｂの発言が開始されると、自己（Ａ）と他（Ｂ）の発言区間検知部４：Ａ、４：Ｂ両方で音声エネルギが閾値（第１の閾値ＴＨＲ１）を超えているため、類似度の算出が行われる。このとき、類似度算出に用いられる音声データは自他それぞれの発言区間検知部４：Ａ、４：Ｂの検知音声蓄積部５：Ａ、５：Ｂに蓄積された音声であるため、発言区間検知部４：Ａでは、話者Ａの発言と話者Ｂの発言とが含まれることなる。 That is, the time lag occurs as follows. When speaker identification is actually used, conversations may occur continuously without breaks between speakers. As an example, assuming that the speaker B makes a statement immediately after the speaker A makes a statement, each statement is detected by the own speech section detection unit 4: A and another speech section detection unit 4: B. Imagine a case. At this time, it is assumed that the speech of the speaker B is also acquired by the speech section detection unit 4: A, and the energy thereof exceeds the threshold value (first threshold value THR1). First, the remarks of the speaker A are stored in the detection voice storage unit 5: A of the speech section detection unit 4: A. When the speaker B starts speaking continuously, the voice energy exceeds the threshold value (first threshold value THR1) in both the self (A) and other (B) speech section detection units 4: A and 4: B. Therefore, the similarity is calculated. At this time, since the voice data used for calculating the similarity is the voice accumulated in the detection voice storage units 5: A and 5: B of the self and other speech section detection units 4: A and 4: B, the speech section The detection unit 4: A includes the remarks of the speaker A and the remarks of the speaker B.

図８の（Ａ）は、各発言区間検知部４：Ａ、４：Ｂの検知音声蓄積部５：Ａ、５：Ｂの音声データのイメージを示している。薄い部分で表示しているのが、話者ＡのマイクＭ１の発言区間検知部４：Ａ、濃い部分で表示しているのが話者ＢのマイクＭ２の発言区間検知部４：Ｂの音声データを示している。類似度の算出においては話者Ｂの発言が両方に含まれているため、類似していると判断されるが、エネルギ算出時に問題が生じる。エネルギ算出に使われる音声データも各検知音声蓄積部５：Ａ、５：Ｂの音声を用いるため、話者Ａの発言の音声エネルギが大きかった場合、その影響を受けて話者Ａの発言の音声エネルギが大きいと判断されることがある。結果、類似度の判定は正しく動作するが、より音声エネルギの大きい話者Ａの発言が優先され、話者Ｂの発言が話者Ａのものとなってしまう虞がある。この問題はエネルギ算出を検知音声蓄積部５：Ａ、５；Ｂに蓄積された音声データ全体で行っているために生じる。これを解決するには、類似している音声のみを抽出し、時間ずれ補正部１３によりエネルギ算出をし、ずれている時間を用いて元の音声を切り出せばよい（図８の（Ｂ）参照）。 FIG. 8A shows an image of the voice data of the detected voice storage units 5: A and 5: B of each speech section detection unit 4: A and 4: B. The light part is the voice of the speaker A's microphone M1's speech section detection unit 4: A, and the dark part is the voice of the speaker B's microphone M2's speech section detection unit 4: B. Shows the data. In the calculation of the degree of similarity, since the remarks of the speaker B are included in both, it is judged that they are similar, but a problem arises when calculating the energy. Since the voice data used for energy calculation also uses the voice of each detected voice storage unit 5: A, 5: B, if the voice energy of the speaker A's speech is large, the speaker A's speech is affected by that. It may be determined that the voice energy is high. As a result, the determination of the similarity works correctly, but there is a possibility that the remark of the speaker A having a higher voice energy is prioritized and the remark of the speaker B becomes that of the speaker A. This problem occurs because the energy calculation is performed for the entire voice data stored in the detection voice storage units 5: A, 5; B. To solve this, only similar voices should be extracted, energy should be calculated by the time lag correction unit 13, and the original voice should be cut out using the time lag (see (B) in FIG. 8). ).

また、本実施形態に係る発言切り分けシステム２は、発言判別部１０が人の音声かどうか判別するだけでなくノイズ判別の機能も有している。発言判別部１０は、発言区間検知部４に入力された音声データについて、人の音声か音声以外の雑音かを判別し、音声以外の雑音と判別すると、音声エネルギにかかわらず、ステップＳ３、あるいはステップＳ１３の結果がFalseとなるように構成される。 Further, the speech separation system 2 according to the present embodiment has a function of discriminating noise as well as discriminating whether or not the speech discrimination unit 10 is a human voice. The speech determination unit 10 determines whether the voice data input to the speech section detection unit 4 is human voice or non-voice noise, and if it is determined to be non-voice noise, step S3 or step S3 or The result of step S13 is configured to be False.

また、本実施形態に係る発言切り分けシステム２は、発言区間検知部４の検知音声蓄積部５から発言終了検知部１２により蓄積音声出力部９を通じて特定された話者Ａとその音声データＶＭ１を受け取ると、テキスト等の文字データ、この文字データを翻訳した翻訳データまたは音声のうち少なくともいずれか１を画面上に表示したり、出力する表示出力部２０を有している（図１参照）。表示出力部２０は、端末や表示装置から構成される。 Further, the speech separation system 2 according to the present embodiment receives the speaker A and its voice data VM1 specified by the speech end detection unit 12 from the detection voice storage unit 5 of the speech section detection unit 4 through the storage voice output unit 9. And, it has a display output unit 20 for displaying or outputting at least one of character data such as text, translated data obtained by translating this character data, and voice (see FIG. 1). The display output unit 20 is composed of a terminal and a display device.

次に、本発明に係る発言切り分け方法について、上記実施形態に係る発言切り分けシステム２の作用に基づいて説明する。上記実施形態に係る発言切り分けシステム２では、図２に示すように、ＰＣ３には、情報処理部（ＣＰＵ）と記憶部と入出力部と表示部とを備えるとともに、発言区間検知部４と検知音声蓄積部５と類似度判別部６と音声エネルギ判別部７と蓄積音声出力部９と発言開始検知部１１と発言終了検知部１２と時間ずれ補正部１３とを備えて構成される。ＰＣ３には、端末や表示装置から構成される表示出力部２０が接続される。本実施形態に係る発言切り分けシステム２では、第１のステップ（Ｓ１０１）で、マイクＭ１～Ｍｎ毎に、マイク１～Ｍｎから取得される混在する複数の音声データを、発言区間検知部４により各音声データの開始から終了までの発言区間毎に検知してその音声データＶ：Ｍ１～Ｖ：Ｍｎを各発言区間検知部４に対応する検知音声蓄積部５に蓄積するようになっている。次に、第２のステップ（Ｓ１０２）で、各マイク１～Ｍｎから取得された音声データＶ：Ｍ１～Ｖ：Ｍｎを同期させて参照し、類似度判別部６により取得した音声データＶ：Ｍ１～Ｖ：Ｍｎの類似度を算出して類似度の高低を比較判別し、類似度が低い音声データは異なる話者の音声データとみなして判別する対象から除き、類似度が高い音声データ（例えば、ＶｉｄＭ１、ＶｉｄＭ２、ＶｉｄＭ３）について同一の音声データとみなすようになっている。そして、第３のステップ（Ｓ１０３）で、これら類似度が高い同一とみなされた音声データ（例えば、ＶｉｄＭ１、ＶｉｄＭ２、ＶｉｄＭ３）について、音声エネルギ判別部７により音声エネルギの大小を比較判別し、音声エネルギが相対的に大きい音声データ（例えば、ＶｉｄＭ１＞ＶｉｄＭ２＞ＶｉｄＭ３）と判別されたマイクＭｘ（Ｍ１）を特定し、そのマイクＭｘ（Ｍ１）から取得され蓄積された音声データ（例えば、ＶｉｄＭ１）をマイクＭｘ（Ｍ１）の話者Ａと関連付けし、蓄積音声出力部９により外部に出力するようになっている。表示出力部２０では、受け取った話者Ａとその話者の発言した音声データＶＭ１を、テキスト等の文字データ、この文字データを翻訳した翻訳データまたは音声のファイルとして画面上に表示したり、出力することができるようになっている。 Next, the remark separation method according to the present invention will be described based on the operation of the remark separation system 2 according to the above embodiment. In the speech separation system 2 according to the above embodiment, as shown in FIG. 2, the PC 3 includes an information processing unit (CPU), a storage unit, an input / output unit, and a display unit, and also detects the speech section detection unit 4. It includes a voice storage unit 5, a similarity discrimination unit 6, a voice energy discrimination unit 7, a stored voice output unit 9, a speech start detection unit 11, a speech end detection unit 12, and a time lag correction unit 13. A display output unit 20 composed of a terminal and a display device is connected to the PC 3. In the speech separation system 2 according to the present embodiment, in the first step (S101), a plurality of mixed voice data acquired from the microphones 1 to Mn are collected by the speech section detection unit 4 for each of the microphones M1 to Mn. It detects each speech section from the start to the end of the voice data and stores the voice data V: M1 to V: Mn in the detection voice storage unit 5 corresponding to each speech section detection unit 4. Next, in the second step (S102), the voice data V: M1 to V: Mn acquired from each microphone 1 to Mn are synchronously referred to, and the voice data V: M1 acquired by the similarity determination unit 6 is referred to. ~ V: The similarity of Mn is calculated and the high and low of the similarity are compared and discriminated, and the voice data having a low similarity is excluded from the target to be discriminated as the voice data of different speakers, and the voice data having a high similarity (for example). , VidM1, VidM2, VidM3) are regarded as the same audio data. Then, in the third step (S103), the voice energy discriminating unit 7 compares and discriminates the magnitude of the voice energy of the voice data (for example, VidM1, VidM2, VidM3) which have high similarity and are regarded as the same, and the voice is voiced. The microphone Mx (M1) determined to be voice data having a relatively large energy (for example, VidM1> VidM2> VidM3) is specified, and the voice data (for example, VidM1) acquired and accumulated from the microphone Mx (M1) is used. It is associated with the speaker A of the microphone Mx (M1) and is output to the outside by the stored voice output unit 9. The display output unit 20 displays or outputs the received speaker A and the voice data VM1 spoken by the speaker as character data such as text, translated data obtained by translating the character data, or a voice file. You can do it.

次に、一連の動作を、２人の話者Ａ、Ｂが存在することを想定して説明する。各話者Ａ、Ｂには、それぞれマイクＭ１、Ｍ２が装着され（図３参照）、各マイクＭ１、Ｍ２に対して発言区間検知部４：Ｍ１、４：Ｍ２を用いて発言区間を取り出す。
まず、話者Ａ、Ｂともに発言していない場合について説明する。
図１２の（Ａ）（条件ａ参照）に示すように、発言区間検知部４：Ｍ１、４：Ｍ２の各発言開始検知部１１：Ｍ１、９：Ｍ２において入力された音声フレーム（音声データ）の音声エネルギＥ＿１、Ｅ＿２が算出されるが、第１の閾値ＴＨＲ１を超えることがないため、検知状態Ｓは常に「未検知」となり、この処理が繰り返される。 Next, a series of operations will be described assuming that there are two speakers A and B. Microphones M1 and M2 are attached to the speakers A and B, respectively (see FIG. 3), and the speech section is taken out from each of the microphones M1 and M2 by using the speech section detection units 4: M1 and 4: M2.
First, a case where neither speaker A nor B speaks will be described.
As shown in (A) (see condition a) of FIG. 12, the voice frame (voice data) input in each of the speech start detection units 11: M1 and 9: M2 of the speech section detection unit 4: M1 and 4: M2. The voice energies E_1 and E_1 of the above are calculated, but since the first threshold value THR1 is not exceeded, the detection state S is always "undetected", and this process is repeated.

次に、図１２の（Ｂ）（条件ｂ参照）に示すように、話者Ａのみが発言している場合について説明する。
話者Ａの発言区間検知部４：Ｍ１の発言開始検知部１１：Ｍ１において、音声フレームの検知状態Ｓが「検知中」に変更され、発言中は発言終了検知部１２：Ｍ１が動作する。発言終了検知処理の類似度算出では、話者Ｂの発言区間検知部４：Ｍ２の検知状態Ｓが参照されるが、これは話者Ｂが発言しておらず常に「未検知」であるため、類似度判別部６の話者Ａの類似度算出の結果がTrueとなる。従って、話者Ａの発言区間検知部４：Ｍ１の発言終了検知部１２：Ｍ１で検知音声蓄積部５：Ｍ１に音声フレームが格納され続け、音声エネルギＥ＿１が第１の閾値ＴＨＲ１を下回ると、発言終了で発言区間が確定され、検知音声蓄積部５の音声が出力される。 Next, as shown in FIG. 12B (see condition b), a case where only the speaker A is speaking will be described.
In the speech section detection unit 4: M1 of the speaker A, the speech start detection unit 11: M1 changes the detection state S of the voice frame to "detecting", and the speech end detection unit 12: M1 operates during speech. In the similarity calculation of the speech end detection process, the detection state S of the speaker B's speech section detection unit 4: M2 is referred to, but this is because the speaker B is not speaking and is always "undetected". , The result of the similarity calculation of the speaker A of the similarity determination unit 6 is True. Therefore, when the voice frame continues to be stored in the voice storage unit 5: M1 detected by the speaker A's speech section detection unit 4: M1's speech end detection unit 12: M1, and the voice energy E_1 falls below the first threshold value THR1. At the end of the speech, the speech section is determined, and the voice of the detection voice storage unit 5 is output.

次に、図１３（条件ｃ参照）に示すように、話者Ａの発言中に話者Ｂが発言した場合について説明する。
話者Ａの発言中、話者Ｂが発言するまでは、図１２の(Ｂ)に示す条件ｂと同様である。話者Ａの発言中、話者Ｂが発言すると、話者Ａの発言区間検知部４：Ｍ１での類似度算出において、話者Ｂの発言区間検知部４：Ｍ２の検知状態Ｓが参照され、検知状態Ｓが「検知中」で取得される。そうすると、話者Ａと話者Ｂとの発言区間検知部４：Ｍ１、４：Ｍ２それぞれの検知音声蓄積部５：Ｍ１、５：Ｍ２に格納されている音声データに対して音声の積率相関係数（類似度）Ｃを算出しこの値が第３の閾値ＴＨＲ３を超えているかどうかを判定する。今回の条件では話者Ｂは話者Ａとは異なる発言をしているため、音声は同一のものではない。よって、類似度判別部６からTrueが出力されるため、話者Ａ、Ｂの発言区間検知部４：Ｍ１、４：Ｍ２それぞれで、上記条件ｂと同様の処理となる。話者Ｂについて、発言が継続していれば、話者Ｂの検知音声蓄積部５：Ｍ２では、音声データの蓄積が継続される。 Next, as shown in FIG. 13 (see condition c), a case where the speaker B speaks while the speaker A speaks will be described.
During the speech of the speaker A, the condition b is the same as that shown in FIG. 12 (B) until the speaker B speaks. When speaker B speaks while speaker A is speaking, the detection state S of speaker B's speech section detection unit 4: M2 is referred to in the similarity calculation by speaker A's speech section detection unit 4: M1. , The detection state S is acquired in "Detecting". Then, the voice product ratio phase with respect to the voice data stored in the detected voice storage units 5: M1 and 5: M2 of the speech section detection unit 4: M1 and 4: M2 of the speaker A and the speaker B, respectively. The number of relations (similarity) C is calculated, and it is determined whether or not this value exceeds the third threshold value THR3. Under this condition, speaker B speaks differently from speaker A, so the voices are not the same. Therefore, since True is output from the similarity determination unit 6, the same processing as the above condition b is performed in each of the speech section detection units 4: M1 and 4: M2 of the speakers A and B. If the speaker B continues to speak, the detection voice storage unit 5: M2 of the speaker B continues to accumulate voice data.

次に、図１４（条件ｄ参照）に示すように、話者Ａの発言が話者ＢのマイクＭ２でも取得された場合について説明する。
類似度算出までは上記条件ｃと同様であるが、話者Ａの検知音声蓄積部５：Ｍ１に入力されている音声データは話者Ａに装着されたマイクＭ１から取得されたものであるため、音声エネルギは他に比べて大きくなっているはずである。従って、話者Ａの発言区間検知部４：Ｍ１での類似度結果はTrueとなり、検知され続ける。話者Ｂの発言区間検知部４：Ｍ２では音声エネルギが小さいため、類似度の結果がTrueにならず、発言終了検知部１２：Ｍ２において検知状態が「未検知」に変更され、検知音声蓄積部５：Ｍ２の音声が出力されることはない。 Next, as shown in FIG. 14 (see condition d), a case where the remark of the speaker A is also acquired by the microphone M2 of the speaker B will be described.
The similarity calculation is the same as the above condition c, but since the voice data input to the detected voice storage unit 5: M1 of the speaker A is acquired from the microphone M1 attached to the speaker A. , Voice energy should be higher than others. Therefore, the similarity result in the speaker A's speech section detection unit 4: M1 becomes True and continues to be detected. Since the voice energy is small in the speech section detection unit 4: M2 of the speaker B, the result of the similarity is not True, the detection state is changed to "undetected" in the speech end detection unit 12: M2, and the detected voice is accumulated. Part 5: The sound of M2 is not output.

つまり、話者Ａ、Ｂの発言を処理する際、話者Ａ、Ｂのそれぞれの発言区間検知部４：Ｍ１、４：Ｍ２において、話者Ａが発言していると、発言区間検知部４：Ｍ１の検知状態Ｓは「検知中」となり、ここで話者Ｂが発言した場合、類似度判別部６で類似度が計算される。話者Ａ、Ｂの発言がそれぞれ独立していれば、互いの類似度は低い値となり、話者Ｂの発言区間検知部４：Ｍ２の検知状態Ｓは「検知中」となり、検知される。話者ＢのマイクＭ２で話者Ａの発言を拾ってしまった場合には、類似度が高くなるため、より音声エネルギの大きい話者Ａの音声が優先され、話者Ｂの音声は検知されない。 That is, when processing the remarks of the speakers A and B, if the speaker A is speaking in the remark section detection units 4: M1 and 4: M2 of the speakers A and B, the remark section detection unit 4 : The detection state S of M1 is "detecting", and when the speaker B speaks here, the similarity determination unit 6 calculates the similarity. If the speeches of the speakers A and B are independent of each other, the similarity between them is low, and the detection state S of the speech section detection unit 4: M2 of the speaker B is “detecting” and is detected. When the speaker A's speech is picked up by the speaker B's microphone M2, the speaker A's voice having a higher voice energy is prioritized and the speaker B's voice is not detected because the similarity is high. ..

さらに、時間的な処理について、本実施形態では、リアルタイム処理が可能となっている。つまり、発言区間検知部４には一定区間の音声データが、短時間で一定間隔毎に区切られた音声フレームとして入力される。二人の話者Ａ、Ｂが存在し、それぞれマイクＭ１、Ｍ２を装着し、マイクＭ１、Ｍ２に対応する発言区間検知部４：Ｍ１、４：Ｍ２では３０msecごとに音声データＸ＿Ｍ１[n]、Ｘ＿Ｍ２[n]がそれぞれ音声フレームとして入力されるとすると、話者Ａの発言区間検知部４：Ｍ１にはＸ＿Ｍ１[0]、Ｘ＿Ｍ１[1]…と次々と３０msec分の音声データが入力されるが、話者Ｂの発言区間検知部４：Ｍ２にも時間的に同じ音声データＸ＿Ｍ２[0]、Ｘ＿Ｍ２[1]…が入力され続ける。始めに動作する発言区間検知部４が話者Ａの発言区間検知部４：Ｍ１だった場合、発言区間検知部４：Ｍ１では、Ｘ＿Ｍ１[0]の入力に対する処理を行う。直後に次の音声データＸ＿Ｍ１[1]の処理を開始するのではなく、話者Ｂの発言区間検知部４：Ｍ２でＸ＿Ｍ２[0]に対する処理の完了を待たなければ、発言区間検知部４毎の時間ずれが発生してしまうため、同期をとる必要がある。この例では、時刻nの音声データに対する発言区間検知部４：Ｍ１、４：Ｍ２の処理が３０msec以内で完了するならば、リアルタイムでの処理が可能となる。 Further, regarding temporal processing, real-time processing is possible in this embodiment. That is, the voice data of a certain section is input to the speech section detection unit 4 as a voice frame divided at regular intervals in a short time. There are two speakers A and B, and the microphones M1 and M2 are attached, respectively. In the speech section detection unit 4: M1 and 4: M2 corresponding to the microphones M1 and M2, the voice data X_M1 [n], every 30 msec, Assuming that each of X_M2 [n] is input as a voice frame, voice data for 30 msec is input to the speaker A's speech section detection unit 4: M1 one after another, such as X_M1 [0], X_M1 [1], and so on. However, the same voice data X_M2 [0], X_M2 [1] ... Continue to be input to the speaker B's speech section detection unit 4: M2 in terms of time. When the speech section detection unit 4 that operates first is the speech section detection unit 4: M1 of the speaker A, the speech section detection unit 4: M1 performs processing for the input of X_M1 [0]. Immediately after, instead of starting the processing of the next voice data X_M1 [1], if the speaker B's speech section detection unit 4: M2 does not wait for the completion of the processing for X_M2 [0], every speech section detection unit 4 It is necessary to synchronize because the time lag will occur. In this example, if the processing of the speech section detection unit 4: M1 and 4: M2 for the voice data at time n is completed within 30 msec, the processing in real time is possible.

このように、本実施形態に係る発言切り分けシステム２とその方法では、類似度と音量を用いるだけの簡素な構成で、同一空間に複数の話者が存在する会議やコールセンター、インカム通話などの場において、あるいは、オンライン会議などで、他の話者の音声を自己の端末のスピーカを通じて聞きながら話し合う場において、複数の話者の音声を重複することなくしかも話者とその発言を正確に特定してリアルタイムで切り出すことができる。また、話者とその発言とを関連付けしなければ、異なる話者の発言を重複なく切り出すことができる。 As described above, the speech separation system 2 and its method according to the present embodiment have a simple configuration using only the similarity and the volume, and are used for meetings, call centers, income calls, etc. where a plurality of speakers exist in the same space. In, or in an online conference, etc., when discussing while listening to the voices of other speakers through the speaker of your own terminal, the voices of multiple speakers are not duplicated and the speaker and its remarks are accurately identified. Can be cut out in real time. In addition, if the speaker is not associated with the statement, the statements of different speakers can be cut out without duplication.

なお、上記実施形態に係る発言切り分けシステム２では、図３の（Ａ）に示すように、同一空間内において話者Ａ～Ｎ毎にマイクＭ１～Ｍｎを装着し、これらマイクＭ１～Ｍｎから自己の音声と他者の音声が混在して入力される音声データを重複なく切り出すようにしているが、これに限られるものではなく、図３の（Ｂ）に示すように、特定（自己）の話者Ａ－Ｒが自らの端末Ｍ－１のマイクを通じて遠隔地の他の話者Ｂ－Ｒ、Ｃ－Ｒ・・Ｎ－Ｒと端末Ｍ－２～Ｍ－ｎを通じて話し合う場合であって、他の話者Ｂ－Ｒ～Ｎ－Ｒの音声が自らの端末Ｍ－１のマイクを通じて入力される場合も同様に、音声データを異なる話者毎に発言を重複なく切り出すことができる。さらに、話者を特定しないで、異なる話者の発言として切り出すこともできる。音声入力部としての端末Ｍ－１～Ｍ－ｎには、ノートＰＣ、デスクトップＰＣ、スマートフォンが含まれる。 In the remark separation system 2 according to the above embodiment, as shown in FIG. 3A, microphones M1 to Mn are attached to each speaker A to N in the same space, and the microphones M1 to Mn are self-sufficient. The voice data of the above and the voice of another person are mixed and input, but the voice data is not limited to this, and as shown in FIG. 3 (B), the specific (self) In the case where the speaker AR talks with other speakers BR, CR ... N-R in a remote place through the microphone of the terminal M-1 through the terminals M-2 to Mn. Similarly, when the voices of the other speakers BR to NR are input through the microphone of the terminal M-1, the voice data can be cut out without duplication for each different speaker. Furthermore, it is possible to cut out the remarks of different speakers without specifying the speaker. The terminals M-1 to Mn as the voice input unit include a notebook PC, a desktop PC, and a smartphone.

また、上記実施形態に係る発言切り分けシステム２では、図３の（Ａ）、（Ｂ）に示すように、同一空間内の話者Ａ～Ｎ毎に装着され関連付けされた個別の音声入力部、すなわち、マイクＭ１～Ｍｎを通じて自他の音声が入力されるか、または遠隔地の話者Ａ－Ｒ～Ｎ－Ｒ毎に関連付けされた端末Ｍ－１～Ｍ－Ｒのマイクを通じて自他の音声が入力されるようになっているが、これに限られるものではなく、図１０に示すように、個別の音声入力部（マイクまたは端末）Ｇ・Ｍ１～Ｇ・Ｍｎ毎に複数の話者（ａ１、ａ２、ａ３）、（ｂ１、ｂ２、ｂ３）・・（ｎ１、ｎ２、ｎ４）の発言が入力されるようにしてもよい。すなわち、複数の話者からなる話者グループＧ・Ａ、Ｇ・Ｂ、・・Ｇ・Ｎ毎に音声入力部Ｇ・Ｍ１～Ｇ・Ｍｎを対応させるようにしている。このように構成することにより、話者一人ひとりでなく話者グループとその話者グループ毎の発言の音声データを入手することができる。つまり、話者グループ毎にＧ・Ａ、Ｇ・Ｂ、・・Ｇ・Ｎ毎に発言を切り出す場合、自他の検知音声蓄積部５を同期させて類似度の高低を判別し、類似度が低いと異なる話者グループの発言とみなし、類似度が高いと同一の話者グループの発言とみなし、音声エネルギの相対的大小を判別して話者グループを特定し、検知音声蓄積部５から蓄積された音声データを出力し、話者グループとしての発言を切り出すようにしている。ただし、発言者グループ内の個別の話者は特定しないようになっている。 Further, in the speech separation system 2 according to the above embodiment, as shown in FIGS. 3A and 3B, individual voice input units mounted and associated with each speaker A to N in the same space. That is, the voice of oneself and others is input through the microphones M1 to Mn, or the voice of oneself and others is input through the microphones of the terminals M-1 to MR associated with each speaker AR to NR in a remote place. Is input, but the present invention is not limited to this, and as shown in FIG. 10, a plurality of speakers (single speakers or terminals) for each of the individual voice input units (mics or terminals) G / M1 to G / Mn ( The remarks of a1, a2, a3), (b1, b2, b3) ... (n1, n2, n4) may be input. That is, the voice input units G / M1 to G / Mn are associated with each of the speaker groups G / A, GB, ... G / N composed of a plurality of speakers. With this configuration, it is possible to obtain voice data of the speaker group and the speech of each speaker group instead of each speaker. That is, when the speech is cut out for each of G / A, G / B, ... G / N for each speaker group, the self-other detection voice storage unit 5 is synchronized to determine the high or low degree of similarity, and the degree of similarity is determined. If it is low, it is regarded as a speech of a different speaker group, and if it is high in similarity, it is regarded as a speech of the same speaker group. The voice data is output, and the remarks made by the speaker group are cut out. However, individual speakers within the speaker group are not specified.

なお、上記実施形態では、短時間で一定間隔毎に区切られた音声フレームを、例えば、３０msec分の音声データとしているがこれに限られるものではなく、用途や会議場の状況マイクの性能等に応じて適宜変更してもよいことは言うまでもない。また、本実施形態では、音声長さ判別部８で予め設定された所定の長さを、１秒未満で、好ましくは０．５秒に設定しているがこれに限られるものではなく、条件に応じて設定してよいことはいうまでもない。 In the above embodiment, the audio frames divided at regular intervals in a short time are used as audio data for, for example, 30 msec, but the present invention is not limited to this, and may be used for applications, conference hall status microphone performance, or the like. Needless to say, it may be changed as appropriate. Further, in the present embodiment, the predetermined length preset by the voice length determination unit 8 is set to less than 1 second, preferably 0.5 seconds, but the condition is not limited to this. Needless to say, it may be set according to the above.

２発言切り分けシステム
４発言区間検知部
５検知音声蓄積部
６類似度判別部
７音声エネルギ判別部
Ａ～Ｎ話者
Ｍ１～Ｍｎマイク 2 Speech separation system 4 Speech section detection unit 5 Detection voice storage unit 6 Similarity discrimination unit 7 Voice energy discrimination unit A to N Speakers M1 to Mn Microphone

Claims

It is a speech separation system that cuts out speech based on the voice data that is acquired and input by mixing the voices of multiple speakers.
Equipped with a voice input unit for each speaker, in which one's own voice and another's voice are mixed and input.
Multiple voice data acquired and mixed in each of these voice input units are detected for each speech section from the start to the end of each voice data, and the own voice data input from the own voice input unit is accumulated and each is stored. The accumulated voice data for each speaker acquired from the voice input unit is synchronized and referred to, the similarity of the acquired voice data for each speaker is calculated, and the high and low of the similarity are compared and discriminated, and the similarity is determined. Low voice data is regarded as voice data of different speakers, voice data with high similarity is regarded as voice data of the same speaker, and the magnitude of voice energy is determined for the voice data in which speakers with high similarity are regarded as the same. A remark separation system characterized in that voice data that is determined to have a relatively large voice energy by comparison and discrimination is identified as one's own remarks input from one's own voice input unit, and one's own remarks are separated.

The remark separation system according to claim 1, wherein the speaker and the remark are specified based on the self-remark and the identified voice input unit.

It is a speech separation system that cuts out speech based on the voice data that is acquired and input by mixing the voices of multiple speakers.
A voice input unit for each speaker, in which one's own voice and another's voice are mixed and input,
A speech section detection unit, which is provided for each voice input section and detects a speech section from the start to the end of speech of the own voice data from a plurality of voice data acquired from the own voice input section and mixed.
A detection voice storage unit, which is provided for each speech section detection unit and stores the detected voice data of the own speech section,
Each speech section detection unit and its detected voice storage section are referred to in synchronization, and for the voice data stored in the detection voice storage section of each speech section detection section, the similarity is calculated and the high and low of the similarity are compared and discriminated. However, the voice data having a low degree of similarity is regarded as the voice data of different speakers, and the voice data having a high degree of similarity is regarded as the voice data of the same speaker acquired from a plurality of voice input units.
For the voice data of the same speaker determined by the similarity discrimination unit, the voice energy is calculated for each voice data and the magnitude of the voice energy is compared and discriminated, and the voice data determined to have a relatively high voice energy is acquired. It has a voice energy discriminating unit that identifies the spoken section detection unit.
A speech separation system characterized by cutting out a speaker and his / her speech based on the voice data stored in the specified speech section detection unit and the detected voice storage unit.

A claim characterized in that the voice of one's own speaker or another speaker is input to the voice input unit through the microphone of each speaker, or the voice of one's own or other speaker is input through the microphone of one's own terminal. The remark separation system described in 3.

A claim characterized in that the voice of oneself or another speaker is input in real time through a microphone in the voice input unit, or is input through a recording unit that has already been acquired and input and recorded as voice data. The speech separation system according to 3 or 4.

The voice of a speaker group composed of a plurality of speakers is input to each voice input unit, and the speaker group and the speech of the speaker group are cut out according to any one of claims 3 to 5. Described remark separation system.

The input voice data is input to the speech section detection unit as a voice frame divided at regular intervals, and is also input.
With a speech start detector that detects the voice frame as either undetected or detected, sets the initial state as undetected, and changes the detected state to during detection when the detected state is undetected and the start of speech is detected. ,
While the detection state is being detected, voice data is accumulated in the detected voice storage unit, and when the end of speech is detected, the voice data stored in the detection voice storage unit is output or deleted, and the detection state is changed to undetected. The speech separation system according to any one of claims 3 to 6, further comprising a detection unit.

Noise is mixed with own and other voices in the voice input section.
About the voice frame input to the speech section detection unit
It is determined whether or not the voice frame is human voice or non-voice noise based on the threshold value of voice energy obtained in advance for the magnitude of voice energy at least one of immediately after the start of speech and immediately before the end of speech. Has a speech discrimination unit
The statement according to claim 7, wherein the voice data accumulated in the detection voice storage unit of the speech section detection unit specified based on the voice data determined to be noise other than voice and determined to be noise is deleted. Carving system.

For the voice data of the same speaker determined by the similarity determination unit, the voice length for determining whether or not the voice time has a predetermined length based on a predetermined voice time length threshold. It has a discriminator and
When it has a predetermined time length, the voice energy discrimination unit compares and discriminates the magnitude of the voice energy, and when it does not have a predetermined time length, the stored voice data is deleted from the detection voice storage unit. The speech isolation system according to any one of claims 3 to 8.

It has a time lag correction unit that obtains a time lag between the stored voice data for the voice data stored in the detected voice storage unit and corrects the time lag of the voice data using the obtained time lag. The speech isolation system according to any one of claims 3 to 9, wherein the speech isolation system is characterized by the above.

When the specified speaker and its voice data are output through the detection voice storage unit of the speech section detection unit, the specified speaker and its voice data are converted into character data, translated data obtained by translating the character data, or voice. The speech isolation system according to any one of claims 3 to 10, further comprising a display output unit that displays or outputs as at least any one.

The voice input unit is composed of a microphone connected to a terminal, and the microphone is characterized in that it is attached to either a speaker gathered at the same place, a call center speaker, or a speaker who conducts a conversation through an income. The speech isolation system according to any one of claims 4 to 11.

It is a remark separation method that cuts out remarks based on the voice data that is acquired and input by mixing the voices of multiple speakers.
Equipped with a voice input unit for each speaker, in which one's own voice and another's voice are mixed and input.
The first method of detecting a plurality of voice data acquired and mixed in each voice input unit for each speech section from the start to the end of each voice data and accumulating the own voice data input from the own voice input unit. Steps and
The accumulated voice data for each speaker acquired from each voice input unit is synchronized and referred to, the similarity of the acquired voice data for each speaker is calculated, and the high and low of the similarity are compared and discriminated to determine the similarity. The second step, in which voice data with a low value is regarded as voice data of different speakers, and voice data with high similarity is regarded as voice data of the same speaker,
Speakers with a high degree of similarity compare and discriminate the magnitude of the voice energy for the voice data considered to be the same, and the voice data determined to have a relatively large voice energy is input from the voice input unit of the self. A speech isolation method characterized by having a third step of identifying the speech and the speech of oneself and others.

The remark separation method according to claim 13, wherein the speaker and the remark are specified based on the self-remark and the identified voice input unit.

The remark separation method according to claim 13 or 14, wherein the specified speaker and its voice data are displayed or output as at least one of character data, translated data obtained by translating the character data, and voice. ..