KR20220067276A

KR20220067276A - Speaker diarization of single channel speech using source separation

Info

Publication number: KR20220067276A
Application number: KR1020200153803A
Authority: KR
Inventors: 최우용; 동성희; 박전규
Original assignee: 한국전자통신연구원
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2022-05-24

Abstract

The present invention relates to a speaker diarization device using sound source separation in a single-channel voice signal. The present invention comprises: a noise eliminating unit which eliminates background noise from an input signal; a voice detection unit which extracts a voice section excluding a silent section from the input signal from which background noise has been eliminated; a feature extraction unit which extracts voice features of each speaker for speaker separation; a sound source separation unit which separates each voice section extracted by the voice detection unit into two sound sources; a sound source selection unit which selects only user's voice, excluding silence and third person's voice, by the sound source separation unit; a sound source connection unit which generates one sound signal by connecting the sound source selected by the sound source selection unit; a clustering unit which performs clustering to classify the extracted speech features of each speaker by each speaker; and a post-processing unit which synchronizes the input signal and time information by reflecting the silence section and overlapping speech section, and outputs a speech section for each speaker. According to the present invention, the deterioration of speaker diarization caused by voice of a third party can be reduced.

Description

Speaker diarization of single channel speech using source separation

본 발명은 단일 채널에서 수집된 음성 데이터로부터 누가 언제 발성했는지 구분하는 방법인 화자분리에 관한 것이다.The present invention relates to speaker separation, which is a method of discriminating who spoke when and when from voice data collected in a single channel.

전화 통화, 은행 창구 상담 등 2명 이상의 화자가 대화를 하는 상황에서 각 화자의 음성을 분리해야 할 필요가 있는 경우가 있으며 화자분리 방법을 통해서 널리 사용되고 있다.In situations where two or more speakers are having a conversation, such as a phone call or a bank window consultation, it is sometimes necessary to separate the voices of each speaker, and the speaker separation method is widely used.

그러나 종래 화자분리 방법의 경우, 은행 창구 상담 녹취 등에서 상담원과 고객 음성 외에 배경 잡음과 함께 제3의 화자의 음성이 녹음된 경우에는 화자분리 성능이 현저하게 떨어지는 문제가 있다.However, in the case of the conventional speaker separation method, when the voice of a third speaker is recorded together with background noise in addition to the voices of the counselor and the customer in the recording of a bank window consultation, there is a problem in that speaker separation performance is remarkably deteriorated.

본 발명은 종래 문제점을 해결하기 위한 것으로, 음원분리를 이용하여 제3의 화자를 제거함으로써 화자분리 성능을 향상시키는 단일 채널 음성신호에서 음원분리를 이용한 화자분리 장치를 제공하고자 한다.The present invention is to solve the problems of the prior art, and it is an object of the present invention to provide a speaker separation device using sound source separation in a single channel voice signal that improves speaker separation performance by removing a third speaker using sound source separation.

본 발명의 목적은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The object of the present invention is not limited to the object mentioned above, and other objects not mentioned will be clearly understood by those skilled in the art from the following description.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 단일 채널 음성신호에서 음원분리를 이용한 화자분리 장치는 입력신호로부터 배경잡음을 제거하는 잡음제거부; 상기 배경잡음이 제거된 입력신호에서 묵음 구간을 제외한 음성구간을 추출하는 음성검출부; 상기 음성검출부에서 추출된 각 음성구간에 대해서 2개의 음원으로 분리하는 음원분리부; 상기 음원분리부에 의해, 묵음과 제3자의 음성을 제외하고 사용자의 음성만 선택하는 음원선택부; 상기 음원선택부에서 선택된 음원을 연결하여 하나의 음성신호를 생성하는 음원연결부; 화자 분리를 위한 각 화자의 음성특징을 추출하는 특징추출부; 상기 추출된 각 화자의 음성특징을 화자별로 분류하는 군집화를 수행하는 군집화부; 및 상기 묵음 구간, 발화겹침 구간을 반영하여 입력신호와 시간정보를 동기화하고, 화자별 발화구간을 출력하는 후처리부를 포함한다. According to an embodiment of the present invention for achieving the above object, there is provided a speaker separation device using sound source separation in a single channel voice signal, comprising: a noise removing unit for removing background noise from an input signal; a voice detection unit for extracting a voice section excluding a silent section from the input signal from which the background noise has been removed; a sound source separating unit for dividing each sound section extracted by the sound detecting unit into two sound sources; a sound source selection unit for selecting only the user's voice except for silence and a third party's voice by the sound source separating unit; a sound source connection unit for generating one audio signal by connecting the sound source selected by the sound source selection unit; a feature extraction unit for extracting voice features of each speaker for speaker separation; a clustering unit that performs clustering for classifying the extracted voice features of each speaker for each speaker; and a post-processing unit that synchronizes the input signal and time information by reflecting the silence section and the speech overlap section, and outputs a speech section for each speaker.

여기서, 분리된 음원은 “사용자, 묵음”, “사용자, 제3자', 또는 “사용자, 사용자” 중 하나로 분리되는 것이 바람직하다. Here, the separated sound source is preferably divided into one of “user, silence”, “user, third party”, or “user, user”.

상기 제3자의 음성은 원거리 음성의 특징을 이용하여 판별하는 것이 바람직하다. It is preferable to discriminate the third party's voice by using the characteristics of the far-field voice.

본 발명에서의 두 사용자가 동시에 발성한 구간은 발화겹침 구간으로 표기하고, 발화겹침 구간은 제외하고 시간정보만 저장한다. In the present invention, a section uttered by two users at the same time is denoted as an overlapping speech section, and only time information is stored except for the overlapping speech section.

본 발명의 일 실시예에 따르면, 제3자의 음성으로 인한 화자분리 성능 저하를 줄일 수 있으며, 발화겹침 구간을 검출하여 화자분리에서 제외시킴으로써 화자분리 성능을 향상시킬 수 있는 효과가 있다. According to an embodiment of the present invention, it is possible to reduce the degradation of speaker separation performance due to a third party's voice, and by detecting overlapping utterances and excluding them from speaker separation, it is possible to improve speaker separation performance.

도 1은 본 발명의 일 실시예에 따른 단일 채널 음성신호에서 음원분리를 이용한 화자분리 장치의 구성블록도.
도 2는 본 발명의 일 실시예에 따른 단일 채널 음성신호에서 음원분리를 이용한 화자분리 방법을 설명하기 위한 순서도이다. 1 is a block diagram of a speaker separation apparatus using sound source separation in a single channel voice signal according to an embodiment of the present invention;
2 is a flowchart illustrating a method for separating a speaker using sound source separation in a single channel voice signal according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 한편, 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성소자, 단계, 동작 및/또는 소자는 하나 이상의 다른 구성소자, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다. Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but will be implemented in a variety of different forms, and only these embodiments allow the disclosure of the present invention to be complete, and common knowledge in the technical field to which the present invention belongs It is provided to fully inform the possessor of the scope of the invention, and the present invention is only defined by the scope of the claims. Meanwhile, the terminology used herein is for the purpose of describing the embodiments and is not intended to limit the present invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, "comprises" and/or "comprising" refers to the presence of one or more other components, steps, operations and/or elements mentioned. or addition is not excluded.

도 1은 본 발명에 따른 단일 채널 음성신호에서 음원분리를 이용한 화자분리 장치를 설명하기 위한 구성블록도이다. 1 is a block diagram illustrating a speaker separation apparatus using sound source separation in a single channel voice signal according to the present invention.

본 발명의 일 실시예에서 단일 채널이라 함은 하나의 마이크를 통해 복수의 화자 음성을 입력받는 환경을 의미한다. In an embodiment of the present invention, a single channel means an environment in which a plurality of speaker voices are input through a single microphone.

도 1에 도시된 바와 같이, 본 발명의 일 실시예에 따른 단일 채널 음성신호에서 음원분리를 이용한 화자분리 장치는 잡음제거부(110), 음성검출부(120), 음원분리부(130), 음원선택부(140), 음원연결부(150), 특징추출부(160), 군집화부(170) 및 후처리부(180)포함한다. As shown in FIG. 1, the speaker separation apparatus using sound source separation in a single channel audio signal according to an embodiment of the present invention includes a noise canceling unit 110, a voice detecting unit 120, a sound source separating unit 130, and a sound source. It includes a selection unit 140 , a sound source connection unit 150 , a feature extraction unit 160 , a clustering unit 170 , and a post-processing unit 180 .

잡음제거부(110)는 입력신호로부터 배경잡음을 제거한다. 본 실시예에서의 배경잡음 제거 방법은 잡음제거 필터 회로 또는 음성과 잡음간 비율을 이용하여 잡음을 검출한 후 제거하는 알고리즘 또는 통계적 모델을 이용하여 입력신호에서 잡음을 제거하는 방법이 이용될 수도 있다. The noise removing unit 110 removes background noise from the input signal. As the method for removing background noise in this embodiment, a method of removing noise from an input signal using a noise removal filter circuit or an algorithm that detects and removes noise using a ratio between speech and noise or a statistical model may be used. .

음성검출부(120)는 상기 배경잡음이 제거된 입력신호에서 묵음 구간을 제외한 음성구간을 추출한다. 음성검출부(120)을 통해 추출되는 음성구간에는 시간 정보가 포함되는 것이 바람직하다. The voice detection unit 120 extracts a voice section excluding the silence section from the input signal from which the background noise is removed. It is preferable that time information is included in the voice section extracted through the voice detection unit 120 .

음원분리부(130)는 음성검출부(120)에서 추출된 각 음성구간에 대해서 2개의 음원으로 분리한다. 여기서, 분리된 음원은 “사용자, 묵음”, “사용자, 제3자”, 또는 “사용자, 사용자” 중 하나를 통해 분리될 수 있다. The sound source separation unit 130 separates each sound section extracted by the sound detection unit 120 into two sound sources. Here, the separated sound source may be separated through one of “user, silence”, “user, third party”, or “user, user”.

음원선택부(140)는 상기 분리된 음원에서 “묵음과 제3자”의 음성을 제외하고, 사용자의 음성만 선택한다. 여기서, 제3자의 음성은 원거리 음성의 특징을 이용하여 판별할 수 있다. 그리고 두 사용자가 동시에 발성한 구간은 발화겹침 구간으로 표기한다.The sound source selector 140 selects only the user's voice, excluding the voice of "silence and a third party" from the separated sound source. Here, the third party's voice can be discriminated using the characteristics of the far-field voice. In addition, a section in which two users uttered at the same time is denoted as an overlapping speech section.

음원연결부(150)는 음원선택부(140)에서 선택된 음원들을 연결하여 하나의 음성신호로 생성한다. 한편, 음원연결부(150)는 검출된 발화겹침 구간을 제외하고 발화겹침 구간의 시간 정보만 저장한다. The sound source connection unit 150 connects the sound sources selected by the sound source selection unit 140 to generate one audio signal. On the other hand, the sound source connection unit 150 stores only the time information of the utterance overlap section except for the detected utterance overlap section.

특징추출부(160)는 화자분리를 위한 각 화자의 음성특징을 추출한다. The feature extraction unit 160 extracts voice features of each speaker for speaker separation.

군집화부(170)에서는 상기 추출된 각 화자의 음성특징을 화자별로 분류하는 군집화를 수행한다. The clustering unit 170 performs clustering by classifying the extracted voice features of each speaker for each speaker.

후처리부(180)에서는 묵음 구간, 발화겹침 구간을 반영하여 입력신호와 시간정보를 동기화하고 화자별 발화구간을 출력한다. The post-processing unit 180 synchronizes the input signal and time information by reflecting the silence section and the speech overlap section, and outputs the speech section for each speaker.

본 발명의 일 실시예에 따르면, 제3자의 음성이 포함된 음성신호에서 바로 화자를 분리하지 않고, 음성신호에서 잡음, 음원분리를 통해 화자분리 성능 저하를 줄일 수 있고, 발화겹침 구간을 검출하여 화자분리에서 제외시킴으로써 화자분리 성능을 향상시킬 수 있는 효과가 있다. According to one embodiment of the present invention, it is possible to reduce speaker separation performance degradation through noise and sound source separation from the voice signal without separating the speaker directly from the voice signal including the voice of a third party, and to detect overlapping utterances. There is an effect of improving speaker separation performance by excluding them from speaker separation.

본 발명의 일 실시예에 따른 단일 채널 음성신호에서 음원분리를 이용한 화자분리 방법에 대하여 도 2를 참조하여 설명하기로 한다. A speaker separation method using sound source separation in a single channel voice signal according to an embodiment of the present invention will be described with reference to FIG. 2 .

먼저, 잡음제거부(110)에 의해, 입력신호로부터 배경잡음을 제거한다(S110). First, the background noise is removed from the input signal by the noise removing unit 110 (S110).

이후, 음성검출부(120)에 의해, 상기 배경잡음이 제거된 입력신호에서 묵음 구간을 제외한 음성구간을 추출한다(S120). 음성검출부(120)을 통해 추출되는 음성구간에는 시간 정보가 포함되는 것이 바람직하다. Thereafter, the voice detection unit 120 extracts a voice section excluding the silence section from the input signal from which the background noise is removed (S120). It is preferable that time information is included in the voice section extracted through the voice detection unit 120 .

이어서, 음원분리부(130)에 의해, 음성검출부(120)에서 추출된 각 음성구간에 대해서 상기 검출된 각 화자들의 음성특징을 통해 2개의 음원으로 분리한다(S130). 여기서, 분리된 음원은 “사용자, 묵음”, “사용자, 제3자”, 또는 “사용자, 사용자” 중 하나로 분리될 수 있다. Then, the sound source separation unit 130 separates each voice section extracted from the voice detection unit 120 into two sound sources based on the detected voice features of each speaker (S130). Here, the separated sound source may be divided into one of “user, silence”, “user, third party”, or “user, user”.

이후, 음원선택부(140)에 의해, 상기 분리된 음원에서 “묵음과 제3자”의 음성을 제외하고 사용자의 음성만 선택한다(S140). 여기서, 제3자의 음성은 원거리 음성의 특징을 이용하여 판별한다. 또한 두 사용자가 동시에 발성한 구간은 발화겹침 구간으로 표기한다. Thereafter, only the user's voice is selected by the sound source selection unit 140 excluding the voice of "silence and a third party" from the separated sound source (S140). Here, the third party's voice is discriminated using the characteristics of the far-end voice. In addition, a section in which two users uttered at the same time is marked as an overlapping speech section.

이어서, 음원연결부(150)에 의해, 음원선택부(140)에서 선택된 음원들을 연결하여 하나의 음성신호로 생성한다(S150). 한편, 음원연결부(150)는 검출된 발화겹침 구간을 제외하고 발화겹침 구간의 시간 정보만 저장한다. Next, the sound sources selected by the sound source selection unit 140 are connected by the sound source connection unit 150 to generate a single audio signal (S150). On the other hand, the sound source connection unit 150 stores only the time information of the utterance overlap section except for the detected utterance overlap section.

이후, 특징추출부(160)에 의해, 화자분리를 위한 각 화자의 음성특징을 추출한다(S160).Thereafter, the voice features of each speaker for speaker separation are extracted by the feature extraction unit 160 ( S160 ).

군집화부(170)를 통해, 상기 추출된 각 화자의 음성특징을 화자별로 분류하는 군집화를 수행한다(S170). Through the clustering unit 170, the extracted voice features of each speaker are grouped for each speaker (S170).

이어서, 후처리부(180)에 의해, 묵음 구간, 발화겹침 구간을 반영하여 입력신호와 시간정보를 동기화하고 화자별 발화구간을 출력한다(S180). Then, the post-processing unit 180 reflects the silence section and the speech overlap section to synchronize the input signal and time information, and output the speech section for each speaker (S180).

이상, 본 발명의 구성에 대하여 첨부 도면을 참조하여 상세히 설명하였으나, 이는 예시에 불과한 것으로서, 본 발명이 속하는 기술분야에 통상의 지식을 가진자라면 본 발명의 기술적 사상의 범위 내에서 다양한 변형과 변경이 가능함은 물론이다. 따라서 본 발명의 보호 범위는 전술한 실시예에 국한되어서는 아니되며 이하의 특허청구범위의 기재에 의하여 정해져야 할 것이다.As mentioned above, although the configuration of the present invention has been described in detail with reference to the accompanying drawings, this is merely an example, and those skilled in the art to which the present invention pertains can make various modifications and changes within the scope of the technical spirit of the present invention. Of course, this is possible. Therefore, the scope of protection of the present invention should not be limited to the above-described embodiments and should be defined by the description of the following claims.

Claims

a noise removing unit for removing background noise from the input signal;
a voice detection unit for extracting a voice section excluding the silence section from the input signal from which the background noise has been removed;
a sound source separating unit for dividing each sound section extracted by the sound detecting unit into two sound sources;
a sound source selection unit for selecting only the user's voice except for silence and a third party's voice by the sound source separating unit;
a sound source connection unit for generating one audio signal by connecting the sound source selected by the sound source selection unit;
a feature extraction unit for extracting voice features of each speaker for speaker separation; and
a clustering unit that performs clustering for classifying the extracted voice features of each speaker for each speaker; and
A speaker separation apparatus using sound source separation in a single-channel voice signal including a post-processing unit that synchronizes an input signal and time information by reflecting the silence section and the speech overlap section, and outputs a speech section for each speaker.