KR101616112B1

KR101616112B1 - Speaker separation system and method using voice feature vectors

Info

Publication number: KR101616112B1
Application number: KR1020140095570A
Authority: KR
Inventors: 최성윤
Original assignee: (주)복스유니버스
Priority date: 2014-07-28
Filing date: 2014-07-28
Publication date: 2016-04-27
Also published as: KR20160013592A

Abstract

본 발명은 적어도 한 명 이상의 화자를 포함하는 단일 채널(single-channel) 상에서 음성 신호의 특징 벡터를 이용하여 화자 분리가 가능한 시스템 및 방법에 관한 것으로, 이의 시스템 구성은 적어도 한 명 이상의 화자를 포함하는 음성 신호로부터 특징 벡터를 추출하는 특징 벡터 추출부와, 특징 벡터 추출부를 통해 추출한 특징 벡터를 음성 구간 또는 비음성 구간으로 분류하여 저장하는 특징 벡터 저장부, 특징 벡터 저장부에서 저장된 음성 구간의 특징 벡터가 누적되면 누적 데이터를 근거로 패턴 인식 기법을 이용하여 화자를 분리하는 화자 분리부, 및 화자 분리부를 통해 출력되는 화자 분리 데이터와 특징 벡터 저장부에 기 저장된 비음성 구간의 누적 데이터를 타임라인 상에 동기화하여 화자별 발화 구간의 시작 및 끝 시간을 탐지하는 발화 구간 탐지부를 포함한다. The present invention relates to a system and method capable of speaker separation using a feature vector of a speech signal on a single channel including at least one speaker, and a system configuration thereof includes at least one speaker A feature vector extracting unit for extracting a feature vector from a speech signal, a feature vector storing unit for storing the feature vector extracted through the feature vector extracting unit as a speech segment or a non-speech segment, A speaker separator for separating the speaker using the pattern recognition technique based on the accumulated data, and a speaker separator for outputting the speaker separated data output through the speaker separator and the cumulative data of the non- To detect the start and end times of the speaker's per-speaker interval It includes chapters.

Description

[0001] SPEAKER SEPARATION SYSTEM AND METHOD USING VOICE FEATURE VECTOR [0002]

본 발명은 음성 분리(Speech separation)에 관한 것으로, 상세하게는 적어도 한 명 이상의 화자를 포함하는 단일 채널(single-channel) 상에서 음성 신호의 특징 벡터를 이용하여 화자 분리가 가능한 화자 분리 시스템 및 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to speech separation, and more particularly to a speaker separation system and method capable of speaker separation using a feature vector of a speech signal on a single channel including at least one speaker .

일반적으로 콜 센터에서는 상담원과 고객간 착발신 호가 연결되면 해당 호를 1개의 상담으로 정의하여 상담 내용을 녹취한 뒤 저장해 둔다. 저장된 녹취 파일은 고객 민원이 발생한 경우 증빙자료로 사용하거나 상담원 평가를 위한 표본 선정 및 근거 자료로 사용할 수 있다. Generally, in a call center, when an incoming call is made between an agent and a customer, the call is defined as one consultation, and the consultation contents are recorded and stored. The recorded recording file can be used as evidence for customer complaints or as a sample for the evaluation of the counselor.

이때, 콜 센터에서는 전문적인 녹음 환경을 마련하지 않으면 상담원 회선, 고객 회선을 구분하여 녹취 저장하는 것이 어렵다. 따라서, 단일 채널에 상담원과 고객의 음성을 함께 녹취하여 저장할 수 있다. At this time, it is difficult to record and store counseling line and customer line separately if a professional recording environment is not provided in the call center. Therefore, the voice of the agent and the customer can be recorded and stored together in a single channel.

예를 들어, 상담원과 고객의 회선을 구분하여 녹취하는 경우 스테레오(Stereo)로 구분되어 좌, 우에 각각 상담원 음성, 고객의 음성을 구분하여 청취하는 것이 가능하다. 회선의 구분 없이 단일채널로 녹취하는 경우 좌, 우 모두 상담원의 음성, 고객의 음성을 함께 청취할 수 있다. For example, when the agent and the client line are recorded separately, it is divided into stereo, and it is possible to listen to the agent voice and the customer voice separately in the left and right. When recording on a single channel without dividing the line, both the voice of the agent and the voice of the customer can be heard together.

그런데, 최근 콜 센터의 녹취 파일을 활용한 서비스가 다양화 됨에 따라 상담원, 고객 등 화자를 구분되지 않은 단일채널의 녹취 파일의 활용성에 한계가 있는 것이 현실이다.However, as the service using the recording file of the call center has diversified recently, it is a reality that there is a limit to the utilization of the recording file of the single channel which does not distinguish the speaker such as the agent and the customer.

관련 선행기술로는 한국등록특허공보 제10-0642455호(발명의 명칭: 기업 전화응대에 대한 고객만족도 자동평가 서비스 방법, 공고일자: 2006년 11월 02일) 등이 있다.Related prior arts include Korean Patent Registration No. 10-0642455 entitled " Method of Automatically Evaluating Customer Satisfaction for Corporate Telephone Service, Date of Notification: November 02, 2006 ".

본 발명은 상기와 같은 문제점을 해결하기 위해 창안된 것으로서, 적어도 한 명 이상의 화자를 포함하는 단일 채널 상에서의 음성 신호로부터 음성 특징 벡터를 추출하여 특징 벡터와 패턴 인식 기술을 근간으로 화자 및 각 화자의 발화 구간을 분리할 수 있는 시스템 및 그 방법을 제공하는 데 그 목적이 있다.SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned problems, and it is an object of the present invention to extract a speech feature vector from a speech signal on a single channel including at least one speaker, The present invention also provides a system and a method for separating the ignition section.

이를 위하여, 본 발명의 실시예에 따른 음성 특징 벡터를 이용한 화자 분리 시스템은, 적어도 한 명 이상의 화자를 포함하는 음성 신호에 대하여 미리 설정된 프레임 단위로 분할하고 상기 프레임 구간의 음성 신호로부터 특징 벡터를 추출하는 특징 벡터 추출부; 상기 특징 벡터 추출부를 통해 추출한 특징 벡터를 음성 구간 또는 비음성 구간으로 분류하여 저장하는 특징 벡터 저장부; 상기 특징 벡터 저장부에서 저장된 음성 구간의 특징 벡터가 누적되면 누적 데이터를 근거로 패턴 인식 기법을 이용하여 화자를 분리하는 화자 분리부; 및 상기 화자 분리부를 통해 출력되는 화자 분리 데이터와 상기 특징 벡터 저장부에 기 저장된 비음성 구간의 누적 데이터를 타임라인 상에 동기화하여 화자별 발화 구간의 시작 및 끝 시간을 탐지하는 발화 구간 탐지부를 포함한다. To this end, a speaker separation system using speech feature vectors according to an embodiment of the present invention divides a speech signal including at least one speaker into a predetermined frame unit and extracts a feature vector from the speech signal of the frame interval A feature vector extractor; A feature vector storage unit for classifying and storing feature vectors extracted through the feature vector extracting unit into a speech segment or a non-speech segment; A speaker separator for separating the speaker using the pattern recognition technique based on the accumulated data when the feature vectors of the voice segments stored in the feature vector storage unit are accumulated; And a speech segment detection unit for detecting the start and end times of the speech segment for each speaker by synchronizing the speaker separation data output through the speaker separation unit and the cumulative data of the non-speech segment previously stored in the feature vector storage unit on the time line do.

한편, 본 발명의 실시예에 따른 음성 특징 벡터를 이용한 화자 분리 방법은, 적어도 한 명 이상의 화자를 포함하는 음성 신호로부터 화자를 분리하는 화자 분리 시스템에서의 방법으로서, 상기 화자 분리 시스템이 입력되는 음성 신호로부터 특징 벡터를 추출하는 단계; 상기 화자 분리 시스템이 추출한 특징 벡터를 음성 구간 또는 비음성 구간으로 분류하여 저장하는 단계; 상기 저장된 음성 구간의 특징 벡터가 누적되면 상기 화자 분리 시스템이 누적 데이터를 근거로 패턴 인식 기법을 이용하여 화자를 분리하는 단계; 및 상기 화자 분리 시스템이 상기 화자 분리를 통해 출력되는 화자 분리 데이터와 상기 비음성 구간으로 분류 저장된 데이터를 타임라인 상에 동기화하여 화자별 발화 구간의 시작 및 끝 시간을 탐지하는 단계를 포함할 수 있다.Meanwhile, a speaker separation method using a voice feature vector according to an embodiment of the present invention is a method in a speaker separation system for separating a speaker from a voice signal including at least one speaker, Extracting a feature vector from the signal; Classifying and storing the feature vector extracted by the speaker separation system into a speech section or a non-speech section; If the feature vectors of the stored speech segments are accumulated, separating the speakers using the pattern recognition technique based on the accumulated data; And a step of the speaker separation system synchronizing the speaker separation data outputted through the speaker separation and the data classified and stored in the non-speech interval on the time line to detect the start and end time of the speaker-specific speaking interval .

이러한 본 발명에 따르면, 적어도 두 화자 이상이 포함된 음성 신호로부터 음성 특징 벡터 및 패턴 인식을 이용하여 화자 및 화자의 발화 구간을 분리할 수 있다. According to the present invention, a speaker and a speaker can be separated from each other by using a speech feature vector and pattern recognition from a speech signal including at least two or more speakers.

또한, 상기의 화자 분리 기술로 인해 오디오 플레이어 및 동영상 플레이어에서 음성 데이터간 이동 시 시간이나 프레임 단위가 아닌 특정 인물의 발화 구간 단위로 이동이 가능해지며, 특히 어학용 학습기에 적용하는 경우 문장 단위의 이동에도 활용할 수 있어 그 활용 범위가 다양하다.In addition, due to the above-described speaker separation technique, it is possible to move not only in time or frame unit but also in a speech unit of a specific character when moving between voice data in an audio player and a moving picture player. And the application range is varied.

나아가, 음성 인식 기술을 이용한 음성 명령 및 음성을 문자로 변환하는 STT(Sound To Text) 기술을 활용한 받아쓰기(Dictation) 기능에도 단어나 문장 단위의 범위를 적용해 더 나은 분석을 제공하는 데 활용할 수 있을 것이다.Furthermore, it can be applied to the dictation function utilizing the STT (Sound To Text) technology, which converts voice commands and voice to text using speech recognition technology, in order to provide a better analysis by applying a range of words or phrases There will be.

도 1은 본 발명의 실시예에 따른 화자 분리 시스템의 구성도.
도 2는 본 발명의 실시예에 따른 화자 분리 시스템에서의 특징 벡터 추출부의 상세 구성도.
도 3은 본 발명의 실시예에 따른 화자 분리 시스템에서의 오류 보정 필터부의 상세 구성도.
도 4는 본 발명의 실시예에 따른 화자 분리 방법을 설명하기 위해 나타낸 흐름도.1 is a configuration diagram of a speaker separation system according to an embodiment of the present invention;
2 is a detailed configuration diagram of a feature vector extracting unit in a speaker separation system according to an embodiment of the present invention;
3 is a detailed configuration diagram of an error correction filter unit in a speaker separation system according to an embodiment of the present invention;
4 is a flowchart illustrating a speaker separation method according to an embodiment of the present invention.

본 명세서에서 사용되는 기술적 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아님을 유의해야 한다. 또한, 본 명세서에서 사용되는 기술적 용어는 본 명세서에서 특별히 다른 의미로 정의되지 않는 한, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 의미로 해석되어야 하며, 과도하게 포괄적인 의미로 해석되거나, 과도하게 축소된 의미로 해석되지 않아야 한다. It is noted that the technical terms used herein are used only to describe specific embodiments and are not intended to limit the invention. It is also to be understood that the technical terms used herein are to be interpreted in a sense generally understood by a person skilled in the art to which the present invention belongs, Should not be construed to mean, or be interpreted in an excessively reduced sense.

또한, 본 명세서에서 사용되는 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "구성된다" 또는 "포함한다" 등의 용어는 명세서 상에 기재된 여러 구성 요소들, 또는 여러 단계들을 반드시 모두 포함하는 것으로 해석되지 않아야 하며, 그 중 일부 구성 요소들 또는 일부 단계들은 포함되지 않을 수도 있고, 또는 추가적인 구성 요소 또는 단계들을 더 포함할 수 있는 것으로 해석되어야 한다.Also, the singular forms "as used herein include plural referents unless the context clearly dictates otherwise. In the present application, the term "comprising" or "comprising" or the like should not be construed as necessarily including the various elements or steps described in the specification, Or may be further comprised of additional components or steps.

또한, 본 명세서에서 사용되는 구성요소에 대한 접미사 "모듈" 및 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다.Further, the suffix "module" and "part" for components used in the present specification are given or mixed in consideration of ease of specification, and do not have their own meaning or role.

이하, 첨부된 도면을 참조하여 본 실시 예를 상세하게 설명한다. Hereinafter, the present embodiment will be described in detail with reference to the accompanying drawings.

먼저 도 1 및 도 2를 참조하면, 도 1은 본 발명의 실시예에 따른 화자 분리 시스템의 구성도이고, 도 2는 본 발명의 실시예에 따른 특징벡터 추출부의 상세 구성도이다. 1 and 2, FIG. 1 is a configuration diagram of a speaker separation system according to an embodiment of the present invention, and FIG. 2 is a detailed configuration diagram of a feature vector extracting unit according to an embodiment of the present invention.

본 발명의 실시예에 따른 화자 분리 시스템은 도 1에 도시한 바와 같이 특징벡터 추출부(10), 특징벡터 저장부(20), 화자 분리부(30), 발화구간 탐지부(40)를 포함한다. 추가로, 화자 분리된 데이터를 보정하여 각 화자의 음역대를 명확하게 구분하기 위한 오류 보정 필터부(50)를 더 포함할 수 있다. 1, a speaker separation system according to an embodiment of the present invention includes a feature vector extraction unit 10, a feature vector storage unit 20, a speaker separation unit 30, and a speech segment detection unit 40 do. In addition, the apparatus may further include an error correction filter unit 50 for correcting speaker separated data to clearly distinguish the band of each speaker.

이러한 각 구성요소의 동작은 시스템에 의해서 제어될 수 있다. The operation of each of these components can be controlled by the system.

특징벡터 추출부(10)는 적어도 한 명 이상의 화자를 포함하는 단일 채널 상에서의 음성 신호로부터 신호적 특징을 이용하여 적어도 하나 이상의 특징 벡터를 추출한다. 특징 벡터는 피치(Pitch), 에너지(Energy), 멜 캠스트럽(MFCC: Mel Frequency Cepstral Coefficients)을 이용하여 추출한 적어도 하나 이상의 특징 벡터들이며, 음성 신호의 분석에는 샘플링 윈도우(sampling window)를 적용한 분할 프레임을 사용한다. The feature vector extraction unit 10 extracts at least one feature vector from a speech signal on a single channel including at least one speaker using a signal feature. The feature vector is at least one feature vector extracted using Pitch, Energy, and Mel Frequency Cepstral Coefficients (MFCC). For the analysis of the speech signal, a segmentation using a sampling window Frame.

이를 구현하기 위한 특징 벡터 추출부(10)는 도 2에 도시한 바와 같이 음성신호 분할부(11), 샘플링 윈도윙부(13), 특징벡터 발생부(15), 비음성 구간 추출부(17)를 포함할 수 있다. 2, the feature vector extracting unit 10 includes a speech signal dividing unit 11, a sampling windowing unit 13, a feature vector generating unit 15, a non-speech region extracting unit 17, . &Lt; / RTI >

음성신호 분할부(11)는 단일 채널 상에 발생한 음성 신호를 미리 정해진 프레임 단위로 분할한다. 이는 음성 신호를 보다 작은 단위로 잘라내어 특징 벡터 추출의 정확도를 향상시킴과 동시에 시간에 따른 특징의 변화를 정확하게 알 수 있도록 하기 위함이다.The voice signal dividing unit 11 divides a voice signal generated on a single channel into a predetermined frame unit. This is to improve the accuracy of the extraction of the feature vector by cutting out the speech signal into smaller units and at the same time to accurately recognize the change of the feature with time.

샘플링 윈도윙부(13)는 프레임 단위로 분할된 음성 신호에 대하여 일정 중복 범위만큼의 윈도우(window)를 곱하여 출력할 수 있다. 샘플링 윈도윙부(13)는 분할된 신호를 주파수 분석에 사용할 수 있도록 샘플링하는 과정을 수행하며, 이를 통해 불필요한 신호를 감소시키고 필요한 신호를 두드러지게 하여 분석의 신뢰도를 높인다. 예컨대, 본 실시예에서는 윈도우 함수로서 해밍 윈도우(Hamming Window)를 적용함으로써, 원하지 않는 고주파 성분이 특징 추출에 주는 영향을 감소시킬 수 있다. The sampling windowing unit 13 can multiply the audio signal divided by frame units by a window of a predetermined overlap range and output the result. The sampling windowing unit 13 performs a process of sampling the divided signals so as to be used for frequency analysis, thereby reducing unnecessary signals and enhancing the reliability of analysis by emphasizing necessary signals. For example, in this embodiment, by applying a Hamming window as a window function, it is possible to reduce the influence of undesired high-frequency components on the feature extraction.

또한, 샘플링 윈도윙부(13)는 윈도우 함수로서 해닝(Hanning) 윈도우, 삼각(Triangluar) 윈도우 등 필요에 따라 다양하게 선택할 수 있다.In addition, the sampling windowing unit 13 may be variously selected as needed as a window function, such as a hanning window, a triangle window, and the like.

특징벡터 발생부(15)는 샘플링 윈도윙부(13)를 통해 샘플링한 음성 신호를 피치(Pitch), 에너지(Energy), 멜 캠스트럽(MFCC) 중 적어도 하나 이상의 추출 기법을 이용하여 특징 벡터를 추출할 수 있다. 피치(Pitch)는 음성 신호 주파수의 고저를 의미하는 것으로 음성 신호의 분석을 위한 가장 기본적이면서 중요한 특징 요소가 될 수 있다. 에너지(Energy)는 해당 음성이 차지하는 신호의 크기를 나타내는 지표로서 녹음 환경이나 화자 특성에 민감하게 반응하는 특징 중의 하나이다. 멜 캠스트럽(MFCC)는 음성 신호의 모델링을 위한 다수의 수학적 계수들을 추출할 수 있다. 멜 캠스트럽(MFCC)란 멜 스케일(Mel-Scale)로 표현된 음성 신호의 주파수별 파워 스펙트럼의 형태를 정현파 성분으로 나타낸 것이다. 쉽게 표현하면, 주파수 분포를 사람의 귀가 소리를 듣는 특성과 비슷할 수 있도록 다르게 표현한 것이다.The feature vector generating unit 15 generates a feature vector by using at least one extraction technique of pitch, energy, and mel-cam stream (MFCC) from the voice signal sampled through the sampling windowing unit 13 Can be extracted. Pitch refers to the high and low of the speech signal frequency, which can be the most basic and important feature element for analyzing the speech signal. Energy is an indicator of the size of the signal occupied by the voice, and is one of the characteristics that react sensitively to the recording environment or speaker characteristics. The Mel-Cam Strobe (MFCC) can extract a number of mathematical coefficients for modeling speech signals. The Mel-Cam Strobe (MFCC) is a sinusoidal wave form of the frequency spectrum of a voice signal expressed in Mel-Scale. Expressed easily, the frequency distribution is expressed differently so that it can be similar to the characteristic of listening to the human ear.

또한, 특징벡터 발생부(15)는 음성 신호의 피치(Pitch), 에너지(Energy), 멜 캠스트럽(MFCC) 중 적어도 하나 이상에 대한 특징 벡터 외, 이들에 대한 평균 및 표준편차 값을 추가적으로 산출할 수 있다. 여기서, 음성 신호의 피치(Pitch), 에너지(Energy), 멜 캠스트럽(MFCC)에 대한 평균 및 표준편차 값을 음성 신호 발신자의 특징 분석을 위한 특징 벡터로 활용될 수 있다.The feature vector generator 15 may further include a feature vector for at least one of a pitch, an energy, and a mammal's muscle (MFCC) of the speech signal, Can be calculated. Here, the mean and standard deviation values of the pitch, energy, and mel-cam signal (MFCC) of the voice signal can be used as feature vectors for analyzing the characteristics of the voice signal sender.

또한, 특징벡터 발생부(15)는 음성 신호의 피치(Pitch), 에너지(Energy), 멜 캠스트럽(MFCC) 중 적어도 하나 이상에 대한 시간적 변화량을 추출하기 위한 델타값 추출 기능을 더 포함할 수 있다. 이때, 음성 신호의 피치(Pitch), 에너지(Energy), 멜 캠스트럽(MFCC)에 대한 시간적 변화량도 음성 신호 발신자의 특징 분석을 위한 특징 벡터로 활용될 수 있다. The feature vector generating unit 15 may further include a delta value extracting function for extracting a temporal change amount of at least one of a pitch, energy and melancholy muscle (MFCC) . In this case, the temporal variation of the pitch, energy, and mel-cam stream (MFCC) of the voice signal can be used as a feature vector for analyzing the characteristics of the voice signal sender.

또한, 특징벡터 발생부(15)는 추가적으로 음성 신호의 피치(Pitch), 에너지(Energy), 멜 캠스트럽(MFCC) 중 적어도 하나 이상에 대한 델타값의 평균 및 표준편차 값을 산출할 수 있다. 이 경우에도, 음성 신호의 피치(Pitch), 에너지(Energy), 멜 캠스트럽(MFCC) 중 적어도 하나 이상에 대한 델타값의 평균 및 표준편차 값도 음성 신호 발신자의 특징 분석을 위한 특징 벡터로 활용될 수 있다. The feature vector generator 15 may further calculate an average and a standard deviation value of a delta value with respect to at least one of a pitch, an energy, and a mammal's muscle (MFCC) of a speech signal . In this case, the mean and standard deviation values of the delta value for at least one of the pitch, energy and MFCC of the speech signal are also used as characteristic vectors for the characteristics analysis of the speech signal sender Can be utilized.

비음성 구간 추출부(17)는 음성 신호에서 비음성 구간을 추출한다. 이는, 하기에서 설명할 패턴인식 과정에서 음성의 특징 데이터만을 바르게 분석할 수 있도록 데이터를 제한하여 시스템의 성능을 향상시키기 위함이다. The non-speech section extracting section 17 extracts a non-speech section from the speech signal. This is to improve the performance of the system by limiting the data so that only the feature data of the voice can be correctly analyzed in the pattern recognition process described below.

이러한 비음성 구간 추출부(17)는 음성신호 분할부(11)와 샘플링 윈도윙부(13)의 전후, 그 사이 어디라도 위치할 수 있다. The non-speech section extracting section 17 may be located before, after, or between the speech signal dividing section 11 and the sampling windowing section 13.

이렇게 비음성 구간 추출부(17)를 통해 추출된 비음성 프레임은 시간 정보만을 추출하여 무음 구간으로 별도로 누적하여 저장될 수 있다. 또한, 비음성 구간 추출부(17)를 통해 음성으로 판별된 프레임에 대해서는 시간 정보를 포함하는 특징 벡터들을 누적하여 저장될 수 있다. 저장은 특징벡터 저장부(도 1의 20)에서 이루어진다. The non-speech frames extracted through the non-speech interval extracting unit 17 may be stored separately in the silent interval by extracting only the time information. In addition, the feature vectors including the time information can be accumulated and stored for the frames discriminated by speech through the non-speech interval extracting unit 17. [ The storage is performed in a feature vector storage unit (20 in Fig. 1).

다시 도 1을 참조하면, 특징벡터 저장부(20)는 특징벡터 추출부(10)에서 추출한 비음성 구간의 데이터(프레임)과 음성 구간의 데이터(프레임)를 분류하여 저장한다. 이처럼, 음성 구간과 비음성구간을 구분하는 이유는, 화자 분리부(30)에 음성 구간의 신호만을 제공하여 비음성 구간의 신호에 의한 오차를 줄이기 위함이다. Referring again to FIG. 1, the feature vector storage unit 20 stores the data (frame) of the non-speech region and the data (frame) of the speech region extracted by the feature vector extraction unit 10 and stores the data. The reason for distinguishing the voice section from the non-voice section is to provide only the voice section signal to the speaker separation section 30 to reduce the error due to the non-voice section signal.

음성으로 판정된 구간의 화자 분리를 위한 입력 데이터로는 시간정보를 포함하는 특징벡터로 누적되며, 비음성 구간으로 판정된 데이터는, 음성 구간만을 포함하는 화자 분리 데이터를 기존 타임라인으로 복원하기 위해 개별적으로 시간정보만을 기록한다.The input data for the speaker separation in the section determined as the voice is accumulated as a feature vector including the time information. The data determined as the non-speech section is used for restoring the speaker separation data including only the voice section into the existing time line Only time information is recorded separately.

화자 분리부(30)는 음성 구간의 특징 벡터를 추출하여 누적한 데이터를 근거로 패턴 인식 기술을 이용하여 적어도 한 명 이상 포함된 화자를 인식한다. The speaker separation unit 30 extracts the feature vector of the speech interval and recognizes the speaker including at least one speaker using the pattern recognition technology based on the accumulated data.

패턴 인식을 위한 학습법은 크게 두 가지로 나누어 지는데, 지도학습(Supervised Learning)기법과 자율학습(Unsupervised Learning)기법이 있다. 지도학습기법은 분류하고자 하는 목표의 수와 종류가 확정된 경우, 해당 데이터를 훈련용으로 사용하여 알고 있는 결과를 기계가 잘 구분할 수 있도록 분류 기준을 훈련시켜 도출하는 방법이다. 자율학습기법은 기계가 데이터 집합 자체를 스스로 분석하여 임의의 패턴을 도출하는 방법을 말한다. 본 실시예에서는 임의의 화자를 인식하고 각 화자의 목소리를 분리하는 화자 독립형 시스템이기 때문에, 자율학습기법을 이용하여 주어진 데이터에서 임의의 패턴을 탐지, 누적 데이터를 분류하는 방법을 사용한다. Learning methods for pattern recognition are divided into two categories: supervised learning and unsupervised learning. When the number and type of targets to be classified are determined, the map learning method is a method of training the classification standard so that the machine can distinguish the result known by using the data for training. The autonomous learning method refers to a method in which a machine analyzes the data set itself to derive an arbitrary pattern. In this embodiment, since a speaker-independent system recognizes an arbitrary speaker and separates the voices of the respective speakers, a method of detecting an arbitrary pattern in a given data and classifying cumulative data is used by using an autonomous learning technique.

본 실시예에서는 자율학습기법 중 대표적인 K-means 알고리즘을 사용해 데이터를 군집화(Clustering)하는 예를 설명하기로 한다.In this embodiment, an example of clustering data using a representative K-means algorithm among autonomous learning techniques will be described.

K-means 알고리즘은 입력 데이터를 K개의 군집(Cluster)으로 구분하는 것을 목표로 하는 알고리즘으로서, 각 군집의 중심과 멤버 데이터간의 거리 차이의 분산을 최소화하는 집합을 찾게 된다. 음성구간의 특징 데이터로 구성된　입력 데이터 집합에서 시스템에 정의된 K개의 군집을 분류하기 위해 K-means 알고리즘은 K개의 초기 군집의 대표값(중심)을 정의할 수 있다. 이때, 군집 대표값(중심)의 정의에는 임의의 값을 사용할 수도 있으며, 데이터의 최대 및 최소값 사이의 차이를 이용하여 데이터 분포 범위를 K개로 분할해 사용할 수도 있다.The K-means algorithm aims to divide the input data into K clusters, and finds a set that minimizes the variance of the distance difference between the center of each cluster and the member data. In order to classify the K clusters defined in the system in the input data set composed of the feature data of the speech interval, the K-means algorithm can define the representative value (center) of the K initial clusters. At this time, an arbitrary value may be used for defining the cluster representative value (center), or the data distribution range may be divided into K by using the difference between the maximum and minimum values of the data.

위 수학식 1은 K-means 알고리즘을 수식으로 나타낸 공식으로서, 음성 구간의 특징 벡터 집합 S를 포함하는 모든 x에 대해 k개의 군집을 대표하는 대표값 μ와의 거리를 최소화하는 값 V를 찾는 것을 그 목적으로 한다. Equation (1) is a formula expressing the K-means algorithm as an equation. It finds a value V that minimizes the distance from the representative value mu representing k clusters for all x including the feature vector set S of the speech interval. The purpose.

K-means 알고리즘을 통해 초기 군집 대표가 정의되면,　①　각 군집의 대표값과 입력 데이터 집합이 포함하는 모든 데이터와의 관계도를 측정한다. 본 실시예에서의 관계도 측정은 유클리디안 거리(Euclidean Distance)를 이용한다. If the initial cluster representation is defined by the K-means algorithm, ① measure the relationship between representative values of each cluster and all the data included in the input data set. The relationship diagram measurement in this embodiment uses the Euclidean distance.

아래 수학식 2는 n-차원의 공간에 존재하는 두 점 사이의 거리를 구하는 유클리디안 공식을 표현하고 있다.Equation (2) below expresses a Euclidian formula for finding the distance between two points in an n-dimensional space.

② 모든 입력 데이터에 대해 각 군집별 거리가 구해지면, 각 입력 데이터를 거리가 가장 가까운 군집에 할당하는 과정을 거치게 된다. 할당 과정이 끝나면 이번엔 각 군집의 데이터만을 이용해 새로운 군집의 중심을 정의한다.(2) When the distance for each cluster is obtained for all the input data, each input data is allocated to the closest cluster. At the end of the assignment process, the center of the new cluster is defined using only the data of each cluster.

K-means 알고리즘은 군집의 중심을 구하고, 군집의 중심과 각 데이터 사이의 관계(거리)를 측정해 가까운 군집에 할당하는　위의 ①, ②과정을 반복하며 최적의 분류를 찾아나가며, 이는 군집의 중심이 더 이상 변하지 않을 때까지 수행한다. 군집의 중심이 더 이상 변하지 않는다는 것은 해당 데이터 집합을 K개의 군집으로 나눌 수 있는 최적의 경계를 찾았다는 의미이다. 최적의 경계는 적어도 한 명 이상의 화자를 구분하여 분리할 수 있는 기준점으로 볼 수 있다. The K-means algorithm searches the center of the cluster, finds the optimal classification by repeating the above steps ① and ② to measure the relationship (distance) between the center of the cluster and each data (distance) Perform until the center no longer changes. The fact that the center of the cluster no longer changes means that it has found an optimal boundary to divide the data set into K clusters. The optimal boundary can be regarded as a reference point separating at least one speaker.

발화구간 탐지부(40)는 화자 분리부(30)에서 모든 데이터와 각 군집 중심간의 거리를 최소화하는 결과에 수렴하게 되면, 앞서 보관하고 있던 비음성 구간의 누적 데이터와 화자 분리 결과를 합쳐 다시 본래의 음성 신호가 표현하는 타임라인과 데이터를 동기화하는 과정을 거친다.When the speaker separator 30 has converged on the result of minimizing the distance between all the data and the center of each cluster in the speaker separator 30, the cumulative data of the non- And synchronizes the data with the time line represented by the voice signal of the mobile terminal.

다음으로 시스템에서 정의하는 필터 설정에 따라 일정 기준 미만의 데이터는 주변 상황을 고려하여 오차를 강제로 보정하는 오류 보정 필터부(50)를 거칠 수 있다. 오류 보정 필터부(50)를 통과하여 출력되는 데이터는 모든 화자와 비음성 구간(무음 구간)을 포함하는 각 분류별 발화 구간의 시작과 끝 시간으로 정의할 수 있다. Next, according to the filter setting defined by the system, the data less than a certain standard may pass through the error correction filter unit 50 for forcibly correcting the error in consideration of the surrounding situation. The data output through the error correction filter unit 50 can be defined as the start and end times of each speech segment including all speakers and non-speech intervals (silent intervals).

도 3은 본 발명의 실시예에 따른 화자 분리 시스템에서의 오류 보정 필터부의 상세 구성도이다. 3 is a detailed configuration diagram of an error correction filter unit in a speaker separation system according to an embodiment of the present invention.

본 실시 예에 따른 오류 보정 필터부(50)는 크게 2단계로 진행되며, 1단계는 추출 단위에서, 2단계는 분석 단위에서 적용된다.The error correction filter unit 50 according to the present embodiment largely proceeds to two stages, where one stage is applied to an extraction unit and the second stage is applied to an analysis unit.

먼저 제1차 보정부(51)는, 추출 단위로 존재하는 화자 분리 결과에서 연속되는 흐름간 순간적으로 나타나는 불규칙적인 데이터를 강제로 보정한다.First, the first correction unit 51 forcibly corrects irregular data appearing instantaneously between consecutive flows in the speaker separation result existing in the extraction unit.

예컨대, 화자 A를 지칭하는 30개의 데이터 사이에 임계치로 설정된 1개의 다른 데이터가 섞여있을 경우, 해당 데이터는 강제로 A로 보정될 수 있다. 다른 화자의 데이터 사이에 낀 100분의 1초 동안의 음성에 대해서는 주파수의 모호함으로 인해 발생한 오차이거나, 실제 다른 화자의 목소리라 해도 의미 없는 부분으로 간주하는 것이다. 오류 보정 필터부(50)는 이러한 데이터를 오차로 판정하고 보정하기 위한 제1 임계치를 제공하며, 제1 임계치는 적용 사이트 또는 음성 신호에 따라 추가 및 수정이 가능하다.　For example, when one other data set as a threshold value is mixed among 30 pieces of data indicating the speaker A, the corresponding data can be forcibly corrected to A. For a voice for a hundredth of a second between data of another speaker, it is regarded as an error due to the ambiguity of the frequency, or even a voice of another speaker is meaningless. The error correction filter unit 50 provides a first threshold for determining and correcting such data as an error, and the first threshold can be added and modified according to the application site or voice signal.

다음으로 제2차 보정부(53)를 위해서는 앞서 따로 누적했던 무음 구간의 데이터를 통합하여 본래 음성 신호의 타임라인을 복구하는 과정이 필요하다. 화자분리 데이터와 무음 구간 데이터를 통합하여 본래 입력 데이터가 포함하는 모든 시간에 대한 분류 결과를 산출하고, 이를 다시 분석단위 시간인 1초로 변환하는 절차를 거친다. 예컨대, 본 실시예에서 제공하는 설정 값에 의하면 분석단위 시간인 1초 동안 50개의 추출단위 데이터가 포함될 수 있다. 50개 단위로 결과를 통합하며, 매 분석단위마다 각 화자와 무음구간의 개수 및 비중 데이터를 산출하여 저장한다.Next, for the second correction unit 53, it is necessary to restore the timeline of the original voice signal by integrating data accumulated in the silent interval previously. The speaker separation data and the silence interval data are integrated to calculate the classification result for all the time included in the original input data, and the process is then converted back to the analysis unit time 1 second. For example, according to the set values provided in the present embodiment, 50 extraction unit data can be included for 1 second as the analysis unit time. The results are aggregated into 50 units, and the number and specific gravity data of each speaker and silence interval are calculated and stored for each analysis unit.

여기서, 도출된 각 화자의 비중 데이터를 살펴보면 특정 비중을 중심으로 상위의 데이터는 해당 화자를 대표하는 구간으로, 하위의 데이터는 2단계 보정 대상으로 판정할 수 있다. 이때, 대표 구간을 중심으로 주변에 분산되어 있는 보정 대상을 강제로 수정하는 작업을 수행한다. 보정 대상 구간 중 간격이 짧은 것부터 긴 것까지 비중값을 근거로 해당 구간을 보정할 수 있다. 해당 보정 단계는 사이트 혹은 음성 데이터의 특성에 따라 조절이 필요한 부분으로서, 신뢰도에 적지 않은 영향을 주는 만큼 사용자의 주의 깊은 설정이 요구된다.Here, if we look at the weighted data of the derived speakers, the upper data centered on a specific weight can be determined as a section representing the speaker, and the lower data can be determined as a two-stage correction target. At this time, an operation of forcibly correcting the correction target distributed around the representative section is performed. The corresponding section can be corrected based on the specific gravity value from the shortest interval to the longest interval among the correction target sections. The correction step is a part that needs to be adjusted according to the characteristics of the site or voice data, and it requires a careful setting of the user as it has a small influence on the reliability.

이렇게 제1차 보정부(51) 및 제2차 보정부(53)를 통과하여 산출되는 결과는 각 화자와 무음구간을 포함하는 K+1개의 분류로 구분될 수 있다. 따라서, 발화구간 탐지부(40)가 분석단위로 존재하는 각 화자 및 무음 구간의 비중 데이터를 근거로 가장 높은 비중을 차지하는 데이터를 해당 분류로 정의할 수 있다.
The result calculated through the first correcting unit 51 and the second correcting unit 53 can be divided into K + 1 classifications including each speaker and a silent section. Therefore, the data that occupies the highest weight based on the specific weight data of each speaker and the silent section existing in the analysis unit by the ignition interval detection unit 40 can be defined as the corresponding classification.

도 4는 본 발명의 실시예에 따른 화자 분리 방법을 설명하기 위해 나타낸 흐름도이다. 4 is a flowchart illustrating a speaker separation method according to an embodiment of the present invention.

먼저 처음 S10단계에서, 화자 분리 시스템으로 음성 신호가 입력되면 시스템 내 특징 벡터 추출부(10)가 음성신호 분할부(11)를 통해 하나의 음성 신호를 미리 정해진 프레임 단위로 분할한다.First, in a first step S10, when a speech signal is input to the speaker separation system, the feature vector extraction unit 10 in the system divides one speech signal into predetermined frame units through the speech signal division unit 11. [

다음으로 S11 단계에서, 특징 벡터 추출부(10)가 프레임 단위로 분할된 음성 신호에 대하여 샘플링 윈도윙부(13)에서 일정 범위를 갖는 윈도우를 곱하여 샘플링을 수행한다. 이때, 윈도우 함수는 해밍 윈도우(Hamming Window)를 적용하여 원하지 않는 고주파 성분이 특징 벡터 추출에 미치는 영향을 감소하도록 할 수 있다. 이 외, 해닝(Hanning) 윈도우, 삼각(Triangluar) 윈도우 등 필요에 따라 다양하게 선택할 수 있다.Next, in step S11, the feature vector extracting unit 10 multiplies the speech signal divided in frame units by a window having a predetermined range in the sampling windowing unit 13, and performs sampling. At this time, the window function can be applied to the Hamming window to reduce the influence of unwanted high frequency components on the feature vector extraction. In addition, Hanning window, Triangluar window and various other options are available.

다음으로 S12 단계에서, 특징 벡터 추출부(10)가 샘플링 윈도윙부(13)를 통해 샘플링한 음성 신호를 피치(Pitch), 에너지(Energy), 멜 캠스트럽(MFCC) 등을 이용하여 적어도 하나 이상의 특징 벡터를 추출한다.Next, in step S12, the feature vector extracting unit 10 samples at least one voice signal sampled through the sampling windowing unit 13 using pitch, energy, and mel-cam stream (MFCC) The above feature vectors are extracted.

특징 벡터의 추출 시, 특징 벡터 추출부(10)는 음성 신호 발신자의 특징 분석을 위한 특징 벡터로, 음성 신호의 피치(Pitch), 에너지(Energy), 멜 캠스트럽(MFCC) 중 적어도 하나 이상에 대한 평균 및 표준편차 값을 추가적으로 산출할 수 있다. 또, 필요에 따라 음성 신호의 피치(Pitch), 에너지(Energy), 멜 캠스트럽(MFCC) 등에 대한 시간적 변화량을 나타내는 델타값, 이 델타값의 평균 및 표준편차 값 등을 산출할 수 있다. In extracting the feature vector, the feature vector extracting unit 10 is a feature vector for analyzing the feature of the speech signal sender, and includes at least one of pitch, energy, and MFCC of the speech signal The average and standard deviation values for the above can be additionally calculated. If necessary, a delta value indicating a temporal change amount of the pitch, energy, and MFCC of the audio signal, and an average and standard deviation value of the delta value can be calculated.

이후 S13 단계에서, 특징 벡터 추출부(10)가 비음성 구간 추출부(17)를 통해 음성 신호로부터 비음성 구간을 추출한다. 이의 추출은 앞서 설명한 바와 같이 차후 패턴 인식 과정에서 음성의 특징 데이터만을 바르게 분석할 수 있도록 데이터를 제한하여 시스템의 성능을 향상시키기 위함이다. In step S13, the feature vector extracting unit 10 extracts a non-speech region from the speech signal through the non-speech region extracting unit 17. As described above, in order to improve the performance of the system by limiting the data so that only the feature data of the voice can be correctly analyzed in the pattern recognition process.

추출 후, 특징 벡터 추출부(10)는 비음성으로 판별된 프레임을 시간 정보만을 추출하여 무음 구간으로 누적하여 특징 벡터 저장부(20)에 저장한다. 또한, 음성으로 판별된 프레임은 시간 정보를 포함하는 특징 벡터들을 누적하여 특징 벡터 저장부(20)에 저장한다. After extracting, the feature vector extracting unit 10 extracts only the time information, which is determined as non-speech, and stores the accumulated time information in the feature vector storing unit 20 as a silent period. In addition, the frame discriminated by speech accumulates the feature vectors including the time information and stores the feature vectors in the feature vector storage unit 20.

모든 음성구간의 특징 벡터가 누적되면, 다음 S14 단계에서, 화자 분리부(30)가 누적된 특징 벡터를 근거로 적어도 한 명 이상이 포함된 화자를 인식하여 분리한다. 이때, 화자 분리부(30)는 K-means 알고리즘을 이용한 패턴 인식 기술을 적용할 수 있다. If the feature vectors of all the speech sections are accumulated, the speaker separating section 30 recognizes and separates the speaker including at least one or more based on the accumulated feature vectors in step S14. At this time, the speaker separation unit 30 can apply the pattern recognition technology using the K-means algorithm.

예를 들어, 두 화자의 대화를 녹음한 데이터로부터 화자를 분리하는 경우, 시스템은 2명의 화자를 인식하여 구분해야 하기 때문에 K-means 알고리즘의 초기 군집 대표 K 에 2 라는 설정값을 정의할 수 있다. K-means 알고리즘은 각 군집(클러스터)의 중심과 멤버 데이터간 거리차의 분산을 최소화 하는 구성을 찾기 위해 군집(클러스터)의 중심을 정의하는 단계와 새로운 군집중심과 멤버 데이터간의 거리를 근거로 새로운 군집에 할당하는 단계를 반복적으로 수행한다.For example, if a speaker is separated from the recorded data of a conversation between two speakers, the system can define a set value of 2 in the initial cluster representative K of the K-means algorithm since the system needs to recognize and distinguish two speakers . The K-means algorithm is based on the step of defining the center of the cluster (cluster) to find the configuration that minimizes the dispersion of the distance difference between the center of each cluster (cluster) and the member data, And assigning it to the cluster is repeatedly performed.

이후 S15 단계에서, 화자 분리부(30)는 K-means 알고리즘을 통해 군집(클러스터)의 중심이 이동하는지 여부를 확인한다. 이 단계는 K-means 알고리즘을 적용한 경우에 한하여 수행하는 과정으로, 다른 기법의 패턴 인식 기술을 적용하는 경우 상기의 과정은 알고리즘의 제어에 따라 삭제 또는 변형이 가능하다. Thereafter, in step S15, the speaker separation unit 30 confirms whether the center of the cluster (cluster) moves through the K-means algorithm. This step is performed only when the K-means algorithm is applied. If the pattern recognition technology of another technique is applied, the above process can be deleted or modified according to the control of the algorithm.

확인 결과 군집(클러스터)의 중심이 이동하면, 군집의 중심과 각 데이터 사이의 관계(거리)를 측정해 가까운 군집에 할당하는 과정을 반복하면서 최적의 분류를 찾아나가며, 군집(클러스터)의 중심이 더 이상 변하지 않을 때까지 수행한다. When the center of the cluster (cluster) moves, the relationship (distance) between the center of the cluster and each data is measured and assigned to the nearby cluster, and the optimum classification is searched for. Perform it until it no longer changes.

확인 결과 군집(클러스터)의 중심이 이동하지 않고 멈추면, 해당 데이터 집합을 K개의 군집으로 나눌 수 있는 최적의 경계를 찾은 것이다.As a result, if the center of the cluster (cluster) stops moving, it finds an optimal boundary that divides the data set into K clusters.

이렇게 화자 분리가 이루어지면, 다음 S16 단계에서, 오류 보정 필터부(50)가 일정 기준 미만의 데이터에 대하여 주변 상황 등을 고려하여 오차를 강제로 보정할 수 있다. 예컨대, 화자 A를 지칭하는 30개의 데이터 사이에 임계치로 설정된 1개의 다른 데이터가 섞여있을 경우 해당 데이터는 강제로 A로 보정될 수 있다. When the speaker separation is performed, the error correction filter unit 50 can forcibly correct the error in consideration of the surrounding conditions and the like, with respect to data less than a certain standard in the next step S16. For example, when one other data set as a threshold value is mixed among 30 pieces of data indicating the speaker A, the corresponding data can be forcibly corrected to A.

다음으로 S17 단계에서, 발화구간 탐지부(40)는 오류 보정 필터부(50) 또는 화자 분리부(30)를 통해 출력되는 화자 분리 결과와 앞서 저장해 두었던 비음성 구간의 누적 데이터를 합쳐 다시 본래의 음성 신호가 표현하는 타임 라인과 데이터를 동기화한다. 따라서, 모든 화자와 비음성 구간(무음 구간)을 토대로 발화 구간의 시작과 끝 시간을 탐지한다.Next, in step S17, the speech segment detection unit 40 combines the speaker separation result output through the error correction filter unit 50 or the speech segmentation unit 30 with the cumulative data of the previously stored non-speech segment, Synchronize the data with the timeline represented by the voice signal. Therefore, the start and end time of the speech section is detected based on all the speakers and the non-speech section (silence section).

일반적으로, 음성 신호에서 추출한 특징벡터를 근간으로 화자를 분리하는 것은 한계가 있다. 이는 특징 벡터의 주파수 성분이 갖는 특성 때문인데, 남자와 여자 목소리의 경우 보통 다른 주파수 대역에 존재하기 때문에 특징을 이용한 분별력이 높은 편에 속한다. 하지만, 남자와 남자, 여자와 여자 등 같은 성별의 목소리를 분리해야 하는 경우에는 두 화자 음역대에 유사성이 크기 때문에 오류가 발생할 수 있다. 이 경우, 화자분리 결과에 모호함을 강제로 잡아줄 수 있는 보정 알고리즘을 적용할 필요가 있다.In general, there is a limitation in separating a speaker based on a feature vector extracted from a speech signal. This is because of the characteristics of the frequency components of the feature vector. In the case of the male and female voices, the voice is usually located in a different frequency band. However, in cases where the voice of a gender such as a man, a man, a woman and a woman is to be separated, an error may occur because the similarity is large between the two speaker ranges. In this case, it is necessary to apply a correction algorithm capable of forcing ambiguity in the speaker separation result.

데이터 보정을 위해서는 데이터의 분석시간 단위를 넓혀서 비중의 개념을 도입한다. 이를 위해 시스템은 2가지 단위 시간을 관리하게 되는데, 하나는 음성의 프레임을 분할하고 특징을 추출하는 추출단위 시간이고, 다음은 추출단위로 분류된 화자분리 결과를 묶어 비중을 산출하기 위한 분석단위 시간이다.　예컨대, 본 발명의 실시예에서는 8Khz 샘플링 레이트를 기준으로 20ms의 추출단위 시간을 이용해 초당 50개의 특징벡터를 추출하고 화자를 분리한다.For the data correction, we introduce the concept of weight by expanding the data analysis time unit. In order to achieve this, the system manages two unit times: one is the extraction unit time for dividing the frame of the speech and extracting the features, the next is the analysis unit time for calculating the specific weight by grouping the speaker separation results classified by the extraction unit to be. For example, in the embodiment of the present invention, 50 feature vectors are extracted per second using an extraction unit time of 20 ms based on the 8 KHz sampling rate, and the speakers are separated.

화자분리 및 보정작업을 마친 산출 결과는 각 화자와 무음구간을 포함하는 K+1개의 분류로 구분될 수 있다. 발화구간 탐지부(40)는 분석단위로 존재하는 각 화자 및 무음 구간의 비중 데이터를 근거로 가장 높은 비중을 차지하는 데이터를 해당 분류로 정의한다.The result of the speaker separation and correction work can be divided into K + 1 classifications including each speaker and a silent section. The utterance interval detection unit 40 defines the data that occupies the highest weight based on the weight data of each speaker and silence interval existing in the analysis unit.

이러한 과정에 따르면, 단일 채널 상에 발생한 음성 신호로부터 화자 및 각 화자의 발화 구간을 구분할 수 있으므로, 음악 플레이어 및 동영상 플레이어와 같은 재생 장치에서 음성 신호를 시간이나 프레임 단위의 이동이 아닌 화자 및 특정 화자의 발화 구간 단위로 이동할 수 있다. 또한, 어학용 학습기의 경우 문장 단위의 이동에도 활용할 수 있다. 나아가, 음성 인식 기술을 이용한 음성 명령 및 음성을 문자로 변환하는 STT(Sound To Text)기술을 활용한 받아쓰기(Dictation) 기능에도 단어나 문장단위의 범위를 적용해 더 나은 분석을 제공하는데 활용할 수 있을 것이다.According to this process, the speaker and the speaker can be distinguished from the speech signal generated on the single channel. Therefore, in the reproducing apparatus such as the music player and the moving picture player, the speech signal is not moved in time or frame unit, In the unit of the ignition interval. Also, in the case of a language learning machine, it can be used for movement of a sentence unit. Furthermore, it is possible to apply a range of words or sentences to the dictation function utilizing the STT (Sound To Text) technology for converting voice commands and voice to characters using the voice recognition technology to provide a better analysis will be.

이상의 설명은 본 발명을 예시적으로 설명한 것에 불과하며, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 본 발명의 기술적 사상에서 벗어나지 않는 범위에서 다양한 변형이 가능할 것이다. 따라서 본 발명의 명세서에 개시된 실시 예들은 본 발명을 한정하는 것이 아니다. 본 발명의 범위는 아래의 특허청구범위에 의해 해석되어야 하며, 그와 균등한 범위 내에 있는 모든 기술도 본 발명의 범위에 포함되는 것으로 해석해야 할 것이다.The foregoing description is merely illustrative of the present invention, and various modifications may be made by those skilled in the art without departing from the spirit of the present invention. Accordingly, the embodiments disclosed in the specification of the present invention are not intended to limit the present invention. The scope of the present invention should be construed according to the following claims, and all the techniques within the scope of equivalents should be construed as being included in the scope of the present invention.

10: 특징 벡터 추출부 20: 특징 벡터 저장부
30: 화자 분리부 40: 발화구간 탐지부
50: 오류 보정 필터부 11: 음성신호 분할부
13: 샘플링 윈도윙부 15: 특징 벡터 발생부
17: 비음성 구간 추출부 51: 제1차 보정부
53: 제2차 보정부10: feature vector extraction unit 20: feature vector storage unit
30: speaker separation unit 40: ignition interval detection unit
50: error correction filter unit 11: audio signal division unit
13: sampling windowing part 15: feature vector generating part
17: Non-speech section extraction section 51: First speech recognition section
53: Second correction unit

Claims

A feature vector extracting unit for dividing a speech signal including at least one speaker into a predetermined frame unit and extracting a feature vector from the speech signal of the frame period;
A feature vector storage unit for classifying and storing feature vectors extracted through the feature vector extracting unit into a speech segment or a non-speech segment;
A speaker separator for separating the speaker using the pattern recognition technique based on the accumulated data when the feature vectors of the voice segments stored in the feature vector storage unit are accumulated;
Accumulating cumulative data of a speaker-specific voice section separated through the speaker separating section and cumulative data of a non-voice section pre-stored in the feature vector storing section on a time line and extracting, based on the non- A loudspeaker section detecting section for detecting the start and end times of the loudspeaker section per speaker;
And an error correction filter unit,
Wherein the feature vector storage unit stores,
And storing and accumulating feature vectors including time information for a frame section classified into the voice section, extracting and storing only time information for a frame section classified into the non-voice section,
Wherein the error correction filter unit comprises:
A first correcting unit correcting irregular data instantaneously appearing between consecutive flows with respect to cumulative data of the voice section existing in an extraction unit before the speaker is separated;
Calculating a specific weight value of each speaker and a non-speech section from the original voice signal obtained by synchronizing the cumulative data of the speaker's voice section separated by the speaker separating section and the cumulative data of the non-voice section on the time line, A second correcting unit
Wherein the speech characteristic vector is a speech signal.

The method according to claim 1,
The feature vector extracting unit
A speech signal dividing unit for dividing a speech signal including at least one speaker into a predetermined frame unit;
A sampling windowing unit for sampling and multiplying a window function having a predetermined size by each divided frame unit through the audio signal dividing unit;
A feature vector generating unit for calculating feature vectors using at least one of pitch, energy, and MFCC for a speech signal output through the sampling windowing unit; And
A non-speech section extracting section for extracting a non-speech section (silence section) from the feature vector calculated by the feature vector generating section and classifying the speech section into a speech section and a non-
Wherein the speech feature vector comprises a speech feature vector.

delete

The method according to claim 1,
The speaker separator
A representative value of each cluster is defined to classify cumulative data of the voice interval into clusters by a speaker using the K-means algorithm according to the following equation as the pattern recognition technique, And a step of determining a distance between the accumulated data and allocating the distance to the nearby cluster is repeatedly performed.

(Where x means to calculate a value V that minimizes the distance between all the input data including the feature vector set s of the speech interval and the representative value mu representing k clusters for all x)

delete

A method in a speaker separation system for separating a speaker from a speech signal comprising at least one speaker,
Extracting a feature vector from a speech signal input by the speaker separation system;
Classifying and storing the feature vector extracted by the speaker separation system into a speech section or a non-speech section;
If the feature vectors of the stored speech segments are accumulated, separating the speakers using the pattern recognition technique based on the accumulated data;
Wherein the speaker separation system synchronizes the accumulative data of the separated voice section of the speaker with the accumulated data of the non-voice section pre-stored through the step of classifying and storing the divided voice section on the timeline, Detecting a start and end time of a speaker-based speech section based on the non-speech section from a star speech signal;
And correcting the error before and after the step of separating the speaker,
The step of classifying and storing the feature vector into a speech section or a non-
Wherein the speaker separation system stores and accumulates feature vectors including time information for a frame section classified into the voice section, extracts and stores only time information for a frame section classified into the non-voice section,
The step of correcting the error includes:
A first correcting step of correcting irregular data instantaneously appearing between consecutive flows with respect to cumulative data of the voice section existing in an extraction unit before the speaker is separated;
Calculating a specific weight value of each speaker and a non-speech section from the original voice signal obtained by synchronizing the cumulative data of the speaker-specific voice section and the accumulated data of the non-voice section on the time line through the step of separating the speaker, A second correction step of correcting the data to occupy a high specific gravity
And extracting a speech feature vector from the speech feature vector.

The method according to claim 6,
Wherein the extracting of the feature vector comprises:
Dividing a speech signal including the at least one speaker into a predetermined frame unit;
Multiplying each of the divided frame periods by a window function of a predetermined size and sampling;
Extracting feature vectors using at least one of a pitch, an energy, and a mammal's muscle (MFCC) for the sampled speech signal;
Extracting non-speech sections (silence sections) from the extracted feature vectors and classifying them into speech sections and non-speech sections
And extracting a speech feature vector from the speech feature vector.

delete

The method according to claim 6,
The step of separating the speaker
Using the following equation,
A representative value of each cluster is defined to divide the cumulative data of the voice interval into clusters by speaker and a representative value of the cluster and a distance between the cumulative data are obtained and assigned to nearby clusters repeatedly A speaker separation method using a speech feature vector characterized by separating a speaker.

delete