KR102044520B1

KR102044520B1 - Apparatus and method for discriminating voice presence section

Info

Publication number: KR102044520B1
Application number: KR1020170105485A
Authority: KR
Inventors: 최인규; 배수현; 김남수
Original assignee: 국방과학연구소
Priority date: 2017-08-21
Filing date: 2017-08-21
Publication date: 2019-11-13
Also published as: KR20190020471A

Abstract

본 발명의 일 실시예에 따른 음성 존재 구간 판별 장치는 음향 신호의 프레임 별로 특징 벡터를 추출하는 특징 벡터 추출부, 특징 벡터를 입력 받아, 프레임 별 음성 존재 확률 및 프레임 별 잡음의 종류를 추정하는 기계 학습 모델부 및 음성 존재 확률이 소정의 임계치 이상인 프레임을 판별하는 판별부를 포함한다. An apparatus for determining a voice presence interval according to an embodiment of the present invention is a machine for extracting a feature vector extracting a feature vector for each frame of an audio signal and receiving a feature vector to estimate a voice existence probability per frame and a kind of noise per frame. The learning model unit and a discrimination unit for discriminating a frame having a voice presence probability equal to or greater than a predetermined threshold value are included.

Description

Apparatus and method for discriminating voice presence section {APPARATUS AND METHOD FOR DISCRIMINATING VOICE PRESENCE SECTION}

본 발명은 음성 존재 구간 판별 장치 및 방법에 관한 것으로서, 보다 자세하게는 음향 신호로부터 추출된 특징 벡터를 기초로 음향 신호 내의 음성 존재 확률 및 음향 신호에 포함된 잡음의 종류를 추정하여 음향 신호 내의 음성 존재 구간을 판별하는 음성 존재 구간 판별 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for determining a voice presence section, and more particularly, based on a feature vector extracted from a sound signal, estimating a voice presence probability in a sound signal and a kind of noise included in the sound signal. The present invention relates to a voice presence section determination device and method for determining a section.

음성 통신, 음성 인식 등의 분야에서 음향 신호로부터 음성 구간을 판별하는 것은 반드시 필요한 전처리 단계로서 음성 구간을 보다 정확하게 판별하기 위한 다양한 기술들이 활발하게 연구되고 있다. In the fields of voice communication, voice recognition, etc., discriminating a voice section from an acoustic signal is a necessary preprocessing step, and various techniques for more accurately determining the voice section have been actively studied.

한편, 기존에는 통계적 모델을 활용한 음성과 잡음의 판별 기술이 주로 연구되었으나, 최근에는 기계 학습을 활용하여 음성과 잡음의 특성을 모델링하는 연구가 진행되고 있다. In the meantime, the speech and noise discrimination technique using a statistical model has been mainly studied, but recently, a study of modeling the characteristics of speech and noise using machine learning has been conducted.

한국 공개특허공보 제10-2011-0038447호: 통계적 모델을 이용한 목표 신호 검출 장치 및 그 방법Korean Laid-Open Patent Publication No. 10-2011-0038447: Target Signal Detection Apparatus and Method Using Statistical Model

본 발명의 실시예에서 해결하고자 하는 과제는 기계 학습을 활용한 음향 신호 내의 음성 구간 판별에 있어서 음향 신호의 고유 정보를 보다 효율적으로 담고 있는 특징 벡터를 이용하는 기술을 제공하는 것이다.The problem to be solved in the embodiment of the present invention is to provide a technique using a feature vector containing the unique information of the sound signal more efficiently in the speech section discrimination in the sound signal using the machine learning.

또한 음향 신호 내의 음성 구간을 판단함에 있어 밀접한 관계를 갖는 잡음의 종류에 관한 정보를 기계 학습 모델이 같이 학습하도록 하여, 잡음의 영향으로부터 강인한 특성을 갖는 기계 학습 모델을 제공하고자 한다. In addition, it is intended to provide a machine learning model having characteristics that are robust from the effects of noise by allowing the machine learning model to learn information on the types of noise that are closely related to determining a speech section in an acoustic signal.

다만, 본 발명의 실시예가 이루고자 하는 기술적 과제는 이상에서 언급한 과제로 제한되지 않으며, 이하에서 설명할 내용으로부터 통상의 기술자에게 자명한 범위 내에서 다양한 기술적 과제가 도출될 수 있다.However, the technical problem to be achieved by the embodiment of the present invention is not limited to the above-mentioned problem, various technical problems can be derived within the scope apparent to those skilled in the art from the following description.

본 발명의 일 실시예에 따른 음성 존재 구간 판별 장치는 음향 신호의 프레임 별로 특징 벡터를 추출하는 특징 벡터 추출부, 상기 특징 벡터를 입력 받아, 상기 프레임 별 음성 존재 확률 및 상기 프레임 별 잡음의 종류를 추정하는 기계 학습 모델부 및 상기 음성 존재 확률이 소정의 임계치 이상인 프레임을 판별하는 판별부를 포함한다. According to an embodiment of the present invention, an apparatus for determining a voice presence interval may include a feature vector extracting unit extracting a feature vector for each frame of an audio signal, and receiving the feature vector to determine a voice existence probability per frame and a type of noise for each frame. And a discriminating unit for estimating a machine learning model unit for estimating and a frame having the voice presence probability equal to or greater than a predetermined threshold.

이때 상기 프레임은 상기 음향 신호를 소정의 시간 단위로 나눈 것일 수 있다. In this case, the frame may be obtained by dividing the sound signal by a predetermined time unit.

또한 상기 기계 학습 모델부는 잡음을 포함하는 학습용 음향 신호의 특징 벡터를 기계 학습의 입력 데이터로 설정하고, 상기 학습용 음향 신호 내의 음성 존재 여부 및 상기 학습용 음향 신호에 포함된 잡음의 종류를 출력 데이터로 설정하여, 상기 입력 데이터로부터 상기 출력 데이터를 출력하기 위한 파라미터를 학습하는 것에 의해 생성되고, 이때 상기 파라미터는 상기 학습용 음향 신호의 음성 존재 여부와 상기 학습용 음향 신호에 포함된 상기 잡음의 종류와의 상관 관계가 학습될 수 있다. The machine learning model unit may set a feature vector of a learning sound signal including noise as input data of machine learning, and set the presence or absence of a voice in the learning sound signal and the kind of noise included in the learning sound signal as output data. Is generated by learning a parameter for outputting the output data from the input data, wherein the parameter is a correlation between the presence or absence of a voice in the learning sound signal and the type of noise included in the learning sound signal. Can be learned.

더불어 상기 출력 데이터로 설정되는 상기 잡음의 종류는 비음성에 대한 식별 정보를 포함하는 클래스(class)로 특정될 수 있다. In addition, the type of noise set as the output data may be specified as a class including identification information about non-voice.

본 발명의 일 실시예에 따른 음성 존재 구간 판별 방법은 상기 방법은 하나 이상의 프로세서에 의해 수행되고, 음향 신호의 프레임 별로 특징 벡터를 추출하는 단계, 상기 특징 벡터를 입력 받아 상기 프레임 별 음성 존재 확률 및 상기 프레임 별 잡음의 종류를 추정하는 단계 및 상기 음성 존재 확률이 소정의 임계치 이상인 프레임을 판별하는 단계를 포함한다. In the method for determining the presence of speech according to an embodiment of the present invention, the method may be performed by one or more processors, extracting a feature vector for each frame of an audio signal, receiving the feature vector, and probing the voice presence probability per frame. Estimating a type of noise for each frame and determining a frame having a voice presence probability greater than or equal to a predetermined threshold.

본 발명의 실시예에 따르면, 시간 영역에서 표현된 음향 신호를 기초로 특징 벡터를 추출함으로써 음향 신호의 고유 정보를 보다 효율적으로 담고 있는 특징 벡터를 기계 학습에 이용할 수 있다. According to an embodiment of the present invention, by extracting a feature vector based on a sound signal expressed in the time domain, a feature vector containing unique information of the sound signal may be used for machine learning.

또한 음향 신호 내의 음성 존재 여부뿐만 아니라 음향 신호에 포함된 잡음의 종류를 기계 학습 단계의 출력 데이터로 설정하여 기계 학습 모델의 파라미터가 음성 존재 확률 및 잡음의 종류와의 상관 관계를 알 수 있도록 학습시킴에 따라, 잡음의 영향으로부터 강인한 기계 학습 모델을 생성할 수 있기 때문에 음향 신호에 많은 잡음이 포함되더라도 음성 존재 구간을 보다 정확하게 판별할 수 있다. In addition, by setting not only the presence of speech in the acoustic signal but also the kind of noise included in the acoustic signal as the output data of the machine learning stage, the parameters of the machine learning model are trained so that the correlation between the existence of speech and the type of noise can be known. According to the present invention, a robust machine learning model can be generated from the influence of noise, so that even if a large amount of noise is included in the acoustic signal, the voice presence section can be more accurately determined.

도 1은 본 발명의 일 실시예에 따른 음성 존재 구간 판별 장치의 기능 블럭도이다.
도 2는 본 발명의 일 실시예에 따른 음성 존재 구간 판별 장치의 특징 벡터 추출부가 음향 신호로부터 특징 벡터를 도출하는 것을 설명하기 위한 예시도이다.
도 3은 본 발명의 일 실시예에 따른 음성 존재 구간 판별 장치의 기계 학습 모델부가 학습되는 과정을 설명하기 위한 예시도이다.
도 4는 본 발명의 일 실시예에 따른 음성 존재 구간 판별 방법의 프로세스를 도시하는 흐름도이다.1 is a functional block diagram of an apparatus for determining a voice presence interval according to an embodiment of the present invention.
FIG. 2 is an exemplary diagram for explaining that a feature vector extractor of the apparatus for determining a voice presence interval according to an embodiment of the present invention derives a feature vector from an acoustic signal.
3 is an exemplary diagram for describing a process of learning a machine learning model unit of an apparatus for determining a voice presence interval according to an embodiment of the present invention.
4 is a flowchart illustrating a process of a method for determining a voice presence interval according to an embodiment of the present invention.

본 발명의 이점 및 특징,　그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다.　　그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 다양한 형태로 구현될 수 있으며,　단지 본 실시예들은 본 발명의 개시가 완전하도록 하고,　본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며,　본 발명의 범주는 청구항에　의해 정의될 뿐이다.Advantages and features of the present invention, and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but can be implemented in various forms. The present embodiments merely make the disclosure of the present invention complete, and have the ordinary knowledge in the art to which the present invention pertains. It is provided to fully inform the scope of the invention, and the scope of the invention is defined only by the claims.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명은 본 발명의 실시예들을 설명함에 있어 실제로 필요한 경우 외에는 생략될 것이다.　　그리고 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다.　　그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In describing the embodiments of the present invention, detailed descriptions of well-known functions or configurations will be omitted unless they are actually necessary in describing the embodiments of the present invention. In addition, terms to be described below are terms defined in consideration of functions in the embodiments of the present invention, which may vary according to intentions or customs of users and operators. Therefore, the definition should be made based on the contents throughout the specification.

도면에 표시되고 아래에 설명되는 기능 블록들은 가능한 구현의 예들일 뿐이다. 다른 구현들에서는 상세한 설명의 사상 및 범위를 벗어나지 않는 범위에서 다른 기능 블록들이 사용될 수 있다. 또한 본 발명의 하나 이상의 기능 블록이 개별 블록들로 표시되지만, 본 발명의 기능 블록들 중 하나 이상은 동일 기능을 실행하는 다양한 하드웨어 및 소프트웨어 구성들의 조합일 수 있다.The functional blocks shown in the figures and described below are only examples of possible implementations. Other functional blocks may be used in other implementations without departing from the spirit and scope of the detailed description. Also, while one or more functional blocks of the present invention are represented by separate blocks, one or more of the functional blocks of the present invention may be a combination of various hardware and software configurations that perform the same function.

또한 어떤 구성 요소들을 포함한다는 표현은 개방형의 표현으로서 해당 구성 요소들이 존재하는 것을 단순히 지칭할 뿐이며, 추가적인 구성 요소들을 배제하는 것으로 이해되어서는 안 된다.In addition, the expression "comprising" certain components merely refers to the existence of the corresponding components as an open expression, and should not be understood as excluding additional components.

나아가 어떤 구성 요소가 다른 구성 요소에 연결되어 있다거나 접속되어 있다고 언급될 때에는, 그 다른 구성 요소에 직접적으로 연결 또는 접속되어 있을 수도 있지만, 중간에 다른 구성 요소가 존재할 수도 있다고 이해되어야 한다. Furthermore, when a component is referred to as being connected or connected to another component, it is to be understood that although the component may be directly connected or connected to the other component, there may be other components in between.

이하에서는 도면들을 참조하여 본 발명의 실시예들에 대해 설명하도록 한다. Hereinafter, embodiments of the present invention will be described with reference to the drawings.

도 1은 본 발명의 일 실시예에 따른 음성 존재 구간 판별 장치(100)의 기능 블럭도이다. 1 is a functional block diagram of an apparatus for determining a voice presence interval 100 according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 음성 존재 구간 판별 장치(100)는 음향 신호(10)로부터 추출된 특징 벡터를 기초로 음향 신호(10) 내의 음성 존재 확률 및 음향 신호(10)에 포함된 잡음의 종류를 추정하여 음향 신호(10) 내의 음성 존재 구간을 판별한다. 이를 위해, 본 발명의 일 실시예에 따른 음성 존재 구간 판별 장치(100)는 특징 벡터 추출부(110), 기계 학습 모델부(120) 및 판별부(130)를 포함한다. Referring to FIG. 1, the apparatus for determining a voice presence interval 100 according to an exemplary embodiment of the present invention may include a voice presence probability and an acoustic signal 10 in the acoustic signal 10 based on a feature vector extracted from the acoustic signal 10. Equation of the noise included in the) to determine the presence of the voice in the acoustic signal (10). To this end, the voice presence section determination apparatus 100 according to an embodiment of the present invention includes a feature vector extractor 110, a machine learning model unit 120, and a determiner 130.

특징 벡터 추출부(110)는 음향 신호(10)로부터 특징 벡터를 추출한다. 여기서, 음향 신호(10)는 소리를 물리적인 수치로 표현한 신호이며, 음향 신호(10)는 음성 및 잡음을 포함할 수 있다. The feature vector extractor 110 extracts a feature vector from the sound signal 10. Here, the acoustic signal 10 is a signal representing the sound as a physical numerical value, and the acoustic signal 10 may include voice and noise.

도 2는 본 발명의 일 실시예에 따른 음성 존재 구간 판별 장치(100)의 특징 벡터 추출부(110)가 음향 신호(10)로부터 특징 벡터를 도출하는 것을 설명하기 위한 예시도이다. FIG. 2 is an exemplary diagram for explaining that the feature vector extractor 110 of the apparatus for determining the presence of speech according to an embodiment of the present invention derives the feature vector from the sound signal 10.

도 2를 참조하면, 본 발명의 일 실시예에 따른 특징 벡터 추출부(110)는 음향 신호(10)를 주파수 대역으로 변환하여 특징 벡터를 추출하는 기존 방식과는 달리, 시간 대역에서 표현된 음향 신호(10)의 원파형으로부터 특징 벡터를 추출할 수 있다. Referring to FIG. 2, the feature vector extractor 110 according to an embodiment of the present invention, unlike the conventional method of extracting the feature vector by converting the sound signal 10 into a frequency band, is represented in a time band. A feature vector may be extracted from the waveform of the signal 10.

예를 들어, 특징 벡터 추출부(110)는 음향 신호(10)의 원파형을 샘플링하여 추출한 값으로부터 특징 벡터를 추출할 수 있다. 가령, 음향 신호(10)를 16Khz의 샘플링 레이트(sampling rate)로 샘플링하면, 1초동안 아날로그 음향 신호(10)의 물리적 수치를 16000번 기록하게 되며, 이와 같이 기록된 값으로부터 특징 벡터를 추출할 수 있다. For example, the feature vector extractor 110 may extract a feature vector from a value obtained by sampling and extracting the waveform of the acoustic signal 10. For example, if the sound signal 10 is sampled at a sampling rate of 16 kHz, the physical value of the analog sound signal 10 is recorded 16,000 times for 1 second, and the feature vector is extracted from the recorded value. Can be.

이때 샘플링 레이트가 높을 경우에는 1초 동안 매우 많은 샘플링 값이 포함되어 있기 때문에 음향 신호(10)를 소정의 시간 단위, 가령 0.1초로 나눈 프레임(11)에 포함된 샘플링 값으로부터 특징 벡터를 추출할 수 있다. In this case, when the sampling rate is high, since a large number of sampling values are included for one second, the feature vector may be extracted from the sampling values included in the frame 11 in which the acoustic signal 10 is divided into predetermined time units, for example, 0.1 seconds. have.

이와 반대로 샘플링 레이트가 낮은 경우에는, 도 2에 도시된 원파형의 박스 모양 하나를 1초 단위라 하였을 때, 5초 단위를 하나의 프레임(11) 단위로 하여 5초 동안 샘플링된 값으로부터 특징 벡터를 추출할 수 있다. On the contrary, when the sampling rate is low, when one box shape of the circular waveform shown in FIG. 2 is referred to as one second unit, a feature vector is obtained from values sampled for five seconds using one unit of unit of five seconds. Can be extracted.

이때 특징 벡터 추출부(110)는 시간 대역으로 표현된 음향 신호(10)를 샘플링한 값으로부터 특징 벡터를 추출하도록 학습된 DCNN(deep convolutional neural network)을 통해 음향 신호(10)의 프레임(11) 별로 특징 벡터를 추출함 수 있다. 이에 따라, 주파수 영역으로 변환된 음향 신호(10)로부터 추출된 특징 벡터에 비해, 음향 신호(10)의 고유 정보를 보다 효율적으로 담고 있는 특징 벡터를 사용할 수 있다. At this time, the feature vector extractor 110 extracts the feature vector from the sampled value of the sound signal 10 represented by the time band. The frame 11 of the sound signal 10 is trained through a deep convolutional neural network (DCNN). Feature vectors can be extracted for each. Accordingly, the feature vector containing the unique information of the sound signal 10 can be used more efficiently than the feature vector extracted from the sound signal 10 converted into the frequency domain.

이때 DCNN에 포함된 레이어의 구조는 공지된 기술이므로 자세한 설명은 생략한다. 한편, 상술한 특징 벡터 추출부(110)가 포함하는 신경망은 DCNN을 예시하여 설명하였으나, 이러한 예시에 한정되지 않으며 다양한 알고리즘을 통해 음향 신호(10)로부터 특징 벡터를 추출할 수 있다. At this time, since the structure of the layer included in the DCNN is a well-known technology, a detailed description thereof will be omitted. Meanwhile, the neural network included in the feature vector extractor 110 described above has been described with reference to DCNN. However, the neural network included in the feature vector extractor 110 is not limited to this example and may extract the feature vector from the acoustic signal 10 through various algorithms.

기계 학습 모델부(120)는 특징 벡터를 입력 받아 특징 벡터의 기초가 된 음향 신호(10)의 음성 존재 확률 및 특징 벡터의 기초가 된 음향 신호(10)에 포함된 잡음의 종류를 추정한다. 이때 기계 학습 모델부(120)는 음향 신호(10)의 프레임(11) 별로 음성 존재 확률 및 잡음의 종류를 추정할 수 있다. 이를 위해, 기계 학습 모델부(120)는 도 3에 도시된 바와 같이 학습될 수 있다. The machine learning model unit 120 receives the feature vector and estimates the speech existence probability of the sound signal 10 that is the basis of the feature vector and the kind of noise included in the sound signal 10 that is the basis of the feature vector. In this case, the machine learning model unit 120 may estimate a voice presence probability and a kind of noise for each frame 11 of the acoustic signal 10. To this end, the machine learning model unit 120 may be trained as shown in FIG. 3.

도 3은 본 발명의 일 실시예에 따른 음성 존재 구간 판별 장치(100)의 기계 학습 모델부(120)가 학습되는 과정을 설명하기 위한 예시도이다.3 is an exemplary diagram for describing a process of learning the machine learning model unit 120 of the apparatus for determining the presence of voice according to an embodiment of the present invention.

도 3을 참조하면, 기계 학습 모델부(120)는 잡음을 포함하는 학습용 음향 신호를 기계 학습의 학습 데이터로 사용하여 생성될 수 있다. 예를 들면, 잡음을 포함하는 학습용 음향 신호의 특징 벡터를 기계 학습의 입력 데이터로 설정하고, 학습용 음향 신호 내의 음성 존재 여부 및 학습용 음향 신호에 포함된 잡음의 종류를 출력 데이터로 설정하여 기계 학습 모델부(120)를 학습시킨 후, 기계 학습 모델부(120)가 음향 신호(10)의 특징 벡터로부터 해당 음향 신호(10)의 음성 존재 확률 및 잡음의 종류를 출력하도록 할 수 있다. Referring to FIG. 3, the machine learning model unit 120 may be generated by using a learning sound signal including noise as learning data of machine learning. For example, a machine learning model is set by setting a feature vector of a learning sound signal including noise as input data of machine learning, and setting the presence or absence of a voice in the learning sound signal and the kind of noise included in the learning sound signal as output data. After learning the unit 120, the machine learning model unit 120 may output a speech presence probability and a kind of noise of the corresponding sound signal 10 from the feature vector of the sound signal 10.

이때 출력 데이터로 설정되는 잡음의 종류는 비음성에 대한 식별 정보를 포함하는 클래스(class)로 특정될 수 있다. 예를 들면, 학습용 음향 신호에 포함된 음성 외의 소리가 자동차 소리인지, 비행기 소리인지, 소음인지를 특정하여, 특정된 정보를 식별할 수 있는 클래스를 해당 학습용 음향 신호에 라벨링한 학습 데이터를 기초로 기계 학습 모델부(120)를 학습시킬 수 있다. In this case, the type of noise set as output data may be specified as a class including identification information about non-voice. For example, by specifying whether a sound other than the voice included in the learning sound signal is a car sound, an airplane sound, or a noise, a class capable of identifying specific information is based on the learning data labeling the learning sound signal. The machine learning model unit 120 may be trained.

따라서 기계 학습 모델부(120)는 학습용 음향 신호(10)에 음성이 존재하는지 여부를 학습하면서, 동시에 음성이 아닌 잡음이 어떠한 소리에 해당하는지 특정할 수 있도록 학습되어, 기계 학습 모델부(120)는 음향 신호(10)에 포함된 여러 가지 소리들을 음성과 음성이 아닌 소리로 보다 잘 구분할 수 있게 된다. Therefore, the machine learning model unit 120 is trained to learn whether a voice exists in the learning sound signal 10, and at the same time, to specify which sounds are noises other than the voice, so that the machine learning model unit 120 is provided. The various sounds included in the acoustic signal 10 can be better distinguished from the voice and the sound other than the voice.

즉, 기계 학습 모델부(120)는 음성 존재 확률뿐만 아니라 잡음의 종류에 대한 결과를 같이 출력할 수 있도록 학습되는 과정에 의하여, 기계 학습 모델의 파라미터는 학습용 음향 신호(10)의 음성 존재 여부와 학습용 음향 신호(10)에 포함된 잡음의 종류와의 상관 관계를 학습하게 되므로, 기계 학습 모델부(120)는 특정 음향 신호(10)로부터 음성 존재 확률을 보다 정확하게 추정할 수 있게 된다. That is, the machine learning model unit 120 is trained to output not only the voice presence probability but also the result of the kind of noise, so that the parameters of the machine learning model are the presence or absence of the voice of the learning acoustic signal 10 and the like. Since the correlation with the type of noise included in the learning sound signal 10 is learned, the machine learning model unit 120 may more accurately estimate the voice presence probability from the specific sound signal 10.

따라서 기계 학습 모델부(120)는 음성 존재 확률을 판단함에 있어 밀접한 관계를 갖는 잡음의 종류에 관한 정보를 같이 학습함으로써 잡음의 영향으로부터 강인한 특성을 갖는다. Therefore, the machine learning model unit 120 has robust characteristics from the influence of noise by learning information on the types of noise having a close relationship in determining a voice presence probability.

판별부(130)는 기계 학습 모델부(120)가 추정한 음성 존재 확률이 소정의 임계치 이상인 프레임(11)을 판별한다. 예를 들어, 소정의 임계치가 75%인 경우 기계 학습 모델부(120)가 특정 프레임(11)에 대하여 음성 존재 확률이 80%라고 추정하였다면, 판별부(130)는 해당 프레임(11)에 음성이 존재한다고 판별할 수 있다. The determination unit 130 determines the frame 11 whose speech existence probability estimated by the machine learning model unit 120 is equal to or greater than a predetermined threshold. For example, when the predetermined threshold is 75%, if the machine learning model unit 120 estimates that the voice existence probability is 80% for the specific frame 11, the determination unit 130 generates a voice in the frame 11. Can be determined to exist.

더불어 판별부(130)는 음성이 존재한다고 판단된 프레임(11)으로부터 잡음을 제거하는 필터링 과정을 통해 음성 신호만을 추출할 수 있다. 음향 신호(10)로부터 잡음을 분리하는 과정은 공지된 기술이므로 자세한 설명은 생략한다. In addition, the determination unit 130 may extract only the voice signal through a filtering process of removing noise from the frame 11 determined that the voice exists. Since the process of separating the noise from the acoustic signal 10 is a known technique, a detailed description thereof will be omitted.

한편 상술한 실시예가 포함하는 특징 벡터 추출부(110), 기계 학습 모델부(120) 및 판별부(130)는 이들의 기능을 수행하도록 프로그램된 명령어를 포함하는 메모리, 및 이들 명령어를 수행하는 마이크로프로세서를 포함하는 연산 장치에 의해 구현될 수 있다. Meanwhile, the feature vector extracting unit 110, the machine learning model unit 120, and the determining unit 130 included in the above-described embodiment include a memory including instructions programmed to perform these functions, and a microcomputer that performs these instructions. It may be implemented by a computing device including a processor.

도 4는 본 발명의 일 실시예에 따른 음성 존재 구간 판별 방법의 프로세스를 도시하는 흐름도이다. 도 4에 따른 음성 존재 구간 판별 방법의 각 단계는 도 1 내지 도 3을 통해 설명된 음성 존재 구간 판별 장치(100)에 의해 수행될 수 있으며, 각 단계를 설명하면 다음과 같다.4 is a flowchart illustrating a process of a method for determining a voice presence interval according to an embodiment of the present invention. Each step of the voice presence section determination method according to FIG. 4 may be performed by the voice presence section determination apparatus 100 described with reference to FIGS. 1 to 3.

우선, 특징 벡터 추출부(110)는 음향 신호(10)의 프레임(11) 별로 특징 벡터를 추출한다(S410). 다음으로, 기계 학습 모델부(120)는 특징 벡터를 입력 받아 프레임(11) 별 음성 존재 확률 및 프레임(11) 별 잡음의 종류를 추정한다(S420). 이후, 판별부(130)는 음성 존재 확률이 소정의 임계치 이상인 프레임(11)을 판별한다(S430). First, the feature vector extractor 110 extracts a feature vector for each frame 11 of the sound signal 10 (S410). Next, the machine learning model unit 120 receives the feature vector and estimates the speech existence probability per frame 11 and the type of noise for each frame 11 (S420). Thereafter, the determination unit 130 determines a frame 11 having a voice presence probability greater than or equal to a predetermined threshold (S430).

한편, 상술한 각 단계의 주체인 구성 요소들이 해당 단계를 실시하기 위한 과정은 도 1 내지 도 3과 함께 설명하였으므로 중복된 설명은 생략한다.On the other hand, since the process for the components that are the subject of each step described above to perform the step has been described with reference to Figures 1 to 3, duplicate description is omitted.

상술한 실시예에 따르면, 시간 영역에서 표현된 음향 신호(10)를 기초로 특징 벡터를 추출함으로써 음향 신호(10)의 고유 정보를 보다 효율적으로 담고 있는 특징 벡터를 기계 학습에 이용할 수 있다. According to the above-described embodiment, by extracting the feature vector based on the sound signal 10 expressed in the time domain, the feature vector containing the unique information of the sound signal 10 can be used for machine learning more efficiently.

또한 음향 신호(10) 내의 음성 존재 여부뿐만 아니라 음향 신호(10)에 포함된 잡음의 종류를 기계 학습 단계의 출력 데이터로 설정하여 기계 학습 모델의 파라미터가 음성 존재 확률 및 잡음의 종류와의 상관 관계를 알 수 있도록 학습시킴에 따라, 잡음의 영향으로부터 강인한 기계 학습 모델을 생성할 수 있기 때문에 음향 신호(10)에 많은 잡음이 포함되더라도 음성 존재 구간을 보다 정확하게 판별할 수 있다. In addition, the type of noise included in the acoustic signal 10 as well as the presence or absence of the voice in the acoustic signal 10 is set as output data of the machine learning step so that the parameters of the machine learning model are correlated with the voice existence probability and the type of noise. By learning to know, since a robust machine learning model can be generated from the influence of noise, even if a large amount of noise is included in the acoustic signal 10, it is possible to more accurately determine the presence of speech.

상술한 본 발명의 실시예들은 다양한 수단을 통해 구현될 수 있다. 예를 들어, 본 발명의 실시예들은 하드웨어, 펌웨어(firmware), 소프트웨어 또는 그것들의 결합 등에 의해 구현될 수 있다.Embodiments of the present invention described above may be implemented through various means. For example, embodiments of the present invention may be implemented by hardware, firmware, software, or a combination thereof.

하드웨어에 의한 구현의 경우, 본 발명의 실시예들에 따른 방법은 하나 또는 그 이상의 ASICs(Application Specific Integrated Circuits), DSPs(Digital Signal Processors), DSPDs(Digital Signal Processing Devices), PLDs(Programmable Logic Devices), FPGAs(Field Programmable Gate Arrays), 프로세서, 컨트롤러, 마이크로 컨트롤러, 마이크로 프로세서 등에 의해 구현될 수 있다.In the case of a hardware implementation, a method according to embodiments of the present invention may include one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), and Programmable Logic Devices (PLDs). It may be implemented by field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, and the like.

펌웨어나 소프트웨어에 의한 구현의 경우, 본 발명의 실시예들에 따른 방법은 이상에서 설명된 기능 또는 동작들을 수행하는 모듈, 절차 또는 함수 등의 형태로 구현될 수 있다. 소프트웨어 코드 등이 기록된 컴퓨터 프로그램은 컴퓨터 판독 가능 기록 매체 또는 메모리 유닛에 저장되어 프로세서에 의해 구동될 수 있다. 메모리 유닛은 프로세서 내부 또는 외부에 위치하여, 이미 공지된 다양한 수단에 의해 프로세서와 데이터를 주고 받을 수 있다.In the case of an implementation by firmware or software, the method according to the embodiments of the present invention may be implemented in the form of a module, a procedure, or a function that performs the functions or operations described above. The computer program in which the software code or the like is recorded may be stored in a computer readable recording medium or a memory unit and driven by a processor. The memory unit may be located inside or outside the processor, and may exchange data with the processor by various known means.

이와 같이, 본 발명이 속하는 기술분야의 당업자는 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로서 이해해야만 한다. 본 발명의 범위는 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 등가개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.As such, those skilled in the art will appreciate that the present invention can be implemented in other specific forms without changing the technical spirit or essential features thereof. Therefore, the above-described embodiments are to be understood as illustrative in all respects and not as restrictive. The scope of the present invention is shown by the following claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present invention. .

10: 음향 신호
11: 프레임
100: 음성 존재 구간 판별 장치
110: 특징 벡터 추출부
120: 기계 학습 모델부
130: 판별부10: sound signal
11: frame
100: voice presence section determination device
110: feature vector extracting unit
120: machine learning model unit
130: determination unit

Claims

A feature vector extracting unit extracting a feature vector from a value obtained by sampling the original waveform of the sound signal expressed in the time band for each frame;
A machine learning model unit which receives the feature vector and estimates the speech existence probability per frame and the type of noise per frame; And
A determination unit for determining a frame having the voice presence probability equal to or greater than a predetermined threshold value,
The machine learning model unit,
A feature vector of a plurality of learning sound signals associated with each of a plurality of kinds of noises is set as input data of machine learning, and whether a voice exists for each of the plurality of learning sound signals and noise included in each of the plurality of learning sound signals. Is set by setting a type of as an output data, and learning a parameter for outputting the output data from the input data,
The parameter is
A correlation between the presence or absence of a voice of the learning sound signal and the type of noise included in the learning sound signal,
The type of noise set as the output data is
Specified as a class containing identifying information for non-voice
Voice presence section determination device.

The method of claim 1,
The frame,
The sound signal divided by a predetermined time unit
Voice presence section determination device.

delete

In the voice presence interval determination method, the method is performed by one or more processors, and the method includes:
Extracting a feature vector from a value obtained by sampling the original waveform of the sound signal expressed in the time band for each frame;
Estimating the speech existence probability per frame and the type of noise for each frame by receiving the feature vector; And
Determining a frame in which the voice presence probability is equal to or greater than a predetermined threshold value;
The estimating step,
The feature vector of the learning sound signal including noise is set as input data of machine learning, the presence or absence of a voice in the learning sound signal and the kind of noise included in the learning sound signal are set as output data. Generating by learning a parameter for outputting said output data,
The parameter is
The correlation between the presence or absence of a voice of the learning sound signal and the type of the noise included in the learning sound signal is learned.
The kind of noise set as the output data is
Specified as a class containing identifying information for non-voice
Voice presence section determination method.

Extracting a feature vector from a value obtained by sampling the original waveform of the sound signal expressed in the time band for each frame;
Estimating the speech existence probability per frame and the type of noise for each frame by receiving the feature vector; And
Causing the processor to perform a step of determining a frame at which the speech presence probability is greater than or equal to a predetermined threshold,
The estimating step,
The feature vector of the learning sound signal including noise is set as input data of machine learning, the presence or absence of voice in the learning sound signal and the kind of noise included in the learning sound signal are set as output data. Generating by learning a parameter for outputting said output data,
The parameter is
The correlation between the presence or absence of the voice of the learning sound signal and the type of the noise included in the learning sound signal is learned,
The type of noise set as the output data is
Contains instructions to be specified as a class containing identifying information for nonvoice
Computer-readable recording medium in which the program is recorded.