KR20190135916A

KR20190135916A - Apparatus and method for determining user stress using speech signal

Info

Publication number: KR20190135916A
Application number: KR1020190047116A
Authority: KR
Inventors: 강홍구; 변경근; 한혜원; 권오성
Original assignee: 연세대학교 산학협력단
Priority date: 2018-05-29
Filing date: 2019-04-23
Publication date: 2019-12-09
Also published as: KR102241364B1

Abstract

The present invention provides an apparatus and a method for determining user stress using a speech signal which can accurately determine stress of a user. The apparatus for determining user stress using a speech signal comprises: a power spectrum conversion unit to convert a speech signal divided into a plurality of frames into a power spectrum; a filter bank unit including a plurality of Mel-filter banks each having a designated frequency band and pattern to filter the power spectrum of the plurality of frames to acquire a plurality of Mel-filter bank energies; a feature vector acquisition unit to apply a plurality of weights determined in accordance with a pre-learned pattern estimation method to corresponding Mel-filter bank energies among the plurality of Mel-filter bank energies to acquire feature vectors for each frame; a frame feature extraction unit to sequentially encode a plurality of feature vectors for each frame in accordance with a pre-learned pattern estimation method to extract frame features, and encoding the feature vectors for each frame with previously acquired frame features to acquire frame features; and a speech feature extraction unit to receive a final frame feature among the frame features acquired by the frame feature extraction unit, and extracting a speech feature corresponding to stress of a user from the final frame feature in accordance with a pre-learned pattern estimation method.

Description

Apparatus and method for determining user stress using speech signal}

본 발명은 사용자 스트레스 판별 장치 및 방법에 관한 것으로, 딥러닝 기반으로 음성 신호를 이용하여 사용자의 스트레스 유/무를 판별할 수 있는 사용자 스트레스 판별 장치 및 방법에 관한 것이다.The present invention relates to a user stress determination apparatus and method, and to a user stress determination apparatus and method that can determine the presence or absence of the stress of the user using a voice signal on the deep learning.

음성 신호를 이용하여 스트레스를 판별 하는 기술은 일반적으로 음성 신호에서 감정 혹은 스트레스 인식과 관련된 특징을 추출하고, 추출된 특징과 미리 지정된 스트레스 레이블 사이의 관계를 모델링하는 통계적 방법을 통해서 진행된다.Generally, a technique for determining stress using a speech signal is performed through a statistical method of extracting a feature related to emotion or stress recognition from a speech signal and modeling a relationship between the extracted feature and a predetermined stress label.

이때 특징 추출은 주로 숙련된 전문가에 의해 제안된 방법에 기반하여 수행되고 있다. 그러나 이 경우, 효용성이 크다는 장점이 있지만 훈련하고자 하는 데이터에 대하여 선택된 특징 추출 방법이 최선의 결과를 보여주는지 확신 할 수 없다는 한계가 있다.At this time, feature extraction is mainly performed based on a method proposed by a skilled expert. However, in this case, although there is an advantage in that the utility is large, there is a limitation that it is not sure whether the selected feature extraction method shows the best result for the data to be trained.

또한 기존의 통계적 방식에서는 GMM(Gaussian Mixture model), SVM(Support Vector Machine) 알고리즘 등이 있다. GMM 알고리즘은 분류하고자 하는 데이터들의 확률 분포를 다수의 정규분포들의 합으로 나타냄으로써, 시험 상황에서 데이터가 주어졌을 때, 주어진 데이터가 어떤 그룹에 속하는지 분류하는 방식으로 작동한다. 그리고 SVM 알고리즘의 경우는 두 개의 그룹의 데이터가 있을 때 둘 사이를 가르는 초평면(Hyperplane)을 최적화하는 방식으로 데이터를 분류하는 방식이다.In addition, conventional statistical methods include a Gaussian Mixture Model (GMM) and a Support Vector Machine (SVM) algorithm. The GMM algorithm works by classifying which group a given data belongs to when given data in a test situation by representing the probability distribution of the data to be classified as the sum of a number of normal distributions. In the case of the SVM algorithm, when two groups of data are present, the data is classified by optimizing a hyperplane between the two.

그러나 이러한 통계적 방식 들은 시간 축에서의 변화를 모델링하기 어렵다는 한계가 있어, 시간 도메인에서 특징이 매우 빠르게 변화하는 음성 신호로부터 스트레스를 판별하기 어렵다는 문제가 있다.However, these statistical methods have a limitation in that it is difficult to model changes in the time axis, which makes it difficult to discriminate stress from speech signals whose characteristics change very rapidly in the time domain.

한국 공개 특허 제10-2017-0117019호 (2017.10.20 공개)Korean Unexamined Patent No. 10-2017-0117019 (published Oct. 20, 2017)

본 발명의 목적은 음성 신호로부터 최적화된 특징 벡터를 추출하여 사용자의 스트레스를 정확하게 판단할 수 있는 음성 신호를 이용한 사용자 스트레스 판별 장치 및 방법을 제공하는데 있다.It is an object of the present invention to provide an apparatus and method for determining a user stress using a speech signal that can accurately determine a user's stress by extracting an optimized feature vector from the speech signal.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 음성 신호를 이용한 사용자 스트레스 판별 장치는 다수의 프레임으로 구분된 음성 신호를 파워 스펙트럼으로 변환하는 파워 스펙트럼 변환부; 각각 기지정된 주파수 대역과 패턴을 갖는 다수의 멜-필터 뱅크를 포함하여, 다수 프레임 각각의 상기 파워 스펙트럼을 각각 필터링하여 다수의 멜-필터 뱅크 에너지를 획득하는 필터 뱅크부; 미리 학습된 패턴 추정 방식에 따라 결정된 다수의 가중치를 상기 다수의 멜-필터 뱅크 에너지 중 대응하는 멜-필터 뱅크 에너지 적용하여 프레임별 특징 벡터를 획득하는 특징 벡터 획득부; 미리 학습된 패턴 추정 방식에 따라 다수의 프레임별 특징 벡터를 순차적으로 인코딩하여 프레임 특징을 추출하되, 이전 획득된 프레임 특징을 함께 인코딩하여 프레임 특징을 획득하는 프레임 특징 추출부; 및 상기 프레임 특징 추출부에서 획득되는 프레임 특징 중 최종 프레임 특징을 인가받고, 미리 학습된 패턴 추정 방식에 따라 상기 최종 프레임 특징으로부터 사용자의 스트레스에 대응하는 음성 특징을 추출하는 음성 특징 추출부; 를 포함한다.In accordance with an aspect of the present invention, a user stress determination apparatus using a voice signal includes: a power spectrum converter configured to convert a voice signal divided into a plurality of frames into a power spectrum; A filter bank unit including a plurality of mel-filter banks each having a predetermined frequency band and a pattern, respectively, filtering the power spectrum of each of the plurality of frames to obtain a plurality of mel-filter bank energies; A feature vector obtaining unit obtaining a feature vector for each frame by applying a corresponding mel-filter bank energy among the plurality of mel-filter bank energies to a plurality of weights determined according to a previously learned pattern estimation scheme; A frame feature extractor configured to sequentially encode a plurality of feature vectors for each frame according to a pre-learned pattern estimation method to extract frame features, and to obtain frame features by encoding previously obtained frame features together; And a speech feature extractor configured to receive a final frame feature among the frame features obtained by the frame feature extractor and to extract a voice feature corresponding to the user's stress from the final frame feature according to a pre-learned pattern estimation scheme. It includes.

상기 프레임 특징 추출부는 상기 프레임 특징 추출부는 다수의 프레임별 특징 벡터 중 대응하는 프레임별 특징 벡터와 이전 획득된 프레임 특징을 인가받아 현재 프레임 특징을 추출하는 다수의 인코더를 포함하는 미리 학습된 인공 신경망으로 구현될 수 있다.The frame feature extracting unit is a pre-learned artificial neural network including a plurality of encoders that extract a current frame feature by receiving a corresponding frame feature vector and a previously obtained frame feature among a plurality of frame feature vectors. Can be implemented.

상기 음성 특징 추출부는 상기 다수의 인코더 각각에서 추출된 다수의 프레임 특징을 결합하고, 미리 학습된 패턴 추정 방식에 따라 결합된 다수의 프레임 특징으로부터 상기 음성 특징을 추출할 수 있다.The speech feature extractor may combine a plurality of frame features extracted from each of the plurality of encoders and extract the speech feature from the plurality of frame features combined according to a pre-learned pattern estimation scheme.

상기 사용자 스트레스 판별 장치는 학습 시에 스트레스 레벨이 레이블된 학습용 데이터를 획득하고, 획득된 학습용 데이터에 대한 음성 특징을 레이블된 스트레스 레벨과 비교하여 오차를 추출하고, 획득된 오차를 상기 프레임 특징 추출부 및 상기 음성 특징 추출부로 역전파하여 학습시키고, 학습된 상기 프레임 특징 추출부 및 상기 음성 특징 추출부를 이용하여 획득된 음성 특징과 레이블된 스트레스 레벨 사이의 오차가 기지정된 기준 오차를 초과하면, 상기 특징 벡터 획득부로 오차를 역전파하여 상기 가중치를 업데이트한 후 다시 상기 프레임 특징 추출부 및 상기 음성 특징 추출부를 학습시키는 학습부; 를 더 포함할 수 있다.The user stress determination apparatus obtains learning data labeled with a stress level during learning, extracts an error by comparing a voice feature of the acquired learning data with a labeled stress level, and extracts the obtained error from the frame feature extracting unit. And learning by back propagating to the speech feature extractor, and if the error between the learned speech feature and the speech feature obtained using the speech feature extractor and the labeled stress level exceeds a predetermined reference error, A learning unit trains the frame feature extractor and the speech feature extractor after updating the weight by back propagating an error with a vector obtainer; It may further include.

상기 다수의 멜-필터 뱅크 각각은 주파수 대역별로 기지정된 멜-주파수 스케일(Mel-frequency scale)과 서로 다른 대역폭을 갖는 삼각 필터로 구현될 수 있다.Each of the plurality of mel-filter banks may be implemented as a triangular filter having a different bandwidth from a mel-frequency scale predetermined for each frequency band.

상기 목적을 달성하기 위한 본 발명의 다른 실시예에 따른 음성 신호를 이용한 사용자 스트레스 판별 방법은 음성 신호를 다수의 프레임으로 구분하고, 다수 프레임 각각을 파워 스펙트럼으로 변환하는 단계; 각각 기지정된 주파수 대역과 패턴을 갖는 다수의 필터를 갖는 필터 뱅크를 이용하여, 프레임별로 구분된 상기 파워 스펙트럼을 필터링하여 다수 프레임 각각에서의 멜-필터 뱅크 에너지를 획득하는 단계; 다수의 멜-필터 뱅크 에너지에 미리 학습된 패턴 추정 방식에 의해 프레임별로 결정된 가중치를 적용하여 특징 벡터를 획득하는 단계; 미리 학습된 패턴 추정 방식에 따라 상기 특징 벡터에서 가중치가 적용된 다수의 멜-필터 뱅크 에너지 각각을 순차적으로 인코딩하여 프레임 특징을 추출하되, 이전 획득된 프레임 특징을 함께 인코딩하여 프레임 특징을 획득하는 단계; 및 획득된 프레임 특징 중 최종 프레임 특징을 인가받고, 미리 학습된 패턴 추정 방식에 따라 상기 최종 프레임 특징으로부터 사용자의 스트레스에 대응하는 음성 특징을 추출하는 단계; 를 포함한다.According to another aspect of the present invention, there is provided a method for determining a user stress using a voice signal, comprising: dividing the voice signal into a plurality of frames and converting each of the plurality of frames into a power spectrum; Using a filter bank having a plurality of filters each having a predetermined frequency band and a pattern, filtering the power spectrum divided by frames to obtain mel-filter bank energy in each of the plurality of frames; Obtaining a feature vector by applying a weight determined for each frame by a pre-learned pattern estimation scheme to a plurality of mel-filter bank energies; Extracting a frame feature by sequentially encoding each of a plurality of mel-filter bank energies weighted in the feature vector according to a pre-learned pattern estimation method, and encoding a previously obtained frame feature together to obtain a frame feature; And receiving a final frame feature among the acquired frame features and extracting a voice feature corresponding to the user's stress from the final frame feature according to a pre-learned pattern estimation scheme. It includes.

따라서, 본 발명의 실시예에 따른 음성 신호를 이용한 사용자 스트레스 판별 장치 및 방법은 음성 신호로부터 최적화된 특징 벡터를 추출하고, 추출된 특징 벡터에서 이전 정보를 반영할 수 있는 인공신경망을 이용하여 스트레스 판별 정확도를 크게 높일 수 있다. 또한 음성에서 스트레스에 연관이 되는 파라미터를 정확하게 모델링할 수 있다.Therefore, the apparatus and method for determining a user stress using a speech signal according to an embodiment of the present invention extracts an optimized feature vector from the speech signal and uses the artificial neural network to reflect previous information from the extracted feature vector. It can greatly increase the accuracy. You can also accurately model the parameters associated with stress in speech.

도 1은 본 발명의 일 실시예에 따른 음성 신호를 이용한 사용자 스트레스 판별 장치의 개략적 구조를 나타낸다.
도 2 및 도 3은 도2 의 특징 벡터 추출부의 동작을 설명하기 위한 도면이다.
도 4는 도 1의 특징 추출부의 상세 구성을 나타낸다.
도 5는 본 발명의 일 실시예에 따른 음성 신호를 이용한 사용자 스트레스 판별 방법을 나타낸다.1 shows a schematic structure of a user stress determination apparatus using a voice signal according to an embodiment of the present invention.
2 and 3 are diagrams for describing an operation of the feature vector extracting unit of FIG. 2.
4 shows a detailed configuration of the feature extraction unit of FIG. 1.
5 illustrates a user stress determination method using a voice signal according to an embodiment of the present invention.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention, the operational advantages of the present invention, and the objects achieved by the practice of the present invention, reference should be made to the accompanying drawings which illustrate preferred embodiments of the present invention and the contents described in the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In addition, in order to clearly describe the present invention, parts irrelevant to the description are omitted, and like reference numerals in the drawings indicate the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part is said to "include" a certain component, it means that it may further include other components, without excluding other components unless otherwise stated. In addition, the terms "... unit", "... unit", "module", "block", etc. described in the specification mean a unit that processes at least one function or operation, which means hardware, software, or hardware. And software.

도 1은 본 발명의 일 실시예에 따른 음성 신호를 이용한 사용자 스트레스 판별 장치의 개략적 구조를 나타내고, 도 2 및 도 3은 도2 의 특징 벡터 추출부의 동작을 설명하기 위한 도면이며, 도 4는 도 1의 특징 추출부의 상세 구성을 나타낸다.1 shows a schematic structure of a user stress determination apparatus using a voice signal according to an embodiment of the present invention, Figures 2 and 3 are views for explaining the operation of the feature vector extraction unit of Figure 2, Figure 4 is a view The detailed structure of the feature extraction part of 1 is shown.

도 1을 참조하면, 본 실시예에 따른 음성 신호를 이용한 사용자 스트레스 판별 장치는 음성 신호 획득부(110), 특징 벡터 획득부(120) 및 스트레스 판별부(130)를 포함한다.Referring to FIG. 1, the apparatus for determining a user stress using a voice signal according to the present exemplary embodiment includes a voice signal acquirer 110, a feature vector acquirer 120, and a stress determiner 130.

음성 신호 획득부(110)는 스트레스 판별 대상이 되는 사용자의 음성 신호를 획득한다. 음성 신호 획득부(110)는 마이크 등의 장치로 구현되어 음성 신호를 획득할 수 있으나, 유/무선 통신을 통해 음성 신호를 획득하는 통신 장치 또는 음성 신호가 미리 저장된 저장 장치 등과 같이 다양한 장치로 구현될 수 있다.The voice signal acquisition unit 110 acquires a voice signal of a user who is a stress determination target. The voice signal acquisition unit 110 may be implemented by a device such as a microphone to obtain a voice signal, but may be implemented by various devices such as a communication device for obtaining a voice signal through wired / wireless communication or a storage device in which the voice signal is stored in advance. Can be.

특징 벡터 획득부(120)는 음성 신호 획득부(110)에서 획득된 음성 신호를 인가받아 기지정된 방식으로 특징 벡터를 획득한다. 특징 벡터 획득부(120)는 시간 도메인의 음성 신호를 기기정된 길이(일예로 5ms)의 프레임 단위로 구분하고, 구분된 각 프레임에 대해 주파수 대역별 에너지를 분석하여 음성 신호에 대한 특징 벡터를 획득한다.The feature vector acquirer 120 obtains the feature vector in a predetermined manner by receiving the voice signal acquired by the voice signal acquirer 110. The feature vector acquirer 120 divides the voice signal of the time domain into frame units having a predetermined length (for example, 5 ms), and analyzes energy for each frequency band for each divided frame to obtain a feature vector for the voice signal. Acquire.

특징 벡터 획득부(120)는 일예로 음성 신호로부터 MFCC(Mel-frequency cepstral coefficients) 기법에 기초하여 프레임별 멜-필터 뱅크 에너지(Mel-filter bank energy)를 추출하여 특징 벡터를 획득할 수 있다.For example, the feature vector acquirer 120 may obtain a feature vector by extracting mel-filter bank energy for each frame based on a mel-frequency cepstral coefficients (MFCC) technique from a speech signal.

MFCC(Mel-frequency cepstral coefficients) 기법은 기존에 음성 신호로부터 특징 벡터를 획득하는 대표적인 기법으로 사람의 청각 특성을 고려하여 각각 대응하는 크기의 주파수 대역(Mel-frequency scale 이라함)을 갖도록 미리 설정된 다수의 멜-필터 뱅크(Mel-filter bank)를 이용하여 음성 신호를 여러 주파수 대역으로 나눈뒤 각 멜-필터 뱅크에서 필터링된 신호의 에너지를 획득하고, 획득된 에너지를 여러 통계적 기법으로 후처리하여 특징 벡터를 획득하는 기법이다.Mel-frequency cepstral coefficients (MFCC) is a typical technique for acquiring a feature vector from a speech signal. A number of preset preset frequency bands (Mel-frequency scale) have a corresponding size in consideration of human hearing characteristics. Mel-filter banks are used to divide the speech signal into several frequency bands, obtain the energy of the filtered signal in each mel-filter bank, and post-process the obtained energy with various statistical techniques. The technique of obtaining a vector.

MFCC 기법은 다수의 멜-필터 뱅크에서 필터링되는 주파수 범위에 따라 음성 신호가 서로 다른 의미를 가지도록 하는 추출 방법으로 음성 인식, 음성 합성등 다양한 분야에서 사용되고 있다.The MFCC technique is an extraction method that allows a speech signal to have a different meaning according to a frequency range filtered by a plurality of mel-filter banks, and is used in various fields such as speech recognition and speech synthesis.

다만 본 실시예에서 특징 벡터 획득부(120)는 다수의 멜-필터 뱅크에서 필터링된 다수의 주파수 대역별 에너지 자체를 특징 벡터로 활용한다. 즉 통계적 기법에 따른 별도의 후처리를 수행하지 않는다. 이는 후술하는 스트레스 판별부(130)가 가능한 원래의 음성 신호 자체의 특징을 추출하여 스트레스를 판별할 수 있도록 하기 위함이다.However, in the present exemplary embodiment, the feature vector acquirer 120 uses the energy per frequency band itself filtered by the plurality of mel-filter banks as a feature vector. In other words, no post-processing is performed according to statistical techniques. This is to allow the stress determination unit 130 to be described later to determine the stress by extracting the features of the original voice signal itself.

다만 본 실시예에서 특징 벡터 획득부(120)는 사용자의 스트레스를 정확하게 판별할 수 있도록 프레임별로 구분된 주파수 대역별 에너지에 대해 대응하는 가중치(w)를 적용하여 특징 벡터를 획득한다. 이때 주파수 대역별 에너지 각각에 대한 다수의 가중치(w)는 스트레스 판별 장치의 학습 과정 동안 업데이트되어 최적화 될 수 있다.However, in the present embodiment, the feature vector acquirer 120 obtains the feature vector by applying a weight w corresponding to the energy of each frequency band divided by frame so as to accurately determine the stress of the user. In this case, a plurality of weights w for each energy of each frequency band may be updated and optimized during the learning process of the stress determination device.

특징 벡터 획득부(120)는 프레임 구분부(121), 파워 스펙트럼 변환부(123), 필터 뱅크부(125) 및 가중치 적용부(127)를 포함할 수 있다. The feature vector acquirer 120 may include a frame divider 121, a power spectrum converter 123, a filter bank 125, and a weight applier 127.

프레임 구분부(121)는 음성 신호 획득부(110)로 음성 신호를 인가받아 기기정된 시간 단위(일예로 5ms)의 다수의 프레임으로 구분한다.The frame separator 121 receives the voice signal through the voice signal acquisition unit 110 and divides the frame into a plurality of frames having a predetermined time unit (for example, 5 ms).

파워 스펙트럼 변환부(123)는 다수의 프레임으로 구분된 음성 신호를 파워 스펙트럼으로 변환한다. 도 2의 (a)는 프레임별로 변환된 파워 스펙트럼의 일예를 나타낸다. 여기서 파워 스펙트럼 변환부(123)는 일예로 다수 프레임 각각에 대해 고속 푸리에 변환(Fast Fourier Transform: FFT)을 수행하여 파워 스펙트럼을 획득할 수 있다.The power spectrum converter 123 converts a voice signal divided into a plurality of frames into a power spectrum. 2A illustrates an example of a power spectrum converted for each frame. For example, the power spectrum converter 123 may obtain a power spectrum by performing a fast Fourier transform (FFT) on each of a plurality of frames.

필터 뱅크부(125)는 각각 지정된 주파수 대역을 필터링하도록 구성된 다수의 멜-필터 뱅크를 포함하고, 다수의 멜-필터 뱅크 각각이 파워 스펙트럼을 필터링하여 각 프레임별, 주파수 대역별 에너지를 나타내는 멜-필터 뱅크 에너지를 획득한다. 여기서 다수의 멜-필터 뱅크는 도 2의 (b)에 도시된 바와 같이, 주파수 대역별로 기지정된 멜-주파수 스케일 (Mel-frequency scale)로 서로 다른 대역폭을 갖는 삼각 필터로 구현될 수 있으며, 다수의 프레임 각각에 대해 다수의 멜-필터 뱅크가 필터링을 수행하여 도 2의 (c)에 도시된 바와 같이 프레임별, 주파수 대역별 멜-필터 뱅크 에너지를 추출할 수 있다.The filter bank unit 125 includes a plurality of mel-filter banks, each configured to filter a specified frequency band, and each of the plurality of mel-filter banks filters a power spectrum to display a mel-indicating energy for each frame and frequency band. Obtain filter bank energy. Here, the plurality of mel-filter banks may be implemented as triangular filters having different bandwidths on a mel-frequency scale predetermined for each frequency band, as shown in (b) of FIG. 2. A plurality of mel-filter banks may perform filtering for each frame of L to extract mel-filter bank energy for each frame and for each frequency band, as shown in FIG.

다수의 멜-필터 뱅크 각각의 필터링 함수는 H_m(k)로 표현될 수 있으며, 수학식 1에 따라 프레임별 파워 스펙트럼을 필터링 할 수 있다.The filtering function of each of the plurality of mel-filter banks may be represented by H _m (k), and the power spectrum of each frame may be filtered according to Equation 1.

여기서 k는 파워 스펙트럼의 인덱스를 나타내고, m은 멜-필터 뱅크 인덱스를 나타내며, f(·)는 멜-필터 뱅크 인덱스(m)의 주파수 대역을 나타낸다.Where k denotes an index of the power spectrum, m denotes a mel-filter bank index, and f (·) denotes a frequency band of the mel-filter bank index m.

그리고 멜-필터 뱅크 에너지(e_m)는 프레임별로 다수의 멜-필터 뱅크 각각에서 필터링된 파워 스펙트럼의 합으로 수학식 2와 같이 획득된다.And Mel-filter bank energy (e _m) is a number of enamel each frame is obtained by the filter bank to filter the sum of the power spectrum in each as in the formula (2).

여기서 s(k)는 임의의 프레임에서 k 번째 인덱스의 파워스펙트럼을 나타낸다.Where s (k) represents the power spectrum of the kth index in any frame.

즉 임의의 프레임에서의 m번째 멜-필터에 의해서 필터링을 통해 얻어진 멜-뱅크 에너지(e_m)는 수학식 2에 따라 계산된다. 이를 통해 하나의 프레임에서 멜-필터 뱅크 에너지(e_m)를 계산할 수 있다.That is, the mel-bank energy (e _m ) obtained through the filtering by the m-th mel-filter in any frame is calculated according to Equation 2. This allows the mel-filter bank energy (e _m ) to be calculated in one frame.

가중치 적용부(127)는 미리 학습된 인공 신경망으로 구현되어 필터 뱅크부(125)에서 획득된 다수의 멜-필터 뱅크 에너지(e_m)에 대응하는 가중치(w_m)를 적용한다. 이때 가중치 적용부(127)는 미리 학습된 패턴에 따라 각 멜-필터 뱅크 별로 서로 다른 가중치(w_m)를 적용할 수 있다. 그리고 가중치 적용부(127)는 프레임별로 가중치(w_m)가 적용된 멜-필터 뱅크 에너지(w_m*e_m)로 *구성되는 특징 벡터를 획득한다.The weight applying unit 127 is implemented as a pre-learned artificial neural network and applies a weight w _m corresponding to the plurality of mel-filter bank energies e _m obtained from the filter bank unit 125. In this case, the weight applying unit 127 may apply different weights w _{m for} each mel-filter bank according to a previously learned pattern. In addition, the weight applying unit 127 has a mel-filter bank energy w _m * to which a weight w _m is applied for each frame. e _m ) to obtain a feature vector consisting of *.

여기서 가중치(w_m)는 멜-필터 뱅크에서 필터링된 다수의 주파수 대역별 에너지에 가중되도록 적용될 수도 있으나, 멜-필터 뱅크의 다수의 멜-필터의 이득(gain)으로서 적용될 수 있다. 즉 다수의 멜-필터 뱅크 각각에 직접 가중치(w_m)를 적용하여, 가중치(w_m)가 적용된 주파수 대역별 에너지(e_m * w_m)를 획득하고, 획득된 주파수 대역별 멜-필터 뱅크 에너지를 합하도록 구성될 수 있다.In this case, the weight w _m may be applied to be weighted to a plurality of frequency bands of energy filtered in the mel-filter bank, but may be applied as a gain of the plurality of mel-filters of the mel-filter bank. That is, by applying a weight (w _m ) directly to each of the plurality of mel-filter banks, the energy band-specific energy band (e _m * w _m ) to which the weight (w _m ) is applied is obtained, and the acquired frequency-specific mel-filter banks It can be configured to sum energy.

도 3은 특징 벡터 획득부(120)가 특징 벡터를 획득하는 개념을 나타낸 도면으로, 도 3에서 (a)는 프레임 구분부(121)에서 구분되고 파워 스펙트럼 변환부(123)에서 변환된 하나의 프레임에 대한 다수의 파워 스펙트럼(s(k))을 나타낸다. (b)는 다수의 멜-필터 뱅크 각각의 필터링 함수(H_m(k))를 나타내고, (c)는 다수의 파워 스펙트럼(s(k))이 다수의 멜-필터 뱅크 각각에 의해 필터링된 멜-필터 뱅크 에너지(e_m)를 나타내며, (d)는 각 멜-필터 뱅크 에너지(e_m)에 가중치(w_m)를 적용하는 과정을 나타낸다. 마지막으로 (e)는 각 프레임에서 다수의 멜-필터 뱅크 각각에 대응하는 가중치(w_m)가 적용된 가중 멜-필터 뱅크 에너지(we_m = e_m * w_m)로 구성되는 특징 벡터(v_f)를 나타낸다. 도 3에 도시된 바와 같이, 멜-필터 뱅크가 M개일 때, 프레임별 특징 벡터(v_f)는 M개의 가중 멜-필터 뱅크 에너지(we₁ ~ we_M)를 포함하여 구성된다.3 is a view illustrating a concept in which the feature vector obtainer 120 acquires a feature vector. In FIG. 3, (a) is divided by the frame separator 121 and converted by the power spectrum converter 123. The multiple power spectra s (k) for the frame are shown. (b) shows the filtering function H _m (k) of each of the plurality of mel-filter banks, and (c) shows that the plurality of power spectra s (k) is filtered by each of the plurality of mel-filter banks. Mel-filter bank energy (e _m ) is shown, and (d) represents a process of applying a weight (w _m ) to each mel-filter bank energy (e _m ). Finally, (e) is a feature vector (v _f ) consisting of weighted mel-filter bank energies (we _m = e _m * w _m ) with weights (w _m ) corresponding to each of a plurality of mel-filter banks in each frame. ). As shown in FIG. 3, when there are M mel-filter banks, the frame-specific feature vector v _f includes M weighted mel-filter bank energies we ₁ to we _M.

한편, 스트레스 판별부(130)는 패턴 추정 방식이 미리 학습된 인공 신경망으로 구현되고, 다수 프레임에서 각각 획득된 다수의 프레임별 특징 벡터(v₁ ~ v_F)를 인가받고, 인가된 다수의 프레임별 특징 벡터(v₁ ~ v_F)로부터 스트레스에 관련된 특징(f₁ ~ f_F)을 추출하여 사용자의 스트레스를 판별한다. 스트레스 판별부(130)는 일예로 인공 신경망 중 이전 추출된 특징을 현재 특징 추출 시에 반영하는 인공 신경망인 RNN(Recurrent neural network) 또는 LSTM(Long Short Term Memory)으로 구현될 수 있다.Meanwhile, the stress determination unit 130 is implemented with an artificial neural network in which a pattern estimation method is pre-learned, and receives a plurality of frame-specific feature vectors v ₁ to v _F obtained from a plurality of frames, respectively, and applies a plurality of frames. Stress related features f ₁ to f _F are extracted from the feature vectors v ₁ to v _F to determine the stress of the user. For example, the stress determination unit 130 may be implemented as a recurrent neural network (RNN) or a long short term memory (LSTM), which is an artificial neural network that reflects a previously extracted feature of an artificial neural network when the current feature is extracted.

도 4를 참조하면, 스트레스 판별부(130)는 프레임 특징 추출부(131) 및 음성 특징 추출부(133)를 포함한다.Referring to FIG. 4, the stress determiner 130 includes a frame feature extractor 131 and a voice feature extractor 133.

프레임 특징 추출부(131)는 다수의 인코더(EN₁ ~ EN_F)를 포함하는 인공 신경망으로 구현되어 다수의 프레임별 특징 벡터(v₁ ~ v_F) 중 대응하는 프레임별 특징 벡터를 각각 인코딩하여 프레임 특징(f₁ ~ f_F)을 추출한다. 도 4에서는 프레임 특징 추출부(131)가 LSTM으로 구현되는 경우를 도시하였으며, 이 경우 다수의 인코더(EN₁ ~ EN_F)는 LSTM 셀로 볼 수 있다.The frame feature extractor 131 is implemented as an artificial neural network including a plurality of encoders EN ₁ to EN _F to encode corresponding feature vectors for each frame among the plurality of frame feature vectors v ₁ to v _F , respectively. Extract the frame features f ₁ to f _F. In FIG. 4, the frame feature extractor 131 is implemented as an LSTM. In this case, the plurality of encoders EN ₁ to EN _F may be viewed as LSTM cells.

다수의 인코더(EN₁ ~ EN_F) 각각은 추출된 프레임 특징(f₁ ~ f_F)을 다음 단의 인코더(EN₁ ~ EN_F)로 전달하고, 최종 단의 인코더(EN_F)는 추출된 최종 프레임 특징(f_F)을 음성 특징 추출부(133)로 전달한다.Each of the plurality of encoders EN ₁ to EN _F passes the extracted frame features f ₁ to f _F to the next stage of encoders EN ₁ to EN _F , and the last stage of encoder _F EN The final frame feature f _F is transmitted to the speech feature extractor 133.

여기서 다수의 인코더(EN₁ ~ EN_F-1)가 추출된 프레임 특징(f₁ ~ f_F-1)을 다음단의 인코더로 전달하는 것은 이전 추출된 프레임 특징(f₁ ~ f_F-1)이 다음 프레임 특징(f₂ ~ f_F) 추출 시에 함께 고려될 수 있도록 하기 위함이다. 이는 기지정된 시간 단위로 구분된 다수의 프레임 사이의 특징이 서로 연관되도록 하여 시간 정보가 반영될 수 있도록 하기 위함이다. 따라서 최종 프레임 특징(f_F)에는 이전 프레임 특징(f₁ ~ f_F-1)의 특성이 누적되어 반영된 것으로 볼 수 있다.Wherein a plurality of Delivering the encoder is extracted frame features _{_{(EN 1 ~ EN F-1}} ) (f 1 ~ f F-1) to the encoder at the next stage the previously extracted frame features _{_{(f 1 ~ f F-1}} ) This is to be considered together when extracting the next frame features f ₂ to f _F. This is to allow time information to be reflected by correlating features between a plurality of frames divided by a predetermined time unit. Therefore, it can be seen that the characteristics of the previous frame features f ₁ to f _F-1 are accumulated and reflected in the final frame feature f _F.

음성 특징 추출부(133) 또한 미리 학습된 인공 신경망으로 구현되어 최종프레임 특징(f_F)으로부터 사용자 스트레스에 대응하는 음성 특징(s)을 추출한다.The speech feature extractor 133 is also implemented as a pre-learned artificial neural network to extract the speech feature s corresponding to the user stress from the final frame feature f _F.

경우에 따라서, 음성 특징 추출부(133)는 다수의 인코더(EN₁ ~ EN_F)에서 추출되는 프레임 특징(f₁ ~ f_F)을 모두 인가받아 결합(concatenate)하여 음성 특징(s)을 추출하도록 구성될 수도 있다.In some cases, the speech feature extractor 133 extracts the speech feature s by concatenating all frame features f ₁ to f _F extracted by the plurality of encoders EN ₁ to EN _F. It may be configured to.

또한 프레임 특징 추출부(131)의 다수의 인코더(EN₁ ~ EN_F)는 프레임 특징(f₁ ~ f_F)을 추출할 때, 히든 특징(h₁ ~ h_F)을 추가로 추출하고, 추출된 히든 특징을 다음 단의 인코더로 전달하여 다음 프레임 특징(f₁ ~ f_F) 추출 시에 이전 추출된 히든 특징(h₁ ~ h_F)이 함께 고려되도록 할 수도 있다.In addition, when the plurality of encoders EN ₁ to EN _{F of the} frame feature extractor 131 extract the frame features f ₁ to f _F , the plurality of encoders EN ₁ to EN _F further extract the hidden features h ₁ to h _F. The extracted hidden features may be transferred to the encoder of the next stage so that the previously extracted hidden features h ₁ to h _F may be considered together when the next frame features f ₁ to f _F are extracted.

음성 특징 추출부(133)는 사용자의 스트레스에 대응하는 음성 특징(s)을 스트레스 유무로 추출할 수도 있으나, 기지정된 단위로 구분된 스트레스 레벨로 추출할 수도 있다.The voice feature extractor 133 may extract the voice feature s corresponding to the user's stress with or without a stress, or may extract the voice feature extractor with a stress level divided by a predetermined unit.

본 실시예에 따른 스트레스 판별 장치에서 인공 신경망으로 구현되는 스트레스 판별부(130)와 가중치 적용부(127)는 상기한 바와 같이 미리 학습되어 최적화되어야 한다. 이에 스트레스 판별 장치는 스트레스 판별부(130)와 가중치 적용부(127)의 학습을 위하여 학습부(140)를 더 포함할 수 있다.In the stress determination apparatus according to the present embodiment, the stress determination unit 130 and the weight applying unit 127 implemented as an artificial neural network must be previously learned and optimized as described above. The stress determination device may further include a learning unit 140 for learning of the stress determination unit 130 and the weight applying unit 127.

학습부(140)는 스트레스 판별 장치의 학습 과정에서 스트레스 여부 또는 스트레스 레벨이 레이블된 음성 신호인 다수의 학습용 데이터를 획득하고, 획득된 학습용 데이터를 음성 신호 획득부(110)에 제공할 수 있다. 그리고 스트레스 판별부(130)로부터 스트레스 판별 결과를 인가받아 학습용 데이터에 레이블된 스트레스 여부 또는 스트레스 레벨과 판별 결과를 비교하여 오차를 판별하고, 판별된 오차를 스트레스 판별부(130)로 역전파하여 스트레스 판별부(130)를 학습 시킬 수 있다. 이하에서는 설명의 편의를 위하여 학습용 데이터에 스트레스 레벨이 레이블 된 것으로 가정하여 설명하지만, 상기한 바와 같이 스트레스 여부가 레이블된 학습 데이터가 이용되어도 무방하다.The learner 140 may acquire a plurality of learning data that is a voice signal having a stress level or a stress level, and provide the acquired learning data to the voice signal obtaining unit 110 during a learning process of the stress determination device. In addition, the stress determination result is received from the stress determination unit 130 to determine the error by comparing the determination result with the stress level or the stress level labeled in the training data, and the propagated error to the stress determination unit 130 by back propagating the stress The determination unit 130 may be trained. Hereinafter, a description will be made on the assumption that the stress level is labeled in the training data for convenience of explanation, but the learning data labeled as stress may be used as described above.

학습부(140)는 다수의 학습용 데이터를 이용하여 스트레스 판별부(130)를 반복학습 시키며, 이때, 특징 벡터 획득부(120)의 다수의 멜-필터 뱅크 각각에 대응하는 다수의 가중치(w₁ ~ w_M)는 프레임별로 이전 획득된 값으로 고정된다. 그리고 학습부(140)는 현재 가중치들(w₁ ~ w_M)에 기반하여 다수의 학습용 데이터에 대한 스트레스 판별 결과에 따른 오차를 스트레스 판별부(130)로 역전파하여 스트레스 판별부(130)를 학습시킨다. 그리고 학습된 스트레스 판별부(130)에 기반하여 다시 다수의 학습용 데이터에 대한 스트레스 판별 결과에 따른 오차를 획득하고, 획득된 오차를 가중치 적용부(127)로 전달하여, 다수의 가중치(w₁ ~ w_M)를 업데이트한다.The learner 140 repeatedly learns the stress determiner 130 using a plurality of pieces of learning data, and at this time, a plurality of weights w ₁ corresponding to each of the plurality of mel-filter banks of the feature vector acquirer 120. ~ w _M ) is fixed to a previously obtained value for each frame. In addition, the learner 140 reverse-propagates the error according to the stress determination result for the plurality of learning data based on the current weights w ₁ to w _M to the stress determination unit 130 to perform the stress determination unit 130. To learn. Based on the learned stress determination unit 130, an error according to the stress determination result of the plurality of learning data is obtained again, and the obtained error is transmitted to the weight application unit 127, and a plurality of weights (w ₁ ˜) are obtained. w _M ) is updated.

가중치 업데이트부(150)에 의해 다수의 멜-필터 뱅크에 대응하는 다수의 가중치(w₁ ~ w_M)가 업데이트되면, 업데이트된 가중치(w₁ ~ w_M)에 따라 다시 다수의 학습용 데이터를 이용하여 스트레스 판별부(130)를 학습시키는 과정을 반복한다. 즉 가중치(w₁ ~ w_M)가 업데이트 될 때마다, 스트레스 판별부(130)를 다시 학습시킨다.When a plurality of weights w ₁ to w _M corresponding to the plurality of mel-filter banks are updated by the weight updating unit 150, the plurality of learning data are used again according to the updated weights w ₁ to w _M. To repeat the process of learning the stress determination unit 130. That is, whenever the weights w ₁ to w _M are updated, the stress determination unit 130 is trained again.

학습부(140)는 학습 횟수가 미리 지정된 기준 횟수에 도달하거나, 오차가 기지정된 기준 오차 이내이면, 스트레스 판별 장치의 학습을 종료한다.The learning unit 140 ends the learning of the stress determination device when the number of learning reaches a predetermined reference number or the error is within the predetermined reference error.

도 5는 본 발명의 일 실시예에 따른 음성 신호를 이용한 사용자 스트레스 판별 방법을 나타낸다.5 illustrates a user stress determination method using a voice signal according to an embodiment of the present invention.

도 1 내지 도 4를 참조하여, 본 실시예에 따른 음성 신호를 이용한 사용자 스트레스 판별 방법을 설명하면, 우선 스트레스 판별 장치를 학습 시키기 위한 다수의 학습 데이터를 획득한다(S11). 여기서 학습 데이터는 스트레스 여부 또는 스트레스 레벨이 레이블된 음성 신호이다.Referring to FIGS. 1 to 4, when a method of determining a user stress using a voice signal according to the present embodiment is described, first, a plurality of learning data for learning a stress determination apparatus are obtained (S11). Here, the training data is a speech signal labeled with stress or stress level.

그리고 획득된 학습 데이터를 학습되지 않은 스트레스 판별 장치에 입력하여 스트레스 판별 결과를 획득한다(S12). 학습용 데이터에 대한 스트레스 판별 결과가 획득되면, 획득된 스트레스 판별 결과와 학습용 데이터에 레이블된 스트레스 레벨를 비교하여 오차를 판별하고, 판별된 오차를 스트레스 판별 장치의 스트레스 판별부(130)로 역전파하여 스트레스 판별부를 학습시킨다(S13).Then, the obtained training data is input to an untrained stress determination device to obtain a stress determination result (S12). When the stress determination result for the training data is obtained, an error is determined by comparing the obtained stress determination result with the stress level labeled in the training data, and the propagated error is propagated back to the stress determination unit 130 of the stress determination device to stress the stress. Train the discriminating unit (S13).

그리고 획득된 전체 학습 데이터를 이용하여 스트레스 판별부(130)를 학습시켰는지 여부를 판별한다(S14). 만일 전체 학습 데이터에 대한 학습이 수행되지 않은 것으로 판단되면, 다시 다른 학습 데이터를 스트레스 판별 장치에 입력하여 스트레스 판별 결과를 획득한다(S12). 그러나 획득된 전체 학습 데이터에 대한 학습이 수행된 것으로 판단되면, 판별된 오차가 기지정된 기준 오차 이하인지 판별한다(S15).In operation S14, it is determined whether the stress determination unit 130 is trained using the obtained total learning data. If it is determined that learning on the entire learning data has not been performed, another learning data is input to the stress determining device again to obtain a stress determining result (S12). However, if it is determined that learning has been performed on all acquired learning data, it is determined whether the determined error is less than or equal to the predetermined reference error (S15).

만일 오차가 기준 오차를 초과하면, 오차를 특징 벡터 획득부(120)의 가중치 적용부(127)로 역전파하여 다수의 가중치(w₁ ~ w_M)를 업데이트한다(S16). 그리고 다시 전체 학습 데이터를 이용하여 스트레스 판별부(130)를 학습시킨다.If the error exceeds the reference error, the error is propagated back to the weight applying unit 127 of the feature vector obtaining unit 120 to update a plurality of weights w ₁ to w _M (S16). In addition, the stress determination unit 130 is trained using the entire learning data.

반면, 오차가 기준 오차 이내이면, 스트레스 판별 장치의 학습을 종료하고, 스트레스를 판별하고자 하는 사용자의 음성 신호를 획득한다(S17). 그리고 획득된 음성 신호를 기지정된 시간 단위의 다수의 프레임으로 구분한다(S18). 이후 다수의 프레임 각각을 파워 스펙트럼으로 변환한다(S19).On the other hand, if the error is within the reference error, the learning of the stress determination device is terminated, and a user's voice signal to determine the stress is obtained (S17). The obtained voice signal is divided into a plurality of frames of a predetermined time unit (S18). Thereafter, each of the plurality of frames is converted into a power spectrum (S19).

한편 특징 벡터 획득부(120)의 다수의 멜-필터 뱅크를 이용하여 프레임별로 변환된 다수의 파워 스펙트럼을 기지정된 멜-주파수 스케일과 주파수 대역폭에 따라 필터링하여 다수의 멜-필터 뱅크 에너지(e_m)를 획득한다(S20). 그리고 획득된 다수의 멜-필터 뱅크 에너지(e_m)에 학습을 통해 획득된 가중치(w_m)를 적용하여 다수의 가중 멜-필터 뱅크 에너지(we_m)를 획득하고, 획득된 다수의 가중 멜-필터 뱅크 에너지(we_m)로 구성되는 프레임별 특징 벡터(v₁ ~ v_F)를 획득한다(S21).Meanwhile, a plurality of mel-filter bank energies (e _m) are filtered by filtering a plurality of power spectrums converted for each frame using a plurality of mel-filter banks of the feature vector acquirer 120 according to a predetermined mel-frequency scale and a frequency bandwidth. ) Is obtained (S20). In addition, a plurality of weighted mel-filter bank energies (we _m ) are obtained by applying the weights (w _m ) obtained through learning to the obtained plurality of mel-filter bank energies (e _m ), and the obtained plurality of weighted mels are obtained. The frame-specific feature vectors v ₁ to v _F composed of the filter bank energy we _m are obtained (S21).

프레임별 특징 벡터(v₁ ~ v_F)가 획득되면, 미리 학습된 패턴 인식 방식에 따라 다수의 프레임 각각에 대해 획득된 프레임별 특징 벡터(v₁ ~ v_F)에서 프레임 특징(f₁ ~ f_F)을 추출한다(S22). 이때, 다수의 프레임 특징(f₁ ~ f_F)은 이전 추출된 프레임 특징(f₁ ~ f_F-1)을 함께 반영하여 추출할 수 있다. 그리고 다수의 프레임 특징(f₁ ~ f_F-1)이 추출되면, 추출된 다수의 프레임 특징(f₁ ~ f_F) 중 최종 프레임 특징(f_F)으로부터 미리 학습된 패턴 인식 방식에 따라 음성 특징(s)을 추출하여 사용자의 스트레스를 판별한다(S23).When the frame-specific feature vectors v ₁ to v _F are obtained, the frame features f ₁ to f are obtained from the frame-specific feature vectors v ₁ to v _F obtained for each of a plurality of frames according to a previously learned pattern recognition scheme. _F ) is extracted (S22). In this case, the plurality of frame features f ₁ to f _F may be extracted by reflecting the previously extracted frame features f ₁ to f _F-1 . When a plurality of frame features f ₁ to f _F-1 are extracted, the voice features are pre-learned from a final frame feature f _F among the extracted plurality of frame features f ₁ to f _F. (S) is extracted to determine the stress of the user (S23).

결과적으로 본 실시예에 따른 음성 신호를 이용한 사용자 스트레스 판별 장치 및 방법은 음성 신호로부터 구분되는 다수 프레임 각각에서 가중치가 적용된 다수의 가중 멜-필터 뱅크 에너지(we_m)로 구성된 프레임별 특징 벡터(v₁ ~ v_F)를 추출하고, 다수의 프레임별 특징 벡터(v₁ ~ v_F)에 대해 이전 추출된 프레임 특징(f₁ ~ f_F-1)이 고려된 프레임 특징(f₁ ~ f_F)을 추출하여 사용자의 스트레스에 대응하는 음성 특징(s)을 추출한다. 따라서 사용자의 스트레스를 정확하게 판단할 수 있다.As a result, the apparatus and method for determining a user stress using a speech signal according to the present embodiment are characterized by a frame-specific feature vector (v) consisting of a plurality of weighted mel-filter bank energies (we _m ), each of which is weighted in a plurality of frames separated from the speech signal. ₁ ~ v _f) extract and a plurality of frame-by-frame feature vector (v ₁ ~ a v _f) prior extracted frame feature for _{_{(f 1 ~ f f-1}} ) is the considered frame features (f ₁ ~ f _f) Extract the voice feature (s) corresponding to the stress of the user. Therefore, it is possible to accurately determine the stress of the user.

본 발명에 따른 방법은 컴퓨터에서 실행 시키기 위한 매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다. 여기서 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 또한 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함하며, ROM(판독 전용 메모리), RAM(랜덤 액세스 메모리), CD(컴팩트 디스크)-ROM, DVD(디지털 비디오 디스크)-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등을 포함할 수 있다.The method according to the invention can be implemented as a computer program stored in a medium for execution in a computer. The computer readable media herein can be any available media that can be accessed by a computer and can also include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, and includes ROM (readable) Dedicated memory), RAM (random access memory), CD (compact disk) -ROM, DVD (digital video disk) -ROM, magnetic tape, floppy disk, optical data storage, and the like.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.Although the present invention has been described with reference to the embodiments shown in the drawings, this is merely exemplary, and it will be understood by those skilled in the art that various modifications and equivalent other embodiments are possible.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

110: 음성 신호 획득부 120: 특징 벡터 획득부
130: 스트레스 판별부 140: 학습부
121: 프레임 구분부 123: 파워 스펙트럼 변환부
125: 필터 뱅크부 127: 가중치 적용부
131: 프레임 특징 추출부 133: 음성 특징 추출부110: voice signal acquisition unit 120: feature vector acquisition unit
130: stress determination unit 140: learning unit
121: frame separator 123: power spectrum converter
125: filter bank unit 127: weight applying unit
131: frame feature extractor 133: voice feature extractor

Claims

A power spectrum converter for converting a voice signal divided into a plurality of frames into a power spectrum;
A filter bank unit including a plurality of mel-filter banks each having a predetermined frequency band and a pattern, respectively, filtering the power spectrum of each of the plurality of frames to obtain a plurality of mel-filter bank energies;
A feature vector obtaining unit obtaining a feature vector for each frame by applying a corresponding mel-filter bank energy among the plurality of mel-filter bank energies to a plurality of weights determined according to a previously learned pattern estimation scheme;
A frame feature extractor configured to sequentially encode a plurality of feature vectors for each frame according to a pre-learned pattern estimation method to extract frame features, and to obtain frame features by encoding previously obtained frame features; And
A voice feature extractor configured to receive a final frame feature among the frame features obtained by the frame feature extractor and to extract a voice feature corresponding to a user's stress from the final frame feature according to a pre-learned pattern estimation scheme; Device for determining the user stress comprising a.

The method of claim 1, wherein the frame feature extraction unit
And a plurality of encoders configured to extract a current frame feature by receiving a corresponding frame feature vector and a previously obtained frame feature among a plurality of frame feature vectors.

The speech feature extractor of claim 1, wherein the speech feature extractor comprises:
And combining the plurality of frame features extracted by each of the plurality of encoders, and extracting the speech feature from the plurality of frame features combined according to a pre-learned pattern estimation scheme.

According to claim 1, wherein the user stress determination device
Acquire learning data labeled with a stress level at the time of learning, extract an error by comparing a voice feature of the acquired learning data with a stress level labeled with the learning data, and obtain the obtained error from the frame feature extractor and the Backpropagation to the speech feature extractor to learn,
When the error between the learned speech feature and the speech feature extracted using the frame feature extractor and the speech feature extractor exceeds a predetermined reference error, the error is propagated back to the feature vector obtainer. A learner configured to learn the frame feature extractor and the speech feature extractor after updating the weights; Device for determining the user stress further comprising.

The method of claim 1, wherein each of the plurality of mel-filter banks are
A user stress determination device implemented by a triangular filter having a different Mel-frequency scale and a different bandwidth for each frequency band.

Dividing the speech signal into a plurality of frames and converting each of the plurality of frames into a power spectrum;
Using the plurality of mel-filter banks each having a predetermined frequency band and pattern, respectively filtering the power spectrum of each of the plurality of frames to obtain a plurality of mel-filter bank energies;
Obtaining a feature vector for each frame by applying a corresponding mel-filter bank energy among the plurality of mel-filter bank energies to a plurality of weights determined according to a pre-learned pattern estimation scheme;
Extracting a frame feature by sequentially encoding a plurality of frame feature vectors according to a pre-learned pattern estimation method, and encoding a previously obtained frame feature together to obtain a frame feature; And
Receiving a final frame feature among the acquired frame features, and extracting a voice feature corresponding to the user's stress from the final frame feature according to a pre-learned pattern estimation scheme; User stress determination method comprising a.

7. The method of claim 6, wherein extracting the speech feature
Applying and combining all of the plurality of frame features including the last frame feature; And
Extracting the speech feature from the combined plurality of frame features according to a pre-learned pattern estimation scheme; User stress determination method comprising a.

The method of claim 6, wherein the user stress determination method is
Learning phase; More,
The learning step
Obtaining training data labeled with a stress level;
To extract the error by comparing the speech feature of the acquired training data with the stress level labeled in the training data, and to extract the frame feature by back propagating the obtained error and to extract the speech feature. Updating the pattern estimation scheme;
If the error between the speech feature acquired based on the updated pattern estimation method and the stress level labeled in the training data exceeds a predetermined reference error, the frame feature is updated again by updating the weight by back propagating the acquired error. Updating a pattern estimation method for extracting a signal and a pattern estimation method for extracting the speech feature; User stress determination method further comprising.

7. The method of claim 6, wherein obtaining the mel-filter bank energy is
A method of determining a user stress, wherein the mel-filter bank energy is obtained by using a plurality of mel-filter banks each formed of a triangular filter having a different mel-frequency scale and a different bandwidth.