KR20200113432A

KR20200113432A - Electronic apparatus for multiscale speech emotion recognization and operating method thereof

Info

Publication number: KR20200113432A
Application number: KR1020190033607A
Authority: KR
Inventors: 이수영; 최신국; 신영훈; 김태호; 김태훈
Original assignee: 한국과학기술원
Priority date: 2019-03-25
Filing date: 2019-03-25
Publication date: 2020-10-07
Also published as: WO2020196978A1; KR102163862B1

Abstract

An electronic device and operation method thereof according to various embodiments of the present invention are for multiscale speech emotion recognition. The electronic device may be configured to compress a speech signal into a plurality of scales, assign a weight according to attention to each of the scales, extract speech features of the speech signal based on the weight from the scales, and recognize the emotion of the voice signal based on the speech features.

Description

Electronic device for multi-scale voice emotion recognition and its operation method {ELECTRONIC APPARATUS FOR MULTISCALE SPEECH EMOTION RECOGNIZATION AND OPERATING METHOD THEREOF}

다양한 실시예들은 멀티스케일 음성 감정 인식을 위한 전자 장치 및 그의 동작 방법에 관한 것이다. Various embodiments relate to an electronic device for multi-scale voice emotion recognition and a method of operating the same.

일반적으로, 발화자는 음성 신호를 통하여 발화자의 감정을 표현한다. 이 때 감정은 음성 신호에서 전체적으로 표현되지 않고, 음성 신호의 일부 영역에서 집중되어 표현된다. 즉 감정과 관련된 음성 특징이 음성 신호의 일부 영역에 집중된다. 이에 따라, 음성 신호로부터 감정을 인식하는 기술이 개발되고 있다. In general, the talker expresses the talker's emotions through voice signals. At this time, emotions are not expressed entirely in the voice signal, but are expressed concentrated in a partial area of the voice signal. That is, voice features related to emotions are concentrated in some areas of the voice signal. Accordingly, technology for recognizing emotions from voice signals is being developed.

음성 신호로부터 감정을 보다 정확하게 인식할 수 있는 방안이 요구된다. 다양한 실시예들은 음성 신호를 멀티스케일로 변환하여, 음성 신호의 감정을 인식하는 전자 장치 및 그의 동작 방법을 제공한다. There is a need for a way to more accurately recognize emotions from voice signals. Various embodiments provide an electronic device for recognizing emotion of a voice signal by converting a voice signal into multiscale, and a method of operating the same.

다양한 실시예들에 따른 전자 장치의 동작 방법은 멀티스케일 음성 감정 인식을 위한 것으로, 음성 신호를 복수 개의 스케일들로 압축하는 동작, 상기 스케일들 각각에 대응하여, 주의집중(attention)에 따른 가중치를 부여하는 동작, 상기 스케일들로부터, 상기 가중치를 기반으로 상기 음성 신호의 음성 특징을 추출하는 동작, 및 상기 음성 특징에 기반하여, 상기 음성 신호의 감정을 인식하는 동작을 포함할 수 있다. An operation method of an electronic device according to various embodiments is for multi-scale speech emotion recognition, an operation of compressing a speech signal into a plurality of scales, and a weight according to attention corresponding to each of the scales. An operation of assigning, an operation of extracting a speech characteristic of the speech signal based on the weight from the scales, and an operation of recognizing an emotion of the speech signal based on the speech characteristic.

다양한 실시예들에 따른 전자 장치는 멀티스케일 음성 감정 인식을 위한 것으로, 음성 신호를 입력하는 입력 모듈, 및 상기 입력 모듈에 연결되며, 상기 음성 신호의 감정을 인식하도록 구성되는 프로세서를 포함할 수 있다. 다양한 실시예들에 따르면, 상기 프로세서는, 상기 음성 신호를 복수 개의 스케일들로 압축하고, 상기 스케일들 각각에 대응하여, 주의집중에 따른 가중치를 부여하고, 상기 스케일들로부터, 상기 가중치를 기반으로 상기 음성 신호의 음성 특징을 추출하고, 상기 음성 특징에 기반하여, 상기 감정을 인식하도록 구성될 수 있다. The electronic device according to various embodiments is for multi-scale voice emotion recognition, and may include an input module for inputting a voice signal, and a processor connected to the input module and configured to recognize the emotion of the voice signal. . According to various embodiments, the processor compresses the speech signal into a plurality of scales, assigns a weight according to attention to each of the scales, and from the scales, based on the weight. It may be configured to extract a voice feature of the voice signal and recognize the emotion based on the voice feature.

다양한 실시예들에 따르면, 전자 장치가 음성 신호를 멀티스케일로 변환하여, 음성 신호의 감정을 인식할 수 있다. 이 때 전자 장치는 음성 신호를 복수 개의 스케일들로 압축함에 따라, 시간적으로 다양한 사이즈의 압축 신호들을 생성할 수 있다. 이를 통해, 전자 장치가 상대적으로 긴 압축 신호에서 프레임 변화와 상대적으로 짧은 압축 신호에서 프레임 변화를 함께 고려하여, 음성 신호의 감정을 인식할 수 있다. 아울러, 전자 장치가 음성 신호의 프레임들 중 감정이 집중된 적어도 어느 하나를 효율적으로 검출할 수 있다. 이에 따라, 전자 장치가 음성 신호로부터 감정을 보다 정확하게 인식할 수 있다.According to various embodiments, the electronic device may recognize the emotion of the voice signal by converting the voice signal into multiscale. In this case, as the electronic device compresses the voice signal into a plurality of scales, it may generate compressed signals of various sizes in time. Through this, the electronic device can recognize the emotion of the voice signal by considering a frame change in a relatively long compressed signal and a frame change in a relatively short compressed signal together. In addition, the electronic device may efficiently detect at least one of the frames of the voice signal in which emotion is concentrated. Accordingly, the electronic device can more accurately recognize the emotion from the voice signal.

도 1은 일반적인 음성 감성 인식을 설명하기 위한 도면이다.
도 2는 다양한 실시예들에 따른 전자 장치를 도시하는 도면이다.
도 3은 도 2의 프로세서를 도시하는 도면이다.
도 4는 도 3의 멀티스케일 모듈을 도시하는 도면이다.
도 5는 다양한 실시예들에 따른 전자 장치의 동작 방법을 도시하는 도면이다. 1 is a diagram for explaining general voice sensibility recognition.
2 is a diagram illustrating an electronic device according to various embodiments.
3 is a diagram illustrating the processor of FIG. 2.
FIG. 4 is a diagram illustrating the multiscale module of FIG. 3.
5 is a diagram illustrating a method of operating an electronic device according to various embodiments.

이하, 본 문서의 다양한 실시예들이 첨부된 도면을 참조하여 설명된다. Hereinafter, various embodiments of the present document will be described with reference to the accompanying drawings.

도 1은 일반적인 음성 감성 인식을 설명하기 위한 도면이다. 1 is a diagram for explaining general voice sensibility recognition.

도 1을 참조하면, 감정과 관련된 음성 특징이 음성 신호로부터 추출될 수 있다. 일 예로, 음성 특징은 음성 신호의 전체 구간에서 추출될 수 있다. 여기서, 음성 특징은 음성 신호의 운율적(prosodic) 흐름, 예컨대 억양과 강세의 변화를 나타낼 수 있다. 다른 예로, 음성 특징은 음성 신호의 단위 구간들로부터 추출될 수 있다. 여기서, 단위 구간들은 미리 정해진 시간 간격으로 결정되며, 음성 신호의 전체 구간들로부터 구분될 수 있다. 여기서, 음성 특징은 단위 구간들 각각에서의 여기원(excitation source) 또는 성도(vocal tract) 중 적어도 어느 하나를 나타낼 수 있다. Referring to FIG. 1, voice features related to emotion may be extracted from a voice signal. For example, the voice feature may be extracted from the entire section of the voice signal. Here, the voice characteristic may represent a prosodic flow of a voice signal, for example, a change in intonation and stress. As another example, the voice feature may be extracted from unit sections of the voice signal. Here, the unit sections are determined at predetermined time intervals, and may be distinguished from all sections of the voice signal. Here, the voice characteristic may represent at least one of an excitation source or a vocal tract in each of the unit sections.

다양한 실시예들에 따르면, 음성 신호를 멀티스케일로 변환하여, 음성 신호의 감정을 인식하는 기술이 제공될 수 있다. 여기서, 음성 신호는 시간 영역 상에서 복수 개의 프레임들로 구분될 수 있다. 그리고 음성 신호가 복수 개의 스케일들로 압축됨에 따라, 시간적으로 다양한 사이즈의 압축 신호들이 생성될 수 있다. 이를 통해, 상대적으로 긴 압축 신호에서 프레임 변화와 상대적으로 짧은 압축 신호에서 프레임 변화를 함께 고려하여, 음성 신호의 감정이 인식될 수 있다. According to various embodiments, a technique for recognizing emotion of a voice signal by converting a voice signal into multi-scale may be provided. Here, the voice signal may be divided into a plurality of frames in the time domain. In addition, as the audio signal is compressed into a plurality of scales, compressed signals having various sizes in time may be generated. Through this, the emotion of the speech signal can be recognized by considering a frame change in a relatively long compressed signal and a frame change in a relatively short compressed signal together.

도 2는 다양한 실시예들에 따른 전자 장치(200)를 도시하는 도면이다. 2 is a diagram illustrating an electronic device 200 according to various embodiments.

도 2를 참조하면, 다양한 실시예들에 따른 전자 장치(200)는, 입력 모듈(210), 출력 모듈(220), 메모리(230) 또는 프로세서(240) 중 적어도 어느 하나를 포함할 수 있다. Referring to FIG. 2, an electronic device 200 according to various embodiments may include at least one of an input module 210, an output module 220, a memory 230, or a processor 240.

입력 모듈(210)은 전자 장치(200)의 구성 요소에 사용될 명령 또는 데이터를 전자 장치(200)의 외부로부터 수신할 수 있다. 입력 모듈(210)은, 사용자가 전자 장치(200)에 직접적으로 명령 또는 데이터를 입력하도록 구성되는 입력 장치 또는 외부 전자 장치와 유선 또는 무선으로 통신하여 명령 또는 데이터를 수신하도록 구성되는 통신 장치 중 적어도 어느 하나를 포함할 수 있다. 예를 들면, 입력 장치는 마이크로폰(microphone), 마우스(mouse), 키보드(keyboard) 또는 카메라(camera) 중 적어도 어느 하나를 포함할 수 있다. 예를 들면, 통신 장치는 유선 통신 장치 또는 무선 통신 장치 중 적어도 어느 하나를 포함하며, 무선 통신 장치는 근거리 통신 장치 또는 원거리 통신 장치 중 적어도 어느 하나를 포함할 수 있다. The input module 210 may receive commands or data to be used for components of the electronic device 200 from outside the electronic device 200. The input module 210 includes at least one of an input device configured to directly input a command or data to the electronic device 200 or a communication device configured to receive commands or data by communicating with an external electronic device wired or wirelessly. It can contain either. For example, the input device may include at least one of a microphone, a mouse, a keyboard, and a camera. For example, the communication device may include at least one of a wired communication device or a wireless communication device, and the wireless communication device may include at least one of a short-range communication device and a long-distance communication device.

출력 모듈(220)은 전자 장치(200)의 외부로 정보를 제공할 수 있다. 출력 모듈(220)은 정보를 청각적으로 출력하도록 구성되는 오디오 출력 장치, 정보를 시각적으로 출력하도록 구성되는 표시 장치 또는 외부 전자 장치와 유선 또는 무선으로 통신하여 정보를 전송하도록 구성되는 통신 장치 중 적어도 어느 하나를 포함할 수 있다. 예를 들면, 통신 장치는 유선 통신 장치 또는 무선 통신 장치 중 적어도 어느 하나를 포함하며, 무선 통신 장치는 근거리 통신 장치 또는 원거리 통신 장치 중 적어도 어느 하나를 포함할 수 있다.The output module 220 may provide information to the outside of the electronic device 200. The output module 220 is at least one of an audio output device configured to audibly output information, a display device configured to visually output information, or a communication device configured to transmit information by wired or wireless communication with an external electronic device. It can contain either. For example, the communication device may include at least one of a wired communication device or a wireless communication device, and the wireless communication device may include at least one of a short-range communication device and a long-distance communication device.

메모리(230)는 전자 장치(200)의 구성 요소에 의해 사용되는 데이터를 저장할 수 있다. 데이터는 프로그램 또는 이와 관련된 명령에 대한 입력 데이터 또는 출력 데이터를 포함할 수 있다. 예를 들면, 메모리(230)는 휘발성 메모리 또는 비휘발성 메모리 중 적어도 어느 하나를 포함할 수 있다. The memory 230 may store data used by components of the electronic device 200. The data may include input data or output data for a program or a command related thereto. For example, the memory 230 may include at least one of a volatile memory or a nonvolatile memory.

프로세서(240)는 메모리(230)의 프로그램을 실행하여, 전자 장치(200)의 구성 요소를 제어할 수 있고, 데이터 처리 또는 연산을 수행할 수 있다. 이 때 프로세서(240)는 음성 신호를 멀티스케일로 변환하여, 음성 신호의 감정을 인식할 수 있다. 프로세서(240)는 음성 신호를 복수 개의 스케일들로 압축할 수 있다. 프로세서(240)는 스케일들 각각에 대응하여, 주의집중(attention)에 따른 가중치를 부여할 수 있다. 프로세서(240)는 스케일들로부터, 가중치를 기반으로 음성 신호의 음성 특징을 추출할 수 있다. 이를 통해, 프로세서(240)는 음성 특징에 기반하여, 음성 신호의 감정을 인식할 수 있다. The processor 240 may execute a program in the memory 230 to control components of the electronic device 200 and perform data processing or operation. At this time, the processor 240 may convert the voice signal into multi-scale and recognize the emotion of the voice signal. The processor 240 may compress the voice signal into a plurality of scales. The processor 240 may assign a weight according to attention to each of the scales. The processor 240 may extract the speech feature of the speech signal based on the weight from the scales. Through this, the processor 240 may recognize the emotion of the voice signal based on the voice characteristic.

도 3은 도 2의 프로세서(240)를 도시하는 도면이다. 도 4는 도 3의 멀티스케일 모듈(310)을 도시하는 도면이다. FIG. 3 is a diagram illustrating the processor 240 of FIG. 2. FIG. 4 is a diagram illustrating the multiscale module 310 of FIG. 3.

도 3을 참조하면, 다양한 실시예들에 따른 프로세서(240)는 전처리 모듈(300), 멀티스케일 모듈(multiscale module)(310), 주의집중 풀링 모듈(attention pooling module)(320), 결합 모듈(concatenate module)(330) 및 감정 인식 모듈(360)을 포함할 수 있다. 3, the processor 240 according to various embodiments includes a preprocessing module 300, a multiscale module 310, an attention pooling module 320, and a combination module ( A concatenate module) 330 and an emotion recognition module 360 may be included.

전처리 모듈(300)은 음성 신호에 전처리를 수행할 수 있다. 이 때 음성 신호는 발화에 따른 것으로, 전처리 모듈(300)은 입력 모듈(210)를 통해 입력되는 음성 신호를 검출할 수 있다. 일 예로, 전처리 모듈(300)은 입력 장치를 통해 직접적으로 입력되는 음성 신호를 검출할 수 있다. 다른 예로, 전처리 모듈(300)은 통신 장치를 통해 외부 전자 장치로부터 수신되는 음성 신호를 검출할 수 있다. 예를 들면, 전처리 모듈(300)은 음성 신호에 잡음제어(denoising), 프리엠퍼시스(pre-emphasis) 또는 음성 활동 감지(voice activity detection; VAD) 중 적어도 어느 하나를 수행할 수 있다. 그리고 전처리 모듈(300)은 음성 신호를 log scaled mel-spectrogram, mel-spectrogram, pitch, energy, MFCC(mel frequency cepstral coefficient) 또는 LPCC(linear predictive cepstral coefficient) 중 어느 하나로 변환한 다음, zero mean unit variance로 정규화할 수 있다.The preprocessing module 300 may perform preprocessing on the voice signal. In this case, the voice signal is based on speech, and the preprocessing module 300 may detect the voice signal input through the input module 210. For example, the preprocessing module 300 may detect a voice signal directly input through an input device. As another example, the preprocessing module 300 may detect a voice signal received from an external electronic device through a communication device. For example, the pre-processing module 300 may perform at least one of denoising, pre-emphasis, and voice activity detection (VAD) on the voice signal. And the preprocessing module 300 converts the speech signal to any one of log scaled mel-spectrogram, mel-spectrogram, pitch, energy, MFCC (mel frequency cepstral coefficient) or LPCC (linear predictive cepstral coefficient), and then zero mean unit variance. Can be normalized with

멀티스케일 모듈(310)은 음성 신호를 멀티스케일로 변환할 수 있다. 멀티스케일 모듈(310)은 음성 신호를 복수 개의 스케일들로 압축할 수 있다. 여기서, 멀티스케일 모듈(310)은 음성 신호를 시간 영역(time domain) 상에서 복수 개의 프레임들로 구분하고, 프레임들을 압축하여 시간에 따른 프레임들의 사이즈를 축소시킬 수 있다.The multi-scale module 310 may convert a voice signal into multi-scale. The multi-scale module 310 may compress the voice signal into a plurality of scales. Here, the multi-scale module 310 may divide the voice signal into a plurality of frames in a time domain, and compress the frames to reduce the size of the frames over time.

멀티스케일 모듈(310)은 복수 개, 즉 N 개의 스케일 모듈(311, 313, …)들을 포함할 수 있다. 스케일 모듈(311, 313, …)들의 스케일들은 서로 다를 수 있다. 이를 통해, 스케일 모듈(311, 313, …)들은 서로 다른 비율로 음성 신호를 압축하여, 복수 개의 압축 신호들을 생성할 수 있다. 예를 들면, 스케일 모듈(311, 313, …)들은 음성 신호에 콘볼루션 신경망(convolutional neural network; CNN)과 풀링(pooling)을 적용하여, 음성 신호를 압축할 수 있다. 여기서, 풀링은 맥스 풀링(max pooling) 또는 에버리지 풀링(average pooling) 중 적어도 어느 하나를 포함할 수 있다. 일 예로, 스케일 모듈(311, 313, …)들은 콘볼루션 신경망과 풀링의 적용 횟수에 따라, 서로 다른 비율로 음성 신호를 압축할 수 있다. 스케일 모듈(311, 313, …)들은 각각 주의집중 풀링 모듈(320)에 연결될 수 있다. 일 실시예에 따르면, 스케일 모듈(311, 313, …)들은 연쇄적으로 연결될 수 있다. The multi-scale module 310 may include a plurality, that is, N scale modules 311, 313, .... Scales of the scale modules 311, 313,… may be different. Through this, the scale modules 311, 313, ... may generate a plurality of compressed signals by compressing the voice signal at different ratios. For example, the scale modules 311, 313, ... may compress the speech signal by applying a convolutional neural network (CNN) and pooling to the speech signal. Here, the pooling may include at least one of max pooling and average pooling. As an example, the scale modules 311, 313,… may compress a speech signal at different rates according to the number of applications of the convolutional neural network and pooling. The scale modules 311, 313, ... may be connected to the attention pulling module 320, respectively. According to an embodiment, the scale modules 311, 313, ... may be connected in series.

일 실시예에 따르면, 멀티스케일 모듈(310)은 제 1 스케일 모듈(311) 및 제 1 스케일 모듈(313)에 연결되는 제 2 스케일 모듈(313)을 포함할 수 있다. 이와 유사하게, 제 2 스케일 모듈(313)로부터 연쇄적으로 연결되는 적어도 하나의 추가적인 스케일 모듈(미도시)을 더 포함할 수 있다. 제 1 스케일 모듈(311)은 제 1 스케일로 음성 신호를 압축하여, 제 1 압축 신호를 생성할 수 있다. 그리고 제 1 스케일 모듈(311)은 제 1 압축 신호를 주의집중 풀링 모듈(320) 및 제 2 스케일 모듈(313)에 전달할 수 있다. 제 2 스케일 모듈(313)은 제 2 스케일로 음성 신호를 압축하여, 제 2 압축 신호를 생성할 수 있다. 이 때 제 2 스케일 모듈(313)은 제 1 스케일 모듈(311)로부터 전달되는 제 1 압축 신호를 압축하여, 제 2 압축 신호를 생성할 수 있다. 그리고 제 2 스케일 모듈(313)은 제 2 압축 신호를 주의집중 풀링 모듈(320)에 전달할 수 있다. 이 때 제 2 스케줄 모듈(313)에 추가적인 스케일 모듈(미도시)이 연결되어 있으면, 제 2 스케일 모듈(313)은 제 2 압축 신호를 추가적인 스케일 모듈(미도시)에도 전달할 수 있다. According to an exemplary embodiment, the multi-scale module 310 may include a first scale module 311 and a second scale module 313 connected to the first scale module 313. Similarly, at least one additional scale module (not shown) connected in series from the second scale module 313 may be further included. The first scale module 311 may generate a first compressed signal by compressing an audio signal with a first scale. In addition, the first scale module 311 may transmit the first compressed signal to the attention pulling module 320 and the second scale module 313. The second scale module 313 may generate a second compressed signal by compressing an audio signal with a second scale. In this case, the second scale module 313 may generate a second compressed signal by compressing the first compressed signal transmitted from the first scale module 311. In addition, the second scale module 313 may transmit the second compressed signal to the attention pooling module 320. At this time, if an additional scale module (not shown) is connected to the second schedule module 313, the second scale module 313 may transmit the second compression signal to the additional scale module (not shown).

예를 들면, 제 1 스케일 모듈(311)은, 도 4에 도시된 바와 같이 제 1 콘볼루션 신경망 모듈(411) 및 제 1 풀링 모듈(412)을 포함하고, 제 2 스케일 모듈(313)은, 도 4에 도시된 바와 같이 제 2 콘볼루션 신경망 모듈(413) 및 제 2 풀링 모듈(414)을 포함할 수 있다. 이와 유사하게, 적어도 하나의 추가적인 스케일 모듈(미도시)이 제 2 스케일 모듈(313)로부터 연쇄적으로 연결되어 있으면, 추가적인 스케일 모듈 역시 제 1 스케일 모듈(311) 및 제 2 스케일 모듈(313)과 유사하게 구성될 수 있다. For example, as shown in FIG. 4, the first scale module 311 includes a first convolutional neural network module 411 and a first pooling module 412, and the second scale module 313, As shown in FIG. 4, a second convolutional neural network module 413 and a second pooling module 414 may be included. Similarly, if at least one additional scale module (not shown) is connected in series from the second scale module 313, the additional scale module also includes the first scale module 311 and the second scale module 313 It can be configured similarly.

입력 모듈(210)로부터 입력되는 음성 신호는 제 1 스케일 모듈(311)의 제 1 콘볼루션 신경망 모듈(411)과 제 1 풀링 모듈(412)을 통과하면서, 제 1 압축 신호로 변환될 수 있다. 여기서, 음성 신호의 프레임들의 사이즈가, 예컨대 1/2로 축소될 수 있다. 어떤 실시예에서는, 음성 신호는 제 1 콘볼루션 신경망 모듈(411)과 제 1 풀링 모듈(412)을 통과한 후에 제 1 콘볼루션 신경망 모듈(411)로 복귀하여, 제 1 콘볼루션 신경망 모듈(411)과 제 1 풀링 모듈(412)을 반복적으로 통과하면서, 제 1 압축 신호로 변환될 수 있다. 여기서, 음성 신호의 프레임들의 사이즈가, 예컨대 1/4, 1/8 또는 1/16 중 어느 하나로 축소될 수 있다. 그리고 제 1 압축 신호는 제 1 풀링 모듈(412)로부터 주의집중 풀링 모듈(320) 및 제 2 스케일 모듈(313)의 제 2 콘볼루션 신경망 모듈(413)로 전달될 수 있다. The voice signal input from the input module 210 may be converted into a first compressed signal while passing through the first convolutional neural network module 411 and the first pulling module 412 of the first scale module 311. Here, the size of the frames of the voice signal may be reduced to 1/2, for example. In some embodiments, the voice signal returns to the first convolutional neural network module 411 after passing through the first convolutional neural network module 411 and the first pooling module 412, and the first convolutional neural network module 411 ) And the first pulling module 412 may be converted into a first compressed signal. Here, the size of the frames of the voice signal may be reduced to one of 1/4, 1/8, or 1/16, for example. In addition, the first compressed signal may be transmitted from the first pooling module 412 to the attention pooling module 320 and the second convolutional neural network module 413 of the second scale module 313.

제 1 압축 신호는 제 2 스케일 모듈(313)의 제 2 콘볼루션 신경망 모듈(413)과 제 2 풀링 모듈(414)을 통과하면서, 제 2 압축 신호로 변환될 수 있다. 여기서, 제 1 압축 신호의 프레임들의 사이즈가, 예컨대 1/2로 축소될 수 있다. 바꿔 말하면, 음성 신호의 프레임들의 사이즈가, 예컨대 1/4로 축소될 수 있다. 어떤 실시예에서는, 제 1 압축 신호는 제 2 콘볼루션 신경망 모듈(413)과 제 2 풀링 모듈(414)을 통과한 후에 제 2 콘볼루션 신경망 모듈(413)로 복귀하여, 제 2 콘볼루션 신경망 모듈(413)과 제 2 풀링 모듈(414)을 반복적으로 통과하면서, 제 2 압축 신호로 변환될 수 있다. 여기서, 제 1 압축 신호의 프레임들의 사이즈가, 예컨대 1/4, 1/8, 1/16 중 어느 하나로 축소될 수 있다. 그리고 제 2 압축 신호는 제 2 풀링 모듈(414)로부터 주의집중 풀링 모듈(320)로 전달될 수 있다. 이 때 제 2 스케줄 모듈(313)에 추가적인 스케일 모듈(미도시)이 연결되어 있으면, 제 2 압축 신호는 추가적인 스케일 모듈(미도시)에도 전달할 수 있다. The first compressed signal may be converted into a second compressed signal while passing through the second convolutional neural network module 413 and the second pooling module 414 of the second scale module 313. Here, the size of the frames of the first compressed signal may be reduced to 1/2, for example. In other words, the size of the frames of the audio signal can be reduced to, for example, 1/4. In some embodiments, the first compressed signal returns to the second convolutional neural network module 413 after passing through the second convolutional neural network module 413 and the second pooling module 414, and the second convolutional neural network module It may be converted into a second compressed signal while repeatedly passing through 413 and the second pulling module 414. Here, the size of the frames of the first compressed signal may be reduced to one of 1/4, 1/8, and 1/16, for example. In addition, the second compressed signal may be transmitted from the second pooling module 414 to the attention pooling module 320. At this time, if an additional scale module (not shown) is connected to the second schedule module 313, the second compressed signal may be transmitted to an additional scale module (not shown).

주의집중 풀링 모듈(320)은 스케일들 각각에 대응하여, 주의집중(attention)에 따른 가중치를 부여할 수 있다. 이 때 주의집중 풀링 모듈(320)은 압축 신호들 각각에 대응하여, 프레임들 각각의 주의집중을 계산할 수 있다. 그리고 주의집중 풀링 모듈(320)은 압축 신호들 각각에 대응하여, 프레임들 각각에 주의 집중에 따른 가중치를 부여할 수 있다. 주의집중 풀링 모듈(320)은 복수 개, 즉 N 개의 신경망 모듈(321, 323, …)들, 복수 개, 즉 N 개의 주의집중 모듈(331, 333, …)들 및 복수 개, 즉 N 개의 가중치 모듈(341, 343, …)들을 포함할 수 있다. The attention pooling module 320 may assign a weight according to attention to each of the scales. At this time, the attention pooling module 320 may calculate the attention of each frame in response to each of the compressed signals. In addition, the attention pulling module 320 may assign a weight according to attention to each frame in response to each of the compressed signals. The attention pooling module 320 includes a plurality, that is, N neural network modules 321, 323, ..., a plurality, that is, N attention modules 331, 333, ..., and a plurality, that is, N weights Modules 341, 343, ... may be included.

신경망 모듈(321, 323)들은 멀티스케일 모듈(310)의 스케일 모듈(311, 313, …)들로부터 입력되는 압축 신호들을 미리 정해진 신경망 구조에 따라 각각 처리할 수 있다. 예를 들면, 신경망 모듈(321, 323, …)들은 압축 신호들을 장단기 메모리(long-short term memory; LSTM) 방식의 순환 신경망 구조에 따라 각각 처리할 수 있다. 이를 위해, 신경망 모듈(321, 323, …)들은 스케일 모듈(311, 313, …)들에 각각 연결될 수 있다. The neural network modules 321 and 323 may respectively process compressed signals input from the scale modules 311, 313,… of the multiscale module 310 according to a predetermined neural network structure. For example, the neural network modules 321, 323, ... may process compressed signals according to a long-short term memory (LSTM) cyclic neural network structure. To this end, the neural network modules 321, 323,… may be connected to the scale modules 311, 313, …, respectively.

일 실시예에 따르면, 신경망 모듈(321, 323, …)들은 제 1 스케일 모듈(311)에 연결되는 제 1 신경망 모듈(321) 및 제 2 스케일 모듈(313)에 연결되는 제 2 신경망 모듈(323)을 포함할 수 있다. 이와 유사하게, 제 2 스케일 모듈(313)에 추가적인 스케일 모듈(미도시)이 연결되어 있으면, 신경망 모듈(321, 323, …)들은 추가적인 스케일 모듈(미도시)에 연결되는 추가적인 신경망 모듈(미도시)을 더 포함할 수 있다. 제 1 신경망 모듈(321)은 제 1 스케일 모듈(311)로부터 입력되는 제 1 압축 신호를 처리할 수 있다. 제 2 신경망 모듈(323)은 제 2 스케일 모듈(313)로부터 입력되는 제 2 압축 신호를 처리할 수 있다. According to an embodiment, the neural network modules 321, 323, ... are a first neural network module 321 connected to the first scale module 311 and a second neural network module 323 connected to the second scale module 313 ) Can be included. Similarly, when an additional scale module (not shown) is connected to the second scale module 313, the neural network modules 321, 323,… are additional neural network modules (not shown) connected to the additional scale module (not shown). ) May be further included. The first neural network module 321 may process a first compressed signal input from the first scale module 311. The second neural network module 323 may process a second compressed signal input from the second scale module 313.

주의집중 모듈(331, 333, …)들은 압축 신호들 각각에 대응하여, 주의집중을 계산할 수 있다. 이 때 주의집중 모듈(331, 333, …)들은 프레임들 각각의 주의집중을 계산할 수 있다. 여기서, 주의집중 모듈(331, 333, …)들은 프레임들 각각에서 음성 특징에 대한 주의집중을 계산할 수 있다. 이를 위해, 주의집중 모듈(331, 333, …)들은 신경망 모듈(321, 323, …)들에 각각 연결될 수 있다. Attention modules 331, 333,… may calculate attentional concentration in response to each of the compressed signals. At this time, the attention modules 331, 333,… may calculate the attention of each frame. Here, the attention modules 331, 333,… may calculate attention to the voice feature in each of the frames. To this end, the attention modules 331, 333, ... may be connected to the neural network modules 321, 323, ..., respectively.

일 실시예에 따르면, 주의집중 모듈(331, 333, …)들은 제 1 신경망 모듈(321)에 연결되는 제 1 주의집중 모듈(331) 및 제 2 신경망 모듈(323)에 연결되는 제 2 주의집중 모듈(333)을 포함할 수 있다. 이와 유사하게, 신경망 모듈(321, 323, …)들은 추가적인 스케일 모듈(미도시)에 연결되는 추가적인 신경망 모듈(미도시)을 더 포함하면, 주의집중 모듈(331, 333, …)들은 추가적인 신경망 모듈(미도시)에 연결되는 추가적인 주의집중 모듈(미도시)을 더 포함할 수 있다. 제 1 주의집중 모듈(331)은 제 1 압축 신호에 대응하여, 프레임들 각각의 주의집중을 계산할 수 있다. 제 2 주의집중 모듈(333)은 제 2 압축 신호에 대응하여, 프레임들 각각의 주의집중을 계산할 수 있다. According to an embodiment, the attention modules 331, 333, ... are a first attention module 331 connected to the first neural network module 321 and a second attention module connected to the second neural network module 323 A module 333 may be included. Similarly, if the neural network modules 321, 323,… further include an additional neural network module (not shown) connected to an additional scale module (not shown), the attention modules 331, 333,… are additional neural network modules. An additional attention module (not shown) connected to (not shown) may be further included. The first attention module 331 may calculate the attention of each frame in response to the first compressed signal. The second attention module 333 may calculate attention of each frame in response to the second compressed signal.

가중치 모듈(341, 343, …)들은 압축 신호들 각각에 대응하여, 주의집중에 따라 가중치를 부여할 수 있다. 이 때 가중치 모듈(341, 343, …)들은 프레임들 각각에 주의집중에 따른 가중치를 부여할 수 있다. 이를 위해, 가중치 모듈(341, 343, …)들은 신경망 모듈(32, 323, …)들과 주의집중 모듈(331, 333, …)들에 각각 연결될 수 있다. The weighting modules 341, 343, ... may assign weights according to attention, corresponding to each of the compressed signals. In this case, the weight modules 341, 343, ... may assign a weight according to attention to each of the frames. To this end, the weight modules 341, 343,… may be connected to the neural network modules 32, 323,… and the attention modules 331, 333, …, respectively.

일 실시예에 따르면, 가중치 모듈(341, 343, …)들은 제 1 신경망 모듈(321)과 제 1 주의집중 모듈(331)에 연결되는 제 1 가중치 모듈(341) 및 제 2 신경망 모듈(323)과 제 2 주의집중 모듈(333)에 연결되는 제 2 가중치 모듈(343)을 포함할 수 있다. 이와 유사하게, 신경망 모듈(321, 323, …)들이 추가적인 신경망 모듈(미도시)을 더 포함하고, 주의집중 모듈(331, 333, …)들이 추가적인 주의집중 모듈(미도시)을 더 포함하면, 가중치 모듈(341, 343)들도 추가적인 신경망 모듈(미도시)과 추가적인 주의집중 모듈(미도시)에 연결되는 추가적인 가중치 모듈(미도시)을 더 포함할 수 있다. 제 1 가중치 모듈(341)은 제 1 압축 신호에 대응하여, 프레임들 각각에 주의집중에 따른 가중치를 부여할 수 있다. 제 2 가중치 모듈(343)은 제 2 압축 신호에 대응하여, 프레임들 각각에 주의집중에 따른 가중치를 부여할 수 있다. According to an embodiment, the weight modules 341, 343, ... are a first weight module 341 and a second neural network module 323 connected to the first neural network module 321 and the first attention module 331 And a second weighting module 343 connected to the second attention module 333. Similarly, if the neural network modules 321, 323,… further include an additional neural network module (not shown), and the attention module 331, 333,… further includes an additional attention module (not shown), The weighting modules 341 and 343 may further include an additional neural network module (not shown) and an additional weighting module (not shown) connected to the additional attention module (not shown). The first weighting module 341 may assign a weight according to attention to each of the frames in response to the first compressed signal. The second weight module 343 may assign a weight according to attention to each of the frames in response to the second compressed signal.

결합 모듈(350)은 스케일들로부터, 가중치를 기반으로 음성 신호의 음성 특징을 추출할 수 있다. 이를 위해, 결합 모듈(350)은 주의집중 풀링 모듈(320)로부터 입력되는 모든 스케일들과 관련된 결과들을 결합하여, 최종 결과를 도출할 수 있다. 이 때 결합 모듈(350)은 모든 압축 신호들에 대하여, 프레임들을 비교할 수 있다. 그리고 결합 모듈(350)은 프레임들로부터 미리 정해진 값을 초과하는 가중치가 부여된 음성 특징을 추출할 수 있다. The combining module 350 may extract speech features of the speech signal based on weights from the scales. To this end, the combining module 350 may derive a final result by combining results related to all scales input from the attention pooling module 320. At this time, the combining module 350 may compare frames for all compressed signals. In addition, the combining module 350 may extract a speech feature to which a weight exceeding a predetermined value is assigned from the frames.

감정 인식 모듈(360)은 음성 특징에 기반하여, 음성 신호의 감정을 인식할 수 있다. 이 때 감정 인식 모듈(360)은 음성 특징에 기반하여, 감정을 분류하도록 정해진 복수 개의 감정 라벨들 중 어느 하나를 선택할 수 있다. 예를 들면, 감정 라벨들은 anger, disgust, fear, happy, neutral, sad 또는 surprise 중 적어도 어느 하나를 포함할 수 있다. The emotion recognition module 360 may recognize the emotion of the voice signal based on the voice characteristic. In this case, the emotion recognition module 360 may select any one of a plurality of emotion labels determined to classify emotions based on voice characteristics. For example, emotion labels may include at least one of anger, disgust, fear, happy, neutral, sad, or surprise.

다양한 실시예들에 따른 전자 장치(200)는 멀티스케일 음성 감정 인식을 위한 것으로, 음성 신호를 입력하는 입력 모듈(210), 및 입력 모듈(210)에 연결되며, 음성 신호의 감정을 인식하도록 구성되는 프로세서(240)를 포함할 수 있다. The electronic device 200 according to various embodiments is for multi-scale voice emotion recognition, is connected to the input module 210 for inputting a voice signal and the input module 210, and is configured to recognize the emotion of the voice signal. The processor 240 may be included.

다양한 실시예들에 따르면, 프로세서(240)는, 음성 신호를 복수 개의 스케일들로 압축하고, 스케일들 각각에 대응하여, 주의집중(attention)에 따른 가중치를 부여하고, 스케일들로부터, 가중치를 기반으로 음성 신호의 음성 특징을 추출하고, 음성 특징에 기반하여, 감정을 인식하도록 구성될 수 있다. According to various embodiments, the processor 240 compresses the speech signal into a plurality of scales, assigns a weight according to attention to each of the scales, and based on the weight from the scales. It may be configured to extract a voice feature of a voice signal and recognize an emotion based on the voice feature.

다양한 실시예들에 따르면, 프로세서(240)는, 음성 신호를 복수 개의 프레임들로 구분하도록 구성될 수 있다. According to various embodiments, the processor 240 may be configured to divide an audio signal into a plurality of frames.

다양한 실시예들에 따르면, 프로세서(240)는, 프레임들 각각의 주의집중을 계산하도록 구성되는 주의집중 모듈(331, 333, …), 및 프레임들 각각에 주의집중에 따른 가중치를 부여하도록 구성되는 가중치 모듈(341, 343, …)을 포함할 수 있다. According to various embodiments, the processor 240 includes an attention module 331, 333,… configured to calculate attention of each frame, and a weight according to attention to each of the frames. Weight modules 341, 343, ... may be included.

다양한 실시예들에 따르면, 프로세서(240)는, 음성 신호를 스케일들로 압축하여, 복수 개의 압축 신호들을 생성하도록 구성되는 멀티스케일 모듈(310)을 포함할 수 있다. According to various embodiments, the processor 240 may include a multi-scale module 310 configured to generate a plurality of compressed signals by compressing a voice signal into scales.

다양한 실시예들에 따르면, 멀티스케일 모듈(310)은, 음성 신호를 압축하여, 제 1 스케일의 제 1 압축 신호를 생성하도록 구성되는 제 1 스케일 모듈(311), 및 제 1 압축 신호를 압축하여, 제 2 스케일의 제 2 압축 신호를 생성하도록 구성되는 제 2 스케일 모듈(313)을 포함할 수 있다. According to various embodiments, the multi-scale module 310 compresses the voice signal to generate a first compressed signal of a first scale, and compresses the first compressed signal. , It may include a second scale module 313 configured to generate a second compressed signal of the second scale.

다양한 실시예들에 따르면, 멀티스케일 모듈(310)은, 음성 신호에 콘볼루션 신경망과 풀링을 적용하여 압축하도록 구성되며, 풀링은 맥스 풀링 또는 에버리지 풀링 중 적어도 어느 하나를 포함할 수 있다. According to various embodiments, the multi-scale module 310 is configured to compress a voice signal by applying a convolutional neural network and pooling, and the pooling may include at least one of max pooling and average pooling.

다양한 실시예들에 따르면, 압축 신호들은 음성 신호에 대한 콘볼루션 신경망과 풀링의 적용 횟수에 따라 구분될 수 있다. According to various embodiments, the compressed signals may be classified according to the number of applications of a convolutional neural network and pooling to a speech signal.

다양한 실시예들에 따른 전자 장치(200)는 멀티스케일 음성 감정 인식을 위한 것으로, 음성 신호를 입력하는 입력 모듈(210), 음성 신호를 복수 개의 스케일들로 압축하도록 구성되는 멀티스케일 모듈(310), 스케일들 각각에 대응하여, 주의집중(attention)에 따른 가중치를 부여하도록 구성되는 주의집중 풀링 모듈(331, 333, …), 스케일들로부터, 가중치를 기반으로 음성 신호의 음성 특징을 추출하도록 구성되는 결합 모듈(350), 및 음성 특징에 기반하여, 감정을 인식하도록 구성되는 감정 인식 모듈(360)을 포함할 수 있다. The electronic device 200 according to various embodiments is for multi-scale voice emotion recognition, an input module 210 for inputting a voice signal, and a multi-scale module 310 configured to compress the voice signal into a plurality of scales. , In response to each of the scales, the attention pooling module (331, 333, …) configured to assign a weight according to attention, configured to extract the speech feature of the speech signal based on the weight from the scales It may include a combination module 350 that is configured, and an emotion recognition module 360 configured to recognize emotions based on voice characteristics.

다양한 실시예들에 따르면, 멀티스케일 모듈(310)은, 음성 신호를 압축하여, 제 1 스케일의 제 1 압축 신호를 생성하도록 구성되는 제 1 스케일 모듈(311), 및 제 1 압축 신호를 압축하여, 제 2 스케일의 제 2 압축 신호를 생성하도록 구성되는 제 2 스케일 모듈(313)을 포함할 수 있다.According to various embodiments, the multi-scale module 310 compresses the voice signal to generate a first compressed signal of a first scale, and compresses the first compressed signal. , It may include a second scale module 313 configured to generate a second compressed signal of the second scale.

도 5는 다양한 실시예들에 따른 전자 장치(200)의 동작 방법을 도시하는 도면이다. 5 is a diagram illustrating a method of operating an electronic device 200 according to various embodiments.

도 5를 참조하면, 전자 장치(200)는 510 동작에서 음성 신호를 검출할 수 있다. 프로세서(240)는 입력 모듈(210)를 통해 입력되는 음성 신호를 검출할 수 있다. 일 예로, 프로세서(240)는 입력 장치를 통해 직접적으로 입력되는 음성 신호를 검출할 수 있다. 다른 예로, 프로세서(240)는 통신 장치를 통해 외부 전자 장치로부터 수신되는 음성 신호를 검출할 수 있다. 여기서, 전처리 모듈(300)이 음성 신호에 전처리를 수행할 수 있다. 예를 들면, 전처리 모듈(300)은 음성 신호에 잡음제어(denoising), 프리엠퍼시스(pre-emphasis) 또는 음성 활동 감지(voice activity detection; VAD) 중 적어도 어느 하나를 수행할 수 있다. 그리고 전처리 모듈(300)은 음성 신호를 log scaled mel-spectrogram, mel-spectrogram, pitch, energy, MFCC(mel frequency cepstral coefficient) 또는 LPCC(linear predictive cepstral coefficient) 중 어느 하나 로 변환한 다음, zero mean unit variance로 정규화할 수 있다.Referring to FIG. 5, the electronic device 200 may detect a voice signal in operation 510. The processor 240 may detect a voice signal input through the input module 210. For example, the processor 240 may detect a voice signal directly input through an input device. As another example, the processor 240 may detect a voice signal received from an external electronic device through a communication device. Here, the preprocessing module 300 may perform preprocessing on the voice signal. For example, the pre-processing module 300 may perform at least one of denoising, pre-emphasis, and voice activity detection (VAD) on the voice signal. And the preprocessing module 300 converts the speech signal to any one of log scaled mel-spectrogram, mel-spectrogram, pitch, energy, MFCC (mel frequency cepstral coefficient) or LPCC (linear predictive cepstral coefficient), and then zero mean unit Can be normalized with variance.

전자 장치(200)는 520 동작에서 복수 개의 스케일들로 음성 신호를 압축할 수 있다. 프로세서(240)는 음성 신호를 시간 영역(time domain) 상에서 압축할 수 있다. 이 때 멀티스케일 모듈(310)이 복수 개의 프레임들로 구분하고, 프레임들을 압축하여 시간에 따른 프레임들의 사이즈를 축소시킬 수 있다. 멀티스케일 모듈(310)은 서로 다른 비율로 음성 신호를 압축하여, 복수 개의 압축 신호들을 생성할 수 있다. 예를 들면, 멀티스케일 모듈(310)은 음성 신호에 콘볼루션 신경망(convolutional neural network; CNN)과 풀링(pooling)을 적용하여, 음성 신호를 압축할 수 있다. 여기서, 풀링은 맥스 풀링(max pooling) 또는 에버리지 풀링(average pooling) 중 적어도 어느 하나를 포함할 수 있다. 일 예로, 멀티스케일 모듈(310)은 콘볼루션 신경망과 풀링의 적용 횟수에 따라, 서로 다른 비율로 음성 신호를 압축할 수 있다.The electronic device 200 may compress the voice signal with a plurality of scales in operation 520. The processor 240 may compress the voice signal in a time domain. In this case, the multi-scale module 310 may divide the frames into a plurality of frames and compress the frames to reduce the size of the frames over time. The multiscale module 310 may generate a plurality of compressed signals by compressing the voice signal at different ratios. For example, the multiscale module 310 may compress the speech signal by applying a convolutional neural network (CNN) and pooling to the speech signal. Here, the pooling may include at least one of max pooling and average pooling. As an example, the multiscale module 310 may compress a speech signal at different rates according to the number of times the convolutional neural network and pooling are applied.

전자 장치(200)는 530 동작에서 각각의 스케일에 대응하여, 주의집중을 계산할 수 있다. 이 때 프로세서(240)는 압축 신호들을 미리 정해진 신경망 구조에 따라 각각 처리할 수 있다. 그리고 프로세서(240)는 압축 신호들 각각에 대응하여, 프레임들 각각의 주의집중을 계산할 수 있다. 여기서, 압축 신호들 각각에 대응하여, 프레임들 각각에서 음성 특징에 대한 주의집중이 계산될 수 있다. In operation 530, the electronic device 200 may calculate attentional concentration corresponding to each scale. In this case, the processor 240 may process each of the compressed signals according to a predetermined neural network structure. In addition, the processor 240 may calculate the attention of each frame in response to each of the compressed signals. Here, in response to each of the compressed signals, attention to a speech characteristic may be calculated in each of the frames.

전자 장치(200)는 540 동작에서 각각의 스케일에 대응하여, 주의집중에 따라, 가중치를 부여할 수 있다. 프로세서(240)는 각각의 압축 신호에 대응하여, 프레임들 각각에 주의집중에 따른 가중치를 부여할 수 있다. In operation 540, the electronic device 200 may assign a weight according to attention, corresponding to each scale. The processor 240 may assign a weight according to attention to each frame in response to each compressed signal.

전자 장치(200)는 550 동작에서 모든 스케일들로부터, 가중치를 기반으로 음성 특징을 추출할 수 있다. 프로세서(240)는 모든 스케일들과 관련된 결과들을 결합하여, 최종 결과를 도출할 수 있다. 이 때 프로세서(240)는 모든 압축 신호들에 대하여, 프레임들을 비교할 수 있다. 그리고 프로세서(240)는 프레임들로부터 미리 정해진 값을 초과하는 가중치가 부여된 음성 특징을 추출할 수 있다. The electronic device 200 may extract a speech feature based on a weight from all scales in operation 550. The processor 240 may combine results related to all scales to derive a final result. In this case, the processor 240 may compare frames for all compressed signals. Further, the processor 240 may extract a speech feature to which a weight exceeding a predetermined value is assigned from the frames.

전자 장치(200)는 560 동작에서 음성 특징에 기반하여, 음성 신호의 감정을 인식할 수 있다. 이 때 프로세서(240)는 음성 특징에 기반하여, 감정을 분류하도록 정해진 복수 개의 감정 라벨들 중 어느 하나를 선택할 수 있다. 예를 들면, 감정 라벨들은 anger, disgust, fear, happy, neutral, sad 또는 surprise 중 적어도 어느 하나를 포함할 수 있다. 이를 통해, 전자 장치(200)가 음성 신호의 감정 인식 결과를 메모리(230)에 저장하거나, 출력 모듈(220)을 통하여 출력할 수 있다. The electronic device 200 may recognize the emotion of the voice signal based on the voice characteristic in operation 560. In this case, the processor 240 may select any one of a plurality of emotion labels determined to classify emotions based on voice characteristics. For example, emotion labels may include at least one of anger, disgust, fear, happy, neutral, sad, or surprise. Through this, the electronic device 200 may store the emotion recognition result of the voice signal in the memory 230 or output it through the output module 220.

다양한 실시예들에 따른 전자 장치(200)의 동작 방법은 멀티스케일 음성 감정 인식을 위한 것으로, 음성 신호를 복수 개의 스케일들로 압축하는 동작, 스케일들 각각에 대응하여, 주의집중(attention)에 따른 가중치를 부여하는 동작, 스케일들로부터, 가중치를 기반으로 음성 신호의 음성 특징을 추출하는 동작, 및 음성 특징에 기반하여, 음성 신호의 감정을 인식하는 동작을 포함할 수 있다. An operation method of the electronic device 200 according to various embodiments is for multi-scale voice emotion recognition, an operation of compressing a voice signal into a plurality of scales, corresponding to each of the scales, according to attention. An operation of assigning a weight, an operation of extracting a speech characteristic of the speech signal based on the weight from the scales, and an operation of recognizing an emotion of the speech signal based on the speech characteristic.

다양한 실시예들에 따르면, 음성 신호는 복수 개의 프레임들로 구분될 수 있다. According to various embodiments, the voice signal may be divided into a plurality of frames.

다양한 실시예들에 따르면, 부여 동작은, 프레임들 각각의 주의집중을 계산하는 동작, 및 프레임들 각각에 주의집중에 따른 가중치를 부여하는 동작을 포함할 수 있다. According to various embodiments, the assigning operation may include an operation of calculating attentional concentration of each of the frames, and an operation of assigning a weight according to the attentional concentration to each of the frames.

다양한 실시예들에 따르면, 압축 동작은, 음성 신호를 압축하여, 제 1 스케일의 제 1 압축 신호를 생성하는 동작, 및 제 1 압축 신호를 압축하여, 제 2 스케일의 제 2 압축 신호를 생성하는 동작을 포함할 수 있다. According to various embodiments, the compression operation includes an operation of compressing a speech signal to generate a first compressed signal of a first scale, and compressing the first compressed signal to generate a second compressed signal of a second scale. May include actions.

다양한 실시예들에 따르면, 압축 동작은, 음성 신호에 콘볼루션 신경망과 풀링을 적용하여 압축하도록 구성되며, 풀링은 맥스 풀링 또는 에버리지 풀링 중 적어도 어느 하나를 포함할 수 있다.According to various embodiments, the compression operation is configured to compress a speech signal by applying a convolutional neural network and pooling, and the pooling may include at least one of max pooling and average pooling.

다양한 실시예들에 따르면, 압축 신호들은 음성 신호에 대한 콘볼루션 신경망과 풀링의 적용 횟수에 따라 구분될 수 있다.According to various embodiments, the compressed signals may be classified according to the number of applications of a convolutional neural network and pooling to a speech signal.

다양한 실시예들에 따르면, 전자 장치(200)가 음성 신호를 멀티스케일로 변환하여, 음성 신호의 감정을 인식할 수 있다. 이 때 전자 장치(200)는 음성 신호를 복수 개의 스케일들로 압축함에 따라, 시간적으로 다양한 사이즈의 압축 신호들을 생성할 수 있다. 이를 통해, 전자 장치(200)가 상대적으로 긴 압축 신호에서 프레임 변화와 상대적으로 짧은 압축 신호에서 프레임 변화를 함께 고려하여, 음성 신호의 감정을 인식할 수 있다. 아울러, 전자 장치(200)가 음성 신호의 프레임들 중 감정이 집중된 적어도 어느 하나를 효율적으로 검출할 수 있다. 이에 따라, 전자 장치(200)가 음성 신호로부터 감정을 보다 정확하게 인식할 수 있다.According to various embodiments, the electronic device 200 may recognize the emotion of the voice signal by converting the voice signal into multiscale. In this case, as the electronic device 200 compresses the voice signal into a plurality of scales, the electronic device 200 may temporally generate compressed signals of various sizes. Through this, the electronic device 200 may recognize the emotion of the voice signal by considering a frame change in a relatively long compressed signal and a frame change in a relatively short compressed signal together. In addition, the electronic device 200 may efficiently detect at least one of the frames of the voice signal in which emotion is concentrated. Accordingly, the electronic device 200 may more accurately recognize the emotion from the voice signal.

본 문서의 다양한 실시예들 및 이에 사용된 용어들은 본 문서에 기재된 기술을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 해당 실시 예의 다양한 변경, 균등물, 및/또는 대체물을 포함하는 것으로 이해되어야 한다. 도면의 설명과 관련하여, 유사한 구성요소에 대해서는 유사한 참조 부호가 사용될 수 있다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함할 수 있다. 본 문서에서, "A 또는 B", "A 및/또는 B 중 적어도 하나", "A, B 또는 C" 또는 "A, B 및/또는 C 중 적어도 하나" 등의 표현은 함께 나열된 항목들의 모든 가능한 조합을 포함할 수 있다. "제 1", "제 2", "첫째" 또는 "둘째" 등의 표현들은 해당 구성요소들을, 순서 또는 중요도에 상관없이 수식할 수 있고, 한 구성요소를 다른 구성요소와 구분하기 위해 사용될 뿐 해당 구성요소들을 한정하지 않는다. 어떤(예: 제 1) 구성요소가 다른(예: 제 2) 구성요소에 "(기능적으로 또는 통신적으로) 연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 상기 어떤 구성요소가 상기 다른 구성요소에 직접적으로 연결되거나, 다른 구성요소(예: 제 3 구성요소)를 통하여 연결될 수 있다.Various embodiments of the present document and terms used therein are not intended to limit the technology described in this document to a specific embodiment, and should be understood to include various modifications, equivalents, and/or substitutes of the corresponding embodiment. In connection with the description of the drawings, similar reference numerals may be used for similar elements. Singular expressions may include plural expressions unless the context clearly indicates otherwise. In this document, expressions such as "A or B", "at least one of A and/or B", "A, B or C" or "at least one of A, B and/or C" are all of the items listed together. It can include possible combinations. Expressions such as "first", "second", "first" or "second" can modify the corresponding elements regardless of their order or importance, and are only used to distinguish one element from another. The components are not limited. When it is mentioned that a certain (eg, first) component is “(functionally or communicatively) connected” or “connected” to another (eg, second) component, the certain component is It may be directly connected to the component, or may be connected through another component (eg, a third component).

본 문서에서 사용된 용어 "모듈"은 하드웨어, 소프트웨어 또는 펌웨어로 구성된 유닛을 포함하며, 예를 들면, 로직, 논리 블록, 부품, 또는 회로 등의 용어와 상호 호환적으로 사용될 수 있다. 모듈은, 일체로 구성된 부품 또는 하나 또는 그 이상의 기능을 수행하는 최소 단위 또는 그 일부가 될 수 있다. 예를 들면, 모듈은 ASIC(application-specific integrated circuit)으로 구성될 수 있다. The term "module" used in this document includes a unit composed of hardware, software, or firmware, and may be used interchangeably with terms such as, for example, logic, logic blocks, parts, or circuits. A module may be an integrally configured component or a minimum unit or a part of one or more functions. For example, the module may be configured as an application-specific integrated circuit (ASIC).

본 문서의 다양한 실시예들은 기기(machine)(예: 전자 장치(200))에 의해 읽을 수 있는 저장 매체(storage medium)(예: 메모리(230))에 저장된 하나 이상의 명령어들을 포함하는 소프트웨어로서 구현될 수 있다. 예를 들면, 기기의 프로세서(예: 프로세서(240))는, 저장 매체로부터 저장된 하나 이상의 명령어들 중 적어도 하나의 명령을 호출하고, 그것을 실행할 수 있다. 이것은 기기가 호출된 적어도 하나의 명령어에 따라 적어도 하나의 기능을 수행하도록 운영되는 것을 가능하게 한다. 하나 이상의 명령어들은 컴파일러에 의해 생성된 코드 또는 인터프리터에 의해 실행될 수 있는 코드를 포함할 수 있다. 기기로 읽을 수 있는 저장매체 는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, ‘비일시적’은 저장매체가 실재(tangible)하는 장치이고, 신호(signal)(예: 전자기파)를 포함하지 않는다는 것을 의미할 뿐이며, 이 용어는 데이터가 저장매체에 반영구적으로 저장되는 경우와 임시적으로 저장되는 경우를 구분하지 않는다.Various embodiments of this document are implemented as software including one or more instructions stored in a storage medium (eg, memory 230) readable by a machine (eg, electronic device 200). Can be. For example, the processor of the device (for example, the processor 240) may call at least one instruction among one or more instructions stored from a storage medium and execute it. This enables the device to be operated to perform at least one function according to the at least one command invoked. The one or more instructions may include code generated by a compiler or code that can be executed by an interpreter. A storage medium that can be read by a device may be provided in the form of a non-transitory storage medium. Here,'non-transient' only means that the storage medium is a tangible device and does not contain a signal (e.g., electromagnetic waves). It does not distinguish between temporary storage cases.

다양한 실시예들에 따르면, 기술한 구성요소들의 각각의 구성요소(예: 모듈 또는 프로그램)는 단수 또는 복수의 개체를 포함할 수 있다. 다양한 실시예들에 따르면, 전술한 해당 구성요소들 중 하나 이상의 구성요소들 또는 동작들이 생략되거나, 또는 하나 이상의 다른 구성요소들 또는 동작들이 추가될 수 있다. 대체적으로 또는 추가적으로, 복수의 구성요소들(예: 모듈 또는 프로그램)은 하나의 구성요소로 통합될 수 있다. 이런 경우, 통합된 구성요소는 복수의 구성요소들 각각의 구성요소의 하나 이상의 기능들을 통합 이전에 복수의 구성요소들 중 해당 구성요소에 의해 수행되는 것과 동일 또는 유사하게 수행할 수 있다. 다양한 실시예들에 따르면, 모듈, 프로그램 또는 다른 구성요소에 의해 수행되는 동작들은 순차적으로, 병렬적으로, 반복적으로, 또는 휴리스틱하게 실행되거나, 동작들 중 하나 이상이 다른 순서로 실행되거나, 생략되거나, 또는 하나 이상의 다른 동작들이 추가될 수 있다. According to various embodiments, each component (eg, a module or program) of the described components may include a singular number or a plurality of entities. According to various embodiments, one or more components or operations among the above-described corresponding components may be omitted, or one or more other components or operations may be added. Alternatively or additionally, a plurality of components (eg, a module or a program) may be integrated into one component. In this case, the integrated component may perform one or more functions of each component of the plurality of components in the same or similar to that performed by the corresponding component among the plurality of components prior to integration. According to various embodiments, operations performed by a module, program, or other component may be sequentially, parallel, repeatedly, or heuristically executed, or one or more of the operations may be executed in a different order, or omitted. , Or one or more other actions may be added.

Claims

In the method of operating an electronic device for multi-scale voice emotion recognition,
Compressing the speech signal into a plurality of scales;
Assigning a weight according to attention to each of the scales;
Extracting a speech feature of the speech signal based on the weight from the scales; And
And recognizing an emotion of the voice signal based on the voice characteristic.

The method of claim 1,
The voice signal is divided into a plurality of frames.

The method of claim 2, wherein the imparting operation,
Calculating the attention of each of the frames; And
And assigning the weight according to the attention to each of the frames.

The method of claim 1, wherein the compression operation,
Compressing the audio signal to generate a first compressed signal of a first scale; And
And compressing the first compressed signal to generate a second compressed signal of a second scale.

The method of claim 1, wherein the compression operation,
Compressing the speech signal by applying a convolutional neural network and pooling,
The pooling includes at least one of max pooling and average pooling.

The method of claim 5,
The compressed signals are classified according to the number of applications of the convolutional neural network and pooling to the speech signal.

In the electronic device for multi-scale voice emotion recognition,
An input module for inputting a voice signal; And
And a processor connected to the input module and configured to recognize emotion of the voice signal,
The processor,
Compressing the speech signal into a plurality of scales,
In response to each of the scales, a weight according to attention is assigned,
From the scales, extracting a speech feature of the speech signal based on the weight,
An electronic device configured to recognize the emotion based on the voice characteristic.

The method of claim 7, wherein the processor,
An electronic device configured to divide the voice signal into a plurality of frames.

The method of claim 8, wherein the processor,
An attention module, configured to calculate attention of each of the frames; And
And a weighting module configured to assign the weight according to the attention to each of the frames.

The method of claim 7, wherein the processor,
And a multiscale module configured to compress the voice signal into the scales to generate a plurality of compressed signals.

The method of claim 10, wherein the multi-scale module,
A first scale module, configured to compress the speech signal to generate a first compressed signal of a first scale; And
And a second scale module configured to compress the first compressed signal to generate a second compressed signal of a second scale.

The method of claim 10, wherein the multi-scale module,
It is configured to compress the speech signal by applying a convolutional neural network and pooling,
The pooling includes at least one of max pooling and average pooling.

The method of claim 12,
The compressed signals are classified according to the number of times the convolutional neural network and pooling are applied to the speech signal.

In the electronic device for multi-scale voice emotion recognition,
An input module for inputting a voice signal;
A multiscale module configured to compress the speech signal into a plurality of scales;
An attention pooling module configured to assign a weight according to attention to each of the scales;
A combining module, configured to extract a speech feature of the speech signal based on the weight from the scales; And
An electronic device comprising an emotion recognition module configured to recognize the emotion based on the voice characteristic.

The method of claim 14, wherein the multi-scale module,
A first scale module, configured to compress the speech signal to generate a first compressed signal of a first scale; And
And a second scale module configured to compress the first compressed signal to generate a second compressed signal of a second scale.