WO2022114347A1 - Voice signal-based method and apparatus for recognizing stress using adversarial training with speaker information - Google Patents

Voice signal-based method and apparatus for recognizing stress using adversarial training with speaker information Download PDF

Info

Publication number
WO2022114347A1
WO2022114347A1 PCT/KR2020/017789 KR2020017789W WO2022114347A1 WO 2022114347 A1 WO2022114347 A1 WO 2022114347A1 KR 2020017789 W KR2020017789 W KR 2020017789W WO 2022114347 A1 WO2022114347 A1 WO 2022114347A1
Authority
WO
WIPO (PCT)
Prior art keywords
stress
speaker
recognition
feature vector
domain
Prior art date
Application number
PCT/KR2020/017789
Other languages
French (fr)
Korean (ko)
Inventor
강홍구
한혜원
신현경
변경근
Original Assignee
연세대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 연세대학교 산학협력단 filed Critical 연세대학교 산학협력단
Publication of WO2022114347A1 publication Critical patent/WO2022114347A1/en

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Definitions

  • the technical field to which the present invention pertains relates to an apparatus and method for recognizing stress based on a voice signal.
  • This study is related to the emotional intelligence research and development that can infer and judge the emotions of the other party in the high-tech convergence content technology development project, funded by the Ministry of Science and ICT and supported by the Information and Communication Planning and Evaluation Institute, and communicate and respond accordingly. (No. 1711116331).
  • Techniques for determining stress using a voice signal are largely divided into a part for extracting a voice feature vector and a part for modeling the relationship between the extracted vector and the stress state by a statistical method.
  • a feature vector of speech previously used for stress discrimination there are pitch, Mel-Frequency Cepstral Coefficients (MFCC), and energy for each frame.
  • MFCC Mel-Frequency Cepstral Coefficients
  • Neural network structures are input to the network, calculate the loss value between the nonlinearly predicted result value and the label, and learn parameters through a backpropagation algorithm. These neural network-based algorithms reflect the statistical characteristics of data and effectively model the image characteristics of the spectrogram or the permutation characteristics of the voice, and are currently being actively used in image and voice-related research fields.
  • Patent Document 1 Korean Patent Publication No. 10-2019-0135916 (2019.12.09.)
  • Embodiments of the present invention remove a speaker-dependent tendency from a voice signal using domain adversarial training with speaker information given in the training process, and learn a feature vector related to psychological stress independently of speaker information. Its main purpose is to improve the stress recognition accuracy.
  • the processor obtains a voice signal, and from the voice signal.
  • a stress recognition apparatus characterized by extracting a feature vector and outputting a stress recognition result through a domain hostile stress discrimination model using the feature vector.
  • the domain hostile stress discrimination model may include a stress recognition unit for determining the presence or absence of stress and a speaker recognition unit for classifying a speaker.
  • the speaker recognizer may apply a gradient reversal layer.
  • the domain hostile stress discrimination model may learn to increase the loss of the speaker recognition unit to distribute the feature vector in a vector space.
  • the domain hostile stress discrimination model may determine stress independently of the speaker in the voice signal through domain hostile learning.
  • the speaker-dependent tendency is removed from the voice signal by using domain adversarial training with the speaker information given in the training process, and the psychological effects are independent of the speaker information. It has the effect of improving the stress recognition accuracy by learning the stress-related feature vectors.
  • FIG. 1 is a block diagram illustrating an apparatus for recognizing stress according to an embodiment of the present invention.
  • FIG. 2 is a diagram illustrating an operation of a stress recognition apparatus according to an embodiment of the present invention.
  • FIG. 3 is a diagram illustrating a voice signal pre-processing operation of the apparatus for recognizing stress according to an embodiment of the present invention.
  • FIGS. 4 and 5 are diagrams illustrating a domain hostile stress discrimination model of the stress recognition apparatus according to an embodiment of the present invention.
  • FIG. 6 is a diagram illustrating the distribution of sentence-level feature vectors processed by the stress recognition apparatus according to an embodiment of the present invention.
  • FIG. 7 is a flowchart illustrating a stress recognition method according to another embodiment of the present invention.
  • a neural network-based model that can model the continuous change characteristics of speech signals well is being used in fields such as stress and emotion recognition.
  • stress and emotion recognition when only stress information is used in the training process, learning of the trained sentence-level feature vector is insufficient, and thus it has a dependent characteristic depending on the speaker information as well as the stress information, which is different from the actual training data. It affects the discrimination performance. That is, the stress discrimination performance is different in a situation where another speaker is speaking.
  • the neural network-based model largely reflects the statistical characteristics of the data, it affects the model performance when the distribution of the training data and the test data is different. There is a problem in that recognition performance deteriorates in an environment in which domains are changed.
  • the stress recognition apparatus improves the discrimination accuracy by learning a feature vector for stress discrimination only independently of the speaker information from the voice signal by utilizing domain adversarial training with the speaker information given in the training process. .
  • Domain adversarial learning constructs a network composed of a first recognizer that is the main purpose of recognition in a neural network model and a second recognizer that distinguishes different domains, and learns the first recognizer, which is the main purpose, and simultaneously learns the second recognizer. It refers to a training technique for improving the performance of the first recognizer, which is the main purpose, by reducing the domain recognition performance.
  • FIG. 1 is a block diagram illustrating an apparatus for recognizing stress according to an embodiment of the present invention.
  • the stress recognition device 110 includes at least one processor 120 , a computer readable storage medium 130 , and a communication bus 170 .
  • the processor 120 may control to operate as the stress recognition device 110 .
  • the processor 120 may execute one or more programs stored in the computer-readable storage medium 130 .
  • the one or more programs may include one or more computer-executable instructions, which when executed by the processor 120 may be configured to cause the stress recognition device 110 to perform operations in accordance with the exemplary embodiment.
  • Computer-readable storage medium 130 is configured to store computer-executable instructions or program code, program data, and/or other suitable form of information.
  • the program 140 stored in the computer-readable storage medium 130 includes a set of instructions executable by the processor 120 .
  • computer-readable storage medium 130 includes memory (volatile memory, such as random access memory, non-volatile memory, or a suitable combination thereof), one or more magnetic disk storage devices, optical disk storage devices, It may be flash memory devices, other types of storage media accessed by the stress recognition apparatus 110 and capable of storing desired information, or a suitable combination thereof.
  • Communication bus 170 interconnects various other components of stress recognition device 110 including processor 120 and computer readable storage medium 140 .
  • the stress recognition device 110 may also include one or more input/output interfaces 150 and one or more communication interfaces 160 that provide interfaces for one or more input/output devices.
  • the input/output interface 150 and the communication interface 160 are connected to the communication bus 170 .
  • the input/output device may be connected to other components of the stress recognition device 110 through the input/output interface 150 .
  • the stress recognition device 110 adds a speaker recognition unit including a gradient reversal layer to the domain adversarial stress discrimination model, and utilizes domain adversarial training with speaker information given in the training process to obtain a speech signal. By learning the feature vector for stress discrimination only independently of the speaker information, the discrimination accuracy is improved.
  • FIG. 2 is a diagram illustrating an operation of a stress recognition apparatus according to an embodiment of the present invention.
  • the stress recognition apparatus extracts a feature vector from a speech database, receives an input from the acquired feature vector, trains a model that updates an internal parameter of a neural network, and finally determines the presence or absence of stress.
  • a vector reflecting the characteristics of the speech signal is extracted in a specified manner. After dividing the audio signal into frames of a certain length (5-40 ms) on the time axis, energy is extracted for each frequency band from each frame. Extracts a power spectrogram representing energy according to time and frequency of a voice signal. Thereafter, a mel-spectrogram representing energy for each frequency band of the mel scale is obtained by multiplying a mel-filter bank having a pattern for a plurality of frequency domains. At this time, since the frequency band energy has a very small value close to 0, it is easy to calculate and train the model by applying a log function and converting it to a log scale to expand the scale between energies and change the distribution differently.
  • the initial model created through parameter generation is trained so that the parameter values reflect the statistical characteristics of the data through the backpropagation algorithm.
  • the acquired speech feature vector is set in the input layer
  • label information indicating stress/non-stress and additional speaker label information are set in the output layer.
  • a backpropagation algorithm is used to train to minimize the error in recognizing the emotional state.
  • the network maximizes the error predicted by the speaker recognition unit. That is, the parameters inside the trained model minimize the error rate in estimating the emotional state while maximizing the error rate for speaker recognition, so that a feature vector suitable for emotional state recognition is learned independently of the speaker.
  • a voice feature vector is passed through the trained neural network model to determine the presence/absence of stress in a given voice, and in the speaker estimation step, a voice feature vector is passed through the same to determine who the speaker of the given voice is.
  • FIG. 3 is a diagram illustrating a voice signal pre-processing operation of the apparatus for recognizing stress according to an embodiment of the present invention.
  • the stress recognition device needs to apply a plurality of preprocessing steps to the recorded speech signal.
  • the stress recognition apparatus performs an operation of removing noise from a voice signal.
  • the composition of unwanted background noise is removed, and the noise can be removed by applying Wiener filtering.
  • the stress recognition device compensates for distortion generated in the speech signal processing process by reducing the dynamic range of the frequency spectrum by applying a pre-emphasis filter, that is, a simple high pass filter. By emphasizing high frequencies through a pre-emphasis filter, it is possible to balance the dynamic range between the high-frequency region and the low-frequency region.
  • a pre-emphasis filter that is, a simple high pass filter.
  • the stress recognizing apparatus performs an operation of discriminating whether or not a voice is present in a voice signal.
  • a voice segment is acquired by applying the VAD (Voice Activity Detection) algorithm in the process of silence processing. After the silence section is searched for in the voice signal and removed (processed as 0), a voice segment is acquired.
  • VAD Voice Activity Detection
  • the stress recognition apparatus performs analysis and transformation in units of a predetermined window.
  • the Fourier transform unit 440 may analyze the input voice signal every 10 ms with a Hanning window of 25 ms.
  • Short-Time Fourier Transform may be performed to analyze the frequency change of the voice signal over time.
  • the speech signal can be divided into frame units of a constant length (5 to 40 ms) on the time axis.
  • the stress recognition apparatus may extract energy for each frequency band from each of the divided frames.
  • a power spectrogram representing energy according to time and frequency of a voice signal may be extracted.
  • a mel-spectrogram representing energy for each frequency band of the mel scale is obtained by multiplying a mel-filter bank having a pattern for a plurality of frequency domains.
  • a log function is applied to convert the energy into log scale energy to extract features.
  • the stress recognition device normalizes the features of the Mel-filter bank. In the normalization process, normalization is performed so that the Mel-filter bank features have zero mean and unit variance.
  • the stress recognition device divides the feature into a fixed length of time (eg, 2 sec, 4 sec, 5 sec, etc.) according to the setting of the stress discrimination algorithm, and outputs the finally extracted feature vector.
  • the feature vector output from the segmentation process is transferred to the domain adversarial stress discrimination model to learn the deep neural network model for stress discrimination.
  • FIGS. 4 and 5 are diagrams illustrating a domain hostile stress discrimination model of the stress recognition apparatus according to an embodiment of the present invention.
  • the domain adversarial stress discrimination model is an encoder that converts the Mel-spectrogram into an embedding vector suitable for stress judgment, calculates the relationship between the embedding vector extracted for each frame and the stress label, and assigns weights to it. , an attention-weighted sum unit that extracts sentence-level speech characteristics by weighted summing, and a sentence-level hidden layer that performs stress determination and speaker recognition using sentence-level feature vectors.
  • the encoder at the bottom of the network plays a role in modeling the temporal and frequency components of the input speech feature vector to be suitable for stress recognition.
  • the encoder consists of several layers of convolutional neural networks.
  • As an input layer of the network convolutional neural network a log-scale power Mel-spectrum extracted at a preset time interval (eg, 10 ms) by the feature vector acquisition unit is used as an input to the network.
  • Each frame of the voice signal is extracted in a short section, and information between neighboring frames has a continuous permutation characteristic.
  • a neural network structure that models overall information between neighboring frames should be applied, and a representative structure is a recurrent neural network.
  • the convolutional neural network After the convolutional neural network, it passes through the recurrent neural network to generate an output vector in the form of a reduced size in terms of time and frequency.
  • the two-dimensional output vector is transferred to the attention hiding layer having multiple heads.
  • the network structure of the encoder unit may utilize a dilated convolution layer instead of a recurrent neural network structure in order to reduce the number of parameters if necessary.
  • the attention weighted summation unit corresponds to a multi-head attention hidden layer.
  • the attention head is calculated as in Equation 1.
  • the relationship between the stress labels is calculated to obtain the weight for each time.
  • a fully-connected layer is added to the vector for each frame so that the weight is calculated.
  • the frame-level vector is transformed into a sentence-level vector representing the sentence by attention pooling, which multiplies the product of the calculated weight vector W and the encoder output value and adds it along the time axis. This sentence-level vector is transferred to the sentence-level hidden layer.
  • the sentence level hidden layer includes a stress recognition unit modeling a non-linear relationship between the sentence level feature vector and a stress label, and a speaker recognition unit modeling a non-linear relationship between speaker labels.
  • a gradient inverse layer is added before the precoupling layer for speaker recognition.
  • This hidden layer reverses the direction of the gradient by multiplying the gradient value calculated in the backpropagation training process by -1. It is trained to play a role in making it difficult to distinguish information about the speaker. That is, the sentence-level feature vector is learned to contain information that can distinguish the stress characteristic regardless of the speaker information.
  • the stress recognition unit determines the presence/absence of stress by passing through the sentence level hidden layer.
  • the speaker recognition unit is connected to a pre-coupling layer having as many dimensions as the number of speakers, and serves to distinguish speakers.
  • the entire neural network model is trained with a backpropagation algorithm by calculating the error between the output value of the network and the stress label.
  • the error between the network output values is calculated and the gradient value is propagated in the opposite direction to the learning direction through the gradient inverse layer, so that the speaker cannot be distinguished.
  • the loss function is expressed as a weighted sum between the stress recognition loss function (cross-entropy) for stress recognition and the speaker recognition loss function with the reversed sign.
  • cross-entropy cross-entropy
  • the learning data is kept the same.
  • FIG. 6 is a diagram illustrating the distribution of sentence-level feature vectors processed by the stress recognition apparatus according to an embodiment of the present invention.
  • non-stress is indicated by ' X '
  • stress is indicated by ' ⁇ '.
  • FIG. 6(a) is a two-dimensional view of a sentence-level feature vector when trained only with a general stress recognition loss function by visualizing it with a t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm. Sentence-level feature vectors for several speakers are expressed, and it can be seen that the same speaker information is clustered among sentence-level feature vectors. This indicates that the sentence level feature vector of the discrimination device includes not only the stress recognition information but also the speaker recognition information, so it is not suitable for the purpose of stress recognition independent of the speaker information.
  • t-SNE Stochastic Neighbor Embedding
  • Figure 6 (b) is a visualization of the sentence-level feature vectors in the case of domain hostile learning. It can be confirmed that the sentence-level feature vectors are not clustered according to the speaker, but are clustered only by the presence/absence of stress. have. That is, it indicates that the sentence-level feature vector has a generalized stress recognition vector regardless of the speaker.
  • the stress recognition method may be performed by a stress recognition device.
  • step S10 the stress recognition device acquires the generated voice signal.
  • step S20 the stress recognition apparatus extracts a feature vector by analyzing the speech signal in units of a predetermined window.
  • step S30 the stress recognition device performs deep learning learning by inputting the feature vector, and trains the feature vector independent of the speaker information through domain adversarial learning by giving the speaker information and the stress information together, and stress recognition print the result
  • the feature vector extraction step S20 includes a noise removal step of removing noise from the voice signal.
  • the feature vector extraction step S20 includes a distortion compensation step of compensating for distortion by emphasizing a high frequency through a pre-emphasis filter.
  • the feature vector extraction step (S20) includes a silence processing step of dividing a section in which a speech is present in the distortion-compensated speech signal, obtaining a speech segment, and transferring the speech segment to the Fourier transform step.
  • the feature vector extraction step ( S20 ) includes a Fourier transform step of transforming the speech signal into frames of a certain length on the time axis in order to analyze the frequency change of the speech signal according to the temporal flow in units of a predetermined window.
  • each of the divided frames is multiplied by a Mel-filter bank having a pattern for a plurality of frequency domains to obtain a Mel-spectrogram representing energy for each frequency band of the Mel scale, and the Mel-filter It includes a filter bank processing step of extracting the characteristics of the bank.
  • the feature vector extraction step S20 includes a normalization step of normalizing the features of the Mel-filter bank.
  • the feature vector extraction step S20 includes a division processing step of extracting the feature vector by dividing the normalized feature by a predetermined fixed length of time.
  • the model learning step S30 includes an encoder processing step of receiving a feature vector, modeling temporal and frequency components of the feature vector to be suitable for stress determination, and outputting an output vector for each frame.
  • the model learning step (S30) is a weight processing step of calculating a weight vector for each time based on the output vector output for each frame, and converting the output vector of the frame level into a vector of the sentence level by calculating the weight vector for each time with the output vector.
  • the model learning step ( S30 ) includes a stress classification processing step of modeling a non-linear relationship between a sentence level feature vector and a stress label to generate a determination result for the existence of stress.
  • the model learning step S30 includes a speaker classification processing step of generating a speaker discrimination result of a speech signal by modeling a non-linear relationship between a sentence level feature vector and a speaker label.
  • the weighted sum of the error between the discrimination result derived based on the feature vector and the stress label and the error between the speaker labels is calculated as a loss function, and then the stress label error is minimized using a backpropagation algorithm. At the same time, learning to increase the error of the speaker label is repeatedly performed.
  • the existing stress recognition model utilizes only stress information in the training process, and since the sentence-level vector can learn the characteristics that depend on the speaker information according to the emotional state, if there is a difference from the real environment in the test environment, the recognition rate can be lowered.
  • the present invention by additionally providing speaker information in the training process of the deep learning model to perform domain hostile learning, learning is performed in the form of estimating stress information independently of speaker information by parameters inside the model. It is possible to effectively recognize the psychological stress in the voice signal by using the speaker information hostilely.
  • the stress recognition apparatus may be implemented in a logic circuit by hardware, firmware, software, or a combination thereof, or may be implemented using a general-purpose or special-purpose computer.
  • the device may be implemented using a hardwired device, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), or the like.
  • the device may be implemented as a system on chip (SoC) including one or more processors and controllers.
  • SoC system on chip
  • the stress recognition apparatus may be mounted in a form of software, hardware, or a combination thereof on a computing device or server provided with hardware elements.
  • a computing device or server is all or part of a communication device such as a communication modem for performing communication with various devices or wired/wireless communication networks, a memory for storing data for executing a program, and a microprocessor for executing operations and commands by executing the program It can mean a variety of devices, including
  • Computer-readable medium represents any medium that participates in providing instructions to a processor for execution.
  • Computer-readable media may include program instructions, data files, data structures, or a combination thereof. For example, there may be a magnetic medium, an optical recording medium, a memory, and the like.
  • a computer program may be distributed over a networked computer system so that computer readable code is stored and executed in a distributed manner. Functional programs, codes, and code segments for implementing the present embodiment may be easily inferred by programmers in the art to which this embodiment belongs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Public Health (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Pathology (AREA)
  • Molecular Biology (AREA)
  • Psychiatry (AREA)
  • Surgery (AREA)
  • Animal Behavior & Ethology (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Signal Processing (AREA)
  • Veterinary Medicine (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Hospice & Palliative Care (AREA)
  • Theoretical Computer Science (AREA)
  • Child & Adolescent Psychology (AREA)
  • Epidemiology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Primary Health Care (AREA)
  • Developmental Disabilities (AREA)
  • Social Psychology (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Psychology (AREA)
  • Software Systems (AREA)

Abstract

The present embodiments provide apparatus and method for recognizing stress in which accuracy for recognizing stress is improved, the method using domain adversarial training with speaker information obtained during a training period to eliminate speaker-dependent traits from the voice signal, and independently training the characteristic vector associated with psychological stress with speaker information.

Description

화자 정보와의 적대적 학습을 활용한 음성 신호 기반 스트레스 인식 장치 및 방법Speech signal-based stress recognition apparatus and method using adversarial learning with speaker information
본 발명이 속하는 기술 분야는 음성 신호 기반 스트레스 인식 장치 및 방법에 관한 것이다. 본 연구는 과학기술정보통신부의 재원으로 정보통신기획평가원의 지원을 받아 수행된 첨단융복합콘텐츠기술개발사업에서 상대방의 감성을 추론, 판단하여 그에 맞추어 대화하고 대응할 수 있는 감성 지능 연구개발과 관련된다(No. 1711116331).The technical field to which the present invention pertains relates to an apparatus and method for recognizing stress based on a voice signal. This study is related to the emotional intelligence research and development that can infer and judge the emotions of the other party in the high-tech convergence content technology development project, funded by the Ministry of Science and ICT and supported by the Information and Communication Planning and Evaluation Institute, and communicate and respond accordingly. (No. 1711116331).
이 부분에 기술된 내용은 단순히 본 실시예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The content described in this section merely provides background information for the present embodiment and does not constitute the prior art.
음성 신호를 이용하여 스트레스를 판별하는 기술은 크게 음성의 특징 벡터를 추출하는 부분과 추출한 벡터와 스트레스의 상태 사이를 통계적 방법으로 모델링하는 부분으로 나뉜다. 기존에 스트레스 판별에 사용하였던 음성의 특징 벡터로는 피치(pitch), MFCC(Mel-Frequency Cepstral Coefficients), 프레임 별 에너지 등이 있다. 이러한 음성 특징 벡터들은 특징 벡터 추출 알고리즘을 통해 얻을 수 있다. Techniques for determining stress using a voice signal are largely divided into a part for extracting a voice feature vector and a part for modeling the relationship between the extracted vector and the stress state by a statistical method. As a feature vector of speech previously used for stress discrimination, there are pitch, Mel-Frequency Cepstral Coefficients (MFCC), and energy for each frame. These speech feature vectors can be obtained through a feature vector extraction algorithm.
기존 음성 신호를 활용하여 정서 상태를 인식하는 모델은 은닉 마르코프 모델 등 통계 기반의 판별 방법들을 활용하였으나, 최근 합성곱 신경망, 순환 신경망, 어텐션 신경망 등 신경망 기반의 모델을 활용하여 음성 신호로부터 자동적으로 스트레스 상태와 관련된 특징 벡터를 학습하여 추출하는 방식의 모델이 제안되고 있다. 이러한 신경망 구조는 입력과 레이블 간의 관계를 비선형적으로 모델링하며, 데이터에 기반한 훈련 방법을 사용하여 데이터의 통계적 특성을 효과적으로 반영하는 장점을 가지며, 기존 통계적 접근법보다 뛰어난 성능을 보이고 있다.Existing models for recognizing emotional states using voice signals used statistical-based identification methods such as the Hidden Markov model. A model of learning and extracting feature vectors related to states has been proposed. This neural network structure non-linearly models the relationship between the input and the label, has the advantage of effectively reflecting the statistical characteristics of the data by using a training method based on the data, and shows superior performance than the existing statistical approaches.
신경망 구조들은 네트워크에 입력되어 비선형적으로 예측된 결과값과 레이블 간의 손실값을 계산하여 역전파 알고리즘(backpropagation)을 통해 파라미터들을 학습한다. 이러한 신경망 기반의 알고리즘들은 데이터의 통계적 특성을 반영하고, 스펙트로그램의 이미지적 특성 또는 음성의 순열적 특성을 효과적으로 모델링하여 최근 이미지, 음성 관련 연구 분야에서 활발하게 이용되고 있다.Neural network structures are input to the network, calculate the loss value between the nonlinearly predicted result value and the label, and learn parameters through a backpropagation algorithm. These neural network-based algorithms reflect the statistical characteristics of data and effectively model the image characteristics of the spectrogram or the permutation characteristics of the voice, and are currently being actively used in image and voice-related research fields.
(특허문헌1) 한국공개특허공보 제10-2019-0135916호 (2019.12.09.)(Patent Document 1) Korean Patent Publication No. 10-2019-0135916 (2019.12.09.)
본 발명의 실시예들은 훈련 과정에서 주어진 화자 정보와의 도메인 적대적 학습(domain adversarial training)을 이용하여 음성 신호로부터 화자에 의존적인 성향을 제거하고 화자 정보에 독립적으로 심리적 스트레스와 관련된 특징 벡터를 학습하여 스트레스 인식 정확도를 향상시키는 데 주된 목적이 있다.Embodiments of the present invention remove a speaker-dependent tendency from a voice signal using domain adversarial training with speaker information given in the training process, and learn a feature vector related to psychological stress independently of speaker information. Its main purpose is to improve the stress recognition accuracy.
본 발명의 명시되지 않은 또 다른 목적들은 하기의 상세한 설명 및 그 효과로부터 용이하게 추론할 수 있는 범위 내에서 추가적으로 고려될 수 있다.Other objects not specified in the present invention may be additionally considered within the scope that can be easily inferred from the following detailed description and effects thereof.
본 실시예의 일 측면에 의하면, 하나 이상의 프로세서 및 상기 하나 이상의 프로세서에 의해 실행되는 하나 이상의 프로그램을 저장하는 메모리를 포함하는 스트레스 인식 장치에 있어서, 상기 프로세서는, 음성 신호를 획득하고, 상기 음성 신호로부터 특징 벡터를 추출하고, 상기 특징 벡터를 이용하여 도메인 적대적 스트레스 판별 모델을 통해 스트레스 인식 결과를 출력하는 것을 특징으로 하는 스트레스 인식 장치를 제공한다.According to an aspect of this embodiment, in the stress recognition apparatus comprising one or more processors and a memory for storing one or more programs executed by the one or more processors, the processor obtains a voice signal, and from the voice signal There is provided a stress recognition apparatus characterized by extracting a feature vector and outputting a stress recognition result through a domain hostile stress discrimination model using the feature vector.
상기 도메인 적대적 스트레스 판별 모델은 스트레스의 유무를 판단하는 스트레스 인식부 및 화자를 구분하는 화자 인식부를 포함할 수 있다.The domain hostile stress discrimination model may include a stress recognition unit for determining the presence or absence of stress and a speaker recognition unit for classifying a speaker.
상기 화자 인식부는 기울기 역 계층(gradient reversal layer)을 적용할 수 있다.The speaker recognizer may apply a gradient reversal layer.
상기 도메인 적대적 스트레스 판별 모델은 상기 화자 인식부의 손실이 커지도록 학습하여 벡터 공간에서 상기 특징 벡터를 분산시킬 수 있다.The domain hostile stress discrimination model may learn to increase the loss of the speaker recognition unit to distribute the feature vector in a vector space.
상기 도메인 적대적 스트레스 판별 모델은 도메인 적대적 학습을 통해 상기 음성 신호에서 상기 화자에 독립적으로 스트레스를 판별할 수 있다.The domain hostile stress discrimination model may determine stress independently of the speaker in the voice signal through domain hostile learning.
이상에서 설명한 바와 같이 본 발명의 실시예들에 의하면, 훈련 과정에서 주어진 화자 정보와의 도메인 적대적 학습(domain adversarial training)을 이용하여 음성 신호로부터 화자에 의존적인 성향을 제거하고 화자 정보에 독립적으로 심리적 스트레스와 관련된 특징 벡터를 학습하여 스트레스 인식 정확도를 향상시킬 수 있는 효과가 있다.As described above, according to the embodiments of the present invention, the speaker-dependent tendency is removed from the voice signal by using domain adversarial training with the speaker information given in the training process, and the psychological effects are independent of the speaker information. It has the effect of improving the stress recognition accuracy by learning the stress-related feature vectors.
여기에서 명시적으로 언급되지 않은 효과라 하더라도, 본 발명의 기술적 특징에 의해 기대되는 이하의 명세서에서 기재된 효과 및 그 잠정적인 효과는 본 발명의 명세서에 기재된 것과 같이 취급된다.Even if it is an effect not explicitly mentioned herein, the effects described in the following specification expected by the technical features of the present invention and their potential effects are treated as if they were described in the specification of the present invention.
도 1는 본 발명의 일 실시예에 따른 스트레스 인식 장치를 예시한 블록도이다.1 is a block diagram illustrating an apparatus for recognizing stress according to an embodiment of the present invention.
도 2는 본 발명의 일 실시예에 따른 스트레스 인식 장치의 동작을 예시한 도면이다.2 is a diagram illustrating an operation of a stress recognition apparatus according to an embodiment of the present invention.
도 3은 본 발명의 일 실시예에 따른 스트레스 인식 장치의 음성 신호 전처리 동작을 예시한 도면이다.3 is a diagram illustrating a voice signal pre-processing operation of the apparatus for recognizing stress according to an embodiment of the present invention.
도 4 및 도 5는 본 발명의 일 실시예에 따른 스트레스 인식 장치의 도메인 적대적 스트레스 판별 모델을 예시한 도면이다.4 and 5 are diagrams illustrating a domain hostile stress discrimination model of the stress recognition apparatus according to an embodiment of the present invention.
도 6은 본 발명의 일 실시예에 따른 스트레스 인식 장치가 처리하는 문장 레벨 특징 벡터의 분산을 예시한 도면이다.6 is a diagram illustrating the distribution of sentence-level feature vectors processed by the stress recognition apparatus according to an embodiment of the present invention.
도 7은 본 발명의 다른 실시예에 따른 스트레스 인식 방법을 예시한 흐름도이다.7 is a flowchart illustrating a stress recognition method according to another embodiment of the present invention.
이하, 본 발명을 설명함에 있어서 관련된 공지기능에 대하여 이 분야의 기술자에게 자명한 사항으로서 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하고, 본 발명의 일부 실시예들을 예시적인 도면을 통해 상세하게 설명한다. Hereinafter, in the description of the present invention, if it is determined that the subject matter of the present invention may be unnecessarily obscure as it is obvious to those skilled in the art with respect to related known functions, the detailed description thereof will be omitted, and some embodiments of the present invention will be described. It will be described in detail with reference to exemplary drawings.
음성신호의 연속적인 변화 특성을 잘 모델링 할 수 있는 신경망 기반의 모델이 스트레스 및 감정 인식 등의 분야에서 활용되고 있다. 그러나 단순히 스트레스 정보만을 훈련과정에 활용했을 때, 훈련된 문장 레벨 특징벡터의 학습이 부족하여 스트레스 정보뿐만 아니라 화자 정보에 따라 종속적인 특성을 가지게 되고, 이는 실제 훈련 데이터와 다른 화자가 발화할 때 스트레스 판별 성능에 영향을 준다. 즉, 다른 화자가 발화하는 상황에서 스트레스 판별 성능이 달라지게 된다.A neural network-based model that can model the continuous change characteristics of speech signals well is being used in fields such as stress and emotion recognition. However, when only stress information is used in the training process, learning of the trained sentence-level feature vector is insufficient, and thus it has a dependent characteristic depending on the speaker information as well as the stress information, which is different from the actual training data. It affects the discrimination performance. That is, the stress discrimination performance is different in a situation where another speaker is speaking.
신경망 기반의 모델은 데이터의 통계적 특성을 크게 반영하므로, 훈련 데이터와 테스트 데이터의 분포가 다른 상황에서는 모델 성능에 영향을 준다. 도메인이 달라지는 환경에서 인식 성능이 떨어지는 문제가 있다. Since the neural network-based model largely reflects the statistical characteristics of the data, it affects the model performance when the distribution of the training data and the test data is different. There is a problem in that recognition performance deteriorates in an environment in which domains are changed.
본 실시예에 따른 스트레스 인식 장치는 훈련 과정에서 주어진 화자 정보와의 도메인 적대적 학습(domain adversarial training)을 활용하여 음성 신호에서 화자 정보에 독립적으로 스트레스 판별만을 위한 특징 벡터를 학습하여 판별 정확도를 향상시킨다.The stress recognition apparatus according to this embodiment improves the discrimination accuracy by learning a feature vector for stress discrimination only independently of the speaker information from the voice signal by utilizing domain adversarial training with the speaker information given in the training process. .
도메인 적대적 학습은 신경망 모델에서 인식의 주 목적이 되는 제1 인식부와 서로 다른 도메인을 구분하는 제2 인식부로 구성된 네트워크를 구성하여, 주 목적이 되는 제1 인식부를 학습함과 동시에 제2 인식부의 도메인 인식 성능을 저하시켜서, 주 목적이 되는 제1 인식부의 성능을 향상시키는 훈련 기법을 의미한다.Domain adversarial learning constructs a network composed of a first recognizer that is the main purpose of recognition in a neural network model and a second recognizer that distinguishes different domains, and learns the first recognizer, which is the main purpose, and simultaneously learns the second recognizer. It refers to a training technique for improving the performance of the first recognizer, which is the main purpose, by reducing the domain recognition performance.
도 1는 본 발명의 일 실시예에 따른 스트레스 인식 장치를 예시한 블록도이다.1 is a block diagram illustrating an apparatus for recognizing stress according to an embodiment of the present invention.
스트레스 인식 장치(110)는 적어도 하나의 프로세서(120), 컴퓨터 판독 가능한 저장매체(130) 및 통신 버스(170)를 포함한다. The stress recognition device 110 includes at least one processor 120 , a computer readable storage medium 130 , and a communication bus 170 .
프로세서(120)는 스트레스 인식 장치(110)로 동작하도록 제어할 수 있다. 예컨대, 프로세서(120)는 컴퓨터 판독 가능한 저장 매체(130)에 저장된 하나 이상의 프로그램들을 실행할 수 있다. 하나 이상의 프로그램들은 하나 이상의 컴퓨터 실행 가능 명령어를 포함할 수 있으며, 컴퓨터 실행 가능 명령어는 프로세서(120)에 의해 실행되는 경우 스트레스 인식 장치(110)로 하여금 예시적인 실시예에 따른 동작들을 수행하도록 구성될 수 있다.The processor 120 may control to operate as the stress recognition device 110 . For example, the processor 120 may execute one or more programs stored in the computer-readable storage medium 130 . The one or more programs may include one or more computer-executable instructions, which when executed by the processor 120 may be configured to cause the stress recognition device 110 to perform operations in accordance with the exemplary embodiment. can
컴퓨터 판독 가능한 저장 매체(130)는 컴퓨터 실행 가능 명령어 내지 프로그램 코드, 프로그램 데이터 및/또는 다른 적합한 형태의 정보를 저장하도록 구성된다. 컴퓨터 판독 가능한 저장 매체(130)에 저장된 프로그램(140)은 프로세서(120)에 의해 실행 가능한 명령어의 집합을 포함한다. 일 실시예에서, 컴퓨터 판독한 가능 저장 매체(130)는 메모리(랜덤 액세스 메모리와 같은 휘발성 메모리, 비휘발성 메모리, 또는 이들의 적절한 조합), 하나 이상의 자기 디스크 저장 디바이스들, 광학 디스크 저장 디바이스들, 플래시 메모리 디바이스들, 그 밖에 스트레스 인식 장치(110)에 의해 액세스되고 원하는 정보를 저장할 수 있는 다른 형태의 저장 매체, 또는 이들의 적합한 조합일 수 있다.Computer-readable storage medium 130 is configured to store computer-executable instructions or program code, program data, and/or other suitable form of information. The program 140 stored in the computer-readable storage medium 130 includes a set of instructions executable by the processor 120 . In one embodiment, computer-readable storage medium 130 includes memory (volatile memory, such as random access memory, non-volatile memory, or a suitable combination thereof), one or more magnetic disk storage devices, optical disk storage devices, It may be flash memory devices, other types of storage media accessed by the stress recognition apparatus 110 and capable of storing desired information, or a suitable combination thereof.
통신 버스(170)는 프로세서(120), 컴퓨터 판독 가능한 저장 매체(140)를 포함하여 스트레스 인식 장치(110)의 다른 다양한 컴포넌트들을 상호 연결한다. Communication bus 170 interconnects various other components of stress recognition device 110 including processor 120 and computer readable storage medium 140 .
스트레스 인식 장치(110)는 또한 하나 이상의 입출력 장치를 위한 인터페이스를 제공하는 하나 이상의 입출력 인터페이스(150) 및 하나 이상의 통신 인터페이스(160)를 포함할 수 있다. 입출력 인터페이스(150) 및 통신 인터페이스(160)는 통신 버스(170)에 연결된다. 입출력 장치는 입출력 인터페이스(150)를 통해 스트레스 인식 장치(110)의 다른 컴포넌트들에 연결될 수 있다.The stress recognition device 110 may also include one or more input/output interfaces 150 and one or more communication interfaces 160 that provide interfaces for one or more input/output devices. The input/output interface 150 and the communication interface 160 are connected to the communication bus 170 . The input/output device may be connected to other components of the stress recognition device 110 through the input/output interface 150 .
스트레스 인식 장치(110)는 도메인 적대적 스트레스 판별 모델에 기울기 역 계층 (gradient reversal layer)을 포함한 화자 인식부를 추가하고, 훈련 과정에서 주어진 화자 정보와의 도메인 적대적 학습(domain adversarial training)을 활용하여 음성 신호에서 화자 정보에 독립적으로 스트레스 판별만을 위한 특징 벡터를 학습하여 판별 정확도를 향상시킨다.The stress recognition device 110 adds a speaker recognition unit including a gradient reversal layer to the domain adversarial stress discrimination model, and utilizes domain adversarial training with speaker information given in the training process to obtain a speech signal. By learning the feature vector for stress discrimination only independently of the speaker information, the discrimination accuracy is improved.
도 2는 본 발명의 일 실시예에 따른 스트레스 인식 장치의 동작을 예시한 도면이다.2 is a diagram illustrating an operation of a stress recognition apparatus according to an embodiment of the present invention.
스트레스 인식 장치는 음성 데이터베이스로부터 특징 벡터를 추출하고, 획득한 특징 벡터로부터 입력을 받아 신경망 내부 파라미터를 업데이트하는 모델을 훈련하고, 마지막으로 스트레스의 유무 상태를 판단한다.The stress recognition apparatus extracts a feature vector from a speech database, receives an input from the acquired feature vector, trains a model that updates an internal parameter of a neural network, and finally determines the presence or absence of stress.
특징 벡터 획득 과정에서 지정된 방식으로 음성 신호의 특성을 반영한 벡터를 추출한다. 시간축에서 음성 신호를 일정한 길이(5~40ms)의 프레임 단위로 구분한 후, 각 프레임에 대하여 주파수 대역별 에너지를 추출한다. 음성 신호의 시간, 주파수에 따른 에너지를 나타내는 파워 스펙트로그램을 추출한다. 이후, 다수의 주파수 영역에 대하여 패턴을 가지고 있는 멜-필터 뱅크를 곱하여 멜 스케일의 각 주파수 대역에 대한 에너지를 나타내는 멜-스펙트로그램을 얻는다. 이 때 주파수 대역 에너지가 0에 가까운 매우 적은 값을 가지므로, 로그 함수를 적용하여 로그 스케일로 변환함으로써 에너지 간 스케일을 확대하고 분포를 다르게 변화시킴으로써 계산 및 모델 훈련을 용이하게 한다.In the feature vector acquisition process, a vector reflecting the characteristics of the speech signal is extracted in a specified manner. After dividing the audio signal into frames of a certain length (5-40 ms) on the time axis, energy is extracted for each frequency band from each frame. Extracts a power spectrogram representing energy according to time and frequency of a voice signal. Thereafter, a mel-spectrogram representing energy for each frequency band of the mel scale is obtained by multiplying a mel-filter bank having a pattern for a plurality of frequency domains. At this time, since the frequency band energy has a very small value close to 0, it is easy to calculate and train the model by applying a log function and converting it to a log scale to expand the scale between energies and change the distribution differently.
심층 신경망 훈련 과정에서 파라미터 생성을 통해 만들어진 초기 모델을 역전파 알고리즘을 통해 파라미터값이 데이터의 통계적 특성을 반영하도록 훈련한다. 이 때 적대적 훈련 방법을 적용하기 위해 입력층에는 획득한 음성 특징 벡터, 출력층에는 스트레스/비스트레스를 나타내는 레이블 정보와 추가적으로 화자 레이블 정보를 설정한다. 입력으로부터 네트워크가 예측한 값과 정서 상태를 나타내는 레이블 간의 에러를 손실함수로 계산한 뒤 역전파 알고리즘을 이용하여 정서 상태를 인식하는 에러가 최소가 되도록 훈련한다. 동시에 화자 레이블 정보를 활용하여 마찬가지로 네트워크가 화자 인식부에서 예측한 오차를 최대화한다. 즉, 훈련되는 모델 내부의 파라미터가 정서 상태 추정 에러율은 최소화하는 동시에, 화자 인식에 대한 에러율은 최대화함으로써, 화자에 독립적으로 정서 상태 인식에 적합한 특징 벡터를 학습하게 된다.In the deep neural network training process, the initial model created through parameter generation is trained so that the parameter values reflect the statistical characteristics of the data through the backpropagation algorithm. In this case, in order to apply the adversarial training method, the acquired speech feature vector is set in the input layer, label information indicating stress/non-stress and additional speaker label information are set in the output layer. After calculating the error between the value predicted by the network from the input and the label representing the emotional state as a loss function, a backpropagation algorithm is used to train to minimize the error in recognizing the emotional state. At the same time, by using speaker label information, the network maximizes the error predicted by the speaker recognition unit. That is, the parameters inside the trained model minimize the error rate in estimating the emotional state while maximizing the error rate for speaker recognition, so that a feature vector suitable for emotional state recognition is learned independently of the speaker.
스트레스 추정 단계에서는 훈련된 신경망 모델에 음성 특징 벡터를 통과시켜 주어진 음성의 스트레스 유/무를 판별하고, 화자 추정 단계에서는 마찬가지로 음성 특징벡터를 통과시켜 주어진 음성의 화자가 누구인지를 판별한다.In the stress estimation step, a voice feature vector is passed through the trained neural network model to determine the presence/absence of stress in a given voice, and in the speaker estimation step, a voice feature vector is passed through the same to determine who the speaker of the given voice is.
도 3은 본 발명의 일 실시예에 따른 스트레스 인식 장치의 음성 신호 전처리 동작을 예시한 도면이다.3 is a diagram illustrating a voice signal pre-processing operation of the apparatus for recognizing stress according to an embodiment of the present invention.
스트레스 인식 장치는 심층 신경망 모델의 학습을 위한 강인한 특징을 추출하려면, 녹음된 음성 신호에 복수의 전처리 단계를 적용해야 한다. In order to extract robust features for training a deep neural network model, the stress recognition device needs to apply a plurality of preprocessing steps to the recorded speech signal.
스트레스 인식 장치는 음성 신호에서 노이즈를 제거하는 동작을 수행한다. 원하지 않는 배경 노이즈의 구성을 제거하며, 위너 필터링(Wiener filtering)을 적용하여 노이즈를 제거할 수 있다.The stress recognition apparatus performs an operation of removing noise from a voice signal. The composition of unwanted background noise is removed, and the noise can be removed by applying Wiener filtering.
스트레스 인식 장치는 프리 엠퍼시스(pre-emphasis) 필터 즉, 간단한 형태의 고주파 필터(high pass filter)를 적용하여 주파수 스펙트럼의 동적 범위를 줄여 음성신호 처리 과정에서 발생하는 왜곡을 보상한다. 프리 엠퍼시스(pre-emphasis) 필터를 통해 고주파수를 강조하여 고주파수 영역과 저주파수 영역 사이의 동적 범위의 균형을 맞출 수 있다. The stress recognition device compensates for distortion generated in the speech signal processing process by reducing the dynamic range of the frequency spectrum by applying a pre-emphasis filter, that is, a simple high pass filter. By emphasizing high frequencies through a pre-emphasis filter, it is possible to balance the dynamic range between the high-frequency region and the low-frequency region.
스트레스 인식 장치는 음성 신호에서 음성이 존재하는 구간인지 아닌지를 구분하는 동작을 수행한다. 묵음 처리 과정에서 VAD(Voice Activity Detection) 알고리즘을 적용하여 음성 세그먼트를 획득한다. 음성 신호에서 묵음 구간을 탐색하여 제거(0으로 처리)한 후 음성 세그먼트를 획득한다. The stress recognizing apparatus performs an operation of discriminating whether or not a voice is present in a voice signal. A voice segment is acquired by applying the VAD (Voice Activity Detection) algorithm in the process of silence processing. After the silence section is searched for in the voice signal and removed (processed as 0), a voice segment is acquired.
스트레스 인식 장치는 소정의 윈도우 단위로 분석 및 변환을 수행한다. 푸리에 변환부(440)는 입력된 음성 신호를 25 ms의 해닝 윈도우(Hanning window)로 10 ms 마다 분석할 수 있다. The stress recognition apparatus performs analysis and transformation in units of a predetermined window. The Fourier transform unit 440 may analyze the input voice signal every 10 ms with a Hanning window of 25 ms.
시간적 흐름에 따라 음성 신호의 주파수 변화를 분석하기 위하여 STFT(Short-Time Fourier Transform)를 수행할 수 있다. 푸리에 변환 과정에서 시간 축에서 음성 신호를 일정한 길이(5 ~ 40 ms)의 프레임 단위로 구분할 수 있다. Short-Time Fourier Transform (STFT) may be performed to analyze the frequency change of the voice signal over time. In the Fourier transform process, the speech signal can be divided into frame units of a constant length (5 to 40 ms) on the time axis.
스트레스 인식 장치는 구분된 프레임 각각에 대하여 주파수 대역별 에너지를 추출할 수 있다. 필터 뱅크 과정에서 음성 신호의 시간, 주파수에 따른 에너지를 나타내는 파워 스펙트로그램을 추출할 수 있다. The stress recognition apparatus may extract energy for each frequency band from each of the divided frames. In the filter bank process, a power spectrogram representing energy according to time and frequency of a voice signal may be extracted.
필터 뱅크 과정에서 다수의 주파수 영역에 대하여 패턴을 가지고 있는 멜-필터 뱅크를 곱하여 멜 스케일의 각 주파수 대역에 대한 에너지를 나타내는 멜-스펙트로그램을 획득한다. 주파수 대역간 에너지 스케일 차이를 좁히기 위해 로그 함수를 적용하여 로그 스케일의 에너지로 변환하여 특징을 추출한다.In the filter bank process, a mel-spectrogram representing energy for each frequency band of the mel scale is obtained by multiplying a mel-filter bank having a pattern for a plurality of frequency domains. In order to narrow the energy scale difference between frequency bands, a log function is applied to convert the energy into log scale energy to extract features.
스트레스 인식 장치는 멜-필터 뱅크의 특징을 정규화 처리한다. 정규화 과정에서 멜-필터 뱅크 특징이 제로 평균 및 단위 분산을 갖도록 정규화 처리 한다. The stress recognition device normalizes the features of the Mel-filter bank. In the normalization process, normalization is performed so that the Mel-filter bank features have zero mean and unit variance.
스트레스 인식 장치는 스트레스 판별 알고리즘의 설정에 따라 고정된 시간의 길이(예: 2 초, 4 초, 5 초 등)로 특징을 분할하여 최종적으로 추출된 특징 벡터를 출력한다. 분할 처리 과정에서 출력된 특징 벡터는 스트레스 판별을 위한 심층 신경망 모델을 학습하기 위하여 도메인 적대적 스트레스 판별 모델로 전달된다.The stress recognition device divides the feature into a fixed length of time (eg, 2 sec, 4 sec, 5 sec, etc.) according to the setting of the stress discrimination algorithm, and outputs the finally extracted feature vector. The feature vector output from the segmentation process is transferred to the domain adversarial stress discrimination model to learn the deep neural network model for stress discrimination.
도 4 및 도 5는 본 발명의 일 실시예에 따른 스트레스 인식 장치의 도메인 적대적 스트레스 판별 모델을 예시한 도면이다.4 and 5 are diagrams illustrating a domain hostile stress discrimination model of the stress recognition apparatus according to an embodiment of the present invention.
도메인 적대적 스트레스 판별 모델은 멜-스펙트로그램을 스트레스 판단을 하기 적합한 형태의 임베딩 벡터 (embedding vector)로 변환하는 인코더 (encoder), 프레임별로 추출된 임베딩 벡터와 스트레스 레이블 간의 관계를 계산하여 가중치를 부여한 후, 가중합으로 문장 레벨의 음성 특성을 추출해내는 어텐션 가중합부, 문장 레벨의 특징 벡터를 이용해 스트레스 판단 및 화자 인식을 진행하는 문장 레벨 은닉층으로 구분할 수 있다.The domain adversarial stress discrimination model is an encoder that converts the Mel-spectrogram into an embedding vector suitable for stress judgment, calculates the relationship between the embedding vector extracted for each frame and the stress label, and assigns weights to it. , an attention-weighted sum unit that extracts sentence-level speech characteristics by weighted summing, and a sentence-level hidden layer that performs stress determination and speaker recognition using sentence-level feature vectors.
네트워크의 하단에 있는 인코더는 입력된 음성 특징 벡터의 시간적, 주파수 성분을 스트레스 인식에 적합하도록 모델링하는 역할을 한다. 인코더는 여러 층의 합성곱 신경망으로 구성된다. 네트워크 합성곱 신경망의 입력층으로는 특징 벡터 획득부에서 기 설정된 시간 간격(예컨대, 10 ms)으로 추출되는 로그 스케일의 파워 멜-스펙트럼을 네트워크의 입력으로 사용한다. The encoder at the bottom of the network plays a role in modeling the temporal and frequency components of the input speech feature vector to be suitable for stress recognition. The encoder consists of several layers of convolutional neural networks. As an input layer of the network convolutional neural network, a log-scale power Mel-spectrum extracted at a preset time interval (eg, 10 ms) by the feature vector acquisition unit is used as an input to the network.
음성 신호의 각 프레임은 짧은 구간에서 추출되며, 이웃한 프레임 간 정보가 연속적인 순열적 특성을 가진다. 이러한 음성 신호를 효과적으로 모델링하기 위해서는 이웃한 프레임들 간의 전반적인 정보를 모델링하는 신경망 구조를 적용해야 하며, 대표적인 구조로서 순환 신경망이 있다. 합성곱 신경망 이후 순환 신경망을 통과하여 시간, 주파수 측면에서 크기가 일정 비율로 감소한 형태의 출력 벡터를 생성한다. 2차원 형태의 출력 벡터는 다수 헤드를 가지는 어텐션 은닉층으로 전달된다. 인코더 부의 네트워크 구조는 필요에 따라 파라미터 수를 감소하기 위하여 순환 신경망 구조 대신 확장 합성곱 신경망 (dilated convolution layer)을 활용할 수도 있다.Each frame of the voice signal is extracted in a short section, and information between neighboring frames has a continuous permutation characteristic. In order to effectively model such a speech signal, a neural network structure that models overall information between neighboring frames should be applied, and a representative structure is a recurrent neural network. After the convolutional neural network, it passes through the recurrent neural network to generate an output vector in the form of a reduced size in terms of time and frequency. The two-dimensional output vector is transferred to the attention hiding layer having multiple heads. The network structure of the encoder unit may utilize a dilated convolution layer instead of a recurrent neural network structure in order to reduce the number of parameters if necessary.
어텐션 가중합부는 다수 헤드 어텐션 은닉층 (Multi-head attention layer)에 해당한다. 헤드의 개수 r에 대한 차원 d를 갖는 i번째 헤드의 쿼리 Q, 키 K, 값 V에 대해서 어텐션 헤드는 수학식 1과 같이 산출된다.The attention weighted summation unit corresponds to a multi-head attention hidden layer. For the query Q, key K, and value V of the i-th head having dimension d with respect to the number of heads r, the attention head is calculated as in Equation 1.
Figure PCTKR2020017789-appb-M000001
Figure PCTKR2020017789-appb-M000001
인코더에서 프레임별로 출력한 벡터 각각에 대하여 스트레스 레이블 간의 관계를 계산하여 시간별 가중치를 구한다 각 프레임별 벡터에 대하여 전결합 레이어 (fully-connected layer)를 추가하여 가중치가 계산되도록 한다.For each vector output from the encoder for each frame, the relationship between the stress labels is calculated to obtain the weight for each time. A fully-connected layer is added to the vector for each frame so that the weight is calculated.
Figure PCTKR2020017789-appb-M000002
Figure PCTKR2020017789-appb-M000002
계산한 가중치 벡터 W와 인코더 출력 값과의 곱을 각각 곱한 후 시간 축으로 더하는 가중합 풀링 (attention pooling)으로 프레임 레벨의 벡터를 문장을 대표하는 문장 레벨의 벡터로 변환한다. 이러한 문장 레벨의 벡터는 문장 레벨 은닉층으로 전달된다.The frame-level vector is transformed into a sentence-level vector representing the sentence by attention pooling, which multiplies the product of the calculated weight vector W and the encoder output value and adds it along the time axis. This sentence-level vector is transferred to the sentence-level hidden layer.
문장 레벨 은닉층은 문장 레벨 특징 벡터와 스트레스 레이블 간의 비선형적 관계를 모델링하는 스트레스 인식부, 화자 레이블 간의 비선형적 관계를 모델링하는 화자 인식부를 포함한다. The sentence level hidden layer includes a stress recognition unit modeling a non-linear relationship between the sentence level feature vector and a stress label, and a speaker recognition unit modeling a non-linear relationship between speaker labels.
화자 인식부에서는 화자 인식을 위한 전결합 레이어 이전에 기울기 역 계층이 추가되어 있다. 이 은닉층은 역전파 훈련 과정에서 계산되는 기울기 값에 -1을 곱하여 기울기의 방향을 역전시킴으로써, 훈련 과정에서 문장 레벨 특징 벡터가 스트레스 인식율이 높아지는 방향으로 훈련됨과 동시에 화자 인식 손실함수 값을 높이는 방향으로 훈련되어 화자에 대한 정보를 구분하기 어렵게 만드는 역할을 한다. 즉, 문장 레벨 특징 벡터가 화자 정보에 상관없이 스트레스 특성을 구분할 수 있는 정보를 담을 수 있도록 학습된다. In the speaker recognition unit, a gradient inverse layer is added before the precoupling layer for speaker recognition. This hidden layer reverses the direction of the gradient by multiplying the gradient value calculated in the backpropagation training process by -1. It is trained to play a role in making it difficult to distinguish information about the speaker. That is, the sentence-level feature vector is learned to contain information that can distinguish the stress characteristic regardless of the speaker information.
스트레스 인식부에서는 문장 레벨 은닉층을 통과함으로써 스트레스의 유/무를 판단한다. 화자 인식부는 화자 수만큼의 차원을 가지는 전결합 레이어로 연결되어 화자를 구분하는 역할을 한다. The stress recognition unit determines the presence/absence of stress by passing through the sentence level hidden layer. The speaker recognition unit is connected to a pre-coupling layer having as many dimensions as the number of speakers, and serves to distinguish speakers.
전체 신경망 모델은 네트워크의 출력값과 스트레스 레이블 간의 에러를 계산하여 역전파 알고리즘으로 훈련된다. 동시에 화자 레이블을 도메인 레이블로 간주하고 부여한 후 네트워크 출력 값 간의 에러를 계산하여 기울기 역 계층을 통해 학습 방향과 반대 방향으로 기울기 값이 전파되면서 화자를 구분하지 못하게 되는 적대적 학습방법을 활용하여 학습된다. The entire neural network model is trained with a backpropagation algorithm by calculating the error between the output value of the network and the stress label. At the same time, after considering the speaker label as a domain label and assigning it, the error between the network output values is calculated and the gradient value is propagated in the opposite direction to the learning direction through the gradient inverse layer, so that the speaker cannot be distinguished.
손실함수는 스트레스 인식을 위한 스트레스 인식 손실함수 (cross-entropy)와 부호가 역전된 화자 인식 손실함수 간의 가중합으로 나타난다. 일반적인 도메인 적대적 학습 방법은 타겟 데이터를 따로 부여하여 학습을 진행하지만 본 발명에서는 화자 정보에 독립적인 문장 레벨 특징 벡터의 학습이 목적이므로 학습 데이터를 동일하게 두어 진행한다.The loss function is expressed as a weighted sum between the stress recognition loss function (cross-entropy) for stress recognition and the speaker recognition loss function with the reversed sign. In a general domain adversarial learning method, learning is carried out by providing target data separately, but in the present invention, since the purpose of the present invention is to learn a sentence-level feature vector independent of speaker information, the learning data is kept the same.
Figure PCTKR2020017789-appb-M000003
Figure PCTKR2020017789-appb-M000003
도 6은 본 발명의 일 실시예에 따른 스트레스 인식 장치가 처리하는 문장 레벨 특징 벡터의 분산을 예시한 도면이다. 도면에서 Non-stress는 'X'로 표시되고 Stress는 '●'로 표시되어 있다. 'X'와 '●'의 분산을 보면 소규모 군집여부를 확인할 수 있고, Non-stress와 Stress 간의 경계 영역 유무를 확인할 수 있다.6 is a diagram illustrating the distribution of sentence-level feature vectors processed by the stress recognition apparatus according to an embodiment of the present invention. In the drawing, non-stress is indicated by ' X ' and stress is indicated by '●'. By looking at the variance of ' X ' and '●', it is possible to check whether small clusters exist and whether there is a boundary region between non-stress and stress.
도 6의 (a)는 일반 스트레스 인식 손실함수만으로 훈련했을 때의 문장 레벨 특징 벡터를 t-SNE (t-Distributed Stochastic Neighbor Embedding) 알고리즘으로 시각화하여 2차원으로 나타낸 도면이다. 일부 몇 명의 화자에 대한 문장 레벨 특징벡터를 표현한 것인데, 화자정보가 동일한 문장 레벨 특징 벡터끼리 군집화 되어있음을 볼 수 있다. 이는 판별장치의 문장 레벨 특징 벡터가 스트레스 인식 정보뿐만 아니라 화자 인식 정보까지 포함하고 있어 화자 정보에 독립적인 스트레스 인식 목적에 적합하지 않음을 나타낸다. FIG. 6(a) is a two-dimensional view of a sentence-level feature vector when trained only with a general stress recognition loss function by visualizing it with a t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm. Sentence-level feature vectors for several speakers are expressed, and it can be seen that the same speaker information is clustered among sentence-level feature vectors. This indicates that the sentence level feature vector of the discrimination device includes not only the stress recognition information but also the speaker recognition information, so it is not suitable for the purpose of stress recognition independent of the speaker information.
도 6의 (b)는 도메인 적대적 학습을 진행한 경우의 문장 레벨 특징벡터를 시각화한 도면으로 화자에 따라 문장 레벨 특징 벡터가 뭉치지 않고, 스트레스의 유/무 상태에 의해서만 군집화가 되어있음을 확인할 수 있다. 즉, 문장 레벨 특징 벡터가 화자에 상관없이 일반화된 스트레스 인식 벡터를 가짐을 나타낸다.Figure 6 (b) is a visualization of the sentence-level feature vectors in the case of domain hostile learning. It can be confirmed that the sentence-level feature vectors are not clustered according to the speaker, but are clustered only by the presence/absence of stress. have. That is, it indicates that the sentence-level feature vector has a generalized stress recognition vector regardless of the speaker.
도 7은 본 발명의 다른 실시예에 따른 스트레스 인식 방법을 예시한 흐름도이다. 스트레스 인식 방법은 스트레스 인식 장치에 의해 수행될 수 있다.7 is a flowchart illustrating a stress recognition method according to another embodiment of the present invention. The stress recognition method may be performed by a stress recognition device.
단계 S10에서 스트레스 인식 장치는 생성된 음성 신호를 획득한다.In step S10, the stress recognition device acquires the generated voice signal.
단계 S20에서 스트레스 인식 장치는 음성 신호를 소정의 윈도우 단위로 분석하여 특징 벡터를 추출한다.In step S20, the stress recognition apparatus extracts a feature vector by analyzing the speech signal in units of a predetermined window.
단계 S30에서 스트레스 인식 장치는 특징 벡터를 입력으로 딥러닝(Deep learning) 학습을 수행하고, 화자 정보와 스트레스 정보를 함께 부여하여 도메인 적대적 학습을 통해 화자 정보에 독립적인 특징 벡터를 훈련하고, 스트레스 인식 결과를 출력한다.In step S30, the stress recognition device performs deep learning learning by inputting the feature vector, and trains the feature vector independent of the speaker information through domain adversarial learning by giving the speaker information and the stress information together, and stress recognition print the result
특징 벡터 추출 단계(S20)는 음성 신호에서 노이즈를 제거하는 노이즈 제거 단계를 포함한다.The feature vector extraction step S20 includes a noise removal step of removing noise from the voice signal.
특징 벡터 추출 단계(S20)는 프리 엠퍼시스(pre-emphasis) 필터를 통해 고주파수를 강조하여 왜곡을 보상하는 왜곡 보상 단계를 포함한다.The feature vector extraction step S20 includes a distortion compensation step of compensating for distortion by emphasizing a high frequency through a pre-emphasis filter.
특징 벡터 추출 단계(S20)는 왜곡이 보상된 음성 신호에서 음성이 존재하는 구간을 구분하여 음성 세그먼트를 획득하여 푸리에 변환 단계로 전달하는 묵음 처리 단계를 포함한다.The feature vector extraction step (S20) includes a silence processing step of dividing a section in which a speech is present in the distortion-compensated speech signal, obtaining a speech segment, and transferring the speech segment to the Fourier transform step.
특징 벡터 추출 단계(S20)는 소정의 윈도우 단위로 시간적 흐름에 따라 상기 음성 신호의 주파수 변화를 분석하기 위하여 시간 축에서 음성 신호를 일정한 길이의 프레임 단위로 변환하는 푸리에 변환 단계를 포함한다.The feature vector extraction step ( S20 ) includes a Fourier transform step of transforming the speech signal into frames of a certain length on the time axis in order to analyze the frequency change of the speech signal according to the temporal flow in units of a predetermined window.
특징 벡터 추출 단계(S20)는 구분된 프레임 각각에 다수의 주파수 영역에 대하여 패턴을 가지고 있는 멜-필터 뱅크를 곱하여 멜 스케일의 각 주파수 대역에 대한 에너지를 나타내는 멜-스펙트로그램을 획득하여 멜-필터 뱅크의 특징을 추출하는 필터 뱅크 처리 단계를 포함한다.In the feature vector extraction step (S20), each of the divided frames is multiplied by a Mel-filter bank having a pattern for a plurality of frequency domains to obtain a Mel-spectrogram representing energy for each frequency band of the Mel scale, and the Mel-filter It includes a filter bank processing step of extracting the characteristics of the bank.
특징 벡터 추출 단계(S20)는 멜-필터 뱅크의 특징을 정규화 처리하는 정규화 단계를 포함한다.The feature vector extraction step S20 includes a normalization step of normalizing the features of the Mel-filter bank.
특징 벡터 추출 단계(S20)는 기 설정된 고정된 시간의 길이로 정규화된 특징을 분할하여 특징 벡터를 추출하는 분할 처리 단계를 포함한다.The feature vector extraction step S20 includes a division processing step of extracting the feature vector by dividing the normalized feature by a predetermined fixed length of time.
모델 학습 단계(S30)는 특징 벡터를 입력 받고, 특징 벡터의 시간적, 주파수 성분을 스트레스 판별에 적합하도록 모델링하여 프레임별로 출력 벡터를 출력하는 인코더 처리 단계를 포함한다.The model learning step S30 includes an encoder processing step of receiving a feature vector, modeling temporal and frequency components of the feature vector to be suitable for stress determination, and outputting an output vector for each frame.
모델 학습 단계(S30)는 프레임별로 출력된 출력 벡터를 기반으로 시간별 가중치 벡터를 계산하고, 시간별 가중치 벡터를 출력 벡터와 연산 처리하여 프레임 레벨의 출력 벡터를 문장 레벨의 벡터로 변환하는 가중치 처리 단계를 포함한다.The model learning step (S30) is a weight processing step of calculating a weight vector for each time based on the output vector output for each frame, and converting the output vector of the frame level into a vector of the sentence level by calculating the weight vector for each time with the output vector. include
모델 학습 단계(S30)는 문장 레벨 특징 벡터와 스트레스 레이블 간의 비선형적 관계를 모델링하여 스트레스의 존재 여부에 대한 판별 결과를 생성하는 스트레스 분류 처리 단계를 포함한다.The model learning step ( S30 ) includes a stress classification processing step of modeling a non-linear relationship between a sentence level feature vector and a stress label to generate a determination result for the existence of stress.
모델 학습 단계(S30)는 문장 레벨 특징 벡터와 화자 레이블 간의 비선형적 관계를 모델링하여 음성 신호의 화자 판별 결과를 생성하는 화자 분류 처리 단계를 포함한다.The model learning step S30 includes a speaker classification processing step of generating a speaker discrimination result of a speech signal by modeling a non-linear relationship between a sentence level feature vector and a speaker label.
모델 학습 단계(S30)는 특징 벡터를 기반으로 도출된 판별 결과와 스트레스 레이블 간의 에러와, 화자 레이블 간의 에러의 가중합을 손실함수로 계산한 뒤 역전파 알고리즘을 이용하여 스트레스 레이블의 에러가 최소가 됨과 동시에 화자 레이블의 에러를 높이는 학습을 반복하여 수행한다.In the model learning step (S30), the weighted sum of the error between the discrimination result derived based on the feature vector and the stress label and the error between the speaker labels is calculated as a loss function, and then the stress label error is minimized using a backpropagation algorithm. At the same time, learning to increase the error of the speaker label is repeatedly performed.
기존의 스트레스 인식 모델은 훈련 과정에서 스트레스 정보만을 활용하고, 이는 문장 레벨 벡터가 정서 상태에 따라 화자 정보에 의존하는 특성을 학습할 수 있으므로 테스트 환경에서 실제 환경과의 차이가 있는 경우 인식률을 저하시킬 수 있다. The existing stress recognition model utilizes only stress information in the training process, and since the sentence-level vector can learn the characteristics that depend on the speaker information according to the emotional state, if there is a difference from the real environment in the test environment, the recognition rate can be lowered. can
본 발명에서는 딥러닝 모델의 훈련과정에서 화자 정보를 추가적으로 제공하여 도메인 적대적 학습을 진행함으로써, 모델 내부의 파라미터가 화자 정보에 독립적으로 스트레스 정보를 추정하는 형태의 학습을 진행한다. 화자 정보를 적대적으로 활용하여 음성 신호에서의 심리적 스트레스를 효과적으로 인식할 수 있다. In the present invention, by additionally providing speaker information in the training process of the deep learning model to perform domain hostile learning, learning is performed in the form of estimating stress information independently of speaker information by parameters inside the model. It is possible to effectively recognize the psychological stress in the voice signal by using the speaker information hostilely.
스트레스 인식 장치는 하드웨어, 펌웨어, 소프트웨어 또는 이들의 조합에 의해 로직회로 내에서 구현될 수 있고, 범용 또는 특정 목적 컴퓨터를 이용하여 구현될 수도 있다. 장치는 고정배선형(Hardwired) 기기, 필드 프로그램 가능한 게이트 어레이(Field Programmable Gate Array, FPGA), 주문형 반도체(Application Specific Integrated Circuit, ASIC) 등을 이용하여 구현될 수 있다. 또한, 장치는 하나 이상의 프로세서 및 컨트롤러를 포함한 시스템온칩(System on Chip, SoC)으로 구현될 수 있다.The stress recognition apparatus may be implemented in a logic circuit by hardware, firmware, software, or a combination thereof, or may be implemented using a general-purpose or special-purpose computer. The device may be implemented using a hardwired device, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), or the like. In addition, the device may be implemented as a system on chip (SoC) including one or more processors and controllers.
스트레스 인식 장치는 하드웨어적 요소가 마련된 컴퓨팅 디바이스 또는 서버에 소프트웨어, 하드웨어, 또는 이들의 조합하는 형태로 탑재될 수 있다. 컴퓨팅 디바이스 또는 서버는 각종 기기 또는 유무선 통신망과 통신을 수행하기 위한 통신 모뎀 등의 통신장치, 프로그램을 실행하기 위한 데이터를 저장하는 메모리, 프로그램을 실행하여 연산 및 명령하기 위한 마이크로프로세서 등을 전부 또는 일부 포함한 다양한 장치를 의미할 수 있다.The stress recognition apparatus may be mounted in a form of software, hardware, or a combination thereof on a computing device or server provided with hardware elements. A computing device or server is all or part of a communication device such as a communication modem for performing communication with various devices or wired/wireless communication networks, a memory for storing data for executing a program, and a microprocessor for executing operations and commands by executing the program It can mean a variety of devices, including
도 7에서는 각각의 과정을 순차적으로 실행하는 것으로 기재하고 있으나 이는 예시적으로 설명한 것에 불과하고, 이 분야의 기술자라면 본 발명의 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 도 7에 기재된 순서를 변경하여 실행하거나 또는 하나 이상의 과정을 병렬적으로 실행하거나 다른 과정을 추가하는 것으로 다양하게 수정 및 변형하여 적용 가능할 것이다.Although it is described that each process is sequentially executed in FIG. 7, this is only illustratively described, and those skilled in the art change the order described in FIG. 7 within the range not departing from the essential characteristics of the embodiment of the present invention Alternatively, various modifications and variations may be applied by executing one or more processes in parallel or adding other processes.
본 실시예들에 따른 동작은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능한 매체에 기록될 수 있다. 컴퓨터 판독 가능한 매체는 실행을 위해 프로세서에 명령어를 제공하는 데 참여한 임의의 매체를 나타낸다. 컴퓨터 판독 가능한 매체는 프로그램 명령, 데이터 파일, 데이터 구조 또는 이들의 조합을 포함할 수 있다. 예를 들면, 자기 매체, 광기록 매체, 메모리 등이 있을 수 있다. 컴퓨터 프로그램은 네트워크로 연결된 컴퓨터 시스템 상에 분산되어 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다. 본 실시예를 구현하기 위한 기능적인(Functional) 프로그램, 코드, 및 코드 세그먼트들은 본 실시예가 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있을 것이다.The operations according to the present embodiments may be implemented in the form of program instructions that can be performed through various computer means and recorded in a computer-readable medium. Computer-readable medium represents any medium that participates in providing instructions to a processor for execution. Computer-readable media may include program instructions, data files, data structures, or a combination thereof. For example, there may be a magnetic medium, an optical recording medium, a memory, and the like. A computer program may be distributed over a networked computer system so that computer readable code is stored and executed in a distributed manner. Functional programs, codes, and code segments for implementing the present embodiment may be easily inferred by programmers in the art to which this embodiment belongs.
본 실시예들은 본 실시예의 기술 사상을 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The present embodiments are for explaining the technical idea of the present embodiment, and the scope of the technical idea of the present embodiment is not limited by these embodiments. The protection scope of this embodiment should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present embodiment.

Claims (5)

  1. 하나 이상의 프로세서 및 상기 하나 이상의 프로세서에 의해 실행되는 하나 이상의 프로그램을 저장하는 메모리를 포함하는 스트레스 인식 장치에 있어서,A stress recognition device comprising one or more processors and a memory for storing one or more programs executed by the one or more processors,
    상기 프로세서는,The processor is
    음성 신호를 획득하고,Acquire a voice signal,
    상기 음성 신호로부터 특징 벡터를 추출하고,extracting a feature vector from the voice signal,
    상기 특징 벡터를 이용하여 도메인 적대적 스트레스 판별 모델을 통해 스트레스 인식 결과를 출력하는 것을 특징으로 하는 스트레스 인식 장치.The stress recognition apparatus according to claim 1, wherein the stress recognition result is output through a domain hostile stress discrimination model using the feature vector.
  2. 제1항에 있어서,According to claim 1,
    상기 도메인 적대적 스트레스 판별 모델은 스트레스의 유무를 판단하는 스트레스 인식부 및 화자를 구분하는 화자 인식부를 포함하는 것을 특징으로 하는 스트레스 인식 장치.The stress recognition apparatus according to claim 1, wherein the domain hostile stress discrimination model includes a stress recognition unit for determining whether stress exists and a speaker recognition unit for classifying a speaker.
  3. 제2항에 있어서,3. The method of claim 2,
    상기 화자 인식부는 기울기 역 계층(gradient reversal layer)을 적용하는 것을 특징으로 하는 스트레스 인식 장치.The speaker recognition unit stress recognition device, characterized in that applying a gradient reversal layer (gradient reversal layer).
  4. 제2항에 있어서,3. The method of claim 2,
    상기 도메인 적대적 스트레스 판별 모델은 상기 화자 인식부의 손실이 커지도록 학습하여 벡터 공간에서 상기 특징 벡터를 분산시키는 것을 특징으로 하는 스트레스 인식 장치.The stress recognition apparatus according to claim 1, wherein the domain hostile stress discrimination model learns to increase the loss of the speaker recognition unit and distributes the feature vector in a vector space.
  5. 제2항에 있어서,3. The method of claim 2,
    상기 도메인 적대적 스트레스 판별 모델은 도메인 적대적 학습을 통해 상기 음성 신호에서 상기 화자에 독립적으로 스트레스를 판별하는 것을 특징으로 하는 스트레스 인식 장치.The stress recognition device, characterized in that the domain hostile stress discrimination model determines the stress in the speech signal independently of the speaker through domain hostile learning.
PCT/KR2020/017789 2020-11-27 2020-12-07 Voice signal-based method and apparatus for recognizing stress using adversarial training with speaker information WO2022114347A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2020-0161957 2020-11-27
KR1020200161957A KR102389610B1 (en) 2020-11-27 2020-11-27 Method and apparatus for determining stress in speech signal learned by domain adversarial training with speaker information

Publications (1)

Publication Number Publication Date
WO2022114347A1 true WO2022114347A1 (en) 2022-06-02

Family

ID=81437234

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/017789 WO2022114347A1 (en) 2020-11-27 2020-12-07 Voice signal-based method and apparatus for recognizing stress using adversarial training with speaker information

Country Status (2)

Country Link
KR (1) KR102389610B1 (en)
WO (1) WO2022114347A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308318A (en) * 2018-08-14 2019-02-05 深圳大学 Training method, device, equipment and the medium of cross-domain texts sentiment classification model
KR20190135916A (en) * 2018-05-29 2019-12-09 연세대학교 산학협력단 Apparatus and method for determining user stress using speech signal
KR20200114705A (en) * 2019-03-29 2020-10-07 연세대학교 산학협력단 User adaptive stress state classification Method using speech signal
US20200364539A1 (en) * 2020-07-28 2020-11-19 Oken Technologies, Inc. Method of and system for evaluating consumption of visual information displayed to a user by analyzing user's eye tracking and bioresponse data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190135916A (en) * 2018-05-29 2019-12-09 연세대학교 산학협력단 Apparatus and method for determining user stress using speech signal
CN109308318A (en) * 2018-08-14 2019-02-05 深圳大学 Training method, device, equipment and the medium of cross-domain texts sentiment classification model
KR20200114705A (en) * 2019-03-29 2020-10-07 연세대학교 산학협력단 User adaptive stress state classification Method using speech signal
US20200364539A1 (en) * 2020-07-28 2020-11-19 Oken Technologies, Inc. Method of and system for evaluating consumption of visual information displayed to a user by analyzing user's eye tracking and bioresponse data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ABDELWAHAB MOHAMMED, BUSSO CARLOS: "Domain Adversarial for Acoustic Emotion Recognition", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE, USA, vol. 26, no. 12, 1 December 2018 (2018-12-01), USA, pages 2423 - 2435, XP055933927, ISSN: 2329-9290, DOI: 10.1109/TASLP.2018.2867099 *

Also Published As

Publication number Publication date
KR102389610B1 (en) 2022-04-21

Similar Documents

Publication Publication Date Title
Mannepalli et al. MFCC-GMM based accent recognition system for Telugu speech signals
CN107610707A (en) A kind of method for recognizing sound-groove and device
Qamhan et al. Digital audio forensics: microphone and environment classification using deep learning
CN107767869A (en) Method and apparatus for providing voice service
Hibare et al. Feature extraction techniques in speech processing: a survey
CN111724770B (en) Audio keyword identification method for generating confrontation network based on deep convolution
CN108962231B (en) Voice classification method, device, server and storage medium
Wijethunga et al. Deepfake audio detection: a deep learning based solution for group conversations
Antony et al. Speaker identification based on combination of MFCC and UMRT based features
WO2023163383A1 (en) Multimodal-based method and apparatus for recognizing emotion in real time
CN114722812A (en) Method and system for analyzing vulnerability of multi-mode deep learning model
CN112489623A (en) Language identification model training method, language identification method and related equipment
Xue et al. Cross-modal information fusion for voice spoofing detection
KS et al. Comparative performance analysis for speech digit recognition based on MFCC and vector quantization
CN112397093B (en) Voice detection method and device
Wu et al. A Characteristic of Speaker's Audio in the Model Space Based on Adaptive Frequency Scaling
WO2022114347A1 (en) Voice signal-based method and apparatus for recognizing stress using adversarial training with speaker information
CN111785262A (en) Speaker age and gender classification method based on residual error network and fusion characteristics
Chen et al. An intelligent nocturnal animal vocalization recognition system
Singh et al. A critical review on automatic speaker recognition
Ali et al. Fake audio detection using hierarchical representations learning and spectrogram features
Eltanashi et al. Proposed speaker recognition model using optimized feed forward neural network and hybrid time-mel speech feature
WO2021153843A1 (en) Method for determining stress of voice signal by using weights, and device therefor
Radha et al. Improving recognition of speech system using multimodal approach
Kari et al. Real time implementation of speaker recognition system with MFCC and neural networks on FPGA

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20963741

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20963741

Country of ref document: EP

Kind code of ref document: A1