WO2022114347A1

WO2022114347A1 - Voice signal-based method and apparatus for recognizing stress using adversarial training with speaker information

Info

Publication number: WO2022114347A1
Application number: PCT/KR2020/017789
Authority: WO
Inventors: 강홍구; 한혜원; 신현경; 변경근
Original assignee: 연세대학교 산학협력단
Priority date: 2020-11-27
Filing date: 2020-12-07
Publication date: 2022-06-02
Also published as: KR102389610B1

Abstract

The present embodiments provide apparatus and method for recognizing stress in which accuracy for recognizing stress is improved, the method using domain adversarial training with speaker information obtained during a training period to eliminate speaker-dependent traits from the voice signal, and independently training the characteristic vector associated with psychological stress with speaker information.

Description

Speech signal-based stress recognition apparatus and method using adversarial learning with speaker information

The technical field to which the present invention pertains relates to an apparatus and method for recognizing stress based on a voice signal. This study is related to the emotional intelligence research and development that can infer and judge the emotions of the other party in the high-tech convergence content technology development project, funded by the Ministry of Science and ICT and supported by the Information and Communication Planning and Evaluation Institute, and communicate and respond accordingly. (No. 1711116331).

The content described in this section merely provides background information for the present embodiment and does not constitute the prior art.

Techniques for determining stress using a voice signal are largely divided into a part for extracting a voice feature vector and a part for modeling the relationship between the extracted vector and the stress state by a statistical method. As a feature vector of speech previously used for stress discrimination, there are pitch, Mel-Frequency Cepstral Coefficients (MFCC), and energy for each frame. These speech feature vectors can be obtained through a feature vector extraction algorithm.

Existing models for recognizing emotional states using voice signals used statistical-based identification methods such as the Hidden Markov model. A model of learning and extracting feature vectors related to states has been proposed. This neural network structure non-linearly models the relationship between the input and the label, has the advantage of effectively reflecting the statistical characteristics of the data by using a training method based on the data, and shows superior performance than the existing statistical approaches.

Neural network structures are input to the network, calculate the loss value between the nonlinearly predicted result value and the label, and learn parameters through a backpropagation algorithm. These neural network-based algorithms reflect the statistical characteristics of data and effectively model the image characteristics of the spectrogram or the permutation characteristics of the voice, and are currently being actively used in image and voice-related research fields.

(Patent Document 1) Korean Patent Publication No. 10-2019-0135916 (2019.12.09.)

Embodiments of the present invention remove a speaker-dependent tendency from a voice signal using domain adversarial training with speaker information given in the training process, and learn a feature vector related to psychological stress independently of speaker information. Its main purpose is to improve the stress recognition accuracy.

Other objects not specified in the present invention may be additionally considered within the scope that can be easily inferred from the following detailed description and effects thereof.

According to an aspect of this embodiment, in the stress recognition apparatus comprising one or more processors and a memory for storing one or more programs executed by the one or more processors, the processor obtains a voice signal, and from the voice signal There is provided a stress recognition apparatus characterized by extracting a feature vector and outputting a stress recognition result through a domain hostile stress discrimination model using the feature vector.

The domain hostile stress discrimination model may include a stress recognition unit for determining the presence or absence of stress and a speaker recognition unit for classifying a speaker.

The speaker recognizer may apply a gradient reversal layer.

The domain hostile stress discrimination model may learn to increase the loss of the speaker recognition unit to distribute the feature vector in a vector space.

The domain hostile stress discrimination model may determine stress independently of the speaker in the voice signal through domain hostile learning.

As described above, according to the embodiments of the present invention, the speaker-dependent tendency is removed from the voice signal by using domain adversarial training with the speaker information given in the training process, and the psychological effects are independent of the speaker information. It has the effect of improving the stress recognition accuracy by learning the stress-related feature vectors.

Even if it is an effect not explicitly mentioned herein, the effects described in the following specification expected by the technical features of the present invention and their potential effects are treated as if they were described in the specification of the present invention.

1 is a block diagram illustrating an apparatus for recognizing stress according to an embodiment of the present invention.

2 is a diagram illustrating an operation of a stress recognition apparatus according to an embodiment of the present invention.

3 is a diagram illustrating a voice signal pre-processing operation of the apparatus for recognizing stress according to an embodiment of the present invention.

4 and 5 are diagrams illustrating a domain hostile stress discrimination model of the stress recognition apparatus according to an embodiment of the present invention.

6 is a diagram illustrating the distribution of sentence-level feature vectors processed by the stress recognition apparatus according to an embodiment of the present invention.

7 is a flowchart illustrating a stress recognition method according to another embodiment of the present invention.

Hereinafter, in the description of the present invention, if it is determined that the subject matter of the present invention may be unnecessarily obscure as it is obvious to those skilled in the art with respect to related known functions, the detailed description thereof will be omitted, and some embodiments of the present invention will be described. It will be described in detail with reference to exemplary drawings.

A neural network-based model that can model the continuous change characteristics of speech signals well is being used in fields such as stress and emotion recognition. However, when only stress information is used in the training process, learning of the trained sentence-level feature vector is insufficient, and thus it has a dependent characteristic depending on the speaker information as well as the stress information, which is different from the actual training data. It affects the discrimination performance. That is, the stress discrimination performance is different in a situation where another speaker is speaking.

Since the neural network-based model largely reflects the statistical characteristics of the data, it affects the model performance when the distribution of the training data and the test data is different. There is a problem in that recognition performance deteriorates in an environment in which domains are changed.

The stress recognition apparatus according to this embodiment improves the discrimination accuracy by learning a feature vector for stress discrimination only independently of the speaker information from the voice signal by utilizing domain adversarial training with the speaker information given in the training process. .

Domain adversarial learning constructs a network composed of a first recognizer that is the main purpose of recognition in a neural network model and a second recognizer that distinguishes different domains, and learns the first recognizer, which is the main purpose, and simultaneously learns the second recognizer. It refers to a training technique for improving the performance of the first recognizer, which is the main purpose, by reducing the domain recognition performance.

The stress recognition device 110 includes at least one processor 120 , a computer readable storage medium 130 , and a communication bus 170 .

The processor 120 may control to operate as the stress recognition device 110 . For example, the processor 120 may execute one or more programs stored in the computer-readable storage medium 130 . The one or more programs may include one or more computer-executable instructions, which when executed by the processor 120 may be configured to cause the stress recognition device 110 to perform operations in accordance with the exemplary embodiment. can

Computer-readable storage medium 130 is configured to store computer-executable instructions or program code, program data, and/or other suitable form of information. The program 140 stored in the computer-readable storage medium 130 includes a set of instructions executable by the processor 120 . In one embodiment, computer-readable storage medium 130 includes memory (volatile memory, such as random access memory, non-volatile memory, or a suitable combination thereof), one or more magnetic disk storage devices, optical disk storage devices, It may be flash memory devices, other types of storage media accessed by the stress recognition apparatus 110 and capable of storing desired information, or a suitable combination thereof.

Communication bus 170 interconnects various other components of stress recognition device 110 including processor 120 and computer readable storage medium 140 .

The stress recognition device 110 may also include one or more input/output interfaces 150 and one or more communication interfaces 160 that provide interfaces for one or more input/output devices. The input/output interface 150 and the communication interface 160 are connected to the communication bus 170 . The input/output device may be connected to other components of the stress recognition device 110 through the input/output interface 150 .

The stress recognition device 110 adds a speaker recognition unit including a gradient reversal layer to the domain adversarial stress discrimination model, and utilizes domain adversarial training with speaker information given in the training process to obtain a speech signal. By learning the feature vector for stress discrimination only independently of the speaker information, the discrimination accuracy is improved.

The stress recognition apparatus extracts a feature vector from a speech database, receives an input from the acquired feature vector, trains a model that updates an internal parameter of a neural network, and finally determines the presence or absence of stress.

In the feature vector acquisition process, a vector reflecting the characteristics of the speech signal is extracted in a specified manner. After dividing the audio signal into frames of a certain length (5-40 ms) on the time axis, energy is extracted for each frequency band from each frame. Extracts a power spectrogram representing energy according to time and frequency of a voice signal. Thereafter, a mel-spectrogram representing energy for each frequency band of the mel scale is obtained by multiplying a mel-filter bank having a pattern for a plurality of frequency domains. At this time, since the frequency band energy has a very small value close to 0, it is easy to calculate and train the model by applying a log function and converting it to a log scale to expand the scale between energies and change the distribution differently.

In the deep neural network training process, the initial model created through parameter generation is trained so that the parameter values reflect the statistical characteristics of the data through the backpropagation algorithm. In this case, in order to apply the adversarial training method, the acquired speech feature vector is set in the input layer, label information indicating stress/non-stress and additional speaker label information are set in the output layer. After calculating the error between the value predicted by the network from the input and the label representing the emotional state as a loss function, a backpropagation algorithm is used to train to minimize the error in recognizing the emotional state. At the same time, by using speaker label information, the network maximizes the error predicted by the speaker recognition unit. That is, the parameters inside the trained model minimize the error rate in estimating the emotional state while maximizing the error rate for speaker recognition, so that a feature vector suitable for emotional state recognition is learned independently of the speaker.

In the stress estimation step, a voice feature vector is passed through the trained neural network model to determine the presence/absence of stress in a given voice, and in the speaker estimation step, a voice feature vector is passed through the same to determine who the speaker of the given voice is.

In order to extract robust features for training a deep neural network model, the stress recognition device needs to apply a plurality of preprocessing steps to the recorded speech signal.

The stress recognition apparatus performs an operation of removing noise from a voice signal. The composition of unwanted background noise is removed, and the noise can be removed by applying Wiener filtering.

The stress recognition device compensates for distortion generated in the speech signal processing process by reducing the dynamic range of the frequency spectrum by applying a pre-emphasis filter, that is, a simple high pass filter. By emphasizing high frequencies through a pre-emphasis filter, it is possible to balance the dynamic range between the high-frequency region and the low-frequency region.

The stress recognizing apparatus performs an operation of discriminating whether or not a voice is present in a voice signal. A voice segment is acquired by applying the VAD (Voice Activity Detection) algorithm in the process of silence processing. After the silence section is searched for in the voice signal and removed (processed as 0), a voice segment is acquired.

The stress recognition apparatus performs analysis and transformation in units of a predetermined window. The Fourier transform unit 440 may analyze the input voice signal every 10 ms with a Hanning window of 25 ms.

Short-Time Fourier Transform (STFT) may be performed to analyze the frequency change of the voice signal over time. In the Fourier transform process, the speech signal can be divided into frame units of a constant length (5 to 40 ms) on the time axis.

The stress recognition apparatus may extract energy for each frequency band from each of the divided frames. In the filter bank process, a power spectrogram representing energy according to time and frequency of a voice signal may be extracted.

In the filter bank process, a mel-spectrogram representing energy for each frequency band of the mel scale is obtained by multiplying a mel-filter bank having a pattern for a plurality of frequency domains. In order to narrow the energy scale difference between frequency bands, a log function is applied to convert the energy into log scale energy to extract features.

The stress recognition device normalizes the features of the Mel-filter bank. In the normalization process, normalization is performed so that the Mel-filter bank features have zero mean and unit variance.

The stress recognition device divides the feature into a fixed length of time (eg, 2 sec, 4 sec, 5 sec, etc.) according to the setting of the stress discrimination algorithm, and outputs the finally extracted feature vector. The feature vector output from the segmentation process is transferred to the domain adversarial stress discrimination model to learn the deep neural network model for stress discrimination.

The domain adversarial stress discrimination model is an encoder that converts the Mel-spectrogram into an embedding vector suitable for stress judgment, calculates the relationship between the embedding vector extracted for each frame and the stress label, and assigns weights to it. , an attention-weighted sum unit that extracts sentence-level speech characteristics by weighted summing, and a sentence-level hidden layer that performs stress determination and speaker recognition using sentence-level feature vectors.

The encoder at the bottom of the network plays a role in modeling the temporal and frequency components of the input speech feature vector to be suitable for stress recognition. The encoder consists of several layers of convolutional neural networks. As an input layer of the network convolutional neural network, a log-scale power Mel-spectrum extracted at a preset time interval (eg, 10 ms) by the feature vector acquisition unit is used as an input to the network.

Each frame of the voice signal is extracted in a short section, and information between neighboring frames has a continuous permutation characteristic. In order to effectively model such a speech signal, a neural network structure that models overall information between neighboring frames should be applied, and a representative structure is a recurrent neural network. After the convolutional neural network, it passes through the recurrent neural network to generate an output vector in the form of a reduced size in terms of time and frequency. The two-dimensional output vector is transferred to the attention hiding layer having multiple heads. The network structure of the encoder unit may utilize a dilated convolution layer instead of a recurrent neural network structure in order to reduce the number of parameters if necessary.

The attention weighted summation unit corresponds to a multi-head attention hidden layer. For the query Q, key K, and value V of the i-th head having dimension d with respect to the number of heads r, the attention head is calculated as in Equation 1.

For each vector output from the encoder for each frame, the relationship between the stress labels is calculated to obtain the weight for each time. A fully-connected layer is added to the vector for each frame so that the weight is calculated.

The frame-level vector is transformed into a sentence-level vector representing the sentence by attention pooling, which multiplies the product of the calculated weight vector W and the encoder output value and adds it along the time axis. This sentence-level vector is transferred to the sentence-level hidden layer.

The sentence level hidden layer includes a stress recognition unit modeling a non-linear relationship between the sentence level feature vector and a stress label, and a speaker recognition unit modeling a non-linear relationship between speaker labels.

In the speaker recognition unit, a gradient inverse layer is added before the precoupling layer for speaker recognition. This hidden layer reverses the direction of the gradient by multiplying the gradient value calculated in the backpropagation training process by -1. It is trained to play a role in making it difficult to distinguish information about the speaker. That is, the sentence-level feature vector is learned to contain information that can distinguish the stress characteristic regardless of the speaker information.

The stress recognition unit determines the presence/absence of stress by passing through the sentence level hidden layer. The speaker recognition unit is connected to a pre-coupling layer having as many dimensions as the number of speakers, and serves to distinguish speakers.

The entire neural network model is trained with a backpropagation algorithm by calculating the error between the output value of the network and the stress label. At the same time, after considering the speaker label as a domain label and assigning it, the error between the network output values is calculated and the gradient value is propagated in the opposite direction to the learning direction through the gradient inverse layer, so that the speaker cannot be distinguished.

The loss function is expressed as a weighted sum between the stress recognition loss function (cross-entropy) for stress recognition and the speaker recognition loss function with the reversed sign. In a general domain adversarial learning method, learning is carried out by providing target data separately, but in the present invention, since the purpose of the present invention is to learn a sentence-level feature vector independent of speaker information, the learning data is kept the same.

6 is a diagram illustrating the distribution of sentence-level feature vectors processed by the stress recognition apparatus according to an embodiment of the present invention. In the drawing, non-stress is indicated by ' X ' and stress is indicated by '●'. By looking at the variance of ' X ' and '●', it is possible to check whether small clusters exist and whether there is a boundary region between non-stress and stress.

FIG. 6(a) is a two-dimensional view of a sentence-level feature vector when trained only with a general stress recognition loss function by visualizing it with a t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm. Sentence-level feature vectors for several speakers are expressed, and it can be seen that the same speaker information is clustered among sentence-level feature vectors. This indicates that the sentence level feature vector of the discrimination device includes not only the stress recognition information but also the speaker recognition information, so it is not suitable for the purpose of stress recognition independent of the speaker information.

Figure 6 (b) is a visualization of the sentence-level feature vectors in the case of domain hostile learning. It can be confirmed that the sentence-level feature vectors are not clustered according to the speaker, but are clustered only by the presence/absence of stress. have. That is, it indicates that the sentence-level feature vector has a generalized stress recognition vector regardless of the speaker.

7 is a flowchart illustrating a stress recognition method according to another embodiment of the present invention. The stress recognition method may be performed by a stress recognition device.

In step S10, the stress recognition device acquires the generated voice signal.

In step S20, the stress recognition apparatus extracts a feature vector by analyzing the speech signal in units of a predetermined window.

In step S30, the stress recognition device performs deep learning learning by inputting the feature vector, and trains the feature vector independent of the speaker information through domain adversarial learning by giving the speaker information and the stress information together, and stress recognition print the result

The feature vector extraction step S20 includes a noise removal step of removing noise from the voice signal.

The feature vector extraction step S20 includes a distortion compensation step of compensating for distortion by emphasizing a high frequency through a pre-emphasis filter.

The feature vector extraction step (S20) includes a silence processing step of dividing a section in which a speech is present in the distortion-compensated speech signal, obtaining a speech segment, and transferring the speech segment to the Fourier transform step.

The feature vector extraction step ( S20 ) includes a Fourier transform step of transforming the speech signal into frames of a certain length on the time axis in order to analyze the frequency change of the speech signal according to the temporal flow in units of a predetermined window.

In the feature vector extraction step (S20), each of the divided frames is multiplied by a Mel-filter bank having a pattern for a plurality of frequency domains to obtain a Mel-spectrogram representing energy for each frequency band of the Mel scale, and the Mel-filter It includes a filter bank processing step of extracting the characteristics of the bank.

The feature vector extraction step S20 includes a normalization step of normalizing the features of the Mel-filter bank.

The feature vector extraction step S20 includes a division processing step of extracting the feature vector by dividing the normalized feature by a predetermined fixed length of time.

The model learning step S30 includes an encoder processing step of receiving a feature vector, modeling temporal and frequency components of the feature vector to be suitable for stress determination, and outputting an output vector for each frame.

The model learning step (S30) is a weight processing step of calculating a weight vector for each time based on the output vector output for each frame, and converting the output vector of the frame level into a vector of the sentence level by calculating the weight vector for each time with the output vector. include

The model learning step ( S30 ) includes a stress classification processing step of modeling a non-linear relationship between a sentence level feature vector and a stress label to generate a determination result for the existence of stress.

The model learning step S30 includes a speaker classification processing step of generating a speaker discrimination result of a speech signal by modeling a non-linear relationship between a sentence level feature vector and a speaker label.

In the model learning step (S30), the weighted sum of the error between the discrimination result derived based on the feature vector and the stress label and the error between the speaker labels is calculated as a loss function, and then the stress label error is minimized using a backpropagation algorithm. At the same time, learning to increase the error of the speaker label is repeatedly performed.

The existing stress recognition model utilizes only stress information in the training process, and since the sentence-level vector can learn the characteristics that depend on the speaker information according to the emotional state, if there is a difference from the real environment in the test environment, the recognition rate can be lowered. can

In the present invention, by additionally providing speaker information in the training process of the deep learning model to perform domain hostile learning, learning is performed in the form of estimating stress information independently of speaker information by parameters inside the model. It is possible to effectively recognize the psychological stress in the voice signal by using the speaker information hostilely.

The stress recognition apparatus may be implemented in a logic circuit by hardware, firmware, software, or a combination thereof, or may be implemented using a general-purpose or special-purpose computer. The device may be implemented using a hardwired device, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), or the like. In addition, the device may be implemented as a system on chip (SoC) including one or more processors and controllers.

The stress recognition apparatus may be mounted in a form of software, hardware, or a combination thereof on a computing device or server provided with hardware elements. A computing device or server is all or part of a communication device such as a communication modem for performing communication with various devices or wired/wireless communication networks, a memory for storing data for executing a program, and a microprocessor for executing operations and commands by executing the program It can mean a variety of devices, including

Although it is described that each process is sequentially executed in FIG. 7, this is only illustratively described, and those skilled in the art change the order described in FIG. 7 within the range not departing from the essential characteristics of the embodiment of the present invention Alternatively, various modifications and variations may be applied by executing one or more processes in parallel or adding other processes.

The operations according to the present embodiments may be implemented in the form of program instructions that can be performed through various computer means and recorded in a computer-readable medium. Computer-readable medium represents any medium that participates in providing instructions to a processor for execution. Computer-readable media may include program instructions, data files, data structures, or a combination thereof. For example, there may be a magnetic medium, an optical recording medium, a memory, and the like. A computer program may be distributed over a networked computer system so that computer readable code is stored and executed in a distributed manner. Functional programs, codes, and code segments for implementing the present embodiment may be easily inferred by programmers in the art to which this embodiment belongs.

The present embodiments are for explaining the technical idea of the present embodiment, and the scope of the technical idea of the present embodiment is not limited by these embodiments. The protection scope of this embodiment should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present embodiment.

Claims

A stress recognition device comprising one or more processors and a memory for storing one or more programs executed by the one or more processors,

The processor is

Acquire a voice signal,

extracting a feature vector from the voice signal,

The stress recognition apparatus according to claim 1, wherein the stress recognition result is output through a domain hostile stress discrimination model using the feature vector.
According to claim 1,

The stress recognition apparatus according to claim 1, wherein the domain hostile stress discrimination model includes a stress recognition unit for determining whether stress exists and a speaker recognition unit for classifying a speaker.
3. The method of claim 2,

The speaker recognition unit stress recognition device, characterized in that applying a gradient reversal layer (gradient reversal layer).
3. The method of claim 2,

The stress recognition apparatus according to claim 1, wherein the domain hostile stress discrimination model learns to increase the loss of the speaker recognition unit and distributes the feature vector in a vector space.
3. The method of claim 2,

The stress recognition device, characterized in that the domain hostile stress discrimination model determines the stress in the speech signal independently of the speaker through domain hostile learning.