WO2022114347A1 - Procédé et appareil à base de signal vocal destinés à reconnaître le stress par entraînement adverse avec informations de locuteur - Google Patents
Procédé et appareil à base de signal vocal destinés à reconnaître le stress par entraînement adverse avec informations de locuteur Download PDFInfo
- Publication number
- WO2022114347A1 WO2022114347A1 PCT/KR2020/017789 KR2020017789W WO2022114347A1 WO 2022114347 A1 WO2022114347 A1 WO 2022114347A1 KR 2020017789 W KR2020017789 W KR 2020017789W WO 2022114347 A1 WO2022114347 A1 WO 2022114347A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- stress
- speaker
- recognition
- feature vector
- domain
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000012549 training Methods 0.000 title abstract description 23
- 239000013598 vector Substances 0.000 claims abstract description 80
- 230000001419 dependent effect Effects 0.000 abstract description 4
- 230000008569 process Effects 0.000 description 20
- 238000013528 artificial neural network Methods 0.000 description 11
- 238000004891 communication Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 9
- 238000012545 processing Methods 0.000 description 9
- 238000000605 extraction Methods 0.000 description 8
- 230000002996 emotional effect Effects 0.000 description 7
- 230000008859 change Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 239000000284 extract Substances 0.000 description 5
- 238000003062 neural network model Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 230000000306 recurrent effect Effects 0.000 description 3
- 230000002123 temporal effect Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 230000008909 emotion recognition Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000003334 potential effect Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000004800 psychological effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/16—Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
Definitions
- the technical field to which the present invention pertains relates to an apparatus and method for recognizing stress based on a voice signal.
- This study is related to the emotional intelligence research and development that can infer and judge the emotions of the other party in the high-tech convergence content technology development project, funded by the Ministry of Science and ICT and supported by the Information and Communication Planning and Evaluation Institute, and communicate and respond accordingly. (No. 1711116331).
- Techniques for determining stress using a voice signal are largely divided into a part for extracting a voice feature vector and a part for modeling the relationship between the extracted vector and the stress state by a statistical method.
- a feature vector of speech previously used for stress discrimination there are pitch, Mel-Frequency Cepstral Coefficients (MFCC), and energy for each frame.
- MFCC Mel-Frequency Cepstral Coefficients
- Neural network structures are input to the network, calculate the loss value between the nonlinearly predicted result value and the label, and learn parameters through a backpropagation algorithm. These neural network-based algorithms reflect the statistical characteristics of data and effectively model the image characteristics of the spectrogram or the permutation characteristics of the voice, and are currently being actively used in image and voice-related research fields.
- Patent Document 1 Korean Patent Publication No. 10-2019-0135916 (2019.12.09.)
- Embodiments of the present invention remove a speaker-dependent tendency from a voice signal using domain adversarial training with speaker information given in the training process, and learn a feature vector related to psychological stress independently of speaker information. Its main purpose is to improve the stress recognition accuracy.
- the processor obtains a voice signal, and from the voice signal.
- a stress recognition apparatus characterized by extracting a feature vector and outputting a stress recognition result through a domain hostile stress discrimination model using the feature vector.
- the domain hostile stress discrimination model may include a stress recognition unit for determining the presence or absence of stress and a speaker recognition unit for classifying a speaker.
- the speaker recognizer may apply a gradient reversal layer.
- the domain hostile stress discrimination model may learn to increase the loss of the speaker recognition unit to distribute the feature vector in a vector space.
- the domain hostile stress discrimination model may determine stress independently of the speaker in the voice signal through domain hostile learning.
- the speaker-dependent tendency is removed from the voice signal by using domain adversarial training with the speaker information given in the training process, and the psychological effects are independent of the speaker information. It has the effect of improving the stress recognition accuracy by learning the stress-related feature vectors.
- FIG. 1 is a block diagram illustrating an apparatus for recognizing stress according to an embodiment of the present invention.
- FIG. 2 is a diagram illustrating an operation of a stress recognition apparatus according to an embodiment of the present invention.
- FIG. 3 is a diagram illustrating a voice signal pre-processing operation of the apparatus for recognizing stress according to an embodiment of the present invention.
- FIGS. 4 and 5 are diagrams illustrating a domain hostile stress discrimination model of the stress recognition apparatus according to an embodiment of the present invention.
- FIG. 6 is a diagram illustrating the distribution of sentence-level feature vectors processed by the stress recognition apparatus according to an embodiment of the present invention.
- FIG. 7 is a flowchart illustrating a stress recognition method according to another embodiment of the present invention.
- a neural network-based model that can model the continuous change characteristics of speech signals well is being used in fields such as stress and emotion recognition.
- stress and emotion recognition when only stress information is used in the training process, learning of the trained sentence-level feature vector is insufficient, and thus it has a dependent characteristic depending on the speaker information as well as the stress information, which is different from the actual training data. It affects the discrimination performance. That is, the stress discrimination performance is different in a situation where another speaker is speaking.
- the neural network-based model largely reflects the statistical characteristics of the data, it affects the model performance when the distribution of the training data and the test data is different. There is a problem in that recognition performance deteriorates in an environment in which domains are changed.
- the stress recognition apparatus improves the discrimination accuracy by learning a feature vector for stress discrimination only independently of the speaker information from the voice signal by utilizing domain adversarial training with the speaker information given in the training process. .
- Domain adversarial learning constructs a network composed of a first recognizer that is the main purpose of recognition in a neural network model and a second recognizer that distinguishes different domains, and learns the first recognizer, which is the main purpose, and simultaneously learns the second recognizer. It refers to a training technique for improving the performance of the first recognizer, which is the main purpose, by reducing the domain recognition performance.
- FIG. 1 is a block diagram illustrating an apparatus for recognizing stress according to an embodiment of the present invention.
- the stress recognition device 110 includes at least one processor 120 , a computer readable storage medium 130 , and a communication bus 170 .
- the processor 120 may control to operate as the stress recognition device 110 .
- the processor 120 may execute one or more programs stored in the computer-readable storage medium 130 .
- the one or more programs may include one or more computer-executable instructions, which when executed by the processor 120 may be configured to cause the stress recognition device 110 to perform operations in accordance with the exemplary embodiment.
- Computer-readable storage medium 130 is configured to store computer-executable instructions or program code, program data, and/or other suitable form of information.
- the program 140 stored in the computer-readable storage medium 130 includes a set of instructions executable by the processor 120 .
- computer-readable storage medium 130 includes memory (volatile memory, such as random access memory, non-volatile memory, or a suitable combination thereof), one or more magnetic disk storage devices, optical disk storage devices, It may be flash memory devices, other types of storage media accessed by the stress recognition apparatus 110 and capable of storing desired information, or a suitable combination thereof.
- Communication bus 170 interconnects various other components of stress recognition device 110 including processor 120 and computer readable storage medium 140 .
- the stress recognition device 110 may also include one or more input/output interfaces 150 and one or more communication interfaces 160 that provide interfaces for one or more input/output devices.
- the input/output interface 150 and the communication interface 160 are connected to the communication bus 170 .
- the input/output device may be connected to other components of the stress recognition device 110 through the input/output interface 150 .
- the stress recognition device 110 adds a speaker recognition unit including a gradient reversal layer to the domain adversarial stress discrimination model, and utilizes domain adversarial training with speaker information given in the training process to obtain a speech signal. By learning the feature vector for stress discrimination only independently of the speaker information, the discrimination accuracy is improved.
- FIG. 2 is a diagram illustrating an operation of a stress recognition apparatus according to an embodiment of the present invention.
- the stress recognition apparatus extracts a feature vector from a speech database, receives an input from the acquired feature vector, trains a model that updates an internal parameter of a neural network, and finally determines the presence or absence of stress.
- a vector reflecting the characteristics of the speech signal is extracted in a specified manner. After dividing the audio signal into frames of a certain length (5-40 ms) on the time axis, energy is extracted for each frequency band from each frame. Extracts a power spectrogram representing energy according to time and frequency of a voice signal. Thereafter, a mel-spectrogram representing energy for each frequency band of the mel scale is obtained by multiplying a mel-filter bank having a pattern for a plurality of frequency domains. At this time, since the frequency band energy has a very small value close to 0, it is easy to calculate and train the model by applying a log function and converting it to a log scale to expand the scale between energies and change the distribution differently.
- the initial model created through parameter generation is trained so that the parameter values reflect the statistical characteristics of the data through the backpropagation algorithm.
- the acquired speech feature vector is set in the input layer
- label information indicating stress/non-stress and additional speaker label information are set in the output layer.
- a backpropagation algorithm is used to train to minimize the error in recognizing the emotional state.
- the network maximizes the error predicted by the speaker recognition unit. That is, the parameters inside the trained model minimize the error rate in estimating the emotional state while maximizing the error rate for speaker recognition, so that a feature vector suitable for emotional state recognition is learned independently of the speaker.
- a voice feature vector is passed through the trained neural network model to determine the presence/absence of stress in a given voice, and in the speaker estimation step, a voice feature vector is passed through the same to determine who the speaker of the given voice is.
- FIG. 3 is a diagram illustrating a voice signal pre-processing operation of the apparatus for recognizing stress according to an embodiment of the present invention.
- the stress recognition device needs to apply a plurality of preprocessing steps to the recorded speech signal.
- the stress recognition apparatus performs an operation of removing noise from a voice signal.
- the composition of unwanted background noise is removed, and the noise can be removed by applying Wiener filtering.
- the stress recognition device compensates for distortion generated in the speech signal processing process by reducing the dynamic range of the frequency spectrum by applying a pre-emphasis filter, that is, a simple high pass filter. By emphasizing high frequencies through a pre-emphasis filter, it is possible to balance the dynamic range between the high-frequency region and the low-frequency region.
- a pre-emphasis filter that is, a simple high pass filter.
- the stress recognizing apparatus performs an operation of discriminating whether or not a voice is present in a voice signal.
- a voice segment is acquired by applying the VAD (Voice Activity Detection) algorithm in the process of silence processing. After the silence section is searched for in the voice signal and removed (processed as 0), a voice segment is acquired.
- VAD Voice Activity Detection
- the stress recognition apparatus performs analysis and transformation in units of a predetermined window.
- the Fourier transform unit 440 may analyze the input voice signal every 10 ms with a Hanning window of 25 ms.
- Short-Time Fourier Transform may be performed to analyze the frequency change of the voice signal over time.
- the speech signal can be divided into frame units of a constant length (5 to 40 ms) on the time axis.
- the stress recognition apparatus may extract energy for each frequency band from each of the divided frames.
- a power spectrogram representing energy according to time and frequency of a voice signal may be extracted.
- a mel-spectrogram representing energy for each frequency band of the mel scale is obtained by multiplying a mel-filter bank having a pattern for a plurality of frequency domains.
- a log function is applied to convert the energy into log scale energy to extract features.
- the stress recognition device normalizes the features of the Mel-filter bank. In the normalization process, normalization is performed so that the Mel-filter bank features have zero mean and unit variance.
- the stress recognition device divides the feature into a fixed length of time (eg, 2 sec, 4 sec, 5 sec, etc.) according to the setting of the stress discrimination algorithm, and outputs the finally extracted feature vector.
- the feature vector output from the segmentation process is transferred to the domain adversarial stress discrimination model to learn the deep neural network model for stress discrimination.
- FIGS. 4 and 5 are diagrams illustrating a domain hostile stress discrimination model of the stress recognition apparatus according to an embodiment of the present invention.
- the domain adversarial stress discrimination model is an encoder that converts the Mel-spectrogram into an embedding vector suitable for stress judgment, calculates the relationship between the embedding vector extracted for each frame and the stress label, and assigns weights to it. , an attention-weighted sum unit that extracts sentence-level speech characteristics by weighted summing, and a sentence-level hidden layer that performs stress determination and speaker recognition using sentence-level feature vectors.
- the encoder at the bottom of the network plays a role in modeling the temporal and frequency components of the input speech feature vector to be suitable for stress recognition.
- the encoder consists of several layers of convolutional neural networks.
- As an input layer of the network convolutional neural network a log-scale power Mel-spectrum extracted at a preset time interval (eg, 10 ms) by the feature vector acquisition unit is used as an input to the network.
- Each frame of the voice signal is extracted in a short section, and information between neighboring frames has a continuous permutation characteristic.
- a neural network structure that models overall information between neighboring frames should be applied, and a representative structure is a recurrent neural network.
- the convolutional neural network After the convolutional neural network, it passes through the recurrent neural network to generate an output vector in the form of a reduced size in terms of time and frequency.
- the two-dimensional output vector is transferred to the attention hiding layer having multiple heads.
- the network structure of the encoder unit may utilize a dilated convolution layer instead of a recurrent neural network structure in order to reduce the number of parameters if necessary.
- the attention weighted summation unit corresponds to a multi-head attention hidden layer.
- the attention head is calculated as in Equation 1.
- the relationship between the stress labels is calculated to obtain the weight for each time.
- a fully-connected layer is added to the vector for each frame so that the weight is calculated.
- the frame-level vector is transformed into a sentence-level vector representing the sentence by attention pooling, which multiplies the product of the calculated weight vector W and the encoder output value and adds it along the time axis. This sentence-level vector is transferred to the sentence-level hidden layer.
- the sentence level hidden layer includes a stress recognition unit modeling a non-linear relationship between the sentence level feature vector and a stress label, and a speaker recognition unit modeling a non-linear relationship between speaker labels.
- a gradient inverse layer is added before the precoupling layer for speaker recognition.
- This hidden layer reverses the direction of the gradient by multiplying the gradient value calculated in the backpropagation training process by -1. It is trained to play a role in making it difficult to distinguish information about the speaker. That is, the sentence-level feature vector is learned to contain information that can distinguish the stress characteristic regardless of the speaker information.
- the stress recognition unit determines the presence/absence of stress by passing through the sentence level hidden layer.
- the speaker recognition unit is connected to a pre-coupling layer having as many dimensions as the number of speakers, and serves to distinguish speakers.
- the entire neural network model is trained with a backpropagation algorithm by calculating the error between the output value of the network and the stress label.
- the error between the network output values is calculated and the gradient value is propagated in the opposite direction to the learning direction through the gradient inverse layer, so that the speaker cannot be distinguished.
- the loss function is expressed as a weighted sum between the stress recognition loss function (cross-entropy) for stress recognition and the speaker recognition loss function with the reversed sign.
- cross-entropy cross-entropy
- the learning data is kept the same.
- FIG. 6 is a diagram illustrating the distribution of sentence-level feature vectors processed by the stress recognition apparatus according to an embodiment of the present invention.
- non-stress is indicated by ' X '
- stress is indicated by ' ⁇ '.
- FIG. 6(a) is a two-dimensional view of a sentence-level feature vector when trained only with a general stress recognition loss function by visualizing it with a t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm. Sentence-level feature vectors for several speakers are expressed, and it can be seen that the same speaker information is clustered among sentence-level feature vectors. This indicates that the sentence level feature vector of the discrimination device includes not only the stress recognition information but also the speaker recognition information, so it is not suitable for the purpose of stress recognition independent of the speaker information.
- t-SNE Stochastic Neighbor Embedding
- Figure 6 (b) is a visualization of the sentence-level feature vectors in the case of domain hostile learning. It can be confirmed that the sentence-level feature vectors are not clustered according to the speaker, but are clustered only by the presence/absence of stress. have. That is, it indicates that the sentence-level feature vector has a generalized stress recognition vector regardless of the speaker.
- the stress recognition method may be performed by a stress recognition device.
- step S10 the stress recognition device acquires the generated voice signal.
- step S20 the stress recognition apparatus extracts a feature vector by analyzing the speech signal in units of a predetermined window.
- step S30 the stress recognition device performs deep learning learning by inputting the feature vector, and trains the feature vector independent of the speaker information through domain adversarial learning by giving the speaker information and the stress information together, and stress recognition print the result
- the feature vector extraction step S20 includes a noise removal step of removing noise from the voice signal.
- the feature vector extraction step S20 includes a distortion compensation step of compensating for distortion by emphasizing a high frequency through a pre-emphasis filter.
- the feature vector extraction step (S20) includes a silence processing step of dividing a section in which a speech is present in the distortion-compensated speech signal, obtaining a speech segment, and transferring the speech segment to the Fourier transform step.
- the feature vector extraction step ( S20 ) includes a Fourier transform step of transforming the speech signal into frames of a certain length on the time axis in order to analyze the frequency change of the speech signal according to the temporal flow in units of a predetermined window.
- each of the divided frames is multiplied by a Mel-filter bank having a pattern for a plurality of frequency domains to obtain a Mel-spectrogram representing energy for each frequency band of the Mel scale, and the Mel-filter It includes a filter bank processing step of extracting the characteristics of the bank.
- the feature vector extraction step S20 includes a normalization step of normalizing the features of the Mel-filter bank.
- the feature vector extraction step S20 includes a division processing step of extracting the feature vector by dividing the normalized feature by a predetermined fixed length of time.
- the model learning step S30 includes an encoder processing step of receiving a feature vector, modeling temporal and frequency components of the feature vector to be suitable for stress determination, and outputting an output vector for each frame.
- the model learning step (S30) is a weight processing step of calculating a weight vector for each time based on the output vector output for each frame, and converting the output vector of the frame level into a vector of the sentence level by calculating the weight vector for each time with the output vector.
- the model learning step ( S30 ) includes a stress classification processing step of modeling a non-linear relationship between a sentence level feature vector and a stress label to generate a determination result for the existence of stress.
- the model learning step S30 includes a speaker classification processing step of generating a speaker discrimination result of a speech signal by modeling a non-linear relationship between a sentence level feature vector and a speaker label.
- the weighted sum of the error between the discrimination result derived based on the feature vector and the stress label and the error between the speaker labels is calculated as a loss function, and then the stress label error is minimized using a backpropagation algorithm. At the same time, learning to increase the error of the speaker label is repeatedly performed.
- the existing stress recognition model utilizes only stress information in the training process, and since the sentence-level vector can learn the characteristics that depend on the speaker information according to the emotional state, if there is a difference from the real environment in the test environment, the recognition rate can be lowered.
- the present invention by additionally providing speaker information in the training process of the deep learning model to perform domain hostile learning, learning is performed in the form of estimating stress information independently of speaker information by parameters inside the model. It is possible to effectively recognize the psychological stress in the voice signal by using the speaker information hostilely.
- the stress recognition apparatus may be implemented in a logic circuit by hardware, firmware, software, or a combination thereof, or may be implemented using a general-purpose or special-purpose computer.
- the device may be implemented using a hardwired device, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), or the like.
- the device may be implemented as a system on chip (SoC) including one or more processors and controllers.
- SoC system on chip
- the stress recognition apparatus may be mounted in a form of software, hardware, or a combination thereof on a computing device or server provided with hardware elements.
- a computing device or server is all or part of a communication device such as a communication modem for performing communication with various devices or wired/wireless communication networks, a memory for storing data for executing a program, and a microprocessor for executing operations and commands by executing the program It can mean a variety of devices, including
- Computer-readable medium represents any medium that participates in providing instructions to a processor for execution.
- Computer-readable media may include program instructions, data files, data structures, or a combination thereof. For example, there may be a magnetic medium, an optical recording medium, a memory, and the like.
- a computer program may be distributed over a networked computer system so that computer readable code is stored and executed in a distributed manner. Functional programs, codes, and code segments for implementing the present embodiment may be easily inferred by programmers in the art to which this embodiment belongs.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Public Health (AREA)
- Biomedical Technology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Medical Informatics (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Pathology (AREA)
- Psychiatry (AREA)
- Animal Behavior & Ethology (AREA)
- Surgery (AREA)
- Signal Processing (AREA)
- Heart & Thoracic Surgery (AREA)
- Veterinary Medicine (AREA)
- Data Mining & Analysis (AREA)
- Hospice & Palliative Care (AREA)
- Child & Adolescent Psychology (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Epidemiology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Educational Technology (AREA)
- Social Psychology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Psychology (AREA)
- Software Systems (AREA)
- Developmental Disabilities (AREA)
- Databases & Information Systems (AREA)
Abstract
Les modes de réalisation selon la présente invention concernent un appareil et un procédé destinés à reconnaître le stress dans lesquels la précision de reconnaissance du stress est améliorée, le procédé utilisant l'entraînement adverse de domaine avec des informations de locuteur obtenues durant une période d'entraînement pour éliminer des traits dépendant du locuteur du signal vocal, et à entraîner indépendamment le vecteur caractéristique associé à un stress psychologique avec les informations de locuteur.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020200161957A KR102389610B1 (ko) | 2020-11-27 | 2020-11-27 | 화자 정보와의 적대적 학습을 활용한 음성 신호 기반 스트레스 인식 장치 및 방법 |
KR10-2020-0161957 | 2020-11-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022114347A1 true WO2022114347A1 (fr) | 2022-06-02 |
Family
ID=81437234
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2020/017789 WO2022114347A1 (fr) | 2020-11-27 | 2020-12-07 | Procédé et appareil à base de signal vocal destinés à reconnaître le stress par entraînement adverse avec informations de locuteur |
Country Status (2)
Country | Link |
---|---|
KR (1) | KR102389610B1 (fr) |
WO (1) | WO2022114347A1 (fr) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109308318A (zh) * | 2018-08-14 | 2019-02-05 | 深圳大学 | 跨领域文本情感分类模型的训练方法、装置、设备及介质 |
KR20190135916A (ko) * | 2018-05-29 | 2019-12-09 | 연세대학교 산학협력단 | 음성 신호를 이용한 사용자 스트레스 판별 장치 및 방법 |
KR20200114705A (ko) * | 2019-03-29 | 2020-10-07 | 연세대학교 산학협력단 | 음성 신호 기반의 사용자 적응형 스트레스 인식 방법 |
US20200364539A1 (en) * | 2020-07-28 | 2020-11-19 | Oken Technologies, Inc. | Method of and system for evaluating consumption of visual information displayed to a user by analyzing user's eye tracking and bioresponse data |
-
2020
- 2020-11-27 KR KR1020200161957A patent/KR102389610B1/ko active IP Right Grant
- 2020-12-07 WO PCT/KR2020/017789 patent/WO2022114347A1/fr active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20190135916A (ko) * | 2018-05-29 | 2019-12-09 | 연세대학교 산학협력단 | 음성 신호를 이용한 사용자 스트레스 판별 장치 및 방법 |
CN109308318A (zh) * | 2018-08-14 | 2019-02-05 | 深圳大学 | 跨领域文本情感分类模型的训练方法、装置、设备及介质 |
KR20200114705A (ko) * | 2019-03-29 | 2020-10-07 | 연세대학교 산학협력단 | 음성 신호 기반의 사용자 적응형 스트레스 인식 방법 |
US20200364539A1 (en) * | 2020-07-28 | 2020-11-19 | Oken Technologies, Inc. | Method of and system for evaluating consumption of visual information displayed to a user by analyzing user's eye tracking and bioresponse data |
Non-Patent Citations (1)
Title |
---|
ABDELWAHAB MOHAMMED, BUSSO CARLOS: "Domain Adversarial for Acoustic Emotion Recognition", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE, USA, vol. 26, no. 12, 1 December 2018 (2018-12-01), USA, pages 2423 - 2435, XP055933927, ISSN: 2329-9290, DOI: 10.1109/TASLP.2018.2867099 * |
Also Published As
Publication number | Publication date |
---|---|
KR102389610B1 (ko) | 2022-04-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Mannepalli et al. | MFCC-GMM based accent recognition system for Telugu speech signals | |
CN107610707A (zh) | 一种声纹识别方法及装置 | |
Qamhan et al. | Digital audio forensics: microphone and environment classification using deep learning | |
CN107767869A (zh) | 用于提供语音服务的方法和装置 | |
CN111724770B (zh) | 一种基于深度卷积生成对抗网络的音频关键词识别方法 | |
Wijethunga et al. | Deepfake audio detection: a deep learning based solution for group conversations | |
Antony et al. | Speaker identification based on combination of MFCC and UMRT based features | |
WO2023163383A1 (fr) | Procédé et appareil à base multimodale pour reconnaître une émotion en temps réel | |
CN112397093B (zh) | 一种语音检测方法与装置 | |
CN112489623A (zh) | 语种识别模型的训练方法、语种识别方法及相关设备 | |
CN112735435A (zh) | 具备未知类别内部划分能力的声纹开集识别方法 | |
CN114722812A (zh) | 一种多模态深度学习模型脆弱性的分析方法和系统 | |
Xue et al. | Cross-modal information fusion for voice spoofing detection | |
KS et al. | Comparative performance analysis for speech digit recognition based on MFCC and vector quantization | |
Wu et al. | A Characteristic of Speaker's Audio in the Model Space Based on Adaptive Frequency Scaling | |
WO2022114347A1 (fr) | Procédé et appareil à base de signal vocal destinés à reconnaître le stress par entraînement adverse avec informations de locuteur | |
Ali et al. | Fake audio detection using hierarchical representations learning and spectrogram features | |
WO2017111386A1 (fr) | Appareil d'extraction des paramètres caractéristiques du signal d'entrée, et appareil de reconnaissance de locuteur utilisant ledit appareil | |
CN111785262A (zh) | 一种基于残差网络及融合特征的说话人年龄性别分类方法 | |
Chen et al. | An intelligent nocturnal animal vocalization recognition system | |
Singh et al. | A critical review on automatic speaker recognition | |
Eltanashi et al. | Proposed speaker recognition model using optimized feed forward neural network and hybrid time-mel speech feature | |
WO2021153843A1 (fr) | Procédé pour déterminer le stress d'un signal vocal en utilisant des poids, et dispositif associé | |
Radha et al. | Improving recognition of speech system using multimodal approach | |
Maruf et al. | Effects of noise on RASTA-PLP and MFCC based Bangla ASR using CNN |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20963741 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20963741 Country of ref document: EP Kind code of ref document: A1 |