WO2022114347A1 - Procédé et appareil à base de signal vocal destinés à reconnaître le stress par entraînement adverse avec informations de locuteur - Google Patents

Procédé et appareil à base de signal vocal destinés à reconnaître le stress par entraînement adverse avec informations de locuteur Download PDF

Info

Publication number
WO2022114347A1
WO2022114347A1 PCT/KR2020/017789 KR2020017789W WO2022114347A1 WO 2022114347 A1 WO2022114347 A1 WO 2022114347A1 KR 2020017789 W KR2020017789 W KR 2020017789W WO 2022114347 A1 WO2022114347 A1 WO 2022114347A1
Authority
WO
WIPO (PCT)
Prior art keywords
stress
speaker
recognition
feature vector
domain
Prior art date
Application number
PCT/KR2020/017789
Other languages
English (en)
Korean (ko)
Inventor
강홍구
한혜원
신현경
변경근
Original Assignee
연세대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 연세대학교 산학협력단 filed Critical 연세대학교 산학협력단
Publication of WO2022114347A1 publication Critical patent/WO2022114347A1/fr

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Definitions

  • the technical field to which the present invention pertains relates to an apparatus and method for recognizing stress based on a voice signal.
  • This study is related to the emotional intelligence research and development that can infer and judge the emotions of the other party in the high-tech convergence content technology development project, funded by the Ministry of Science and ICT and supported by the Information and Communication Planning and Evaluation Institute, and communicate and respond accordingly. (No. 1711116331).
  • Techniques for determining stress using a voice signal are largely divided into a part for extracting a voice feature vector and a part for modeling the relationship between the extracted vector and the stress state by a statistical method.
  • a feature vector of speech previously used for stress discrimination there are pitch, Mel-Frequency Cepstral Coefficients (MFCC), and energy for each frame.
  • MFCC Mel-Frequency Cepstral Coefficients
  • Neural network structures are input to the network, calculate the loss value between the nonlinearly predicted result value and the label, and learn parameters through a backpropagation algorithm. These neural network-based algorithms reflect the statistical characteristics of data and effectively model the image characteristics of the spectrogram or the permutation characteristics of the voice, and are currently being actively used in image and voice-related research fields.
  • Patent Document 1 Korean Patent Publication No. 10-2019-0135916 (2019.12.09.)
  • Embodiments of the present invention remove a speaker-dependent tendency from a voice signal using domain adversarial training with speaker information given in the training process, and learn a feature vector related to psychological stress independently of speaker information. Its main purpose is to improve the stress recognition accuracy.
  • the processor obtains a voice signal, and from the voice signal.
  • a stress recognition apparatus characterized by extracting a feature vector and outputting a stress recognition result through a domain hostile stress discrimination model using the feature vector.
  • the domain hostile stress discrimination model may include a stress recognition unit for determining the presence or absence of stress and a speaker recognition unit for classifying a speaker.
  • the speaker recognizer may apply a gradient reversal layer.
  • the domain hostile stress discrimination model may learn to increase the loss of the speaker recognition unit to distribute the feature vector in a vector space.
  • the domain hostile stress discrimination model may determine stress independently of the speaker in the voice signal through domain hostile learning.
  • the speaker-dependent tendency is removed from the voice signal by using domain adversarial training with the speaker information given in the training process, and the psychological effects are independent of the speaker information. It has the effect of improving the stress recognition accuracy by learning the stress-related feature vectors.
  • FIG. 1 is a block diagram illustrating an apparatus for recognizing stress according to an embodiment of the present invention.
  • FIG. 2 is a diagram illustrating an operation of a stress recognition apparatus according to an embodiment of the present invention.
  • FIG. 3 is a diagram illustrating a voice signal pre-processing operation of the apparatus for recognizing stress according to an embodiment of the present invention.
  • FIGS. 4 and 5 are diagrams illustrating a domain hostile stress discrimination model of the stress recognition apparatus according to an embodiment of the present invention.
  • FIG. 6 is a diagram illustrating the distribution of sentence-level feature vectors processed by the stress recognition apparatus according to an embodiment of the present invention.
  • FIG. 7 is a flowchart illustrating a stress recognition method according to another embodiment of the present invention.
  • a neural network-based model that can model the continuous change characteristics of speech signals well is being used in fields such as stress and emotion recognition.
  • stress and emotion recognition when only stress information is used in the training process, learning of the trained sentence-level feature vector is insufficient, and thus it has a dependent characteristic depending on the speaker information as well as the stress information, which is different from the actual training data. It affects the discrimination performance. That is, the stress discrimination performance is different in a situation where another speaker is speaking.
  • the neural network-based model largely reflects the statistical characteristics of the data, it affects the model performance when the distribution of the training data and the test data is different. There is a problem in that recognition performance deteriorates in an environment in which domains are changed.
  • the stress recognition apparatus improves the discrimination accuracy by learning a feature vector for stress discrimination only independently of the speaker information from the voice signal by utilizing domain adversarial training with the speaker information given in the training process. .
  • Domain adversarial learning constructs a network composed of a first recognizer that is the main purpose of recognition in a neural network model and a second recognizer that distinguishes different domains, and learns the first recognizer, which is the main purpose, and simultaneously learns the second recognizer. It refers to a training technique for improving the performance of the first recognizer, which is the main purpose, by reducing the domain recognition performance.
  • FIG. 1 is a block diagram illustrating an apparatus for recognizing stress according to an embodiment of the present invention.
  • the stress recognition device 110 includes at least one processor 120 , a computer readable storage medium 130 , and a communication bus 170 .
  • the processor 120 may control to operate as the stress recognition device 110 .
  • the processor 120 may execute one or more programs stored in the computer-readable storage medium 130 .
  • the one or more programs may include one or more computer-executable instructions, which when executed by the processor 120 may be configured to cause the stress recognition device 110 to perform operations in accordance with the exemplary embodiment.
  • Computer-readable storage medium 130 is configured to store computer-executable instructions or program code, program data, and/or other suitable form of information.
  • the program 140 stored in the computer-readable storage medium 130 includes a set of instructions executable by the processor 120 .
  • computer-readable storage medium 130 includes memory (volatile memory, such as random access memory, non-volatile memory, or a suitable combination thereof), one or more magnetic disk storage devices, optical disk storage devices, It may be flash memory devices, other types of storage media accessed by the stress recognition apparatus 110 and capable of storing desired information, or a suitable combination thereof.
  • Communication bus 170 interconnects various other components of stress recognition device 110 including processor 120 and computer readable storage medium 140 .
  • the stress recognition device 110 may also include one or more input/output interfaces 150 and one or more communication interfaces 160 that provide interfaces for one or more input/output devices.
  • the input/output interface 150 and the communication interface 160 are connected to the communication bus 170 .
  • the input/output device may be connected to other components of the stress recognition device 110 through the input/output interface 150 .
  • the stress recognition device 110 adds a speaker recognition unit including a gradient reversal layer to the domain adversarial stress discrimination model, and utilizes domain adversarial training with speaker information given in the training process to obtain a speech signal. By learning the feature vector for stress discrimination only independently of the speaker information, the discrimination accuracy is improved.
  • FIG. 2 is a diagram illustrating an operation of a stress recognition apparatus according to an embodiment of the present invention.
  • the stress recognition apparatus extracts a feature vector from a speech database, receives an input from the acquired feature vector, trains a model that updates an internal parameter of a neural network, and finally determines the presence or absence of stress.
  • a vector reflecting the characteristics of the speech signal is extracted in a specified manner. After dividing the audio signal into frames of a certain length (5-40 ms) on the time axis, energy is extracted for each frequency band from each frame. Extracts a power spectrogram representing energy according to time and frequency of a voice signal. Thereafter, a mel-spectrogram representing energy for each frequency band of the mel scale is obtained by multiplying a mel-filter bank having a pattern for a plurality of frequency domains. At this time, since the frequency band energy has a very small value close to 0, it is easy to calculate and train the model by applying a log function and converting it to a log scale to expand the scale between energies and change the distribution differently.
  • the initial model created through parameter generation is trained so that the parameter values reflect the statistical characteristics of the data through the backpropagation algorithm.
  • the acquired speech feature vector is set in the input layer
  • label information indicating stress/non-stress and additional speaker label information are set in the output layer.
  • a backpropagation algorithm is used to train to minimize the error in recognizing the emotional state.
  • the network maximizes the error predicted by the speaker recognition unit. That is, the parameters inside the trained model minimize the error rate in estimating the emotional state while maximizing the error rate for speaker recognition, so that a feature vector suitable for emotional state recognition is learned independently of the speaker.
  • a voice feature vector is passed through the trained neural network model to determine the presence/absence of stress in a given voice, and in the speaker estimation step, a voice feature vector is passed through the same to determine who the speaker of the given voice is.
  • FIG. 3 is a diagram illustrating a voice signal pre-processing operation of the apparatus for recognizing stress according to an embodiment of the present invention.
  • the stress recognition device needs to apply a plurality of preprocessing steps to the recorded speech signal.
  • the stress recognition apparatus performs an operation of removing noise from a voice signal.
  • the composition of unwanted background noise is removed, and the noise can be removed by applying Wiener filtering.
  • the stress recognition device compensates for distortion generated in the speech signal processing process by reducing the dynamic range of the frequency spectrum by applying a pre-emphasis filter, that is, a simple high pass filter. By emphasizing high frequencies through a pre-emphasis filter, it is possible to balance the dynamic range between the high-frequency region and the low-frequency region.
  • a pre-emphasis filter that is, a simple high pass filter.
  • the stress recognizing apparatus performs an operation of discriminating whether or not a voice is present in a voice signal.
  • a voice segment is acquired by applying the VAD (Voice Activity Detection) algorithm in the process of silence processing. After the silence section is searched for in the voice signal and removed (processed as 0), a voice segment is acquired.
  • VAD Voice Activity Detection
  • the stress recognition apparatus performs analysis and transformation in units of a predetermined window.
  • the Fourier transform unit 440 may analyze the input voice signal every 10 ms with a Hanning window of 25 ms.
  • Short-Time Fourier Transform may be performed to analyze the frequency change of the voice signal over time.
  • the speech signal can be divided into frame units of a constant length (5 to 40 ms) on the time axis.
  • the stress recognition apparatus may extract energy for each frequency band from each of the divided frames.
  • a power spectrogram representing energy according to time and frequency of a voice signal may be extracted.
  • a mel-spectrogram representing energy for each frequency band of the mel scale is obtained by multiplying a mel-filter bank having a pattern for a plurality of frequency domains.
  • a log function is applied to convert the energy into log scale energy to extract features.
  • the stress recognition device normalizes the features of the Mel-filter bank. In the normalization process, normalization is performed so that the Mel-filter bank features have zero mean and unit variance.
  • the stress recognition device divides the feature into a fixed length of time (eg, 2 sec, 4 sec, 5 sec, etc.) according to the setting of the stress discrimination algorithm, and outputs the finally extracted feature vector.
  • the feature vector output from the segmentation process is transferred to the domain adversarial stress discrimination model to learn the deep neural network model for stress discrimination.
  • FIGS. 4 and 5 are diagrams illustrating a domain hostile stress discrimination model of the stress recognition apparatus according to an embodiment of the present invention.
  • the domain adversarial stress discrimination model is an encoder that converts the Mel-spectrogram into an embedding vector suitable for stress judgment, calculates the relationship between the embedding vector extracted for each frame and the stress label, and assigns weights to it. , an attention-weighted sum unit that extracts sentence-level speech characteristics by weighted summing, and a sentence-level hidden layer that performs stress determination and speaker recognition using sentence-level feature vectors.
  • the encoder at the bottom of the network plays a role in modeling the temporal and frequency components of the input speech feature vector to be suitable for stress recognition.
  • the encoder consists of several layers of convolutional neural networks.
  • As an input layer of the network convolutional neural network a log-scale power Mel-spectrum extracted at a preset time interval (eg, 10 ms) by the feature vector acquisition unit is used as an input to the network.
  • Each frame of the voice signal is extracted in a short section, and information between neighboring frames has a continuous permutation characteristic.
  • a neural network structure that models overall information between neighboring frames should be applied, and a representative structure is a recurrent neural network.
  • the convolutional neural network After the convolutional neural network, it passes through the recurrent neural network to generate an output vector in the form of a reduced size in terms of time and frequency.
  • the two-dimensional output vector is transferred to the attention hiding layer having multiple heads.
  • the network structure of the encoder unit may utilize a dilated convolution layer instead of a recurrent neural network structure in order to reduce the number of parameters if necessary.
  • the attention weighted summation unit corresponds to a multi-head attention hidden layer.
  • the attention head is calculated as in Equation 1.
  • the relationship between the stress labels is calculated to obtain the weight for each time.
  • a fully-connected layer is added to the vector for each frame so that the weight is calculated.
  • the frame-level vector is transformed into a sentence-level vector representing the sentence by attention pooling, which multiplies the product of the calculated weight vector W and the encoder output value and adds it along the time axis. This sentence-level vector is transferred to the sentence-level hidden layer.
  • the sentence level hidden layer includes a stress recognition unit modeling a non-linear relationship between the sentence level feature vector and a stress label, and a speaker recognition unit modeling a non-linear relationship between speaker labels.
  • a gradient inverse layer is added before the precoupling layer for speaker recognition.
  • This hidden layer reverses the direction of the gradient by multiplying the gradient value calculated in the backpropagation training process by -1. It is trained to play a role in making it difficult to distinguish information about the speaker. That is, the sentence-level feature vector is learned to contain information that can distinguish the stress characteristic regardless of the speaker information.
  • the stress recognition unit determines the presence/absence of stress by passing through the sentence level hidden layer.
  • the speaker recognition unit is connected to a pre-coupling layer having as many dimensions as the number of speakers, and serves to distinguish speakers.
  • the entire neural network model is trained with a backpropagation algorithm by calculating the error between the output value of the network and the stress label.
  • the error between the network output values is calculated and the gradient value is propagated in the opposite direction to the learning direction through the gradient inverse layer, so that the speaker cannot be distinguished.
  • the loss function is expressed as a weighted sum between the stress recognition loss function (cross-entropy) for stress recognition and the speaker recognition loss function with the reversed sign.
  • cross-entropy cross-entropy
  • the learning data is kept the same.
  • FIG. 6 is a diagram illustrating the distribution of sentence-level feature vectors processed by the stress recognition apparatus according to an embodiment of the present invention.
  • non-stress is indicated by ' X '
  • stress is indicated by ' ⁇ '.
  • FIG. 6(a) is a two-dimensional view of a sentence-level feature vector when trained only with a general stress recognition loss function by visualizing it with a t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm. Sentence-level feature vectors for several speakers are expressed, and it can be seen that the same speaker information is clustered among sentence-level feature vectors. This indicates that the sentence level feature vector of the discrimination device includes not only the stress recognition information but also the speaker recognition information, so it is not suitable for the purpose of stress recognition independent of the speaker information.
  • t-SNE Stochastic Neighbor Embedding
  • Figure 6 (b) is a visualization of the sentence-level feature vectors in the case of domain hostile learning. It can be confirmed that the sentence-level feature vectors are not clustered according to the speaker, but are clustered only by the presence/absence of stress. have. That is, it indicates that the sentence-level feature vector has a generalized stress recognition vector regardless of the speaker.
  • the stress recognition method may be performed by a stress recognition device.
  • step S10 the stress recognition device acquires the generated voice signal.
  • step S20 the stress recognition apparatus extracts a feature vector by analyzing the speech signal in units of a predetermined window.
  • step S30 the stress recognition device performs deep learning learning by inputting the feature vector, and trains the feature vector independent of the speaker information through domain adversarial learning by giving the speaker information and the stress information together, and stress recognition print the result
  • the feature vector extraction step S20 includes a noise removal step of removing noise from the voice signal.
  • the feature vector extraction step S20 includes a distortion compensation step of compensating for distortion by emphasizing a high frequency through a pre-emphasis filter.
  • the feature vector extraction step (S20) includes a silence processing step of dividing a section in which a speech is present in the distortion-compensated speech signal, obtaining a speech segment, and transferring the speech segment to the Fourier transform step.
  • the feature vector extraction step ( S20 ) includes a Fourier transform step of transforming the speech signal into frames of a certain length on the time axis in order to analyze the frequency change of the speech signal according to the temporal flow in units of a predetermined window.
  • each of the divided frames is multiplied by a Mel-filter bank having a pattern for a plurality of frequency domains to obtain a Mel-spectrogram representing energy for each frequency band of the Mel scale, and the Mel-filter It includes a filter bank processing step of extracting the characteristics of the bank.
  • the feature vector extraction step S20 includes a normalization step of normalizing the features of the Mel-filter bank.
  • the feature vector extraction step S20 includes a division processing step of extracting the feature vector by dividing the normalized feature by a predetermined fixed length of time.
  • the model learning step S30 includes an encoder processing step of receiving a feature vector, modeling temporal and frequency components of the feature vector to be suitable for stress determination, and outputting an output vector for each frame.
  • the model learning step (S30) is a weight processing step of calculating a weight vector for each time based on the output vector output for each frame, and converting the output vector of the frame level into a vector of the sentence level by calculating the weight vector for each time with the output vector.
  • the model learning step ( S30 ) includes a stress classification processing step of modeling a non-linear relationship between a sentence level feature vector and a stress label to generate a determination result for the existence of stress.
  • the model learning step S30 includes a speaker classification processing step of generating a speaker discrimination result of a speech signal by modeling a non-linear relationship between a sentence level feature vector and a speaker label.
  • the weighted sum of the error between the discrimination result derived based on the feature vector and the stress label and the error between the speaker labels is calculated as a loss function, and then the stress label error is minimized using a backpropagation algorithm. At the same time, learning to increase the error of the speaker label is repeatedly performed.
  • the existing stress recognition model utilizes only stress information in the training process, and since the sentence-level vector can learn the characteristics that depend on the speaker information according to the emotional state, if there is a difference from the real environment in the test environment, the recognition rate can be lowered.
  • the present invention by additionally providing speaker information in the training process of the deep learning model to perform domain hostile learning, learning is performed in the form of estimating stress information independently of speaker information by parameters inside the model. It is possible to effectively recognize the psychological stress in the voice signal by using the speaker information hostilely.
  • the stress recognition apparatus may be implemented in a logic circuit by hardware, firmware, software, or a combination thereof, or may be implemented using a general-purpose or special-purpose computer.
  • the device may be implemented using a hardwired device, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), or the like.
  • the device may be implemented as a system on chip (SoC) including one or more processors and controllers.
  • SoC system on chip
  • the stress recognition apparatus may be mounted in a form of software, hardware, or a combination thereof on a computing device or server provided with hardware elements.
  • a computing device or server is all or part of a communication device such as a communication modem for performing communication with various devices or wired/wireless communication networks, a memory for storing data for executing a program, and a microprocessor for executing operations and commands by executing the program It can mean a variety of devices, including
  • Computer-readable medium represents any medium that participates in providing instructions to a processor for execution.
  • Computer-readable media may include program instructions, data files, data structures, or a combination thereof. For example, there may be a magnetic medium, an optical recording medium, a memory, and the like.
  • a computer program may be distributed over a networked computer system so that computer readable code is stored and executed in a distributed manner. Functional programs, codes, and code segments for implementing the present embodiment may be easily inferred by programmers in the art to which this embodiment belongs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Pathology (AREA)
  • Psychiatry (AREA)
  • Animal Behavior & Ethology (AREA)
  • Surgery (AREA)
  • Signal Processing (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Veterinary Medicine (AREA)
  • Data Mining & Analysis (AREA)
  • Hospice & Palliative Care (AREA)
  • Child & Adolescent Psychology (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Educational Technology (AREA)
  • Social Psychology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Psychology (AREA)
  • Software Systems (AREA)
  • Developmental Disabilities (AREA)
  • Databases & Information Systems (AREA)

Abstract

Les modes de réalisation selon la présente invention concernent un appareil et un procédé destinés à reconnaître le stress dans lesquels la précision de reconnaissance du stress est améliorée, le procédé utilisant l'entraînement adverse de domaine avec des informations de locuteur obtenues durant une période d'entraînement pour éliminer des traits dépendant du locuteur du signal vocal, et à entraîner indépendamment le vecteur caractéristique associé à un stress psychologique avec les informations de locuteur.
PCT/KR2020/017789 2020-11-27 2020-12-07 Procédé et appareil à base de signal vocal destinés à reconnaître le stress par entraînement adverse avec informations de locuteur WO2022114347A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020200161957A KR102389610B1 (ko) 2020-11-27 2020-11-27 화자 정보와의 적대적 학습을 활용한 음성 신호 기반 스트레스 인식 장치 및 방법
KR10-2020-0161957 2020-11-27

Publications (1)

Publication Number Publication Date
WO2022114347A1 true WO2022114347A1 (fr) 2022-06-02

Family

ID=81437234

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/017789 WO2022114347A1 (fr) 2020-11-27 2020-12-07 Procédé et appareil à base de signal vocal destinés à reconnaître le stress par entraînement adverse avec informations de locuteur

Country Status (2)

Country Link
KR (1) KR102389610B1 (fr)
WO (1) WO2022114347A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308318A (zh) * 2018-08-14 2019-02-05 深圳大学 跨领域文本情感分类模型的训练方法、装置、设备及介质
KR20190135916A (ko) * 2018-05-29 2019-12-09 연세대학교 산학협력단 음성 신호를 이용한 사용자 스트레스 판별 장치 및 방법
KR20200114705A (ko) * 2019-03-29 2020-10-07 연세대학교 산학협력단 음성 신호 기반의 사용자 적응형 스트레스 인식 방법
US20200364539A1 (en) * 2020-07-28 2020-11-19 Oken Technologies, Inc. Method of and system for evaluating consumption of visual information displayed to a user by analyzing user's eye tracking and bioresponse data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190135916A (ko) * 2018-05-29 2019-12-09 연세대학교 산학협력단 음성 신호를 이용한 사용자 스트레스 판별 장치 및 방법
CN109308318A (zh) * 2018-08-14 2019-02-05 深圳大学 跨领域文本情感分类模型的训练方法、装置、设备及介质
KR20200114705A (ko) * 2019-03-29 2020-10-07 연세대학교 산학협력단 음성 신호 기반의 사용자 적응형 스트레스 인식 방법
US20200364539A1 (en) * 2020-07-28 2020-11-19 Oken Technologies, Inc. Method of and system for evaluating consumption of visual information displayed to a user by analyzing user's eye tracking and bioresponse data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ABDELWAHAB MOHAMMED, BUSSO CARLOS: "Domain Adversarial for Acoustic Emotion Recognition", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE, USA, vol. 26, no. 12, 1 December 2018 (2018-12-01), USA, pages 2423 - 2435, XP055933927, ISSN: 2329-9290, DOI: 10.1109/TASLP.2018.2867099 *

Also Published As

Publication number Publication date
KR102389610B1 (ko) 2022-04-21

Similar Documents

Publication Publication Date Title
Mannepalli et al. MFCC-GMM based accent recognition system for Telugu speech signals
CN107610707A (zh) 一种声纹识别方法及装置
Qamhan et al. Digital audio forensics: microphone and environment classification using deep learning
CN107767869A (zh) 用于提供语音服务的方法和装置
CN111724770B (zh) 一种基于深度卷积生成对抗网络的音频关键词识别方法
Wijethunga et al. Deepfake audio detection: a deep learning based solution for group conversations
Antony et al. Speaker identification based on combination of MFCC and UMRT based features
WO2023163383A1 (fr) Procédé et appareil à base multimodale pour reconnaître une émotion en temps réel
CN112397093B (zh) 一种语音检测方法与装置
CN112489623A (zh) 语种识别模型的训练方法、语种识别方法及相关设备
CN112735435A (zh) 具备未知类别内部划分能力的声纹开集识别方法
CN114722812A (zh) 一种多模态深度学习模型脆弱性的分析方法和系统
Xue et al. Cross-modal information fusion for voice spoofing detection
KS et al. Comparative performance analysis for speech digit recognition based on MFCC and vector quantization
Wu et al. A Characteristic of Speaker's Audio in the Model Space Based on Adaptive Frequency Scaling
WO2022114347A1 (fr) Procédé et appareil à base de signal vocal destinés à reconnaître le stress par entraînement adverse avec informations de locuteur
Ali et al. Fake audio detection using hierarchical representations learning and spectrogram features
WO2017111386A1 (fr) Appareil d'extraction des paramètres caractéristiques du signal d'entrée, et appareil de reconnaissance de locuteur utilisant ledit appareil
CN111785262A (zh) 一种基于残差网络及融合特征的说话人年龄性别分类方法
Chen et al. An intelligent nocturnal animal vocalization recognition system
Singh et al. A critical review on automatic speaker recognition
Eltanashi et al. Proposed speaker recognition model using optimized feed forward neural network and hybrid time-mel speech feature
WO2021153843A1 (fr) Procédé pour déterminer le stress d'un signal vocal en utilisant des poids, et dispositif associé
Radha et al. Improving recognition of speech system using multimodal approach
Maruf et al. Effects of noise on RASTA-PLP and MFCC based Bangla ASR using CNN

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20963741

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20963741

Country of ref document: EP

Kind code of ref document: A1