CN116978409A - Depression state evaluation method, device, terminal and medium based on voice signal - Google Patents

Depression state evaluation method, device, terminal and medium based on voice signal Download PDF

Info

Publication number
CN116978409A
CN116978409A CN202311226304.5A CN202311226304A CN116978409A CN 116978409 A CN116978409 A CN 116978409A CN 202311226304 A CN202311226304 A CN 202311226304A CN 116978409 A CN116978409 A CN 116978409A
Authority
CN
China
Prior art keywords
training
features
depression state
acoustic
depression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202311226304.5A
Other languages
Chinese (zh)
Inventor
陈扬斌
陆志伟
胡希塔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Fubian Medical Technology Co ltd
Original Assignee
Suzhou Fubian Medical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Fubian Medical Technology Co ltd filed Critical Suzhou Fubian Medical Technology Co ltd
Priority to CN202311226304.5A priority Critical patent/CN116978409A/en
Publication of CN116978409A publication Critical patent/CN116978409A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • A61B5/165Evaluating the state of mind, e.g. depression, anxiety
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/48Other medical applications
    • A61B5/4803Speech analysis specially adapted for diagnostic purposes
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7235Details of waveform analysis
    • A61B5/7264Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems
    • A61B5/7267Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems involving training the classification device
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Veterinary Medicine (AREA)
  • Signal Processing (AREA)
  • Psychiatry (AREA)
  • Biophysics (AREA)
  • Pathology (AREA)
  • Biomedical Technology (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Surgery (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physiology (AREA)
  • Mathematical Physics (AREA)
  • Educational Technology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychology (AREA)
  • Epidemiology (AREA)
  • Developmental Disabilities (AREA)
  • Child & Adolescent Psychology (AREA)
  • Fuzzy Systems (AREA)
  • Social Psychology (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a depression state assessment method, device, terminal and medium based on voice signals. Wherein the method comprises the following steps: acquiring initial voice data of different testees marked with depression types; preprocessing initial voice data; extracting acoustic features and speech spectrum features from the speech segments, constructing and training an acoustic feature encoder based on the acoustic features, and constructing and training a speech spectrum feature encoder based on the speech spectrum features; and based on a cross-attention mechanism, merging feature vectors obtained by the trained acoustic feature encoder and the trained speech spectrum feature encoder, constructing a depression state evaluation model, and training to obtain a trained depression state evaluator. The invention introduces a self-attention mechanism and a cross-attention mechanism based on a voice signal processing technology, constructs and trains the feature extraction model and the depression state evaluation model according to different voice features, and improves the accuracy of the model on depression state evaluation.

Description

Depression state evaluation method, device, terminal and medium based on voice signal
Technical Field
The embodiment of the invention relates to the technical field of voice signal processing in the field of artificial intelligence, in particular to a depression state evaluation method, device, terminal and medium based on voice signals.
Background
Depressive disorder refers to a category of mood disorders characterized by significant and persistent mood drops as the primary clinical feature, caused by various causes, one of the most common mental disorders. The etiology, pathology and pathogenesis of depressive disorders have not been clearly studied. Thus, clinical features are currently primarily used as their primary diagnostic basis. The doctor and doctor need to compare and judge each item according to the items listed in the professional scale. But this also results in a more subjective, less consistent, and less efficient assessment result.
Artificial intelligence techniques are increasingly being applied in diagnostic practice for depressive disorders, including digital diagnostics, big data mining, biosignal processing, etc. Compared with the traditional evaluation mode, the artificial intelligence-based evaluation has the advantages of objectivity, consistency, convenience and high efficiency. However, the current works have the defects of small data scale, low data quality, weak model generalization and the like, and have a certain distance from clinical application.
Disclosure of Invention
The invention provides a depression state evaluation method, device, terminal and medium based on voice signals, which can efficiently and accurately assist in detection and evaluation of depression states.
In a first aspect, an embodiment of the present invention provides a method for evaluating a depression state based on a speech signal, including:
s1, acquiring initial voice data of different testees marked with depression types;
s2, preprocessing the initial voice data to obtain preprocessed voice fragments;
s3, extracting acoustic features and speech spectrum features from the voice fragments;
s4, constructing and training an acoustic feature coding model based on the acoustic features to obtain a trained acoustic feature coder;
s5, constructing and training a language spectrum feature coding model based on the language spectrum features to obtain a trained language spectrum feature coder;
s6, based on a cross-attention mechanism, fusing feature vectors obtained by the trained acoustic feature encoder and the trained speech spectrum feature encoder, and constructing a depression state evaluation model;
and S7, training the depression state evaluation model according to the acoustic features and the language spectrum features to obtain a trained depression state evaluator, wherein the depression state evaluator is used for outputting a depression state evaluation result of a tested person.
Optionally, the step S3 includes:
windowing and framing the voice fragments, extracting low-order voice features based on each frame of voice fragments and statistics features based on the low-order voice features to obtain acoustic features;
and converting the voice fragment into a spectrogram, and extracting the spectrogram characteristics from the spectrogram.
Optionally, the S4 includes:
and respectively pre-training the acoustic feature coding model by taking the acoustic features and the depression categories as training data and training labels, and generating a trained acoustic feature coder according to the optimal parameters obtained by training.
Optionally, the step S5 includes:
and respectively taking the language spectrum characteristics and the depression categories as training data and training labels to pretrain the language spectrum characteristic coding model, and generating a trained language spectrum characteristic coder according to the optimal parameters obtained by training.
Optionally, the step S6 includes:
and training the depression state evaluation model by taking the acoustic features and the language spectrum features as training data and the depression category as a training label, and generating a trained depression state evaluator according to the optimal parameters obtained by training.
Optionally, the speech spectrum feature encoder is a neural network model employing a self-attention mechanism.
Optionally, the acoustic feature encoder is a neural network model employing a self-attention mechanism.
In a second aspect, an embodiment of the present invention further provides a depression state assessment device based on a voice signal, including:
the data acquisition module is used for acquiring initial voice data of different testees marked with depression categories;
the preprocessing module is used for preprocessing the initial voice data to obtain preprocessed voice fragments;
the feature extraction module is used for extracting acoustic features and speech spectrum features from the speech fragments;
the acoustic feature encoder training module is used for constructing and training an acoustic feature encoding model based on the acoustic features so as to obtain a trained acoustic feature encoder;
the language spectrum feature encoder training module is used for constructing and training a language spectrum feature encoding model based on the language spectrum features to obtain a trained language spectrum feature encoder;
the depression state evaluator construction module is used for constructing a depression state evaluation model based on the feature vectors obtained by the trained acoustic feature encoder and the trained language spectrum feature encoder which are fused by a cross-attention mechanism;
the depression state evaluator training module is used for training the depression state evaluation model according to the acoustic features and the language spectrum features to obtain a trained depression state evaluator, and the depression state evaluator is used for outputting a depression state evaluation result of a tested person.
In a third aspect, an embodiment of the present invention provides a terminal, including:
one or more processors;
storage means for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the speech signal based depression state assessment method as described in any of the above embodiments.
In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for assessing depression state based on speech signals as described in any of the above embodiments.
The invention has the beneficial effects that:
1. the invention obtains voice data in a face-to-face and multi-situation mode, is used for depression state assessment, and has the advantages of user friendliness, directness and convenience.
2. The invention combines the traditional acoustic characteristics and the speech spectrum characteristics to represent the voice signals, wherein the traditional acoustic characteristics are commonly used for emotion detection, the speech spectrum characteristics are commonly used for voice recognition, the combination of the traditional acoustic characteristics and the speech spectrum characteristics can reflect the characteristics of the voice, the semantics can be reflected to a certain extent, and the voice signal processing method has the advantage of information complementation.
3. The invention constructs an acoustic feature coding model and a language spectrum feature coding model and performs pre-training, thereby enhancing the training effect of the subsequent depression state evaluation model.
4. The invention introduces a self-attention mechanism and a cross-attention mechanism, integrates local characteristics and global characteristics, and improves the characteristic coding capability and the characteristic fusion quality compared with a method based on a convolutional neural network or a cyclic neural network.
Drawings
Fig. 1 is a flowchart of a depression state evaluation method based on a voice signal according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a network structure of an acoustic feature coding model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a network structure of a speech spectrum feature coding model according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a network structure of a depression state assessment model according to an embodiment of the present invention;
fig. 5 is a schematic diagram of an application process of depression state assessment according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
Examples
Fig. 1 is a flowchart of a depression state evaluation method based on a voice signal according to an embodiment of the present invention, which specifically includes the following steps:
s1, acquiring initial voice data of different testees marked with depression types.
The depression categories correspond to different depression degrees, and the depression categories with different degrees can be divided according to the different depression degrees.
In this embodiment, the initial voice data is recorded voice data of the testees via online and offline interactions, and the clinician marks each of the testees for depression.
The online and offline interaction scenarios may include different types of interaction scenarios such as interviews, speaks, emotion induction, etc., performed by a doctor or an intelligent system, among others. Illustratively, interviews mainly surround daily near-by conditions of the subject, and reading requires the subject to recite a given piece of text, while emotion induction is a topic with emotional polarity that is discussed with the subject.
In the interaction process, voice data of testees in the interaction process are recorded, and each testee obtains a diagnosis result through a series of detection of doctors and is endowed with a label.
All subjects in this example generally met the requirements of sex balance, uniform age distribution, non-depressive and relatively balanced numbers of subjects with varying degrees of depression, and subjects with depression were free of other mental disorders.
According to the method, through setting various situations such as interview, reading, emotion induction and the like, the testee can fully express and display oneself in all directions, the influence of accidental factors can be weakened, the situation most suitable for depression state assessment can be found conveniently, and the interactive design is optimized. In addition, the invention obtains the voice data for depression state assessment in a face-to-face and multi-situation manner, and has the advantages of user friendliness, directness and convenience.
S2, preprocessing the initial voice data to obtain preprocessed voice fragments.
For example, the dialogue data in the initial voice data is the tested voice fragments extracted by manual or algorithm, and the speakable data retains the original voice data.
Specifically, the step S2 specifically includes the following steps:
s21, eliminating data with lower quality from the initial voice data. The screening threshold can be set by using the conditions of duration, background sound, silence proportion and the like, and effective data meeting the conditions can be selected from the initial voice data under different interaction situations.
S22, aiming at the problems that the total amount of voice data is limited, the duration of a single voice is long and the like, a plurality of fragments are extracted from a single original voice in a sliding window mode, and data enhancement is carried out on the voice fragments.
S23, the voice fragments subjected to screening and data enhancement are reserved according to different situations, and corresponding labels are added to serve as training sets.
S3, extracting acoustic features and speech spectrum features from the voice fragments.
The acoustic features comprise basic low-order voice features and statistical features based on the low-order voice features, and the speech spectrum features comprise the representation of voice signals such as a Mel spectrogram and the like in a frequency domain.
Specifically, S3 includes the following steps:
s31, extracting acoustic features from the voice fragments. The speech segments are windowed and framed, and low-order speech features and statistical features based on the low-order speech features are extracted based on each frame of speech segments. This step refers to the classical feature set for emotion recognition based on speech, i.e. the eGeMAPS feature set. The feature set encompasses Low-order speech features (Low-Level Descriptors, LLDs) and statistical features based on Low-order speech features (High-level Statistical Features, HSFs). Wherein the low-order speech features includePitch, frequency perturbation, formants, amplitude perturbation, loudness, harmonic to noise ratio, mel-frequency cepstrum coefficient, and the like. The statistical feature is to make symmetrical moving average for several adjacent frames of speech, calculate standard deviation and normalize with arithmetic average. Acoustic signatures are described as. Where D represents the dimension of the acoustic feature and L represents the total number of frames of the speech segment. Table 1 lists the detailed features.
TABLE 1 eGeMAPS Low-order Speech feature set
S32, extracting the speech spectrum features from the speech fragments. The spectrogram is also called a sound spectrogram, and is a heat map describing how each frequency component of fluctuation changes with time, wherein the vertical axis represents frequency, the horizontal axis represents time, and the matrix value represents energy intensity.
Optionally, the original two-dimensional data of the mel spectrogram is used in the embodiment, but the original two-dimensional data is not meant to be the only option, and the scheme designed by the invention is also applicable to the mel spectrogram in RGB format, other types of spectrograms, spectrograms subjected to differential transformation, and the like. Phonetic features are described as. Wherein H and W represent the height and width of the spectrogram, respectively.
The invention combines the traditional acoustic characteristics and the speech spectrum characteristics to represent the voice signals, and learns the association of the characteristics and the depression state from the change rules of pitch, tone, fundamental frequency, loudness, speech spectrum and the like in time sequence. The traditional acoustic features are commonly used for emotion detection and speech recognition, so that the combination of the traditional acoustic features and speech recognition can reflect the characteristics of sound and semantics to a certain extent, and the voice recognition method has the advantage of information complementation.
S4, constructing and training an acoustic feature coding model based on the acoustic features to obtain a trained acoustic feature coder.
Specifically, S4 includes the following steps:
s41, designing an acoustic feature coding model based on acoustic features and a self-attention mechanism, and recording asThe network structure of the acoustic feature coding model is shown in fig. 2. The output vector of the acoustic feature coding model is obtained by the following formula:
wherein,,belonging to the parameter matrix, the method comprises the steps of,d k is super-parameter (herba Cinchi Oleracei)>For the input features via the output from the attention module, will +.>Average to obtain vector->. The last module of the acoustic feature encoder is the full connection layer,/->Through the full connection layer, obtainAnd finally outputting.
S42, pre-training the acoustic feature coding model. Characterizing acoustic featuresCategory of depressionyRespectively serving as training data and training labels of the acoustic feature coding model, performing multiple training, and reserving model parameters with optimal fitting degree. The cross entropy loss function for model training is expressed as follows:
wherein,,representing a self-attention encoding module->It is indicated that the average is taken,representing a fully connected layer.
Normally, the depression category is classified into a normal state and a depression state (c=2) according to coarse-grained classification, and into a normal state, a slightly depressed state, a moderately depressed state, and a severely depressed state (c=4) according to fine-grained classification. In practical application, the value of C can be determined according to different requirements.
S5, constructing and training a language spectrum feature coding model based on the language spectrum features to obtain a trained language spectrum feature coder.
Specifically, S5 includes the following steps:
s51, designing a language spectrum feature coding model based on the language spectrum features and the self-attention mechanism, and recording asModel structure of language spectrum characteristic coding modelSee in particular fig. 3. The output vector of the speech spectrum feature coding model is obtained by the following formula:
wherein,,belonging to the parameter matrix, the method comprises the steps of,d k is super-parameter (herba Cinchi Oleracei)>For the input features via the output from the attention module, will +.>Average to obtain vector->. The last module of the speech spectrum feature encoder is the full connection layer,>and obtaining final output through the full connection layer.
S52, pre-training the language spectrum feature coding model. Characterizing speech spectrumCategory of depressionyTraining respectively used as language spectrum characteristic coding modelTraining data and training labels, performing multiple times of training, and reserving model parameters with optimal fitting degree. The cross entropy loss function for model training is expressed as follows:
wherein,,a self-attention encoding module is represented,/>it is indicated that the average is taken,and representing a full-connection layer, wherein C is a value corresponding to different depression category division.
Compared with a convolution structure and a multi-layer perceptron structure which are commonly used in the prior art, the self-attention mechanism-based feature encoder design method reduces dependence on external information, is more beneficial to capturing internal correlation of features, can more effectively extract features related to depression state evaluation, and enhances training effect of a subsequent depression state evaluation model.
S6, based on a cross-attention mechanism, feature vectors obtained by the trained acoustic feature encoder and the trained speech spectrum feature encoder are fused, and a depression state evaluation model is constructed.
Referring to fig. 4, fig. 4 is a network structure model of a depression state evaluation model, which adopts a cross-attention neural network and merges feature vectors of an acoustic feature encoder and a speech spectrum feature encoder.
Specifically, the training data is executed by the depression state estimator as follows:
s61, extracting the acoustic characteristics of the same voice segment through S3Sum-speech features->
S62, outputting the acoustic feature through an acoustic feature encoder,/>
S63, outputting the speech spectrum characteristics through a speech spectrum characteristic encoder,/>
S64 toAs Query, < >>As Key, a fused acoustic characterization is obtained by adopting an attention computing mode and is marked as +.>,/>
S65 toAs Query, < >>As Key, a attention computing method is adopted to obtain a fusion spectrum feature, which is marked as +.>,/>
S66, will beAnd->Transforming to obtain a vector +.>
Wherein,,all are parameter matrixes>
S67, will beAnd->Transforming to obtain a vector +.>
Wherein,,all are the parameter matrix, and the parameter matrix is the parameter matrix,
s68, splicing the vector representations obtained in S66 and S67 to obtain the final representation of the voice segment
S69: And obtaining final output through the full connection layer.
And S7, training the depression state evaluation model according to the acoustic features and the language spectrum features to obtain a trained depression state evaluator, wherein the depression state evaluator is used for outputting a depression state evaluation result of a tested person.
The depression state estimator based on the cross-attention mechanism in the embodiment can effectively integrate multiple source characteristics, and has better classification capability on the premise of enough data volume.
In the training stage, in order to improve the accuracy of estimation, the initialization parameters of the acoustic feature coding module and the speech spectrum feature coding module use the parameters of S4 and S5 for pre-training to realize acoustic featuresSum-speech features->As training data, the depression category y is used as a training label, multiple training is carried out, and model parameters with optimal fitting degree are reserved. A depression state estimator is generated based on the best fit parameters.
Wherein, the cross entropy loss function trained by the depression state evaluation model is expressed as follows:
after all model training was completed, the combination was applied to a depression state aid assessment, the overall process is shown in fig. 5.
Further, the implementation of the present invention also provides a depression state assessment device based on a voice signal, which comprises:
the data acquisition module is used for acquiring initial voice data of different testees marked with depression categories;
the preprocessing module is used for preprocessing the initial voice data to obtain preprocessed voice fragments;
the feature extraction module is used for extracting acoustic features and speech spectrum features from the speech fragments;
the acoustic feature encoder training module is used for constructing and training an acoustic feature encoding model based on the acoustic features so as to obtain a trained acoustic feature encoder;
the language spectrum feature encoder training module is used for constructing and training a language spectrum feature encoding model based on the language spectrum features to obtain a trained language spectrum feature encoder;
the depression state estimator construction module is used for constructing a depression state estimator based on the feature vector obtained by the trained acoustic feature encoder and the trained language spectrum feature encoder which are fused by a cross-attention mechanism;
the depression state evaluator training module is used for training the depression state evaluation model according to the acoustic features and the language spectrum features to obtain a trained depression state evaluator, and the depression state evaluator is used for outputting a depression state evaluation result of a tested person.
The feature extraction module is specifically used for:
windowing and framing the voice fragments, extracting low-order voice features based on each frame of voice fragments and statistics features based on the low-order voice features to obtain acoustic features;
and converting the voice fragment into a spectrogram, and extracting the spectrogram characteristics from the spectrogram.
Optionally, the acoustic feature encoder training module is specifically configured to:
and respectively pre-training the acoustic feature coding model by taking the acoustic features and the depression categories as training data and training labels, and generating a trained acoustic feature coder according to the optimal parameters obtained by training.
Optionally, the speech spectrum feature encoder training module is specifically configured to:
and respectively taking the language spectrum characteristics and the depression categories as training data and training labels to pretrain the language spectrum characteristic coding model, and generating a trained language spectrum characteristic coder according to the optimal parameters obtained by training.
The depression state evaluator training module is specifically configured to:
and training the depression state evaluation model by taking the acoustic features and the language spectrum features as training data and the depression category as a training label, and generating a trained depression state evaluator according to the optimal parameters obtained by training.
The speech spectrum feature encoder is a neural network model adopting a self-attention mechanism; the depression state evaluator is a neural network model employing a cross-attention mechanism.
The depression state evaluation device based on the voice signal provided by the embodiment of the invention can execute the depression state evaluation method based on the voice signal provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Fig. 6 is a schematic structural diagram of a terminal according to an embodiment of the present invention. Fig. 6 shows a block diagram of an exemplary terminal 12 suitable for use in implementing embodiments of the invention. The terminal 12 shown in fig. 6 is merely an example, and should not be construed as limiting the functionality and scope of use of the embodiments of the present invention.
As shown in fig. 6, the terminal 12 is in the form of a general purpose computing device. The components of the terminal 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.
Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Terminal 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by terminal 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The terminal 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 6, commonly referred to as a "hard disk drive"). Although not shown in fig. 6, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.
The terminal 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the terminal 12, and/or any devices (e.g., network card, modem, etc.) that enable the terminal 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Also, the terminal 12 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, via the network adapter 20. As shown, the network adapter 20 communicates with other modules of the terminal 12 via the bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with terminal 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, to implement a depression state evaluation method based on a voice signal provided by an embodiment of the present invention.
The embodiment of the present invention also provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the depression state assessment method based on a speech signal according to any one of the above embodiments. The method comprises the following steps:
s1, acquiring initial voice data of different testees marked with depression types;
s2, preprocessing the initial voice data to obtain preprocessed voice fragments;
s3, extracting acoustic features and speech spectrum features from the voice fragments;
s4, constructing and training an acoustic feature coding model based on the acoustic features to obtain a trained acoustic feature coder;
s5, constructing and training a language spectrum feature coding model based on the language spectrum features to obtain a trained language spectrum feature coder;
s6, fusing the trained acoustic feature encoder and the trained speech spectrum feature encoder based on a cross-attention mechanism to construct a depression state assessment model;
and S7, training the depression state evaluation model according to the acoustic features and the language spectrum features to obtain a trained depression state evaluator, wherein the depression state evaluator is used for outputting a depression state evaluation result of a tested person.
The computer storage media of embodiments of the invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims (10)

1. A method for assessing depression state based on speech signals, comprising:
s1, acquiring initial voice data of different testees marked with depression types;
s2, preprocessing the initial voice data to obtain preprocessed voice fragments;
s3, extracting acoustic features and speech spectrum features from the voice fragments;
s4, constructing and training an acoustic feature coding model based on the acoustic features to obtain a trained acoustic feature coder;
s5, constructing and training a language spectrum feature coding model based on the language spectrum features to obtain a trained language spectrum feature coder;
s6, based on a cross-attention mechanism, fusing feature vectors obtained by the trained acoustic feature encoder and the trained speech spectrum feature encoder, and constructing a depression state evaluation model;
and S7, training the depression state evaluation model according to the acoustic features and the language spectrum features to obtain a trained depression state evaluator, wherein the depression state evaluator is used for outputting a depression state evaluation result of a tested person.
2. The method according to claim 1, wherein S3 comprises:
windowing and framing the voice fragments, extracting low-order voice features based on each frame of voice fragments and statistics features based on the low-order voice features to obtain acoustic features;
and converting the voice fragment into a spectrogram, and extracting the spectrogram characteristics from the spectrogram.
3. The method according to claim 1, wherein S4 comprises:
and respectively pre-training the acoustic feature coding model by taking the acoustic features and the depression categories as training data and training labels, and generating a trained acoustic feature coder according to the optimal parameters obtained by training.
4. The method according to claim 1, wherein S5 comprises:
and respectively taking the language spectrum characteristics and the depression categories as training data and training labels to pretrain the language spectrum characteristic coding model, and generating a trained language spectrum characteristic coder according to the optimal parameters obtained by training.
5. The method according to claim 1, wherein S6 comprises:
and training the depression state evaluation model by taking the acoustic features and the language spectrum features as training data and the depression category as a training label, and generating a trained depression state evaluator according to the optimal parameters obtained by training.
6. The method of claim 1, wherein the speech spectral feature encoder is a neural network model employing a self-attention mechanism.
7. The method of claim 1, wherein the acoustic feature encoder is a neural network model employing a self-attention mechanism.
8. A depression state evaluation device based on a voice signal, comprising:
the data acquisition module is used for acquiring initial voice data of different testees marked with depression categories;
the preprocessing module is used for preprocessing the initial voice data to obtain preprocessed voice fragments;
the feature extraction module is used for extracting acoustic features and speech spectrum features from the speech fragments;
the acoustic feature encoder training module is used for constructing and training an acoustic feature encoding model based on the acoustic features so as to obtain a trained acoustic feature encoder;
the language spectrum feature encoder training module is used for constructing and training a language spectrum feature encoding model based on the language spectrum features to obtain a trained language spectrum feature encoder;
the depression state evaluator construction module is used for constructing a depression state evaluation model based on the feature vectors obtained by the trained acoustic feature encoder and the trained language spectrum feature encoder which are fused by a cross-attention mechanism;
the depression state evaluator training module is used for training the depression state evaluation model according to the acoustic features and the language spectrum features to obtain a trained depression state evaluator, and the depression state evaluator is used for outputting a depression state evaluation result of a tested person.
9. A terminal, the terminal comprising:
one or more processors;
storage means for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the speech signal based depression state assessment method of any one of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a depression state assessment method based on a speech signal according to any one of claims 1-7.
CN202311226304.5A 2023-09-22 2023-09-22 Depression state evaluation method, device, terminal and medium based on voice signal Withdrawn CN116978409A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311226304.5A CN116978409A (en) 2023-09-22 2023-09-22 Depression state evaluation method, device, terminal and medium based on voice signal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311226304.5A CN116978409A (en) 2023-09-22 2023-09-22 Depression state evaluation method, device, terminal and medium based on voice signal

Publications (1)

Publication Number Publication Date
CN116978409A true CN116978409A (en) 2023-10-31

Family

ID=88475275

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311226304.5A Withdrawn CN116978409A (en) 2023-09-22 2023-09-22 Depression state evaluation method, device, terminal and medium based on voice signal

Country Status (1)

Country Link
CN (1) CN116978409A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117992597A (en) * 2024-04-03 2024-05-07 江苏微皓智能科技有限公司 Information feedback method, device, computer equipment and computer storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110728997A (en) * 2019-11-29 2020-01-24 中国科学院深圳先进技术研究院 Multi-modal depression detection method and system based on context awareness
US20200075040A1 (en) * 2018-08-31 2020-03-05 The Regents Of The University Of Michigan Automatic speech-based longitudinal emotion and mood recognition for mental health treatment
CN111951824A (en) * 2020-08-14 2020-11-17 苏州国岭技研智能科技有限公司 Detection method for distinguishing depression based on sound
CN112349297A (en) * 2020-11-10 2021-02-09 西安工程大学 Depression detection method based on microphone array
CN112818892A (en) * 2021-02-10 2021-05-18 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network
CN113065344A (en) * 2021-03-24 2021-07-02 大连理工大学 Cross-corpus emotion recognition method based on transfer learning and attention mechanism
CN113111151A (en) * 2021-04-16 2021-07-13 北京爱抑暖舟科技有限责任公司 Cross-modal depression detection method based on intelligent voice question answering
KR102365433B1 (en) * 2020-10-23 2022-02-21 서울대학교산학협력단 Method and apparatus for emotion recognition based on cross attention model
CN115064246A (en) * 2022-08-18 2022-09-16 山东第一医科大学附属省立医院(山东省立医院) Depression evaluation system and equipment based on multi-mode information fusion
CN115346561A (en) * 2022-08-15 2022-11-15 南京脑科医院 Method and system for estimating and predicting depression mood based on voice characteristics
CN116030271A (en) * 2023-02-22 2023-04-28 云南大学 Depression emotion prediction system based on deep learning and bimodal data

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200075040A1 (en) * 2018-08-31 2020-03-05 The Regents Of The University Of Michigan Automatic speech-based longitudinal emotion and mood recognition for mental health treatment
CN110728997A (en) * 2019-11-29 2020-01-24 中国科学院深圳先进技术研究院 Multi-modal depression detection method and system based on context awareness
CN111951824A (en) * 2020-08-14 2020-11-17 苏州国岭技研智能科技有限公司 Detection method for distinguishing depression based on sound
KR102365433B1 (en) * 2020-10-23 2022-02-21 서울대학교산학협력단 Method and apparatus for emotion recognition based on cross attention model
CN112349297A (en) * 2020-11-10 2021-02-09 西安工程大学 Depression detection method based on microphone array
CN112818892A (en) * 2021-02-10 2021-05-18 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network
CN113065344A (en) * 2021-03-24 2021-07-02 大连理工大学 Cross-corpus emotion recognition method based on transfer learning and attention mechanism
CN113111151A (en) * 2021-04-16 2021-07-13 北京爱抑暖舟科技有限责任公司 Cross-modal depression detection method based on intelligent voice question answering
CN115346561A (en) * 2022-08-15 2022-11-15 南京脑科医院 Method and system for estimating and predicting depression mood based on voice characteristics
CN115064246A (en) * 2022-08-18 2022-09-16 山东第一医科大学附属省立医院(山东省立医院) Depression evaluation system and equipment based on multi-mode information fusion
CN116030271A (en) * 2023-02-22 2023-04-28 云南大学 Depression emotion prediction system based on deep learning and bimodal data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117992597A (en) * 2024-04-03 2024-05-07 江苏微皓智能科技有限公司 Information feedback method, device, computer equipment and computer storage medium
CN117992597B (en) * 2024-04-03 2024-06-07 江苏微皓智能科技有限公司 Information feedback method, device, computer equipment and computer storage medium

Similar Documents

Publication Publication Date Title
Wu et al. Automatic depression recognition by intelligent speech signal processing: A systematic survey
Patel et al. Impact of autoencoder based compact representation on emotion detection from audio
Asgari et al. Inferring clinical depression from speech and spoken utterances
CN103996155A (en) Intelligent interaction and psychological comfort robot service system
Abdusalomov et al. Improved feature parameter extraction from speech signals using machine learning algorithm
Xia et al. Audiovisual speech recognition: A review and forecast
Hashem et al. Speech emotion recognition approaches: A systematic review
CN116978409A (en) Depression state evaluation method, device, terminal and medium based on voice signal
Dhelim et al. Artificial intelligence for suicide assessment using Audiovisual Cues: a review
Zhao et al. Research on depression detection algorithm combine acoustic rhythm with sparse face recognition
Tian et al. Deep learning for depression recognition from speech
Deepa et al. Speech technology in healthcare
Han et al. [Retracted] The Modular Design of an English Pronunciation Level Evaluation System Based on Machine Learning
Radha et al. Towards modeling raw speech in gender identification of children using sincNet over ERB scale
Sharma et al. Comparative analysis of various feature extraction techniques for classification of speech disfluencies
Milewski et al. Comparison of the Ability of Neural Network Model and Humans to Detect a Cloned Voice
Yue English spoken stress recognition based on natural language processing and endpoint detection algorithm
CN117150320A (en) Dialog digital human emotion style similarity evaluation method and system
Vlaj et al. Acoustic gender and age classification as an aid to human–computer interaction in a smart home environment
CN116682463A (en) Multi-mode emotion recognition method and system
Weed et al. Different in different ways: A network-analysis approach to voice and prosody in Autism Spectrum Disorder
Chen et al. An electroglottograph auxiliary neural network for target speaker extraction
Gonzalez-Lopez et al. Non-parallel articulatory-to-acoustic conversion using multiview-based time warping
CN116013371A (en) Neurodegenerative disease monitoring method, system, device and storage medium
Cheng et al. Improving english phoneme pronunciation with automatic speech recognition using voice chatbot

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20231031