CN116616770A

CN116616770A - Multimode depression screening and evaluating method and system based on voice semantic analysis

Info

Publication number: CN116616770A
Application number: CN202310412560.7A
Authority: CN
Inventors: 郭景桓; 翁鼎钧; 杨竣宇; 罗珮芬; 陈俊玮
Original assignee: Xiamen Zhugeliang Technology Co ltd
Current assignee: Xiamen Zhugeliang Technology Co ltd
Priority date: 2023-04-18
Filing date: 2023-04-18
Publication date: 2023-08-22

Abstract

The invention provides a multi-mode depression screening and evaluating method and a system based on voice semantic analysis, wherein the method comprises the following steps: responding to dialogue input and dialogue output between a user and a dialogue interface; collecting, managing and analyzing dialogue information between a user and a dialogue interface by using a dialogue management module; further utilizing a pre-trained single-mode emotion recognition model to perform voice recognition and semantic recognition on the collected dialogue information and extracting features; and carrying out multi-mode fusion on the extracted features to obtain evaluation indexes so as to comprehensively and objectively evaluate the depression degree. The invention well solves the problems of strong subjectivity, strong disguisability, multiple and complex test questions and the like of the conventional depression screening test. In addition, the method has low cost and easy popularization, can identify the depression of the personnel to be tested in a large amount, efficiently and rapidly, and can be used as an effective auxiliary means for diagnosing the depression by doctors.

Description

Multimode depression screening and evaluating method and system based on voice semantic analysis

Technical Field

The invention belongs to the technical field of depression screening and evaluating, and particularly relates to a multi-mode depression screening and evaluating method and system based on voice semantic analysis.

Background

Depression is a mental disorder with reduced mood and reduced interest as core symptoms, and is the first killer in mental disorder due to the characteristics of high prevalence, high recurrence rate, high disability rate, high mortality rate and the like. The key to alleviating the harm of depression is early diagnosis and early treatment, but effective identification technology based on objective indexes is lacking at present. Clinical observation and research show that language behaviors of depressed patients have the characteristics of slowness, monotone, sinking, pause and the like which are different from those of abnormal people, so that the depression recognition technology based on voice signals becomes a new research hot spot due to the advantages of low cost, easy acquisition, non-contact and the like.

In this context, with the widespread use of artificial intelligence technology, researchers have attempted to develop artificial intelligence detection methods for depression to assist medical staff. The artificial intelligence technology is used for assisting doctors in screening and identifying patients, so that the working pressure of the doctors is reduced, and the method has important practical significance. In particular, in hospitals with limited manpower for psychiatry specialists, the diagnosis assistance of the depression is carried out by the artificial intelligence technology, so that the recognition rate of the depression can be improved, and the depression patients can receive intervention treatment as early as possible. At present, many scholars develop some depressive disorder detection researches based on voice, video and the like, but the accuracy of depressive disorder detection in a real environment is still to be improved.

In the existing depression detection method based on artificial intelligence technology, more facial expressions and voices are used. AUs and Landmarks have proven to be effective features in facial expression-based feature extraction, but low-dimensional manual features still cannot represent the entire facial information, resulting in a large amount of information loss; in the feature extraction based on the audio, although the acoustic feature extraction modes are various, the feature set which has strong generalization capability and can eliminate differences caused by different features or different implementation modes of the same feature and can be automatically extracted is lacking; for voice emotion recognition, the research of the general institute of the Ma province can recognize 12 emotions, and the accuracy is 70%; for voice MCI recognition, the leading technology at home and abroad can only reach about 90% on a small sample, and 20 minutes are needed.

In addition, methods based on questionnaires, social media, and detection methods based on eye-tracker or brain imaging devices are included, for example, kohrt et al explored the effect of questionnaires based on phq-9 depression diagnostic criteria on detecting depression; islam et al extract dictionary features from text published by users on social media and use decision tree models for depression detection; ay et al propose the use of long term memory networks (lstm) and convolutional neural networks (cnn) to process brain wave data for depression detection. However, the questionnaire-based depression detection method often has the problems of less feedback information and insufficient objective and accurate result; the depression detection method based on social media requires that users have enough release content and behavior on the social media, and cannot process new users and users with sparse behaviors; methods based on eye-tracker and brain waves are expensive in equipment cost, resulting in higher detection costs. Meanwhile, the modes related by the methods are single, and the accuracy of depression detection is not satisfactory.

Human psychology is complex and physiological responses have individual variability, and the use of physiological signals in a single modality is not comprehensive and accurate enough, resulting in poor classification and prediction ability in different situations. The method is to integrate multiple physiological signals, construct a multi-mode physiological characteristic fusion strategy and ensure better reliability and accuracy by utilizing complementarity among different types of physiological signals.

In view of the above, a method and a system for screening and evaluating multi-modal depression based on voice semantic analysis are provided. Is very significant.

Disclosure of Invention

In order to solve the problems that the mode related to the existing depression detection is single, the depression detection accuracy is still to be improved and the like, the invention provides a multi-mode depression screening and evaluating method and a system based on voice semantic analysis, which are used for solving the technical defect problems.

In a first aspect, the invention provides a method for screening and evaluating multimodal depression based on speech semantic analysis, which comprises the following steps:

responding to dialogue input and dialogue output between a user and a dialogue interface;

collecting, managing and analyzing dialogue information between a user and a dialogue interface by using a dialogue management module;

further utilizing a pre-trained single-mode emotion recognition model to perform voice recognition and semantic recognition on the collected dialogue information and extracting features; and

and carrying out multi-mode fusion on the extracted features to obtain evaluation indexes so as to comprehensively and objectively evaluate the depression degree.

Preferably, the dialog interface constructs a virtual character in reallosion 4 using CrazyTalk, character Creator and iClone, the virtual character constructed including facial features, lip shapes of different sounds, skin textures and character animations, and applies emotion text speech synthesis technology using Microsoft Azure5 to provide a human-like speech with emotion expression.

Further preferably, the dialogue management module comprises a depression evaluation script and a voice dialogue system, wherein the depression evaluation script refers to a clinical evaluation depression common scale, and comprises four dimensions of a depression symptom dimension, a mania symptom dimension, an anxiety symptom dimension and a family teaching and raising mode, wherein the total number of the depression factors is eight factors, namely, a body factor, an excitation factor, an emotion instability factor, an emotion rising factor, an anxiety factor, a family relation care factor and a family autonomy factor; the voice dialogue system is realized based on a RASA6 framework and comprises a single-mode emotion recognition model and a dialogue management strategy.

Further preferably, the single-mode emotion recognition model includes speech recognition and semantic recognition, the speech recognition utilizes emotion knowledge enhancement pre-training model SKEP to predict probabilities of positive and negative emotions using sentence-level emotion classification, and the prediction probability below a threshold is classified as neutral emotion;

the voice recognition is trained by training different classifiers and five-fold cross validation, various emotion labels of different data sets are uniformly mapped into positive, negative and neutral emotions through a valence-awakening model, and the emotion labels are divided into three sections of [ -3, -1], [ -1,1] and [1,3], which correspond to negative, neutral and positive directions respectively.

It is further preferred that the dialog management strategy uses a 3-pass algorithm based on emotion perception to decide whether to ask further information or to continue with the next question, including in particular:

in a first pass, determining a "yes" or "no" intent based on the question, identifying a user intent by an intent classifier built by the Rasa framework;

then, the emotion consistency detection block confirms whether the identified intention is consistent with a single-mode emotion identification result from the text and the audio;

if no intention is detected, the dialog system will proceed with other questions in the series;

if the probability of intent recognition does not exceed the threshold or the recognized intent is inconsistent with emotion recognition, the process proceeds to a second pass;

in the second pass, semantic emotion recognition is independent of the emotion consistency detection block, and once the probability of semantic emotion recognition does not exceed a threshold value or emotion detected in the text is different from other single-mode emotion recognition, a third pass is performed;

in the third pass, which is also the last pass, the next question is determined by majority voting based on emotion recognition results that do not include text.

Further preferably, the feature extraction includes semantic feature extraction and voice feature extraction, the semantic feature extraction adopts a pre-trained language model BERT, the model includes two tasks of masking language modeling and next sentence prediction, a pre-marked text is input into the pre-trained BERT model, the output of the last layer is selected, and a vector with length 768 is extracted as the text feature vector;

the speech feature extraction includes extracting five spectral features of mel-frequency spectrogram, mel-frequency cepstral coefficient MFCC, spectral contrast, chromaticity diagram and tone-quality heart feature Tonnetz from an audio file using a Librosa8 software package, and extracting the spectral features to form an audio feature vector with a length of 193.

Further preferably, the multi-modal fusion comprises extracting two modal characteristics of text and audio when answering each question, and comprehensively and objectively evaluating the depression degree by adopting a method of fusion of a decision layer and a feature layer;

the decision layer fusion carries out model training by constructing a plurality of single-peak classifiers of texts and audios by using different machine learning algorithms, selects the single-peak classifier with the optimal performance, and uses the classifiers to determine the final depression level prediction through majority voting;

the feature layer fusion utilizes a deep neural network to integrate all information from both text and audio modality features, taking as input both given text and audio modality features, following the deep neural network of the softmax layer to generate probabilities of different depression levels, all feature vectors of both modalities being directly connected as input and fed to the deep neural network comprising two hidden layers.

Further preferred, the evaluation index comprises evaluating the level of multimodal depression, including five levels of healthy, mild, moderate or major depression and bipolar disorder, using a weighted average of precision, recall and F1 scores. The calculation formula is as follows:

where N is the number of categories and the weight i is the ratio of category i to the total number of samples, which is equal to:

the precision i, recall i, and f1 score i are calculated as follows:

in a second aspect, an embodiment of the present invention further provides a system for screening and evaluating multimodal depression based on speech semantic analysis, where the system includes:

the dialogue interface module is used for carrying out dialogue input and dialogue output with a user;

the dialogue management module is used for collecting, managing and analyzing dialogue information between the user and the dialogue interface;

the single-mode emotion recognition model module is used for carrying out voice recognition and semantic recognition on the collected dialogue information and extracting features;

the multi-mode fusion module is used for carrying out multi-mode fusion on the extracted characteristics so as to comprehensively and objectively evaluate the depression degree;

and the evaluation index module is used for evaluating the multi-modal depression level by adopting a weighted average value of the precision, the recall and the F1 score.

In a third aspect, an embodiment of the present invention provides an electronic device, including: one or more processors; and storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any of the implementations of the first aspect.

In a fourth aspect, embodiments of the present invention provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

(1) Actively guiding a user through a voice dialogue, and changing dialogue contents by using emotion perception, extracting features from texts and audios for multi-modal depression level assessment in the dialogue process, integrating two modes by using a feature level fusion frame, and classifying depression at different levels by using a deep neural network, wherein the depression comprises healthy depression, mild depression, moderate depression or major depression and bipolar affective disorder; the invention well solves the problems of strong subjectivity, strong disguisability, multiple and complex test questions and the like of the conventional depression screening test. In addition, the method has low cost and easy popularization, can identify the depression of the personnel to be tested in a large amount, efficiently and rapidly, and can be used as an effective auxiliary means for diagnosing the depression by doctors.

(2) The invention develops a dialogue script based on a depression screening scale, integrates a voice semantic analysis and artificial intelligence multi-mode fusion technology, and provides an artificial intelligence solution for multi-mode depression screening by using software which can carry out psychological consultation dialogue at a mobile terminal created by artificial intelligence; breaking the limitation of traditional psychological consultation on places and time, and giving depression tendency judgment through multi-mode identification.

(3) The multi-mode depression knowledge screening and evaluating system based on voice semantic analysis, disclosed by the invention, collects 168 cases of clinical data in Shanghai city mental health centers, and the accuracy of a depression diagnosis/evaluation model is as high as 90.26%, so that light, medium and severe depression, bidirectional affective disorder and healthy people can be effectively identified.

Drawings

The accompanying drawings are included to provide a further understanding of the embodiments and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments and together with the description serve to explain the principles of the invention. Many of the intended advantages of other embodiments and embodiments will be readily appreciated as they become better understood by reference to the following detailed description. The elements of the drawings are not necessarily to scale relative to each other. Like reference numerals designate corresponding similar parts.

FIG. 1 is an exemplary device frame pattern to which an embodiment of the present invention may be applied;

FIG. 2 is a flow chart of a multimodal depression screening and evaluating method based on speech semantic analysis according to an embodiment of the present invention;

fig. 3 is a schematic overall flow chart of a multi-modal depression screening and evaluating method based on voice semantic analysis according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a dialogue script design based on depression assessment in a multimodal depression screening and evaluation method based on speech semantic analysis according to an embodiment of the invention;

FIG. 5 is a schematic diagram of feature extraction of speech semantics in a multimodal depression screening and evaluation method based on speech semantic analysis according to an embodiment of the invention;

FIG. 6 is a schematic flow chart of a 3-pass algorithm in a multi-modal depression screening and evaluating method based on speech semantic analysis according to an embodiment of the present invention;

FIGS. 7 (a) and 7 (b) are schematic architectural diagrams of decision-level and feature-level fusion frameworks for multimodal depression assessment in a multimodal depression screening and assessment method based on speech semantic analysis according to an embodiment of the present invention, respectively;

FIG. 8 is a schematic diagram of the architecture of a multimodal depression screening evaluator based on phonetic semantic analysis according to an embodiment of the present invention;

fig. 9 is a schematic structural view of a computer device suitable for use in an electronic apparatus for implementing an embodiment of the present invention.

Detailed Description

In the following detailed description, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. For this, directional terms, such as "top", "bottom", "left", "right", "upper", "lower", and the like, are used with reference to the orientation of the described figures. Because components of embodiments can be positioned in a number of different orientations, the directional terminology is used for purposes of illustration and is in no way limiting. It is to be understood that other embodiments may be utilized or logical changes may be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 1 illustrates an exemplary system architecture 100 for a method of processing information or an apparatus for processing information to which embodiments of the present invention may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be various electronic devices with communication capabilities including, but not limited to, smartphones, tablet computers, laptop and desktop computers, and the like.

The server 105 may be a server that provides various services, such as a background information processing server that processes verification request information transmitted by the terminal devices 101, 102, 103. The background information processing server may analyze the received verification request information and obtain a processing result (for example, verification success information for characterizing that the verification request is a legal request).

It should be noted that, the method for processing information provided by the embodiment of the present invention is generally performed by the server 105, and accordingly, the device for processing information is generally disposed in the server 105. In addition, the method for transmitting information provided by the embodiment of the present invention is generally performed by the terminal devices 101, 102, 103, and accordingly, the means for transmitting information is generally provided in the terminal devices 101, 102, 103.

The server may be hardware or software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (for example, to provide a distributed service), or may be implemented as a single software or a plurality of software modules, which are not specifically limited herein.

The invention develops a dialogue script based on a depression screening scale, integrates a voice semantic analysis and artificial intelligence multi-mode fusion technology, and provides an artificial intelligence solution for multi-mode depression screening by using software which can carry out psychological consultation dialogue at a mobile terminal created by artificial intelligence; breaking the limitation of traditional psychological consultation on places and time, and giving depression tendency judgment through multi-mode identification.

Fig. 2 shows that the embodiment of the invention discloses a multi-modal depression screening and evaluating method based on voice semantic analysis, which comprises the following steps as shown in fig. 2 and 3:

s1, responding to dialogue input and dialogue output between a user and a dialogue interface;

specifically, in this embodiment, the dialog interface builds virtual characters at reallosion 4 using CrazyTalk, character Creator, and iClone, including facial features, lip shapes of different sounds, skin texture, and character animation. Emotion text speech synthesis techniques were applied using Microsoft Azure5 to provide human-like speech with emotion expressions.

S2, collecting, managing and analyzing dialogue information between the user and a dialogue interface by using a dialogue management module;

in particular, dialog management includes depression assessment scripts and a voice dialog system.

Referring to fig. 4, a portion of the depression evaluation script refers to the clinical evaluation depression common scale, and a dialogue script title is designed in four dimensions, namely a depression symptom dimension, a mania symptom dimension, an anxiety symptom dimension and a family teaching mode. Wherein each dimension in turn comprises a plurality of factors:

a. dimension of depressive symptoms: depression factors, somatic factors;

b. manic symptom dimension: an excitation factor, an mood swings factor, and a mood rises factor;

c. anxiety symptom dimension: anxiety factors;

d. dimension of family teaching mode: family relationship care factors, family autonomy factors.

The voice conversation system is implemented using RASA6, which RASA6 is a framework dedicated to building conversation assistants and chat robots. The voice dialog system includes a single-mode emotion recognition model and a dialog management policy.

S3, further carrying out voice recognition and semantic recognition on the collected dialogue information by utilizing a pre-trained single-mode emotion recognition model, and carrying out feature extraction; and

specifically, the single-mode emotion recognition model includes speech recognition and semantic recognition.

The semantic recognition uses an emotion knowledge enhancement pre-training (SKEP) model, which is a pre-training algorithm utilizing emotion knowledge enhancement. The model can spontaneously learn and understand emotion semantics from the mining of emotion knowledge through an unsupervised learning method. The probability of positive and negative emotions is predicted using sentence-level emotion classification, and the prediction probability below a threshold is classified as neutral.

For speech recognition, details of the input features are described with reference to fig. 5. In this example we trained different classifiers and verified the accuracy of the common dataset, including CMU-MOSEI, IEMOCAP, RAVDESS, emoDB, MAHNOB-HCI and SEED-IV. All models are trained by adopting quintuple cross-validation, and various emotion tags of different data sets are mapped into positive, negative and neutral emotions in a unified mode through a valence-wakeup model. In particular, the emotion label of the CMU-MOSEI is a value from-3 to 3. We divide the tag into three segments, -3, -1,1 and (1, 3), corresponding to negative, neutral and positive, respectively.

In this embodiment, referring to FIG. 6, the dialog management strategy uses a 3-pass algorithm based on emotion perception to decide whether to ask further information or continue with the next question. In the first pass, a "yes" or "no" intent is determined based on the question. For example, do you have recently had their emotion unstable? Sometimes you will not live the spleen qi? The intent classifier built by the Rasa framework identifies the user intent.

Next, the emotion matching detection block confirms whether the recognized intention matches the single-mode emotion recognition result from text, audio. In the above example, "yes" is intended to be associated with a negative emotion, so another problem will be posed: the unstable emotion is sometimes not happy and sometimes suddenly excited to want spleen qi? "not" is intended to relate to positive emotions. If a "no" intent is detected, the dialog system will proceed with other questions in the series. If the probability of intent recognition does not exceed the threshold or the recognized intent is inconsistent with emotion recognition, the process proceeds to a second pass.

In the second pass, semantic emotion recognition is independent of emotion consistency detection block. Once the probability of semantic emotion recognition does not exceed a threshold or the emotion detected in the text is different from other single-mode emotion recognition, a third pass is performed. In the third pass, which is also the last pass, the next question is determined by majority voting based on emotion recognition results, excluding text.

Further, the multimodal depression assessment includes feature extraction, multimodal fusion, and assessment indicators.

Specifically, the feature extraction includes semantic feature extraction and speech feature extraction. The semantic feature extraction uses the Bidirectional Encoder Representations from Transformers (BERT) model, which is a powerful pre-trained language model for general purposes and to achieve advanced results in many tasks of natural language processing. We have imported from Hugging Face7 a pre-trained BERT model that has been trained on a large corpus, comprising two tasks: masking language modeling and next sentence prediction. The BERT based model includes a 12-layer transform encoder with 12 bi-directional self-attention headers and contains 110M parameters. We input the tagged text into the pre-trained BERT model and select the output of the last layer, a vector of length 768, as a text feature representation.

768 are parameters set in the BERT model (e.g., the BERT-base-Chinese: encoder has 12 hidden layers, outputs 768-dimensional tensors, 12 self-attention-headers, 110M total parameters, and is trained on simplified and traditional chinese text.

Speech feature extraction five spectral features, namely, a Mel spectrogram, mel cepstral coefficient (MFCC), spectral contrast, chromaticity diagram, and tone-quality heart feature (Tonnetz), were extracted from an audio file using the Librosa8 software package. The mel-spectrum is an original scale spectrum converted into a mel scale, which represents the characteristics of the audio signal; the mel cepstrum coefficient reflects the perception of different frequencies by human ears, and has wide application in speaker recognition and voice recognition; the octave-based spectral contrast profile indicates a relative spectral distribution that estimates the difference between spectral peaks and spectral valleys in each subband. Chromaticity diagram features are calculated from audio, where the complete spectrum is projected onto 12 bins representing 12 different semitones of the musical octave. Tonnetz maps 12 interval chrominance vectors onto a 6-dimensional basis that is capable of detecting harmonic variations. Eventually, all features form an audio feature vector of length 193.

S4, carrying out multi-mode fusion on the extracted features to obtain evaluation indexes so as to comprehensively and objectively evaluate the depression degree.

In particular, referring to fig. 7 (a) and 7 (b), multi-modal fusion refers to the extraction of two modal features, text and audio, in answering each question during a user's dialogue with the system. On the basis, a method of combining a decision layer and a feature layer is adopted, so that comprehensive objective evaluation of depression degree is realized. The model was implemented using machine learning libraries scikitlearn, tensorflow in Python and Keras.

For decision level fusion, a plurality of unimodal classifiers of text and audio are firstly constructed by using different machine learning algorithms, including K Nearest Neighbor (KNN), support Vector Machine (SVM), decision Tree (DT), random Forest (RF), multi-layer perceptron (MLP), adaptive lifting (AdaBoost) and gradient lifting (GB). After model training, the best performing unimodal classifiers are selected and used to determine the final depression level prediction by majority voting.

For feature level fusion, a deep neural network is utilized to integrate all information from 2 modality features. With given two modality features as input, the deep neural network following the softmax layer generates probabilities of different depression levels. All feature vectors of both modalities are directly connected as input and fed to a deep neural network comprising two hidden layers. The first hidden layer is followed by a discard layer with a probability of 0.2, containing 512 neurons, and performs L2 regularization. The second hidden layer has 256 neurons and is also followed by a discard layer with a probability of 0.2. For deep neural networks with feature level fusion, we train with 100 epochs and select Adam with a learning rate of 0.001 as the optimizer.

Further, for the evaluation index, the multimodal depression level assessment is a multi-category cognitive task. For multiple classes of classification, a weighted average of accuracy and precision, recall, and F1 score was used to evaluate model performance and verify generalization ability, including five levels of healthy, mild, moderate, or major depression, and bipolar disorder. The formula is as follows:

the precision i, recall i, and f1 score i are calculated as follows:

the invention actively guides users through voice dialogue, and utilizes emotion perception to change dialogue content, in the dialogue process, features are extracted from texts and audios for multi-modal depression level assessment, two modes are integrated by utilizing a feature level fusion frame, and depression of different levels, including healthy, mild, moderate or severe depression and bipolar affective disorder, are classified by using a deep neural network; the invention well solves the problems of strong subjectivity, strong disguisability, multiple and complex test questions and the like of the conventional depression screening test. In addition, the method has low cost and easy popularization, can identify the depression of the personnel to be tested in a large amount, efficiently and rapidly, and can be used as an effective auxiliary means for diagnosing the depression by doctors.

The multi-mode depression knowledge screening and evaluating system based on voice semantic analysis, disclosed by the invention, collects 168 cases of clinical data in Shanghai city mental health centers, and the accuracy of a depression diagnosis/evaluation model is as high as 90.26%, so that light, medium and severe depression, bidirectional affective disorder and healthy people can be effectively identified.

In a second aspect, the embodiment of the invention further discloses a multimodal depression screening and evaluating system based on voice semantic analysis, as shown in fig. 8, the system comprises: a dialogue interface module 81, a dialogue management module 82, a single-mode emotion recognition model module 83, a multi-mode fusion module 84 and an evaluation index module 85.

In one embodiment, the dialogue interface module 81 is configured to perform dialogue input and dialogue output with a user; the dialogue management module 82 is used for collecting, managing and analyzing dialogue information between the user and the dialogue interface; the single-mode emotion recognition model module 83 is used for performing voice recognition and semantic recognition on the collected dialogue information and extracting features; a multi-modal fusion module 84 for comprehensively and objectively evaluating the depression degree; an evaluation index module 85 for evaluating the multimodal depression level using a weighted average of the precision, recall, and F1 score.

Referring now to fig. 9, there is illustrated a schematic diagram of a computer apparatus 900 suitable for use in an electronic device (e.g., a server or terminal device as illustrated in fig. 1) for implementing an embodiment of the present invention. The electronic device shown in fig. 9 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present invention.

As shown in fig. 9, the computer apparatus 900 includes a Central Processing Unit (CPU) 901 and a Graphics Processor (GPU) 902, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 903 or a program loaded from a storage section 909 into a Random Access Memory (RAM) 906. In the RAM 904, various programs and data required for the operation of the apparatus 900 are also stored. The CPU 901, GPU902, ROM 903, and RAM 904 are connected to each other by a bus 905. An input/output (I/O) interface 906 is also connected to bus 905.

The following components are connected to the I/O interface 906: an input section 907 including a keyboard, a mouse, and the like; an output portion 908 including a speaker, such as a Liquid Crystal Display (LCD), or the like; a storage section 909 including a hard disk or the like; and a communication section 910 including a network interface card such as a LAN card, a modem, or the like. The communication section 910 performs communication processing via a network such as the internet. The drive 911 may also be connected to the I/O interface 906 as needed. A removable medium 912 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 911 so that a computer program read out therefrom is installed into the storage section 909 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication portion 910, and/or installed from the removable medium 912. The above-described functions defined in the method of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 901 and a Graphics Processor (GPU) 902.

It should be noted that the computer readable medium according to the present invention may be a computer readable signal medium or a computer readable medium, or any combination of the two. The computer readable medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor apparatus, device, or means, or a combination of any of the foregoing. More specific examples of the computer-readable medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution apparatus, device, or apparatus. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may be any computer readable medium that is not a computer readable medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution apparatus, device, or apparatus. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based devices which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor.

As another aspect, the present invention also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: responding to dialogue input and dialogue output between a user and a dialogue interface; collecting, managing and analyzing dialogue information between a user and a dialogue interface by using a dialogue management module; further utilizing a pre-trained single-mode emotion recognition model to perform voice recognition and semantic recognition on the collected dialogue information and extracting features; and obtaining an evaluation index through multi-mode fusion so as to comprehensively and objectively evaluate the depression degree.

The above description is only illustrative of the preferred embodiments of the present invention and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the invention referred to in the present invention is not limited to the specific combinations of the technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the inventive concept described above. Such as the above-mentioned features and the technical features disclosed in the present invention (but not limited to) having similar functions are replaced with each other.

Claims

1. A multi-mode depression screening and evaluating method based on voice semantic analysis is characterized by comprising the following steps:

2. The method for screening and evaluating multi-modal depression based on speech semantic analysis according to claim 1, wherein the dialogue interface constructs a virtual character in reallosion 4 using CrazyTalk, character Creator and iClone, the constructed virtual character including facial features, lip shapes of different sounds, skin textures and character animation, and applies emotion text speech synthesis technology using Microsoft Azure5 to provide a human-like speech with emotion expression.

3. The method for screening and evaluating the multi-modal depression based on the voice semantic analysis according to claim 2, wherein the dialogue management module comprises a depression evaluation script and a voice dialogue system, and the depression evaluation script refers to a common scale for clinical evaluation of the depression, wherein the common scale comprises four dimensions of a depression symptom dimension, a mania symptom dimension, an anxiety symptom dimension and a family education mode, and the common scale comprises eight factors including a depression factor, a somatic factor, an excitation factor, an emotion instability factor, an emotion elevation factor, an anxiety factor, a family relation care factor and a family autonomy factor; the voice dialogue system is realized based on a RASA6 framework and comprises a single-mode emotion recognition model and a dialogue management strategy.

4. The method for screening and evaluating multi-modal depression based on speech semantic analysis according to claim 3, wherein the single-mode emotion recognition model comprises speech recognition and semantic recognition, the speech recognition utilizes emotion knowledge enhancement pre-training model SKEP, probability of positive and negative emotion is predicted using sentence-level emotion classification, and prediction probability below a threshold is classified as neutral emotion;

5. The speech semantic analysis based multimodal depression screening and assessment method according to claim 4, wherein the dialogue management strategy uses a emotion perception based 3-pass algorithm to decide whether to ask further information or continue with the next question, specifically comprising:

6. The method for screening and evaluating the multi-modal depression based on the semantic analysis of voice according to claim 5, wherein the feature extraction comprises semantic feature extraction and voice feature extraction, the semantic feature extraction adopts a pre-trained language model BERT, the model comprises two tasks of masking language modeling and next sentence prediction, a pre-marked text is input into the pre-trained BERT model, the output of the last layer is selected, and a vector with the length of 768 is extracted as the text feature vector;

7. The method for screening and evaluating the multi-modal depression based on the voice semantic analysis according to claim 6, wherein the multi-modal fusion comprises the steps of extracting two modal characteristics of a text and an audio when each question is answered, and comprehensively and objectively evaluating the depression degree by adopting a method of fusion of a decision layer and a feature layer;

8. The speech semantic analysis-based multimodal depression screening and assessment method according to claim 7, wherein the assessment index comprises assessing a multimodal depression level using a weighted average of precision, recall and F1 score, calculated as:

the precision i, recall i, and f1 score i are calculated as follows:

9. a multimodal depression screening and evaluating system based on speech semantic analysis, the system comprising:

10. An electronic device, comprising:

one or more processors;

a storage means for storing one or more programs;

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1 to 8.

11. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1 to 8.