CN114190942B - Method for computer-implemented depression detection based on audio analysis - Google Patents

Method for computer-implemented depression detection based on audio analysis Download PDF

Info

Publication number
CN114190942B
CN114190942B CN202111523401.1A CN202111523401A CN114190942B CN 114190942 B CN114190942 B CN 114190942B CN 202111523401 A CN202111523401 A CN 202111523401A CN 114190942 B CN114190942 B CN 114190942B
Authority
CN
China
Prior art keywords
audio
determining
segment
video
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111523401.1A
Other languages
Chinese (zh)
Other versions
CN114190942A (en
Inventor
齐中祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Womin High New Science & Technology Beijing Co ltd
Original Assignee
Womin High New Science & Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Womin High New Science & Technology Beijing Co ltd filed Critical Womin High New Science & Technology Beijing Co ltd
Priority to CN202111523401.1A priority Critical patent/CN114190942B/en
Publication of CN114190942A publication Critical patent/CN114190942A/en
Application granted granted Critical
Publication of CN114190942B publication Critical patent/CN114190942B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • A61B5/165Evaluating the state of mind, e.g. depression, anxiety

Abstract

The invention relates to a depression detection method based on audio analysis, which comprises the following steps: acquiring audio, facial video and autonomic nerve signals of a user when the user accepts a question; extracting speech speed, intonation and semantics in the audio; determining temporal and behavioral features based on the facial video and audio; and carrying out depression detection on the facial video based on the speech speed, the intonation, the semantics, the time characteristics, the behavior characteristics and the autonomic nerve signals to obtain a depression detection result. The method provided by the invention detects whether the user suffers from depression or not through the audio frequency, the facial video and the autonomic nerve signals when the user receives the question, thereby realizing automatic detection of depression.

Description

Method for computer-implemented depression detection based on audio analysis
Technical Field
The invention relates to the technical field of psychological assessment, in particular to a method for detecting depression based on audio analysis, which is executed by a computer.
Background
Currently, depression is the second largest disease in humans, secondary to cardiovascular disease, and at the same time, the onset of depression has begun to appear as a trend toward low age. Thus, the detection of depression is critical to the medical prevention work of depression.
Disclosure of Invention
First, the technical problem to be solved
In view of the above-described shortcomings and drawbacks of the prior art, the present invention provides a computer-implemented method of depression detection based on audio analysis.
(II) technical scheme
In order to achieve the above purpose, the main technical scheme adopted by the invention comprises the following steps:
a method of depression detection based on audio analysis, the method comprising:
s101, acquiring audio, facial video and autonomic nerve signals of a user when the user accepts a question;
s102, extracting speech speed, intonation and semantics in the audio;
s103, determining time characteristics and behavior characteristics based on the facial video and the audio;
s104, carrying out depression detection on the facial video based on the speech speed, the intonation, the semantics, the time characteristics, the behavior characteristics and the autonomic nerve signals to obtain a depression detection result.
Optionally, the audio and facial video are collected simultaneously;
the step S103 includes:
s103-1, segmenting the audio according to the questions to obtain a plurality of audio segments; each audio segment corresponds to a question and answer of a question;
s103-2, determining an acquisition time period corresponding to each audio segment, and taking the video in the acquisition time period as a video segment corresponding to each audio segment in the face video;
s103-3, corresponding time characteristics and behavior characteristics are determined according to each audio segment and the corresponding video segment.
Optionally, each audio segment is composed of 3 audio subsections, wherein the first audio subsection is a question audio, the second audio subsection is a silence audio after the question, and the third audio subsection is a reply audio according to the question by the user;
the S103-3 comprises:
for any one of the audio segments,
determining a duration of a first audio sub-segment in said any audio segmentThe duration of the second audio subsection +.>The duration of the third audio subsection +.>
Determining a first video sub-segment corresponding to the first audio sub-segment, a second video sub-segment corresponding to the second audio sub-segment and a third video sub-segment corresponding to the third audio sub-segment in the video segments corresponding to any audio segment;
identifying pupil positions and pupil areas in the first video sub-segment, the second video sub-segment and the third video sub-segment respectively, and determining attention time characteristics and attention behavior characteristics according to identification results;
the saidThe attention time characteristics are used as the time characteristics corresponding to any audio segment;
and taking the attention behavior characteristic as the behavior characteristic corresponding to any audio frequency segment.
Optionally, the questioner is located directly in front of the user, and the eyes of the questioner and the eyes of the user are located at the same height;
the determining the attention time characteristic and the attention behavior characteristic according to the recognition result comprises the following steps:
determining a maximum duration in which the pupil position is continuously located at the center of the eye according to the pupil position in the first video subsectionThe pupil position continues for a maximum duration +.>And determining a maximum pupil area in the first video subsection +.>And minimum pupil area->Determining the maximum pupil area +.>And minimum pupil area->
Determining a maximum duration in which the pupil position is continuously located at the center of the eye based on the pupil position in the second video subsectionMaximum duration of pupil position lasting in a second other position in the same non-eye centre +.>And determining a maximum pupil area in the second video subsection +.>And minimum pupil area->
Determining a maximum duration in which the pupil position is continuously located at the center of the eye based on the pupil position in the third video subsectionMaximum duration of pupil position lasting in a third other position in the same non-eye centre +.>And determining a maximum pupil area in the third video subsection +.>And minimum pupil area->Determining the maximum pupil area +.>And minimum pupil area->
The saidAll as attention time features;
the said All as attention behavior features.
Optionally, the S104 includes:
s104-1, determining corresponding weight values according to the speech speed, intonation, semantics, time characteristics and behavior characteristics of each audio segment;
s104-2, determining the total weight value of the user according to the weight value corresponding to each audio segment;
s104-3, determining an emotion coefficient based on the autonomic nerve signals;
s104-4, carrying out depression detection on the facial video based on the emotion coefficient and the total weight value to obtain a depression detection result.
Optionally, the step S104-1 includes:
for any one of the audio segments,
determining emotion labels of the third audio sub-segment according to the semantics, and obtaining corresponding emotion weight values W according to the corresponding relation between preset emotion labels and emotion weights i1
Determining an average speech rate of a third audio subsection according to the speech rate, and determining a quotient between the average speech rate of |1 and a pre-acquired standard speech rate of the user as a speech rate weight value W i2
Determining an average intonation of a third audio sub-segment based on said intonation, determining as a intonation weight value W |1-a quotient between said average intonation and a pre-collected standard intonation of said user i3
According toDetermining a first time weight value, determining a second weight value according to the attention time feature, and determining the quotient of the second time weight value and the first time weight as a time weight value W i4
Calculate the behavior weight W according to the following formula i5
Alternatively, the process may be carried out in a single-stage,said basis isDetermining a first time weight value, determining a second weight value according to the attention time feature, comprising:
the first time weight value is determined by the following formula:
the second time weight value is determined by the following formula:
optionally, the step S104-2 includes:
total weight value of the user = max { W i1 ,W i2 ,W i3 }*W i4 *W i5
Optionally, the step S104-3 includes:
forming the autonomic nerve signals into a signal set; wherein each element in the signal set corresponds to a vegetative nerve signal value acquired at one moment, and the elements in the signal set are arranged from far to near according to the acquisition moment;
determining the difference value between every two adjacent elements in the signal set to form a signal difference set, wherein the difference value is the value of the next element-the value of the previous element;
determining standard deviation sigma of all elements in the set of signal differences Δ
Determining the element with the largest value in the signal difference setElement with minimum sum->
Determining the set of signalsElement a with the largest median value max Element a with minimum sum value min And time t corresponding to the element with the largest value max Time t corresponding to the element with the smallest value min
Determining emotion coefficients
Optionally, the step S104-4 includes:
identifying micro-expressions of frames in the facial video;
determining the degree of change between the micro expressions of each frame;
determining a maximum number of consecutive frames with a degree of change not greater than a change threshold;
determining a detection value = maximum number × total weight value × I1; wherein I1 is an emotion coefficient;
if the detection value is greater than the depression threshold, it is determined that depression is detected.
(III) beneficial effects
Acquiring audio, facial video and autonomic nerve signals of a user when the user accepts a question; extracting speech speed, intonation and semantics in the audio; determining temporal and behavioral features based on the facial video and audio; and carrying out depression detection on the facial video based on the speech speed, the intonation, the semantics, the time characteristics, the behavior characteristics and the autonomic nerve signals to obtain a depression detection result. The method provided by the invention detects whether the user suffers from depression or not through the audio frequency, the facial video and the autonomic nerve signals when the user receives the question, thereby realizing automatic detection of depression.
Drawings
Fig. 1 is a flowchart of a method for detecting depression based on audio analysis according to an embodiment of the present invention.
Detailed Description
The invention will be better explained by the following detailed description of the embodiments with reference to the drawings.
Currently, depression is the second major disease in humans, next to cardiovascular disease, and about 80 tens of thousands of people suicide each year due to depression, and at the same time, the onset of depression has begun to develop a trend toward low age (university, or even primary and secondary school students). However, the medical treatment and prevention of the depression in China is still in the situation of low recognition rate, the recognition rate of hospitals above the ground level market is less than 20%, and only less than 10% of patients receive relevant drug treatment, so that the detection of the depression is crucial to the medical treatment and prevention work of the depression.
Based on this, the present invention provides a method of depression detection based on audio analysis, the method comprising: acquiring audio, facial video and autonomic nerve signals of a user when the user accepts a question; extracting speech speed, intonation and semantics in the audio; determining temporal and behavioral features based on the facial video and audio; the facial video is subjected to depression detection based on speech speed, intonation, semantics, time characteristics, behavior characteristics and autonomic nerve signals, so that a depression detection result is obtained, and automatic depression detection is realized.
In particular, a question is presented to the user, and the user is checked for a tendency to depression by the method shown in fig. 1, to determine if the user has depression.
Referring to fig. 1, the implementation procedure of the method for detecting depression based on audio analysis provided in this embodiment is as follows:
s101, acquiring audio, facial video and autonomic nerve signals of a user when the user accepts a question.
In step S101, a question is asked to the user, and after waiting for the user to answer the question, the question is continued until all questions are answered.
In asking questions, the questioner is located directly in front of the user, and the eyes of the questioner are on the same level (i.e., at the same height) as the eyes of the user.
In the questioning process, audio, facial video and autonomic nerve signals of a user when the user accepts the questioning can be acquired in real time.
That is, audio, facial video, and autonomic nerve signals are acquired simultaneously.
S102, extracting speech speed, intonation and semantics in the audio.
The step is implemented by adopting the existing speech speed, intonation and semantic extraction scheme, and is not described in detail herein.
And S103, determining time characteristics and behavior characteristics based on the facial videos and the audios.
In particular, the method comprises the steps of,
s103-1, segmenting the audio according to the questions to obtain a plurality of audio segments.
Each audio segment corresponds to a question and answer to a question.
That is, audio is split in terms of a question and corresponding answer, one for each segment of audio. How many problems there are how many audio segments.
For any audio segment, it includes 3 phases of content, i.e., a question phase, and after a question, the user thinks about the silence phase and the user replies to the phase, because it corresponds to a question and its answer. Namely, each audio segment is composed of 3 audio sub-segments, wherein the first audio sub-segment is question audio, the second audio sub-segment is silence audio after the question, and the third audio sub-segment is answer audio according to the question.
S103-2, determining a corresponding acquisition time period of each audio segment, and taking the video in the acquisition time period as a video segment corresponding to each audio segment in the face video.
Because the audio segments and the video segments are recorded simultaneously, the video segments corresponding to each audio segment are obtained in this step.
That is, an audio segment and its corresponding video segment are audio and video of the question-answering process for the same question.
S103-3, corresponding time characteristics and behavior characteristics are determined according to each audio segment and the corresponding video segment.
Specifically, for any audio segment (e.g., the subject audio segment i, which corresponds to video segment i),
1. determining a duration of a first audio sub-segment in any audio segmentSecond toneThe duration of the frequency sub-segment +.>The duration of the third audio subsection +.>
I.e. determining the duration of the challenge audio in audio segment iDuration of silence Audio +.>Duration of reply audio
2. And determining a first video sub-segment corresponding to the first audio sub-segment, a second video sub-segment corresponding to the second audio sub-segment and a third video sub-segment corresponding to the third audio sub-segment in the video segments corresponding to any audio segment.
I.e., in video segment i, a first video sub-segment corresponding to the challenge audio (i.e., the video of the challenge phase), a second video sub-segment corresponding to the silence audio (i.e., the video of the silence phase), and a third video sub-segment corresponding to the reply audio (i.e., the video of the reply phase).
And executing the above steps, the audio frequency section and the video frequency section of each question-answering process can be obtained.
Each question-answering process is divided into a corresponding question-asking stage, a thinking silence stage and a answer stage
In one audio segment, 3 audio sub-segments are obtained, which correspond to a question phase, a thinking silence phase and a reply phase respectively. Then the first audio sub-section and the first video sub-section of the question phase, the second audio sub-section and the second video sub-section of the thought silence phase, the third audio sub-section and the third video sub-section of the answer phase are also obtained up to this point.
3. And respectively identifying pupil positions and pupil areas in the first video sub-segment, the second video sub-segment and the third video sub-segment, and determining the attention time characteristic and the attention behavior characteristic according to the identification results.
The pupil position and pupil area are identified by adopting the existing scheme, and are not described herein.
The process of determining the attention time feature and the attention behavior feature according to the recognition result may be:
1) Determining a maximum duration in which the pupil position is continuously located at the center of the eye according to the pupil position in the first video subsectionThe pupil position continues for a maximum duration +.>And determining a maximum pupil area in the first video subsection +.>And minimum pupil area->Determining the maximum pupil area +.>And minimum pupil area->
That is, the light source is configured to,
pupil position and pupil area in each frame of the first video sub-segment are determined.
Comparing whether the pupil positions of the front frame and the rear frame change, if so, indicating that the pupil positions of the front frame and the rear frame change, and if not, indicating that the pupil positions of the front frame and the rear frame do not change. By comparing pupil positions of frames in the first video sub-segment, the method can obtainThere are frame segments where the pupil position is unchanged. From all the frame segments, determining the frame segment with the pupil position being the center of the eyes (namely, the frame segment of the direct-vision questioner), and finding the frame segment with the largest included frame number from the frame segments with the pupil position being the center of the eyes, wherein the duration of the frame segment isThe maximum pupil areas in the frame segment areMinimum is->
Of the frame segments whose pupil position is not the center of the eye, the frame segment with the largest number of frames is found, and then the frame segment has a duration of
The largest pupil area in all frames of the first video sub-segment is taken asMinimum as
2) Determining a maximum duration in which the pupil position is continuously located at the center of the eye based on the pupil position in the second video subsectionMaximum duration of pupil position lasting in a second other position in the same non-eye centre +.>And determining a maximum pupil area in the second video subsection +.>And minimum pupil area->
Pupil position and pupil area in each frame of the second video sub-segment are determined.
Comparing whether the pupil positions of the front frame and the rear frame change, if so, indicating that the pupil positions of the front frame and the rear frame change, and if not, indicating that the pupil positions of the front frame and the rear frame do not change. And comparing pupil positions of frames in the second video sub-segment to obtain all frame segments with unchanged pupil positions. From all the frame segments, determining the frame segment with the pupil position being the center of the eyes (namely, the frame segment of the direct-vision questioner), and finding the frame segment with the largest included frame number from the frame segments with the pupil position being the center of the eyes, wherein the duration of the frame segment isThe maximum pupil areas in the frame segment areMinimum is->
Of the frame segments whose pupil position is not the center of the eye, the frame segment with the largest number of frames is found, and then the frame segment has a duration of
3) Determining a maximum duration in which the pupil position is continuously located at the center of the eye based on the pupil position in the third video subsectionMaximum duration of pupil position lasting in a third other position in the same non-eye centre +.>And determining a maximum pupil area in the third video subsection +.>And minimum pupil area->Determining the maximum pupil area +.>And minimum pupil area->
That is, the light source is configured to,
pupil position and pupil area in each frame of the third video sub-segment are determined.
Comparing whether the pupil positions of the front frame and the rear frame change, if so, indicating that the pupil positions of the front frame and the rear frame change, and if not, indicating that the pupil positions of the front frame and the rear frame do not change. And comparing pupil positions of frames in the third video sub-segment to obtain all frame segments with unchanged pupil positions. From all the frame segments, determining the frame segment with the pupil position being the center of the eyes (namely, the frame segment of the direct-vision questioner), and finding the frame segment with the largest included frame number from the frame segments with the pupil position being the center of the eyes, wherein the duration of the frame segment isThe maximum pupil areas in the frame segment areMinimum is->
Of the frame segments whose pupil position is not the center of the eye, the frame segment with the largest number of frames is found, thenThe frame segment has a duration of
The largest pupil area in all frames of the third video sub-segment is taken asMinimum as
4) Will beAll as attention time features.
5) Will be All as attention behavior features.
4. Will beAnd the attention time feature is taken as the time feature corresponding to any audio segment.
I.e. the audio segment i is time-characterised by
5. And taking the attention behavior characteristic as the behavior characteristic corresponding to any audio segment.
I.e. the behavior of audio segment i is characterized by
S104, carrying out depression detection on the facial video based on the speech speed, the intonation, the semantics, the time characteristics, the behavior characteristics and the autonomic nerve signals to obtain a depression detection result.
In particular, the method comprises the steps of,
s104-1, determining corresponding weight values according to the speech speed, intonation, semantics, time features and behavior features of each audio segment.
For any audio segment, the weight value is calculated as follows:
1. determining emotion labels of the third audio subsection according to semantics, and obtaining corresponding emotion weight values W according to the corresponding relation between preset emotion labels and emotion weights i1
The corresponding relation between the emotion labels and the emotion weights is preset and can be an experience value.
The semantics are important for the judgment of depression, and if the user's voice is full of negative energy, the likelihood of suffering from depression increases. Thus, this step passes through the emotion weight value W i1 Reflects the possibility of depression in terms of semantics.
2. Determining the average speech rate of the third audio subsection according to the speech rate, and determining the quotient between the average speech rate of |1-and the standard speech rate of the user acquired in advance as a speech rate weight value W i2
The standard speech rate is acquired in advance and is obtained by voice analysis under the condition that a user is unconscious. That is, the speech rate of the user in the natural state is determined as the standard speech rate.
And I is absolute value.
W i2 The average speech rate/standard speech rate of the= |1-third audio subsection.
The speech speed is important for judging depression, if the speech speed of the user suddenly changes compared with the normal speech speed, the emotion of the user is indicated to be fluctuated, and if the speech speed is faster, the emotion of the user is indicated to be excited.If slowed, the explanation does not answer carefully, or does not want to answer, or other negative emotions. The likelihood of a significant change in mood (whether agitation or slowing) will increase. Therefore, this step passes the speech rate weight value W i2 The possibility of depression in the sense of a response emotion.
3. Determining the average intonation of the third audio subsection based on the intonation, determining the quotient between the 1-average intonation and the pre-collected standard intonation of the user as the intonation weight value W i3
The standard speech rate is collected by intonation and is obtained by collecting speech analysis under the condition that a user is unconscious. That is, the intonation of the user in the natural state is determined as the standard intonation.
W i3 The average intonation/standard intonation of the= |1-third audio sub-segment.
Intonation is important in determining depression, and if the intonation of the user suddenly changes, the emotion of the user is indicated to be fluctuated, and if the emotion of the user is elevated, the emotion of the user is indicated to be excited. If it is lowered, the description is lost. The likelihood of a significant change in mood (whether agitation or loss) is increased. Thus, this step is performed by intonation of the weight value W i3 The possibility of depression in the sense of a response emotion.
4. According toDetermining a first time weight value, determining a second weight value according to the attention time characteristics, and determining the quotient of the second time weight value and the first time weight as a time weight value W i4
As an example of the presence of a metal such as,
time weight value W i4 =second time weight value/first time weight。
The first time weight value characterizes the relationship between the duration of silence and the duration of response, if the longer the duration of silence is, the more the user does not want to answer, does not know how to answer, or the negative emotion such as unconsciousness is serious, and the possibility of suffering from depression is increased.
The second time weight value characterizes the time situation of continuously staring at the questioner in the question and answer process, and the larger the value is, the more the user does not want to answer, does not know how to answer, or the negative emotion such as unconsciousness is serious, and the possibility of suffering from depression is increased.
Through time weight value W i4 The likelihood of suffering from depression can be reflected at a temporal level.
5. Calculate the behavior weight W according to the following formula i5
Behavior weight W i5 The change condition of the pupil area in the question and the pupil area when the questioner is stared at continuously is characterized, and the larger the value is, the more the user does not want to answer, does not know how to answer, or the serious negative emotion such as unconsciousness is, and the possibility of suffering from depression is increased.
By action weight W i5 The potential for depression may be reflected at the behavioral level.
S104-2, determining the total weight value of the user according to the weight value corresponding to each audio segment.
For example, the total weight value of the user=max { W i1 ,W i2 ,W i3 }*W i4 *W i5
S104-3, determining the emotion coefficient based on the autonomic nerve signals.
In particular, the method comprises the steps of,
1. the autonomic nerve signals are formed into signal sets.
Each element in the signal set corresponds to a autonomic nerve signal value acquired at one moment, and the elements in the signal set are arranged from far to near according to the acquisition moment.
For example, the signal set S includes 5 elements, the set s= { S 0 ,S 1 ,S 2 ,S 3 ,S 4 }。
2. And determining the difference value between every two adjacent elements in the signal set to form a signal difference set.
Wherein the difference is the value of the latter element-the value of the former element.
For example, the signal difference set Δs includes 4 elements, which set Δs= { S 1 -S 0 ,S 2 -S 1 ,S 3 -S 2 ,S 4 -S 3 }。
If S is to 1 -S 0 Denoted as a 0 ,S 2 -S 1 Denoted as a 1 ,S 3 -S 2 Denoted as a 2 ,S 4 -S 3 Denoted as a 3 Then Δs= { a 0 ,a 1 ,a 2 ,a 3 }。
That is, any element in the set Δs (e.g., a j ) Its value is S j+1 -S j I.e. a j =S j+1 -S j
3. Determining standard deviation sigma of all elements in a set of signal differences Δ
I.e. determining the standard deviation sigma of all elements in the set of signal differences deltas Δ
For example, if Δs= { a 0 ,a 1 ,a 2 ,a 3 Then (V) is
4. Determining the element with the largest median in the signal difference setElement with minimum sum->
I.e. determining
Wherein max { } is the maximum function and min { } is the minimum function.
5. Determining the element a with the largest value in the signal set max Element a with minimum sum value min And time t corresponding to the element with the largest value max Time t corresponding to the element with the smallest value min
I.e. determining a max =max{S 0 ,S 1 ,S 2 ,S 3 ,S 4 },a min =min{S 0 ,S 1 ,S 2 ,S 3 ,S 4 }。
a max The corresponding acquisition time is t max ,a min The corresponding acquisition time is t min
6. Determining emotion coefficients
Because depression is clinically manifested as poor mood and long life, low mood and subsidence, from initial smoldering to final sorrow, spelt, pain, pessimisty, boredom, feeling alive each day is hopefully afflicting itself, negatively, evading, and finally even more suicide attempts and behaviors.
Patients with depression do not actively interact with the outside world, i.e. react more poorly to external stimuli. Whereas the autonomic signal value may reflect the user's response to the current emotional stimulus, the longer the maximum and minimum responses (e.g., t max -t min The greater the likelihood of depression being present. The smaller the difference between the maximum reaction and the minimum reaction (e.g. a max -a min The value of (2) that it is insensitive to external stimuli, the greater the likelihood that it will be depressed.
In addition, the patients with depression also respondPhenomena of overstimulation, sigma Δ Characterizing the degree of dispersion of the change in autonomic nerve signal values, if σ, two times before and after Δ The larger the mood swings are, the more likely they are to be depressed.The difference between the maximum and minimum extent of change is characterized, the greater the difference the more violent the reaction, the greater the likelihood of depression.
S104-4, carrying out depression detection on the facial video based on the emotion coefficient and the total weight value to obtain a depression detection result.
In particular, the method comprises the steps of,
1. micro-expressions of frames in the facial video are identified.
The present micro-expression recognition scheme is adopted in this step, and will not be described here again.
2. And determining the degree of change between the micro expressions of each frame.
The existing micro-expression analysis method is also adopted, and the expression change between the front frame and the rear frame is determined through the method.
The degree of change may be various representatives, for example, the degree of change is the number of the changed micro-expression feature points, or the degree of change is the number of the changed micro-expression feature points/the total number of the micro-expression feature points, or the degree of change is the average distance of the changed micro-expression feature points, wherein the average distance is the position difference of each micro-expression feature point in the front and rear frames.
3. A maximum number of consecutive frames having a degree of change not greater than a change threshold is determined.
Wherein the variation threshold is an empirical value and may be preset. Or training through sample data.
In addition, the continuous frames include all frames involved in a degree of change not greater than the change threshold. That is, the degree of change is obtained by subtracting two frames, and then both frames are the frames to which they relate.
In this step, the degree of change of two adjacent frames is sequentially calculated. And determining the relationship between each degree of change and the change threshold value, respectively.
For example, table 1 shows:
TABLE 1
In the data shown in Table 1, there are 3 consecutive frames having a degree of change of not more than the change threshold, the first segment is D 0 Corresponding frame F 0 And F 1 The second section is D 2 And D 3 Corresponding frame F 2 、F 3 And F 4 The third section is D 5 、D 6 And D 7 Corresponding frame F 5 、F 6 、F 7 And F 8
Then the maximum number is 4 (i.e. F 5 、F 6 、F 7 And F 8 )。
The maximum number of consecutive frames indicates the maximum number of frames that the user does not respond to, since the frames are also time-sequential, i.e. how many frames per minute in the video are fixed, the length of time that the frames can respond, i.e. the maximum time that the user does not respond when receiving emotional stimuli, the longer the time the greater the likelihood that depression will occur.
4. Determining detection value = maximum number x total weight value x I1.
Wherein I1 is an emotion coefficient.
5. If the detection value is greater than the depression threshold, it is determined that depression is detected.
Wherein the depression threshold is an empirical value and may be preset. Or training through sample data.
The method of the embodiment obtains audio, facial video and autonomic nerve signals when a user accepts a question; extracting speech speed, intonation and semantics in the audio; determining temporal and behavioral features based on the facial video and audio; the facial video is subjected to depression detection based on speech speed, intonation, semantics, time characteristics, behavior characteristics and autonomic nerve signals, so that a depression detection result is obtained, and automatic depression detection is realized.
In order that the above-described aspects may be better understood, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions.
Furthermore, it should be noted that in the description of the present specification, the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to a specific feature, structure, material, or characteristic described in connection with the embodiment or example being included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art upon learning the basic inventive concepts. Therefore, the appended claims should be construed to include preferred embodiments and all such variations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, the present invention should also include such modifications and variations provided that they come within the scope of the following claims and their equivalents.

Claims (6)

1. A computer-implemented method of depression detection based on audio analysis, the method comprising:
s101, acquiring audio, facial video and autonomic nerve signals of a user when the user accepts a question;
s102, extracting speech speed, intonation and semantics in the audio;
s103, determining time characteristics and behavior characteristics based on the facial video and the audio;
the audio and facial video are collected simultaneously;
the step S103 includes:
s103-1, segmenting the audio according to the questions to obtain a plurality of audio segments; each audio segment corresponds to a question and answer of a question;
s103-2, determining an acquisition time period corresponding to each audio segment, and taking the video in the acquisition time period as a video segment corresponding to each audio segment in the face video;
s103-3, determining corresponding time characteristics and behavior characteristics according to each audio segment and the corresponding video segment;
each audio segment is composed of 3 audio sub-segments, wherein the first audio sub-segment is a question audio, the second audio sub-segment is a silence audio after the question, and the third audio sub-segment is a reply audio according to the question;
the S103-3 comprises:
for any one of the audio segments,
determining a duration of a first audio sub-segment in said any audio segmentThe duration of the second audio subsection +.>The duration of the third audio subsection +.>
Determining a first video sub-segment corresponding to the first audio sub-segment, a second video sub-segment corresponding to the second audio sub-segment and a third video sub-segment corresponding to the third audio sub-segment in the video segments corresponding to any audio segment;
identifying pupil positions and pupil areas in the first video sub-segment, the second video sub-segment and the third video sub-segment respectively, and determining attention time characteristics and attention behavior characteristics according to identification results;
the saidThe attention time characteristics are used as the time characteristics corresponding to any audio segment;
taking the attention behavior characteristic as a behavior characteristic corresponding to any audio segment;
the questioner is positioned right in front of the user, and the eyes of the questioner and the eyes of the user are positioned at the same height;
the determining the attention time characteristic and the attention behavior characteristic according to the recognition result comprises the following steps:
determining a maximum duration in which the pupil position is continuously located at the center of the eye according to the pupil position in the first video subsectionThe pupil position continues for a maximum duration +.>And determining a maximum pupil area in the first video subsection +.>And minimum pupil area->Determining the maximum pupil area +.>And minimum pupil area->
Determining a maximum duration in which the pupil position is continuously located at the center of the eye based on the pupil position in the second video subsectionMaximum duration of pupil position lasting in a second other position in the same non-eye centre +.>And determining a maximum pupil area in the second video subsection +.>And minimum pupil area->
Determining a maximum duration in which the pupil position is continuously located at the center of the eye based on the pupil position in the third video subsectionMaximum duration of pupil position lasting in a third other position in the same non-eye centre +.>And determining a maximum pupil area in the third video subsection +.>And minimum pupil area->Determining the maximum pupil area +.>And minimum pupil area->
The saidAll as attention time features;
the said All as attention behavior features;
s104, carrying out depression detection on the facial video based on the speech speed, the intonation, the semantics, the time characteristics, the behavior characteristics and the autonomic nerve signals to obtain a depression detection result;
the S104 includes:
s104-1, determining corresponding weight values according to the speech speed, intonation, semantics, time characteristics and behavior characteristics of each audio segment;
s104-2, determining the total weight value of the user according to the weight value corresponding to each audio segment;
s104-3, determining an emotion coefficient based on the autonomic nerve signals;
s104-4, carrying out depression detection on the facial video based on the emotion coefficient and the total weight value to obtain a depression detection result.
2. The method according to claim 1, wherein S104-1 comprises:
for any one of the audio segments,
determining emotion labels of the third audio sub-segment according to the semantics, and obtaining corresponding emotion weight values W according to the corresponding relation between preset emotion labels and emotion weights i1
Determining an average speech rate of a third audio subsection according to the speech rate, and determining a quotient between the average speech rate of |1 and a pre-acquired standard speech rate of the user as a speech rate weight value W i2
Determining an average intonation of a third audio sub-segment based on said intonation, determining as a intonation weight value W |1-a quotient between said average intonation and a pre-collected standard intonation of said user i3
According toDetermining a first time weight value, determining a second time weight value according to the attention time feature, and determining the quotient of the second time weight value and the first time weight as a time weight value W i4
Calculate the behavior weight W according to the following formula i5
3. The method according to claim 2, wherein the step ofDetermining a first time weight value, determining a second time weight value from the attention time feature, comprising:
the first time weight value is determined by the following formula:
the second time weight value is determined by the following formula:
4. a method according to claim 3, wherein S104-2 comprises:
total weight value of the user = max { W i1 ,W i2 ,W i3 }*W i4 *W i5
5. The method according to claim 1, wherein S104-3 comprises:
forming the autonomic nerve signals into a signal set; wherein each element in the signal set corresponds to a vegetative nerve signal value acquired at one moment, and the elements in the signal set are arranged from far to near according to the acquisition moment;
determining the difference value between every two adjacent elements in the signal set to form a signal difference set, wherein the difference value is the value of the next element-the value of the previous element;
determining the letterStandard deviation sigma of all elements in the number difference set Δ
Determining the element with the largest value in the signal difference setElement with minimum sum->
Determining the element a with the largest value in the signal set max Element a with minimum sum value min And time t corresponding to the element with the largest value max Time t corresponding to the element with the smallest value min
Determining emotion coefficients
6. The method according to claim 1, wherein S104-4 comprises:
identifying micro-expressions of frames in the facial video;
determining the degree of change between the micro expressions of each frame;
determining a maximum number of consecutive frames with a degree of change not greater than a change threshold;
determining a detection value = maximum number × total weight value × I1; wherein I1 is an emotion coefficient;
if the detection value is greater than the depression threshold, it is determined that depression is detected.
CN202111523401.1A 2021-12-13 2021-12-13 Method for computer-implemented depression detection based on audio analysis Active CN114190942B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111523401.1A CN114190942B (en) 2021-12-13 2021-12-13 Method for computer-implemented depression detection based on audio analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111523401.1A CN114190942B (en) 2021-12-13 2021-12-13 Method for computer-implemented depression detection based on audio analysis

Publications (2)

Publication Number Publication Date
CN114190942A CN114190942A (en) 2022-03-18
CN114190942B true CN114190942B (en) 2023-10-03

Family

ID=80653418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111523401.1A Active CN114190942B (en) 2021-12-13 2021-12-13 Method for computer-implemented depression detection based on audio analysis

Country Status (1)

Country Link
CN (1) CN114190942B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115715680A (en) * 2022-12-01 2023-02-28 杭州市第七人民医院 Anxiety discrimination method and device based on connective tissue potential

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130050817A (en) * 2011-11-08 2013-05-16 가천대학교 산학협력단 Depression diagnosis method using hrv based on neuro-fuzzy network
CN112818892A (en) * 2021-02-10 2021-05-18 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130050817A (en) * 2011-11-08 2013-05-16 가천대학교 산학협력단 Depression diagnosis method using hrv based on neuro-fuzzy network
CN112818892A (en) * 2021-02-10 2021-05-18 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network

Also Published As

Publication number Publication date
CN114190942A (en) 2022-03-18

Similar Documents

Publication Publication Date Title
Krejtz et al. Discerning ambient/focal attention with coefficient K
JP6426105B2 (en) System and method for detecting blink suppression as a marker of engagement and sensory stimulus saliency
Dawson et al. Cochlear implants in children, adolescents, and prelinguistically deafened adults: speech perception
Lewis et al. The influence of listener experience and academic training on ratings of nasality
Yeung et al. The new age of play audiometry: prospective validation testing of an iPad-based play audiometer
WO2020119355A1 (en) Method for evaluating multi-modal emotional understanding capability of patient with autism spectrum disorder
Higgins et al. Longitudinal changes in children’s speech and voice physiology after cochlear implantation
Gygi et al. The incongruency advantage for environmental sounds presented in natural auditory scenes.
Siegel et al. A little bit louder now: negative affect increases perceived loudness.
US10959661B2 (en) Quantification of bulbar function
Leung et al. Affective speech prosody perception and production in stroke patients with left-hemispheric damage and healthy controls
Patel Acoustic characteristics of the question-statement contrast in severe dysarthria due to cerebral palsy
Nittrouer et al. Verbal working memory in older adults: The roles of phonological capacities and processing speed
Thibodeaux et al. What do youth tennis athletes say to themselves? Observed and self-reported self-talk on the court
McAllister Byun et al. Direction of attentional focus in biofeedback treatment for/r/misarticulation
Brown et al. Effects of long-term musical training on cortical auditory evoked potentials
Preston et al. Remediating residual rhotic errors with traditional and ultrasound-enhanced treatment: A single-case experimental study
Snow et al. Subject review: Conceptual and methodological challenges in discourse assessment with TBI speakers: Towards an understanding
CN114190942B (en) Method for computer-implemented depression detection based on audio analysis
Xu et al. Developmental phonagnosia: Neural correlates and a behavioral marker
WO2021035067A1 (en) Measuring language proficiency from electroencephelography data
Nittrouer et al. Weighting of acoustic cues to a manner distinction by children with and without hearing loss
Gong et al. Towards an Automated Screening Tool for Developmental Speech and Language Impairments.
Van Ingelghem et al. An auditory temporal processing deficit in children with dyslexia
Wisler et al. The effects of symptom onset location on automatic amyotrophic lateral sclerosis detection using the correlation structure of articulatory movements

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 301, 3rd Floor, No. 8 Sijiqing Road, Haidian District, Beijing, 100195

Applicant after: WOMIN HIGH-NEW SCIENCE & TECHNOLOGY (BEIJING) CO.,LTD.

Address before: 100086 West, 14th floor, block B, Haidian culture and art building, 28a Zhongguancun Street, Haidian District, Beijing

Applicant before: WOMIN HIGH-NEW SCIENCE & TECHNOLOGY (BEIJING) CO.,LTD.

GR01 Patent grant
GR01 Patent grant