CN114190942A - Method for detecting depression based on audio analysis - Google Patents

Method for detecting depression based on audio analysis Download PDF

Info

Publication number
CN114190942A
CN114190942A CN202111523401.1A CN202111523401A CN114190942A CN 114190942 A CN114190942 A CN 114190942A CN 202111523401 A CN202111523401 A CN 202111523401A CN 114190942 A CN114190942 A CN 114190942A
Authority
CN
China
Prior art keywords
determining
audio
segment
video
weight value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111523401.1A
Other languages
Chinese (zh)
Other versions
CN114190942B (en
Inventor
齐中祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Womin High New Science & Technology Beijing Co ltd
Original Assignee
Womin High New Science & Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Womin High New Science & Technology Beijing Co ltd filed Critical Womin High New Science & Technology Beijing Co ltd
Priority to CN202111523401.1A priority Critical patent/CN114190942B/en
Publication of CN114190942A publication Critical patent/CN114190942A/en
Application granted granted Critical
Publication of CN114190942B publication Critical patent/CN114190942B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • A61B5/165Evaluating the state of mind, e.g. depression, anxiety

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Psychiatry (AREA)
  • Engineering & Computer Science (AREA)
  • Educational Technology (AREA)
  • Biomedical Technology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychology (AREA)
  • Social Psychology (AREA)
  • Physics & Mathematics (AREA)
  • Child & Adolescent Psychology (AREA)
  • Biophysics (AREA)
  • Pathology (AREA)
  • Developmental Disabilities (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Surgery (AREA)
  • Animal Behavior & Ethology (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Veterinary Medicine (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The invention relates to a depression detection method based on audio analysis, which comprises the following steps: acquiring audio, facial video and vegetative nerve signals of a user when the user receives a question; extracting the speed, tone and semantics of speech in the audio; determining temporal and behavioral features based on the facial video and audio; and carrying out depression detection on the facial video based on the speed of speech, the tone of speech, the semantics, the time characteristic, the behavior characteristic and the autonomic nerve signals to obtain a depression detection result. The method provided by the invention can be used for detecting whether the user suffers from the depression or not through the audio, the facial video and the vegetative nerve signals when the user receives the questions, so that the automatic detection of the depression is realized.

Description

Method for detecting depression based on audio analysis
Technical Field
The invention relates to the technical field of psychological evaluation, in particular to a depression detection method based on audio analysis.
Background
At present, depression is the second largest disease of humans after cardiovascular diseases, about 80 ten thousand people suicide for depression every year, and the onset of depression has begun to trend toward the development of a low age (university, or even a group of primary and secondary school students). However, the medical treatment and prevention of the depression in China are still in the situation of low recognition rate, hospitals above grade market receive related drug treatment for patients with recognition rate less than 20% and less than 10%, so that the detection of the depression is vital to the medical prevention work of the depression.
Disclosure of Invention
Technical problem to be solved
In view of the above-mentioned shortcomings and drawbacks of the prior art, the present invention provides a method for depression detection based on audio analysis.
(II) technical scheme
In order to achieve the purpose, the invention adopts the main technical scheme that:
a method of depression detection based on audio analysis, the method comprising:
s101, acquiring audio, facial video and vegetative nerve signals when a user receives a question;
s102, extracting the speed, tone and semantics of the voice frequency;
s103, determining time characteristics and behavior characteristics based on the facial video and the audio;
and S104, carrying out depression detection on the facial video based on the speech rate, the tone, the semantics, the time characteristic, the behavior characteristic and the autonomic nerve signal to obtain a depression detection result.
Optionally, the audio and facial video are captured simultaneously;
the S103 includes:
s103-1, segmenting the audio according to the question to obtain a plurality of audio segments; each audio segment corresponds to a question and answer of a question;
s103-2, determining a collection time period corresponding to each audio segment, and taking the video in the collection time period as a video segment corresponding to each audio segment in the face video;
s103-3, determining corresponding time characteristics and behavior characteristics according to each audio segment and the corresponding video segment.
Optionally, each audio segment is composed of 3 audio sub-segments, where a first audio sub-segment is a question audio, a second audio sub-segment is a silent audio after a question is asked, and a third audio sub-segment is a reply audio of a user according to the question;
the S103-3 comprises:
for any of the audio pieces it is possible to,
determining the duration of a first audio sub-segment in the any audio segment
Figure BDA0003408603640000021
Second soundDuration of frequency sub-segments
Figure BDA0003408603640000022
Duration of third audio sub-segment
Figure BDA0003408603640000023
In the video segment corresponding to any audio segment, determining a first video sub-segment corresponding to the first audio sub-segment, a second video sub-segment corresponding to the second audio sub-segment and a third video sub-segment corresponding to the third audio sub-segment;
respectively identifying pupil positions and pupil areas in the first video subsegment, the second video subsegment and the third video subsegment, and determining attention time characteristics and attention behavior characteristics according to identification results;
will be described in
Figure BDA0003408603640000024
And the attention time characteristic is taken as a time characteristic corresponding to any audio segment;
and taking the attention behavior characteristic as a behavior characteristic corresponding to any audio segment.
Optionally, the questioner is located directly in front of the user, and the eyes of the questioner and the eyes of the user are located at the same height;
the determining the attention time characteristic and the attention behavior characteristic according to the recognition result comprises the following steps:
determining a maximum duration for which the pupil position is continuously centered in the eye based on the pupil position in the first video subsection
Figure BDA0003408603640000031
Maximum duration of pupil position at first other position of same non-eye center
Figure BDA0003408603640000032
And, determining a maximum pupil area in the first video subsection
Figure BDA0003408603640000033
And minimum pupil area
Figure BDA0003408603640000034
Determining a maximum pupil area in a maximum duration time period during which a pupil location is continuously centered in an eye
Figure BDA0003408603640000035
And minimum pupil area
Figure BDA0003408603640000036
Determining a maximum duration for which the pupil position is continuously centered in the eye based on the pupil position in the second video subsection
Figure BDA0003408603640000037
Maximum duration of pupil position at a second other position that is continuously centered in the same non-eye
Figure BDA0003408603640000038
And, determining a maximum pupil area in the second video subsection
Figure BDA0003408603640000039
And minimum pupil area
Figure BDA00034086036400000310
Determining a maximum duration for which the pupil location is continuously centered in the eye based on the pupil location in the third video subsection
Figure BDA00034086036400000311
Maximum duration of pupil position at third other position of same non-eye center
Figure BDA00034086036400000312
And, determining a maximum pupil area in the third video subsection
Figure BDA00034086036400000313
And minimum pupil area
Figure BDA00034086036400000314
Determining a maximum pupil area in a maximum duration time period during which a pupil location is continuously centered in an eye
Figure BDA00034086036400000315
And minimum pupil area
Figure BDA00034086036400000316
Will be described in
Figure BDA00034086036400000317
All as attention time characteristics;
will be described in
Figure BDA00034086036400000318
Figure BDA00034086036400000319
Are all characterized by attention.
Optionally, the S104 includes:
s104-1, determining a corresponding weight value according to the speed, tone, semantics, time characteristics and behavior characteristics of each audio segment;
s104-2, determining the total weight value of the user according to the weight value corresponding to each audio segment;
s104-3, determining an emotion coefficient based on the autonomic nerve signals;
and S104-4, carrying out depression detection on the facial video based on the emotion coefficients and the total weight value to obtain a depression detection result.
Optionally, the S104-1 includes:
for any of the audio pieces it is possible to,
determining the emotion label of the third audio sub-segment according to the semantics and according to the corresponding relation between the preset emotion label and the emotion weightObtain the corresponding emotion weight value Wi1
Determining the average speech speed of a third audio subsection according to the speech speed, and determining | 1-quotient | between the average speech speed and the standard speech speed of the user collected in advance as a speech speed weight value Wi2
Determining the average intonation of a third audio sub-segment according to the intonation, and determining | 1-quotient | between the average intonation and the standard intonation of the user collected in advance as an intonation weight value Wi3
According to
Figure BDA0003408603640000041
Determining a first time weight value, determining a second weight value according to the attention time characteristics, and determining the quotient of the second time weight value and the first time weight value as a time weight value Wi4
The behavior weight W is calculated according to the following formulai5
Figure BDA0003408603640000042
Optionally, the said according to
Figure BDA0003408603640000043
Determining a first time weight value and a second weight value according to the attention time characteristics, wherein the determining comprises the following steps:
determining a first time weight value by:
Figure BDA0003408603640000044
determining a second temporal weight value by:
Figure BDA0003408603640000045
optionally, the S104-2 includes:
max W total weight value of the useri1,Wi2,Wi3}*Wi4*Wi5
Optionally, the S104-3 includes:
forming the autonomic nervous signals into a signal set; each element in the signal set corresponds to an vegetative nerve signal value acquired at a moment, and the elements in the signal set are arranged from far to near according to the acquisition moments;
determining a difference value between every two adjacent elements in the signal set to form a signal difference set, wherein the difference value is the value of a next element-the value of a previous element;
determining the standard deviation sigma of all elements in the set of signal differencesΔ
Determining the element with the largest value in the signal difference set
Figure BDA0003408603640000051
Element with minimum sum
Figure BDA0003408603640000052
Determining the element a with the largest value in the signal setmaxElement a with the smallest summinAnd the time t corresponding to the element with the largest valuemaxTime t corresponding to the element with the smallest valuemin
Determining affective coefficients
Figure BDA0003408603640000053
Optionally, the S104-4 includes:
identifying the micro-expression of each frame in the facial video;
determining the variation degree among the micro expressions of each frame;
determining a maximum number of consecutive frames for which the degree of change is not greater than a change threshold;
determining a maximum number total weight value I1; wherein I1 is an emotion coefficient;
determining that depression is detected if the detection value is greater than a depression threshold.
(III) advantageous effects
Acquiring audio, facial video and vegetative nerve signals of a user when the user receives a question; extracting the speed, tone and semantics of speech in the audio; determining temporal and behavioral features based on the facial video and audio; and carrying out depression detection on the facial video based on the speed of speech, the tone of speech, the semantics, the time characteristic, the behavior characteristic and the autonomic nerve signals to obtain a depression detection result. The method provided by the invention can be used for detecting whether the user suffers from the depression or not through the audio, the facial video and the vegetative nerve signals when the user receives the questions, so that the automatic detection of the depression is realized.
Drawings
Fig. 1 is a flowchart illustrating a method for depression detection based on audio analysis according to an embodiment of the present invention.
Detailed Description
For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings.
At present, depression is the second largest disease of humans after cardiovascular diseases, about 80 ten thousand people suicide for depression every year, and the onset of depression has begun to trend toward the development of a low age (university, or even a group of primary and secondary school students). However, the medical treatment and prevention of the depression in China are still in the situation of low recognition rate, hospitals above grade market receive related drug treatment for patients with recognition rate less than 20% and less than 10%, so that the detection of the depression is vital to the medical prevention work of the depression.
Based on this, the invention provides a method for depression detection based on audio analysis, the method comprising: acquiring audio, facial video and vegetative nerve signals of a user when the user receives a question; extracting the speed, tone and semantics of speech in the audio; determining temporal and behavioral features based on the facial video and audio; and carrying out depression detection on the facial video based on the speed of speech, the tone of speech, the semantics, the time characteristic, the behavior characteristic and the autonomic nerve signals to obtain a depression detection result, thereby realizing the automatic detection of depression.
In a specific implementation, a question is asked of the user, and whether the user has a depression tendency is detected by the method shown in fig. 1, so as to determine whether the user has the depression.
Referring to fig. 1, the implementation process of the method for detecting depression based on audio analysis provided in this embodiment is as follows:
s101, acquiring audio, facial video and vegetative nerve signals when the user receives a question.
In step S101, a question is asked for the user, and after the user answers the question, the question is continuously asked until all questions are answered.
When asking, the questioner is located right in front of the user, and the eyes of the questioner are on the same horizontal plane (i.e., at the same height) as the eyes of the user.
In the process of questioning, audio, facial video and vegetative nerve signals of a user when the user receives a questioning are obtained in real time.
That is, audio, facial video, and autonomic nerve signals are acquired simultaneously.
S102, extracting the speed, tone and semantics of the voice frequency.
The step can be realized by adopting the existing speech speed, tone and semantic extraction scheme, and the detailed explanation is not provided here.
S103, determining time characteristics and behavior characteristics based on the face video and the audio.
In particular, the method comprises the following steps of,
s103-1, segmenting the audio according to the question to obtain a plurality of audio segments.
Each audio segment corresponds to a question and answer to a question.
That is, the audio is segmented by question and corresponding response, each question and corresponding response corresponding to an audio segment. There are as many questions as there are audio segments.
For any audio segment, because it corresponds to a question and its answer, it includes 3 stages of contents, namely, the content of the question stage, the content of the user thinking the silent stage after the question, and the content of the user answering stage. Namely, each audio segment is composed of 3 audio sub-segments, wherein the first audio sub-segment is a question audio, the second audio sub-segment is a silent audio after the question is asked, and the third audio sub-segment is a reply audio of the user according to the question.
S103-2, determining a collection time period corresponding to each audio segment, and taking the video in the collection time period as a corresponding video segment of each audio segment in the face video.
Because the audio segment and the video segment are recorded simultaneously, the step obtains the video segment corresponding to each audio segment.
That is, one audio segment and its corresponding video segment are audio and video for the question-and-answer process of the same question.
S103-3, determining corresponding time characteristics and behavior characteristics according to each audio segment and the corresponding video segment.
Specifically, for any audio segment (e.g., audio segment i of the ith topic, which corresponds to video segment i),
1. determining the duration of a first audio sub-segment in any audio segment
Figure BDA0003408603640000081
Duration of the second audio sub-segment
Figure BDA0003408603640000082
Duration of third audio sub-segment
Figure BDA0003408603640000083
I.e. determining the duration of the questioning audio in the audio segment i
Figure BDA0003408603640000084
Duration of silent audio
Figure BDA0003408603640000085
Duration of reply audio
Figure BDA0003408603640000086
2. In the video segment corresponding to any audio segment, a first video sub-segment corresponding to the first audio sub-segment, a second video sub-segment corresponding to the second audio sub-segment and a third video sub-segment corresponding to the third audio sub-segment are determined.
Namely, of the video segment i, a first video sub-segment corresponding to the questioning audio (i.e., video in the questioning stage), a second video sub-segment corresponding to the silent audio (i.e., video in the silent stage), and a third video sub-segment corresponding to the reply audio (i.e., video in the reply stage).
Executing this, the audio segment and the video segment of each question-answering process can be obtained.
Each question-answering process is divided into a corresponding question stage, a thought silent stage and a reply stage
In one audio segment, 3 audio sub-segments are obtained, which respectively correspond to a question stage, a thought silence stage and a reply stage. Then the first audio sub-segment and the first video sub-segment of the question stage, the second audio sub-segment and the second video sub-segment of the thought silence stage, and the third audio sub-segment and the third video sub-segment of the answer stage are obtained.
3. And identifying pupil positions and pupil areas in the first video subsegment, the second video subsegment and the third video subsegment respectively, and determining attention time characteristics and attention behavior characteristics according to the identification results.
Wherein, adopt current scheme to carry out the discernment of pupil position and pupil area, no longer give unnecessary details here.
The process of determining the attention time characteristic and the attention behavior characteristic according to the recognition result may be:
1) determining a maximum duration for which the pupil position is continuously centered in the eye based on the pupil position in the first video subsection
Figure BDA0003408603640000091
Maximum duration of pupil position at first other position of same non-eye center
Figure BDA0003408603640000092
And, determining a first video sub-frameMaximum pupil area in a segment
Figure BDA0003408603640000093
And minimum pupil area
Figure BDA0003408603640000094
Determining a maximum pupil area in a maximum duration time period during which a pupil location is continuously centered in an eye
Figure BDA0003408603640000095
And minimum pupil area
Figure BDA0003408603640000096
That is to say, the position of the nozzle is,
the pupil position and pupil area in each frame of the first video subsegment are determined.
And comparing whether the positions of the pupils of the front frame and the back frame are changed or not, if so, indicating that the positions of the pupils of the front frame and the back frame are changed, and if not, indicating that the positions of the pupils of the front frame and the back frame are not changed. By comparing the positions of the pupils of the frames in the first video subsegment, all the frames with unchanged pupil positions can be obtained. From all the frame segments, the frame segment with the pupil position as the eye center (namely the frame segment of the direct-view questioner) is determined, and the frame segment with the maximum number of frames is found from all the frame segments with the pupil position as the eye center, so that the time length of the frame segment is equal to
Figure BDA0003408603640000097
The maximum pupil area in the frame section is
Figure BDA0003408603640000098
At a minimum of
Figure BDA0003408603640000099
In a frame section where the pupil position is not the center of the eye, a frame section including the largest number of frames is found, and the duration of the frame section is then
Figure BDA00034086036400000910
Taking the maximum pupil area in all frames of the first video subsection as
Figure BDA00034086036400000911
At a minimum as
Figure BDA00034086036400000912
2) Determining a maximum duration for which the pupil position is continuously centered in the eye based on the pupil position in the second video subsection
Figure BDA00034086036400000913
Maximum duration of pupil position at a second other position that is continuously centered in the same non-eye
Figure BDA00034086036400000914
And, determining a maximum pupil area in the second video subsection
Figure BDA00034086036400000915
And minimum pupil area
Figure BDA00034086036400000916
And determining the pupil position and the pupil area in each frame of the second video subsegment.
And comparing whether the positions of the pupils of the front frame and the back frame are changed or not, if so, indicating that the positions of the pupils of the front frame and the back frame are changed, and if not, indicating that the positions of the pupils of the front frame and the back frame are not changed. And comparing the positions of the pupils of all frames in the second video subsegment to obtain all the frames with unchanged pupil positions. From all the frame segments, the frame segment with the pupil position as the eye center (namely the frame segment of the direct-view questioner) is determined, and the frame segment with the maximum number of frames is found from all the frame segments with the pupil position as the eye center, so that the time length of the frame segment is equal to
Figure BDA0003408603640000101
The maximum pupil area in the frame section is
Figure BDA0003408603640000102
At a minimum of
Figure BDA0003408603640000103
In a frame section where the pupil position is not the center of the eye, a frame section including the largest number of frames is found, and the duration of the frame section is then
Figure BDA0003408603640000104
3) Determining a maximum duration for which the pupil location is continuously centered in the eye based on the pupil location in the third video subsection
Figure BDA0003408603640000105
Maximum duration of pupil position at third other position of same non-eye center
Figure BDA0003408603640000106
And, determining a maximum pupil area in the third video subsection
Figure BDA0003408603640000107
And minimum pupil area
Figure BDA0003408603640000108
Determining a maximum pupil area in a maximum duration time period during which a pupil location is continuously centered in an eye
Figure BDA0003408603640000109
And minimum pupil area
Figure BDA00034086036400001010
That is to say, the position of the nozzle is,
and determining the pupil position and the pupil area in each frame of the third video subsegment.
Comparing the pupils of the front and rear framesAnd if the position of the hole is not changed, the pupil positions of the front frame and the back frame are not changed. And comparing the positions of the pupils of all frames in the third video subsegment to obtain all the frames with unchanged pupil positions. From all the frame segments, the frame segment with the pupil position as the eye center (namely the frame segment of the direct-view questioner) is determined, and the frame segment with the maximum number of frames is found from all the frame segments with the pupil position as the eye center, so that the time length of the frame segment is equal to
Figure BDA00034086036400001011
The maximum pupil area in the frame section is
Figure BDA00034086036400001012
At a minimum of
Figure BDA00034086036400001013
In a frame section where the pupil position is not the center of the eye, a frame section including the largest number of frames is found, and the duration of the frame section is then
Figure BDA00034086036400001014
Taking the maximum pupil area in all frames of the third video subsection as
Figure BDA0003408603640000111
At a minimum as
Figure BDA0003408603640000112
4) Will be provided with
Figure BDA0003408603640000113
All as time-of-attention features.
5) Will be provided with
Figure BDA0003408603640000114
Figure BDA0003408603640000115
Are all characterized by attention.
4. Will be provided with
Figure BDA0003408603640000116
And the attention time characteristic are taken as the time characteristic corresponding to any audio segment.
I.e. the temporal characteristics of the audio segment i
Figure BDA0003408603640000117
Figure BDA0003408603640000118
5. And taking the attention behavior characteristic as the behavior characteristic corresponding to any audio segment.
I.e. the behaviour of the audio segment i is characterized by
Figure BDA0003408603640000119
Figure BDA00034086036400001110
And S104, carrying out depression detection on the facial video based on the speed of speech, the tone of speech, the semantics, the time characteristic, the behavior characteristic and the autonomic nerve signals to obtain a depression detection result.
In particular, the method comprises the following steps of,
and S104-1, determining a corresponding weight value according to the speech speed, the tone, the semantics, the time characteristic and the behavior characteristic of each audio segment.
For any audio segment, the weighted value is calculated as follows:
1. determining the emotion label of the third audio sub-segment according to the semantics, and obtaining a corresponding emotion weight value W according to the corresponding relation between the preset emotion label and the emotion weighti1
The correspondence between the emotion labels and the emotion weights is preset and can be an empirical value.
The semantic meaning is important to judge the depression if the user voicesFilled with negative energy, the likelihood of suffering from depression increases. Therefore, this step passes the emotion weight value Wi1Reflecting the semantic depression potential.
2. Determining the average speech speed of the third audio subsection according to the speech speed, and determining the quotient | between | 1-average speech speed and the standard speech speed of the user collected in advance as the speech speed weighted value Wi2
The standard speech rate is acquired in advance and is acquired by voice analysis under the condition that a user is unconscious. That is, the speech rate in the natural state of the user is determined as the standard speech rate.
And | | is an absolute value.
Wi21-average/standard speech rate of the third audio sub-segment.
The speed of speech is important for judging the depression, if the speed of speech of the user is suddenly changed from the normal speed of speech, the emotion of the user is fluctuated, and if the speed of speech of the user is faster, the emotion of the user is excited. If slow, no serious response, or no response, or other negative emotions are indicated. The likelihood of a significant change in mood (whether agitation or slowing) is increased. Therefore, this step is based on the speech rate weight Wi2Reflecting the possibility of depression in mood.
3. Determining the average intonation of the third audio subsection according to the intonation, and determining the quotient | between | 1-average intonation and standard intonation of the user collected in advance as intonation weighted value Wi3
The standard speech rate is collected by collecting speech analysis under the condition that a user is unconscious. That is, the intonation in the natural state of the user is determined as the standard intonation.
Wi31-average/standard intonation of the third audio sub-segment.
The tone of voice is important for judging the depression, if the tone of voice of the user changes suddenly, the emotion of the user fluctuates, and if the tone of voice of the user rises, the emotion of the user is excited. If so, it is said to be relatively missing. With depression if the mood changes are significant (whether agitation or loss)The probability increases. Therefore, this step passes the intonation weight value Wi3Reflecting the possibility of depression in mood.
4. According to
Figure BDA0003408603640000121
Determining a first time weight value, determining a second weight value according to the attention time characteristics, and determining the quotient of the second time weight value and the first time weight value as a time weight value Wi4
Such as if
Figure BDA0003408603640000131
Figure BDA0003408603640000132
Then the time weight value Wi4Second temporal weight value/first temporal weight.
The first time weighted value represents the relation between the silent time length and the response time length, if the silent time length is longer, the user is unwilling to respond, does not know how to respond, or the negative emotions such as unconsciousness are serious, and the possibility that the user suffers from depression is increased.
The second time weight value represents the time condition of continuously staring at the questioner in the questioning and answering process, and the larger the value is, the more serious negative emotions such as unwilling to answer, unknown answering or unconsciousness are indicated to the user, and the possibility of suffering from depression is increased.
Passing time weight value Wi4May reflect the likelihood of having depression at the time level.
5. The behavior weight W is calculated according to the following formulai5
Figure BDA0003408603640000133
Weight of behavior Wi5The pupil area in the question-answering process and the time pupil area of a person who keeps staring at a questioner are representedThe larger the change of (a), the more serious the negative emotion such as the user is unwilling to answer, not knowing how to answer, or unconsciousness, and the like, and the possibility of suffering from depression is increased.
By behavioral weight Wi5The behavioral level response may be to the likelihood of having depression.
And S104-2, determining the total weight value of the user according to the weight value corresponding to each audio segment.
For example, the total weight value of the user is max { W ═ max }i1,Wi2,Wi3}*Wi4*Wi5
And S104-3, determining the emotion coefficient based on the vegetative nerve signals.
In particular, the method comprises the following steps of,
1. the autonomic nerve signals are formed into a signal set.
Each element in the signal set corresponds to the vegetative nerve signal value acquired at a moment, and the elements in the signal set are arranged from far to near according to the acquisition moments.
For example, the signal set S includes 5 elements, the set S ═ { S ═ S0,S1,S2,S3,S4}。
2. And determining the difference value between every two adjacent elements in the signal set to form a signal difference set.
Where the difference is the value of the next element-the value of the previous element.
For example, the set of signal differences Δ S includes 4 elements, and the set Δ S ═ S1-S0,S2-S1,S3-S2,S4-S3}。
If it is to be S1-S0Is marked as a0,S2-S1Is marked as a1,S3-S2Is marked as a2,S4-S3Is marked as a3Then Δ S ═ a0,a1,a2,a3}。
That is, any element in the set Δ S (e.g., a)j) Having a value of Sj+1-SjI.e. aj=Sj+1-Sj
3. Determining the standard deviation sigma of all elements in a set of signal differencesΔ
I.e. determining the standard deviation sigma of all elements in the set of signal differences deltasΔ
For example, if Δ S ═ a0,a1,a2,a3}, then
Figure BDA0003408603640000141
4. Determining the element with the largest value in the set of signal differences
Figure BDA0003408603640000142
Element with minimum sum
Figure BDA0003408603640000143
I.e. to determine
Figure BDA0003408603640000144
Wherein max { } is a maximum value solving function, and min { } is a minimum value solving function.
5. Determining the element a with the largest value in the signal setmaxElement a with the smallest summinAnd the time t corresponding to the element with the largest valuemaxTime t corresponding to the element with the smallest valuemin
I.e. determine amax=max{S0,S1,S2,S3,S4},amin=min{S0,S1,S2,S3,S4}。
amaxCorresponding to a collection time tmax,aminCorresponding to a collection time tmin
6. Determining affective coefficients
Figure BDA0003408603640000145
The clinical manifestations of depression are that the mood is bad and the life is too happy, the mood is low for a long time and is dull, from sultriness at first to sadness at last, the feeling is life every day, the people feel apprehensive and hurting, negative, escape and even have suicide attempt and behavior.
Patients with depression do not actively interact emotionally with the outside, i.e., respond more slowly to external stimuli. The autonomic nervous signal value can reflect the user's response to the current emotional stimulus, so the longer the maximum response is compared to the minimum response (e.g., t)max-tminValue of (d) the more likely it is to present depression. The smaller the difference between the maximum response and the minimum response (e.g. a) in the same time periodmax-aminThe value of (b) indicating that it is not sensitive to external stimuli, the greater the likelihood that it will present depression.
In addition, patients with depression may also develop an overstimulation, σΔThe degree of dispersion of the change of the vegetative nerve signal values in two times is characterized, if sigmaΔThe larger the number of the depression, the more remarkable the mood swing is, and the more likely the depression is.
Figure BDA0003408603640000151
The difference between the maximum and minimum degree of change is characterized, and a larger difference indicates a more drastic reaction and a higher probability of depression.
And S104-4, carrying out depression detection on the face video based on the emotional coefficients and the total weight value to obtain a depression detection result.
In particular, the method comprises the following steps of,
1. micro-expressions of frames in the facial video are identified.
The existing micro expression recognition scheme is adopted in the step, and details are not repeated here.
2. And determining the change degree between the micro-expressions of each frame.
The existing micro-expression analysis method is also adopted, and the expression change between the previous frame and the next frame is determined by the method.
The degree of change can be represented by various factors, for example, the degree of change is the number of changed micro-expression feature points, or the degree of change is the number of changed micro-expression feature points/total number of micro-expression feature points, or the degree of change is the average distance of the changed micro-expression feature points, wherein the average distance is the position difference of each micro-expression feature point in the two frames before and after.
3. A maximum number of consecutive frames for which the degree of change is not greater than the change threshold is determined.
Wherein the variation threshold is an empirical value and can be set in advance. Or training through sample data.
In addition, the continuous frames include all frames involved in a degree of change not greater than the change threshold. That is, the degree of change is derived from the difference between two frames, which are both the frames they refer to.
In this step, the variation of two adjacent frames is calculated in sequence. And determining the relationship between each degree of change and the change threshold value respectively.
For example as shown in table 1:
TABLE 1
Figure BDA0003408603640000161
In the data shown in Table 1, there are 3 consecutive frames having a degree of change not greater than a change threshold, the first segment being D0Corresponding frame F0And F1The second stage is D2And D3Corresponding frame F2、F3And F4And the third section is D5、D6And D7Corresponding frame F5、F6、F7And F8
Then the maximum number is 4 (i.e., F)5、F6、F7And F8)。
The maximum number of consecutive frames characterizes the maximum number of frames that the user is unresponsive to, by which the duration of the reaction is longer, i.e., the maximum time that the user is unresponsive to receiving a emotional stimulus, the greater the likelihood that depression will be present, since the frames are also chronologically present, i.e., how many frames per minute in the video are fixed.
4. The maximum number I1 is determined.
Wherein, I1 is the emotion coefficient.
5. If the detection value is greater than the depression threshold, determining that depression is detected.
Where the depression threshold is an empirical value, it may be set in advance. Or training through sample data.
The method of the embodiment acquires audio, facial video and vegetative nerve signals when a user receives a question; extracting the speed, tone and semantics of speech in the audio; determining temporal and behavioral features based on the facial video and audio; and carrying out depression detection on the facial video based on the speed of speech, the tone of speech, the semantics, the time characteristic, the behavior characteristic and the autonomic nerve signals to obtain a depression detection result, thereby realizing the automatic detection of depression.
In order to better understand the above technical solutions, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions.
Furthermore, it should be noted that in the description of the present specification, the description of the term "one embodiment", "some embodiments", "examples", "specific examples" or "some examples", etc., means that a specific feature, structure, material or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, the claims should be construed to include preferred embodiments and all changes and modifications that fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention should also include such modifications and variations.

Claims (10)

1. A method of depression detection based on audio analysis, the method comprising:
s101, acquiring audio, facial video and vegetative nerve signals when a user receives a question;
s102, extracting the speed, tone and semantics of the voice frequency;
s103, determining time characteristics and behavior characteristics based on the facial video and the audio;
and S104, carrying out depression detection on the facial video based on the speech rate, the tone, the semantics, the time characteristic, the behavior characteristic and the autonomic nerve signal to obtain a depression detection result.
2. The method of claim 1, wherein the audio and facial video are captured simultaneously;
the S103 includes:
s103-1, segmenting the audio according to the question to obtain a plurality of audio segments; each audio segment corresponds to a question and answer of a question;
s103-2, determining a collection time period corresponding to each audio segment, and taking the video in the collection time period as a video segment corresponding to each audio segment in the face video;
s103-3, determining corresponding time characteristics and behavior characteristics according to each audio segment and the corresponding video segment.
3. The method of claim 2, wherein each audio segment is composed of 3 audio sub-segments, wherein a first audio sub-segment is a question audio, a second audio sub-segment is a silent audio after a question, and a third audio sub-segment is a reply audio of a user according to the question;
the S103-3 comprises:
for any of the audio pieces it is possible to,
determining the duration of a first audio sub-segment in the any audio segment
Figure FDA0003408603630000011
Duration of the second audio sub-segment
Figure FDA0003408603630000012
Duration of third audio sub-segment
Figure FDA0003408603630000013
In the video segment corresponding to any audio segment, determining a first video sub-segment corresponding to the first audio sub-segment, a second video sub-segment corresponding to the second audio sub-segment and a third video sub-segment corresponding to the third audio sub-segment;
respectively identifying pupil positions and pupil areas in the first video subsegment, the second video subsegment and the third video subsegment, and determining attention time characteristics and attention behavior characteristics according to identification results;
will be described in
Figure FDA0003408603630000021
And the attention time characteristic is taken as a time characteristic corresponding to any audio segment;
and taking the attention behavior characteristic as a behavior characteristic corresponding to any audio segment.
4. The method of claim 3, wherein the questioner is located directly in front of the user and the eyes of the questioner are at the same height as the eyes of the user;
the determining the attention time characteristic and the attention behavior characteristic according to the recognition result comprises the following steps:
determining a maximum duration for which the pupil position is continuously centered in the eye based on the pupil position in the first video subsection
Figure FDA0003408603630000022
Maximum duration of pupil position at first other position of same non-eye center
Figure FDA0003408603630000023
And, determining a maximum pupil area in the first video subsection
Figure FDA0003408603630000024
And minimum pupil area
Figure FDA0003408603630000025
Determining a maximum pupil area in a maximum duration time period during which a pupil location is continuously centered in an eye
Figure FDA0003408603630000026
And minimum pupil area
Figure FDA0003408603630000027
Determining a maximum duration for which the pupil position is continuously centered in the eye based on the pupil position in the second video subsection
Figure FDA0003408603630000028
Maximum duration of pupil position at a second other position that is continuously centered in the same non-eye
Figure FDA0003408603630000029
And, determining a maximum pupil area in the second video subsection
Figure FDA00034086036300000210
And minimum pupil area
Figure FDA00034086036300000211
Determining a maximum duration for which the pupil location is continuously centered in the eye based on the pupil location in the third video subsection
Figure FDA00034086036300000212
Maximum duration of pupil position at third other position of same non-eye center
Figure FDA00034086036300000213
And, determining a maximum pupil area in the third video subsection
Figure FDA00034086036300000214
And minimum pupil area
Figure FDA00034086036300000215
Determining a maximum pupil area in a maximum duration time period during which a pupil location is continuously centered in an eye
Figure FDA00034086036300000216
And minimum pupil area
Figure FDA00034086036300000217
Will be described in
Figure FDA00034086036300000218
All as attention time characteristics;
will be described in
Figure FDA0003408603630000031
Figure FDA0003408603630000032
Are all characterized by attention.
5. The method according to claim 4, wherein the S104 comprises:
s104-1, determining a corresponding weight value according to the speed, tone, semantics, time characteristics and behavior characteristics of each audio segment;
s104-2, determining the total weight value of the user according to the weight value corresponding to each audio segment;
s104-3, determining an emotion coefficient based on the autonomic nerve signals;
and S104-4, carrying out depression detection on the facial video based on the emotion coefficients and the total weight value to obtain a depression detection result.
6. The method of claim 5, wherein the S104-1 comprises:
for any of the audio pieces it is possible to,
determining the emotion label of the third audio sub-segment according to the semantics, and obtaining a corresponding emotion weight value W according to the corresponding relation between the preset emotion label and the emotion weighti1
Determining the average speech speed of a third audio subsection according to the speech speed, and determining | 1-quotient | between the average speech speed and the standard speech speed of the user collected in advance as a speech speed weight value Wi2
Determining the average intonation of a third audio sub-segment according to the intonation, and determining | 1-quotient | between the average intonation and the standard intonation of the user collected in advance as an intonation weight value Wi3
According to
Figure FDA0003408603630000033
Determining a first time weight value, determining a second weight value according to the attention time characteristics, and determining the quotient of the second time weight value and the first time weight value as a time weight value Wi4
The behavior weight W is calculated according to the following formulai5
Figure FDA0003408603630000034
Figure FDA0003408603630000041
7. The method of claim 6, wherein the method is based on
Figure FDA0003408603630000047
Determining a first time weight value and a second weight value according to the attention time characteristics, wherein the determining comprises the following steps:
determining a first time weight value by:
Figure FDA0003408603630000042
determining a second temporal weight value by:
Figure FDA0003408603630000043
8. the method of claim 7, wherein the S104-2 comprises:
max W total weight value of the useri1,Wi2,Wi3}*Wi4*Wi5
9. The method according to claim 5, wherein the S104-3 comprises:
forming the autonomic nervous signals into a signal set; each element in the signal set corresponds to an vegetative nerve signal value acquired at a moment, and the elements in the signal set are arranged from far to near according to the acquisition moments;
determining a difference value between every two adjacent elements in the signal set to form a signal difference set, wherein the difference value is the value of a next element-the value of a previous element;
determining the standard deviation sigma of all elements in the set of signal differencesΔ
Determining the element with the largest value in the signal difference set
Figure FDA0003408603630000044
Element with minimum sum
Figure FDA0003408603630000045
Determining the element a with the largest value in the signal setmaxElement a with the smallest summinAnd the time t corresponding to the element with the largest valuemaxTime t corresponding to the element with the smallest valuemin
Determining affective coefficients
Figure FDA0003408603630000046
10. The method according to claim 5, wherein the S104-4 comprises:
identifying the micro-expression of each frame in the facial video;
determining the variation degree among the micro expressions of each frame;
determining a maximum number of consecutive frames for which the degree of change is not greater than a change threshold;
determining a maximum number total weight value I1; wherein I1 is an emotion coefficient;
determining that depression is detected if the detection value is greater than a depression threshold.
CN202111523401.1A 2021-12-13 2021-12-13 Method for computer-implemented depression detection based on audio analysis Active CN114190942B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111523401.1A CN114190942B (en) 2021-12-13 2021-12-13 Method for computer-implemented depression detection based on audio analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111523401.1A CN114190942B (en) 2021-12-13 2021-12-13 Method for computer-implemented depression detection based on audio analysis

Publications (2)

Publication Number Publication Date
CN114190942A true CN114190942A (en) 2022-03-18
CN114190942B CN114190942B (en) 2023-10-03

Family

ID=80653418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111523401.1A Active CN114190942B (en) 2021-12-13 2021-12-13 Method for computer-implemented depression detection based on audio analysis

Country Status (1)

Country Link
CN (1) CN114190942B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115715680A (en) * 2022-12-01 2023-02-28 杭州市第七人民医院 Anxiety discrimination method and device based on connective tissue potential

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101366348B1 (en) * 2011-11-08 2014-02-24 가천대학교 산학협력단 Depression diagnosis method using hrv based on neuro-fuzzy network
CN112818892B (en) * 2021-02-10 2023-04-07 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115715680A (en) * 2022-12-01 2023-02-28 杭州市第七人民医院 Anxiety discrimination method and device based on connective tissue potential

Also Published As

Publication number Publication date
CN114190942B (en) 2023-10-03

Similar Documents

Publication Publication Date Title
Bull Posture & Gesture: Posture & Gesture
Van Petten et al. Time course of word identification and semantic integration in spoken language.
US20200365275A1 (en) System and method for assessing physiological state
Feldstein et al. A chronography of conversation: In defense of an objective approach
Higgins et al. Longitudinal changes in children’s speech and voice physiology after cochlear implantation
Gibbon Lingual activity in two speech‐disordered children's attempts to produce velar and alveolar stop consonants: evidence from electropalatographic (EPG) data
Zhao et al. Automatic detection of expressed emotion in Parkinson's disease
CN105022929A (en) Cognition accuracy analysis method for personality trait value test
Nittrouer et al. Verbal working memory in older adults: The roles of phonological capacities and processing speed
JP2006071936A (en) Dialogue agent
Oshrat et al. Speech prosody as a biosignal for physical pain detection
Shakow et al. The use of the tautophone (“verbal summator”) as an auditory apperceptive test for the study of personality
WO2021035067A1 (en) Measuring language proficiency from electroencephelography data
CN114190942A (en) Method for detecting depression based on audio analysis
Gong et al. Towards an Automated Screening Tool for Developmental Speech and Language Impairments.
Fell et al. visiBabble for reinforcement of early vocalization
Hudenko et al. Listeners prefer the laughs of children with autism to those of typically developing children
Nittrouer et al. Weighting of acoustic cues to a manner distinction by children with and without hearing loss
Papesh et al. Your effort is showing! Pupil dilation reveals memory heuristics
CN114209322A (en) Method for detecting depression based on video analysis
Begum et al. Survey on Artificial Intelligence-based Depression Detection using Clinical Interview Data
Feng et al. I-vector Based Within Speaker Voice Quality Identification on connected speech
Kamiloğlu et al. Not All Laughs Are the Same: Tickling Induces a Unique Type of Spontaneous Laughter
Freeman et al. Using simplified regulated breathing with an adolescent stutterer: Application of effective intervention in a residential context
CN114300120A (en) Method for depression detection based on text analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 301, 3rd Floor, No. 8 Sijiqing Road, Haidian District, Beijing, 100195

Applicant after: WOMIN HIGH-NEW SCIENCE & TECHNOLOGY (BEIJING) CO.,LTD.

Address before: 100086 West, 14th floor, block B, Haidian culture and art building, 28a Zhongguancun Street, Haidian District, Beijing

Applicant before: WOMIN HIGH-NEW SCIENCE & TECHNOLOGY (BEIJING) CO.,LTD.

GR01 Patent grant
GR01 Patent grant