CN114190942B

CN114190942B - Method for computer-implemented depression detection based on audio analysis

Info

Publication number: CN114190942B
Application number: CN202111523401.1A
Authority: CN
Inventors: 齐中祥
Original assignee: Womin High New Science & Technology Beijing Co ltd
Current assignee: Womin High New Science & Technology Beijing Co ltd
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2023-10-03
Anticipated expiration: 2041-12-13
Also published as: CN114190942A

Abstract

The invention relates to a depression detection method based on audio analysis, which comprises the following steps: acquiring audio, facial video and autonomic nerve signals of a user when the user accepts a question; extracting speech speed, intonation and semantics in the audio; determining temporal and behavioral features based on the facial video and audio; and carrying out depression detection on the facial video based on the speech speed, the intonation, the semantics, the time characteristics, the behavior characteristics and the autonomic nerve signals to obtain a depression detection result. The method provided by the invention detects whether the user suffers from depression or not through the audio frequency, the facial video and the autonomic nerve signals when the user receives the question, thereby realizing automatic detection of depression.

Description

Method for computer-implemented depression detection based on audio analysis

Technical Field

The invention relates to the technical field of psychological assessment, in particular to a method for detecting depression based on audio analysis, which is executed by a computer.

Background

Currently, depression is the second largest disease in humans, secondary to cardiovascular disease, and at the same time, the onset of depression has begun to appear as a trend toward low age. Thus, the detection of depression is critical to the medical prevention work of depression.

Disclosure of Invention

First, the technical problem to be solved

In view of the above-described shortcomings and drawbacks of the prior art, the present invention provides a computer-implemented method of depression detection based on audio analysis.

(II) technical scheme

In order to achieve the above purpose, the main technical scheme adopted by the invention comprises the following steps:

a method of depression detection based on audio analysis, the method comprising:

s101, acquiring audio, facial video and autonomic nerve signals of a user when the user accepts a question;

s102, extracting speech speed, intonation and semantics in the audio;

s103, determining time characteristics and behavior characteristics based on the facial video and the audio;

s104, carrying out depression detection on the facial video based on the speech speed, the intonation, the semantics, the time characteristics, the behavior characteristics and the autonomic nerve signals to obtain a depression detection result.

Optionally, the audio and facial video are collected simultaneously;

the step S103 includes:

s103-1, segmenting the audio according to the questions to obtain a plurality of audio segments; each audio segment corresponds to a question and answer of a question;

s103-2, determining an acquisition time period corresponding to each audio segment, and taking the video in the acquisition time period as a video segment corresponding to each audio segment in the face video;

s103-3, corresponding time characteristics and behavior characteristics are determined according to each audio segment and the corresponding video segment.

Optionally, each audio segment is composed of 3 audio subsections, wherein the first audio subsection is a question audio, the second audio subsection is a silence audio after the question, and the third audio subsection is a reply audio according to the question by the user;

the S103-3 comprises:

for any one of the audio segments,

determining a duration of a first audio sub-segment in said any audio segmentThe duration of the second audio subsection +.>The duration of the third audio subsection +.>

Determining a first video sub-segment corresponding to the first audio sub-segment, a second video sub-segment corresponding to the second audio sub-segment and a third video sub-segment corresponding to the third audio sub-segment in the video segments corresponding to any audio segment;

identifying pupil positions and pupil areas in the first video sub-segment, the second video sub-segment and the third video sub-segment respectively, and determining attention time characteristics and attention behavior characteristics according to identification results;

the saidThe attention time characteristics are used as the time characteristics corresponding to any audio segment;

and taking the attention behavior characteristic as the behavior characteristic corresponding to any audio frequency segment.

Optionally, the questioner is located directly in front of the user, and the eyes of the questioner and the eyes of the user are located at the same height;

the determining the attention time characteristic and the attention behavior characteristic according to the recognition result comprises the following steps:

determining a maximum duration in which the pupil position is continuously located at the center of the eye according to the pupil position in the first video subsectionThe pupil position continues for a maximum duration +.>And determining a maximum pupil area in the first video subsection +.>And minimum pupil area->Determining the maximum pupil area +.>And minimum pupil area->

Determining a maximum duration in which the pupil position is continuously located at the center of the eye based on the pupil position in the second video subsectionMaximum duration of pupil position lasting in a second other position in the same non-eye centre +.>And determining a maximum pupil area in the second video subsection +.>And minimum pupil area->

Determining a maximum duration in which the pupil position is continuously located at the center of the eye based on the pupil position in the third video subsectionMaximum duration of pupil position lasting in a third other position in the same non-eye centre +.>And determining a maximum pupil area in the third video subsection +.>And minimum pupil area->Determining the maximum pupil area +.>And minimum pupil area->

The saidAll as attention time features;

the said All as attention behavior features.

Optionally, the S104 includes:

s104-1, determining corresponding weight values according to the speech speed, intonation, semantics, time characteristics and behavior characteristics of each audio segment;

s104-2, determining the total weight value of the user according to the weight value corresponding to each audio segment;

s104-3, determining an emotion coefficient based on the autonomic nerve signals;

s104-4, carrying out depression detection on the facial video based on the emotion coefficient and the total weight value to obtain a depression detection result.

Optionally, the step S104-1 includes:

for any one of the audio segments,

determining emotion labels of the third audio sub-segment according to the semantics, and obtaining corresponding emotion weight values W according to the corresponding relation between preset emotion labels and emotion weights _i1 ；

Determining an average speech rate of a third audio subsection according to the speech rate, and determining a quotient between the average speech rate of |1 and a pre-acquired standard speech rate of the user as a speech rate weight value W _i2 ；

Determining an average intonation of a third audio sub-segment based on said intonation, determining as a intonation weight value W |1-a quotient between said average intonation and a pre-collected standard intonation of said user _i3 ；

According toDetermining a first time weight value, determining a second weight value according to the attention time feature, and determining the quotient of the second time weight value and the first time weight as a time weight value W _i4 ；

Calculate the behavior weight W according to the following formula _i5 ：

Alternatively, the process may be carried out in a single-stage,said basis isDetermining a first time weight value, determining a second weight value according to the attention time feature, comprising:

the first time weight value is determined by the following formula:

the second time weight value is determined by the following formula:

optionally, the step S104-2 includes:

total weight value of the user = max { W _i1 ,W _i2 ,W _i3 }*W _i4 *W _i5 。

Optionally, the step S104-3 includes:

forming the autonomic nerve signals into a signal set; wherein each element in the signal set corresponds to a vegetative nerve signal value acquired at one moment, and the elements in the signal set are arranged from far to near according to the acquisition moment;

determining the difference value between every two adjacent elements in the signal set to form a signal difference set, wherein the difference value is the value of the next element-the value of the previous element;

determining standard deviation sigma of all elements in the set of signal differences ^Δ ；

Determining the element with the largest value in the signal difference setElement with minimum sum->

Determining the set of signalsElement a with the largest median value _max Element a with minimum sum value _min And time t corresponding to the element with the largest value _max Time t corresponding to the element with the smallest value _min ；

Determining emotion coefficients

Optionally, the step S104-4 includes:

identifying micro-expressions of frames in the facial video;

determining the degree of change between the micro expressions of each frame;

determining a maximum number of consecutive frames with a degree of change not greater than a change threshold;

determining a detection value = maximum number × total weight value × I1; wherein I1 is an emotion coefficient;

if the detection value is greater than the depression threshold, it is determined that depression is detected.

(III) beneficial effects

Acquiring audio, facial video and autonomic nerve signals of a user when the user accepts a question; extracting speech speed, intonation and semantics in the audio; determining temporal and behavioral features based on the facial video and audio; and carrying out depression detection on the facial video based on the speech speed, the intonation, the semantics, the time characteristics, the behavior characteristics and the autonomic nerve signals to obtain a depression detection result. The method provided by the invention detects whether the user suffers from depression or not through the audio frequency, the facial video and the autonomic nerve signals when the user receives the question, thereby realizing automatic detection of depression.

Drawings

Fig. 1 is a flowchart of a method for detecting depression based on audio analysis according to an embodiment of the present invention.

Detailed Description

The invention will be better explained by the following detailed description of the embodiments with reference to the drawings.

Currently, depression is the second major disease in humans, next to cardiovascular disease, and about 80 tens of thousands of people suicide each year due to depression, and at the same time, the onset of depression has begun to develop a trend toward low age (university, or even primary and secondary school students). However, the medical treatment and prevention of the depression in China is still in the situation of low recognition rate, the recognition rate of hospitals above the ground level market is less than 20%, and only less than 10% of patients receive relevant drug treatment, so that the detection of the depression is crucial to the medical treatment and prevention work of the depression.

Based on this, the present invention provides a method of depression detection based on audio analysis, the method comprising: acquiring audio, facial video and autonomic nerve signals of a user when the user accepts a question; extracting speech speed, intonation and semantics in the audio; determining temporal and behavioral features based on the facial video and audio; the facial video is subjected to depression detection based on speech speed, intonation, semantics, time characteristics, behavior characteristics and autonomic nerve signals, so that a depression detection result is obtained, and automatic depression detection is realized.

In particular, a question is presented to the user, and the user is checked for a tendency to depression by the method shown in fig. 1, to determine if the user has depression.

Referring to fig. 1, the implementation procedure of the method for detecting depression based on audio analysis provided in this embodiment is as follows:

s101, acquiring audio, facial video and autonomic nerve signals of a user when the user accepts a question.

In step S101, a question is asked to the user, and after waiting for the user to answer the question, the question is continued until all questions are answered.

In asking questions, the questioner is located directly in front of the user, and the eyes of the questioner are on the same level (i.e., at the same height) as the eyes of the user.

In the questioning process, audio, facial video and autonomic nerve signals of a user when the user accepts the questioning can be acquired in real time.

That is, audio, facial video, and autonomic nerve signals are acquired simultaneously.

S102, extracting speech speed, intonation and semantics in the audio.

The step is implemented by adopting the existing speech speed, intonation and semantic extraction scheme, and is not described in detail herein.

And S103, determining time characteristics and behavior characteristics based on the facial videos and the audios.

In particular, the method comprises the steps of,

s103-1, segmenting the audio according to the questions to obtain a plurality of audio segments.

Each audio segment corresponds to a question and answer to a question.

That is, audio is split in terms of a question and corresponding answer, one for each segment of audio. How many problems there are how many audio segments.

For any audio segment, it includes 3 phases of content, i.e., a question phase, and after a question, the user thinks about the silence phase and the user replies to the phase, because it corresponds to a question and its answer. Namely, each audio segment is composed of 3 audio sub-segments, wherein the first audio sub-segment is question audio, the second audio sub-segment is silence audio after the question, and the third audio sub-segment is answer audio according to the question.

S103-2, determining a corresponding acquisition time period of each audio segment, and taking the video in the acquisition time period as a video segment corresponding to each audio segment in the face video.

Because the audio segments and the video segments are recorded simultaneously, the video segments corresponding to each audio segment are obtained in this step.

That is, an audio segment and its corresponding video segment are audio and video of the question-answering process for the same question.

Specifically, for any audio segment (e.g., the subject audio segment i, which corresponds to video segment i),

1. determining a duration of a first audio sub-segment in any audio segmentSecond toneThe duration of the frequency sub-segment +.>The duration of the third audio subsection +.>

I.e. determining the duration of the challenge audio in audio segment iDuration of silence Audio +.>Duration of reply audio

2. And determining a first video sub-segment corresponding to the first audio sub-segment, a second video sub-segment corresponding to the second audio sub-segment and a third video sub-segment corresponding to the third audio sub-segment in the video segments corresponding to any audio segment.

I.e., in video segment i, a first video sub-segment corresponding to the challenge audio (i.e., the video of the challenge phase), a second video sub-segment corresponding to the silence audio (i.e., the video of the silence phase), and a third video sub-segment corresponding to the reply audio (i.e., the video of the reply phase).

And executing the above steps, the audio frequency section and the video frequency section of each question-answering process can be obtained.

Each question-answering process is divided into a corresponding question-asking stage, a thinking silence stage and a answer stage

In one audio segment, 3 audio sub-segments are obtained, which correspond to a question phase, a thinking silence phase and a reply phase respectively. Then the first audio sub-section and the first video sub-section of the question phase, the second audio sub-section and the second video sub-section of the thought silence phase, the third audio sub-section and the third video sub-section of the answer phase are also obtained up to this point.

3. And respectively identifying pupil positions and pupil areas in the first video sub-segment, the second video sub-segment and the third video sub-segment, and determining the attention time characteristic and the attention behavior characteristic according to the identification results.

The pupil position and pupil area are identified by adopting the existing scheme, and are not described herein.

The process of determining the attention time feature and the attention behavior feature according to the recognition result may be:

1) Determining a maximum duration in which the pupil position is continuously located at the center of the eye according to the pupil position in the first video subsectionThe pupil position continues for a maximum duration +.>And determining a maximum pupil area in the first video subsection +.>And minimum pupil area->Determining the maximum pupil area +.>And minimum pupil area->

That is, the light source is configured to,

pupil position and pupil area in each frame of the first video sub-segment are determined.

Comparing whether the pupil positions of the front frame and the rear frame change, if so, indicating that the pupil positions of the front frame and the rear frame change, and if not, indicating that the pupil positions of the front frame and the rear frame do not change. By comparing pupil positions of frames in the first video sub-segment, the method can obtainThere are frame segments where the pupil position is unchanged. From all the frame segments, determining the frame segment with the pupil position being the center of the eyes (namely, the frame segment of the direct-vision questioner), and finding the frame segment with the largest included frame number from the frame segments with the pupil position being the center of the eyes, wherein the duration of the frame segment isThe maximum pupil areas in the frame segment areMinimum is->

Of the frame segments whose pupil position is not the center of the eye, the frame segment with the largest number of frames is found, and then the frame segment has a duration of

The largest pupil area in all frames of the first video sub-segment is taken asMinimum as

2) Determining a maximum duration in which the pupil position is continuously located at the center of the eye based on the pupil position in the second video subsectionMaximum duration of pupil position lasting in a second other position in the same non-eye centre +.>And determining a maximum pupil area in the second video subsection +.>And minimum pupil area->

Pupil position and pupil area in each frame of the second video sub-segment are determined.

Comparing whether the pupil positions of the front frame and the rear frame change, if so, indicating that the pupil positions of the front frame and the rear frame change, and if not, indicating that the pupil positions of the front frame and the rear frame do not change. And comparing pupil positions of frames in the second video sub-segment to obtain all frame segments with unchanged pupil positions. From all the frame segments, determining the frame segment with the pupil position being the center of the eyes (namely, the frame segment of the direct-vision questioner), and finding the frame segment with the largest included frame number from the frame segments with the pupil position being the center of the eyes, wherein the duration of the frame segment isThe maximum pupil areas in the frame segment areMinimum is->

3) Determining a maximum duration in which the pupil position is continuously located at the center of the eye based on the pupil position in the third video subsectionMaximum duration of pupil position lasting in a third other position in the same non-eye centre +.>And determining a maximum pupil area in the third video subsection +.>And minimum pupil area->Determining the maximum pupil area +.>And minimum pupil area->

That is, the light source is configured to,

pupil position and pupil area in each frame of the third video sub-segment are determined.

Comparing whether the pupil positions of the front frame and the rear frame change, if so, indicating that the pupil positions of the front frame and the rear frame change, and if not, indicating that the pupil positions of the front frame and the rear frame do not change. And comparing pupil positions of frames in the third video sub-segment to obtain all frame segments with unchanged pupil positions. From all the frame segments, determining the frame segment with the pupil position being the center of the eyes (namely, the frame segment of the direct-vision questioner), and finding the frame segment with the largest included frame number from the frame segments with the pupil position being the center of the eyes, wherein the duration of the frame segment isThe maximum pupil areas in the frame segment areMinimum is->

Of the frame segments whose pupil position is not the center of the eye, the frame segment with the largest number of frames is found, thenThe frame segment has a duration of

The largest pupil area in all frames of the third video sub-segment is taken asMinimum as

4) Will beAll as attention time features.

5) Will be All as attention behavior features.

4. Will beAnd the attention time feature is taken as the time feature corresponding to any audio segment.

I.e. the audio segment i is time-characterised by

5. And taking the attention behavior characteristic as the behavior characteristic corresponding to any audio segment.

I.e. the behavior of audio segment i is characterized by

In particular, the method comprises the steps of,

s104-1, determining corresponding weight values according to the speech speed, intonation, semantics, time features and behavior features of each audio segment.

For any audio segment, the weight value is calculated as follows:

1. determining emotion labels of the third audio subsection according to semantics, and obtaining corresponding emotion weight values W according to the corresponding relation between preset emotion labels and emotion weights _i1 。

The corresponding relation between the emotion labels and the emotion weights is preset and can be an experience value.

The semantics are important for the judgment of depression, and if the user's voice is full of negative energy, the likelihood of suffering from depression increases. Thus, this step passes through the emotion weight value W _i1 Reflects the possibility of depression in terms of semantics.

2. Determining the average speech rate of the third audio subsection according to the speech rate, and determining the quotient between the average speech rate of |1-and the standard speech rate of the user acquired in advance as a speech rate weight value W _i2 。

The standard speech rate is acquired in advance and is obtained by voice analysis under the condition that a user is unconscious. That is, the speech rate of the user in the natural state is determined as the standard speech rate.

And I is absolute value.

W _i2 The average speech rate/standard speech rate of the= |1-third audio subsection.

The speech speed is important for judging depression, if the speech speed of the user suddenly changes compared with the normal speech speed, the emotion of the user is indicated to be fluctuated, and if the speech speed is faster, the emotion of the user is indicated to be excited.If slowed, the explanation does not answer carefully, or does not want to answer, or other negative emotions. The likelihood of a significant change in mood (whether agitation or slowing) will increase. Therefore, this step passes the speech rate weight value W _i2 The possibility of depression in the sense of a response emotion.

3. Determining the average intonation of the third audio subsection based on the intonation, determining the quotient between the 1-average intonation and the pre-collected standard intonation of the user as the intonation weight value W _i3 。

The standard speech rate is collected by intonation and is obtained by collecting speech analysis under the condition that a user is unconscious. That is, the intonation of the user in the natural state is determined as the standard intonation.

W _i3 The average intonation/standard intonation of the= |1-third audio sub-segment.

Intonation is important in determining depression, and if the intonation of the user suddenly changes, the emotion of the user is indicated to be fluctuated, and if the emotion of the user is elevated, the emotion of the user is indicated to be excited. If it is lowered, the description is lost. The likelihood of a significant change in mood (whether agitation or loss) is increased. Thus, this step is performed by intonation of the weight value W _i3 The possibility of depression in the sense of a response emotion.

4. According toDetermining a first time weight value, determining a second weight value according to the attention time characteristics, and determining the quotient of the second time weight value and the first time weight as a time weight value W _i4 。

As an example of the presence of a metal such as,

time weight value W _i4 =second time weight value/first time weight。

The first time weight value characterizes the relationship between the duration of silence and the duration of response, if the longer the duration of silence is, the more the user does not want to answer, does not know how to answer, or the negative emotion such as unconsciousness is serious, and the possibility of suffering from depression is increased.

The second time weight value characterizes the time situation of continuously staring at the questioner in the question and answer process, and the larger the value is, the more the user does not want to answer, does not know how to answer, or the negative emotion such as unconsciousness is serious, and the possibility of suffering from depression is increased.

Through time weight value W _i4 The likelihood of suffering from depression can be reflected at a temporal level.

5. Calculate the behavior weight W according to the following formula _i5 ：

Behavior weight W _i5 The change condition of the pupil area in the question and the pupil area when the questioner is stared at continuously is characterized, and the larger the value is, the more the user does not want to answer, does not know how to answer, or the serious negative emotion such as unconsciousness is, and the possibility of suffering from depression is increased.

By action weight W _i5 The potential for depression may be reflected at the behavioral level.

S104-2, determining the total weight value of the user according to the weight value corresponding to each audio segment.

For example, the total weight value of the user=max { W _i1 ,W _i2 ,W _i3 }*W _i4 *W _i5 。

S104-3, determining the emotion coefficient based on the autonomic nerve signals.

In particular, the method comprises the steps of,

1. the autonomic nerve signals are formed into signal sets.

Each element in the signal set corresponds to a autonomic nerve signal value acquired at one moment, and the elements in the signal set are arranged from far to near according to the acquisition moment.

For example, the signal set S includes 5 elements, the set s= { S ₀ ,S ₁ ,S ₂ ,S ₃ ,S ₄ }。

2. And determining the difference value between every two adjacent elements in the signal set to form a signal difference set.

Wherein the difference is the value of the latter element-the value of the former element.

For example, the signal difference set Δs includes 4 elements, which set Δs= { S ₁ -S ₀ ，S ₂ -S ₁ ，S ₃ -S ₂ ，S ₄ -S ₃ }。

If S is to ₁ -S ₀ Denoted as a ₀ ，S ₂ -S ₁ Denoted as a ₁ ，S ₃ -S ₂ Denoted as a ₂ ，S ₄ -S ₃ Denoted as a ₃ Then Δs= { a ₀ ，a ₁ ，a ₂ ，a ₃ }。

That is, any element in the set Δs (e.g., a _j ) Its value is S _j+1 -S _j I.e. a _j ＝S _j+1 -S _j 。

3. Determining standard deviation sigma of all elements in a set of signal differences ^Δ 。

I.e. determining the standard deviation sigma of all elements in the set of signal differences deltas ^Δ 。

For example, if Δs= { a ₀ ，a ₁ ，a ₂ ，a ₃ Then (V) is

4. Determining the element with the largest median in the signal difference setElement with minimum sum->

I.e. determining

Wherein max { } is the maximum function and min { } is the minimum function.

5. Determining the element a with the largest value in the signal set _max Element a with minimum sum value _min And time t corresponding to the element with the largest value _max Time t corresponding to the element with the smallest value _min 。

I.e. determining a _max ＝max{S ₀ ，S ₁ ，S ₂ ，S ₃ ，S ₄ }，a _min ＝min{S ₀ ，S ₁ ，S ₂ ，S ₃ ，S ₄ }。

a _max The corresponding acquisition time is t _max ，a _min The corresponding acquisition time is t _min 。

6. Determining emotion coefficients

Because depression is clinically manifested as poor mood and long life, low mood and subsidence, from initial smoldering to final sorrow, spelt, pain, pessimisty, boredom, feeling alive each day is hopefully afflicting itself, negatively, evading, and finally even more suicide attempts and behaviors.

Patients with depression do not actively interact with the outside world, i.e. react more poorly to external stimuli. Whereas the autonomic signal value may reflect the user's response to the current emotional stimulus, the longer the maximum and minimum responses (e.g., t _max -t _min The greater the likelihood of depression being present. The smaller the difference between the maximum reaction and the minimum reaction (e.g. a _max -a _min The value of (2) that it is insensitive to external stimuli, the greater the likelihood that it will be depressed.

In addition, the patients with depression also respondPhenomena of overstimulation, sigma ^Δ Characterizing the degree of dispersion of the change in autonomic nerve signal values, if σ, two times before and after ^Δ The larger the mood swings are, the more likely they are to be depressed.The difference between the maximum and minimum extent of change is characterized, the greater the difference the more violent the reaction, the greater the likelihood of depression.

In particular, the method comprises the steps of,

1. micro-expressions of frames in the facial video are identified.

The present micro-expression recognition scheme is adopted in this step, and will not be described here again.

2. And determining the degree of change between the micro expressions of each frame.

The existing micro-expression analysis method is also adopted, and the expression change between the front frame and the rear frame is determined through the method.

The degree of change may be various representatives, for example, the degree of change is the number of the changed micro-expression feature points, or the degree of change is the number of the changed micro-expression feature points/the total number of the micro-expression feature points, or the degree of change is the average distance of the changed micro-expression feature points, wherein the average distance is the position difference of each micro-expression feature point in the front and rear frames.

3. A maximum number of consecutive frames having a degree of change not greater than a change threshold is determined.

Wherein the variation threshold is an empirical value and may be preset. Or training through sample data.

In addition, the continuous frames include all frames involved in a degree of change not greater than the change threshold. That is, the degree of change is obtained by subtracting two frames, and then both frames are the frames to which they relate.

In this step, the degree of change of two adjacent frames is sequentially calculated. And determining the relationship between each degree of change and the change threshold value, respectively.

For example, table 1 shows:

TABLE 1

In the data shown in Table 1, there are 3 consecutive frames having a degree of change of not more than the change threshold, the first segment is D ₀ Corresponding frame F ₀ And F ₁ The second section is D ₂ And D ₃ Corresponding frame F ₂ 、F ₃ And F ₄ The third section is D ₅ 、D ₆ And D ₇ Corresponding frame F ₅ 、F ₆ 、F ₇ And F ₈ 。

Then the maximum number is 4 (i.e. F ₅ 、F ₆ 、F ₇ And F ₈ )。

The maximum number of consecutive frames indicates the maximum number of frames that the user does not respond to, since the frames are also time-sequential, i.e. how many frames per minute in the video are fixed, the length of time that the frames can respond, i.e. the maximum time that the user does not respond when receiving emotional stimuli, the longer the time the greater the likelihood that depression will occur.

4. Determining detection value = maximum number x total weight value x I1.

Wherein I1 is an emotion coefficient.

5. If the detection value is greater than the depression threshold, it is determined that depression is detected.

Wherein the depression threshold is an empirical value and may be preset. Or training through sample data.

The method of the embodiment obtains audio, facial video and autonomic nerve signals when a user accepts a question; extracting speech speed, intonation and semantics in the audio; determining temporal and behavioral features based on the facial video and audio; the facial video is subjected to depression detection based on speech speed, intonation, semantics, time characteristics, behavior characteristics and autonomic nerve signals, so that a depression detection result is obtained, and automatic depression detection is realized.

In order that the above-described aspects may be better understood, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions.

Furthermore, it should be noted that in the description of the present specification, the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to a specific feature, structure, material, or characteristic described in connection with the embodiment or example being included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art upon learning the basic inventive concepts. Therefore, the appended claims should be construed to include preferred embodiments and all such variations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, the present invention should also include such modifications and variations provided that they come within the scope of the following claims and their equivalents.

Claims

1. A computer-implemented method of depression detection based on audio analysis, the method comprising:

s102, extracting speech speed, intonation and semantics in the audio;

the audio and facial video are collected simultaneously;

the step S103 includes:

s103-3, determining corresponding time characteristics and behavior characteristics according to each audio segment and the corresponding video segment;

each audio segment is composed of 3 audio sub-segments, wherein the first audio sub-segment is a question audio, the second audio sub-segment is a silence audio after the question, and the third audio sub-segment is a reply audio according to the question;

the S103-3 comprises:

for any one of the audio segments,

taking the attention behavior characteristic as a behavior characteristic corresponding to any audio segment;

the questioner is positioned right in front of the user, and the eyes of the questioner and the eyes of the user are positioned at the same height;

The saidAll as attention time features;

the said All as attention behavior features;

s104, carrying out depression detection on the facial video based on the speech speed, the intonation, the semantics, the time characteristics, the behavior characteristics and the autonomic nerve signals to obtain a depression detection result;

the S104 includes:

2. The method according to claim 1, wherein S104-1 comprises:

for any one of the audio segments,

According toDetermining a first time weight value, determining a second time weight value according to the attention time feature, and determining the quotient of the second time weight value and the first time weight as a time weight value W _i4 ；

Calculate the behavior weight W according to the following formula _i5 ：

3. The method according to claim 2, wherein the step ofDetermining a first time weight value, determining a second time weight value from the attention time feature, comprising:

the first time weight value is determined by the following formula:

the second time weight value is determined by the following formula:

4. a method according to claim 3, wherein S104-2 comprises:

total weight value of the user = max { W _i1 ,W _i2 ,W _i3 }*W _i4 *W _i5 。

5. The method according to claim 1, wherein S104-3 comprises:

determining the letterStandard deviation sigma of all elements in the number difference set ^Δ ；

Determining the element a with the largest value in the signal set _max Element a with minimum sum value _min And time t corresponding to the element with the largest value _max Time t corresponding to the element with the smallest value _min ；

Determining emotion coefficients

6. The method according to claim 1, wherein S104-4 comprises:

identifying micro-expressions of frames in the facial video;

determining the degree of change between the micro expressions of each frame;