CN109171769A

CN109171769A - It is a kind of applied to depression detection voice, facial feature extraction method and system

Info

Publication number: CN109171769A
Application number: CN201810762032.3A
Authority: CN
Inventors: 郭威彤; 杨鸿武; 甘振业
Original assignee: Northwest Normal University
Current assignee: Northwest Normal University
Priority date: 2018-07-12
Filing date: 2018-07-12
Publication date: 2019-01-11

Abstract

The present invention disclose it is a kind of applied to depression detection voice, facial feature extraction method and system.Audio data is subjected to feature extraction according to energy information method, obtains frequency spectrum parameter and parameters,acoustic；Above-mentioned parameter is inputted into the first deep neural network model, obtains voice depth characteristic data；Video image is subjected to static nature extraction, obtains frame image；Frame image is inputted into the second deep neural network model, obtains facial feature data；Video image is subjected to behavioral characteristics extraction, obtains light stream image；Light stream image is inputted into third deep neural network model, obtains facial movement characteristic；Facial feature data and motion characteristic data are inputted into third deep neural network model, obtain facial depth characteristic data；Voice depth characteristic data and facial depth characteristic data are inputted into fourth nerve network model, obtain fused data.The precision of the screening results of depression can be improved using method or system of the invention and improve the detection efficiency of depression.

Description

It is a kind of applied to depression detection voice, facial feature extraction method and system

Technical field

The present invention relates to feature extraction fields, more particularly to a kind of voice applied to depression detection, facial characteristics Extracting method and system.

Background technique

Since depressive disorder can generate huge social danger and economic loss, the scholar of various countries and associated mechanisms thus Relevant research is expanded for depressive disorder, actively seeks effective diagnosis and treatment scheme.The differentiation of depression at present and diagnosis are led To start in terms of three: 1) by subjective factor, such as: Hamilton depressive scale (HAMD), Beck Depression scale (BDI), patient health questionnaire depression self-rating scale (PHQ-9) etc. and the subjective judgement of clinician diagnose, this gesture Must there will be a degree of subjective bias；2) biological information is relied on, based on brain electric (EEG), NMR imaging (fMRI) etc. Biotechnology has been used in depression detection, and e.g., what the gamma wave band presentation of depressive disorder crowd EEG persistently enhanced shows As depressive disorder crowd has the increase etc. of Prefrontal Cortex activation level asymmetry；3) by the relevant behavior letter of psychology Breath, identifies depression based on the abnormal behaviors such as voice, facial expression and body posture feature.For example, in terms of voice attributes Difference can effectively reflect the depressive state of people, the speech channel characteristic variations of patients with depression and its depressed physiological signs Have a relationship, objective indicator one of of the Information procession of facial expression as depression detection, patients with depression to positive mood at Reason has difficulties, but has stronger attention and susceptibility for sad mood, and body expression is also depression detection A very important visual clue.

Currently, the depression identification under audio-video signal mainly uses conventional methods, feature extraction first, then Feature selecting is finally identified with classification or the algorithm returned.1) it utilizes audio signal: analyzing the rhythm and acoustics of voice Feature, discovery depressive patient phonetically lack rhythm variation than normal person；The resonance of comparative analysis depressive patient and normal person Peak and spectrum signature discovery, formant, power spectral density, mel cepstrum coefficients (MFCC) and its difference, Teager energy operator (TEO) etc. features are validity feature in depression identification.2) vision signal: detection of the vision signal to depression is utilized It is concentrated mainly on facial expression, by extracting geometric features (geometrical features) and based on outer table algorithm spy (Appearance-based algorithms features) is levied to portray facial expression.By extracting the edge of face, turning The time sequential value at angle, coordinate and direction portrays the variation and intensity of particular emotion, shows that the expressivity of depressive patient reduces； The variation of texture is described by extracting facial area feature, carries out the classification of depression face-image；3) pass through linear fusion sound Frequently depression is identified with video features.

The problem of detecting depression most critical using audio and video is the extraction of feature.However, the sound extracted at present Video features are all the features of hand-designed, can there is certain non-linear dependencies each other, so these features are insufficient To characterize the high layer information of depression audio or video.Along with the variation of voice and facial expression is to occur simultaneously, correlation Height, and the variation of depression affective state does not have apparent event horizon, and emotion behavior also varies with each individual, so simply The feature of splicing series winding voice and facial expression can lose some important informations, and the screening results and detection for influencing depression are imitated Rate.

Summary of the invention

The object of the present invention is to provide it is a kind of applied to depression detection voice, facial feature extraction method and system, It improves the precision of the screening results of depression and improves the detection efficiency of depression.

To achieve the above object, the present invention provides following schemes:

It is a kind of applied to depression detection voice, facial feature extraction method, which comprises

Randomly select one section of audio, video data；

The audio data in the audio, video data is subjected to feature extraction according to energy information method, obtains frequency spectrum parameter And parameters,acoustic；

The frequency spectrum parameter and the parameters,acoustic are inputted into the first deep neural network model, obtain the voice of audio Depth characteristic data；

Video image in the audio, video data is subjected to static nature extraction, obtains frame image；

The frame image is inputted into the second deep neural network model, obtains facial feature data；

Video image in the audio, video data is subjected to behavioral characteristics extraction, obtains light stream image；

The light stream image is inputted into third deep neural network model, obtains facial movement characteristic；

The facial feature data and the facial movement characteristic are inputted into third deep neural network model, obtained To the facial depth characteristic data of video；

The voice depth characteristic data and the facial depth characteristic data are inputted into fourth nerve network model, are obtained To fused data.

Optionally, described that the frequency spectrum parameter and the parameters,acoustic are inputted into the first deep neural network model, it obtains The voice depth characteristic data of audio, specifically include:

The frequency spectrum parameter and the parameters,acoustic are inputted into the first depth confidence network, obtain voice high-level characteristic；

By the long memory network model in short-term of voice high-level characteristic input first, long duration high-level characteristic and short is obtained Duration high-level characteristic；

Long high-level characteristic inputs the second depth confidence network by the long duration high-level characteristic and in short-term, obtains voice depth Spend characteristic.

Optionally, described that the frame image is inputted into the second deep neural network model, facial feature data is obtained, is had Body includes:

Frame image input convolutional neural networks model is obtained into facial characteristics number by backpropagation BP algorithm According to.

Optionally, the video image by the audio, video data carries out behavioral characteristics extraction, obtains light stream figure Picture specifically includes；

By in the audio, video data video image carry out behavioral characteristics extraction, obtain multiple image by light stream position It moves；

Curvature change method and the constant subjunctive of gray value are used according to light stream displacement, obtains light stream image.

Optionally, described by the facial feature data and facial movement characteristic input third depth nerve Network model obtains the facial depth characteristic data of video, specifically includes:

The facial feature data is connected with the facial movement characteristic by full articulamentum, it is whole to obtain face Volume data；

The facial overall data is input to the second long memory network model in short-term, obtains facial depth characteristic data.

To achieve the above object, the present invention provides following schemes:

It is a kind of applied to depression detection voice, facial feature extraction system, the system comprises:

Module is chosen, for randomly selecting one section of audio, video data；

Fisrt feature extraction module, for being carried out the audio data in the audio, video data according to energy information method Feature extraction obtains frequency spectrum parameter and parameters,acoustic；

Voice depth characteristic data acquisition module, for the frequency spectrum parameter and parameters,acoustic input first is deep Neural network model is spent, the voice depth characteristic data of audio are obtained；

Second feature extraction module is obtained for the video image in the audio, video data to be carried out static nature extraction To frame image；

Facial feature data obtains module, for the frame image to be inputted the second deep neural network model, obtains face Portion's characteristic；

Third feature extraction module is obtained for the video image in the audio, video data to be carried out behavioral characteristics extraction To light stream image；

Facial movement characteristic obtains module, for the light stream image to be inputted third deep neural network model, Obtain facial movement characteristic；

Facial depth characteristic data acquisition module is used for the facial feature data and the facial movement characteristic According to input third deep neural network model, the facial depth characteristic data of video are obtained；

Fusion Module, for the voice depth characteristic data and the facial depth characteristic data input the 4th are refreshing Through network model, fused data is obtained.

Optionally, voice depth characteristic data acquisition module, specifically includes:

First input unit is obtained for the frequency spectrum parameter and the parameters,acoustic to be inputted the first depth confidence network To voice high-level characteristic；

Second input unit, for being grown the long memory network model in short-term of voice high-level characteristic input first Duration high-level characteristic and in short-term long high-level characteristic；

Third input unit inputs the second depth for by the long duration high-level characteristic and in short-term long high-level characteristic and sets Communication network obtains voice depth characteristic data.

Optionally, the facial feature data obtains module, specifically includes:

Facial feature data acquiring unit, for the frame image to be inputted convolutional neural networks model, by reversely passing BP algorithm is broadcast, facial feature data is obtained.

Optionally, the third feature extraction module, specifically includes；

Light stream is displaced acquiring unit, for the video image in the audio, video data to be carried out behavioral characteristics extraction, obtains To multiple image by light stream be displaced；

Light stream image acquisition unit, for using Curvature change method and the constant hypothesis of gray value according to light stream displacement Method obtains light stream image.

Optionally, the facial depth characteristic data acquisition module, specifically includes:

Facial overall data acquiring unit, for leading to the facial feature data and the facial movement characteristic Full articulamentum connection is crossed, facial overall data is obtained；

Facial depth characteristic data capture unit, for the facial overall data to be input to the second long short-term memory Network model obtains facial depth characteristic data.

The specific embodiment provided according to the present invention, the invention discloses following technical effects:

The present invention provide it is a kind of applied to depression detection voice, facial feature extraction method, by establishing depression Audio, video data library, extract the audio-video bimodal fusion feature towards deep learning, thus realize based on deep learning Depression under audio-video bimodal detects automatically, improves the precision of the screening results of depression and improves depression Detection efficiency.

Detailed description of the invention

It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be in embodiment Required attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some realities of the invention Example is applied, it for those of ordinary skill in the art, without any creative labor, can also be according to these Attached drawing obtains other attached drawings.

Fig. 1 is that the embodiment of the present invention is applied to the voice of depression detection, facial feature extraction method flow diagram；

Fig. 2 is affection data of embodiment of the present invention library Establishing process figure；

Fig. 3 is the flow chart of depth model of embodiment of the present invention system building；

Fig. 4 is the flow chart that audio-video of embodiment of the present invention depth characteristic is extracted；

Fig. 5 is the flow chart that the embodiment of the present invention is merged based on the audio-video bimodal of model；

Fig. 6 is that the embodiment of the present invention is applied to the voice of depression detection, facial feature extraction system construction drawing.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its His embodiment, shall fall within the protection scope of the present invention.

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.

Fig. 1 is that the embodiment of the present invention is applied to the voice of depression detection, facial feature extraction method flow diagram.Such as Fig. 1 It is shown, it is a kind of applied to depression detection voice, facial feature extraction method, which comprises

Step 101: randomly select one section of audio, video data, the audio, video data include normal person's audio, video data and Patients with depression audio, video data；

Step 102: the audio data in the audio, video data being carried out by feature extraction according to energy information method, is obtained Frequency spectrum parameter and parameters,acoustic；

Step 103: the frequency spectrum parameter and the parameters,acoustic being inputted into the first deep neural network model, obtain sound The voice depth characteristic data of frequency；

Step 104: the video image in the audio, video data being subjected to static nature extraction, obtains frame image；

Step 105: the frame image being inputted into the second deep neural network model, obtains facial feature data；

Step 106: the video image in the audio, video data being subjected to behavioral characteristics extraction, obtains light stream image；

Step 107: the light stream image being inputted into third deep neural network model, obtains facial movement characteristic；

Step 108: the facial feature data and the facial movement characteristic are inputted into third deep neural network Model obtains the facial depth characteristic data of video；

Step 109: the voice depth characteristic data and the facial depth characteristic data are inputted into fourth nerve network Model obtains fused data.

Step 103 specifically includes:

Step 105 specifically includes:

Step 106 specifically includes；

Step 108 specifically includes:

The sensitivity characteristic of the data as involved by the project and it is related to privacy problem, so some relevant Data set can not be shared, and most of existing data set is external tested crowd.Therefore, in order to which the later period can continue It conducts a research, project team establishes the sound of depression, video feeling database.Fig. 2 is the foundation of affection data of embodiment of the present invention library Flow chart.Consider the design of the key factors such as age, gender, level of depression, emotion stimulation mode, speech mode and mood potency Experiment records the audio being tested under different mood potencies, video data by different emotions and sounding induction mode.Quilt Examination is from specified Psychiatric department class hospital.Design 2 (subject type: depression, normal) × 3 (mood potency: positivity, in Property, negativity) mixing audio-video experimental paradigm, mainly include 4 partial contents, respectively be viewing vidclip, picture description, Text is read aloud and voice response.It is intended to induce by different emotions, subject is induced in a manner of different speeches, reaches research The purpose of patients with depression facial expression and voice variation.Present invention acquisition male 300, women 300, wherein depressed group 400, control group 200, the age is between 18-55 years old.All experiments are in chain hospital sound insulation and without the room of electromagnetic interference Between carry out, microphone, sound DAQ audio signal, monophonic, sample rate 44.1kHz, sampling depth 24bit.High-definition camera Head, kinect camera acquire vision signal, frame per second 30, resolution ratio 800x600.Research requires subject in age, educational background, property It is not statistically significant (P > 0.05) Gou Cheng difference not gone up.

In the present invention when analyzing the audio in audio, video data, the voice signal of speaker is received, according to energy Amount information judges mute section, by non-mute section of progress feature extraction, extracts frequency spectrum parameter MFCC and parameters,acoustic logf0.Along Time shaft series connection frequency spectrum parameter and parameters,acoustic are sent into depth network as input feature vector.Depth network is by two layers of RBM DBN (the depth confidence network) network that (Boltzmann machine of limitation) is stacked up is constituted.The feature of input passes through DBN (depth Confidence network) training, extract the expression of the higher of feature, i.e. high-level characteristic.Then high-level characteristic is re-fed into LSTM (long memory models in short-term) depth network, extracts the high-level characteristic under long and short duration.This feature finally obtained be sent to by It is trained in DBN (depth confidence network) network that RBM (Boltzmann machine of limitation) is stacked up, from DBN, (depth is set Communication network) feature of network output is namely based on the depth characteristic of audio.

When analyzing the video in audio, video data.Video analysis and speech analysis are two independent steps. In video analysis, be divided into two stages, one be static nature extraction, one be behavioral characteristics extraction.The depth used Spending network is all CNN (convolutional neural networks).In static nature extraction, using a width figure as input, it is sent to preparatory training Good CNN (convolutional Neural network) network, trained CNN (convolutional Neural network) is by the training of disclosed data set in advance Out, including three convolutional layers, two maximum pond layers and two full articulamentums.Original picture is sent to and is trained CNN (convolutional Neural network) model in, by backpropagation BP (backpropagetion) algorithm, export and have from network There is the facial characteristics of identification.In behavioral characteristics extraction, using light stream figure as the input of depth model, output is face The feature of motion change.By calculating the light stream displacement between continuous 10 frame, Curvature change method and the constant vacation of gray value are utilized If to obtain light stream figure.Next, the facial characteristics and motion feature that two stages extract are attached, pass through structure Two full articulamentums are built integrally to finely tune the facial characteristics and motion feature that are stitched together.Hidden first number of two full articulamentums (first layer 512, the second layer 256) successively is reduced, facial characteristics and motion feature are together in series on each layer.It finally will be complete The output of articulamentum trains LSTM (long memory models in short-term) network as the input of LSTM (long memory models in short-term), from The output of LSTM (long memory models in short-term) network is namely based on the facial depth characteristic of video.

After the depth characteristic of the depth characteristic and video that have obtained audio, the depth characteristic and view of audio are first used respectively DBN (depth confidence network) network of one 2 layers of each training of the depth characteristic of frequency: DBN (depth confidence network) network of audio Input be audio depth characteristic, output is the testing result to audio signal to depression, DBN (the depth confidence of video Network) network inputs be video depth characteristic, output is testing result of the vision signal to depression.Then, by the two Testing result is fed again into one 2 layers DBN (depth confidence network) network and carries out final fusion as input signal, this The output of DBN (depth confidence network) network is exactly the testing result eventually by audio-video signal to depression.

The present invention is established in the distinguishing feature based on patients with depression on voice and facial expression, contrived experiment normal form On the basis of depression affection data library, emphasis solves the depth modelling and multi-modal fusion of phonetic feature and video features The problem of.Voice and facial expression all change over time, and changing is also synchronous generation, these factors just determine In audio or video signal and audio-video signal, there is complicated relationship between feature.The present invention has learnt time and sky Between expression on domain, realize the extraction of the audio-video depth characteristic towards deep learning.

In the multi-modal feature of extraction audio-video, the depth of voice is extracted from audio modality and video modality respectively first Spend the depth characteristic of feature and facial expression, then by the depth characteristic under the two mode carry out fusion generate one it is new Feature, the detection for depression.In this process, different deep learning model structures is related to for different modalities, is deposited In different characteristic dimensions.Simultaneously in view of audio-frequency information and video information be it is simultaneous, then audio modality and video Feature association and conspiracy relation are certainly existed between mode, therefore the present invention utilizes these factors, construct based on depth model Multimodal information fusion.

In order to realize the multi-modal depression detection of the audio-video based on deep learning, first have to build based on depth The audio-video signal identifying platform of habit.Originally the step B of Fig. 1 establishes the different depth learning model of both modalities which respectively (RBM, DBN, CNN and LSTM) then carries out bimodal fusion and identification with RBM-DBN.Wherein, CNN is using on ImageNet The model AlexNet/VGG16 of pre-training uses RBM and DBN, LSTM long as depth frame, the modeling of acoustic feature When and time change in short-term, and isochronous audio and video, establish CNN-LSTM and RBM-DBN-LSTM finally to extract video And audio frequency characteristics.Fig. 3 is the flow chart of depth model of embodiment of the present invention system building.Detailed process is as shown in Figure 3.

Video data includes that room and time both sides information is investigated respectively using two-way CNN feature extraction framework The feature extraction of the feature extraction of Spatial Dimension and time dimension in video data, in the feature extraction of Spatial Dimension, CNN It is the pre-train in the data of ImageNet, then extracts input of each frame picture as the CNN in video, according to Backpropagation (BP) algorithm and loss function modify to depth structure, to extract the depth characteristic of static expression.In the time In the feature extraction of dimension, emphasis is directed to the input of network, and the input that the light stream of continuous several frames is stacked up as CNN is used LSTM integrates the activation of CNN the last layer on a timeline, to obtain the motion feature of face.Finally, by space dimension The feature of degree and the feature of time dimension are connected entirely, obtain the high-level characteristic of the facial expression based on deep learning.? When extracting audio depth characteristic, the present invention not only considers that the model can generate the high-rise of reaction raw tone waveform and indicate, also Consider the model can obtain in short-term with it is long when timing variations.Therefore, the present invention constructs a RBM-DBN-LSTM and serially connects The depth model connect extracts the depth characteristic of voice.In RBM-DBN-LSTM model, using Gibbs sampling, sdpecific dispersion is calculated Method (CD) extracts voice high-level characteristic, is lost the information of space characteristics on a timeline by LSTM supplement.And using two into Cross entropy (cross-entropy) loss function processed and stochastic gradient descent method (SGD) optimize whole network.Fig. 4 is this hair The flow chart that bright embodiment audio-video depth characteristic is extracted.It is specific as shown in Figure 4.

After being extracted the depth characteristic of audio and video respectively, using the depth model convergence strategy based on model, it will mention The DBN network of audio depth characteristic and the DBN network of video depth feature is first respectively trained in the audio-video depth characteristic of taking-up, It is then combined with DBN re -training.Modules are finally cascaded, fused multiple mode model is used for depression public data It is detected and is finely tuned on the database that library and the present invention design, finally establish the depression automatic checkout system of audio-video.Figure 5 flow charts merged for the embodiment of the present invention based on the audio-video bimodal of model.

Fig. 6 is that the embodiment of the present invention is applied to the voice of depression detection, facial feature extraction system construction drawing.Such as Fig. 6 It is shown, it is a kind of applied to depression detection voice, facial feature extraction system, the system comprises:

Module 601 is chosen, for randomly selecting one section of audio, video data, the audio, video data includes normal person's sound view Frequency evidence and patients with depression audio, video data；

Fisrt feature extraction module 602, for according to energy information method by the audio data in the audio, video data into Row feature extraction, obtains frequency spectrum parameter and parameters,acoustic；

Voice depth characteristic data acquisition module 603, for the frequency spectrum parameter and the parameters,acoustic to be inputted first Deep neural network model obtains the voice depth characteristic data of audio；

Second feature extraction module 604 is mentioned for the video image in the audio, video data to be carried out static nature It takes, obtains frame image；

Facial feature data obtains module 605, for the frame image to be inputted the second deep neural network model, obtains To facial feature data；

Third feature extraction module 606 is mentioned for the video image in the audio, video data to be carried out behavioral characteristics It takes, obtains light stream image；

Facial movement characteristic obtains module 607, for the light stream image to be inputted third deep neural network mould Type obtains facial movement characteristic；

Facial depth characteristic data acquisition module 608 is used for the facial feature data and the facial movement feature Data input third deep neural network model, obtain the facial depth characteristic data of video；

Fusion Module 609, for the voice depth characteristic data and the facial depth characteristic data to be inputted the 4th Neural network model obtains fused data.

Voice depth characteristic data acquisition module 603, specifically includes:

The facial feature data obtains module 605, specifically includes:

The third feature extraction module 606, specifically includes；

The face depth characteristic data acquisition module 608, specifically includes:

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with its The difference of his embodiment, the same or similar parts in each embodiment may refer to each other.For being disclosed in embodiment For system, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is referring to method portion It defends oneself bright.

Used herein a specific example illustrates the principle and implementation of the invention, above embodiments Illustrate to be merely used to help understand method and its core concept of the invention；At the same time, for those skilled in the art, According to the thought of the present invention, there will be changes in the specific implementation manner and application range.In conclusion this specification Content should not be construed as limiting the invention.

Claims

1. a kind of voice applied to depression detection, facial feature extraction method, which is characterized in that the described method includes:

Randomly select one section of audio, video data；

The audio data in the audio, video data is subjected to feature extraction according to energy information method, obtains frequency spectrum parameter and acoustics Parameter；

The frequency spectrum parameter and the parameters,acoustic are inputted into the first deep neural network model, the voice depth for obtaining audio is special Levy data；

The facial feature data and the facial movement characteristic are inputted into third deep neural network model, obtain video Facial depth characteristic data；

The voice depth characteristic data and the facial depth characteristic data are inputted into fourth nerve network model, are merged Data.

2. the voice according to claim 1 applied to depression detection, facial feature extraction method, which is characterized in that Described that the frequency spectrum parameter and the parameters,acoustic are inputted the first deep neural network model, the voice depth for obtaining audio is special Data are levied, are specifically included:

By the long memory network model in short-term of voice high-level characteristic input first, obtains long duration high-level characteristic and grow tall in short-term Layer feature；

Long high-level characteristic inputs the second depth confidence network by the long duration high-level characteristic and in short-term, obtains voice depth characteristic Data.

3. the voice according to claim 1 applied to depression detection, facial feature extraction method, which is characterized in that It is described that the frame image is inputted into the second deep neural network model, facial feature data is obtained, is specifically included:

Frame image input convolutional neural networks model is obtained into facial feature data by backpropagation BP algorithm.

4. the voice according to claim 1 applied to depression detection, facial feature extraction method, which is characterized in that The video image by the audio, video data carries out behavioral characteristics extraction, obtains light stream image, specifically includes；

By in the audio, video data video image carry out behavioral characteristics extraction, obtain multiple image by light stream be displaced；

5. the voice according to claim 1 applied to depression detection, facial feature extraction method, which is characterized in that It is described that the facial feature data and the facial movement characteristic are inputted into third deep neural network model, obtain video Facial depth characteristic data, specifically include:

The facial feature data is connected with the facial movement characteristic by full articulamentum, facial whole number is obtained According to；

6. a kind of voice applied to depression detection, facial feature extraction system, which is characterized in that the system comprises:

Module is chosen, for randomly selecting one section of audio, video data；

Fisrt feature extraction module is mentioned for the audio data in the audio, video data to be carried out feature according to energy information method It takes, obtains frequency spectrum parameter and parameters,acoustic；

Voice depth characteristic data acquisition module, for the frequency spectrum parameter and the parameters,acoustic to be inputted the first depth nerve Network model obtains the voice depth characteristic data of audio；

Second feature extraction module obtains frame for the video image in the audio, video data to be carried out static nature extraction Image；

Facial feature data obtains module, for the frame image to be inputted the second deep neural network model, obtains facial spy Levy data；

Third feature extraction module obtains light for the video image in the audio, video data to be carried out behavioral characteristics extraction Stream picture；

Facial movement characteristic obtains module, for the light stream image to be inputted third deep neural network model, obtains Facial movement characteristic；

Facial depth characteristic data acquisition module, for inputting the facial feature data and the facial movement characteristic Third deep neural network model obtains the facial depth characteristic data of video；

Fusion Module, for the voice depth characteristic data and the facial depth characteristic data to be inputted fourth nerve network Model obtains fused data.

7. the voice according to claim 6 applied to depression detection, facial feature extraction system, which is characterized in that Voice depth characteristic data acquisition module, specifically includes:

First input unit obtains language for the frequency spectrum parameter and the parameters,acoustic to be inputted the first depth confidence network Sound high-level characteristic；

Second input unit, for obtaining long duration for the long memory network model in short-term of voice high-level characteristic input first High-level characteristic and in short-term long high-level characteristic；

Third input unit, for by the long duration high-level characteristic and in short-term the second depth confidence net of long high-level characteristic input Network obtains voice depth characteristic data.

8. the voice according to claim 6 applied to depression detection, facial feature extraction system, which is characterized in that The facial feature data obtains module, specifically includes:

Facial feature data acquiring unit passes through backpropagation BP for the frame image to be inputted convolutional neural networks model Algorithm obtains facial feature data.

9. the voice according to claim 6 applied to depression detection, facial feature extraction system, which is characterized in that The third feature extraction module, specifically includes；

Light stream is displaced acquiring unit, for the video image in the audio, video data to be carried out behavioral characteristics extraction, obtains more Frame image by light stream be displaced；

Light stream image acquisition unit is obtained for using Curvature change method and the constant subjunctive of gray value according to light stream displacement To light stream image.

10. the voice according to claim 6 applied to depression detection, facial feature extraction system, which is characterized in that The face depth characteristic data acquisition module, specifically includes:

Facial overall data acquiring unit connects entirely for passing through the facial feature data and the facial movement characteristic Layer connection is connect, facial overall data is obtained；

Facial depth characteristic data capture unit, for the facial overall data to be input to the second long memory network mould in short-term Type obtains facial depth characteristic data.