CN109171769A - It is a kind of applied to depression detection voice, facial feature extraction method and system - Google Patents

It is a kind of applied to depression detection voice, facial feature extraction method and system Download PDF

Info

Publication number
CN109171769A
CN109171769A CN201810762032.3A CN201810762032A CN109171769A CN 109171769 A CN109171769 A CN 109171769A CN 201810762032 A CN201810762032 A CN 201810762032A CN 109171769 A CN109171769 A CN 109171769A
Authority
CN
China
Prior art keywords
data
facial
obtains
voice
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810762032.3A
Other languages
Chinese (zh)
Inventor
郭威彤
杨鸿武
甘振业
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest Normal University
Original Assignee
Northwest Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest Normal University filed Critical Northwest Normal University
Priority to CN201810762032.3A priority Critical patent/CN109171769A/en
Publication of CN109171769A publication Critical patent/CN109171769A/en
Pending legal-status Critical Current

Links

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • A61B5/165Evaluating the state of mind, e.g. depression, anxiety

Abstract

The present invention disclose it is a kind of applied to depression detection voice, facial feature extraction method and system.Audio data is subjected to feature extraction according to energy information method, obtains frequency spectrum parameter and parameters,acoustic;Above-mentioned parameter is inputted into the first deep neural network model, obtains voice depth characteristic data;Video image is subjected to static nature extraction, obtains frame image;Frame image is inputted into the second deep neural network model, obtains facial feature data;Video image is subjected to behavioral characteristics extraction, obtains light stream image;Light stream image is inputted into third deep neural network model, obtains facial movement characteristic;Facial feature data and motion characteristic data are inputted into third deep neural network model, obtain facial depth characteristic data;Voice depth characteristic data and facial depth characteristic data are inputted into fourth nerve network model, obtain fused data.The precision of the screening results of depression can be improved using method or system of the invention and improve the detection efficiency of depression.

Description

It is a kind of applied to depression detection voice, facial feature extraction method and system
Technical field
The present invention relates to feature extraction fields, more particularly to a kind of voice applied to depression detection, facial characteristics Extracting method and system.
Background technique
Since depressive disorder can generate huge social danger and economic loss, the scholar of various countries and associated mechanisms thus Relevant research is expanded for depressive disorder, actively seeks effective diagnosis and treatment scheme.The differentiation of depression at present and diagnosis are led To start in terms of three: 1) by subjective factor, such as: Hamilton depressive scale (HAMD), Beck Depression scale (BDI), patient health questionnaire depression self-rating scale (PHQ-9) etc. and the subjective judgement of clinician diagnose, this gesture Must there will be a degree of subjective bias;2) biological information is relied on, based on brain electric (EEG), NMR imaging (fMRI) etc. Biotechnology has been used in depression detection, and e.g., what the gamma wave band presentation of depressive disorder crowd EEG persistently enhanced shows As depressive disorder crowd has the increase etc. of Prefrontal Cortex activation level asymmetry;3) by the relevant behavior letter of psychology Breath, identifies depression based on the abnormal behaviors such as voice, facial expression and body posture feature.For example, in terms of voice attributes Difference can effectively reflect the depressive state of people, the speech channel characteristic variations of patients with depression and its depressed physiological signs Have a relationship, objective indicator one of of the Information procession of facial expression as depression detection, patients with depression to positive mood at Reason has difficulties, but has stronger attention and susceptibility for sad mood, and body expression is also depression detection A very important visual clue.
Currently, the depression identification under audio-video signal mainly uses conventional methods, feature extraction first, then Feature selecting is finally identified with classification or the algorithm returned.1) it utilizes audio signal: analyzing the rhythm and acoustics of voice Feature, discovery depressive patient phonetically lack rhythm variation than normal person;The resonance of comparative analysis depressive patient and normal person Peak and spectrum signature discovery, formant, power spectral density, mel cepstrum coefficients (MFCC) and its difference, Teager energy operator (TEO) etc. features are validity feature in depression identification.2) vision signal: detection of the vision signal to depression is utilized It is concentrated mainly on facial expression, by extracting geometric features (geometrical features) and based on outer table algorithm spy (Appearance-based algorithms features) is levied to portray facial expression.By extracting the edge of face, turning The time sequential value at angle, coordinate and direction portrays the variation and intensity of particular emotion, shows that the expressivity of depressive patient reduces; The variation of texture is described by extracting facial area feature, carries out the classification of depression face-image;3) pass through linear fusion sound Frequently depression is identified with video features.
The problem of detecting depression most critical using audio and video is the extraction of feature.However, the sound extracted at present Video features are all the features of hand-designed, can there is certain non-linear dependencies each other, so these features are insufficient To characterize the high layer information of depression audio or video.Along with the variation of voice and facial expression is to occur simultaneously, correlation Height, and the variation of depression affective state does not have apparent event horizon, and emotion behavior also varies with each individual, so simply The feature of splicing series winding voice and facial expression can lose some important informations, and the screening results and detection for influencing depression are imitated Rate.
Summary of the invention
The object of the present invention is to provide it is a kind of applied to depression detection voice, facial feature extraction method and system, It improves the precision of the screening results of depression and improves the detection efficiency of depression.
To achieve the above object, the present invention provides following schemes:
It is a kind of applied to depression detection voice, facial feature extraction method, which comprises
Randomly select one section of audio, video data;
The audio data in the audio, video data is subjected to feature extraction according to energy information method, obtains frequency spectrum parameter And parameters,acoustic;
The frequency spectrum parameter and the parameters,acoustic are inputted into the first deep neural network model, obtain the voice of audio Depth characteristic data;
Video image in the audio, video data is subjected to static nature extraction, obtains frame image;
The frame image is inputted into the second deep neural network model, obtains facial feature data;
Video image in the audio, video data is subjected to behavioral characteristics extraction, obtains light stream image;
The light stream image is inputted into third deep neural network model, obtains facial movement characteristic;
The facial feature data and the facial movement characteristic are inputted into third deep neural network model, obtained To the facial depth characteristic data of video;
The voice depth characteristic data and the facial depth characteristic data are inputted into fourth nerve network model, are obtained To fused data.
Optionally, described that the frequency spectrum parameter and the parameters,acoustic are inputted into the first deep neural network model, it obtains The voice depth characteristic data of audio, specifically include:
The frequency spectrum parameter and the parameters,acoustic are inputted into the first depth confidence network, obtain voice high-level characteristic;
By the long memory network model in short-term of voice high-level characteristic input first, long duration high-level characteristic and short is obtained Duration high-level characteristic;
Long high-level characteristic inputs the second depth confidence network by the long duration high-level characteristic and in short-term, obtains voice depth Spend characteristic.
Optionally, described that the frame image is inputted into the second deep neural network model, facial feature data is obtained, is had Body includes:
Frame image input convolutional neural networks model is obtained into facial characteristics number by backpropagation BP algorithm According to.
Optionally, the video image by the audio, video data carries out behavioral characteristics extraction, obtains light stream figure Picture specifically includes;
By in the audio, video data video image carry out behavioral characteristics extraction, obtain multiple image by light stream position It moves;
Curvature change method and the constant subjunctive of gray value are used according to light stream displacement, obtains light stream image.
Optionally, described by the facial feature data and facial movement characteristic input third depth nerve Network model obtains the facial depth characteristic data of video, specifically includes:
The facial feature data is connected with the facial movement characteristic by full articulamentum, it is whole to obtain face Volume data;
The facial overall data is input to the second long memory network model in short-term, obtains facial depth characteristic data.
To achieve the above object, the present invention provides following schemes:
It is a kind of applied to depression detection voice, facial feature extraction system, the system comprises:
Module is chosen, for randomly selecting one section of audio, video data;
Fisrt feature extraction module, for being carried out the audio data in the audio, video data according to energy information method Feature extraction obtains frequency spectrum parameter and parameters,acoustic;
Voice depth characteristic data acquisition module, for the frequency spectrum parameter and parameters,acoustic input first is deep Neural network model is spent, the voice depth characteristic data of audio are obtained;
Second feature extraction module is obtained for the video image in the audio, video data to be carried out static nature extraction To frame image;
Facial feature data obtains module, for the frame image to be inputted the second deep neural network model, obtains face Portion's characteristic;
Third feature extraction module is obtained for the video image in the audio, video data to be carried out behavioral characteristics extraction To light stream image;
Facial movement characteristic obtains module, for the light stream image to be inputted third deep neural network model, Obtain facial movement characteristic;
Facial depth characteristic data acquisition module is used for the facial feature data and the facial movement characteristic According to input third deep neural network model, the facial depth characteristic data of video are obtained;
Fusion Module, for the voice depth characteristic data and the facial depth characteristic data input the 4th are refreshing Through network model, fused data is obtained.
Optionally, voice depth characteristic data acquisition module, specifically includes:
First input unit is obtained for the frequency spectrum parameter and the parameters,acoustic to be inputted the first depth confidence network To voice high-level characteristic;
Second input unit, for being grown the long memory network model in short-term of voice high-level characteristic input first Duration high-level characteristic and in short-term long high-level characteristic;
Third input unit inputs the second depth for by the long duration high-level characteristic and in short-term long high-level characteristic and sets Communication network obtains voice depth characteristic data.
Optionally, the facial feature data obtains module, specifically includes:
Facial feature data acquiring unit, for the frame image to be inputted convolutional neural networks model, by reversely passing BP algorithm is broadcast, facial feature data is obtained.
Optionally, the third feature extraction module, specifically includes;
Light stream is displaced acquiring unit, for the video image in the audio, video data to be carried out behavioral characteristics extraction, obtains To multiple image by light stream be displaced;
Light stream image acquisition unit, for using Curvature change method and the constant hypothesis of gray value according to light stream displacement Method obtains light stream image.
Optionally, the facial depth characteristic data acquisition module, specifically includes:
Facial overall data acquiring unit, for leading to the facial feature data and the facial movement characteristic Full articulamentum connection is crossed, facial overall data is obtained;
Facial depth characteristic data capture unit, for the facial overall data to be input to the second long short-term memory Network model obtains facial depth characteristic data.
The specific embodiment provided according to the present invention, the invention discloses following technical effects:
The present invention provide it is a kind of applied to depression detection voice, facial feature extraction method, by establishing depression Audio, video data library, extract the audio-video bimodal fusion feature towards deep learning, thus realize based on deep learning Depression under audio-video bimodal detects automatically, improves the precision of the screening results of depression and improves depression Detection efficiency.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be in embodiment Required attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some realities of the invention Example is applied, it for those of ordinary skill in the art, without any creative labor, can also be according to these Attached drawing obtains other attached drawings.
Fig. 1 is that the embodiment of the present invention is applied to the voice of depression detection, facial feature extraction method flow diagram;
Fig. 2 is affection data of embodiment of the present invention library Establishing process figure;
Fig. 3 is the flow chart of depth model of embodiment of the present invention system building;
Fig. 4 is the flow chart that audio-video of embodiment of the present invention depth characteristic is extracted;
Fig. 5 is the flow chart that the embodiment of the present invention is merged based on the audio-video bimodal of model;
Fig. 6 is that the embodiment of the present invention is applied to the voice of depression detection, facial feature extraction system construction drawing.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its His embodiment, shall fall within the protection scope of the present invention.
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.
Fig. 1 is that the embodiment of the present invention is applied to the voice of depression detection, facial feature extraction method flow diagram.Such as Fig. 1 It is shown, it is a kind of applied to depression detection voice, facial feature extraction method, which comprises
Step 101: randomly select one section of audio, video data, the audio, video data include normal person's audio, video data and Patients with depression audio, video data;
Step 102: the audio data in the audio, video data being carried out by feature extraction according to energy information method, is obtained Frequency spectrum parameter and parameters,acoustic;
Step 103: the frequency spectrum parameter and the parameters,acoustic being inputted into the first deep neural network model, obtain sound The voice depth characteristic data of frequency;
Step 104: the video image in the audio, video data being subjected to static nature extraction, obtains frame image;
Step 105: the frame image being inputted into the second deep neural network model, obtains facial feature data;
Step 106: the video image in the audio, video data being subjected to behavioral characteristics extraction, obtains light stream image;
Step 107: the light stream image being inputted into third deep neural network model, obtains facial movement characteristic;
Step 108: the facial feature data and the facial movement characteristic are inputted into third deep neural network Model obtains the facial depth characteristic data of video;
Step 109: the voice depth characteristic data and the facial depth characteristic data are inputted into fourth nerve network Model obtains fused data.
Step 103 specifically includes:
The frequency spectrum parameter and the parameters,acoustic are inputted into the first depth confidence network, obtain voice high-level characteristic;
By the long memory network model in short-term of voice high-level characteristic input first, long duration high-level characteristic and short is obtained Duration high-level characteristic;
Long high-level characteristic inputs the second depth confidence network by the long duration high-level characteristic and in short-term, obtains voice depth Spend characteristic.
Step 105 specifically includes:
Frame image input convolutional neural networks model is obtained into facial characteristics number by backpropagation BP algorithm According to.
Step 106 specifically includes;
By in the audio, video data video image carry out behavioral characteristics extraction, obtain multiple image by light stream position It moves;
Curvature change method and the constant subjunctive of gray value are used according to light stream displacement, obtains light stream image.
Step 108 specifically includes:
The facial feature data is connected with the facial movement characteristic by full articulamentum, it is whole to obtain face Volume data;
The facial overall data is input to the second long memory network model in short-term, obtains facial depth characteristic data.
The sensitivity characteristic of the data as involved by the project and it is related to privacy problem, so some relevant Data set can not be shared, and most of existing data set is external tested crowd.Therefore, in order to which the later period can continue It conducts a research, project team establishes the sound of depression, video feeling database.Fig. 2 is the foundation of affection data of embodiment of the present invention library Flow chart.Consider the design of the key factors such as age, gender, level of depression, emotion stimulation mode, speech mode and mood potency Experiment records the audio being tested under different mood potencies, video data by different emotions and sounding induction mode.Quilt Examination is from specified Psychiatric department class hospital.Design 2 (subject type: depression, normal) × 3 (mood potency: positivity, in Property, negativity) mixing audio-video experimental paradigm, mainly include 4 partial contents, respectively be viewing vidclip, picture description, Text is read aloud and voice response.It is intended to induce by different emotions, subject is induced in a manner of different speeches, reaches research The purpose of patients with depression facial expression and voice variation.Present invention acquisition male 300, women 300, wherein depressed group 400, control group 200, the age is between 18-55 years old.All experiments are in chain hospital sound insulation and without the room of electromagnetic interference Between carry out, microphone, sound DAQ audio signal, monophonic, sample rate 44.1kHz, sampling depth 24bit.High-definition camera Head, kinect camera acquire vision signal, frame per second 30, resolution ratio 800x600.Research requires subject in age, educational background, property It is not statistically significant (P > 0.05) Gou Cheng difference not gone up.
In the present invention when analyzing the audio in audio, video data, the voice signal of speaker is received, according to energy Amount information judges mute section, by non-mute section of progress feature extraction, extracts frequency spectrum parameter MFCC and parameters,acoustic logf0.Along Time shaft series connection frequency spectrum parameter and parameters,acoustic are sent into depth network as input feature vector.Depth network is by two layers of RBM DBN (the depth confidence network) network that (Boltzmann machine of limitation) is stacked up is constituted.The feature of input passes through DBN (depth Confidence network) training, extract the expression of the higher of feature, i.e. high-level characteristic.Then high-level characteristic is re-fed into LSTM (long memory models in short-term) depth network, extracts the high-level characteristic under long and short duration.This feature finally obtained be sent to by It is trained in DBN (depth confidence network) network that RBM (Boltzmann machine of limitation) is stacked up, from DBN, (depth is set Communication network) feature of network output is namely based on the depth characteristic of audio.
When analyzing the video in audio, video data.Video analysis and speech analysis are two independent steps. In video analysis, be divided into two stages, one be static nature extraction, one be behavioral characteristics extraction.The depth used Spending network is all CNN (convolutional neural networks).In static nature extraction, using a width figure as input, it is sent to preparatory training Good CNN (convolutional Neural network) network, trained CNN (convolutional Neural network) is by the training of disclosed data set in advance Out, including three convolutional layers, two maximum pond layers and two full articulamentums.Original picture is sent to and is trained CNN (convolutional Neural network) model in, by backpropagation BP (backpropagetion) algorithm, export and have from network There is the facial characteristics of identification.In behavioral characteristics extraction, using light stream figure as the input of depth model, output is face The feature of motion change.By calculating the light stream displacement between continuous 10 frame, Curvature change method and the constant vacation of gray value are utilized If to obtain light stream figure.Next, the facial characteristics and motion feature that two stages extract are attached, pass through structure Two full articulamentums are built integrally to finely tune the facial characteristics and motion feature that are stitched together.Hidden first number of two full articulamentums (first layer 512, the second layer 256) successively is reduced, facial characteristics and motion feature are together in series on each layer.It finally will be complete The output of articulamentum trains LSTM (long memory models in short-term) network as the input of LSTM (long memory models in short-term), from The output of LSTM (long memory models in short-term) network is namely based on the facial depth characteristic of video.
After the depth characteristic of the depth characteristic and video that have obtained audio, the depth characteristic and view of audio are first used respectively DBN (depth confidence network) network of one 2 layers of each training of the depth characteristic of frequency: DBN (depth confidence network) network of audio Input be audio depth characteristic, output is the testing result to audio signal to depression, DBN (the depth confidence of video Network) network inputs be video depth characteristic, output is testing result of the vision signal to depression.Then, by the two Testing result is fed again into one 2 layers DBN (depth confidence network) network and carries out final fusion as input signal, this The output of DBN (depth confidence network) network is exactly the testing result eventually by audio-video signal to depression.
The present invention is established in the distinguishing feature based on patients with depression on voice and facial expression, contrived experiment normal form On the basis of depression affection data library, emphasis solves the depth modelling and multi-modal fusion of phonetic feature and video features The problem of.Voice and facial expression all change over time, and changing is also synchronous generation, these factors just determine In audio or video signal and audio-video signal, there is complicated relationship between feature.The present invention has learnt time and sky Between expression on domain, realize the extraction of the audio-video depth characteristic towards deep learning.
In the multi-modal feature of extraction audio-video, the depth of voice is extracted from audio modality and video modality respectively first Spend the depth characteristic of feature and facial expression, then by the depth characteristic under the two mode carry out fusion generate one it is new Feature, the detection for depression.In this process, different deep learning model structures is related to for different modalities, is deposited In different characteristic dimensions.Simultaneously in view of audio-frequency information and video information be it is simultaneous, then audio modality and video Feature association and conspiracy relation are certainly existed between mode, therefore the present invention utilizes these factors, construct based on depth model Multimodal information fusion.
In order to realize the multi-modal depression detection of the audio-video based on deep learning, first have to build based on depth The audio-video signal identifying platform of habit.Originally the step B of Fig. 1 establishes the different depth learning model of both modalities which respectively (RBM, DBN, CNN and LSTM) then carries out bimodal fusion and identification with RBM-DBN.Wherein, CNN is using on ImageNet The model AlexNet/VGG16 of pre-training uses RBM and DBN, LSTM long as depth frame, the modeling of acoustic feature When and time change in short-term, and isochronous audio and video, establish CNN-LSTM and RBM-DBN-LSTM finally to extract video And audio frequency characteristics.Fig. 3 is the flow chart of depth model of embodiment of the present invention system building.Detailed process is as shown in Figure 3.
Video data includes that room and time both sides information is investigated respectively using two-way CNN feature extraction framework The feature extraction of the feature extraction of Spatial Dimension and time dimension in video data, in the feature extraction of Spatial Dimension, CNN It is the pre-train in the data of ImageNet, then extracts input of each frame picture as the CNN in video, according to Backpropagation (BP) algorithm and loss function modify to depth structure, to extract the depth characteristic of static expression.In the time In the feature extraction of dimension, emphasis is directed to the input of network, and the input that the light stream of continuous several frames is stacked up as CNN is used LSTM integrates the activation of CNN the last layer on a timeline, to obtain the motion feature of face.Finally, by space dimension The feature of degree and the feature of time dimension are connected entirely, obtain the high-level characteristic of the facial expression based on deep learning.? When extracting audio depth characteristic, the present invention not only considers that the model can generate the high-rise of reaction raw tone waveform and indicate, also Consider the model can obtain in short-term with it is long when timing variations.Therefore, the present invention constructs a RBM-DBN-LSTM and serially connects The depth model connect extracts the depth characteristic of voice.In RBM-DBN-LSTM model, using Gibbs sampling, sdpecific dispersion is calculated Method (CD) extracts voice high-level characteristic, is lost the information of space characteristics on a timeline by LSTM supplement.And using two into Cross entropy (cross-entropy) loss function processed and stochastic gradient descent method (SGD) optimize whole network.Fig. 4 is this hair The flow chart that bright embodiment audio-video depth characteristic is extracted.It is specific as shown in Figure 4.
After being extracted the depth characteristic of audio and video respectively, using the depth model convergence strategy based on model, it will mention The DBN network of audio depth characteristic and the DBN network of video depth feature is first respectively trained in the audio-video depth characteristic of taking-up, It is then combined with DBN re -training.Modules are finally cascaded, fused multiple mode model is used for depression public data It is detected and is finely tuned on the database that library and the present invention design, finally establish the depression automatic checkout system of audio-video.Figure 5 flow charts merged for the embodiment of the present invention based on the audio-video bimodal of model.
Fig. 6 is that the embodiment of the present invention is applied to the voice of depression detection, facial feature extraction system construction drawing.Such as Fig. 6 It is shown, it is a kind of applied to depression detection voice, facial feature extraction system, the system comprises:
Module 601 is chosen, for randomly selecting one section of audio, video data, the audio, video data includes normal person's sound view Frequency evidence and patients with depression audio, video data;
Fisrt feature extraction module 602, for according to energy information method by the audio data in the audio, video data into Row feature extraction, obtains frequency spectrum parameter and parameters,acoustic;
Voice depth characteristic data acquisition module 603, for the frequency spectrum parameter and the parameters,acoustic to be inputted first Deep neural network model obtains the voice depth characteristic data of audio;
Second feature extraction module 604 is mentioned for the video image in the audio, video data to be carried out static nature It takes, obtains frame image;
Facial feature data obtains module 605, for the frame image to be inputted the second deep neural network model, obtains To facial feature data;
Third feature extraction module 606 is mentioned for the video image in the audio, video data to be carried out behavioral characteristics It takes, obtains light stream image;
Facial movement characteristic obtains module 607, for the light stream image to be inputted third deep neural network mould Type obtains facial movement characteristic;
Facial depth characteristic data acquisition module 608 is used for the facial feature data and the facial movement feature Data input third deep neural network model, obtain the facial depth characteristic data of video;
Fusion Module 609, for the voice depth characteristic data and the facial depth characteristic data to be inputted the 4th Neural network model obtains fused data.
Voice depth characteristic data acquisition module 603, specifically includes:
First input unit is obtained for the frequency spectrum parameter and the parameters,acoustic to be inputted the first depth confidence network To voice high-level characteristic;
Second input unit, for being grown the long memory network model in short-term of voice high-level characteristic input first Duration high-level characteristic and in short-term long high-level characteristic;
Third input unit inputs the second depth for by the long duration high-level characteristic and in short-term long high-level characteristic and sets Communication network obtains voice depth characteristic data.
The facial feature data obtains module 605, specifically includes:
Facial feature data acquiring unit, for the frame image to be inputted convolutional neural networks model, by reversely passing BP algorithm is broadcast, facial feature data is obtained.
The third feature extraction module 606, specifically includes;
Light stream is displaced acquiring unit, for the video image in the audio, video data to be carried out behavioral characteristics extraction, obtains To multiple image by light stream be displaced;
Light stream image acquisition unit, for using Curvature change method and the constant hypothesis of gray value according to light stream displacement Method obtains light stream image.
The face depth characteristic data acquisition module 608, specifically includes:
Facial overall data acquiring unit, for leading to the facial feature data and the facial movement characteristic Full articulamentum connection is crossed, facial overall data is obtained;
Facial depth characteristic data capture unit, for the facial overall data to be input to the second long short-term memory Network model obtains facial depth characteristic data.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with its The difference of his embodiment, the same or similar parts in each embodiment may refer to each other.For being disclosed in embodiment For system, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is referring to method portion It defends oneself bright.
Used herein a specific example illustrates the principle and implementation of the invention, above embodiments Illustrate to be merely used to help understand method and its core concept of the invention;At the same time, for those skilled in the art, According to the thought of the present invention, there will be changes in the specific implementation manner and application range.In conclusion this specification Content should not be construed as limiting the invention.

Claims (10)

1. a kind of voice applied to depression detection, facial feature extraction method, which is characterized in that the described method includes:
Randomly select one section of audio, video data;
The audio data in the audio, video data is subjected to feature extraction according to energy information method, obtains frequency spectrum parameter and acoustics Parameter;
The frequency spectrum parameter and the parameters,acoustic are inputted into the first deep neural network model, the voice depth for obtaining audio is special Levy data;
Video image in the audio, video data is subjected to static nature extraction, obtains frame image;
The frame image is inputted into the second deep neural network model, obtains facial feature data;
Video image in the audio, video data is subjected to behavioral characteristics extraction, obtains light stream image;
The light stream image is inputted into third deep neural network model, obtains facial movement characteristic;
The facial feature data and the facial movement characteristic are inputted into third deep neural network model, obtain video Facial depth characteristic data;
The voice depth characteristic data and the facial depth characteristic data are inputted into fourth nerve network model, are merged Data.
2. the voice according to claim 1 applied to depression detection, facial feature extraction method, which is characterized in that Described that the frequency spectrum parameter and the parameters,acoustic are inputted the first deep neural network model, the voice depth for obtaining audio is special Data are levied, are specifically included:
The frequency spectrum parameter and the parameters,acoustic are inputted into the first depth confidence network, obtain voice high-level characteristic;
By the long memory network model in short-term of voice high-level characteristic input first, obtains long duration high-level characteristic and grow tall in short-term Layer feature;
Long high-level characteristic inputs the second depth confidence network by the long duration high-level characteristic and in short-term, obtains voice depth characteristic Data.
3. the voice according to claim 1 applied to depression detection, facial feature extraction method, which is characterized in that It is described that the frame image is inputted into the second deep neural network model, facial feature data is obtained, is specifically included:
Frame image input convolutional neural networks model is obtained into facial feature data by backpropagation BP algorithm.
4. the voice according to claim 1 applied to depression detection, facial feature extraction method, which is characterized in that The video image by the audio, video data carries out behavioral characteristics extraction, obtains light stream image, specifically includes;
By in the audio, video data video image carry out behavioral characteristics extraction, obtain multiple image by light stream be displaced;
Curvature change method and the constant subjunctive of gray value are used according to light stream displacement, obtains light stream image.
5. the voice according to claim 1 applied to depression detection, facial feature extraction method, which is characterized in that It is described that the facial feature data and the facial movement characteristic are inputted into third deep neural network model, obtain video Facial depth characteristic data, specifically include:
The facial feature data is connected with the facial movement characteristic by full articulamentum, facial whole number is obtained According to;
The facial overall data is input to the second long memory network model in short-term, obtains facial depth characteristic data.
6. a kind of voice applied to depression detection, facial feature extraction system, which is characterized in that the system comprises:
Module is chosen, for randomly selecting one section of audio, video data;
Fisrt feature extraction module is mentioned for the audio data in the audio, video data to be carried out feature according to energy information method It takes, obtains frequency spectrum parameter and parameters,acoustic;
Voice depth characteristic data acquisition module, for the frequency spectrum parameter and the parameters,acoustic to be inputted the first depth nerve Network model obtains the voice depth characteristic data of audio;
Second feature extraction module obtains frame for the video image in the audio, video data to be carried out static nature extraction Image;
Facial feature data obtains module, for the frame image to be inputted the second deep neural network model, obtains facial spy Levy data;
Third feature extraction module obtains light for the video image in the audio, video data to be carried out behavioral characteristics extraction Stream picture;
Facial movement characteristic obtains module, for the light stream image to be inputted third deep neural network model, obtains Facial movement characteristic;
Facial depth characteristic data acquisition module, for inputting the facial feature data and the facial movement characteristic Third deep neural network model obtains the facial depth characteristic data of video;
Fusion Module, for the voice depth characteristic data and the facial depth characteristic data to be inputted fourth nerve network Model obtains fused data.
7. the voice according to claim 6 applied to depression detection, facial feature extraction system, which is characterized in that Voice depth characteristic data acquisition module, specifically includes:
First input unit obtains language for the frequency spectrum parameter and the parameters,acoustic to be inputted the first depth confidence network Sound high-level characteristic;
Second input unit, for obtaining long duration for the long memory network model in short-term of voice high-level characteristic input first High-level characteristic and in short-term long high-level characteristic;
Third input unit, for by the long duration high-level characteristic and in short-term the second depth confidence net of long high-level characteristic input Network obtains voice depth characteristic data.
8. the voice according to claim 6 applied to depression detection, facial feature extraction system, which is characterized in that The facial feature data obtains module, specifically includes:
Facial feature data acquiring unit passes through backpropagation BP for the frame image to be inputted convolutional neural networks model Algorithm obtains facial feature data.
9. the voice according to claim 6 applied to depression detection, facial feature extraction system, which is characterized in that The third feature extraction module, specifically includes;
Light stream is displaced acquiring unit, for the video image in the audio, video data to be carried out behavioral characteristics extraction, obtains more Frame image by light stream be displaced;
Light stream image acquisition unit is obtained for using Curvature change method and the constant subjunctive of gray value according to light stream displacement To light stream image.
10. the voice according to claim 6 applied to depression detection, facial feature extraction system, which is characterized in that The face depth characteristic data acquisition module, specifically includes:
Facial overall data acquiring unit connects entirely for passing through the facial feature data and the facial movement characteristic Layer connection is connect, facial overall data is obtained;
Facial depth characteristic data capture unit, for the facial overall data to be input to the second long memory network mould in short-term Type obtains facial depth characteristic data.
CN201810762032.3A 2018-07-12 2018-07-12 It is a kind of applied to depression detection voice, facial feature extraction method and system Pending CN109171769A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810762032.3A CN109171769A (en) 2018-07-12 2018-07-12 It is a kind of applied to depression detection voice, facial feature extraction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810762032.3A CN109171769A (en) 2018-07-12 2018-07-12 It is a kind of applied to depression detection voice, facial feature extraction method and system

Publications (1)

Publication Number Publication Date
CN109171769A true CN109171769A (en) 2019-01-11

Family

ID=64936032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810762032.3A Pending CN109171769A (en) 2018-07-12 2018-07-12 It is a kind of applied to depression detection voice, facial feature extraction method and system

Country Status (1)

Country Link
CN (1) CN109171769A (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109784287A (en) * 2019-01-22 2019-05-21 中国科学院自动化研究所 Information processing method, system, device based on scene class signal forehead leaf network
CN110123343A (en) * 2019-04-19 2019-08-16 西北师范大学 Depression detection device based on speech analysis
CN110675953A (en) * 2019-09-23 2020-01-10 湖南检信智能科技有限公司 Method for screening and identifying mental patients by using artificial intelligence and big data
CN111292765A (en) * 2019-11-21 2020-06-16 台州学院 Bimodal emotion recognition method fusing multiple deep learning models
CN111297350A (en) * 2020-02-27 2020-06-19 福州大学 Three-heart beat multi-model comprehensive decision-making electrocardiogram feature classification method integrating source end influence
CN111357011A (en) * 2019-01-31 2020-06-30 深圳市大疆创新科技有限公司 Environment sensing method and device, control method and device and vehicle
CN111462841A (en) * 2020-03-12 2020-07-28 华南理工大学 Depression intelligent diagnosis device and system based on knowledge graph
CN111462773A (en) * 2020-03-26 2020-07-28 心图熵动科技(苏州)有限责任公司 Suicide risk prediction model generation method and prediction system
CN111553899A (en) * 2020-04-28 2020-08-18 湘潭大学 Audio and video based Parkinson non-contact intelligent detection method and system
CN112120716A (en) * 2020-09-02 2020-12-25 中国人民解放军军事科学院国防科技创新研究院 Wearable multi-mode emotional state monitoring device
CN112307947A (en) * 2020-10-29 2021-02-02 北京沃东天骏信息技术有限公司 Method and apparatus for generating information
CN112472088A (en) * 2020-10-22 2021-03-12 深圳大学 Emotional state evaluation method and device, intelligent terminal and storage medium
US10956809B1 (en) 2019-11-21 2021-03-23 Wang Lian Artificial intelligence brain
CN112687390A (en) * 2021-03-12 2021-04-20 中国科学院自动化研究所 Depression state detection method and device based on hybrid network and lp norm pooling
CN112768070A (en) * 2021-01-06 2021-05-07 万佳安智慧生活技术(深圳)有限公司 Mental health evaluation method and system based on dialogue communication
CN112818892A (en) * 2021-02-10 2021-05-18 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network
CN113392918A (en) * 2021-06-24 2021-09-14 哈尔滨理工大学 Depressive disorder related factor identification method based on multi-source information fusion
CN113397563A (en) * 2021-07-22 2021-09-17 北京脑陆科技有限公司 Training method, device, terminal and medium for depression classification model
CN113485261A (en) * 2021-06-29 2021-10-08 西北师范大学 CAEs-ACNN-based soft measurement modeling method
CN113705328A (en) * 2021-07-06 2021-11-26 合肥工业大学 Depression detection method and system based on facial feature points and facial movement units
CN113812948A (en) * 2021-09-08 2021-12-21 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Dequantization anxiety and depression psychological detection method and device
CN115715680A (en) * 2022-12-01 2023-02-28 杭州市第七人民医院 Anxiety discrimination method and device based on connective tissue potential
CN117079772A (en) * 2023-07-24 2023-11-17 广东智正科技有限公司 Intelligent correction system and terminal based on mental evaluation analysis of community correction object
CN117137488A (en) * 2023-10-27 2023-12-01 吉林大学 Auxiliary identification method for depression symptoms based on electroencephalogram data and facial expression images
US11963771B2 (en) 2021-02-19 2024-04-23 Institute Of Automation, Chinese Academy Of Sciences Automatic depression detection method based on audio-video
CN111357011B (en) * 2019-01-31 2024-04-30 深圳市大疆创新科技有限公司 Environment sensing method and device, control method and device and vehicle

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133481A (en) * 2017-05-22 2017-09-05 西北工业大学 The estimation of multi-modal depression and sorting technique based on DCNN DNN and PV SVM

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133481A (en) * 2017-05-22 2017-09-05 西北工业大学 The estimation of multi-modal depression and sorting technique based on DCNN DNN and PV SVM

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨雨浓: "基于深度学习的人脸表情识别方法研究", 《中国优秀博士学位论文全文数据库信息科技辑,2018年03期,I138-26》 *

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109784287A (en) * 2019-01-22 2019-05-21 中国科学院自动化研究所 Information processing method, system, device based on scene class signal forehead leaf network
US10915815B1 (en) 2019-01-22 2021-02-09 Institute Of Automation, Chinese Academy Of Sciences Information processing method, system and device based on contextual signals and prefrontal cortex-like network
CN111357011B (en) * 2019-01-31 2024-04-30 深圳市大疆创新科技有限公司 Environment sensing method and device, control method and device and vehicle
CN111357011A (en) * 2019-01-31 2020-06-30 深圳市大疆创新科技有限公司 Environment sensing method and device, control method and device and vehicle
CN110123343B (en) * 2019-04-19 2023-10-03 西北师范大学 Depression detection device based on speech analysis
CN110123343A (en) * 2019-04-19 2019-08-16 西北师范大学 Depression detection device based on speech analysis
CN110675953A (en) * 2019-09-23 2020-01-10 湖南检信智能科技有限公司 Method for screening and identifying mental patients by using artificial intelligence and big data
WO2021101439A1 (en) * 2019-11-21 2021-05-27 Lian Wang Artificial intelligence brain
CN112825014A (en) * 2019-11-21 2021-05-21 王炼 Artificial intelligence brain
US10956809B1 (en) 2019-11-21 2021-03-23 Wang Lian Artificial intelligence brain
CN111292765A (en) * 2019-11-21 2020-06-16 台州学院 Bimodal emotion recognition method fusing multiple deep learning models
CN111297350B (en) * 2020-02-27 2021-08-31 福州大学 Three-heart beat multi-model comprehensive decision-making electrocardiogram feature classification method integrating source end influence
CN111297350A (en) * 2020-02-27 2020-06-19 福州大学 Three-heart beat multi-model comprehensive decision-making electrocardiogram feature classification method integrating source end influence
CN111462841A (en) * 2020-03-12 2020-07-28 华南理工大学 Depression intelligent diagnosis device and system based on knowledge graph
CN111462773A (en) * 2020-03-26 2020-07-28 心图熵动科技(苏州)有限责任公司 Suicide risk prediction model generation method and prediction system
CN111553899A (en) * 2020-04-28 2020-08-18 湘潭大学 Audio and video based Parkinson non-contact intelligent detection method and system
CN112120716A (en) * 2020-09-02 2020-12-25 中国人民解放军军事科学院国防科技创新研究院 Wearable multi-mode emotional state monitoring device
CN112472088A (en) * 2020-10-22 2021-03-12 深圳大学 Emotional state evaluation method and device, intelligent terminal and storage medium
CN112472088B (en) * 2020-10-22 2022-11-29 深圳大学 Emotional state evaluation method and device, intelligent terminal and storage medium
CN112307947A (en) * 2020-10-29 2021-02-02 北京沃东天骏信息技术有限公司 Method and apparatus for generating information
CN112768070A (en) * 2021-01-06 2021-05-07 万佳安智慧生活技术(深圳)有限公司 Mental health evaluation method and system based on dialogue communication
CN112818892A (en) * 2021-02-10 2021-05-18 杭州医典智能科技有限公司 Multi-modal depression detection method and system based on time convolution neural network
US11963771B2 (en) 2021-02-19 2024-04-23 Institute Of Automation, Chinese Academy Of Sciences Automatic depression detection method based on audio-video
CN112687390B (en) * 2021-03-12 2021-06-18 中国科学院自动化研究所 Depression state detection method and device based on hybrid network and lp norm pooling
CN112687390A (en) * 2021-03-12 2021-04-20 中国科学院自动化研究所 Depression state detection method and device based on hybrid network and lp norm pooling
CN113392918A (en) * 2021-06-24 2021-09-14 哈尔滨理工大学 Depressive disorder related factor identification method based on multi-source information fusion
CN113485261A (en) * 2021-06-29 2021-10-08 西北师范大学 CAEs-ACNN-based soft measurement modeling method
CN113485261B (en) * 2021-06-29 2022-06-28 西北师范大学 CAEs-ACNN-based soft measurement modeling method
CN113705328A (en) * 2021-07-06 2021-11-26 合肥工业大学 Depression detection method and system based on facial feature points and facial movement units
CN113397563A (en) * 2021-07-22 2021-09-17 北京脑陆科技有限公司 Training method, device, terminal and medium for depression classification model
CN113812948A (en) * 2021-09-08 2021-12-21 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Dequantization anxiety and depression psychological detection method and device
CN115715680A (en) * 2022-12-01 2023-02-28 杭州市第七人民医院 Anxiety discrimination method and device based on connective tissue potential
CN117079772A (en) * 2023-07-24 2023-11-17 广东智正科技有限公司 Intelligent correction system and terminal based on mental evaluation analysis of community correction object
CN117137488B (en) * 2023-10-27 2024-01-26 吉林大学 Auxiliary identification method for depression symptoms based on electroencephalogram data and facial expression images
CN117137488A (en) * 2023-10-27 2023-12-01 吉林大学 Auxiliary identification method for depression symptoms based on electroencephalogram data and facial expression images

Similar Documents

Publication Publication Date Title
CN109171769A (en) It is a kind of applied to depression detection voice, facial feature extraction method and system
CN110507335B (en) Multi-mode information based criminal psychological health state assessment method and system
CN110556129B (en) Bimodal emotion recognition model training method and bimodal emotion recognition method
Bachorowski Vocal expression and perception of emotion
Narayanan et al. Behavioral signal processing: Deriving human behavioral informatics from speech and language
Cen et al. A real-time speech emotion recognition system and its application in online learning
CN106073706B (en) A kind of customized information and audio data analysis method and system towards Mini-mental Status Examination
Khan et al. Emotion Based Signal Enhancement Through Multisensory Integration Using Machine Learning.
Sinha Recognizing complex patterns
WO2015158017A1 (en) Intelligent interaction and psychological comfort robot service system
CN111081371A (en) Virtual reality-based early autism screening and evaluating system and method
Rituerto-González et al. Data augmentation for speaker identification under stress conditions to combat gender-based violence
Caponetti et al. Biologically inspired emotion recognition from speech
CN110348409A (en) A kind of method and apparatus that facial image is generated based on vocal print
Fang et al. Combining acoustic signals and medical records to improve pathological voice classification
Upadhyay et al. SmHeSol (IoT-BC): smart healthcare solution for future development using speech feature extraction integration approach with IoT and blockchain
Li et al. Global-local-feature-fused driver speech emotion detection for intelligent cockpit in automated driving
Cristani et al. Generative modeling and classification of dialogs by a low-level turn-taking feature
Cowie et al. Piecing together the emotion jigsaw
Degila et al. The UCD system for the 2018 FEMH voice data challenge
Gavrilescu et al. Feedforward neural network-based architecture for predicting emotions from speech
US20220015687A1 (en) Method for Screening Psychiatric Disorder Based On Conversation and Apparatus Therefor
Gupta et al. REDE-Detecting human emotions using CNN and RASA
Massaro The McGurk effect: Auditory visual speech perception’s piltdown man
Feather et al. Auditory texture synthesis from task-optimized convolutional neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190111