CN109171769A - It is a kind of applied to depression detection voice, facial feature extraction method and system - Google Patents
It is a kind of applied to depression detection voice, facial feature extraction method and system Download PDFInfo
- Publication number
- CN109171769A CN109171769A CN201810762032.3A CN201810762032A CN109171769A CN 109171769 A CN109171769 A CN 109171769A CN 201810762032 A CN201810762032 A CN 201810762032A CN 109171769 A CN109171769 A CN 109171769A
- Authority
- CN
- China
- Prior art keywords
- data
- facial
- obtains
- voice
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/16—Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
- A61B5/165—Evaluating the state of mind, e.g. depression, anxiety
Abstract
The present invention disclose it is a kind of applied to depression detection voice, facial feature extraction method and system.Audio data is subjected to feature extraction according to energy information method, obtains frequency spectrum parameter and parameters,acoustic;Above-mentioned parameter is inputted into the first deep neural network model, obtains voice depth characteristic data;Video image is subjected to static nature extraction, obtains frame image;Frame image is inputted into the second deep neural network model, obtains facial feature data;Video image is subjected to behavioral characteristics extraction, obtains light stream image;Light stream image is inputted into third deep neural network model, obtains facial movement characteristic;Facial feature data and motion characteristic data are inputted into third deep neural network model, obtain facial depth characteristic data;Voice depth characteristic data and facial depth characteristic data are inputted into fourth nerve network model, obtain fused data.The precision of the screening results of depression can be improved using method or system of the invention and improve the detection efficiency of depression.
Description
Technical field
The present invention relates to feature extraction fields, more particularly to a kind of voice applied to depression detection, facial characteristics
Extracting method and system.
Background technique
Since depressive disorder can generate huge social danger and economic loss, the scholar of various countries and associated mechanisms thus
Relevant research is expanded for depressive disorder, actively seeks effective diagnosis and treatment scheme.The differentiation of depression at present and diagnosis are led
To start in terms of three: 1) by subjective factor, such as: Hamilton depressive scale (HAMD), Beck Depression scale
(BDI), patient health questionnaire depression self-rating scale (PHQ-9) etc. and the subjective judgement of clinician diagnose, this gesture
Must there will be a degree of subjective bias;2) biological information is relied on, based on brain electric (EEG), NMR imaging (fMRI) etc.
Biotechnology has been used in depression detection, and e.g., what the gamma wave band presentation of depressive disorder crowd EEG persistently enhanced shows
As depressive disorder crowd has the increase etc. of Prefrontal Cortex activation level asymmetry;3) by the relevant behavior letter of psychology
Breath, identifies depression based on the abnormal behaviors such as voice, facial expression and body posture feature.For example, in terms of voice attributes
Difference can effectively reflect the depressive state of people, the speech channel characteristic variations of patients with depression and its depressed physiological signs
Have a relationship, objective indicator one of of the Information procession of facial expression as depression detection, patients with depression to positive mood at
Reason has difficulties, but has stronger attention and susceptibility for sad mood, and body expression is also depression detection
A very important visual clue.
Currently, the depression identification under audio-video signal mainly uses conventional methods, feature extraction first, then
Feature selecting is finally identified with classification or the algorithm returned.1) it utilizes audio signal: analyzing the rhythm and acoustics of voice
Feature, discovery depressive patient phonetically lack rhythm variation than normal person;The resonance of comparative analysis depressive patient and normal person
Peak and spectrum signature discovery, formant, power spectral density, mel cepstrum coefficients (MFCC) and its difference, Teager energy operator
(TEO) etc. features are validity feature in depression identification.2) vision signal: detection of the vision signal to depression is utilized
It is concentrated mainly on facial expression, by extracting geometric features (geometrical features) and based on outer table algorithm spy
(Appearance-based algorithms features) is levied to portray facial expression.By extracting the edge of face, turning
The time sequential value at angle, coordinate and direction portrays the variation and intensity of particular emotion, shows that the expressivity of depressive patient reduces;
The variation of texture is described by extracting facial area feature, carries out the classification of depression face-image;3) pass through linear fusion sound
Frequently depression is identified with video features.
The problem of detecting depression most critical using audio and video is the extraction of feature.However, the sound extracted at present
Video features are all the features of hand-designed, can there is certain non-linear dependencies each other, so these features are insufficient
To characterize the high layer information of depression audio or video.Along with the variation of voice and facial expression is to occur simultaneously, correlation
Height, and the variation of depression affective state does not have apparent event horizon, and emotion behavior also varies with each individual, so simply
The feature of splicing series winding voice and facial expression can lose some important informations, and the screening results and detection for influencing depression are imitated
Rate.
Summary of the invention
The object of the present invention is to provide it is a kind of applied to depression detection voice, facial feature extraction method and system,
It improves the precision of the screening results of depression and improves the detection efficiency of depression.
To achieve the above object, the present invention provides following schemes:
It is a kind of applied to depression detection voice, facial feature extraction method, which comprises
Randomly select one section of audio, video data;
The audio data in the audio, video data is subjected to feature extraction according to energy information method, obtains frequency spectrum parameter
And parameters,acoustic;
The frequency spectrum parameter and the parameters,acoustic are inputted into the first deep neural network model, obtain the voice of audio
Depth characteristic data;
Video image in the audio, video data is subjected to static nature extraction, obtains frame image;
The frame image is inputted into the second deep neural network model, obtains facial feature data;
Video image in the audio, video data is subjected to behavioral characteristics extraction, obtains light stream image;
The light stream image is inputted into third deep neural network model, obtains facial movement characteristic;
The facial feature data and the facial movement characteristic are inputted into third deep neural network model, obtained
To the facial depth characteristic data of video;
The voice depth characteristic data and the facial depth characteristic data are inputted into fourth nerve network model, are obtained
To fused data.
Optionally, described that the frequency spectrum parameter and the parameters,acoustic are inputted into the first deep neural network model, it obtains
The voice depth characteristic data of audio, specifically include:
The frequency spectrum parameter and the parameters,acoustic are inputted into the first depth confidence network, obtain voice high-level characteristic;
By the long memory network model in short-term of voice high-level characteristic input first, long duration high-level characteristic and short is obtained
Duration high-level characteristic;
Long high-level characteristic inputs the second depth confidence network by the long duration high-level characteristic and in short-term, obtains voice depth
Spend characteristic.
Optionally, described that the frame image is inputted into the second deep neural network model, facial feature data is obtained, is had
Body includes:
Frame image input convolutional neural networks model is obtained into facial characteristics number by backpropagation BP algorithm
According to.
Optionally, the video image by the audio, video data carries out behavioral characteristics extraction, obtains light stream figure
Picture specifically includes;
By in the audio, video data video image carry out behavioral characteristics extraction, obtain multiple image by light stream position
It moves;
Curvature change method and the constant subjunctive of gray value are used according to light stream displacement, obtains light stream image.
Optionally, described by the facial feature data and facial movement characteristic input third depth nerve
Network model obtains the facial depth characteristic data of video, specifically includes:
The facial feature data is connected with the facial movement characteristic by full articulamentum, it is whole to obtain face
Volume data;
The facial overall data is input to the second long memory network model in short-term, obtains facial depth characteristic data.
To achieve the above object, the present invention provides following schemes:
It is a kind of applied to depression detection voice, facial feature extraction system, the system comprises:
Module is chosen, for randomly selecting one section of audio, video data;
Fisrt feature extraction module, for being carried out the audio data in the audio, video data according to energy information method
Feature extraction obtains frequency spectrum parameter and parameters,acoustic;
Voice depth characteristic data acquisition module, for the frequency spectrum parameter and parameters,acoustic input first is deep
Neural network model is spent, the voice depth characteristic data of audio are obtained;
Second feature extraction module is obtained for the video image in the audio, video data to be carried out static nature extraction
To frame image;
Facial feature data obtains module, for the frame image to be inputted the second deep neural network model, obtains face
Portion's characteristic;
Third feature extraction module is obtained for the video image in the audio, video data to be carried out behavioral characteristics extraction
To light stream image;
Facial movement characteristic obtains module, for the light stream image to be inputted third deep neural network model,
Obtain facial movement characteristic;
Facial depth characteristic data acquisition module is used for the facial feature data and the facial movement characteristic
According to input third deep neural network model, the facial depth characteristic data of video are obtained;
Fusion Module, for the voice depth characteristic data and the facial depth characteristic data input the 4th are refreshing
Through network model, fused data is obtained.
Optionally, voice depth characteristic data acquisition module, specifically includes:
First input unit is obtained for the frequency spectrum parameter and the parameters,acoustic to be inputted the first depth confidence network
To voice high-level characteristic;
Second input unit, for being grown the long memory network model in short-term of voice high-level characteristic input first
Duration high-level characteristic and in short-term long high-level characteristic;
Third input unit inputs the second depth for by the long duration high-level characteristic and in short-term long high-level characteristic and sets
Communication network obtains voice depth characteristic data.
Optionally, the facial feature data obtains module, specifically includes:
Facial feature data acquiring unit, for the frame image to be inputted convolutional neural networks model, by reversely passing
BP algorithm is broadcast, facial feature data is obtained.
Optionally, the third feature extraction module, specifically includes;
Light stream is displaced acquiring unit, for the video image in the audio, video data to be carried out behavioral characteristics extraction, obtains
To multiple image by light stream be displaced;
Light stream image acquisition unit, for using Curvature change method and the constant hypothesis of gray value according to light stream displacement
Method obtains light stream image.
Optionally, the facial depth characteristic data acquisition module, specifically includes:
Facial overall data acquiring unit, for leading to the facial feature data and the facial movement characteristic
Full articulamentum connection is crossed, facial overall data is obtained;
Facial depth characteristic data capture unit, for the facial overall data to be input to the second long short-term memory
Network model obtains facial depth characteristic data.
The specific embodiment provided according to the present invention, the invention discloses following technical effects:
The present invention provide it is a kind of applied to depression detection voice, facial feature extraction method, by establishing depression
Audio, video data library, extract the audio-video bimodal fusion feature towards deep learning, thus realize based on deep learning
Depression under audio-video bimodal detects automatically, improves the precision of the screening results of depression and improves depression
Detection efficiency.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be in embodiment
Required attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some realities of the invention
Example is applied, it for those of ordinary skill in the art, without any creative labor, can also be according to these
Attached drawing obtains other attached drawings.
Fig. 1 is that the embodiment of the present invention is applied to the voice of depression detection, facial feature extraction method flow diagram;
Fig. 2 is affection data of embodiment of the present invention library Establishing process figure;
Fig. 3 is the flow chart of depth model of embodiment of the present invention system building;
Fig. 4 is the flow chart that audio-video of embodiment of the present invention depth characteristic is extracted;
Fig. 5 is the flow chart that the embodiment of the present invention is merged based on the audio-video bimodal of model;
Fig. 6 is that the embodiment of the present invention is applied to the voice of depression detection, facial feature extraction system construction drawing.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its
His embodiment, shall fall within the protection scope of the present invention.
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real
Applying mode, the present invention is described in further detail.
Fig. 1 is that the embodiment of the present invention is applied to the voice of depression detection, facial feature extraction method flow diagram.Such as Fig. 1
It is shown, it is a kind of applied to depression detection voice, facial feature extraction method, which comprises
Step 101: randomly select one section of audio, video data, the audio, video data include normal person's audio, video data and
Patients with depression audio, video data;
Step 102: the audio data in the audio, video data being carried out by feature extraction according to energy information method, is obtained
Frequency spectrum parameter and parameters,acoustic;
Step 103: the frequency spectrum parameter and the parameters,acoustic being inputted into the first deep neural network model, obtain sound
The voice depth characteristic data of frequency;
Step 104: the video image in the audio, video data being subjected to static nature extraction, obtains frame image;
Step 105: the frame image being inputted into the second deep neural network model, obtains facial feature data;
Step 106: the video image in the audio, video data being subjected to behavioral characteristics extraction, obtains light stream image;
Step 107: the light stream image being inputted into third deep neural network model, obtains facial movement characteristic;
Step 108: the facial feature data and the facial movement characteristic are inputted into third deep neural network
Model obtains the facial depth characteristic data of video;
Step 109: the voice depth characteristic data and the facial depth characteristic data are inputted into fourth nerve network
Model obtains fused data.
Step 103 specifically includes:
The frequency spectrum parameter and the parameters,acoustic are inputted into the first depth confidence network, obtain voice high-level characteristic;
By the long memory network model in short-term of voice high-level characteristic input first, long duration high-level characteristic and short is obtained
Duration high-level characteristic;
Long high-level characteristic inputs the second depth confidence network by the long duration high-level characteristic and in short-term, obtains voice depth
Spend characteristic.
Step 105 specifically includes:
Frame image input convolutional neural networks model is obtained into facial characteristics number by backpropagation BP algorithm
According to.
Step 106 specifically includes;
By in the audio, video data video image carry out behavioral characteristics extraction, obtain multiple image by light stream position
It moves;
Curvature change method and the constant subjunctive of gray value are used according to light stream displacement, obtains light stream image.
Step 108 specifically includes:
The facial feature data is connected with the facial movement characteristic by full articulamentum, it is whole to obtain face
Volume data;
The facial overall data is input to the second long memory network model in short-term, obtains facial depth characteristic data.
The sensitivity characteristic of the data as involved by the project and it is related to privacy problem, so some relevant
Data set can not be shared, and most of existing data set is external tested crowd.Therefore, in order to which the later period can continue
It conducts a research, project team establishes the sound of depression, video feeling database.Fig. 2 is the foundation of affection data of embodiment of the present invention library
Flow chart.Consider the design of the key factors such as age, gender, level of depression, emotion stimulation mode, speech mode and mood potency
Experiment records the audio being tested under different mood potencies, video data by different emotions and sounding induction mode.Quilt
Examination is from specified Psychiatric department class hospital.Design 2 (subject type: depression, normal) × 3 (mood potency: positivity, in
Property, negativity) mixing audio-video experimental paradigm, mainly include 4 partial contents, respectively be viewing vidclip, picture description,
Text is read aloud and voice response.It is intended to induce by different emotions, subject is induced in a manner of different speeches, reaches research
The purpose of patients with depression facial expression and voice variation.Present invention acquisition male 300, women 300, wherein depressed group
400, control group 200, the age is between 18-55 years old.All experiments are in chain hospital sound insulation and without the room of electromagnetic interference
Between carry out, microphone, sound DAQ audio signal, monophonic, sample rate 44.1kHz, sampling depth 24bit.High-definition camera
Head, kinect camera acquire vision signal, frame per second 30, resolution ratio 800x600.Research requires subject in age, educational background, property
It is not statistically significant (P > 0.05) Gou Cheng difference not gone up.
In the present invention when analyzing the audio in audio, video data, the voice signal of speaker is received, according to energy
Amount information judges mute section, by non-mute section of progress feature extraction, extracts frequency spectrum parameter MFCC and parameters,acoustic logf0.Along
Time shaft series connection frequency spectrum parameter and parameters,acoustic are sent into depth network as input feature vector.Depth network is by two layers of RBM
DBN (the depth confidence network) network that (Boltzmann machine of limitation) is stacked up is constituted.The feature of input passes through DBN (depth
Confidence network) training, extract the expression of the higher of feature, i.e. high-level characteristic.Then high-level characteristic is re-fed into LSTM
(long memory models in short-term) depth network, extracts the high-level characteristic under long and short duration.This feature finally obtained be sent to by
It is trained in DBN (depth confidence network) network that RBM (Boltzmann machine of limitation) is stacked up, from DBN, (depth is set
Communication network) feature of network output is namely based on the depth characteristic of audio.
When analyzing the video in audio, video data.Video analysis and speech analysis are two independent steps.
In video analysis, be divided into two stages, one be static nature extraction, one be behavioral characteristics extraction.The depth used
Spending network is all CNN (convolutional neural networks).In static nature extraction, using a width figure as input, it is sent to preparatory training
Good CNN (convolutional Neural network) network, trained CNN (convolutional Neural network) is by the training of disclosed data set in advance
Out, including three convolutional layers, two maximum pond layers and two full articulamentums.Original picture is sent to and is trained
CNN (convolutional Neural network) model in, by backpropagation BP (backpropagetion) algorithm, export and have from network
There is the facial characteristics of identification.In behavioral characteristics extraction, using light stream figure as the input of depth model, output is face
The feature of motion change.By calculating the light stream displacement between continuous 10 frame, Curvature change method and the constant vacation of gray value are utilized
If to obtain light stream figure.Next, the facial characteristics and motion feature that two stages extract are attached, pass through structure
Two full articulamentums are built integrally to finely tune the facial characteristics and motion feature that are stitched together.Hidden first number of two full articulamentums
(first layer 512, the second layer 256) successively is reduced, facial characteristics and motion feature are together in series on each layer.It finally will be complete
The output of articulamentum trains LSTM (long memory models in short-term) network as the input of LSTM (long memory models in short-term), from
The output of LSTM (long memory models in short-term) network is namely based on the facial depth characteristic of video.
After the depth characteristic of the depth characteristic and video that have obtained audio, the depth characteristic and view of audio are first used respectively
DBN (depth confidence network) network of one 2 layers of each training of the depth characteristic of frequency: DBN (depth confidence network) network of audio
Input be audio depth characteristic, output is the testing result to audio signal to depression, DBN (the depth confidence of video
Network) network inputs be video depth characteristic, output is testing result of the vision signal to depression.Then, by the two
Testing result is fed again into one 2 layers DBN (depth confidence network) network and carries out final fusion as input signal, this
The output of DBN (depth confidence network) network is exactly the testing result eventually by audio-video signal to depression.
The present invention is established in the distinguishing feature based on patients with depression on voice and facial expression, contrived experiment normal form
On the basis of depression affection data library, emphasis solves the depth modelling and multi-modal fusion of phonetic feature and video features
The problem of.Voice and facial expression all change over time, and changing is also synchronous generation, these factors just determine
In audio or video signal and audio-video signal, there is complicated relationship between feature.The present invention has learnt time and sky
Between expression on domain, realize the extraction of the audio-video depth characteristic towards deep learning.
In the multi-modal feature of extraction audio-video, the depth of voice is extracted from audio modality and video modality respectively first
Spend the depth characteristic of feature and facial expression, then by the depth characteristic under the two mode carry out fusion generate one it is new
Feature, the detection for depression.In this process, different deep learning model structures is related to for different modalities, is deposited
In different characteristic dimensions.Simultaneously in view of audio-frequency information and video information be it is simultaneous, then audio modality and video
Feature association and conspiracy relation are certainly existed between mode, therefore the present invention utilizes these factors, construct based on depth model
Multimodal information fusion.
In order to realize the multi-modal depression detection of the audio-video based on deep learning, first have to build based on depth
The audio-video signal identifying platform of habit.Originally the step B of Fig. 1 establishes the different depth learning model of both modalities which respectively
(RBM, DBN, CNN and LSTM) then carries out bimodal fusion and identification with RBM-DBN.Wherein, CNN is using on ImageNet
The model AlexNet/VGG16 of pre-training uses RBM and DBN, LSTM long as depth frame, the modeling of acoustic feature
When and time change in short-term, and isochronous audio and video, establish CNN-LSTM and RBM-DBN-LSTM finally to extract video
And audio frequency characteristics.Fig. 3 is the flow chart of depth model of embodiment of the present invention system building.Detailed process is as shown in Figure 3.
Video data includes that room and time both sides information is investigated respectively using two-way CNN feature extraction framework
The feature extraction of the feature extraction of Spatial Dimension and time dimension in video data, in the feature extraction of Spatial Dimension, CNN
It is the pre-train in the data of ImageNet, then extracts input of each frame picture as the CNN in video, according to
Backpropagation (BP) algorithm and loss function modify to depth structure, to extract the depth characteristic of static expression.In the time
In the feature extraction of dimension, emphasis is directed to the input of network, and the input that the light stream of continuous several frames is stacked up as CNN is used
LSTM integrates the activation of CNN the last layer on a timeline, to obtain the motion feature of face.Finally, by space dimension
The feature of degree and the feature of time dimension are connected entirely, obtain the high-level characteristic of the facial expression based on deep learning.?
When extracting audio depth characteristic, the present invention not only considers that the model can generate the high-rise of reaction raw tone waveform and indicate, also
Consider the model can obtain in short-term with it is long when timing variations.Therefore, the present invention constructs a RBM-DBN-LSTM and serially connects
The depth model connect extracts the depth characteristic of voice.In RBM-DBN-LSTM model, using Gibbs sampling, sdpecific dispersion is calculated
Method (CD) extracts voice high-level characteristic, is lost the information of space characteristics on a timeline by LSTM supplement.And using two into
Cross entropy (cross-entropy) loss function processed and stochastic gradient descent method (SGD) optimize whole network.Fig. 4 is this hair
The flow chart that bright embodiment audio-video depth characteristic is extracted.It is specific as shown in Figure 4.
After being extracted the depth characteristic of audio and video respectively, using the depth model convergence strategy based on model, it will mention
The DBN network of audio depth characteristic and the DBN network of video depth feature is first respectively trained in the audio-video depth characteristic of taking-up,
It is then combined with DBN re -training.Modules are finally cascaded, fused multiple mode model is used for depression public data
It is detected and is finely tuned on the database that library and the present invention design, finally establish the depression automatic checkout system of audio-video.Figure
5 flow charts merged for the embodiment of the present invention based on the audio-video bimodal of model.
Fig. 6 is that the embodiment of the present invention is applied to the voice of depression detection, facial feature extraction system construction drawing.Such as Fig. 6
It is shown, it is a kind of applied to depression detection voice, facial feature extraction system, the system comprises:
Module 601 is chosen, for randomly selecting one section of audio, video data, the audio, video data includes normal person's sound view
Frequency evidence and patients with depression audio, video data;
Fisrt feature extraction module 602, for according to energy information method by the audio data in the audio, video data into
Row feature extraction, obtains frequency spectrum parameter and parameters,acoustic;
Voice depth characteristic data acquisition module 603, for the frequency spectrum parameter and the parameters,acoustic to be inputted first
Deep neural network model obtains the voice depth characteristic data of audio;
Second feature extraction module 604 is mentioned for the video image in the audio, video data to be carried out static nature
It takes, obtains frame image;
Facial feature data obtains module 605, for the frame image to be inputted the second deep neural network model, obtains
To facial feature data;
Third feature extraction module 606 is mentioned for the video image in the audio, video data to be carried out behavioral characteristics
It takes, obtains light stream image;
Facial movement characteristic obtains module 607, for the light stream image to be inputted third deep neural network mould
Type obtains facial movement characteristic;
Facial depth characteristic data acquisition module 608 is used for the facial feature data and the facial movement feature
Data input third deep neural network model, obtain the facial depth characteristic data of video;
Fusion Module 609, for the voice depth characteristic data and the facial depth characteristic data to be inputted the 4th
Neural network model obtains fused data.
Voice depth characteristic data acquisition module 603, specifically includes:
First input unit is obtained for the frequency spectrum parameter and the parameters,acoustic to be inputted the first depth confidence network
To voice high-level characteristic;
Second input unit, for being grown the long memory network model in short-term of voice high-level characteristic input first
Duration high-level characteristic and in short-term long high-level characteristic;
Third input unit inputs the second depth for by the long duration high-level characteristic and in short-term long high-level characteristic and sets
Communication network obtains voice depth characteristic data.
The facial feature data obtains module 605, specifically includes:
Facial feature data acquiring unit, for the frame image to be inputted convolutional neural networks model, by reversely passing
BP algorithm is broadcast, facial feature data is obtained.
The third feature extraction module 606, specifically includes;
Light stream is displaced acquiring unit, for the video image in the audio, video data to be carried out behavioral characteristics extraction, obtains
To multiple image by light stream be displaced;
Light stream image acquisition unit, for using Curvature change method and the constant hypothesis of gray value according to light stream displacement
Method obtains light stream image.
The face depth characteristic data acquisition module 608, specifically includes:
Facial overall data acquiring unit, for leading to the facial feature data and the facial movement characteristic
Full articulamentum connection is crossed, facial overall data is obtained;
Facial depth characteristic data capture unit, for the facial overall data to be input to the second long short-term memory
Network model obtains facial depth characteristic data.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with its
The difference of his embodiment, the same or similar parts in each embodiment may refer to each other.For being disclosed in embodiment
For system, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is referring to method portion
It defends oneself bright.
Used herein a specific example illustrates the principle and implementation of the invention, above embodiments
Illustrate to be merely used to help understand method and its core concept of the invention;At the same time, for those skilled in the art,
According to the thought of the present invention, there will be changes in the specific implementation manner and application range.In conclusion this specification
Content should not be construed as limiting the invention.
Claims (10)
1. a kind of voice applied to depression detection, facial feature extraction method, which is characterized in that the described method includes:
Randomly select one section of audio, video data;
The audio data in the audio, video data is subjected to feature extraction according to energy information method, obtains frequency spectrum parameter and acoustics
Parameter;
The frequency spectrum parameter and the parameters,acoustic are inputted into the first deep neural network model, the voice depth for obtaining audio is special
Levy data;
Video image in the audio, video data is subjected to static nature extraction, obtains frame image;
The frame image is inputted into the second deep neural network model, obtains facial feature data;
Video image in the audio, video data is subjected to behavioral characteristics extraction, obtains light stream image;
The light stream image is inputted into third deep neural network model, obtains facial movement characteristic;
The facial feature data and the facial movement characteristic are inputted into third deep neural network model, obtain video
Facial depth characteristic data;
The voice depth characteristic data and the facial depth characteristic data are inputted into fourth nerve network model, are merged
Data.
2. the voice according to claim 1 applied to depression detection, facial feature extraction method, which is characterized in that
Described that the frequency spectrum parameter and the parameters,acoustic are inputted the first deep neural network model, the voice depth for obtaining audio is special
Data are levied, are specifically included:
The frequency spectrum parameter and the parameters,acoustic are inputted into the first depth confidence network, obtain voice high-level characteristic;
By the long memory network model in short-term of voice high-level characteristic input first, obtains long duration high-level characteristic and grow tall in short-term
Layer feature;
Long high-level characteristic inputs the second depth confidence network by the long duration high-level characteristic and in short-term, obtains voice depth characteristic
Data.
3. the voice according to claim 1 applied to depression detection, facial feature extraction method, which is characterized in that
It is described that the frame image is inputted into the second deep neural network model, facial feature data is obtained, is specifically included:
Frame image input convolutional neural networks model is obtained into facial feature data by backpropagation BP algorithm.
4. the voice according to claim 1 applied to depression detection, facial feature extraction method, which is characterized in that
The video image by the audio, video data carries out behavioral characteristics extraction, obtains light stream image, specifically includes;
By in the audio, video data video image carry out behavioral characteristics extraction, obtain multiple image by light stream be displaced;
Curvature change method and the constant subjunctive of gray value are used according to light stream displacement, obtains light stream image.
5. the voice according to claim 1 applied to depression detection, facial feature extraction method, which is characterized in that
It is described that the facial feature data and the facial movement characteristic are inputted into third deep neural network model, obtain video
Facial depth characteristic data, specifically include:
The facial feature data is connected with the facial movement characteristic by full articulamentum, facial whole number is obtained
According to;
The facial overall data is input to the second long memory network model in short-term, obtains facial depth characteristic data.
6. a kind of voice applied to depression detection, facial feature extraction system, which is characterized in that the system comprises:
Module is chosen, for randomly selecting one section of audio, video data;
Fisrt feature extraction module is mentioned for the audio data in the audio, video data to be carried out feature according to energy information method
It takes, obtains frequency spectrum parameter and parameters,acoustic;
Voice depth characteristic data acquisition module, for the frequency spectrum parameter and the parameters,acoustic to be inputted the first depth nerve
Network model obtains the voice depth characteristic data of audio;
Second feature extraction module obtains frame for the video image in the audio, video data to be carried out static nature extraction
Image;
Facial feature data obtains module, for the frame image to be inputted the second deep neural network model, obtains facial spy
Levy data;
Third feature extraction module obtains light for the video image in the audio, video data to be carried out behavioral characteristics extraction
Stream picture;
Facial movement characteristic obtains module, for the light stream image to be inputted third deep neural network model, obtains
Facial movement characteristic;
Facial depth characteristic data acquisition module, for inputting the facial feature data and the facial movement characteristic
Third deep neural network model obtains the facial depth characteristic data of video;
Fusion Module, for the voice depth characteristic data and the facial depth characteristic data to be inputted fourth nerve network
Model obtains fused data.
7. the voice according to claim 6 applied to depression detection, facial feature extraction system, which is characterized in that
Voice depth characteristic data acquisition module, specifically includes:
First input unit obtains language for the frequency spectrum parameter and the parameters,acoustic to be inputted the first depth confidence network
Sound high-level characteristic;
Second input unit, for obtaining long duration for the long memory network model in short-term of voice high-level characteristic input first
High-level characteristic and in short-term long high-level characteristic;
Third input unit, for by the long duration high-level characteristic and in short-term the second depth confidence net of long high-level characteristic input
Network obtains voice depth characteristic data.
8. the voice according to claim 6 applied to depression detection, facial feature extraction system, which is characterized in that
The facial feature data obtains module, specifically includes:
Facial feature data acquiring unit passes through backpropagation BP for the frame image to be inputted convolutional neural networks model
Algorithm obtains facial feature data.
9. the voice according to claim 6 applied to depression detection, facial feature extraction system, which is characterized in that
The third feature extraction module, specifically includes;
Light stream is displaced acquiring unit, for the video image in the audio, video data to be carried out behavioral characteristics extraction, obtains more
Frame image by light stream be displaced;
Light stream image acquisition unit is obtained for using Curvature change method and the constant subjunctive of gray value according to light stream displacement
To light stream image.
10. the voice according to claim 6 applied to depression detection, facial feature extraction system, which is characterized in that
The face depth characteristic data acquisition module, specifically includes:
Facial overall data acquiring unit connects entirely for passing through the facial feature data and the facial movement characteristic
Layer connection is connect, facial overall data is obtained;
Facial depth characteristic data capture unit, for the facial overall data to be input to the second long memory network mould in short-term
Type obtains facial depth characteristic data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810762032.3A CN109171769A (en) | 2018-07-12 | 2018-07-12 | It is a kind of applied to depression detection voice, facial feature extraction method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810762032.3A CN109171769A (en) | 2018-07-12 | 2018-07-12 | It is a kind of applied to depression detection voice, facial feature extraction method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109171769A true CN109171769A (en) | 2019-01-11 |
Family
ID=64936032
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810762032.3A Pending CN109171769A (en) | 2018-07-12 | 2018-07-12 | It is a kind of applied to depression detection voice, facial feature extraction method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109171769A (en) |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109784287A (en) * | 2019-01-22 | 2019-05-21 | 中国科学院自动化研究所 | Information processing method, system, device based on scene class signal forehead leaf network |
CN110123343A (en) * | 2019-04-19 | 2019-08-16 | 西北师范大学 | Depression detection device based on speech analysis |
CN110675953A (en) * | 2019-09-23 | 2020-01-10 | 湖南检信智能科技有限公司 | Method for screening and identifying mental patients by using artificial intelligence and big data |
CN111292765A (en) * | 2019-11-21 | 2020-06-16 | 台州学院 | Bimodal emotion recognition method fusing multiple deep learning models |
CN111297350A (en) * | 2020-02-27 | 2020-06-19 | 福州大学 | Three-heart beat multi-model comprehensive decision-making electrocardiogram feature classification method integrating source end influence |
CN111357011A (en) * | 2019-01-31 | 2020-06-30 | 深圳市大疆创新科技有限公司 | Environment sensing method and device, control method and device and vehicle |
CN111462841A (en) * | 2020-03-12 | 2020-07-28 | 华南理工大学 | Depression intelligent diagnosis device and system based on knowledge graph |
CN111462773A (en) * | 2020-03-26 | 2020-07-28 | 心图熵动科技(苏州)有限责任公司 | Suicide risk prediction model generation method and prediction system |
CN111553899A (en) * | 2020-04-28 | 2020-08-18 | 湘潭大学 | Audio and video based Parkinson non-contact intelligent detection method and system |
CN112120716A (en) * | 2020-09-02 | 2020-12-25 | 中国人民解放军军事科学院国防科技创新研究院 | Wearable multi-mode emotional state monitoring device |
CN112307947A (en) * | 2020-10-29 | 2021-02-02 | 北京沃东天骏信息技术有限公司 | Method and apparatus for generating information |
CN112472088A (en) * | 2020-10-22 | 2021-03-12 | 深圳大学 | Emotional state evaluation method and device, intelligent terminal and storage medium |
US10956809B1 (en) | 2019-11-21 | 2021-03-23 | Wang Lian | Artificial intelligence brain |
CN112687390A (en) * | 2021-03-12 | 2021-04-20 | 中国科学院自动化研究所 | Depression state detection method and device based on hybrid network and lp norm pooling |
CN112768070A (en) * | 2021-01-06 | 2021-05-07 | 万佳安智慧生活技术(深圳)有限公司 | Mental health evaluation method and system based on dialogue communication |
CN112818892A (en) * | 2021-02-10 | 2021-05-18 | 杭州医典智能科技有限公司 | Multi-modal depression detection method and system based on time convolution neural network |
CN113392918A (en) * | 2021-06-24 | 2021-09-14 | 哈尔滨理工大学 | Depressive disorder related factor identification method based on multi-source information fusion |
CN113397563A (en) * | 2021-07-22 | 2021-09-17 | 北京脑陆科技有限公司 | Training method, device, terminal and medium for depression classification model |
CN113485261A (en) * | 2021-06-29 | 2021-10-08 | 西北师范大学 | CAEs-ACNN-based soft measurement modeling method |
CN113705328A (en) * | 2021-07-06 | 2021-11-26 | 合肥工业大学 | Depression detection method and system based on facial feature points and facial movement units |
CN113812948A (en) * | 2021-09-08 | 2021-12-21 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Dequantization anxiety and depression psychological detection method and device |
CN115715680A (en) * | 2022-12-01 | 2023-02-28 | 杭州市第七人民医院 | Anxiety discrimination method and device based on connective tissue potential |
CN117079772A (en) * | 2023-07-24 | 2023-11-17 | 广东智正科技有限公司 | Intelligent correction system and terminal based on mental evaluation analysis of community correction object |
CN117137488A (en) * | 2023-10-27 | 2023-12-01 | 吉林大学 | Auxiliary identification method for depression symptoms based on electroencephalogram data and facial expression images |
US11963771B2 (en) | 2021-02-19 | 2024-04-23 | Institute Of Automation, Chinese Academy Of Sciences | Automatic depression detection method based on audio-video |
CN111357011B (en) * | 2019-01-31 | 2024-04-30 | 深圳市大疆创新科技有限公司 | Environment sensing method and device, control method and device and vehicle |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107133481A (en) * | 2017-05-22 | 2017-09-05 | 西北工业大学 | The estimation of multi-modal depression and sorting technique based on DCNN DNN and PV SVM |
-
2018
- 2018-07-12 CN CN201810762032.3A patent/CN109171769A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107133481A (en) * | 2017-05-22 | 2017-09-05 | 西北工业大学 | The estimation of multi-modal depression and sorting technique based on DCNN DNN and PV SVM |
Non-Patent Citations (1)
Title |
---|
杨雨浓: "基于深度学习的人脸表情识别方法研究", 《中国优秀博士学位论文全文数据库信息科技辑,2018年03期,I138-26》 * |
Cited By (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109784287A (en) * | 2019-01-22 | 2019-05-21 | 中国科学院自动化研究所 | Information processing method, system, device based on scene class signal forehead leaf network |
US10915815B1 (en) | 2019-01-22 | 2021-02-09 | Institute Of Automation, Chinese Academy Of Sciences | Information processing method, system and device based on contextual signals and prefrontal cortex-like network |
CN111357011B (en) * | 2019-01-31 | 2024-04-30 | 深圳市大疆创新科技有限公司 | Environment sensing method and device, control method and device and vehicle |
CN111357011A (en) * | 2019-01-31 | 2020-06-30 | 深圳市大疆创新科技有限公司 | Environment sensing method and device, control method and device and vehicle |
CN110123343B (en) * | 2019-04-19 | 2023-10-03 | 西北师范大学 | Depression detection device based on speech analysis |
CN110123343A (en) * | 2019-04-19 | 2019-08-16 | 西北师范大学 | Depression detection device based on speech analysis |
CN110675953A (en) * | 2019-09-23 | 2020-01-10 | 湖南检信智能科技有限公司 | Method for screening and identifying mental patients by using artificial intelligence and big data |
WO2021101439A1 (en) * | 2019-11-21 | 2021-05-27 | Lian Wang | Artificial intelligence brain |
CN112825014A (en) * | 2019-11-21 | 2021-05-21 | 王炼 | Artificial intelligence brain |
US10956809B1 (en) | 2019-11-21 | 2021-03-23 | Wang Lian | Artificial intelligence brain |
CN111292765A (en) * | 2019-11-21 | 2020-06-16 | 台州学院 | Bimodal emotion recognition method fusing multiple deep learning models |
CN111297350B (en) * | 2020-02-27 | 2021-08-31 | 福州大学 | Three-heart beat multi-model comprehensive decision-making electrocardiogram feature classification method integrating source end influence |
CN111297350A (en) * | 2020-02-27 | 2020-06-19 | 福州大学 | Three-heart beat multi-model comprehensive decision-making electrocardiogram feature classification method integrating source end influence |
CN111462841A (en) * | 2020-03-12 | 2020-07-28 | 华南理工大学 | Depression intelligent diagnosis device and system based on knowledge graph |
CN111462773A (en) * | 2020-03-26 | 2020-07-28 | 心图熵动科技(苏州)有限责任公司 | Suicide risk prediction model generation method and prediction system |
CN111553899A (en) * | 2020-04-28 | 2020-08-18 | 湘潭大学 | Audio and video based Parkinson non-contact intelligent detection method and system |
CN112120716A (en) * | 2020-09-02 | 2020-12-25 | 中国人民解放军军事科学院国防科技创新研究院 | Wearable multi-mode emotional state monitoring device |
CN112472088A (en) * | 2020-10-22 | 2021-03-12 | 深圳大学 | Emotional state evaluation method and device, intelligent terminal and storage medium |
CN112472088B (en) * | 2020-10-22 | 2022-11-29 | 深圳大学 | Emotional state evaluation method and device, intelligent terminal and storage medium |
CN112307947A (en) * | 2020-10-29 | 2021-02-02 | 北京沃东天骏信息技术有限公司 | Method and apparatus for generating information |
CN112768070A (en) * | 2021-01-06 | 2021-05-07 | 万佳安智慧生活技术(深圳)有限公司 | Mental health evaluation method and system based on dialogue communication |
CN112818892A (en) * | 2021-02-10 | 2021-05-18 | 杭州医典智能科技有限公司 | Multi-modal depression detection method and system based on time convolution neural network |
US11963771B2 (en) | 2021-02-19 | 2024-04-23 | Institute Of Automation, Chinese Academy Of Sciences | Automatic depression detection method based on audio-video |
CN112687390B (en) * | 2021-03-12 | 2021-06-18 | 中国科学院自动化研究所 | Depression state detection method and device based on hybrid network and lp norm pooling |
CN112687390A (en) * | 2021-03-12 | 2021-04-20 | 中国科学院自动化研究所 | Depression state detection method and device based on hybrid network and lp norm pooling |
CN113392918A (en) * | 2021-06-24 | 2021-09-14 | 哈尔滨理工大学 | Depressive disorder related factor identification method based on multi-source information fusion |
CN113485261A (en) * | 2021-06-29 | 2021-10-08 | 西北师范大学 | CAEs-ACNN-based soft measurement modeling method |
CN113485261B (en) * | 2021-06-29 | 2022-06-28 | 西北师范大学 | CAEs-ACNN-based soft measurement modeling method |
CN113705328A (en) * | 2021-07-06 | 2021-11-26 | 合肥工业大学 | Depression detection method and system based on facial feature points and facial movement units |
CN113397563A (en) * | 2021-07-22 | 2021-09-17 | 北京脑陆科技有限公司 | Training method, device, terminal and medium for depression classification model |
CN113812948A (en) * | 2021-09-08 | 2021-12-21 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Dequantization anxiety and depression psychological detection method and device |
CN115715680A (en) * | 2022-12-01 | 2023-02-28 | 杭州市第七人民医院 | Anxiety discrimination method and device based on connective tissue potential |
CN117079772A (en) * | 2023-07-24 | 2023-11-17 | 广东智正科技有限公司 | Intelligent correction system and terminal based on mental evaluation analysis of community correction object |
CN117137488B (en) * | 2023-10-27 | 2024-01-26 | 吉林大学 | Auxiliary identification method for depression symptoms based on electroencephalogram data and facial expression images |
CN117137488A (en) * | 2023-10-27 | 2023-12-01 | 吉林大学 | Auxiliary identification method for depression symptoms based on electroencephalogram data and facial expression images |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109171769A (en) | It is a kind of applied to depression detection voice, facial feature extraction method and system | |
CN110507335B (en) | Multi-mode information based criminal psychological health state assessment method and system | |
CN110556129B (en) | Bimodal emotion recognition model training method and bimodal emotion recognition method | |
Bachorowski | Vocal expression and perception of emotion | |
Narayanan et al. | Behavioral signal processing: Deriving human behavioral informatics from speech and language | |
Cen et al. | A real-time speech emotion recognition system and its application in online learning | |
CN106073706B (en) | A kind of customized information and audio data analysis method and system towards Mini-mental Status Examination | |
Khan et al. | Emotion Based Signal Enhancement Through Multisensory Integration Using Machine Learning. | |
Sinha | Recognizing complex patterns | |
WO2015158017A1 (en) | Intelligent interaction and psychological comfort robot service system | |
CN111081371A (en) | Virtual reality-based early autism screening and evaluating system and method | |
Rituerto-González et al. | Data augmentation for speaker identification under stress conditions to combat gender-based violence | |
Caponetti et al. | Biologically inspired emotion recognition from speech | |
CN110348409A (en) | A kind of method and apparatus that facial image is generated based on vocal print | |
Fang et al. | Combining acoustic signals and medical records to improve pathological voice classification | |
Upadhyay et al. | SmHeSol (IoT-BC): smart healthcare solution for future development using speech feature extraction integration approach with IoT and blockchain | |
Li et al. | Global-local-feature-fused driver speech emotion detection for intelligent cockpit in automated driving | |
Cristani et al. | Generative modeling and classification of dialogs by a low-level turn-taking feature | |
Cowie et al. | Piecing together the emotion jigsaw | |
Degila et al. | The UCD system for the 2018 FEMH voice data challenge | |
Gavrilescu et al. | Feedforward neural network-based architecture for predicting emotions from speech | |
US20220015687A1 (en) | Method for Screening Psychiatric Disorder Based On Conversation and Apparatus Therefor | |
Gupta et al. | REDE-Detecting human emotions using CNN and RASA | |
Massaro | The McGurk effect: Auditory visual speech perception’s piltdown man | |
Feather et al. | Auditory texture synthesis from task-optimized convolutional neural networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190111 |