CN112307947A

CN112307947A - Method and apparatus for generating information

Info

Publication number: CN112307947A
Application number: CN202011179663.6A
Authority: CN
Inventors: 谢佩; 赵俊
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Tuoxian Technology Co Ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2021-02-02

Abstract

The embodiment of the disclosure discloses a method and a device for generating information. One embodiment of the method comprises: sampling a first preset number of image frames from a video to be evaluated; sequencing each image frame based on the time stamp of each image frame to obtain an image frame sequence; determining key points of the face of an evaluation object from the image frame; the coordinates of the key points are normalized, and the following features are extracted from the image frame based on the normalized coordinates of the key points: face movement, two-eye gaze direction, distance between key points of nose tip and upper lip, distance between key points of left and right mouth corners, area of mouth region and offset of each key point in the image frame relative to each key point in a reference frame, wherein the reference frame is the image frame with the minimum timestamp in the image frame sequence; coding the extracted features in the image frame to obtain a feature vector of the image frame; and inputting the feature vector of each image frame into a pre-trained depression degree prediction model, and estimating the depression index of the evaluation object.

Description

Method and apparatus for generating information

Technical Field

The embodiment of the disclosure relates to the technical field of artificial intelligence, in particular to the field of intelligent medical treatment, and particularly relates to a method and a device for generating information.

Background

Depression, a typical mood disorder, is mainly manifested by significant and persistent mood swings, and different degrees of depression will affect the life and work of patients to different degrees, seriously even cause suicide and cause social harm. According to the data disclosure of the World Health Organization (WHO), more than 3.5 million people have suffered from depression in the world by 2019, and more than 9500 million people in china have suffered from depression, which has become one of the first three diseases threatening human health.

With the development of artificial intelligence technology, the severity of depression can be evaluated by collecting video or audio data of patients with depression.

In the related art, a method for detecting depression based on video trains a machine learning model for identifying the emotional state of an evaluation object by encoding a video image sequence, and further estimates the degree of depression of the evaluation object according to the statistical result of the emotional state.

Disclosure of Invention

Embodiments of the present disclosure propose methods and apparatuses for generating information.

In a first aspect, an embodiment of the present disclosure provides a method for generating information, the method including: sampling a first preset number of image frames from a video to be evaluated, wherein the image frames comprise face images of an evaluation object; sequencing each image frame based on the time stamp of each image frame to obtain an image frame sequence; for each image frame in the image frame sequence, respectively executing the following characteristic extraction steps to determine a characteristic vector of each image frame: determining key points of the face of an evaluation object from the image frame; the coordinates of the key points are normalized, and the following features are extracted from the image frame based on the normalized coordinates of the key points: face movement, two-eye gaze direction, distance between key points of nose tip and upper lip, distance between key points of left and right mouth corners, area of mouth region and offset of each key point in the image frame relative to each key point in a reference frame, wherein the reference frame is the image frame with the minimum timestamp in the image frame sequence; coding the extracted features in the image frame to obtain a feature vector of the image frame; and inputting the feature vector of each image frame into a pre-trained depression degree prediction model, and estimating a depression index of the evaluation object, wherein the depression index is used for representing the depression degree of the evaluation object.

In some embodiments, the video to be evaluated is obtained via the following steps: presenting a pre-constructed interaction problem to an evaluation object; and collecting videos of the evaluation object when the evaluation object answers the interactive questions in real time, and determining the videos as the videos to be evaluated.

In some embodiments, sampling a first preset number of image frames from a video to be evaluated includes: determining a second preset number of video clips from the video to be evaluated based on the time length of the evaluation object for answering each interactive question, wherein each video clip corresponds to one interactive question; respectively extracting a third preset number of image frames from each video clip to obtain an image frame set corresponding to each video clip; respectively sequencing the image frames in each image frame set based on the time stamps of the image frames to obtain an image frame sequence corresponding to each video clip; and inputting the feature vector of each image frame into a pre-trained depression degree prediction model to estimate the depression index of the evaluation object, wherein the estimation method comprises the following steps: coding the feature vector of each image frame in the same image frame sequence to obtain the feature vector sequence of the image frame sequence; and inputting the characteristic vector sequence into a pre-trained depression degree prediction model, and estimating the depression index of the evaluation object.

In some embodiments, before normalizing the coordinates of the keypoints, the method further comprises: the orientation of the face image of the evaluation object in the image frame is adjusted to be consistent with the normal vector of the image frame by adopting affine transformation.

In some embodiments, the depression degree prediction model is a time convolution neural network based on a self-attention mechanism, and comprises an input layer, a hiding layer, an attention layer and an output layer, wherein the input layer is used for receiving the feature vectors of each image frame and extracting feature sequences from the feature vectors of each image frame; the hidden layer is used for coding the characteristic sequence and outputting the coded characteristic sequence; the attention layer is used for acquiring the coded feature sequences output by the hidden layers, weighting the coded feature sequences output by the hidden layers based on an attention mechanism and determining the weighted sum of the coded feature sequences output by the hidden layers; and the output layer performs logistic regression on the weighted sum to estimate the depression index.

In some embodiments, the self-attention mechanism based time-convolutional neural network is trained by: constructing a first initial time convolution neural network, wherein the initial time convolution neural network comprises an input layer, a hidden layer, an attention layer and an output layer, the output layer of the first initial time convolution neural network is a full connection layer, and the full connection layer estimates a depression classification result based on weighting and calculating; inputting the first sample feature vector marked with the sample depression classification result into a first initial time convolutional neural network, taking the sample depression classification result as expected output, training the first initial time convolutional neural network until the accuracy of the depression classification result estimated by the first initial time convolutional neural network meets a preset accuracy threshold, and obtaining a trained first time convolutional neural network; updating a full-connection layer in the first time convolution neural network into a logistic regression layer to obtain a second initial time convolution neural network, wherein the logistic regression layer estimates a depression index based on the weighted sum; inputting the second sample feature vector marked with the sample depression index into a second initial time convolution neural network, taking the sample depression index as expected output, training the second initial time convolution neural network to obtain a trained second time convolution neural network, and determining the trained second time convolution neural network as the time convolution neural network based on the self-attention mechanism.

In a second aspect, an embodiment of the present disclosure provides an apparatus for generating information, the apparatus including: the image sampling unit is configured to sample a first preset number of image frames from a video to be evaluated, wherein the image frames comprise face images of an evaluation object; a sequence generating unit configured to order each image frame based on the time stamp of each image frame, resulting in an image frame sequence; the feature extraction unit is configured to perform the following feature extraction steps for each image frame in the image frame sequence respectively to determine a feature vector of each image frame: determining key points of the face of an evaluation object from the image frame; the coordinates of the key points are normalized, and the following features are extracted from the image frame based on the normalized coordinates of the key points: face movement, two-eye gaze direction, distance between key points of nose tip and upper lip, distance between key points of left and right mouth corners, area of mouth region and offset of each key point in the image frame relative to each key point in a reference frame, wherein the reference frame is the image frame with the minimum timestamp in the image frame sequence; coding the extracted features in the image frame to obtain a feature vector of the image frame; and the information prediction unit is configured to input the feature vector of each image frame into a pre-trained depression degree prediction model and estimate a depression index of the evaluation object, wherein the depression index is used for representing the depression degree of the evaluation object.

In some embodiments, the apparatus further comprises a video capture unit configured to: presenting a pre-constructed interaction problem to an evaluation object; and collecting videos of the evaluation object when the evaluation object answers the interactive questions in real time, and determining the videos as the videos to be evaluated.

In some embodiments, the image sampling unit further comprises a video extraction module configured to: determining a second preset number of video clips from the video to be evaluated based on the time length of the evaluation object for answering each interactive question, wherein each video clip corresponds to one interactive question; respectively extracting a third preset number of image frames from each video clip to obtain an image frame set corresponding to each video clip; the sequence generation unit is further configured to: respectively sequencing the image frames in each image frame set based on the time stamps of the image frames to obtain an image frame sequence corresponding to each video clip; and the feature extraction unit comprises a feature encoding module configured to: coding the feature vector of each image frame in the same image frame sequence to obtain the feature vector sequence of the image frame sequence; the information prediction unit is further configured to: and inputting the characteristic vector sequence into a pre-trained depression degree prediction model, and estimating the depression index of the evaluation object.

In some embodiments, the feature extraction unit further comprises an affine transformation module configured to: the orientation of the face image of the evaluation object in the image frame is adjusted to be consistent with the normal vector of the image frame by adopting affine transformation.

In some embodiments, the depressive degree prediction model is a time convolution neural network based on a self-attention mechanism, and includes an input layer, a hidden layer, an attention layer and an output layer, wherein the input layer is used for receiving feature vectors of each image frame and extracting feature sequences from the feature vectors of each image frame; the hidden layer is used for coding the characteristic sequence and outputting the coded characteristic sequence; the attention layer is used for acquiring the coded feature sequences output by the hidden layers, weighting the coded feature sequences output by the hidden layers based on an attention mechanism and determining the weighted sum of the coded feature sequences output by the hidden layers; and the output layer performs logistic regression on the weighted sum to estimate the depression index.

In some embodiments, the apparatus further comprises a model training module configured to train the self-attention mechanism-based time-convolutional neural network via: constructing a first initial time convolution neural network, wherein the initial time convolution neural network comprises an input layer, a hidden layer, an attention layer and an output layer, the output layer of the first initial time convolution neural network is a full connection layer, and the full connection layer estimates a depression classification result based on weighting and calculating; inputting the first sample feature vector marked with the sample depression classification result into a first initial time convolutional neural network, taking the sample depression classification result as expected output, training the first initial time convolutional neural network until the accuracy of the depression classification result estimated by the first initial time convolutional neural network meets a preset accuracy threshold, and obtaining a trained first time convolutional neural network; updating a full-connection layer in the first time convolution neural network into a logistic regression layer to obtain a second initial time convolution neural network, wherein the logistic regression layer estimates a depression index based on the weighted sum; inputting the second sample feature vector marked with the sample depression index into a second initial time convolution neural network, taking the sample depression index as expected output, training the second initial time convolution neural network to obtain a trained second time convolution neural network, and determining the trained second time convolution neural network as the time convolution neural network based on the self-attention mechanism.

According to the method and the device for generating information, the image frame including the facial image of the evaluation object is sampled from the video to be evaluated, after the image frame is normalized, the characteristics of multiple dimensions related to the depression degree are extracted from the image frame, the characteristic vector is constructed, then the characteristic vector is input into a pre-trained depression degree prediction model, the depression index of the evaluation object is estimated, and the depression degree of the evaluation object is represented through the depression index. The dimensionality of the features related to the depression degree extracted from the video to be evaluated is larger, the features are not influenced by environmental factors, and the accuracy and stability of depression degree estimation through a machine learning model are improved.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which some embodiments of the present disclosure may be applied;

FIG. 2 is a flow diagram for one embodiment of a method for generating information, according to the present disclosure;

FIG. 3 is a schematic diagram of a scenario of a flow of the method for generating information shown in FIG. 2;

FIG. 4 is a flow diagram of yet another embodiment of a method for generating information according to the present disclosure;

FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for generating information according to the present disclosure;

FIG. 6 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 of a method for generating information or an apparatus for generating information to which embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like, e.g. a video to be evaluated of the evaluation object may be sent to the server, and an estimated depression index of the evaluation object may be received from the server.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, and 103 are hardware, they may be electronic devices with communication functions, including but not limited to smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background data server processing a video to be evaluated uploaded by the

terminal devices

101, 102, 103. The background data server may perform processing such as sampling and feature extraction on the received video to be evaluated, and feed back a processing result (e.g., an estimated depression index of the evaluation object) to the terminal device.

It should be noted that the method for generating information provided by the embodiments of the present disclosure may be executed by the

terminal devices

101, 102, and 103, or may be executed by the server 105. Accordingly, the means for generating information may be provided in the

terminal devices

101, 102, 103, or in the server 105. And is not particularly limited herein.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules, for example, to provide distributed services, or as a single piece of software or software module. And is not particularly limited herein.

With continued reference to fig. 2, fig. 2 illustrates a flow 200 of one embodiment of a method for generating information in accordance with the present disclosure. The method for generating information comprises the following steps:

step 201, a first preset number of image frames are sampled from a video to be evaluated, wherein the image frames include a face image of an evaluation object.

Research shows that facial expression features of depression patients, such as the speed of expression transformation or eye nerve transformation, are in certain relation with the severity of depression, that is, the depression degree of the patients can be predicted through the facial expression features of the patients, so that the core idea of predicting the depression degree based on videos or images is to extract features related to the depression degree from the images and then predict the depression degree of an evaluation object based on the extracted features.

In this embodiment, an executing subject (e.g., the server 105 shown in fig. 1) may receive a video to be evaluated from a terminal device (e.g., a smartphone as described in fig. 1) via a network, and then extract a first preset number of image frames from the video to be evaluated based on actual needs, where the image frames include a facial image of an evaluation object, so that the executing subject extracts features related to the degree of depression from the image frames.

In a specific example, the executing subject may also be a terminal device (such as a laptop computer shown in fig. 1) operated by a doctor, and a camera of the executing subject may collect a video of the doctor when the doctor visits a depression patient in real time, and then directly sample the video to obtain a preset number of image frames.

Step 202, sequencing each image frame based on the time stamp of each image frame to obtain an image frame sequence.

Based on the image frame sequence obtained in step 202, the subject performs the

following steps

203, 204, and 205 for each image frame in the image frame sequence, respectively, to determine a feature vector of each image frame.

In step 203, key points of the face of the evaluation target are determined from the image frame.

As an example, the execution subject may identify a face region in an image frame using an open source framework Openface, and then determine pixel coordinates of 68 key points of the evaluation target face.

It should be noted that the technology for identifying the facial key points in the image frame belongs to a mature technology in the field of computer vision, for example, the executing subject may also execute step 203 by using a convolutional neural network or a cyclic neural network, which is not limited in this application.

And 204, normalizing the coordinates of the key points, and extracting features from the image frame based on the normalized coordinates of the key points.

In this embodiment, the features related to the degree of depression extracted from the image frame include: the method comprises the following steps of facial motion, two-eye gazing directions, the distance between a key point of a nose tip and a key point of an upper lip, the distance between key points of a left mouth corner and a right mouth corner, the area of a mouth region and the offset of each key point in an image frame relative to each key point in a reference frame, wherein the reference frame is the image frame with the minimum timestamp in an image frame sequence.

In the method for predicting the degree of depression, the greater the feature dimension related to the degree of depression participating in the prediction, the higher the accuracy of the prediction. In the related art, features related to depression degree extracted from an image generally include facial key points or gaze directions and the like, feature dimensions are small, and environmental factors (such as illumination) when a video to be evaluated is acquired have a large influence on prediction accuracy. In order to solve the problems, deeper high-order features related to depression degree are mined from image frames based on the normalized key points so as to expand feature dimensions participating in prediction and avoid adverse effects of environmental factors on prediction results.

In this embodiment, the facial movements correspond to 44 mid-facial Action Units (AU) under the human anatomy, for example AU4 for frowns and AU9 for frowns.

As an example, the execution subject may perform normalization processing on the key points of the image frame by adopting the following steps: translating the key point of the nose tip to an origin coordinate; the rotation transformation enables the vertical coordinates of key points of the inner canthus of the face to be consistent; performing similarity transformation by taking the nose tip as a center, and normalizing the distance between key points of the inner canthus of the evaluation object to be 1; and performing coordinate transformation on the pixel coordinates of other key points to obtain the normalized key point coordinates. On this basis, the execution subject may extract the above features by using Openface.

In some optional implementations of this embodiment, before normalizing the coordinates of the keypoints, the method further includes: the orientation of the face image of the evaluation object in the image frame is adjusted to be consistent with the normal vector of the image frame by adopting affine transformation. In this way, the absence of key points due to the face orientation of the evaluation target can be avoided.

Step 205, encoding the features extracted from the image frame to obtain a feature vector of the image frame.

As an example, the executing subject may encode the extracted multiple features respectively to obtain multiple sub-vectors or scalars, and then combine the sub-vectors and the scalars into a feature vector, where the feature vector is a feature related to the degree of depression in the image frame. For example, the extracted facial motion features are encoded into a 12-dimensional vector; encoding the binocular fixation direction into a 6-dimensional vector; the distance between the key point of the nose tip and the key point of the upper lip, the distance between the key points of the left and right mouth corners and the area of the mouth region are scalar quantities; the offset of the image frame relative to the corresponding key in the reference frame is encoded as a 136-dimensional vector. The subvectors and scalars are then encoded into a 157 feature vector.

And step 206, inputting the feature vectors of the image frames into a pre-trained depression degree prediction model, and estimating a depression index of the evaluation object, wherein the depression index is used for representing the depression degree of the evaluation object.

In the present embodiment, the execution subject may predict the degree of depression of the evaluation subject using a Long Short-Term Memory artificial Neural network (LSTM), a time Convolutional Neural network (TCN), or a Convolutional Neural Network (CNN).

In one specific example, the performing subject constructs an initial CNN model, and then obtains depression patient data from the emotion recognition public data DAIC-WOZ and EMOTI-W, e.g., including a sample video of the patient and a diagnosis of the patient's degree of depression, and characterizes the patient's degree of depression using a depression index, the higher the depression index, the greater the degree of depression. Inputting the sample video into an Openface model, extracting a sample characteristic vector from the Openface model, marking a sample depression index on the sample characteristic vector, inputting the sample characteristic vector marked with the sample depression index into an initial CNN, taking the marked depression index as expected output, training the initial CNN by adopting a machine learning method, and adjusting each parameter in the initial CNN to obtain the trained CNN. Then, the execution subject inputs the feature vector of each image frame obtained in step 205 into the CNN, i.e., the depression index of the evaluation subject can be predicted.

With continued reference to fig. 3, fig. 3 is a schematic view of a scenario of the flow of the method shown in fig. 2. In fig. 3, a user may send a video to be evaluated to an execution subject 301 through a smart phone 301, where the execution subject 301 may be a server or a terminal device. After receiving a video 303 to be evaluated sent by a user, an execution subject samples a first preset number of image frames from the video 303 to be evaluated, constructs an image frame sequence 304 based on time sequence, extracts features related to depression degree from each image frame 305, and generates a feature vector 306; the feature vector of each image frame is then input into a pre-trained depression degree prediction model 307, and a depression index 308 of the evaluation subject is estimated.

In some optional implementations of the foregoing embodiment, the depression degree prediction model is a time convolution neural network based on a self-attention mechanism, and includes an input layer, a hidden layer, an attention layer, and an output layer, where the input layer is configured to receive feature vectors of each image frame and extract a feature sequence from the feature vectors of each image frame; the hidden layer is used for coding the characteristic sequence and outputting the coded characteristic sequence; the attention layer is used for acquiring the coded feature sequences output by the hidden layers, weighting the coded feature sequences output by the hidden layers based on an attention mechanism and determining the weighted sum of the coded feature sequences output by the hidden layers; and the output layer performs logistic regression on the weighted sum to estimate the depression index.

The core of the time convolution neural network is causal convolution and cavity convolution, wherein the cavity convolution needs to set a 'cavity coefficient', and the cavity coefficient and the hidden layer number jointly determine the range of the utilization information of the model output on the time axis, namely the size of the receptive field. The original time convolution neural network only uses the hidden layer output of the last time point as the coding of the feature vector of the whole image frame sequence, if the image sequence is long and the receptive field is not enough, some information of the first half part of the image frame sequence is lost, and the accuracy of the prediction result is reduced.

In the time convolutional neural network based on the self-attention mechanism in the implementation manner, the attention layer is arranged behind the hidden layer, the self-attention mechanism can be adopted to weight the coding feature sequences output by the hidden layer at all times to obtain the final weighted sum, the problem of information loss caused by the limited receptive field of the time convolutional neural network is solved, in addition, the attention layer can give a higher weight to the feature sequences with higher degree of correlation with the depression degree to capture important features related to the depression, and the accuracy of prediction is improved.

As an example, the first preset number is m, and assuming that the time-convolutional neural network based on the self-attention mechanism includes n time-instant hidden layers, for the feature vector of each image frame, after passing through the input layer and the hidden layers, n encoded feature sequences (h) can be obtained₁，h₂...h_n). As shown in equation (1), the attention layer is based on the learned self-attention vectors w pairs (h)₁，h₂...h_n) And weighting to obtain a weighted sum e.

e＝[h₁,…h_n]·softmax([w^Th₁,…w^Th_n]^T) (1)

Then, the output layer logically regresses to obtain a final depression index D based on the learned weighted sum e of the regression coefficient u and the m feature vectors, as shown in formula (2).

In this implementation, the purpose of training the first initial time-convolutional neural network is to determine the parameters of the input layer and the hidden layer and to make the attention layer in the first initial time-convolutional neural network learn the self-attention vector. And then, reserving an input layer, a hidden layer and an attention layer in the trained first time convolution neural network, updating a full connection layer in the output layer to be a logistic regression layer to obtain a second initial time convolution neural network, training the second initial time convolution neural network to enable the logistic regression layer to learn regression coefficients, and predicting the depressive index of the evaluation object based on the input feature vector by the trained second time convolution neural network.

In one specific example, the executive may pre-construct a first initial time-convolutional neural network and then obtain a first sample video and corresponding diagnostic results from the emotion recognition public data sets DAIC-WOZ and EMOTI-W. Then, the execution subject extracts a first sample feature vector from the first sample video through steps 201 to 205 in the above embodiment, and marks the diagnosis result on the first sample feature vector (for example, the depression patient may be marked as 1, and the non-depression patient may be marked as 0), and then inputs the first sample feature vector marked with the diagnosis result into the first initial time convolution neural network to obtain the classification result output by the first initial time convolution neural network, and compares the classification result with the label value of the first sample feature vector, so as to obtain the prediction accuracy of the first initial time convolution neural network. When the accuracy of the first initial time convolution neural network reaches an accuracy threshold (for example, 80%), the training is completed, and the first time convolution neural network is obtained.

And then, carrying out a second part of training, and updating the full-connection layer in the first time convolution neural network into a logistic regression layer to obtain a second initial time convolution neural network. A second sample video and corresponding depression degree diagnosis is obtained from the emotion recognition public data sets DAIC-WOZ and EMOTI-W. The executing subject extracts a second sample feature vector from the second sample video through steps 201 to 205 in the above embodiment, and performs depression degree labeling on the second sample feature vector to obtain a labeled second sample feature vector. And then, the execution subject inputs the marked second sample feature vector into the second initial time convolution neural network to obtain an estimated depression index, and adjusts the regression coefficient of the logistic regression layer by comparing the estimated depression index with the marked sample depression index based on the characteristic of back propagation of the neural network until the loss function converges to obtain a trained second time convolution neural network, namely the depression degree prediction model.

With further reference to fig. 4, a flow 400 of yet another embodiment of a method for generating information is shown. The flow 400 of the method for generating information comprises the steps of:

step 401, presenting the pre-constructed interactive question to the evaluation object.

In this embodiment, a pre-constructed interaction problem may be constructed based on the emotion corpus, and used to stimulate the evaluation subject to generate three emotions, namely, positive, neutral, and negative, respectively.

And 402, collecting videos of the evaluation objects when the evaluation objects answer the interactive questions in real time, and determining the videos as the videos to be evaluated.

As an example, the execution subject may be a terminal device with an interactive component, such as a tablet computer with a camera, and when the evaluation subject performs depression degree detection, the interaction problem may be presented on a screen of the tablet computer, and the camera is turned on synchronously. Thanks to the interaction problem presented in step 401, the executing subject may capture facial features of the evaluation subject under various emotions, so that the video to be evaluated may contain more and more comprehensive emotion changes of the evaluation subject, thereby ensuring that features related to the degree of depression subsequently extracted from the video to be evaluated are more targeted.

Step 403, determining a second preset number of video segments from the video to be evaluated based on the time length of the evaluation object for answering each interactive question, wherein each video segment corresponds to one interactive question.

In this embodiment, since the emotional corpus included in the interactive questions includes three types, the responses of the evaluation object facing each interactive question are different, and the consistency and relevance of the responses of the evaluation object in the video segment corresponding to each interactive question are stronger.

Step 404, extracting a third preset number of image frames from each video segment, respectively, to obtain an image frame set corresponding to each video segment.

Based on the video segments extracted in step 403, the continuity and correlation between the image frames in each image frame set and the difference between different image frame sets can be improved.

Step 405, based on the time stamps of the image frames, the image frames in each image frame set are sorted respectively to obtain an image frame sequence corresponding to each video clip.

Then, for each image frame in each image frame sequence, the following

steps

406, 407, and 408 are respectively performed to determine the feature vector of each image frame in each image frame sequence, and the

steps

406, 407, and 408 respectively correspond to the

steps

203, 204, and 205, which are not described herein again.

At step 406, keypoints of the face of the evaluation target are determined from the image frame.

Step 407, normalizing the coordinates of the key points, and extracting the following features from the image frame based on the normalized coordinates of the key points: the method comprises the following steps of facial motion, two-eye gazing directions, the distance between a key point of a nose tip and a key point of an upper lip, the distance between key points of a left mouth corner and a right mouth corner, the area of a mouth region and the offset of each key point in an image frame relative to each key point in a reference frame, wherein the reference frame is the image frame with the minimum timestamp in an image frame sequence.

And step 408, coding the features extracted from the image frame to obtain a feature vector of the image frame.

Step 409, encoding the feature vector of each image frame in the same image frame sequence to obtain a feature vector sequence of the image frame sequence.

Because each video slice corresponds to an interaction problem, the feature vectors of all image frames in the same image frame sequence have strong correlation and coherence, so that the obtained feature vector sequence can represent the features related to the depression degree, and can also represent the change process of all the features, thereby further expanding the feature dimension.

As an example, the video to be evaluated includes 5 interactive questions, and the execution subject may obtain 5 image frame sequences through steps 401 to 404; assuming that the third preset number is 10, the executing body may extract 10 feature vectors from each image frame sequence through steps 405 to 408, and assuming that the dimension of each feature vector is 157, the 10 feature vectors may be encoded into a feature vector sequence with a dimension of 1570, and finally obtain 5 feature vector sequences with a dimension of 1570.

And step 410, inputting the characteristic vector sequence into a pre-trained depression degree prediction model, and estimating the depression index of the evaluation object.

This step is similar to the step 206, and the difference is only that the execution subject inputs the feature vector sequence of the depressive degree prediction model in this embodiment, which is not described herein again.

As can be seen from fig. 4, compared with the process 200 shown in fig. 2, the process 400 of the method for generating information in the present embodiment highlights the steps of acquiring a video of an evaluation object when the evaluation object answers an interactive question in real time, extracting a segment corresponding to each interactive question from the video, and encoding feature vectors corresponding to image frames in each video segment into a feature sequence. Through interaction between the interactive questions and the evaluation object, facial features of the evaluation object facing different emotional stimuli are mined, the pertinence of the video to be evaluated can be improved, the pertinence of features relevant to depression extracted from the video to be evaluated subsequently is higher, the feature vector sequence generated on the basis can better represent the emotional features of the evaluation object when the evaluation object answers the interactive questions, and therefore the accuracy of depression degree prediction is further improved.

With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for generating information, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable in various electronic devices.

As shown in fig. 5, the apparatus 500 for generating information of the present embodiment includes: an image sampling unit 501 configured to sample a first preset number of image frames including a face image of an evaluation object from a video to be evaluated; a sequence generating unit 502 configured to order each image frame based on the time stamp of each image frame, resulting in an image frame sequence; the feature extraction unit 503 is configured to perform the following feature extraction steps for each image frame in the image frame sequence, respectively, to determine a feature vector of each image frame: determining key points of the face of an evaluation object from the image frame; the coordinates of the key points are normalized, and the following features are extracted from the image frame based on the normalized coordinates of the key points: face movement, two-eye gaze direction, distance between key points of nose tip and upper lip, distance between key points of left and right mouth corners, area of mouth region and offset of each key point in the image frame relative to each key point in a reference frame, wherein the reference frame is the image frame with the minimum timestamp in the image frame sequence; coding the extracted features in the image frame to obtain a feature vector of the image frame; and an information prediction unit 504 configured to input the feature vector of each image frame into a pre-trained depression degree prediction model, and estimate a depression index of the evaluation subject, wherein the depression index is used for representing the depression degree of the evaluation subject.

In this embodiment, the apparatus further comprises a video capture unit configured to: presenting a pre-constructed interaction problem to an evaluation object; and collecting videos of the evaluation object when the evaluation object answers the interactive questions in real time, and determining the videos as the videos to be evaluated.

In this embodiment, the image sampling unit 501 further includes a video extraction module configured to: determining a second preset number of video clips from the video to be evaluated based on the time length of the evaluation object for answering each interactive question, wherein each video clip corresponds to one interactive question; respectively extracting a third preset number of image frames from each video clip to obtain an image frame set corresponding to each video clip; the sequence generation unit 502 is further configured to: respectively sequencing the image frames in each image frame set based on the time stamps of the image frames to obtain an image frame sequence corresponding to each video clip; and, the feature extraction unit 503 comprises a feature encoding module configured to: coding the feature vector of each image frame in the same image frame sequence to obtain the feature vector sequence of the image frame sequence; the information prediction unit 504 is further configured to: and inputting the characteristic vector sequence into a pre-trained depression degree prediction model, and estimating the depression index of the evaluation object.

In the present embodiment, the feature extraction unit 503 further includes an affine transformation module configured to: the orientation of the face image of the evaluation object in the image frame is adjusted to be consistent with the normal vector of the image frame by adopting affine transformation.

In this embodiment, the depression degree prediction model is a time convolution neural network based on a self-attention mechanism, and includes an input layer, a hidden layer, an attention layer, and an output layer, where the input layer is configured to receive feature vectors of each image frame and extract a feature sequence from the feature vectors of each image frame; the hidden layer is used for coding the characteristic sequence and outputting the coded characteristic sequence; the attention layer is used for acquiring the coded feature sequences output by the hidden layers, weighting the coded feature sequences output by the hidden layers based on an attention mechanism and determining the weighted sum of the coded feature sequences output by the hidden layers; and the output layer performs logistic regression on the weighted sum to estimate the depression index.

In this embodiment, the apparatus further includes a model training module configured to train the time-convolution neural network based on the self-attention mechanism by: constructing a first initial time convolution neural network, wherein the initial time convolution neural network comprises an input layer, a hidden layer, an attention layer and an output layer, the output layer of the first initial time convolution neural network is a full connection layer, and the full connection layer estimates a depression classification result based on weighting and calculating; inputting the first sample feature vector marked with the sample depression classification result into a first initial time convolutional neural network, taking the sample depression classification result as expected output, training the first initial time convolutional neural network until the accuracy of the depression classification result estimated by the first initial time convolutional neural network meets a preset accuracy threshold, and obtaining a trained first time convolutional neural network; updating a full connection layer in the first time convolution neural network into a logistic regression layer to obtain a second time convolution neural network, wherein the logistic regression layer estimates a depression index based on the weighted sum; inputting the second sample feature vector marked with the sample depression index into a second time convolution neural network, taking the sample depression index as expected output, training the second time convolution neural network to obtain a trained second time convolution neural network, and determining the trained second time convolution neural network as the time convolution neural network based on the self-attention mechanism.

Referring now to fig. 6, a schematic diagram of an electronic device (e.g., the server or terminal device of fig. 1) 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the use range of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure. It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: sampling a first preset number of image frames from a video to be evaluated, wherein the image frames comprise face images of an evaluation object; sequencing each image frame based on the time stamp of each image frame to obtain an image frame sequence; for each image frame in the image frame sequence, respectively executing the following characteristic extraction steps to determine a characteristic vector of each image frame: determining key points of the face of an evaluation object from the image frame; the coordinates of the key points are normalized, and the following features are extracted from the image frame based on the normalized coordinates of the key points: face movement, two-eye gaze direction, distance between key points of nose tip and upper lip, distance between key points of left and right mouth corners, area of mouth region and offset of each key point in the image frame relative to each key point in a reference frame, wherein the reference frame is the image frame with the minimum timestamp in the image frame sequence; coding the extracted features in the image frame to obtain a feature vector of the image frame; and inputting the feature vector of each image frame into a pre-trained depression degree prediction model, and estimating a depression index of the evaluation object, wherein the depression index is used for representing the depression degree of the evaluation object.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an image sampling unit, a sequence generating unit, a feature extracting unit, and an information predicting unit. Where the names of the units do not in some cases constitute a limitation of the units themselves, for example, the image sampling unit may also be described as "sampling a first preset number of image frames from the video to be evaluated".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A method for generating information, comprising:

sampling a first preset number of image frames from a video to be evaluated, wherein the image frames comprise face images of an evaluation object;

sequencing each image frame based on the time stamp of each image frame to obtain an image frame sequence;

for each image frame in the image frame sequence, respectively executing the following feature extraction steps to determine the feature vector of each image frame: determining key points of the face of the evaluation object from the image frame; normalizing the coordinates of the key points, and extracting the following features from the image frame based on the normalized coordinates of the key points: facial motion, two-eye gaze direction, distance between key points of nose tip and upper lip, distance between key points of left and right mouth corners, area of mouth region, and offset of each key point in the image frame relative to each key point in a reference frame, wherein the reference frame is the image frame with the minimum timestamp in the image frame sequence; coding the extracted features in the image frame to obtain a feature vector of the image frame;

inputting the feature vector of each image frame into a pre-trained depression degree prediction model, and estimating a depression index of the evaluation subject, wherein the depression index is used for representing the depression degree of the evaluation subject.

2. The method of claim 1, wherein the video to be evaluated is obtained via:

presenting a pre-constructed interaction problem to the evaluation object;

and collecting videos of the evaluation object when the evaluation object answers the interactive questions in real time, and determining the videos as the videos to be evaluated.

3. The method of claim 2, wherein sampling a first preset number of image frames from the video to be evaluated comprises:

determining a second preset number of video clips from the video to be evaluated based on the time length of the evaluation object for answering each interactive question, wherein each video clip corresponds to one interactive question; respectively extracting a third preset number of image frames from each video clip to obtain an image frame set corresponding to each video clip; respectively sequencing the image frames in each image frame set based on the time stamps of the image frames to obtain an image frame sequence corresponding to each video clip; and the number of the first and second groups,

inputting the feature vector of each image frame into a pre-trained depression degree prediction model, and estimating a depression index of the evaluation object, wherein the depression index comprises the following steps: coding the feature vector of each image frame in the same image frame sequence to obtain the feature vector sequence of the image frame sequence; inputting the feature vector sequence into the pre-trained depressive degree prediction model.

4. The method of claim 1, prior to normalizing the coordinates of the keypoints, further comprising:

adjusting, using an affine transformation, an orientation of a face image of the evaluation object in the image frame to coincide with a normal vector of the image frame.

5. The method according to one of claims 1 to 4, wherein the depressive-degree prediction model is a time-convolutional neural network based on a self-attention mechanism, comprising an input layer, a hidden layer, an attention layer, and an output layer, wherein,

the input layer is used for receiving the feature vector of each image frame and extracting a feature sequence from the feature vector of each image frame;

the hidden layer is used for coding the characteristic sequence and outputting the coded characteristic sequence;

the attention layer is used for acquiring the coded feature sequences output by the hidden layers, weighting the coded feature sequences output by the hidden layers based on a self-attention mechanism, and determining a weighted sum of the coded feature sequences output by the hidden layers;

and the output layer performs logistic regression on the weighted sum to estimate the depression index.

6. The method of claim 5, wherein the auto-attention mechanism based time-convolutional neural network is trained by:

constructing a first initial time convolution neural network, wherein the initial time convolution neural network comprises an input layer, a hidden layer, an attention layer and an output layer, the output layer of the first initial time convolution neural network is a full-connection layer, and the full-connection layer estimates a depression classification result based on the weighted sum;

inputting a first sample feature vector marked with a sample depression classification result into the first initial time convolutional neural network, taking the sample depression classification result as expected output, training the first initial time convolutional neural network until the accuracy of the depression classification result estimated by the first initial time convolutional neural network meets a preset accuracy threshold, and obtaining a trained first time convolutional neural network;

updating a full-link layer in the first time convolution neural network to be a logistic regression layer to obtain a second initial time convolution neural network, wherein the logistic regression layer estimates a depression index based on the weighted sum;

inputting a second sample feature vector marked with a sample depression index into the second initial time convolution neural network, taking the sample depression index as an expected output, training the second initial time convolution neural network to obtain a trained second time convolution neural network,

determining the trained second time convolution neural network as a time convolution neural network based on a self-attention mechanism.

7. An apparatus for generating information, comprising:

the image sampling unit is configured to sample a first preset number of image frames from a video to be evaluated, wherein the image frames comprise face images of an evaluation object;

a sequence generating unit configured to order each image frame based on a time stamp of the image frame, resulting in an image frame sequence;

a feature extraction unit configured to perform, for each image frame in the image frame sequence, the following feature extraction steps to determine a feature vector of each image frame: determining key points of the face of the evaluation object from the image frame; normalizing the coordinates of the key points, and extracting the following features from the image frame based on the normalized coordinates of the key points: facial motion, two-eye gaze direction, distance between key points of nose tip and upper lip, distance between key points of left and right mouth corners, area of mouth region, and offset of each key point in the image frame relative to each key point in a reference frame, wherein the reference frame is the image frame with the minimum timestamp in the image frame sequence; coding the extracted features in the image frame to obtain a feature vector of the image frame;

and the information prediction unit is configured to input the feature vector of each image frame into a pre-trained depression degree prediction model and estimate a depression index of the evaluation object, wherein the depression index is used for representing the depression degree of the evaluation object.

8. The apparatus of claim 7, wherein the apparatus further comprises a video capture unit configured to:

presenting a pre-constructed interaction problem to the evaluation object;

9. The apparatus of claim 8, wherein the image sampling unit further comprises a video extraction module configured to:

determining a second preset number of video clips from the video to be evaluated based on the time length of the evaluation object for answering each interactive question, wherein each video clip corresponds to one interactive question; respectively extracting a third preset number of image frames from each video clip to obtain an image frame set corresponding to each video clip;

the sequence generation unit is further configured to: based on the time stamp of each image frame, sequencing the image frames in each image frame set respectively to obtain an image frame sequence corresponding to each video clip; and the number of the first and second groups,

the feature extraction unit comprises a feature encoding module configured to: coding the feature vector of each image frame in the same image frame sequence to obtain the feature vector sequence of the image frame sequence;

the information prediction unit is further configured to: and inputting the characteristic vector sequence into the pre-trained depression degree prediction model, and estimating the depression index of the evaluation object.

10. The apparatus of claim 7, the feature extraction unit further comprising an affine transformation module configured to:

11. The apparatus according to one of claims 7 to 10, wherein the depressive-degree prediction model is a time-convolution neural network based on a self-attention mechanism, including an input layer, a hidden layer, an attention layer, and an output layer, wherein,

12. The apparatus of claim 11, wherein the apparatus further comprises a model training module configured to train the auto-attention mechanism based time-convolutional neural network via:

13. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.

14. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-6.