CN117592014A - Multi-modal fusion-based large five personality characteristic prediction method - Google Patents
Multi-modal fusion-based large five personality characteristic prediction method Download PDFInfo
- Publication number
- CN117592014A CN117592014A CN202410082720.0A CN202410082720A CN117592014A CN 117592014 A CN117592014 A CN 117592014A CN 202410082720 A CN202410082720 A CN 202410082720A CN 117592014 A CN117592014 A CN 117592014A
- Authority
- CN
- China
- Prior art keywords
- sequence
- feature
- audio
- fusion
- facial expression
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 50
- 238000000034 method Methods 0.000 title claims abstract description 33
- 230000008921 facial expression Effects 0.000 claims abstract description 35
- 238000012549 training Methods 0.000 claims abstract description 12
- 238000013518 transcription Methods 0.000 claims abstract description 4
- 230000035897 transcription Effects 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 13
- 230000004913 activation Effects 0.000 claims description 6
- 238000001514 detection method Methods 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 230000001815 facial effect Effects 0.000 claims description 3
- 230000000306 recurrent effect Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 abstract description 3
- 230000008451 emotion Effects 0.000 abstract description 2
- 238000013139 quantization Methods 0.000 abstract 1
- 230000008569 process Effects 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 230000001360 synchronised effect Effects 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000007115 recruitment Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000004630 mental health Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/27—Regression, e.g. linear or logistic regression
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/70—Multimodal biometrics, e.g. combining information from different biometric modalities
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Public Health (AREA)
- Epidemiology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Pathology (AREA)
- Primary Health Care (AREA)
- Child & Adolescent Psychology (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Biodiversity & Conservation Biology (AREA)
- Databases & Information Systems (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
Abstract
The invention discloses a large five personality characteristic prediction method based on multi-mode fusion, which relates to the technical field of emotion calculation and comprises the following steps: intercepting an image sequence to be processed containing the face of a tested person from a target dialogue video, and extracting an audio file containing dialogue information of the tested person from the target dialogue video; respectively extracting a facial expression characteristic sequence and a head posture characteristic sequence from an image sequence to be processed by using a trained facial expression prediction network and a trained head posture estimation network; extracting an audio feature sequence of the audio file and text features of the audio transcription; carrying out multi-mode fusion on the facial expression feature sequence, the head gesture feature sequence, the audio feature sequence and the text feature to obtain target fusion features, and training the whole network by using a loss function based on label distribution; and carrying out weighted regression on the target fusion characteristics based on the trained multi-layer perceptron to obtain the quantization prediction result of each dimension of the large five personality of the tested person.
Description
Technical Field
The invention relates to the technical field of emotion calculation, in particular to a large five personality characteristic prediction method based on multi-mode fusion.
Background
The five-factor model of personality has wide application value in clinical psychology, health psychology, development psychology, occupation, management, industrial psychology and the like. As found by researches, the camber, the nervous matter, the humanity and the like are all related to mental health; camber and openness are two important relevant factors for professional and industrial psychology; the responsibility center is closely related to personnel selection. The current evaluation mode of the large five personality mainly uses a large five personality scale, such as NEO-PI-R, NEO-FFI, but the subjectivity of the scale is too strong, a tested person can not fill the scale faithfully, and an incorrect result can influence medical diagnosis, personnel selection of a company and the like, so that the loss of manpower and financial resources is caused.
The character of a person is usually observed after a certain time, but for recruitment, team optimization and talent evaluation, people are usually known in a short time. The traditional large five personality scales can actually evaluate the personality characteristics of a person in a short time, but the answering staff can also answer some problems in the personality characteristics of the person in a bad way, so that the personality characteristics of the person cannot be accurately obtained. In addition, since the meter contents are not usually updated frequently, a responder can easily memorize the meter contents by filling a lot of meter contents, thereby generating a trained effect, and attempt to answer options which make himself look better or more in line with social expectations, thereby causing distortion of test results.
Disclosure of Invention
In order to solve the technical problems in the prior art, the embodiment of the invention provides a large five personality characteristic prediction method based on multi-mode fusion. The technical scheme is as follows:
in a first aspect, an embodiment of the present invention provides a method for predicting large five personality characteristics based on multi-modal fusion, where the method includes: intercepting an image sequence to be processed containing the face of a tested person from a target dialogue video, and extracting an audio file containing dialogue information of the tested person from the target dialogue video; the target dialogue video is a video of a testee participating in a dialogue; respectively extracting a facial expression characteristic sequence and a head posture characteristic sequence from the image sequence to be processed by using a trained facial expression prediction network and a trained head posture estimation network; extracting an audio feature sequence and text information of the audio file, and extracting text features in the text information; carrying out multi-mode fusion on the facial expression feature sequence, the head gesture feature sequence, the audio feature sequence and the text feature to obtain a target fusion feature; and carrying out regression on the target fusion characteristics based on the trained multi-layer perceptron to obtain a quantized prediction result of each dimension of the large five personality of the tested person.
Further, intercepting a to-be-processed image sequence containing the face of the tested person from the target dialogue video, which comprises the following steps: and uniformly intercepting images containing the faces of the testee from the target dialogue video according to a preset time interval to obtain an image sequence to be processed.
Further, the facial expression feature sequence and the head posture feature sequence are respectively extracted from the image sequence to be processed by using a trained facial expression prediction network and a trained head posture estimation network, and the method comprises the following steps: performing face detection on the image sequence to be processed to obtain a face image sequence; respectively extracting a facial expression characteristic sequence and a head posture characteristic sequence from the facial image sequence by using a trained facial expression prediction network and a trained head posture estimation network; the facial expression prediction network comprises a VGGFace model; the head pose estimation network includes a ShuffleNet V2 model.
Further, extracting the audio feature sequence and text information of the audio file, and extracting text features in the text information, including: obtaining a serialized initial audio feature from the audio file using an audio analysis tool; the initial audio features include: short-time average zero-crossing rate, short-time energy, energy entropy, spectrum center and mel cepstrum coefficient; performing feature extraction on the initial audio features based on a full connection layer to obtain the audio feature sequence; acquiring text information of the audio file based on audio transcription; extracting features of the text information based on the trained BERT model to obtain the text features; the text features include deep bi-directional language features.
Further, performing multi-modal fusion on the facial expression feature sequence, the head gesture feature sequence, the audio feature sequence and the text feature to obtain a target fusion feature, including: the facial expression feature sequence, the head gesture feature sequence and the audio feature sequence are subjected to weighted fusion, and then features in the time dimension are further extracted by using a two-way long-short-term memory recurrent neural network, so that primary fusion features are obtained; and carrying out weighted fusion on the primary fusion characteristic and the text characteristic to obtain a target fusion characteristic.
Further, regression is performed on the target fusion characteristics based on the trained multi-layer perceptron to obtain a quantized prediction result of each dimension of the large five personality of the tested person, including:
wherein S1, S2, S3, S4, S5 respectively represent different dimension trends of five personality, MLP 1 Is the first full connection layer, MLP 2 Is the second full connection layer, reLU is the first activation function, sigma is the second activation function, w m Is the weight of the preliminary fusion feature, X m Is the preliminary fusion feature, w n Is the weight of the text feature, X n Is the text feature.
Further, the training loss function of the multi-layer perceptron comprises:
where N represents the number of data in a batch during training, var (y i ) Representing the variance of the characteristics of the ith person,predictive value, y, representing the character of the ith person i Representing the true value of the i-th personal trait.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least: the invention combines the facial expression characteristics, the head posture characteristics, the audio characteristics and the text characteristics, can quantitatively evaluate the big five personality of the tested person in all directions, can objectively and conveniently quantitatively evaluate the tendency condition of each dimension of the big five personality of the tested person, and relieves the technical problems of distortion and strong subjectivity of the prediction result in the existing personality prediction method.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a large five personality characteristic prediction method based on multi-modal fusion provided by an embodiment of the invention;
fig. 2 is a schematic structural diagram of a modified VGGFace model according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a modified ShuffleNet V2 model according to an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is described below with reference to the accompanying drawings.
In embodiments of the invention, words such as "exemplary," "such as" and the like are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the term use of an example is intended to present concepts in a concrete fashion. Furthermore, in embodiments of the present invention, the meaning of "and/or" may be that of both, or may be that of either, optionally one of both.
Example 1
Fig. 1 is a flowchart of a large five personality characteristic prediction method based on multi-modal fusion according to an embodiment of the present invention. As shown in fig. 1, the method specifically includes the following steps:
step S102, intercepting a to-be-processed image sequence containing the face of an image testee from a target dialogue video, and extracting an audio file containing the dialogue information of the testee from the target dialogue video; the target dialogue video is a video of a person to be tested participating in a dialogue.
Preferably, the target dialogue video includes a video communicated with the subject or a video of the subject's self-presentation process.
Step S104, a trained facial expression prediction network and a trained head posture estimation network are utilized to respectively extract a facial expression characteristic sequence and a head posture characteristic sequence from the image sequence to be processed.
Step S106, extracting the audio feature sequence and the text information of the audio file, and extracting the text features in the text information.
And S108, carrying out multi-mode fusion on the facial expression feature sequence, the head gesture feature sequence, the audio feature sequence and the text feature to obtain target fusion features.
And step S110, carrying out regression on the target fusion characteristics based on the trained multi-layer perceptron to obtain the quantized prediction results of the five big personality dimensions of the tested person.
Specifically, step S102 includes: and uniformly intercepting images containing the faces of the testee from the target dialogue video according to a preset time interval to obtain an image sequence to be processed. For example, the preset time interval is 1s.
Specifically, step S104 further includes the following steps:
step S1041, performing face detection on the image sequence to be processed to obtain a face image sequence.
Preferably, the face detection is carried out on the image sequence to be processed based on a lightweight face detection network, face images are cut out, and the sizes of all face images are unified.
Step S1042, extracting a facial expression feature sequence and a head posture feature sequence from a facial image sequence by using a trained facial expression prediction network and a trained head posture estimation network; the facial expression prediction network comprises a VGGFace model; the head pose estimation network includes a ShuffleNet V2 model.
Preferably, the facial expression prediction network used in the embodiment of the invention is modified on the basis of a VGGFace model, the modified model structure is shown in fig. 2, and the network structure retains training parameters of the corresponding module of the original network and participates in the final training of the network.
Preferably, the head pose estimation network used in the embodiments of the present invention is modified on the basis of the ShuffleNet V2 model, the modified model structure is shown in fig. 3, and the model does not participate in the final network training.
Specifically, step S106 further includes the following steps:
step S1061, obtaining a serialized initial audio feature from an audio file using an audio analysis tool; the initial audio features include: short-time average zero-crossing rate, short-time energy, energy entropy, spectrum center and mel-frequency cepstral coefficient.
Step S1062, extracting features of the initial audio features based on the full connection layer to obtain an audio feature sequence. In the embodiment of the invention, the full connection layer participates in the final network training.
In step S1063, text information of the audio file is acquired based on the audio transcription.
Step S1064, extracting features of the text information based on the trained BERT model to obtain text features; text features include deep bi-directional language features. In the embodiment of the invention, the BERT model does not participate in the final network training.
Specifically, step S108 further includes the steps of:
and S1081, carrying out weighted fusion on the facial expression feature sequence, the head posture feature sequence and the audio feature sequence, and then further extracting features in the time dimension by utilizing a two-way long-short-term memory recurrent neural network to obtain primary fusion features.
And step S1082, carrying out weighted fusion on the primary fusion characteristics and the text characteristics to obtain target fusion characteristics.
Specifically, the specific calculation method of the weight parameter of each mode is as follows:
wherein X is i ,X j ,X k Respectively a facial expression characteristic sequence, a head posture characteristic sequence and an audio characteristic sequence, w i ,w j ,w k The weight of the facial expression feature sequence, the weight of the head gesture feature sequence and the weight of the audio feature sequence are respectively. X is X m Is a preliminary fusion feature, w m Is the weight of the preliminary fusion feature. X is X n Is a text feature, w n Is the weight of text features, and MLP is a full connection layer network.
Specifically, step S110 includes:
wherein S is 1 ,S 2 ,S 3 ,S 4 ,S 5 Representing different dimension trends of five personality, wherein the value is between 0 and 1, and the greater the value is, the greater the trend degree of the personality is; MLP (Multi-layer Programming protocol) 1 Is the first full connection layer, MLP 2 Is the second fully connected layer, reLU is the first activation function, σ is the second activation function.
The steps S108 and S110 are iterated until the model converges. Specifically, after obtaining the quantized prediction result of each dimension of the large five personality of the tested person, iteratively training the models in step S108 and step S110 by using the following loss function based on label distribution, so that the models learn the multi-modal fusion feature information step by step:
where N represents the number of data in a batch during training, var (y i ) Representing the variance of the characteristics of the ith person,representing the ith personal compartmentPredicted value of the feature, y i Representing the true value of the i-th personal trait.
Optionally, the method provided by the embodiment of the invention further includes: the model is used to test on the collected data and evaluate the model performance.
As can be seen from the above description, the embodiment of the invention provides a large five personality characteristic prediction method based on multi-mode fusion, which fuses facial expression characteristics, head posture characteristics, audio characteristics and text characteristics, can quantitatively evaluate the large five personality of a tested person in all directions, can quantitatively evaluate the tendency condition of each dimension of the large five personality of the tested person objectively and conveniently, can be applied to the fields of personnel selection, recruitment interviews, professional planning and the like of companies, and alleviates the technical problems of distortion and strong subjectivity of prediction results in the existing personality prediction method. In addition, the method provided by the invention has the advantages that the influence degree of the mode on the result can be described by the weight value, so that the model has interpretation.
It should also be appreciated that the memory in embodiments of the present invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example but not limitation, many forms of random access memory (random access memory, RAM) are available, such as Static RAM (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DR RAM).
The above embodiments may be implemented in whole or in part by software, hardware (e.g., circuitry), firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present invention are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.
It should be understood that, in various embodiments of the present invention, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another device, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (7)
1. The large five personality characteristic prediction method based on the multi-mode fusion is characterized by comprising the following steps of:
intercepting an image sequence to be processed containing the face of a tested person from a target dialogue video, and extracting an audio file containing dialogue information of the tested person from the target dialogue video; the target dialogue video is a video of a testee participating in a dialogue;
respectively extracting a facial expression characteristic sequence and a head posture characteristic sequence from the image sequence to be processed by using a trained facial expression prediction network and a trained head posture estimation network;
extracting an audio feature sequence and text information of the audio file, and extracting text features in the text information;
carrying out multi-mode fusion on the facial expression feature sequence, the head gesture feature sequence, the audio feature sequence and the text feature to obtain a target fusion feature;
and carrying out regression on the target fusion characteristics based on the trained multi-layer perceptron to obtain a quantized prediction result of each dimension of the large five personality of the tested person.
2. The method of claim 1, wherein capturing a sequence of images to be processed including a face of a subject from a target dialog video, comprises:
and uniformly intercepting images containing the faces of the testee from the target dialogue video according to a preset time interval to obtain an image sequence to be processed.
3. The method according to claim 1, wherein extracting a facial expression feature sequence and a head pose feature sequence from the image sequence to be processed respectively using a trained facial expression prediction network and a trained head pose estimation network, comprises:
performing face detection on the image sequence to be processed to obtain a face image sequence;
respectively extracting a facial expression characteristic sequence and a head posture characteristic sequence from the facial image sequence by using a trained facial expression prediction network and a trained head posture estimation network;
the facial expression prediction network comprises a VGGFace model; the head pose estimation network includes a ShuffleNet V2 model.
4. The method of claim 1, wherein extracting the sequence of audio features and text information of the audio file and extracting text features in the text information comprises:
obtaining a serialized initial audio feature from the audio file using an audio analysis tool; the initial audio features include: short-time average zero-crossing rate, short-time energy, energy entropy, spectrum center and mel cepstrum coefficient;
performing feature extraction on the initial audio features based on a full connection layer to obtain the audio feature sequence;
acquiring text information of the audio file based on audio transcription;
extracting features of the text information based on the trained BERT model to obtain the text features; the text features include deep bi-directional language features.
5. The method of claim 1, wherein multimodal fusion of the facial expression feature sequence, the head pose feature sequence, the audio feature sequence, and the text feature to obtain a target fusion feature comprises:
the facial expression feature sequence, the head gesture feature sequence and the audio feature sequence are subjected to weighted fusion, and then features in the time dimension are further extracted by using a two-way long-short-term memory recurrent neural network, so that primary fusion features are obtained;
and carrying out weighted fusion on the primary fusion characteristic and the text characteristic to obtain a target fusion characteristic.
6. The method of claim 5, wherein regressing the target fusion features based on the trained multi-layer perceptron to obtain quantized prediction results for each dimension of the subject's large five personality comprises:
;
wherein S is 1 ,S 2 ,S 3 ,S 4 ,S 5 Representing different dimension trends of five personality, MLP 1 Is the first full connection layer, MLP 2 Is the second full connection layer, reLU is the first activation function, sigma is the second activation function, w m Is the weight of the preliminary fusion feature, X m Is the preliminary fusion feature, w n Is the weight of the text feature, X n Is the text feature.
7. The method of claim 1, wherein the training loss function of the multi-layer perceptron comprises:
;
where N represents the number of data in a batch during training, var (y i ) Representing the variance of the characteristics of the ith person,predictive value, y, representing the character of the ith person i Representing the true value of the i-th personal trait.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410082720.0A CN117592014A (en) | 2024-01-19 | 2024-01-19 | Multi-modal fusion-based large five personality characteristic prediction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410082720.0A CN117592014A (en) | 2024-01-19 | 2024-01-19 | Multi-modal fusion-based large five personality characteristic prediction method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117592014A true CN117592014A (en) | 2024-02-23 |
Family
ID=89920636
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410082720.0A Pending CN117592014A (en) | 2024-01-19 | 2024-01-19 | Multi-modal fusion-based large five personality characteristic prediction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117592014A (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111243626A (en) * | 2019-12-30 | 2020-06-05 | 清华大学 | Speaking video generation method and system |
CN113705725A (en) * | 2021-09-15 | 2021-11-26 | 中国矿业大学 | User personality characteristic prediction method and device based on multi-mode information fusion |
CN113724712A (en) * | 2021-08-10 | 2021-11-30 | 南京信息工程大学 | Bird sound identification method based on multi-feature fusion and combination model |
CN114463671A (en) * | 2021-12-29 | 2022-05-10 | 上海花事电子商务有限公司 | User personality identification method based on video data |
CN114841399A (en) * | 2022-03-24 | 2022-08-02 | 合肥工业大学 | Personality portrait generation method and system based on audio and video multi-mode feature fusion |
CN115577316A (en) * | 2022-09-28 | 2023-01-06 | 中国人民解放军国防科技大学 | User personality prediction method based on multi-mode data fusion and application |
KR20230128812A (en) * | 2022-02-28 | 2023-09-05 | 전남대학교산학협력단 | Cross-modal learning-based emotion inference system and method |
CN117237766A (en) * | 2023-07-12 | 2023-12-15 | 华中师范大学 | Classroom cognition input identification method and system based on multi-mode data |
-
2024
- 2024-01-19 CN CN202410082720.0A patent/CN117592014A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111243626A (en) * | 2019-12-30 | 2020-06-05 | 清华大学 | Speaking video generation method and system |
CN113724712A (en) * | 2021-08-10 | 2021-11-30 | 南京信息工程大学 | Bird sound identification method based on multi-feature fusion and combination model |
CN113705725A (en) * | 2021-09-15 | 2021-11-26 | 中国矿业大学 | User personality characteristic prediction method and device based on multi-mode information fusion |
CN114463671A (en) * | 2021-12-29 | 2022-05-10 | 上海花事电子商务有限公司 | User personality identification method based on video data |
KR20230128812A (en) * | 2022-02-28 | 2023-09-05 | 전남대학교산학협력단 | Cross-modal learning-based emotion inference system and method |
CN114841399A (en) * | 2022-03-24 | 2022-08-02 | 合肥工业大学 | Personality portrait generation method and system based on audio and video multi-mode feature fusion |
CN115577316A (en) * | 2022-09-28 | 2023-01-06 | 中国人民解放军国防科技大学 | User personality prediction method based on multi-mode data fusion and application |
CN117237766A (en) * | 2023-07-12 | 2023-12-15 | 华中师范大学 | Classroom cognition input identification method and system based on multi-mode data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111461176B (en) | Multi-mode fusion method, device, medium and equipment based on normalized mutual information | |
Larsson et al. | Alliance ruptures and repairs in psychotherapy in primary care | |
CN109935336B (en) | Intelligent auxiliary diagnosis system for respiratory diseases of children | |
CN109460473B (en) | Electronic medical record multi-label classification method based on symptom extraction and feature representation | |
DE112016004922B4 (en) | Design and analysis system of the touch screen user button behavior mode and its identity recognition method | |
López-de-Ipiña et al. | Feature selection for spontaneous speech analysis to aid in Alzheimer's disease diagnosis: A fractal dimension approach | |
Fergadiotis et al. | Modelling confrontation naming and discourse performance in aphasia | |
CN110457432A (en) | Interview methods of marking, device, equipment and storage medium | |
Al-Khassaweneh et al. | A signal processing approach for the diagnosis of asthma from cough sounds | |
JP2010054568A (en) | Emotional identification device, method and program | |
Callan et al. | Self-organizing map for the classification of normal and disordered female voices | |
DE112020003909T5 (en) | PROCEDURE FOR MULTIMODAL RETRIEVING RECOVERY AND CLUSTERS USING A DEEP CCA AND ACTIVE PAIRWISE QUERIES | |
Fergadiotis et al. | Modeling confrontation naming and discourse informativeness using structural equation modeling | |
CN111292851A (en) | Data classification method and device, computer equipment and storage medium | |
CN112836937A (en) | Flood disaster loss evaluation method based on entropy weight and BP neural network technology | |
CN111210912A (en) | Parkinson prediction method and device | |
Walker et al. | Beyond percent correct: Measuring change in individual picture naming ability | |
Zhang et al. | You never know what you are going to get: Large-scale assessment of therapists’ supportive counseling skill use. | |
Heard et al. | Speech workload estimation for human-machine interaction | |
CN117592014A (en) | Multi-modal fusion-based large five personality characteristic prediction method | |
Chen et al. | IIFDD: Intra and inter-modal fusion for depression detection with multi-modal information from Internet of Medical Things | |
CN110782119A (en) | Insurance agent selection method, device and equipment based on artificial intelligence | |
US11963771B2 (en) | Automatic depression detection method based on audio-video | |
Fraser et al. | Measuring cognitive status from speech in a smart home environment | |
Khavylo et al. | Manifestation of Task’s Cognitive Complexity in Mimic Micromovements: Prognostic Model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |