CN117592014A - Multi-modal fusion-based large five personality characteristic prediction method - Google Patents

Multi-modal fusion-based large five personality characteristic prediction method Download PDF

Info

Publication number
CN117592014A
CN117592014A CN202410082720.0A CN202410082720A CN117592014A CN 117592014 A CN117592014 A CN 117592014A CN 202410082720 A CN202410082720 A CN 202410082720A CN 117592014 A CN117592014 A CN 117592014A
Authority
CN
China
Prior art keywords
sequence
feature
audio
fusion
facial expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410082720.0A
Other languages
Chinese (zh)
Inventor
马惠敏
李欣
王荣全
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN202410082720.0A priority Critical patent/CN117592014A/en
Publication of CN117592014A publication Critical patent/CN117592014A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/27Regression, e.g. linear or logistic regression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/70Multimodal biometrics, e.g. combining information from different biometric modalities
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Pathology (AREA)
  • Primary Health Care (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Databases & Information Systems (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The invention discloses a large five personality characteristic prediction method based on multi-mode fusion, which relates to the technical field of emotion calculation and comprises the following steps: intercepting an image sequence to be processed containing the face of a tested person from a target dialogue video, and extracting an audio file containing dialogue information of the tested person from the target dialogue video; respectively extracting a facial expression characteristic sequence and a head posture characteristic sequence from an image sequence to be processed by using a trained facial expression prediction network and a trained head posture estimation network; extracting an audio feature sequence of the audio file and text features of the audio transcription; carrying out multi-mode fusion on the facial expression feature sequence, the head gesture feature sequence, the audio feature sequence and the text feature to obtain target fusion features, and training the whole network by using a loss function based on label distribution; and carrying out weighted regression on the target fusion characteristics based on the trained multi-layer perceptron to obtain the quantization prediction result of each dimension of the large five personality of the tested person.

Description

Multi-modal fusion-based large five personality characteristic prediction method
Technical Field
The invention relates to the technical field of emotion calculation, in particular to a large five personality characteristic prediction method based on multi-mode fusion.
Background
The five-factor model of personality has wide application value in clinical psychology, health psychology, development psychology, occupation, management, industrial psychology and the like. As found by researches, the camber, the nervous matter, the humanity and the like are all related to mental health; camber and openness are two important relevant factors for professional and industrial psychology; the responsibility center is closely related to personnel selection. The current evaluation mode of the large five personality mainly uses a large five personality scale, such as NEO-PI-R, NEO-FFI, but the subjectivity of the scale is too strong, a tested person can not fill the scale faithfully, and an incorrect result can influence medical diagnosis, personnel selection of a company and the like, so that the loss of manpower and financial resources is caused.
The character of a person is usually observed after a certain time, but for recruitment, team optimization and talent evaluation, people are usually known in a short time. The traditional large five personality scales can actually evaluate the personality characteristics of a person in a short time, but the answering staff can also answer some problems in the personality characteristics of the person in a bad way, so that the personality characteristics of the person cannot be accurately obtained. In addition, since the meter contents are not usually updated frequently, a responder can easily memorize the meter contents by filling a lot of meter contents, thereby generating a trained effect, and attempt to answer options which make himself look better or more in line with social expectations, thereby causing distortion of test results.
Disclosure of Invention
In order to solve the technical problems in the prior art, the embodiment of the invention provides a large five personality characteristic prediction method based on multi-mode fusion. The technical scheme is as follows:
in a first aspect, an embodiment of the present invention provides a method for predicting large five personality characteristics based on multi-modal fusion, where the method includes: intercepting an image sequence to be processed containing the face of a tested person from a target dialogue video, and extracting an audio file containing dialogue information of the tested person from the target dialogue video; the target dialogue video is a video of a testee participating in a dialogue; respectively extracting a facial expression characteristic sequence and a head posture characteristic sequence from the image sequence to be processed by using a trained facial expression prediction network and a trained head posture estimation network; extracting an audio feature sequence and text information of the audio file, and extracting text features in the text information; carrying out multi-mode fusion on the facial expression feature sequence, the head gesture feature sequence, the audio feature sequence and the text feature to obtain a target fusion feature; and carrying out regression on the target fusion characteristics based on the trained multi-layer perceptron to obtain a quantized prediction result of each dimension of the large five personality of the tested person.
Further, intercepting a to-be-processed image sequence containing the face of the tested person from the target dialogue video, which comprises the following steps: and uniformly intercepting images containing the faces of the testee from the target dialogue video according to a preset time interval to obtain an image sequence to be processed.
Further, the facial expression feature sequence and the head posture feature sequence are respectively extracted from the image sequence to be processed by using a trained facial expression prediction network and a trained head posture estimation network, and the method comprises the following steps: performing face detection on the image sequence to be processed to obtain a face image sequence; respectively extracting a facial expression characteristic sequence and a head posture characteristic sequence from the facial image sequence by using a trained facial expression prediction network and a trained head posture estimation network; the facial expression prediction network comprises a VGGFace model; the head pose estimation network includes a ShuffleNet V2 model.
Further, extracting the audio feature sequence and text information of the audio file, and extracting text features in the text information, including: obtaining a serialized initial audio feature from the audio file using an audio analysis tool; the initial audio features include: short-time average zero-crossing rate, short-time energy, energy entropy, spectrum center and mel cepstrum coefficient; performing feature extraction on the initial audio features based on a full connection layer to obtain the audio feature sequence; acquiring text information of the audio file based on audio transcription; extracting features of the text information based on the trained BERT model to obtain the text features; the text features include deep bi-directional language features.
Further, performing multi-modal fusion on the facial expression feature sequence, the head gesture feature sequence, the audio feature sequence and the text feature to obtain a target fusion feature, including: the facial expression feature sequence, the head gesture feature sequence and the audio feature sequence are subjected to weighted fusion, and then features in the time dimension are further extracted by using a two-way long-short-term memory recurrent neural network, so that primary fusion features are obtained; and carrying out weighted fusion on the primary fusion characteristic and the text characteristic to obtain a target fusion characteristic.
Further, regression is performed on the target fusion characteristics based on the trained multi-layer perceptron to obtain a quantized prediction result of each dimension of the large five personality of the tested person, including:
wherein S1, S2, S3, S4, S5 respectively represent different dimension trends of five personality, MLP 1 Is the first full connection layer, MLP 2 Is the second full connection layer, reLU is the first activation function, sigma is the second activation function, w m Is the weight of the preliminary fusion feature, X m Is the preliminary fusion feature, w n Is the weight of the text feature, X n Is the text feature.
Further, the training loss function of the multi-layer perceptron comprises:
where N represents the number of data in a batch during training, var (y i ) Representing the variance of the characteristics of the ith person,predictive value, y, representing the character of the ith person i Representing the true value of the i-th personal trait.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least: the invention combines the facial expression characteristics, the head posture characteristics, the audio characteristics and the text characteristics, can quantitatively evaluate the big five personality of the tested person in all directions, can objectively and conveniently quantitatively evaluate the tendency condition of each dimension of the big five personality of the tested person, and relieves the technical problems of distortion and strong subjectivity of the prediction result in the existing personality prediction method.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a large five personality characteristic prediction method based on multi-modal fusion provided by an embodiment of the invention;
fig. 2 is a schematic structural diagram of a modified VGGFace model according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a modified ShuffleNet V2 model according to an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is described below with reference to the accompanying drawings.
In embodiments of the invention, words such as "exemplary," "such as" and the like are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the term use of an example is intended to present concepts in a concrete fashion. Furthermore, in embodiments of the present invention, the meaning of "and/or" may be that of both, or may be that of either, optionally one of both.
Example 1
Fig. 1 is a flowchart of a large five personality characteristic prediction method based on multi-modal fusion according to an embodiment of the present invention. As shown in fig. 1, the method specifically includes the following steps:
step S102, intercepting a to-be-processed image sequence containing the face of an image testee from a target dialogue video, and extracting an audio file containing the dialogue information of the testee from the target dialogue video; the target dialogue video is a video of a person to be tested participating in a dialogue.
Preferably, the target dialogue video includes a video communicated with the subject or a video of the subject's self-presentation process.
Step S104, a trained facial expression prediction network and a trained head posture estimation network are utilized to respectively extract a facial expression characteristic sequence and a head posture characteristic sequence from the image sequence to be processed.
Step S106, extracting the audio feature sequence and the text information of the audio file, and extracting the text features in the text information.
And S108, carrying out multi-mode fusion on the facial expression feature sequence, the head gesture feature sequence, the audio feature sequence and the text feature to obtain target fusion features.
And step S110, carrying out regression on the target fusion characteristics based on the trained multi-layer perceptron to obtain the quantized prediction results of the five big personality dimensions of the tested person.
Specifically, step S102 includes: and uniformly intercepting images containing the faces of the testee from the target dialogue video according to a preset time interval to obtain an image sequence to be processed. For example, the preset time interval is 1s.
Specifically, step S104 further includes the following steps:
step S1041, performing face detection on the image sequence to be processed to obtain a face image sequence.
Preferably, the face detection is carried out on the image sequence to be processed based on a lightweight face detection network, face images are cut out, and the sizes of all face images are unified.
Step S1042, extracting a facial expression feature sequence and a head posture feature sequence from a facial image sequence by using a trained facial expression prediction network and a trained head posture estimation network; the facial expression prediction network comprises a VGGFace model; the head pose estimation network includes a ShuffleNet V2 model.
Preferably, the facial expression prediction network used in the embodiment of the invention is modified on the basis of a VGGFace model, the modified model structure is shown in fig. 2, and the network structure retains training parameters of the corresponding module of the original network and participates in the final training of the network.
Preferably, the head pose estimation network used in the embodiments of the present invention is modified on the basis of the ShuffleNet V2 model, the modified model structure is shown in fig. 3, and the model does not participate in the final network training.
Specifically, step S106 further includes the following steps:
step S1061, obtaining a serialized initial audio feature from an audio file using an audio analysis tool; the initial audio features include: short-time average zero-crossing rate, short-time energy, energy entropy, spectrum center and mel-frequency cepstral coefficient.
Step S1062, extracting features of the initial audio features based on the full connection layer to obtain an audio feature sequence. In the embodiment of the invention, the full connection layer participates in the final network training.
In step S1063, text information of the audio file is acquired based on the audio transcription.
Step S1064, extracting features of the text information based on the trained BERT model to obtain text features; text features include deep bi-directional language features. In the embodiment of the invention, the BERT model does not participate in the final network training.
Specifically, step S108 further includes the steps of:
and S1081, carrying out weighted fusion on the facial expression feature sequence, the head posture feature sequence and the audio feature sequence, and then further extracting features in the time dimension by utilizing a two-way long-short-term memory recurrent neural network to obtain primary fusion features.
And step S1082, carrying out weighted fusion on the primary fusion characteristics and the text characteristics to obtain target fusion characteristics.
Specifically, the specific calculation method of the weight parameter of each mode is as follows:
wherein X is i ,X j ,X k Respectively a facial expression characteristic sequence, a head posture characteristic sequence and an audio characteristic sequence, w i ,w j ,w k The weight of the facial expression feature sequence, the weight of the head gesture feature sequence and the weight of the audio feature sequence are respectively. X is X m Is a preliminary fusion feature, w m Is the weight of the preliminary fusion feature. X is X n Is a text feature, w n Is the weight of text features, and MLP is a full connection layer network.
Specifically, step S110 includes:
wherein S is 1 ,S 2 ,S 3 ,S 4 ,S 5 Representing different dimension trends of five personality, wherein the value is between 0 and 1, and the greater the value is, the greater the trend degree of the personality is; MLP (Multi-layer Programming protocol) 1 Is the first full connection layer, MLP 2 Is the second fully connected layer, reLU is the first activation function, σ is the second activation function.
The steps S108 and S110 are iterated until the model converges. Specifically, after obtaining the quantized prediction result of each dimension of the large five personality of the tested person, iteratively training the models in step S108 and step S110 by using the following loss function based on label distribution, so that the models learn the multi-modal fusion feature information step by step:
where N represents the number of data in a batch during training, var (y i ) Representing the variance of the characteristics of the ith person,representing the ith personal compartmentPredicted value of the feature, y i Representing the true value of the i-th personal trait.
Optionally, the method provided by the embodiment of the invention further includes: the model is used to test on the collected data and evaluate the model performance.
As can be seen from the above description, the embodiment of the invention provides a large five personality characteristic prediction method based on multi-mode fusion, which fuses facial expression characteristics, head posture characteristics, audio characteristics and text characteristics, can quantitatively evaluate the large five personality of a tested person in all directions, can quantitatively evaluate the tendency condition of each dimension of the large five personality of the tested person objectively and conveniently, can be applied to the fields of personnel selection, recruitment interviews, professional planning and the like of companies, and alleviates the technical problems of distortion and strong subjectivity of prediction results in the existing personality prediction method. In addition, the method provided by the invention has the advantages that the influence degree of the mode on the result can be described by the weight value, so that the model has interpretation.
It should also be appreciated that the memory in embodiments of the present invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example but not limitation, many forms of random access memory (random access memory, RAM) are available, such as Static RAM (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DR RAM).
The above embodiments may be implemented in whole or in part by software, hardware (e.g., circuitry), firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present invention are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.
It should be understood that, in various embodiments of the present invention, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another device, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (7)

1. The large five personality characteristic prediction method based on the multi-mode fusion is characterized by comprising the following steps of:
intercepting an image sequence to be processed containing the face of a tested person from a target dialogue video, and extracting an audio file containing dialogue information of the tested person from the target dialogue video; the target dialogue video is a video of a testee participating in a dialogue;
respectively extracting a facial expression characteristic sequence and a head posture characteristic sequence from the image sequence to be processed by using a trained facial expression prediction network and a trained head posture estimation network;
extracting an audio feature sequence and text information of the audio file, and extracting text features in the text information;
carrying out multi-mode fusion on the facial expression feature sequence, the head gesture feature sequence, the audio feature sequence and the text feature to obtain a target fusion feature;
and carrying out regression on the target fusion characteristics based on the trained multi-layer perceptron to obtain a quantized prediction result of each dimension of the large five personality of the tested person.
2. The method of claim 1, wherein capturing a sequence of images to be processed including a face of a subject from a target dialog video, comprises:
and uniformly intercepting images containing the faces of the testee from the target dialogue video according to a preset time interval to obtain an image sequence to be processed.
3. The method according to claim 1, wherein extracting a facial expression feature sequence and a head pose feature sequence from the image sequence to be processed respectively using a trained facial expression prediction network and a trained head pose estimation network, comprises:
performing face detection on the image sequence to be processed to obtain a face image sequence;
respectively extracting a facial expression characteristic sequence and a head posture characteristic sequence from the facial image sequence by using a trained facial expression prediction network and a trained head posture estimation network;
the facial expression prediction network comprises a VGGFace model; the head pose estimation network includes a ShuffleNet V2 model.
4. The method of claim 1, wherein extracting the sequence of audio features and text information of the audio file and extracting text features in the text information comprises:
obtaining a serialized initial audio feature from the audio file using an audio analysis tool; the initial audio features include: short-time average zero-crossing rate, short-time energy, energy entropy, spectrum center and mel cepstrum coefficient;
performing feature extraction on the initial audio features based on a full connection layer to obtain the audio feature sequence;
acquiring text information of the audio file based on audio transcription;
extracting features of the text information based on the trained BERT model to obtain the text features; the text features include deep bi-directional language features.
5. The method of claim 1, wherein multimodal fusion of the facial expression feature sequence, the head pose feature sequence, the audio feature sequence, and the text feature to obtain a target fusion feature comprises:
the facial expression feature sequence, the head gesture feature sequence and the audio feature sequence are subjected to weighted fusion, and then features in the time dimension are further extracted by using a two-way long-short-term memory recurrent neural network, so that primary fusion features are obtained;
and carrying out weighted fusion on the primary fusion characteristic and the text characteristic to obtain a target fusion characteristic.
6. The method of claim 5, wherein regressing the target fusion features based on the trained multi-layer perceptron to obtain quantized prediction results for each dimension of the subject's large five personality comprises:
wherein S is 1 ,S 2 ,S 3 ,S 4 ,S 5 Representing different dimension trends of five personality, MLP 1 Is the first full connection layer, MLP 2 Is the second full connection layer, reLU is the first activation function, sigma is the second activation function, w m Is the weight of the preliminary fusion feature, X m Is the preliminary fusion feature, w n Is the weight of the text feature, X n Is the text feature.
7. The method of claim 1, wherein the training loss function of the multi-layer perceptron comprises:
where N represents the number of data in a batch during training, var (y i ) Representing the variance of the characteristics of the ith person,predictive value, y, representing the character of the ith person i Representing the true value of the i-th personal trait.
CN202410082720.0A 2024-01-19 2024-01-19 Multi-modal fusion-based large five personality characteristic prediction method Pending CN117592014A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410082720.0A CN117592014A (en) 2024-01-19 2024-01-19 Multi-modal fusion-based large five personality characteristic prediction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410082720.0A CN117592014A (en) 2024-01-19 2024-01-19 Multi-modal fusion-based large five personality characteristic prediction method

Publications (1)

Publication Number Publication Date
CN117592014A true CN117592014A (en) 2024-02-23

Family

ID=89920636

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410082720.0A Pending CN117592014A (en) 2024-01-19 2024-01-19 Multi-modal fusion-based large five personality characteristic prediction method

Country Status (1)

Country Link
CN (1) CN117592014A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111243626A (en) * 2019-12-30 2020-06-05 清华大学 Speaking video generation method and system
CN113705725A (en) * 2021-09-15 2021-11-26 中国矿业大学 User personality characteristic prediction method and device based on multi-mode information fusion
CN113724712A (en) * 2021-08-10 2021-11-30 南京信息工程大学 Bird sound identification method based on multi-feature fusion and combination model
CN114463671A (en) * 2021-12-29 2022-05-10 上海花事电子商务有限公司 User personality identification method based on video data
CN114841399A (en) * 2022-03-24 2022-08-02 合肥工业大学 Personality portrait generation method and system based on audio and video multi-mode feature fusion
CN115577316A (en) * 2022-09-28 2023-01-06 中国人民解放军国防科技大学 User personality prediction method based on multi-mode data fusion and application
KR20230128812A (en) * 2022-02-28 2023-09-05 전남대학교산학협력단 Cross-modal learning-based emotion inference system and method
CN117237766A (en) * 2023-07-12 2023-12-15 华中师范大学 Classroom cognition input identification method and system based on multi-mode data

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111243626A (en) * 2019-12-30 2020-06-05 清华大学 Speaking video generation method and system
CN113724712A (en) * 2021-08-10 2021-11-30 南京信息工程大学 Bird sound identification method based on multi-feature fusion and combination model
CN113705725A (en) * 2021-09-15 2021-11-26 中国矿业大学 User personality characteristic prediction method and device based on multi-mode information fusion
CN114463671A (en) * 2021-12-29 2022-05-10 上海花事电子商务有限公司 User personality identification method based on video data
KR20230128812A (en) * 2022-02-28 2023-09-05 전남대학교산학협력단 Cross-modal learning-based emotion inference system and method
CN114841399A (en) * 2022-03-24 2022-08-02 合肥工业大学 Personality portrait generation method and system based on audio and video multi-mode feature fusion
CN115577316A (en) * 2022-09-28 2023-01-06 中国人民解放军国防科技大学 User personality prediction method based on multi-mode data fusion and application
CN117237766A (en) * 2023-07-12 2023-12-15 华中师范大学 Classroom cognition input identification method and system based on multi-mode data

Similar Documents

Publication Publication Date Title
CN111461176B (en) Multi-mode fusion method, device, medium and equipment based on normalized mutual information
Larsson et al. Alliance ruptures and repairs in psychotherapy in primary care
CN109935336B (en) Intelligent auxiliary diagnosis system for respiratory diseases of children
CN109460473B (en) Electronic medical record multi-label classification method based on symptom extraction and feature representation
DE112016004922B4 (en) Design and analysis system of the touch screen user button behavior mode and its identity recognition method
López-de-Ipiña et al. Feature selection for spontaneous speech analysis to aid in Alzheimer's disease diagnosis: A fractal dimension approach
Fergadiotis et al. Modelling confrontation naming and discourse performance in aphasia
CN110457432A (en) Interview methods of marking, device, equipment and storage medium
Al-Khassaweneh et al. A signal processing approach for the diagnosis of asthma from cough sounds
JP2010054568A (en) Emotional identification device, method and program
Callan et al. Self-organizing map for the classification of normal and disordered female voices
DE112020003909T5 (en) PROCEDURE FOR MULTIMODAL RETRIEVING RECOVERY AND CLUSTERS USING A DEEP CCA AND ACTIVE PAIRWISE QUERIES
Fergadiotis et al. Modeling confrontation naming and discourse informativeness using structural equation modeling
CN111292851A (en) Data classification method and device, computer equipment and storage medium
CN112836937A (en) Flood disaster loss evaluation method based on entropy weight and BP neural network technology
CN111210912A (en) Parkinson prediction method and device
Walker et al. Beyond percent correct: Measuring change in individual picture naming ability
Zhang et al. You never know what you are going to get: Large-scale assessment of therapists’ supportive counseling skill use.
Heard et al. Speech workload estimation for human-machine interaction
CN117592014A (en) Multi-modal fusion-based large five personality characteristic prediction method
Chen et al. IIFDD: Intra and inter-modal fusion for depression detection with multi-modal information from Internet of Medical Things
CN110782119A (en) Insurance agent selection method, device and equipment based on artificial intelligence
US11963771B2 (en) Automatic depression detection method based on audio-video
Fraser et al. Measuring cognitive status from speech in a smart home environment
Khavylo et al. Manifestation of Task’s Cognitive Complexity in Mimic Micromovements: Prognostic Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination