CN117592014A

CN117592014A - Multi-modal fusion-based large five personality characteristic prediction method

Info

Publication number: CN117592014A
Application number: CN202410082720.0A
Authority: CN
Inventors: 马惠敏; 李欣; 王荣全
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2024-01-19
Filing date: 2024-01-19
Publication date: 2024-02-23

Abstract

The invention discloses a large five personality characteristic prediction method based on multi-mode fusion, which relates to the technical field of emotion calculation and comprises the following steps: intercepting an image sequence to be processed containing the face of a tested person from a target dialogue video, and extracting an audio file containing dialogue information of the tested person from the target dialogue video; respectively extracting a facial expression characteristic sequence and a head posture characteristic sequence from an image sequence to be processed by using a trained facial expression prediction network and a trained head posture estimation network; extracting an audio feature sequence of the audio file and text features of the audio transcription; carrying out multi-mode fusion on the facial expression feature sequence, the head gesture feature sequence, the audio feature sequence and the text feature to obtain target fusion features, and training the whole network by using a loss function based on label distribution; and carrying out weighted regression on the target fusion characteristics based on the trained multi-layer perceptron to obtain the quantization prediction result of each dimension of the large five personality of the tested person.

Description

Multi-modal fusion-based large five personality characteristic prediction method

Technical Field

The invention relates to the technical field of emotion calculation, in particular to a large five personality characteristic prediction method based on multi-mode fusion.

Background

The five-factor model of personality has wide application value in clinical psychology, health psychology, development psychology, occupation, management, industrial psychology and the like. As found by researches, the camber, the nervous matter, the humanity and the like are all related to mental health; camber and openness are two important relevant factors for professional and industrial psychology; the responsibility center is closely related to personnel selection. The current evaluation mode of the large five personality mainly uses a large five personality scale, such as NEO-PI-R, NEO-FFI, but the subjectivity of the scale is too strong, a tested person can not fill the scale faithfully, and an incorrect result can influence medical diagnosis, personnel selection of a company and the like, so that the loss of manpower and financial resources is caused.

The character of a person is usually observed after a certain time, but for recruitment, team optimization and talent evaluation, people are usually known in a short time. The traditional large five personality scales can actually evaluate the personality characteristics of a person in a short time, but the answering staff can also answer some problems in the personality characteristics of the person in a bad way, so that the personality characteristics of the person cannot be accurately obtained. In addition, since the meter contents are not usually updated frequently, a responder can easily memorize the meter contents by filling a lot of meter contents, thereby generating a trained effect, and attempt to answer options which make himself look better or more in line with social expectations, thereby causing distortion of test results.

Disclosure of Invention

In order to solve the technical problems in the prior art, the embodiment of the invention provides a large five personality characteristic prediction method based on multi-mode fusion. The technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a method for predicting large five personality characteristics based on multi-modal fusion, where the method includes: intercepting an image sequence to be processed containing the face of a tested person from a target dialogue video, and extracting an audio file containing dialogue information of the tested person from the target dialogue video; the target dialogue video is a video of a testee participating in a dialogue; respectively extracting a facial expression characteristic sequence and a head posture characteristic sequence from the image sequence to be processed by using a trained facial expression prediction network and a trained head posture estimation network; extracting an audio feature sequence and text information of the audio file, and extracting text features in the text information; carrying out multi-mode fusion on the facial expression feature sequence, the head gesture feature sequence, the audio feature sequence and the text feature to obtain a target fusion feature; and carrying out regression on the target fusion characteristics based on the trained multi-layer perceptron to obtain a quantized prediction result of each dimension of the large five personality of the tested person.

Further, intercepting a to-be-processed image sequence containing the face of the tested person from the target dialogue video, which comprises the following steps: and uniformly intercepting images containing the faces of the testee from the target dialogue video according to a preset time interval to obtain an image sequence to be processed.

Further, the facial expression feature sequence and the head posture feature sequence are respectively extracted from the image sequence to be processed by using a trained facial expression prediction network and a trained head posture estimation network, and the method comprises the following steps: performing face detection on the image sequence to be processed to obtain a face image sequence; respectively extracting a facial expression characteristic sequence and a head posture characteristic sequence from the facial image sequence by using a trained facial expression prediction network and a trained head posture estimation network; the facial expression prediction network comprises a VGGFace model; the head pose estimation network includes a ShuffleNet V2 model.

Further, extracting the audio feature sequence and text information of the audio file, and extracting text features in the text information, including: obtaining a serialized initial audio feature from the audio file using an audio analysis tool; the initial audio features include: short-time average zero-crossing rate, short-time energy, energy entropy, spectrum center and mel cepstrum coefficient; performing feature extraction on the initial audio features based on a full connection layer to obtain the audio feature sequence; acquiring text information of the audio file based on audio transcription; extracting features of the text information based on the trained BERT model to obtain the text features; the text features include deep bi-directional language features.

Further, performing multi-modal fusion on the facial expression feature sequence, the head gesture feature sequence, the audio feature sequence and the text feature to obtain a target fusion feature, including: the facial expression feature sequence, the head gesture feature sequence and the audio feature sequence are subjected to weighted fusion, and then features in the time dimension are further extracted by using a two-way long-short-term memory recurrent neural network, so that primary fusion features are obtained; and carrying out weighted fusion on the primary fusion characteristic and the text characteristic to obtain a target fusion characteristic.

Further, regression is performed on the target fusion characteristics based on the trained multi-layer perceptron to obtain a quantized prediction result of each dimension of the large five personality of the tested person, including:

wherein S1, S2, S3, S4, S5 respectively represent different dimension trends of five personality, MLP ₁ Is the first full connection layer, MLP ₂ Is the second full connection layer, reLU is the first activation function, sigma is the second activation function, w _m Is the weight of the preliminary fusion feature, X _m Is the preliminary fusion feature, w _n Is the weight of the text feature, X _n Is the text feature.

Further, the training loss function of the multi-layer perceptron comprises:

where N represents the number of data in a batch during training, var (y _i ) Representing the variance of the characteristics of the ith person,predictive value, y, representing the character of the ith person _i Representing the true value of the i-th personal trait.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least: the invention combines the facial expression characteristics, the head posture characteristics, the audio characteristics and the text characteristics, can quantitatively evaluate the big five personality of the tested person in all directions, can objectively and conveniently quantitatively evaluate the tendency condition of each dimension of the big five personality of the tested person, and relieves the technical problems of distortion and strong subjectivity of the prediction result in the existing personality prediction method.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a large five personality characteristic prediction method based on multi-modal fusion provided by an embodiment of the invention;

fig. 2 is a schematic structural diagram of a modified VGGFace model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a modified ShuffleNet V2 model according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is described below with reference to the accompanying drawings.

In embodiments of the invention, words such as "exemplary," "such as" and the like are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the term use of an example is intended to present concepts in a concrete fashion. Furthermore, in embodiments of the present invention, the meaning of "and/or" may be that of both, or may be that of either, optionally one of both.

Example 1

Fig. 1 is a flowchart of a large five personality characteristic prediction method based on multi-modal fusion according to an embodiment of the present invention. As shown in fig. 1, the method specifically includes the following steps:

step S102, intercepting a to-be-processed image sequence containing the face of an image testee from a target dialogue video, and extracting an audio file containing the dialogue information of the testee from the target dialogue video; the target dialogue video is a video of a person to be tested participating in a dialogue.

Preferably, the target dialogue video includes a video communicated with the subject or a video of the subject's self-presentation process.

Step S104, a trained facial expression prediction network and a trained head posture estimation network are utilized to respectively extract a facial expression characteristic sequence and a head posture characteristic sequence from the image sequence to be processed.

Step S106, extracting the audio feature sequence and the text information of the audio file, and extracting the text features in the text information.

And S108, carrying out multi-mode fusion on the facial expression feature sequence, the head gesture feature sequence, the audio feature sequence and the text feature to obtain target fusion features.

And step S110, carrying out regression on the target fusion characteristics based on the trained multi-layer perceptron to obtain the quantized prediction results of the five big personality dimensions of the tested person.

Specifically, step S102 includes: and uniformly intercepting images containing the faces of the testee from the target dialogue video according to a preset time interval to obtain an image sequence to be processed. For example, the preset time interval is 1s.

Specifically, step S104 further includes the following steps:

step S1041, performing face detection on the image sequence to be processed to obtain a face image sequence.

Preferably, the face detection is carried out on the image sequence to be processed based on a lightweight face detection network, face images are cut out, and the sizes of all face images are unified.

Step S1042, extracting a facial expression feature sequence and a head posture feature sequence from a facial image sequence by using a trained facial expression prediction network and a trained head posture estimation network; the facial expression prediction network comprises a VGGFace model; the head pose estimation network includes a ShuffleNet V2 model.

Preferably, the facial expression prediction network used in the embodiment of the invention is modified on the basis of a VGGFace model, the modified model structure is shown in fig. 2, and the network structure retains training parameters of the corresponding module of the original network and participates in the final training of the network.

Preferably, the head pose estimation network used in the embodiments of the present invention is modified on the basis of the ShuffleNet V2 model, the modified model structure is shown in fig. 3, and the model does not participate in the final network training.

Specifically, step S106 further includes the following steps:

step S1061, obtaining a serialized initial audio feature from an audio file using an audio analysis tool; the initial audio features include: short-time average zero-crossing rate, short-time energy, energy entropy, spectrum center and mel-frequency cepstral coefficient.

Step S1062, extracting features of the initial audio features based on the full connection layer to obtain an audio feature sequence. In the embodiment of the invention, the full connection layer participates in the final network training.

In step S1063, text information of the audio file is acquired based on the audio transcription.

Step S1064, extracting features of the text information based on the trained BERT model to obtain text features; text features include deep bi-directional language features. In the embodiment of the invention, the BERT model does not participate in the final network training.

Specifically, step S108 further includes the steps of:

and S1081, carrying out weighted fusion on the facial expression feature sequence, the head posture feature sequence and the audio feature sequence, and then further extracting features in the time dimension by utilizing a two-way long-short-term memory recurrent neural network to obtain primary fusion features.

And step S1082, carrying out weighted fusion on the primary fusion characteristics and the text characteristics to obtain target fusion characteristics.

Specifically, the specific calculation method of the weight parameter of each mode is as follows:

wherein X is _i ，X _j ，X _k Respectively a facial expression characteristic sequence, a head posture characteristic sequence and an audio characteristic sequence, w _i ，w _j ，w _k The weight of the facial expression feature sequence, the weight of the head gesture feature sequence and the weight of the audio feature sequence are respectively. X is X _m Is a preliminary fusion feature, w _m Is the weight of the preliminary fusion feature. X is X _n Is a text feature, w _n Is the weight of text features, and MLP is a full connection layer network.

Specifically, step S110 includes:

wherein S is ₁ ，S ₂ ，S ₃ ，S ₄ ，S ₅ Representing different dimension trends of five personality, wherein the value is between 0 and 1, and the greater the value is, the greater the trend degree of the personality is; MLP (Multi-layer Programming protocol) ₁ Is the first full connection layer, MLP ₂ Is the second fully connected layer, reLU is the first activation function, σ is the second activation function.

The steps S108 and S110 are iterated until the model converges. Specifically, after obtaining the quantized prediction result of each dimension of the large five personality of the tested person, iteratively training the models in step S108 and step S110 by using the following loss function based on label distribution, so that the models learn the multi-modal fusion feature information step by step:

where N represents the number of data in a batch during training, var (y _i ) Representing the variance of the characteristics of the ith person,representing the ith personal compartmentPredicted value of the feature, y _i Representing the true value of the i-th personal trait.

Optionally, the method provided by the embodiment of the invention further includes: the model is used to test on the collected data and evaluate the model performance.

As can be seen from the above description, the embodiment of the invention provides a large five personality characteristic prediction method based on multi-mode fusion, which fuses facial expression characteristics, head posture characteristics, audio characteristics and text characteristics, can quantitatively evaluate the large five personality of a tested person in all directions, can quantitatively evaluate the tendency condition of each dimension of the large five personality of the tested person objectively and conveniently, can be applied to the fields of personnel selection, recruitment interviews, professional planning and the like of companies, and alleviates the technical problems of distortion and strong subjectivity of prediction results in the existing personality prediction method. In addition, the method provided by the invention has the advantages that the influence degree of the mode on the result can be described by the weight value, so that the model has interpretation.

It should also be appreciated that the memory in embodiments of the present invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example but not limitation, many forms of random access memory (random access memory, RAM) are available, such as Static RAM (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

The above embodiments may be implemented in whole or in part by software, hardware (e.g., circuitry), firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present invention are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

It should be understood that, in various embodiments of the present invention, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another device, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. The large five personality characteristic prediction method based on the multi-mode fusion is characterized by comprising the following steps of:

intercepting an image sequence to be processed containing the face of a tested person from a target dialogue video, and extracting an audio file containing dialogue information of the tested person from the target dialogue video; the target dialogue video is a video of a testee participating in a dialogue;

respectively extracting a facial expression characteristic sequence and a head posture characteristic sequence from the image sequence to be processed by using a trained facial expression prediction network and a trained head posture estimation network;

extracting an audio feature sequence and text information of the audio file, and extracting text features in the text information;

carrying out multi-mode fusion on the facial expression feature sequence, the head gesture feature sequence, the audio feature sequence and the text feature to obtain a target fusion feature;

and carrying out regression on the target fusion characteristics based on the trained multi-layer perceptron to obtain a quantized prediction result of each dimension of the large five personality of the tested person.

2. The method of claim 1, wherein capturing a sequence of images to be processed including a face of a subject from a target dialog video, comprises:

and uniformly intercepting images containing the faces of the testee from the target dialogue video according to a preset time interval to obtain an image sequence to be processed.

3. The method according to claim 1, wherein extracting a facial expression feature sequence and a head pose feature sequence from the image sequence to be processed respectively using a trained facial expression prediction network and a trained head pose estimation network, comprises:

performing face detection on the image sequence to be processed to obtain a face image sequence;

respectively extracting a facial expression characteristic sequence and a head posture characteristic sequence from the facial image sequence by using a trained facial expression prediction network and a trained head posture estimation network;

the facial expression prediction network comprises a VGGFace model; the head pose estimation network includes a ShuffleNet V2 model.

4. The method of claim 1, wherein extracting the sequence of audio features and text information of the audio file and extracting text features in the text information comprises:

obtaining a serialized initial audio feature from the audio file using an audio analysis tool; the initial audio features include: short-time average zero-crossing rate, short-time energy, energy entropy, spectrum center and mel cepstrum coefficient;

performing feature extraction on the initial audio features based on a full connection layer to obtain the audio feature sequence;

acquiring text information of the audio file based on audio transcription;

extracting features of the text information based on the trained BERT model to obtain the text features; the text features include deep bi-directional language features.

5. The method of claim 1, wherein multimodal fusion of the facial expression feature sequence, the head pose feature sequence, the audio feature sequence, and the text feature to obtain a target fusion feature comprises:

the facial expression feature sequence, the head gesture feature sequence and the audio feature sequence are subjected to weighted fusion, and then features in the time dimension are further extracted by using a two-way long-short-term memory recurrent neural network, so that primary fusion features are obtained;

and carrying out weighted fusion on the primary fusion characteristic and the text characteristic to obtain a target fusion characteristic.

6. The method of claim 5, wherein regressing the target fusion features based on the trained multi-layer perceptron to obtain quantized prediction results for each dimension of the subject's large five personality comprises:

；

wherein S is ₁ ，S ₂ ，S ₃ ，S ₄ ，S ₅ Representing different dimension trends of five personality, MLP ₁ Is the first full connection layer, MLP ₂ Is the second full connection layer, reLU is the first activation function, sigma is the second activation function, w _m Is the weight of the preliminary fusion feature, X _m Is the preliminary fusion feature, w _n Is the weight of the text feature, X _n Is the text feature.

7. The method of claim 1, wherein the training loss function of the multi-layer perceptron comprises:

；