CN111259976B

CN111259976B - Personality detection method based on multi-modal alignment and multi-vector characterization

Info

Publication number: CN111259976B
Application number: CN202010070066.3A
Authority: CN
Inventors: 陈承勃; 权小军
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2023-05-23
Anticipated expiration: 2040-01-21
Also published as: CN111259976A

Abstract

The invention discloses a personality detection method based on multi-mode alignment and multi-vector characterization, which comprises the steps of resampling voice and video mode data according to each epoch; inputting a plurality of samples and text mode data thereof into a characterization module in a mode for independent coding to obtain a voice sequence, a video sequence and a text sequence; inputting the voice sequence, the video sequence and the text sequence into a pair Ji Biaozheng module between modes, aligning and interacting in pairs, and then splicing to obtain enhanced voice characterization, video characterization and text characterization; respectively splicing all the voice representations, all the video representations and all the text representations to obtain voice vectors, video vectors and text vectors, and converting the voice vectors, the video vectors and the text vectors into at least two personality vectors by inputting a convolutional neural network; and respectively linearizing at least two types of personality vectors, and mapping through a sigmoid function to obtain the prediction probability of at least two types of personality characteristics. According to the method, the mode characterization is enhanced by interaction of 3 mode data, so that the distinguishing capability of the model is improved, and a more accurate prediction result is obtained.

Description

Personality detection method based on multi-modal alignment and multi-vector characterization

Technical Field

The invention relates to the field of data processing, in particular to a personality detection method based on multi-mode alignment and multi-vector characterization.

Background

Some people predict character characters by using data of two modes of voice and video, specifically, randomly sampling an original video to obtain a video and voice frequency spectrum with a certain frame number. For each frame, the features of the video are extracted using a residual network, and the MFCC features of the speech spectrum are extracted using fourier transforms. The video characteristics of each frame and the MFCC characteristics of the audio are spliced, and a multi-layer bidirectional LSTM network is input to jointly encode the video characteristics and the audio characteristics. Then, for each frame of encoded vector, a linear layer is input, and regression is performed by using a sigmoid function. Finally, obtaining a 5-dimensional vector by using average pooling, wherein the scores respectively represent five types of personality. Some models are made using data in three modalities, speech, text, and video. In particular, for speech, the paper directly inputs the original audio signal into the neural network, rather than using the fourier transform extracted MFCC features. The audio signal is converted into a 64-dimensional vector using a convolutional neural network. For text, convolutional neural networks are also used to encode vectors in 64 dimensions. For video, an image of a frame is randomly sampled from the video, input into a convolutional neural network, and encoded into a 64-dimensional vector. The convolutional neural network used by the three modes has different structures and parameters. Finally, the vectors of the three modes are spliced into vectors of 196 dimensions, and regression prediction is respectively carried out on the five classes after linear transformation.

These prior art techniques mainly consider both speech and video modalities, ignoring the specific content of their speech, resulting in a limited expressive power of the model. Generally, we cannot accurately judge the emotion and character characteristics of a speaker only according to the speech intonation and expression actions of the speaker. In fact, the personality characteristics of the individual speaking speech intonation, speaking content and expression actions can be reflected. If specific speaking content of a speaker is considered, particularly the specific word characteristics of the speaker are considered, the obtained information of the speaker can be greatly enriched, and the speaker can accurately judge personality characteristics of the speaker. Furthermore, in the prior art, codes among different modes are independent, so that the expressive power of the model is limited. Secondly, the prior art samples the same sample once before training, and the whole training process only repeats the video and audio of a few frames obtained after the full name is used for the sampling, so that the problem of data quantity deficiency is solved. Then, the prior art only learns one vector representation for each sample, and performs 5 regression tasks with the vector representation, which cannot distinguish 5 types of personality well. A vector representation cannot effectively and comprehensively characterize the individual in 5 types of personality, and each type of personality is represented by a vector representation so as to more comprehensively characterize the individual.

Disclosure of Invention

The invention mainly aims to provide a personality detection method based on multi-mode alignment and multi-vector characterization, aiming at overcoming the problems.

In order to achieve the above purpose, the personality detection method based on multi-modal alignment and multi-vector characterization provided by the invention comprises the following steps:

s10, resampling voice and video modal data according to each epoch to generate a plurality of samples with differences;

s20, inputting a plurality of samples and text modal data thereof into an intra-modal characterization module, wherein the intra-modal characterization module respectively and independently encodes three modal data of audio, video and text to obtain a voice sequence, a video sequence and a text sequence;

s30, inputting the voice sequence, the video sequence and the text sequence into an inter-mode pair Ji Biaozheng module, and respectively aligning and interacting the voice sequence, the video sequence and the text sequence in pairs by the inter-mode pair Ji Biaozheng module and then splicing to obtain enhanced voice characterization, video characterization and text characterization;

s40, splicing all the voice tokens into voice vectors, splicing all the video tokens into video vectors, splicing all the text tokens into text vectors, and respectively converting the voice vectors, the video vectors and the text vectors into at least two personality vectors by using a convolutional neural network;

s50, linearizing at least two types of personality vectors respectively, and mapping through a sigmoid function to obtain the prediction probability of at least two types of personality characteristics.

Preferably, the S20 includes:

the intra-mode characterization module extracts the Mel frequency cepstrum coefficient and the response Fbank characteristic of the audio in the sample through Fourier transformation, inputs the Mel frequency cepstrum coefficient and the response Fbank characteristic into a multi-layer bidirectional LSTM network for encoding so as to capture voice intonation change characteristics, encodes the captured voice intonation change characteristics into a voice sequence, and outputs the voice sequence;

the intra-mode characterization module encodes the video in the sample through a convolutional neural network with a residual error structure to obtain a high-dimensional vector of the video feature, inputs the high-dimensional vector of the video feature into a multi-layer bidirectional LSTM network, encodes the learned expression and motion change into a video sequence, and outputs the video sequence;

the intra-mode characterization module encodes the text in the sample through a Bert model based on a transform structure to obtain a text sequence with deep semantic information.

Preferably, the personality vector is a 5-class personality vector, and the 5-class personality vector includes:

the open personality vector is used for extracting imagination, aesthetic, emotion richness, dissimilarity, creation and intelligence characteristics of an individual;

the responsibility personality vector is used for extracting the characteristics of competence, fairness, regularity, due-job, achievement, self-discipline, caution and restraint displayed by the individual;

the camber personality vector is used for extracting enthusiasm, social contact, fruit break, liveness, adventure and optimistic characteristics shown by the individual;

the pleasant personality vector is used for extracting the characteristics of trust, litatan, straightness, compliance, court, and mood of the individual;

the neural personality vector is used for extracting the emotional characteristics of the individual, such as difficulty in balancing anxiety, hostility, depression, self-consciousness, impulse and weakness.

Preferably, the convolutional neural network comprises 5 groups of one-dimensional convolutions, and the convolution kernel of each group of convolutions has a size of 1.

Preferably, the step S50 further includes:

s60, carrying out weighted average on at least two types of personality vectors to obtain a comprehensive personality vector, linearizing the comprehensive personality vector, and obtaining the comprehensive personality probability through a sigmoid function.

Preferably, the text mode data is collected according to vector representation of the video caption text, and is encoded based on a Bert model of a transformer structure, wherein the Bert model is a model with encoded semantic information after being pre-trained by an English text data set.

Preferably, the inter-modality pair Ji Biaozheng module adopts an attention mechanism to respectively align and interact the voice sequence, the video sequence and the text sequence.

Preferably, the inter-modality pair Ji Biaozheng module aligns the text sequence to the speech sequence using the attention of the text-to-speech text2audio to enhance the speech characterization; the attention of the speech-to-video audio2video is utilized to align the speech sequence to the video sequence to enhance the video characterization.

Preferably, in S10, the voice and video modal data is resampled for each epoch synchronization.

Preferably, the convolutional neural network comprises 5 groups of one-dimensional convolutions, and the convolution kernel of each group of convolutions has a size of 1; the method for encoding the captured voice intonation variation features into the voice sequence comprises the following steps: encoding the voice intonation variation characteristics of each frame into a voice high-dimensional vector to be output as a voice sequence; the method for encoding the learned expression and action changes into the video sequence by the multi-layer bidirectional LSTM network comprises the following steps: the multi-layer bidirectional LSTM network learns expression and action changes in the first frame of picture, converts the expression and action changes into picture features, and codes the picture features into a picture high-dimensional vector output.

Preferably, the multi-layered bi-directional LSTM network is a multi-layered bi-directional LSTM network with extracted audio features after training using a large scale audio data set; the convolutional neural network is a convolutional neural network with extracted picture characteristics after being pre-trained on an ImageNet task.

According to the invention, the voice and video modes of the sample are sampled once by using different training epochs; the mutual attentions among the modes are aligned, the characterization of each mode is enhanced, and in the mode fusion module, each individual is mapped into at least two types of personality vector characterization, and the scores of the individual in at least two types of personality characteristics are respectively corresponding. The invention has the following three advantages:

1. the data is fully utilized, the effect of data enhancement is achieved by resampling, and the robustness of the model is improved. For each individual, the invention samples the speech and video modalities of each epoch before it begins, so that there are subtle differences in the training samples of different epochs for each individual. This makes it possible to make full use of each frame of data of video and audio. In the prior art, the sampling is performed only once before the training, and the whole training process only uses the result of the sampling, so that the data is not fully utilized.

2. The invention utilizes the attention mechanism to fully interact different modes, thereby greatly strengthening the characterization capability of each mode. The mutual interaction and alignment among the modes are utilized to promote the representation among the modes, and the expressive power of the model is improved.

3. Each type of personality characteristics of the individual can be more accurately depicted by using a vector for characterization of each type of personality characteristics, and thus the personality characteristics of the individual can be more comprehensively depicted. And respectively utilizing a plurality of vector characterizations to classify and predict personality characteristics.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method according to an embodiment of the invention;

figure 2 is a schematic diagram of the structure of the model of the invention,

the achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that, if directional indications (such as up, down, left, right, front, and rear … …) are included in the embodiments of the present invention, the directional indications are merely used to explain the relative positional relationship, movement conditions, etc. between the components in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indications are correspondingly changed.

In addition, if there is a description of "first", "second", etc. in the embodiments of the present invention, the description of "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.

As shown in fig. 1-2, the personality detection method based on multi-modal alignment and multi-vector characterization provided by the invention comprises the following steps:

According to the technical scheme, mutual attention among modes is utilized for alignment, characterization of each mode is enhanced, and in the mode fusion module, each individual is mapped into 5 vector characterizations, and scores of the individual in 5 types of personality characteristics are respectively corresponding.

Preferably, the S20 includes:

In the embodiment of the invention, the intra-mode characterization module is responsible for independently encoding three modes of voice, text and video data to obtain the characterization of each mode. For the resampled picture sequence, a convolution neural network with a residual structure is used for encoding each picture to obtain a high-dimensional vector, the vector is input into the multi-layer bidirectional LSTM network for encoding, and the variation of the expression and the action of the vector is learned.

Preferably, the step S50 further includes:

In the embodiment of the invention, the invention predicts two tasks. The main task is to predict the score of the individual on the 5 types of personality, and the specific method comprises the following steps: and 5 vectors obtained by the previous module are respectively passed through 5 linear layers, and then mapped into numbers between [0 and 1] by using a sigmoid function, so that the scores of the individual on 5 types of corresponding personality characteristics are respectively represented. The auxiliary tasks are as follows: predicting the probability of an individual with the personality trait being engaged in an interview. And carrying out weighted average on the 5 vectors obtained in the last module to obtain a new vector which represents the comprehensive characters of the individual, carrying out 1 linear layer on the vector, and obtaining the probability by utilizing a sigmoid function. Wherein the weights are learned by a model. The 6 probabilities obtained by this module are used as the final output of the model.

In the embodiment of the invention, the Bert model is a model which is pre-trained on a large-scale English text data set by utilizing the current advanced technology.

Actual operation example:

and collecting log files of three modes of voice, text and video to perform personality detection tasks, and setting a resampling module, an intra-mode characterization module, an inter-mode pair Ji Biaozheng module, a mode fusion module and a prediction module. The resampling module is responsible for sampling the voice and video of the input sample to obtain a frequency spectrum with a certain frame number and a picture input network; the intra-mode characterization module is responsible for independently encoding data of each mode to obtain characterization of each mode; the inter-mode pair Ji Biaozheng module is responsible for learning the mutual connection between different modes, and enriching the representation of the mode by utilizing the information aligned with other modes; the modal fusion module is responsible for fusing the representations learned by the three modalities, and a final vector representation is obtained for each type of personality, namely, 5 vector representations are all obtained. The prediction module performs two prediction tasks, namely, an auxiliary task, namely, predicting whether the individual is employed in interview or not, and a score of the individual in five types of personality characteristics is respectively predicted and is used as a final output of the model.

The large 5-class personality vector comprises an open personality vector, a responsible personality vector, an camber personality vector, a pleasant personality vector and a nerve personality vector, and the open personality vector extracts characteristics of imagination, aesthetic, emotion enrichment, dissimilarity, creation, intelligence and the like of an individual; the personality vector of responsibility extracts the characteristics of competence, fairness, regularity, due-job, achievement, self-discipline, caution, restraint and the like displayed by the individual; extracting enthusiasm, social contact, fruit break, liveness, adventure, optimistic and other characteristics of an individual by using the camber personality vector; the personality vector is suitable for extracting the characteristics of trust, lithe, straightness, compliance, humming, moving emotion and the like of the individual; the neural personality vector extracts emotional characteristics of individuals, such as difficulty in balancing anxiety, hostility, depression, self-consciousness, impulse and weakness, namely the individuals do not have the ability of keeping emotion stable.

1. Subsampling Module (resampling module)

The module is mainly responsible for randomly sampling the voice and video of the sample. Unlike other methods, the invention samples once per epoch for the same sample. By the method, a plurality of samples with the constant difference can be generated from one sample, so that the effect of data enhancement is achieved, and the robustness of the model is improved. During sampling, in order to ensure that voice and video can be aligned strictly, synchronous sampling is performed on the voice and video, namely, the audio and the picture at the same moment are sampled. And inputting the sampled audio and pictures and the text of the sample into an In-modularity module for coding respectively.

2. In-ModalityModule (intra-Modal characterization Module)

The module is responsible for independently encoding data of three modes of voice, text and video. The method comprises the steps of extracting MFCC ((Mel Frequency Cepstral Coefficents) Mel frequency cepstrum coefficient and Fbank (Filterbank) characteristics of a resampled audio sequence by utilizing Fourier transform, inputting a multi-layer bidirectional LSTM network for encoding, capturing characteristics of voice intonation change of the multi-layer bidirectional LSTM network, finally, encoding each frame into a high-dimensional vector, wherein the bidirectional LSTM network is Pre-trained by a large-scale audio data set and has the capability of extracting audio characteristics.

3. Cross-ModaletyAligmentModule (intermodal pair Ji Biaozheng Module)

The module receives vector characterization sequences, namely a voice sequence, a text sequence and a picture sequence, which are independently encoded in the modes, and mutually interacts the characterization of the three modes to strengthen respective encoding. The interaction mode used by the invention is an attention mechanism. For example, the two attentions of the audio2text and the video2text are utilized to align the voice with the text and align the video with the text respectively, the representation of the text containing the voice and the video information is obtained by utilizing the representation of the related voice and the representation of the video, and the representation is spliced with the original representation of the text to obtain the representation of the enhanced text. Similarly, speech characterization enhanced with text and video may be obtained, as well as video characterization enhanced with speech and text.

Modalities Fusion Module (Modal fusion module)

The module is responsible for fusing the vector characterization of each frame of the three modes obtained by the previous module, and the fusing mode is splicing, so that a new vector is obtained. Thereafter, these vector representations are converted into 5 vectors using a convolutional neural network (the specific details of the convolutional network are that a layer of one-dimensional convolution is used, the convolution kernel size is 1, and the number of groups is 5), each vector representing the characteristics of the individual over a certain class of personality.

PrerectionModule (prediction Module)

The invention uses an auxiliary task to promote the learning of the original personality detection task, so that the two tasks are predicted. The main task is to predict the score of the individual on the 5 types of personality, and the specific method comprises the following steps: and 5 vectors obtained by the previous module are respectively passed through 5 linear layers, and then mapped into numbers between [0 and 1] by using a sigmoid function, so that the scores of the individual on 5 types of corresponding personality characteristics are respectively represented. The auxiliary tasks are as follows: predicting the probability of an individual with the personality trait being engaged in an interview. And carrying out weighted average on the 5 vectors obtained in the last module to obtain a new vector which represents the comprehensive characters of the individual, carrying out 1 linear layer on the vector, and obtaining the probability by utilizing a sigmoid function. Wherein the weights are learned by a model. The 6 probabilities obtained by this module are used as the final output of the model.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structural changes made by the description of the present invention and the accompanying drawings or direct/indirect application in other related technical fields are included in the scope of the invention.

Claims

1. A personality detection method based on multi-modal alignment and multi-vector characterization is characterized by comprising the following steps:

s20, inputting a plurality of samples and text mode data thereof into a mode internal representation module for independent coding to obtain a voice sequence, a video sequence and a text sequence; comprising the following steps:

the intra-mode characterization module encodes the text in the sample through a Bert model based on a transform structure to obtain a text sequence with deep semantic information;

the inter-modal pair Ji Biaozheng module aligns a text sequence to a voice sequence by using the attention of text-to-voice text2audio so as to enhance voice characterization; aligning the voice sequence to the video sequence by using the attention of the voice-to-video audio2video to enhance the video characterization;

s50, linearizing at least two types of personality vectors respectively, and mapping through a sigmoid function to obtain the prediction probability of at least two types of personality characteristics;

the personality vector is a 5-class personality vector, the 5-class personality vector including:

2. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 1 wherein S50 further comprises, after:

3. The personality detection method based on multi-modal alignment and multi-vector characterization according to claim 1, wherein the text modal data is collected according to vector characterization of video subtitle text, and is encoded based on a Bert model of a transform structure, and the Bert model is a model with encoded semantic information after being pre-trained by an english text dataset.

4. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 1 wherein the inter-modal pair Ji Biaozheng module employs an attention mechanism to interact with each other in pairs of voice sequences, video sequences and text sequences, respectively.

5. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 1 wherein the speech and video modal data is resampled for each epoch sync in S10.

6. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 1 wherein the convolutional neural network comprises 5 sets of one-dimensional convolutions, each set of convolutions having a convolution kernel size of 1; the method for encoding the captured voice intonation variation features into the voice sequence comprises the following steps: encoding the voice intonation variation characteristics of each frame into a voice high-dimensional vector to be output as a voice sequence; the method for encoding the learned expression and action changes into the video sequence by the multi-layer bidirectional LSTM network comprises the following steps: the multi-layer bidirectional LSTM network learns expression and action changes in the first frame of picture, converts the expression and action changes into picture features, and codes the picture features into a picture high-dimensional vector output.

7. The multi-modality alignment and multi-vector characterization based personality detection method of claim 1 wherein the multi-layer bi-directional LSTM network is a multi-layer bi-directional LSTM network with extracted audio features after training using a large scale audio data set; the convolutional neural network is a convolutional neural network with extracted picture characteristics after being pre-trained on an ImageNet task.