CN111259976A

CN111259976A - Personality detection method based on multi-mode alignment and multi-vector representation

Info

Publication number: CN111259976A
Application number: CN202010070066.3A
Authority: CN
Inventors: 陈承勃; 权小军
Original assignee: National Sun Yat Sen University
Current assignee: National Sun Yat Sen University
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2020-06-09
Anticipated expiration: 2040-01-21
Also published as: CN111259976B

Abstract

The invention discloses a personality detection method based on multi-mode alignment and multi-vector representation, which comprises the steps of resampling voice and video modal data according to each epoch; inputting a plurality of samples and text modal data thereof into a modal internal representation module for independent coding to obtain a voice sequence, a video sequence and a text sequence; aligning and interacting the voice sequence, the video sequence and the text sequence input modality alignment characterization modules in pairs, and then splicing to obtain enhanced voice characterization, video characterization and text characterization; respectively splicing all the voice representations, all the video representations and all the text representations to obtain voice vectors, video vectors and text vectors, and inputting the voice vectors, the video vectors and the text vectors into a convolutional neural network to convert the voice vectors, the video vectors and the text vectors into at least two types of personality vectors; and after the vectors of the at least two types of lattices are respectively linearized, mapping through a sigmoid function to obtain the prediction probability of the characteristics of the at least two types of lattices. According to the method, modal representation is enhanced through pairwise interaction of 3 modal data, the discrimination capability of the model is improved, and a more accurate prediction result is obtained.

Description

Personality detection method based on multi-mode alignment and multi-vector representation

Technical Field

The invention relates to the field of data processing, in particular to a personality detection method based on multi-modal alignment and multi-vector representation.

Background

People predict character characters by using data of two modes of voice and video, specifically, randomly sampling an original video to obtain a video and a voice spectrum with a certain frame number. For each frame, the video is characterized by a residual network, and the MFCC characteristics of the speech spectrum are extracted by Fourier transform. And splicing the video characteristics of each frame and the MFCC characteristics of the audio, and inputting the spliced video characteristics and audio characteristics into a multi-layer bidirectional LSTM network to jointly encode the video characteristics and the audio characteristics. And then, inputting the vector coded by each frame into a linear layer, and performing regression by using a sigmoid function. And finally, obtaining a 5-dimensional vector by utilizing average pooling, and respectively representing the scores of the five types of personality. Some use the data of three modes of voice, text and video for modeling. Specifically, for speech, the paper directly inputs the original audio signal into the neural network, rather than using the MFCC features extracted by fourier transform. The audio signal is converted into a 64-dimensional vector using a convolutional neural network. For text, the same convolutional neural network is used to encode into 64-dimensional vectors. For video, a frame of image is randomly sampled from the video, and the image is input to a convolutional neural network and encoded into a 64-dimensional vector. The convolutional neural network structures and parameters used by the three modes are different. And finally, splicing the vectors of the three modes into 196-dimensional vectors, and performing regression prediction on the five major classes after linear transformation.

These prior art techniques primarily consider both speech and video modalities, but ignore the specific content of their speech, resulting in a limited performance of the model. Generally speaking, we can not accurately judge the emotion and character characteristics of the speaker according to the speech tone and expression of the speaking. In fact, the individual speaking voice tone, the speaking content and the expression action can reflect the personality characteristics. If the specific words of the speaker can be taken into consideration, especially the specific word characteristics of the speaker, the obtained information can be greatly enriched, and the personality characteristics of the speaker can be judged more accurately. Moreover, in the prior art, the codes between different modes are independent, so that the representation capability of the model is limited. Secondly, the prior art samples the same sample once before training, and the whole training process only repeats the video and audio of a few frames obtained after the sampling, which is a whole name, and the problem of data quantity shortage is solved. Then, the prior art only learns one vector characterization for each sample, and does not distinguish 5 types of personality well by utilizing the vector characterization to perform 5 regression tasks. The characteristics of the individual in 5 types of personality cannot be effectively and comprehensively described by one vector representation, and each type of personality can more comprehensively describe the characteristics of the personality of the individual by one vector representation.

Disclosure of Invention

The invention mainly aims to provide a personality detection method based on multi-modal alignment and multi-vector representation, aiming at overcoming the problems.

In order to achieve the above object, the invention provides a personality detection method based on multi-modal alignment and multi-vector characterization, which comprises the following steps:

s10, resampling the voice and video modal data according to each epoch, and generating a plurality of samples with difference;

s20, inputting a plurality of samples and text modal data thereof into an intra-modal characterization module, and independently coding the audio modal data, the video modal data and the text modal data by the intra-modal characterization module to obtain a voice sequence, a video sequence and a text sequence;

s30, inputting the voice sequence, the video sequence and the text sequence into an inter-modality alignment characterization module, and splicing the voice sequence, the video sequence and the text sequence after aligning and interacting pairwise respectively by the inter-modality alignment characterization module to obtain an enhanced voice characterization, a video characterization and a text characterization;

s40, splicing all voice representations into voice vectors, splicing all video representations into video vectors, splicing all text representations into text vectors, and converting the voice vectors, the video vectors and the text vectors into at least two types of personality vectors by using a convolutional neural network;

s50, after the at least two types of personality vectors are respectively linearized, the predictive probability of the characteristics of the at least two types of personality is obtained through sigmoid function mapping.

Preferably, the S20 includes:

the intra-modal characterization module extracts a Mel frequency cepstrum coefficient and a response Fbank characteristic of audio in a sample through Fourier transform, inputs the characteristic into a multi-layer bidirectional LSTM network for coding so as to capture a voice intonation change characteristic, codes the captured voice intonation change characteristic into a voice sequence and outputs the voice sequence;

the intra-modal characterization module encodes videos in the samples through a convolutional neural network with a residual error structure to obtain high-dimensional vectors of video features, inputs the high-dimensional vectors of the video features into a multi-layer bidirectional LSTM network, encodes learned expression and action changes into video sequences and outputs the video sequences;

and the intra-modal characterization module encodes the text in the sample through a Bert model based on a transformer structure to obtain a text sequence with deep semantic information.

Preferably, the personality vector is a category 5 personality vector, and the category 5 personality vector includes:

the open personality vector is used for extracting characteristics of imagination, aesthetic feeling, rich emotion, differentiation, creation and intelligence of individuals;

the responsibility personality vector is used for extracting the characteristics of competence, impartiality, orderliness, due employment, achievement, self-discipline, caution and curbing shown by the individual;

the extroversion personality vector is used for extracting enthusiasm, social contact, fruit break, activity, adventure and optimistic traits shown by the individual;

the pleasant personality vector is used for extracting the characteristics of trust, profit, straightness, compliance, modesty and immigration of the individual;

the neural character personality vector is used for extracting emotional traits of the individual, such as difficulty in balancing anxiety, hostility, depression, self-consciousness, impulsion and fragility.

Preferably, the convolutional neural network comprises 5 sets of one-dimensional convolutions one layer by one layer, and the convolution kernel of each set of convolutions has a size of 1.

Preferably, said S50 is followed by:

s60, weighted average is carried out on at least two types of personality vectors to obtain a comprehensive personality vector, and after linearization is carried out on the comprehensive personality vector, a comprehensive personality probability is obtained through a sigmoid function.

Preferably, the text modal data is acquired according to vector representation of a video subtitle text, and is encoded based on a Bert model of a transform structure, wherein the Bert model is a model with encoding semantic information after being pre-trained by an English text data set.

Preferably, the inter-modality alignment characterization module aligns and interacts pairwise the voice sequence, the video sequence and the text sequence respectively by using an attention mechanism.

Preferably, the inter-modality alignment characterization module aligns the text sequence to the speech sequence with the attention of the text-to-speech text2audio to enhance the speech characterization; the attention of the voice-to-video audio2video is used to align the voice sequence to the video sequence to enhance the video representation.

Preferably, the voice and video modality data are resampled in S10 at each epoch synchronization step.

Preferably, the convolutional neural network comprises 5 groups of one-dimensional convolutions, and the convolution kernel of each group of convolutions has a size of 1; the method for coding the captured voice intonation change characteristics into a voice sequence comprises the following steps: coding each frame of voice intonation change characteristics into a voice high-dimensional vector to be output as a voice sequence; the method for coding the learned expression and motion changes into a video sequence by the multi-layer bidirectional LSTM network comprises the following steps: the multi-layer bidirectional LSTM network learns expression and motion changes in the first frame of picture, converts the expression and motion changes into picture characteristics and encodes the picture characteristics into a picture high-dimensional vector for output.

Preferably, the multi-layer bidirectional LSTM network is a multi-layer bidirectional LSTM network trained using a large-scale audio data set with extracted audio features; the convolutional neural network is a convolutional neural network with extracted picture characteristics and is pre-trained on an ImageNet task.

The method carries out one-time sampling on the voice and video modes of the sample by utilizing different training epochs; the mutual attention among the modes is aligned, the representation of each mode is enhanced, and in a mode fusion module, each individual is mapped into at least two types of personality vector representations which respectively correspond to the scores of the individual on the characteristics of the at least two types of personality. The invention has the following three advantages:

1. the data is fully utilized, the data enhancement effect is achieved by utilizing resampling, and the robustness of the model is improved. For each individual, the present invention samples its voice and video modalities before the start of each epoch, so that for each individual, the training samples for different epochs are slightly different. This makes full use of each frame of video and audio data. In the prior art, sampling is performed only once before training, and the whole training process only uses the result of the sampling, so that the data is not fully utilized.

2. The invention utilizes an attention mechanism to fully interact with different modes, thereby greatly strengthening the representation capability of each mode. The representation among the modes is improved by utilizing the mutual interaction and alignment among the modes, and the representation capability of the model is improved.

3. Each category of the multiple personality characteristics is represented by one vector, each personality characteristic of the individual can be described more accurately, and then the personality characteristics of the individual can be described more comprehensively. And respectively utilizing a plurality of vector representations to carry out classification prediction on the character of the character.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the model structure of the present invention,

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that, if directional indications (such as up, down, left, right, front, and back … …) are involved in the embodiment of the present invention, the directional indications are only used to explain the relative positional relationship between the components, the movement situation, and the like in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indications are changed accordingly.

In addition, if there is a description of "first", "second", etc. in an embodiment of the present invention, the description of "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

As shown in fig. 1-2, the personality detection method based on multi-modal alignment and multi-vector characterization provided by the present invention includes the following steps:

According to the technical scheme, mutual attention among the modes is utilized for alignment, the representation of each mode is enhanced, and in the mode fusion module, each individual is mapped into 5 vector representations which respectively correspond to scores of the individual in 5-class personality characteristics.

Preferably, the S20 includes:

In the embodiment of the invention, the intra-modal characterization module is responsible for independently coding the data of the three modes, namely the voice mode, the text mode and the video mode, so as to obtain the characterization of each mode. For the resampled picture sequence, coding each picture by a convolutional neural network with a residual error structure to obtain a high-dimensional vector, inputting the vector into the multilayer bidirectional LSTM network for coding, and learning the change of expression and action.

Preferably, said S50 is followed by:

In the embodiment of the invention, the invention carries out the prediction of two tasks. The main task is to predict the score of the individual on 5 types of characters, and the specific method comprises the following steps: 5 vectors obtained by the last module pass through 5 linear layers respectively, and are mapped into numbers between [0 and 1] by utilizing a sigmoid function, so that the scores of the individuals on 5 types of corresponding personality characteristics are respectively represented. The auxiliary tasks are as follows: and predicting the probability of the individual with the character of the character being hired in the interview. And (3) carrying out weighted average on the 5 vectors obtained by the last module to obtain a new vector, representing the comprehensive character of the individual, passing the vector through 1 linear layer, and obtaining the probability by using a sigmoid function. Wherein the weights are derived from model learning. The 6 probabilities obtained by this module are used as the final output of the model.

In the embodiment of the invention, the Bert model of the invention is a model which is pre-trained on a large-scale English text data set at present.

An actual operation example:

the method comprises the steps of collecting log files of three modes of voice, text and video to perform personality detection, and setting a resampling module, an intra-mode characterization module, an inter-mode alignment characterization module, a mode fusion module and a prediction module. The resampling module is responsible for sampling voice and video of an input sample to obtain a frequency spectrum and a picture input network with a certain frame number; the intra-modal representation module is responsible for independently coding data of each modal to obtain the representation of each modal; the alignment representation module among the modes is responsible for learning the mutual relation among different modes, and the representation of the mode is enriched by utilizing the information aligned with other modes; the mode fusion module is responsible for fusing the learned representations of the three modes, and a final vector representation is obtained for each type of personality, namely 5 vector representations in total. The prediction module performs two prediction tasks, namely an auxiliary task, namely predicting whether the individual is hired in the interview, and predicting scores of the individual in the five types of personality characteristics respectively to serve as final output of the model.

The large category-5 personality vectors comprise an open personality vector, a responsible personality vector, a camber personality vector, a pleasant personality vector and a neural personality vector, and the open personality vector extracts characteristics of imagination, aesthetics, rich emotion, differentiation, creation, intelligence and the like of an individual; the responsibility personality vector extracts the characteristics of competency, justice, arrangement, full-time, achievement, self-discipline, caution, control and the like displayed by the individual; extracting special qualities such as enthusiasm, social contact, fruit break, activity, adventure, optimism and the like presented by individuals by the extranal personality vector; the pleasant personality vector extracts the characteristics of trust, profit, straightness, compliance, modesty, immigration and the like of the individual; the neural character personality vector extracts the emotional traits of the individual which are difficult to balance anxiety, hostility, depression, self-consciousness, impulsion and fragility, namely the individual does not have the ability of keeping emotional stability.

1. Subsampling Module (resampling Module)

The module is mainly responsible for randomly sampling voice and video of a sample. Unlike other methods, the present invention samples the same sample once every epoch. By the method, a plurality of samples with certain difference can be generated from one sample, so that the effect of data enhancement is achieved, and the robustness of the model is improved. During sampling, in order to ensure that the voice and the video can be strictly aligned, the voice and the video are synchronously sampled, namely, the audio and the picture at the same moment are sampled. And inputting the audio and the picture obtained by sampling and the text of the sample into an In-ModalityModule for coding respectively.

2. In-ModalityModule (Modal inner representation module)

The module is responsible for independently encoding data of three modes of voice, text and video. Wherein, for the resampled audio sequence, extracting the MFCC (Mel Frequency Cepstral coeffiences) Mel Frequency Cepstral coefficients and Fbank (filterbank) characteristics by Fourier transform, inputting the obtained data into a multi-layer bidirectional LSTM network for coding, capturing the characteristics of voice intonation change, finally coding each frame into a high-dimensional vector, notably, the bidirectional LSTM network is pre-trained by a large-scale audio data set and has the capability of extracting audio characteristics, for the resampled picture sequence, coding each picture by using a convolutional neural network with a residual error structure to obtain a high-dimensional vector, inputting the vector into the multi-layer bidirectional LSTM network for coding, learning the expression and action change, finally coding each frame into a high-dimensional vector, wherein the deep convolutional neural network is pre-trained on ImageNet task, the method has the capability of extracting the features in the picture. And encoding the caption text sequence of the sample by using a current advanced Bert model based on a transform structure. The BERT model is Pre-trained on large-scale English text data set, and has strong capability of coding semantic information. And after the text sequence is coded by a Bert model, obtaining a vector sequence with deep semantic information.

3. Cross-ModalityAlignmentModule (Module for alignment between modalities)

The module receives vector representation sequences which are independently coded in the modes, namely a voice sequence, a text sequence and a picture sequence, and mutually interacts representations of the three modes to strengthen respective codes. The interaction means used by the present invention is an attention mechanism. For example, with the two attentions of audio2text and video2text, respectively aligning the voice with the text and aligning the video with the text, obtaining the representation of the text containing the voice and video information by using the related representation of the voice and the representation of the video, and splicing the representation with the original representation of the text to obtain the enhanced representation of the text. Similarly, speech representations augmented with text and video representations augmented with speech and text are available.

Modalities Fusion Module (Modal Fusion Module)

The module is responsible for fusing the vector representation of each frame of the three modes obtained by the previous module, and the fusion mode is splicing to obtain a new vector. These vector representations are then transformed into 5 vectors, each representing the personality of the individual using a convolutional neural network (the convolutional network is detailed in detail: using a one-dimensional convolution with a convolution kernel size of 1 and a number of groups of 5).

PredictionModule (prediction module)

The present invention uses an auxiliary task to facilitate the learning of the original personality detection task, so we will make predictions for both tasks. The main task is to predict the score of the individual on 5 types of characters, and the specific method comprises the following steps: 5 vectors obtained by the last module pass through 5 linear layers respectively, and are mapped into numbers between [0 and 1] by utilizing a sigmoid function, so that the scores of the individuals on 5 types of corresponding personality characteristics are respectively represented. The auxiliary tasks are as follows: and predicting the probability of the individual with the character of the character being hired in the interview. And (3) carrying out weighted average on the 5 vectors obtained by the last module to obtain a new vector, representing the comprehensive character of the individual, passing the vector through 1 linear layer, and obtaining the probability by using a sigmoid function. Wherein the weights are derived from model learning. The 6 probabilities obtained by this module are used as the final output of the model.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A personality detection method based on multi-modal alignment and multi-vector characterization is characterized by comprising the following steps:

s20, inputting a plurality of samples and text modal data thereof into a modal internal representation module for independent coding to obtain a voice sequence, a video sequence and a text sequence;

2. The personality detection method based on multi-modal alignment and multi-vector characterization according to claim 1, wherein the S20 includes:

3. The personality detection method based on multi-modal alignment and multi-vector characterization according to claim 1, wherein the personality vector is a category 5 personality vector, the category 5 personality vector comprising:

4. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 1, wherein the S50 is followed by further comprising:

5. The personality detection method based on multi-modal alignment and multi-vector representation of claim 1, wherein the text modal data is collected according to vector representation of a video subtitle text, and is encoded based on a transform structure Bert model, and the Bert model is a model with encoding semantic information after being pre-trained by an English text data set.

6. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 1, wherein the inter-modal alignment characterization module aligns and interacts pairwise with a voice sequence, a video sequence and a text sequence, respectively, using an attention mechanism.

7. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 1, wherein the inter-modal alignment characterization module aligns a text sequence to a speech sequence with attention of text-to-speech text2audio to enhance speech characterization; the attention of the voice-to-video audio2video is used to align the voice sequence to the video sequence to enhance the video representation.

8. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 1, wherein voice and video modality data are resampled in steps identical to each epoch in the S10.

9. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 3, wherein the convolutional neural network comprises 5 sets of one-dimensional convolutions, each set of convolutions having a convolution kernel size of 1; the method for coding the captured voice intonation change characteristics into a voice sequence comprises the following steps: coding each frame of voice intonation change characteristics into a voice high-dimensional vector to be output as a voice sequence; the method for coding the learned expression and motion changes into a video sequence by the multi-layer bidirectional LSTM network comprises the following steps: the multi-layer bidirectional LSTM network learns expression and motion changes in the first frame of picture, converts the expression and motion changes into picture characteristics and encodes the picture characteristics into a picture high-dimensional vector for output.

10. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 2, wherein the multi-layer bi-directional LSTM network is a multi-layer bi-directional LSTM network with extracted audio features trained using a large-scale audio data set; the convolutional neural network is a convolutional neural network with extracted picture characteristics and is pre-trained on an ImageNet task.