CN111259976B - Personality detection method based on multi-modal alignment and multi-vector characterization - Google Patents

Personality detection method based on multi-modal alignment and multi-vector characterization Download PDF

Info

Publication number
CN111259976B
CN111259976B CN202010070066.3A CN202010070066A CN111259976B CN 111259976 B CN111259976 B CN 111259976B CN 202010070066 A CN202010070066 A CN 202010070066A CN 111259976 B CN111259976 B CN 111259976B
Authority
CN
China
Prior art keywords
voice
personality
video
text
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010070066.3A
Other languages
Chinese (zh)
Other versions
CN111259976A (en
Inventor
陈承勃
权小军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202010070066.3A priority Critical patent/CN111259976B/en
Publication of CN111259976A publication Critical patent/CN111259976A/en
Application granted granted Critical
Publication of CN111259976B publication Critical patent/CN111259976B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a personality detection method based on multi-mode alignment and multi-vector characterization, which comprises the steps of resampling voice and video mode data according to each epoch; inputting a plurality of samples and text mode data thereof into a characterization module in a mode for independent coding to obtain a voice sequence, a video sequence and a text sequence; inputting the voice sequence, the video sequence and the text sequence into a pair Ji Biaozheng module between modes, aligning and interacting in pairs, and then splicing to obtain enhanced voice characterization, video characterization and text characterization; respectively splicing all the voice representations, all the video representations and all the text representations to obtain voice vectors, video vectors and text vectors, and converting the voice vectors, the video vectors and the text vectors into at least two personality vectors by inputting a convolutional neural network; and respectively linearizing at least two types of personality vectors, and mapping through a sigmoid function to obtain the prediction probability of at least two types of personality characteristics. According to the method, the mode characterization is enhanced by interaction of 3 mode data, so that the distinguishing capability of the model is improved, and a more accurate prediction result is obtained.

Description

Personality detection method based on multi-modal alignment and multi-vector characterization
Technical Field
The invention relates to the field of data processing, in particular to a personality detection method based on multi-mode alignment and multi-vector characterization.
Background
Some people predict character characters by using data of two modes of voice and video, specifically, randomly sampling an original video to obtain a video and voice frequency spectrum with a certain frame number. For each frame, the features of the video are extracted using a residual network, and the MFCC features of the speech spectrum are extracted using fourier transforms. The video characteristics of each frame and the MFCC characteristics of the audio are spliced, and a multi-layer bidirectional LSTM network is input to jointly encode the video characteristics and the audio characteristics. Then, for each frame of encoded vector, a linear layer is input, and regression is performed by using a sigmoid function. Finally, obtaining a 5-dimensional vector by using average pooling, wherein the scores respectively represent five types of personality. Some models are made using data in three modalities, speech, text, and video. In particular, for speech, the paper directly inputs the original audio signal into the neural network, rather than using the fourier transform extracted MFCC features. The audio signal is converted into a 64-dimensional vector using a convolutional neural network. For text, convolutional neural networks are also used to encode vectors in 64 dimensions. For video, an image of a frame is randomly sampled from the video, input into a convolutional neural network, and encoded into a 64-dimensional vector. The convolutional neural network used by the three modes has different structures and parameters. Finally, the vectors of the three modes are spliced into vectors of 196 dimensions, and regression prediction is respectively carried out on the five classes after linear transformation.
These prior art techniques mainly consider both speech and video modalities, ignoring the specific content of their speech, resulting in a limited expressive power of the model. Generally, we cannot accurately judge the emotion and character characteristics of a speaker only according to the speech intonation and expression actions of the speaker. In fact, the personality characteristics of the individual speaking speech intonation, speaking content and expression actions can be reflected. If specific speaking content of a speaker is considered, particularly the specific word characteristics of the speaker are considered, the obtained information of the speaker can be greatly enriched, and the speaker can accurately judge personality characteristics of the speaker. Furthermore, in the prior art, codes among different modes are independent, so that the expressive power of the model is limited. Secondly, the prior art samples the same sample once before training, and the whole training process only repeats the video and audio of a few frames obtained after the full name is used for the sampling, so that the problem of data quantity deficiency is solved. Then, the prior art only learns one vector representation for each sample, and performs 5 regression tasks with the vector representation, which cannot distinguish 5 types of personality well. A vector representation cannot effectively and comprehensively characterize the individual in 5 types of personality, and each type of personality is represented by a vector representation so as to more comprehensively characterize the individual.
Disclosure of Invention
The invention mainly aims to provide a personality detection method based on multi-mode alignment and multi-vector characterization, aiming at overcoming the problems.
In order to achieve the above purpose, the personality detection method based on multi-modal alignment and multi-vector characterization provided by the invention comprises the following steps:
s10, resampling voice and video modal data according to each epoch to generate a plurality of samples with differences;
s20, inputting a plurality of samples and text modal data thereof into an intra-modal characterization module, wherein the intra-modal characterization module respectively and independently encodes three modal data of audio, video and text to obtain a voice sequence, a video sequence and a text sequence;
s30, inputting the voice sequence, the video sequence and the text sequence into an inter-mode pair Ji Biaozheng module, and respectively aligning and interacting the voice sequence, the video sequence and the text sequence in pairs by the inter-mode pair Ji Biaozheng module and then splicing to obtain enhanced voice characterization, video characterization and text characterization;
s40, splicing all the voice tokens into voice vectors, splicing all the video tokens into video vectors, splicing all the text tokens into text vectors, and respectively converting the voice vectors, the video vectors and the text vectors into at least two personality vectors by using a convolutional neural network;
s50, linearizing at least two types of personality vectors respectively, and mapping through a sigmoid function to obtain the prediction probability of at least two types of personality characteristics.
Preferably, the S20 includes:
the intra-mode characterization module extracts the Mel frequency cepstrum coefficient and the response Fbank characteristic of the audio in the sample through Fourier transformation, inputs the Mel frequency cepstrum coefficient and the response Fbank characteristic into a multi-layer bidirectional LSTM network for encoding so as to capture voice intonation change characteristics, encodes the captured voice intonation change characteristics into a voice sequence, and outputs the voice sequence;
the intra-mode characterization module encodes the video in the sample through a convolutional neural network with a residual error structure to obtain a high-dimensional vector of the video feature, inputs the high-dimensional vector of the video feature into a multi-layer bidirectional LSTM network, encodes the learned expression and motion change into a video sequence, and outputs the video sequence;
the intra-mode characterization module encodes the text in the sample through a Bert model based on a transform structure to obtain a text sequence with deep semantic information.
Preferably, the personality vector is a 5-class personality vector, and the 5-class personality vector includes:
the open personality vector is used for extracting imagination, aesthetic, emotion richness, dissimilarity, creation and intelligence characteristics of an individual;
the responsibility personality vector is used for extracting the characteristics of competence, fairness, regularity, due-job, achievement, self-discipline, caution and restraint displayed by the individual;
the camber personality vector is used for extracting enthusiasm, social contact, fruit break, liveness, adventure and optimistic characteristics shown by the individual;
the pleasant personality vector is used for extracting the characteristics of trust, litatan, straightness, compliance, court, and mood of the individual;
the neural personality vector is used for extracting the emotional characteristics of the individual, such as difficulty in balancing anxiety, hostility, depression, self-consciousness, impulse and weakness.
Preferably, the convolutional neural network comprises 5 groups of one-dimensional convolutions, and the convolution kernel of each group of convolutions has a size of 1.
Preferably, the step S50 further includes:
s60, carrying out weighted average on at least two types of personality vectors to obtain a comprehensive personality vector, linearizing the comprehensive personality vector, and obtaining the comprehensive personality probability through a sigmoid function.
Preferably, the text mode data is collected according to vector representation of the video caption text, and is encoded based on a Bert model of a transformer structure, wherein the Bert model is a model with encoded semantic information after being pre-trained by an English text data set.
Preferably, the inter-modality pair Ji Biaozheng module adopts an attention mechanism to respectively align and interact the voice sequence, the video sequence and the text sequence.
Preferably, the inter-modality pair Ji Biaozheng module aligns the text sequence to the speech sequence using the attention of the text-to-speech text2audio to enhance the speech characterization; the attention of the speech-to-video audio2video is utilized to align the speech sequence to the video sequence to enhance the video characterization.
Preferably, in S10, the voice and video modal data is resampled for each epoch synchronization.
Preferably, the convolutional neural network comprises 5 groups of one-dimensional convolutions, and the convolution kernel of each group of convolutions has a size of 1; the method for encoding the captured voice intonation variation features into the voice sequence comprises the following steps: encoding the voice intonation variation characteristics of each frame into a voice high-dimensional vector to be output as a voice sequence; the method for encoding the learned expression and action changes into the video sequence by the multi-layer bidirectional LSTM network comprises the following steps: the multi-layer bidirectional LSTM network learns expression and action changes in the first frame of picture, converts the expression and action changes into picture features, and codes the picture features into a picture high-dimensional vector output.
Preferably, the multi-layered bi-directional LSTM network is a multi-layered bi-directional LSTM network with extracted audio features after training using a large scale audio data set; the convolutional neural network is a convolutional neural network with extracted picture characteristics after being pre-trained on an ImageNet task.
According to the invention, the voice and video modes of the sample are sampled once by using different training epochs; the mutual attentions among the modes are aligned, the characterization of each mode is enhanced, and in the mode fusion module, each individual is mapped into at least two types of personality vector characterization, and the scores of the individual in at least two types of personality characteristics are respectively corresponding. The invention has the following three advantages:
1. the data is fully utilized, the effect of data enhancement is achieved by resampling, and the robustness of the model is improved. For each individual, the invention samples the speech and video modalities of each epoch before it begins, so that there are subtle differences in the training samples of different epochs for each individual. This makes it possible to make full use of each frame of data of video and audio. In the prior art, the sampling is performed only once before the training, and the whole training process only uses the result of the sampling, so that the data is not fully utilized.
2. The invention utilizes the attention mechanism to fully interact different modes, thereby greatly strengthening the characterization capability of each mode. The mutual interaction and alignment among the modes are utilized to promote the representation among the modes, and the expressive power of the model is improved.
3. Each type of personality characteristics of the individual can be more accurately depicted by using a vector for characterization of each type of personality characteristics, and thus the personality characteristics of the individual can be more comprehensively depicted. And respectively utilizing a plurality of vector characterizations to classify and predict personality characteristics.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method according to an embodiment of the invention;
figure 2 is a schematic diagram of the structure of the model of the invention,
the achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that, if directional indications (such as up, down, left, right, front, and rear … …) are included in the embodiments of the present invention, the directional indications are merely used to explain the relative positional relationship, movement conditions, etc. between the components in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indications are correspondingly changed.
In addition, if there is a description of "first", "second", etc. in the embodiments of the present invention, the description of "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.
As shown in fig. 1-2, the personality detection method based on multi-modal alignment and multi-vector characterization provided by the invention comprises the following steps:
s10, resampling voice and video modal data according to each epoch to generate a plurality of samples with differences;
s20, inputting a plurality of samples and text modal data thereof into an intra-modal characterization module, wherein the intra-modal characterization module respectively and independently encodes three modal data of audio, video and text to obtain a voice sequence, a video sequence and a text sequence;
s30, inputting the voice sequence, the video sequence and the text sequence into an inter-mode pair Ji Biaozheng module, and respectively aligning and interacting the voice sequence, the video sequence and the text sequence in pairs by the inter-mode pair Ji Biaozheng module and then splicing to obtain enhanced voice characterization, video characterization and text characterization;
s40, splicing all the voice tokens into voice vectors, splicing all the video tokens into video vectors, splicing all the text tokens into text vectors, and respectively converting the voice vectors, the video vectors and the text vectors into at least two personality vectors by using a convolutional neural network;
s50, linearizing at least two types of personality vectors respectively, and mapping through a sigmoid function to obtain the prediction probability of at least two types of personality characteristics.
According to the technical scheme, mutual attention among modes is utilized for alignment, characterization of each mode is enhanced, and in the mode fusion module, each individual is mapped into 5 vector characterizations, and scores of the individual in 5 types of personality characteristics are respectively corresponding.
Preferably, the S20 includes:
the intra-mode characterization module extracts the Mel frequency cepstrum coefficient and the response Fbank characteristic of the audio in the sample through Fourier transformation, inputs the Mel frequency cepstrum coefficient and the response Fbank characteristic into a multi-layer bidirectional LSTM network for encoding so as to capture voice intonation change characteristics, encodes the captured voice intonation change characteristics into a voice sequence, and outputs the voice sequence;
the intra-mode characterization module encodes the video in the sample through a convolutional neural network with a residual error structure to obtain a high-dimensional vector of the video feature, inputs the high-dimensional vector of the video feature into a multi-layer bidirectional LSTM network, encodes the learned expression and motion change into a video sequence, and outputs the video sequence;
the intra-mode characterization module encodes the text in the sample through a Bert model based on a transform structure to obtain a text sequence with deep semantic information.
In the embodiment of the invention, the intra-mode characterization module is responsible for independently encoding three modes of voice, text and video data to obtain the characterization of each mode. For the resampled picture sequence, a convolution neural network with a residual structure is used for encoding each picture to obtain a high-dimensional vector, the vector is input into the multi-layer bidirectional LSTM network for encoding, and the variation of the expression and the action of the vector is learned.
Preferably, the personality vector is a 5-class personality vector, and the 5-class personality vector includes:
the open personality vector is used for extracting imagination, aesthetic, emotion richness, dissimilarity, creation and intelligence characteristics of an individual;
the responsibility personality vector is used for extracting the characteristics of competence, fairness, regularity, due-job, achievement, self-discipline, caution and restraint displayed by the individual;
the camber personality vector is used for extracting enthusiasm, social contact, fruit break, liveness, adventure and optimistic characteristics shown by the individual;
the pleasant personality vector is used for extracting the characteristics of trust, litatan, straightness, compliance, court, and mood of the individual;
the neural personality vector is used for extracting the emotional characteristics of the individual, such as difficulty in balancing anxiety, hostility, depression, self-consciousness, impulse and weakness.
Preferably, the step S50 further includes:
s60, carrying out weighted average on at least two types of personality vectors to obtain a comprehensive personality vector, linearizing the comprehensive personality vector, and obtaining the comprehensive personality probability through a sigmoid function.
In the embodiment of the invention, the invention predicts two tasks. The main task is to predict the score of the individual on the 5 types of personality, and the specific method comprises the following steps: and 5 vectors obtained by the previous module are respectively passed through 5 linear layers, and then mapped into numbers between [0 and 1] by using a sigmoid function, so that the scores of the individual on 5 types of corresponding personality characteristics are respectively represented. The auxiliary tasks are as follows: predicting the probability of an individual with the personality trait being engaged in an interview. And carrying out weighted average on the 5 vectors obtained in the last module to obtain a new vector which represents the comprehensive characters of the individual, carrying out 1 linear layer on the vector, and obtaining the probability by utilizing a sigmoid function. Wherein the weights are learned by a model. The 6 probabilities obtained by this module are used as the final output of the model.
Preferably, the text mode data is collected according to vector representation of the video caption text, and is encoded based on a Bert model of a transformer structure, wherein the Bert model is a model with encoded semantic information after being pre-trained by an English text data set.
In the embodiment of the invention, the Bert model is a model which is pre-trained on a large-scale English text data set by utilizing the current advanced technology.
Preferably, the inter-modality pair Ji Biaozheng module adopts an attention mechanism to respectively align and interact the voice sequence, the video sequence and the text sequence.
Preferably, the inter-modality pair Ji Biaozheng module aligns the text sequence to the speech sequence using the attention of the text-to-speech text2audio to enhance the speech characterization; the attention of the speech-to-video audio2video is utilized to align the speech sequence to the video sequence to enhance the video characterization.
Preferably, in S10, the voice and video modal data is resampled for each epoch synchronization.
Preferably, the convolutional neural network comprises 5 groups of one-dimensional convolutions, and the convolution kernel of each group of convolutions has a size of 1; the method for encoding the captured voice intonation variation features into the voice sequence comprises the following steps: encoding the voice intonation variation characteristics of each frame into a voice high-dimensional vector to be output as a voice sequence; the method for encoding the learned expression and action changes into the video sequence by the multi-layer bidirectional LSTM network comprises the following steps: the multi-layer bidirectional LSTM network learns expression and action changes in the first frame of picture, converts the expression and action changes into picture features, and codes the picture features into a picture high-dimensional vector output.
Preferably, the multi-layered bi-directional LSTM network is a multi-layered bi-directional LSTM network with extracted audio features after training using a large scale audio data set; the convolutional neural network is a convolutional neural network with extracted picture characteristics after being pre-trained on an ImageNet task.
Actual operation example:
and collecting log files of three modes of voice, text and video to perform personality detection tasks, and setting a resampling module, an intra-mode characterization module, an inter-mode pair Ji Biaozheng module, a mode fusion module and a prediction module. The resampling module is responsible for sampling the voice and video of the input sample to obtain a frequency spectrum with a certain frame number and a picture input network; the intra-mode characterization module is responsible for independently encoding data of each mode to obtain characterization of each mode; the inter-mode pair Ji Biaozheng module is responsible for learning the mutual connection between different modes, and enriching the representation of the mode by utilizing the information aligned with other modes; the modal fusion module is responsible for fusing the representations learned by the three modalities, and a final vector representation is obtained for each type of personality, namely, 5 vector representations are all obtained. The prediction module performs two prediction tasks, namely, an auxiliary task, namely, predicting whether the individual is employed in interview or not, and a score of the individual in five types of personality characteristics is respectively predicted and is used as a final output of the model.
The large 5-class personality vector comprises an open personality vector, a responsible personality vector, an camber personality vector, a pleasant personality vector and a nerve personality vector, and the open personality vector extracts characteristics of imagination, aesthetic, emotion enrichment, dissimilarity, creation, intelligence and the like of an individual; the personality vector of responsibility extracts the characteristics of competence, fairness, regularity, due-job, achievement, self-discipline, caution, restraint and the like displayed by the individual; extracting enthusiasm, social contact, fruit break, liveness, adventure, optimistic and other characteristics of an individual by using the camber personality vector; the personality vector is suitable for extracting the characteristics of trust, lithe, straightness, compliance, humming, moving emotion and the like of the individual; the neural personality vector extracts emotional characteristics of individuals, such as difficulty in balancing anxiety, hostility, depression, self-consciousness, impulse and weakness, namely the individuals do not have the ability of keeping emotion stable.
1. Subsampling Module (resampling module)
The module is mainly responsible for randomly sampling the voice and video of the sample. Unlike other methods, the invention samples once per epoch for the same sample. By the method, a plurality of samples with the constant difference can be generated from one sample, so that the effect of data enhancement is achieved, and the robustness of the model is improved. During sampling, in order to ensure that voice and video can be aligned strictly, synchronous sampling is performed on the voice and video, namely, the audio and the picture at the same moment are sampled. And inputting the sampled audio and pictures and the text of the sample into an In-modularity module for coding respectively.
2. In-ModalityModule (intra-Modal characterization Module)
The module is responsible for independently encoding data of three modes of voice, text and video. The method comprises the steps of extracting MFCC ((Mel Frequency Cepstral Coefficents) Mel frequency cepstrum coefficient and Fbank (Filterbank) characteristics of a resampled audio sequence by utilizing Fourier transform, inputting a multi-layer bidirectional LSTM network for encoding, capturing characteristics of voice intonation change of the multi-layer bidirectional LSTM network, finally, encoding each frame into a high-dimensional vector, wherein the bidirectional LSTM network is Pre-trained by a large-scale audio data set and has the capability of extracting audio characteristics.
3. Cross-ModaletyAligmentModule (intermodal pair Ji Biaozheng Module)
The module receives vector characterization sequences, namely a voice sequence, a text sequence and a picture sequence, which are independently encoded in the modes, and mutually interacts the characterization of the three modes to strengthen respective encoding. The interaction mode used by the invention is an attention mechanism. For example, the two attentions of the audio2text and the video2text are utilized to align the voice with the text and align the video with the text respectively, the representation of the text containing the voice and the video information is obtained by utilizing the representation of the related voice and the representation of the video, and the representation is spliced with the original representation of the text to obtain the representation of the enhanced text. Similarly, speech characterization enhanced with text and video may be obtained, as well as video characterization enhanced with speech and text.
Modalities Fusion Module (Modal fusion module)
The module is responsible for fusing the vector characterization of each frame of the three modes obtained by the previous module, and the fusing mode is splicing, so that a new vector is obtained. Thereafter, these vector representations are converted into 5 vectors using a convolutional neural network (the specific details of the convolutional network are that a layer of one-dimensional convolution is used, the convolution kernel size is 1, and the number of groups is 5), each vector representing the characteristics of the individual over a certain class of personality.
PrerectionModule (prediction Module)
The invention uses an auxiliary task to promote the learning of the original personality detection task, so that the two tasks are predicted. The main task is to predict the score of the individual on the 5 types of personality, and the specific method comprises the following steps: and 5 vectors obtained by the previous module are respectively passed through 5 linear layers, and then mapped into numbers between [0 and 1] by using a sigmoid function, so that the scores of the individual on 5 types of corresponding personality characteristics are respectively represented. The auxiliary tasks are as follows: predicting the probability of an individual with the personality trait being engaged in an interview. And carrying out weighted average on the 5 vectors obtained in the last module to obtain a new vector which represents the comprehensive characters of the individual, carrying out 1 linear layer on the vector, and obtaining the probability by utilizing a sigmoid function. Wherein the weights are learned by a model. The 6 probabilities obtained by this module are used as the final output of the model.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structural changes made by the description of the present invention and the accompanying drawings or direct/indirect application in other related technical fields are included in the scope of the invention.

Claims (7)

1. A personality detection method based on multi-modal alignment and multi-vector characterization is characterized by comprising the following steps:
s10, resampling voice and video modal data according to each epoch to generate a plurality of samples with differences;
s20, inputting a plurality of samples and text mode data thereof into a mode internal representation module for independent coding to obtain a voice sequence, a video sequence and a text sequence; comprising the following steps:
the intra-mode characterization module extracts the Mel frequency cepstrum coefficient and the response Fbank characteristic of the audio in the sample through Fourier transformation, inputs the Mel frequency cepstrum coefficient and the response Fbank characteristic into a multi-layer bidirectional LSTM network for encoding so as to capture voice intonation change characteristics, encodes the captured voice intonation change characteristics into a voice sequence, and outputs the voice sequence;
the intra-mode characterization module encodes the video in the sample through a convolutional neural network with a residual error structure to obtain a high-dimensional vector of the video feature, inputs the high-dimensional vector of the video feature into a multi-layer bidirectional LSTM network, encodes the learned expression and motion change into a video sequence, and outputs the video sequence;
the intra-mode characterization module encodes the text in the sample through a Bert model based on a transform structure to obtain a text sequence with deep semantic information;
s30, inputting the voice sequence, the video sequence and the text sequence into an inter-mode pair Ji Biaozheng module, and respectively aligning and interacting the voice sequence, the video sequence and the text sequence in pairs by the inter-mode pair Ji Biaozheng module and then splicing to obtain enhanced voice characterization, video characterization and text characterization;
the inter-modal pair Ji Biaozheng module aligns a text sequence to a voice sequence by using the attention of text-to-voice text2audio so as to enhance voice characterization; aligning the voice sequence to the video sequence by using the attention of the voice-to-video audio2video to enhance the video characterization;
s40, splicing all the voice tokens into voice vectors, splicing all the video tokens into video vectors, splicing all the text tokens into text vectors, and respectively converting the voice vectors, the video vectors and the text vectors into at least two personality vectors by using a convolutional neural network;
s50, linearizing at least two types of personality vectors respectively, and mapping through a sigmoid function to obtain the prediction probability of at least two types of personality characteristics;
the personality vector is a 5-class personality vector, the 5-class personality vector including:
the open personality vector is used for extracting imagination, aesthetic, emotion richness, dissimilarity, creation and intelligence characteristics of an individual;
the responsibility personality vector is used for extracting the characteristics of competence, fairness, regularity, due-job, achievement, self-discipline, caution and restraint displayed by the individual;
the camber personality vector is used for extracting enthusiasm, social contact, fruit break, liveness, adventure and optimistic characteristics shown by the individual;
the pleasant personality vector is used for extracting the characteristics of trust, litatan, straightness, compliance, court, and mood of the individual;
the neural personality vector is used for extracting the emotional characteristics of the individual, such as difficulty in balancing anxiety, hostility, depression, self-consciousness, impulse and weakness.
2. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 1 wherein S50 further comprises, after:
s60, carrying out weighted average on at least two types of personality vectors to obtain a comprehensive personality vector, linearizing the comprehensive personality vector, and obtaining the comprehensive personality probability through a sigmoid function.
3. The personality detection method based on multi-modal alignment and multi-vector characterization according to claim 1, wherein the text modal data is collected according to vector characterization of video subtitle text, and is encoded based on a Bert model of a transform structure, and the Bert model is a model with encoded semantic information after being pre-trained by an english text dataset.
4. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 1 wherein the inter-modal pair Ji Biaozheng module employs an attention mechanism to interact with each other in pairs of voice sequences, video sequences and text sequences, respectively.
5. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 1 wherein the speech and video modal data is resampled for each epoch sync in S10.
6. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 1 wherein the convolutional neural network comprises 5 sets of one-dimensional convolutions, each set of convolutions having a convolution kernel size of 1; the method for encoding the captured voice intonation variation features into the voice sequence comprises the following steps: encoding the voice intonation variation characteristics of each frame into a voice high-dimensional vector to be output as a voice sequence; the method for encoding the learned expression and action changes into the video sequence by the multi-layer bidirectional LSTM network comprises the following steps: the multi-layer bidirectional LSTM network learns expression and action changes in the first frame of picture, converts the expression and action changes into picture features, and codes the picture features into a picture high-dimensional vector output.
7. The multi-modality alignment and multi-vector characterization based personality detection method of claim 1 wherein the multi-layer bi-directional LSTM network is a multi-layer bi-directional LSTM network with extracted audio features after training using a large scale audio data set; the convolutional neural network is a convolutional neural network with extracted picture characteristics after being pre-trained on an ImageNet task.
CN202010070066.3A 2020-01-21 2020-01-21 Personality detection method based on multi-modal alignment and multi-vector characterization Active CN111259976B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010070066.3A CN111259976B (en) 2020-01-21 2020-01-21 Personality detection method based on multi-modal alignment and multi-vector characterization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010070066.3A CN111259976B (en) 2020-01-21 2020-01-21 Personality detection method based on multi-modal alignment and multi-vector characterization

Publications (2)

Publication Number Publication Date
CN111259976A CN111259976A (en) 2020-06-09
CN111259976B true CN111259976B (en) 2023-05-23

Family

ID=70954332

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010070066.3A Active CN111259976B (en) 2020-01-21 2020-01-21 Personality detection method based on multi-modal alignment and multi-vector characterization

Country Status (1)

Country Link
CN (1) CN111259976B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832651B (en) * 2020-07-14 2023-04-07 清华大学 Video multi-mode emotion inference method and device
JP2022076949A (en) * 2020-11-10 2022-05-20 富士通株式会社 Inference program and method of inferring
CN112650861A (en) * 2020-12-29 2021-04-13 中山大学 Personality prediction method, system and device based on task layering
CN112951258B (en) * 2021-04-23 2024-05-17 中国科学技术大学 Audio/video voice enhancement processing method and device
CN113705725B (en) * 2021-09-15 2022-03-25 中国矿业大学 User personality characteristic prediction method and device based on multi-mode information fusion
CN115269845B (en) * 2022-08-01 2023-06-23 安徽大学 Network alignment method and system based on social network user personality
CN115146743B (en) * 2022-08-31 2022-12-16 平安银行股份有限公司 Character recognition model training method, character recognition method, device and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460737A (en) * 2018-11-13 2019-03-12 四川大学 A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network
CN110287389A (en) * 2019-05-31 2019-09-27 南京理工大学 The multi-modal sensibility classification method merged based on text, voice and video

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460737A (en) * 2018-11-13 2019-03-12 四川大学 A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network
CN110287389A (en) * 2019-05-31 2019-09-27 南京理工大学 The multi-modal sensibility classification method merged based on text, voice and video

Also Published As

Publication number Publication date
CN111259976A (en) 2020-06-09

Similar Documents

Publication Publication Date Title
CN111259976B (en) Personality detection method based on multi-modal alignment and multi-vector characterization
Zhang et al. Spontaneous speech emotion recognition using multiscale deep convolutional LSTM
CN110728997B (en) Multi-modal depression detection system based on context awareness
CN113255755B (en) Multi-modal emotion classification method based on heterogeneous fusion network
CN103366618B (en) Scene device for Chinese learning training based on artificial intelligence and virtual reality
CN115329779B (en) Multi-person dialogue emotion recognition method
CN107972028A (en) Man-machine interaction method, device and electronic equipment
CN116863038A (en) Method for generating digital human voice and facial animation by text
CN116304973A (en) Classroom teaching emotion recognition method and system based on multi-mode fusion
Klaylat et al. Enhancement of an Arabic speech emotion recognition system
Fellbaum et al. Principles of electronic speech processing with applications for people with disabilities
Zhang Voice keyword retrieval method using attention mechanism and multimodal information fusion
Zhang Ideological and political empowering English teaching: ideological education based on artificial intelligence in classroom emotion recognition
Qadri et al. A critical insight into multi-languages speech emotion databases
US11587561B2 (en) Communication system and method of extracting emotion data during translations
CN116580691A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
WO2023226239A1 (en) Object emotion analysis method and apparatus and electronic device
CN115171673A (en) Role portrait based communication auxiliary method and device and storage medium
Ghorpade et al. ITTS model: speech generation for image captioning using feature extraction for end-to-end synthesis
Brahme et al. Effect of various visual speech units on language identification using visual speech recognition
Campr et al. Automatic fingersign to speech translator
Schuller et al. Speech communication and multimodal interfaces
CN113409768A (en) Pronunciation detection method, pronunciation detection device and computer readable medium
Varchavskaia et al. Characterizing and processing robot-directed speech
CN117150320B (en) Dialog digital human emotion style similarity evaluation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant