CN111259976A - Personality detection method based on multi-mode alignment and multi-vector representation - Google Patents

Personality detection method based on multi-mode alignment and multi-vector representation Download PDF

Info

Publication number
CN111259976A
CN111259976A CN202010070066.3A CN202010070066A CN111259976A CN 111259976 A CN111259976 A CN 111259976A CN 202010070066 A CN202010070066 A CN 202010070066A CN 111259976 A CN111259976 A CN 111259976A
Authority
CN
China
Prior art keywords
voice
video
personality
text
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010070066.3A
Other languages
Chinese (zh)
Other versions
CN111259976B (en
Inventor
陈承勃
权小军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Sun Yat Sen University
Original Assignee
National Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Sun Yat Sen University filed Critical National Sun Yat Sen University
Priority to CN202010070066.3A priority Critical patent/CN111259976B/en
Publication of CN111259976A publication Critical patent/CN111259976A/en
Application granted granted Critical
Publication of CN111259976B publication Critical patent/CN111259976B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a personality detection method based on multi-mode alignment and multi-vector representation, which comprises the steps of resampling voice and video modal data according to each epoch; inputting a plurality of samples and text modal data thereof into a modal internal representation module for independent coding to obtain a voice sequence, a video sequence and a text sequence; aligning and interacting the voice sequence, the video sequence and the text sequence input modality alignment characterization modules in pairs, and then splicing to obtain enhanced voice characterization, video characterization and text characterization; respectively splicing all the voice representations, all the video representations and all the text representations to obtain voice vectors, video vectors and text vectors, and inputting the voice vectors, the video vectors and the text vectors into a convolutional neural network to convert the voice vectors, the video vectors and the text vectors into at least two types of personality vectors; and after the vectors of the at least two types of lattices are respectively linearized, mapping through a sigmoid function to obtain the prediction probability of the characteristics of the at least two types of lattices. According to the method, modal representation is enhanced through pairwise interaction of 3 modal data, the discrimination capability of the model is improved, and a more accurate prediction result is obtained.

Description

Personality detection method based on multi-mode alignment and multi-vector representation
Technical Field
The invention relates to the field of data processing, in particular to a personality detection method based on multi-modal alignment and multi-vector representation.
Background
People predict character characters by using data of two modes of voice and video, specifically, randomly sampling an original video to obtain a video and a voice spectrum with a certain frame number. For each frame, the video is characterized by a residual network, and the MFCC characteristics of the speech spectrum are extracted by Fourier transform. And splicing the video characteristics of each frame and the MFCC characteristics of the audio, and inputting the spliced video characteristics and audio characteristics into a multi-layer bidirectional LSTM network to jointly encode the video characteristics and the audio characteristics. And then, inputting the vector coded by each frame into a linear layer, and performing regression by using a sigmoid function. And finally, obtaining a 5-dimensional vector by utilizing average pooling, and respectively representing the scores of the five types of personality. Some use the data of three modes of voice, text and video for modeling. Specifically, for speech, the paper directly inputs the original audio signal into the neural network, rather than using the MFCC features extracted by fourier transform. The audio signal is converted into a 64-dimensional vector using a convolutional neural network. For text, the same convolutional neural network is used to encode into 64-dimensional vectors. For video, a frame of image is randomly sampled from the video, and the image is input to a convolutional neural network and encoded into a 64-dimensional vector. The convolutional neural network structures and parameters used by the three modes are different. And finally, splicing the vectors of the three modes into 196-dimensional vectors, and performing regression prediction on the five major classes after linear transformation.
These prior art techniques primarily consider both speech and video modalities, but ignore the specific content of their speech, resulting in a limited performance of the model. Generally speaking, we can not accurately judge the emotion and character characteristics of the speaker according to the speech tone and expression of the speaking. In fact, the individual speaking voice tone, the speaking content and the expression action can reflect the personality characteristics. If the specific words of the speaker can be taken into consideration, especially the specific word characteristics of the speaker, the obtained information can be greatly enriched, and the personality characteristics of the speaker can be judged more accurately. Moreover, in the prior art, the codes between different modes are independent, so that the representation capability of the model is limited. Secondly, the prior art samples the same sample once before training, and the whole training process only repeats the video and audio of a few frames obtained after the sampling, which is a whole name, and the problem of data quantity shortage is solved. Then, the prior art only learns one vector characterization for each sample, and does not distinguish 5 types of personality well by utilizing the vector characterization to perform 5 regression tasks. The characteristics of the individual in 5 types of personality cannot be effectively and comprehensively described by one vector representation, and each type of personality can more comprehensively describe the characteristics of the personality of the individual by one vector representation.
Disclosure of Invention
The invention mainly aims to provide a personality detection method based on multi-modal alignment and multi-vector representation, aiming at overcoming the problems.
In order to achieve the above object, the invention provides a personality detection method based on multi-modal alignment and multi-vector characterization, which comprises the following steps:
s10, resampling the voice and video modal data according to each epoch, and generating a plurality of samples with difference;
s20, inputting a plurality of samples and text modal data thereof into an intra-modal characterization module, and independently coding the audio modal data, the video modal data and the text modal data by the intra-modal characterization module to obtain a voice sequence, a video sequence and a text sequence;
s30, inputting the voice sequence, the video sequence and the text sequence into an inter-modality alignment characterization module, and splicing the voice sequence, the video sequence and the text sequence after aligning and interacting pairwise respectively by the inter-modality alignment characterization module to obtain an enhanced voice characterization, a video characterization and a text characterization;
s40, splicing all voice representations into voice vectors, splicing all video representations into video vectors, splicing all text representations into text vectors, and converting the voice vectors, the video vectors and the text vectors into at least two types of personality vectors by using a convolutional neural network;
s50, after the at least two types of personality vectors are respectively linearized, the predictive probability of the characteristics of the at least two types of personality is obtained through sigmoid function mapping.
Preferably, the S20 includes:
the intra-modal characterization module extracts a Mel frequency cepstrum coefficient and a response Fbank characteristic of audio in a sample through Fourier transform, inputs the characteristic into a multi-layer bidirectional LSTM network for coding so as to capture a voice intonation change characteristic, codes the captured voice intonation change characteristic into a voice sequence and outputs the voice sequence;
the intra-modal characterization module encodes videos in the samples through a convolutional neural network with a residual error structure to obtain high-dimensional vectors of video features, inputs the high-dimensional vectors of the video features into a multi-layer bidirectional LSTM network, encodes learned expression and action changes into video sequences and outputs the video sequences;
and the intra-modal characterization module encodes the text in the sample through a Bert model based on a transformer structure to obtain a text sequence with deep semantic information.
Preferably, the personality vector is a category 5 personality vector, and the category 5 personality vector includes:
the open personality vector is used for extracting characteristics of imagination, aesthetic feeling, rich emotion, differentiation, creation and intelligence of individuals;
the responsibility personality vector is used for extracting the characteristics of competence, impartiality, orderliness, due employment, achievement, self-discipline, caution and curbing shown by the individual;
the extroversion personality vector is used for extracting enthusiasm, social contact, fruit break, activity, adventure and optimistic traits shown by the individual;
the pleasant personality vector is used for extracting the characteristics of trust, profit, straightness, compliance, modesty and immigration of the individual;
the neural character personality vector is used for extracting emotional traits of the individual, such as difficulty in balancing anxiety, hostility, depression, self-consciousness, impulsion and fragility.
Preferably, the convolutional neural network comprises 5 sets of one-dimensional convolutions one layer by one layer, and the convolution kernel of each set of convolutions has a size of 1.
Preferably, said S50 is followed by:
s60, weighted average is carried out on at least two types of personality vectors to obtain a comprehensive personality vector, and after linearization is carried out on the comprehensive personality vector, a comprehensive personality probability is obtained through a sigmoid function.
Preferably, the text modal data is acquired according to vector representation of a video subtitle text, and is encoded based on a Bert model of a transform structure, wherein the Bert model is a model with encoding semantic information after being pre-trained by an English text data set.
Preferably, the inter-modality alignment characterization module aligns and interacts pairwise the voice sequence, the video sequence and the text sequence respectively by using an attention mechanism.
Preferably, the inter-modality alignment characterization module aligns the text sequence to the speech sequence with the attention of the text-to-speech text2audio to enhance the speech characterization; the attention of the voice-to-video audio2video is used to align the voice sequence to the video sequence to enhance the video representation.
Preferably, the voice and video modality data are resampled in S10 at each epoch synchronization step.
Preferably, the convolutional neural network comprises 5 groups of one-dimensional convolutions, and the convolution kernel of each group of convolutions has a size of 1; the method for coding the captured voice intonation change characteristics into a voice sequence comprises the following steps: coding each frame of voice intonation change characteristics into a voice high-dimensional vector to be output as a voice sequence; the method for coding the learned expression and motion changes into a video sequence by the multi-layer bidirectional LSTM network comprises the following steps: the multi-layer bidirectional LSTM network learns expression and motion changes in the first frame of picture, converts the expression and motion changes into picture characteristics and encodes the picture characteristics into a picture high-dimensional vector for output.
Preferably, the multi-layer bidirectional LSTM network is a multi-layer bidirectional LSTM network trained using a large-scale audio data set with extracted audio features; the convolutional neural network is a convolutional neural network with extracted picture characteristics and is pre-trained on an ImageNet task.
The method carries out one-time sampling on the voice and video modes of the sample by utilizing different training epochs; the mutual attention among the modes is aligned, the representation of each mode is enhanced, and in a mode fusion module, each individual is mapped into at least two types of personality vector representations which respectively correspond to the scores of the individual on the characteristics of the at least two types of personality. The invention has the following three advantages:
1. the data is fully utilized, the data enhancement effect is achieved by utilizing resampling, and the robustness of the model is improved. For each individual, the present invention samples its voice and video modalities before the start of each epoch, so that for each individual, the training samples for different epochs are slightly different. This makes full use of each frame of video and audio data. In the prior art, sampling is performed only once before training, and the whole training process only uses the result of the sampling, so that the data is not fully utilized.
2. The invention utilizes an attention mechanism to fully interact with different modes, thereby greatly strengthening the representation capability of each mode. The representation among the modes is improved by utilizing the mutual interaction and alignment among the modes, and the representation capability of the model is improved.
3. Each category of the multiple personality characteristics is represented by one vector, each personality characteristic of the individual can be described more accurately, and then the personality characteristics of the individual can be described more comprehensively. And respectively utilizing a plurality of vector representations to carry out classification prediction on the character of the character.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.
FIG. 1 is a flow chart of a method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the model structure of the present invention,
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that, if directional indications (such as up, down, left, right, front, and back … …) are involved in the embodiment of the present invention, the directional indications are only used to explain the relative positional relationship between the components, the movement situation, and the like in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indications are changed accordingly.
In addition, if there is a description of "first", "second", etc. in an embodiment of the present invention, the description of "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
As shown in fig. 1-2, the personality detection method based on multi-modal alignment and multi-vector characterization provided by the present invention includes the following steps:
s10, resampling the voice and video modal data according to each epoch, and generating a plurality of samples with difference;
s20, inputting a plurality of samples and text modal data thereof into an intra-modal characterization module, and independently coding the audio modal data, the video modal data and the text modal data by the intra-modal characterization module to obtain a voice sequence, a video sequence and a text sequence;
s30, inputting the voice sequence, the video sequence and the text sequence into an inter-modality alignment characterization module, and splicing the voice sequence, the video sequence and the text sequence after aligning and interacting pairwise respectively by the inter-modality alignment characterization module to obtain an enhanced voice characterization, a video characterization and a text characterization;
s40, splicing all voice representations into voice vectors, splicing all video representations into video vectors, splicing all text representations into text vectors, and converting the voice vectors, the video vectors and the text vectors into at least two types of personality vectors by using a convolutional neural network;
s50, after the at least two types of personality vectors are respectively linearized, the predictive probability of the characteristics of the at least two types of personality is obtained through sigmoid function mapping.
According to the technical scheme, mutual attention among the modes is utilized for alignment, the representation of each mode is enhanced, and in the mode fusion module, each individual is mapped into 5 vector representations which respectively correspond to scores of the individual in 5-class personality characteristics.
Preferably, the S20 includes:
the intra-modal characterization module extracts a Mel frequency cepstrum coefficient and a response Fbank characteristic of audio in a sample through Fourier transform, inputs the characteristic into a multi-layer bidirectional LSTM network for coding so as to capture a voice intonation change characteristic, codes the captured voice intonation change characteristic into a voice sequence and outputs the voice sequence;
the intra-modal characterization module encodes videos in the samples through a convolutional neural network with a residual error structure to obtain high-dimensional vectors of video features, inputs the high-dimensional vectors of the video features into a multi-layer bidirectional LSTM network, encodes learned expression and action changes into video sequences and outputs the video sequences;
and the intra-modal characterization module encodes the text in the sample through a Bert model based on a transformer structure to obtain a text sequence with deep semantic information.
In the embodiment of the invention, the intra-modal characterization module is responsible for independently coding the data of the three modes, namely the voice mode, the text mode and the video mode, so as to obtain the characterization of each mode. For the resampled picture sequence, coding each picture by a convolutional neural network with a residual error structure to obtain a high-dimensional vector, inputting the vector into the multilayer bidirectional LSTM network for coding, and learning the change of expression and action.
Preferably, the personality vector is a category 5 personality vector, and the category 5 personality vector includes:
the open personality vector is used for extracting characteristics of imagination, aesthetic feeling, rich emotion, differentiation, creation and intelligence of individuals;
the responsibility personality vector is used for extracting the characteristics of competence, impartiality, orderliness, due employment, achievement, self-discipline, caution and curbing shown by the individual;
the extroversion personality vector is used for extracting enthusiasm, social contact, fruit break, activity, adventure and optimistic traits shown by the individual;
the pleasant personality vector is used for extracting the characteristics of trust, profit, straightness, compliance, modesty and immigration of the individual;
the neural character personality vector is used for extracting emotional traits of the individual, such as difficulty in balancing anxiety, hostility, depression, self-consciousness, impulsion and fragility.
Preferably, said S50 is followed by:
s60, weighted average is carried out on at least two types of personality vectors to obtain a comprehensive personality vector, and after linearization is carried out on the comprehensive personality vector, a comprehensive personality probability is obtained through a sigmoid function.
In the embodiment of the invention, the invention carries out the prediction of two tasks. The main task is to predict the score of the individual on 5 types of characters, and the specific method comprises the following steps: 5 vectors obtained by the last module pass through 5 linear layers respectively, and are mapped into numbers between [0 and 1] by utilizing a sigmoid function, so that the scores of the individuals on 5 types of corresponding personality characteristics are respectively represented. The auxiliary tasks are as follows: and predicting the probability of the individual with the character of the character being hired in the interview. And (3) carrying out weighted average on the 5 vectors obtained by the last module to obtain a new vector, representing the comprehensive character of the individual, passing the vector through 1 linear layer, and obtaining the probability by using a sigmoid function. Wherein the weights are derived from model learning. The 6 probabilities obtained by this module are used as the final output of the model.
Preferably, the text modal data is acquired according to vector representation of a video subtitle text, and is encoded based on a Bert model of a transform structure, wherein the Bert model is a model with encoding semantic information after being pre-trained by an English text data set.
In the embodiment of the invention, the Bert model of the invention is a model which is pre-trained on a large-scale English text data set at present.
Preferably, the inter-modality alignment characterization module aligns and interacts pairwise the voice sequence, the video sequence and the text sequence respectively by using an attention mechanism.
Preferably, the inter-modality alignment characterization module aligns the text sequence to the speech sequence with the attention of the text-to-speech text2audio to enhance the speech characterization; the attention of the voice-to-video audio2video is used to align the voice sequence to the video sequence to enhance the video representation.
Preferably, the voice and video modality data are resampled in S10 at each epoch synchronization step.
Preferably, the convolutional neural network comprises 5 groups of one-dimensional convolutions, and the convolution kernel of each group of convolutions has a size of 1; the method for coding the captured voice intonation change characteristics into a voice sequence comprises the following steps: coding each frame of voice intonation change characteristics into a voice high-dimensional vector to be output as a voice sequence; the method for coding the learned expression and motion changes into a video sequence by the multi-layer bidirectional LSTM network comprises the following steps: the multi-layer bidirectional LSTM network learns expression and motion changes in the first frame of picture, converts the expression and motion changes into picture characteristics and encodes the picture characteristics into a picture high-dimensional vector for output.
Preferably, the multi-layer bidirectional LSTM network is a multi-layer bidirectional LSTM network trained using a large-scale audio data set with extracted audio features; the convolutional neural network is a convolutional neural network with extracted picture characteristics and is pre-trained on an ImageNet task.
An actual operation example:
the method comprises the steps of collecting log files of three modes of voice, text and video to perform personality detection, and setting a resampling module, an intra-mode characterization module, an inter-mode alignment characterization module, a mode fusion module and a prediction module. The resampling module is responsible for sampling voice and video of an input sample to obtain a frequency spectrum and a picture input network with a certain frame number; the intra-modal representation module is responsible for independently coding data of each modal to obtain the representation of each modal; the alignment representation module among the modes is responsible for learning the mutual relation among different modes, and the representation of the mode is enriched by utilizing the information aligned with other modes; the mode fusion module is responsible for fusing the learned representations of the three modes, and a final vector representation is obtained for each type of personality, namely 5 vector representations in total. The prediction module performs two prediction tasks, namely an auxiliary task, namely predicting whether the individual is hired in the interview, and predicting scores of the individual in the five types of personality characteristics respectively to serve as final output of the model.
The large category-5 personality vectors comprise an open personality vector, a responsible personality vector, a camber personality vector, a pleasant personality vector and a neural personality vector, and the open personality vector extracts characteristics of imagination, aesthetics, rich emotion, differentiation, creation, intelligence and the like of an individual; the responsibility personality vector extracts the characteristics of competency, justice, arrangement, full-time, achievement, self-discipline, caution, control and the like displayed by the individual; extracting special qualities such as enthusiasm, social contact, fruit break, activity, adventure, optimism and the like presented by individuals by the extranal personality vector; the pleasant personality vector extracts the characteristics of trust, profit, straightness, compliance, modesty, immigration and the like of the individual; the neural character personality vector extracts the emotional traits of the individual which are difficult to balance anxiety, hostility, depression, self-consciousness, impulsion and fragility, namely the individual does not have the ability of keeping emotional stability.
1. Subsampling Module (resampling Module)
The module is mainly responsible for randomly sampling voice and video of a sample. Unlike other methods, the present invention samples the same sample once every epoch. By the method, a plurality of samples with certain difference can be generated from one sample, so that the effect of data enhancement is achieved, and the robustness of the model is improved. During sampling, in order to ensure that the voice and the video can be strictly aligned, the voice and the video are synchronously sampled, namely, the audio and the picture at the same moment are sampled. And inputting the audio and the picture obtained by sampling and the text of the sample into an In-ModalityModule for coding respectively.
2. In-ModalityModule (Modal inner representation module)
The module is responsible for independently encoding data of three modes of voice, text and video. Wherein, for the resampled audio sequence, extracting the MFCC (Mel Frequency Cepstral coeffiences) Mel Frequency Cepstral coefficients and Fbank (filterbank) characteristics by Fourier transform, inputting the obtained data into a multi-layer bidirectional LSTM network for coding, capturing the characteristics of voice intonation change, finally coding each frame into a high-dimensional vector, notably, the bidirectional LSTM network is pre-trained by a large-scale audio data set and has the capability of extracting audio characteristics, for the resampled picture sequence, coding each picture by using a convolutional neural network with a residual error structure to obtain a high-dimensional vector, inputting the vector into the multi-layer bidirectional LSTM network for coding, learning the expression and action change, finally coding each frame into a high-dimensional vector, wherein the deep convolutional neural network is pre-trained on ImageNet task, the method has the capability of extracting the features in the picture. And encoding the caption text sequence of the sample by using a current advanced Bert model based on a transform structure. The BERT model is Pre-trained on large-scale English text data set, and has strong capability of coding semantic information. And after the text sequence is coded by a Bert model, obtaining a vector sequence with deep semantic information.
3. Cross-ModalityAlignmentModule (Module for alignment between modalities)
The module receives vector representation sequences which are independently coded in the modes, namely a voice sequence, a text sequence and a picture sequence, and mutually interacts representations of the three modes to strengthen respective codes. The interaction means used by the present invention is an attention mechanism. For example, with the two attentions of audio2text and video2text, respectively aligning the voice with the text and aligning the video with the text, obtaining the representation of the text containing the voice and video information by using the related representation of the voice and the representation of the video, and splicing the representation with the original representation of the text to obtain the enhanced representation of the text. Similarly, speech representations augmented with text and video representations augmented with speech and text are available.
Modalities Fusion Module (Modal Fusion Module)
The module is responsible for fusing the vector representation of each frame of the three modes obtained by the previous module, and the fusion mode is splicing to obtain a new vector. These vector representations are then transformed into 5 vectors, each representing the personality of the individual using a convolutional neural network (the convolutional network is detailed in detail: using a one-dimensional convolution with a convolution kernel size of 1 and a number of groups of 5).
PredictionModule (prediction module)
The present invention uses an auxiliary task to facilitate the learning of the original personality detection task, so we will make predictions for both tasks. The main task is to predict the score of the individual on 5 types of characters, and the specific method comprises the following steps: 5 vectors obtained by the last module pass through 5 linear layers respectively, and are mapped into numbers between [0 and 1] by utilizing a sigmoid function, so that the scores of the individuals on 5 types of corresponding personality characteristics are respectively represented. The auxiliary tasks are as follows: and predicting the probability of the individual with the character of the character being hired in the interview. And (3) carrying out weighted average on the 5 vectors obtained by the last module to obtain a new vector, representing the comprehensive character of the individual, passing the vector through 1 linear layer, and obtaining the probability by using a sigmoid function. Wherein the weights are derived from model learning. The 6 probabilities obtained by this module are used as the final output of the model.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A personality detection method based on multi-modal alignment and multi-vector characterization is characterized by comprising the following steps:
s10, resampling the voice and video modal data according to each epoch, and generating a plurality of samples with difference;
s20, inputting a plurality of samples and text modal data thereof into a modal internal representation module for independent coding to obtain a voice sequence, a video sequence and a text sequence;
s30, inputting the voice sequence, the video sequence and the text sequence into an inter-modality alignment characterization module, and splicing the voice sequence, the video sequence and the text sequence after aligning and interacting pairwise respectively by the inter-modality alignment characterization module to obtain an enhanced voice characterization, a video characterization and a text characterization;
s40, splicing all voice representations into voice vectors, splicing all video representations into video vectors, splicing all text representations into text vectors, and converting the voice vectors, the video vectors and the text vectors into at least two types of personality vectors by using a convolutional neural network;
s50, after the at least two types of personality vectors are respectively linearized, the predictive probability of the characteristics of the at least two types of personality is obtained through sigmoid function mapping.
2. The personality detection method based on multi-modal alignment and multi-vector characterization according to claim 1, wherein the S20 includes:
the intra-modal characterization module extracts a Mel frequency cepstrum coefficient and a response Fbank characteristic of audio in a sample through Fourier transform, inputs the characteristic into a multi-layer bidirectional LSTM network for coding so as to capture a voice intonation change characteristic, codes the captured voice intonation change characteristic into a voice sequence and outputs the voice sequence;
the intra-modal characterization module encodes videos in the samples through a convolutional neural network with a residual error structure to obtain high-dimensional vectors of video features, inputs the high-dimensional vectors of the video features into a multi-layer bidirectional LSTM network, encodes learned expression and action changes into video sequences and outputs the video sequences;
and the intra-modal characterization module encodes the text in the sample through a Bert model based on a transformer structure to obtain a text sequence with deep semantic information.
3. The personality detection method based on multi-modal alignment and multi-vector characterization according to claim 1, wherein the personality vector is a category 5 personality vector, the category 5 personality vector comprising:
the open personality vector is used for extracting characteristics of imagination, aesthetic feeling, rich emotion, differentiation, creation and intelligence of individuals;
the responsibility personality vector is used for extracting the characteristics of competence, impartiality, orderliness, due employment, achievement, self-discipline, caution and curbing shown by the individual;
the extroversion personality vector is used for extracting enthusiasm, social contact, fruit break, activity, adventure and optimistic traits shown by the individual;
the pleasant personality vector is used for extracting the characteristics of trust, profit, straightness, compliance, modesty and immigration of the individual;
the neural character personality vector is used for extracting emotional traits of the individual, such as difficulty in balancing anxiety, hostility, depression, self-consciousness, impulsion and fragility.
4. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 1, wherein the S50 is followed by further comprising:
s60, weighted average is carried out on at least two types of personality vectors to obtain a comprehensive personality vector, and after linearization is carried out on the comprehensive personality vector, a comprehensive personality probability is obtained through a sigmoid function.
5. The personality detection method based on multi-modal alignment and multi-vector representation of claim 1, wherein the text modal data is collected according to vector representation of a video subtitle text, and is encoded based on a transform structure Bert model, and the Bert model is a model with encoding semantic information after being pre-trained by an English text data set.
6. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 1, wherein the inter-modal alignment characterization module aligns and interacts pairwise with a voice sequence, a video sequence and a text sequence, respectively, using an attention mechanism.
7. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 1, wherein the inter-modal alignment characterization module aligns a text sequence to a speech sequence with attention of text-to-speech text2audio to enhance speech characterization; the attention of the voice-to-video audio2video is used to align the voice sequence to the video sequence to enhance the video representation.
8. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 1, wherein voice and video modality data are resampled in steps identical to each epoch in the S10.
9. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 3, wherein the convolutional neural network comprises 5 sets of one-dimensional convolutions, each set of convolutions having a convolution kernel size of 1; the method for coding the captured voice intonation change characteristics into a voice sequence comprises the following steps: coding each frame of voice intonation change characteristics into a voice high-dimensional vector to be output as a voice sequence; the method for coding the learned expression and motion changes into a video sequence by the multi-layer bidirectional LSTM network comprises the following steps: the multi-layer bidirectional LSTM network learns expression and motion changes in the first frame of picture, converts the expression and motion changes into picture characteristics and encodes the picture characteristics into a picture high-dimensional vector for output.
10. The personality detection method based on multi-modal alignment and multi-vector characterization of claim 2, wherein the multi-layer bi-directional LSTM network is a multi-layer bi-directional LSTM network with extracted audio features trained using a large-scale audio data set; the convolutional neural network is a convolutional neural network with extracted picture characteristics and is pre-trained on an ImageNet task.
CN202010070066.3A 2020-01-21 2020-01-21 Personality detection method based on multi-modal alignment and multi-vector characterization Active CN111259976B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010070066.3A CN111259976B (en) 2020-01-21 2020-01-21 Personality detection method based on multi-modal alignment and multi-vector characterization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010070066.3A CN111259976B (en) 2020-01-21 2020-01-21 Personality detection method based on multi-modal alignment and multi-vector characterization

Publications (2)

Publication Number Publication Date
CN111259976A true CN111259976A (en) 2020-06-09
CN111259976B CN111259976B (en) 2023-05-23

Family

ID=70954332

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010070066.3A Active CN111259976B (en) 2020-01-21 2020-01-21 Personality detection method based on multi-modal alignment and multi-vector characterization

Country Status (1)

Country Link
CN (1) CN111259976B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832651A (en) * 2020-07-14 2020-10-27 清华大学 Video multi-mode emotion inference method and device
CN112650861A (en) * 2020-12-29 2021-04-13 中山大学 Personality prediction method, system and device based on task layering
CN112951258A (en) * 2021-04-23 2021-06-11 中国科学技术大学 Audio and video voice enhancement processing method and model
CN113705725A (en) * 2021-09-15 2021-11-26 中国矿业大学 User personality characteristic prediction method and device based on multi-mode information fusion
US20220147758A1 (en) * 2020-11-10 2022-05-12 Fujitsu Limited Computer-readable recording medium storing inference program and method of inferring
CN115146743A (en) * 2022-08-31 2022-10-04 平安银行股份有限公司 Character recognition model training method, character recognition method, device and system
CN115269845A (en) * 2022-08-01 2022-11-01 安徽大学 Network alignment method and system based on social network user personality
CN112951258B (en) * 2021-04-23 2024-05-17 中国科学技术大学 Audio/video voice enhancement processing method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460737A (en) * 2018-11-13 2019-03-12 四川大学 A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network
CN110287389A (en) * 2019-05-31 2019-09-27 南京理工大学 The multi-modal sensibility classification method merged based on text, voice and video

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460737A (en) * 2018-11-13 2019-03-12 四川大学 A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network
CN110287389A (en) * 2019-05-31 2019-09-27 南京理工大学 The multi-modal sensibility classification method merged based on text, voice and video

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832651A (en) * 2020-07-14 2020-10-27 清华大学 Video multi-mode emotion inference method and device
CN111832651B (en) * 2020-07-14 2023-04-07 清华大学 Video multi-mode emotion inference method and device
US20220147758A1 (en) * 2020-11-10 2022-05-12 Fujitsu Limited Computer-readable recording medium storing inference program and method of inferring
CN112650861A (en) * 2020-12-29 2021-04-13 中山大学 Personality prediction method, system and device based on task layering
CN112951258A (en) * 2021-04-23 2021-06-11 中国科学技术大学 Audio and video voice enhancement processing method and model
CN112951258B (en) * 2021-04-23 2024-05-17 中国科学技术大学 Audio/video voice enhancement processing method and device
CN113705725A (en) * 2021-09-15 2021-11-26 中国矿业大学 User personality characteristic prediction method and device based on multi-mode information fusion
CN115269845A (en) * 2022-08-01 2022-11-01 安徽大学 Network alignment method and system based on social network user personality
CN115269845B (en) * 2022-08-01 2023-06-23 安徽大学 Network alignment method and system based on social network user personality
CN115146743A (en) * 2022-08-31 2022-10-04 平安银行股份有限公司 Character recognition model training method, character recognition method, device and system

Also Published As

Publication number Publication date
CN111259976B (en) 2023-05-23

Similar Documents

Publication Publication Date Title
CN111259976B (en) Personality detection method based on multi-modal alignment and multi-vector characterization
CN110097894B (en) End-to-end speech emotion recognition method and system
CN110728997B (en) Multi-modal depression detection system based on context awareness
CN113255755B (en) Multi-modal emotion classification method based on heterogeneous fusion network
CN103366618B (en) Scene device for Chinese learning training based on artificial intelligence and virtual reality
CN112069484A (en) Multi-mode interactive information acquisition method and system
CN115329779B (en) Multi-person dialogue emotion recognition method
CN115910066A (en) Intelligent dispatching command and operation system for regional power distribution network
Fellbaum et al. Principles of electronic speech processing with applications for people with disabilities
Zhang Ideological and political empowering English teaching: ideological education based on artificial intelligence in classroom emotion recognition
Deschamps-Berger et al. Exploring attention mechanisms for multimodal emotion recognition in an emergency call center corpus
WO2023226239A1 (en) Object emotion analysis method and apparatus and electronic device
CN116304973A (en) Classroom teaching emotion recognition method and system based on multi-mode fusion
Cole et al. A prototype voice-response questionnaire for the us census.
CN116959417A (en) Method, apparatus, device, medium, and program product for detecting dialog rounds
CN115376547A (en) Pronunciation evaluation method and device, computer equipment and storage medium
CN112700796B (en) Voice emotion recognition method based on interactive attention model
Brahme et al. Effect of various visual speech units on language identification using visual speech recognition
Mean Foong et al. V2s: Voice to sign language translation system for malaysian deaf people
CN114420159A (en) Audio evaluation method and device and non-transient storage medium
CN117150320B (en) Dialog digital human emotion style similarity evaluation method and system
Sajid et al. Multimodal Emotion Recognition using Deep Convolution and Recurrent Network
CN116108856B (en) Emotion recognition method and system based on long and short loop cognition and latent emotion display interaction
CN114610861B (en) End-to-end dialogue method integrating knowledge and emotion based on variational self-encoder
Aguirre-Peralta et al. Speech to Text Recognition for Videogame Controlling with Convolutional Neural Networks.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant