CN110096575B - Psychological portrait method facing microblog user - Google Patents

Psychological portrait method facing microblog user Download PDF

Info

Publication number
CN110096575B
CN110096575B CN201910375599.XA CN201910375599A CN110096575B CN 110096575 B CN110096575 B CN 110096575B CN 201910375599 A CN201910375599 A CN 201910375599A CN 110096575 B CN110096575 B CN 110096575B
Authority
CN
China
Prior art keywords
microblog
user
text
representation
personality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910375599.XA
Other languages
Chinese (zh)
Other versions
CN110096575A (en
Inventor
赵忠华
吴俊杰
赵志云
袁石
王禄恒
左源
付培国
万欣欣
李欣
王涵菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
National Computer Network and Information Security Management Center
Original Assignee
Beihang University
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University, National Computer Network and Information Security Management Center filed Critical Beihang University
Publication of CN110096575A publication Critical patent/CN110096575A/en
Application granted granted Critical
Publication of CN110096575B publication Critical patent/CN110096575B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/20ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H20/00ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
    • G16H20/70ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance relating to mental therapies, e.g. psychological therapy or autogenous training

Abstract

The invention discloses a psychological portrait method facing microblog users, which comprises the following steps: selecting a sample user on a microblog platform, and acquiring a personality characteristic score of the sample user by using a questionnaire method according to a set psychological scale; secondly, acquiring text representations of the sample users according to the text information of the sample users on the microblog platform, and acquiring behavior representations of the sample users according to the behavior information of the sample users; thirdly, constructing a personality characteristic prediction model according to the corresponding relation between the personality characteristic score of the sample user and the text representation and behavior representation; and step four, acquiring the text representation and the behavior representation of the user to be detected, and acquiring the personality characteristics of the user to be detected according to the personality characteristic prediction model. According to the invention, the personality traits of the microblog user can be analyzed, and technical support is provided for the psychological portrait of the microblog user.

Description

Psychological portrait method facing microblog user
Technical Field
The present invention relates to the field of data modeling. More specifically, the invention relates to a psychological portrait modeling method facing microblog users.
Background
The wide participation of social media and the data mining technology based on the social media provide opportunities for the psychological characteristic analysis, portrayal, tracking and monitoring of netizens. On one hand, microblogs are the most popular and open social media at present, and cover most netizens in China; on the other hand, the data mining technology taking social media as the core provides technical support for the transition of the traditional psychological analysis method from a physical space to a network space. The behavior trace left by the network individual on the network is utilized to carry out the analysis of the psychology characteristics of the netizens represented by the microblog, and the machine learning modeling and computational psychology technology are utilized to realize the accurate portrayal, tracking and monitoring of the psychology characteristics of the netizens, so that the method has important research significance. In addition, in the business industry, psychological images of users by using social media big data are widely applied in the aspects of accurate commodity putting, user fraud monitoring, emotion analysis, viewpoint mining and the like, and relevant benefits are brought to companies.
Disclosure of Invention
An object of the present invention is to solve at least the above problems and to provide at least the advantages described later.
The invention also aims to provide a psychological portrait modeling method facing microblog users, which analyzes text characteristics and behavior characteristics of the users from data disclosed by the microblog users, integrates the text characteristics and the behavior characteristics to establish a five-personality and dark three-personality characteristic prediction model, realizes the analysis of personality traits of the microblog users and provides technical support for the psychological portrait of the microblog users.
To achieve these objects and other advantages in accordance with the purpose of the invention, there is provided a mental portrayal modeling method for microblog users, including:
selecting a sample user on a microblog platform, and acquiring a personality characteristic score of the sample user by using a questionnaire method according to a set psychological scale; carrying out personality characteristic questionnaires on certain sample users on the microblog platform according to the five-personality model and the dark triad characteristic model, and taking dimension scores in the five-personality model and the dark triad characteristic model as personality characteristics of each sample user;
acquiring microblog data of sample users, and acquiring text representations of the sample users according to microblog texts issued by the users; acquiring the behavior representation of the sample user according to the basic information and the behavior information of the sample user; weighting the text representation and the behavior representation of the sample user to form the representation characteristics of the sample user;
thirdly, constructing a personality characteristic prediction model according to the corresponding relation between the personality characteristic score of the sample user and the text representation and behavior representation;
and step four, predicting the personality characteristics of the microblog user by using the prediction model obtained in the step three to form the psychological portrait of the user.
Preferably, the psychological scale in step one is calculated according to a quintuplet model and a dark triad model by using a 1-5 point Lekter scale.
Preferably, the first step further comprises verifying the validity of the questionnaire by using SPSS and Amos software, and deleting questionnaires which are not verified by the validity; in the aspect of the reliability coefficient, an Alpha reliability coefficient is adopted for carrying out reliability check; in the aspect of validity coefficient, two modes of content validity check and structure validity check are adopted.
Preferably, the step two further includes a step of cleaning microblog data of the sample user, and content irrelevant to the text issued by the user in the forwarded and commented microblogs is removed.
Preferably, the method for acquiring the text representation in the second step includes:
converting each word after the microblog text word segmentation into an N-dimensional characterization vector according to an LIWC dictionary, converting each sample user into an N-dimensional vector from the text perspective through vector summation, and calculating the text characterization of each user.
Preferably, the method further comprises the step of expanding the text feature corpus in the LIWC dictionary according to the habit of using words in microblog words.
Preferably, the method for acquiring the behavior representation in the step two includes:
determining indexes related to the personality characteristics of the microblog users;
and calculating the numerical value of each user on each index, adding the numerical values, splicing the numerical values into a P-dimensional vector, and calculating the behavior characterization of the sample user.
Preferably, the indexes related to the personality characteristics of the microblog users comprise numerical indexes and vector indexes; the numerical indexes comprise the attention number of the microblog users, the microblog sending number of the microblog users and the fan number of the microblog users; the vector-type indicators include emoticon usage preferences and activity levels.
Preferably, the method for weighting the characterization features of the sample user in step two includes:
step one, constructing a vector dimension reduction model and a vector dimension raising model, and realizing dimension unification of text representation vectors and behavior representation vectors;
and secondly, weighting the text representation and the behavior representation by using a vector weighting algorithm based on the representation vector with unified dimension to obtain the representation characteristics of the sample user.
Preferably, the method for modeling the prediction model in step three includes:
constructing a single-dimensional classification model: aiming at each dimension of a set psychological scale, dividing three intervals after adding or subtracting a standard deviation in a certain range according to the mean value of all sample users in the dimension, wherein the standard deviation range can be 0.1-1.0, and each interval is provided with a category label;
and then establishing a prediction model of the text characterization, the behavior characterization and the class label according to a classification algorithm, wherein the classification algorithm comprises a support vector machine and naive Bayes.
The invention at least comprises the following beneficial effects: firstly, accurate portrayal, tracking and monitoring of the psychology characteristics of netizens are realized, and the method has important research significance; secondly, psychological images of the users are performed by utilizing social media big data, and the method is widely applied to the aspects of accurate commodity putting, user fraud behavior monitoring, emotion analysis, viewpoint mining and the like.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.
Drawings
Fig. 1 is an overall block diagram of the present invention.
Detailed Description
The present invention is described in further detail below to enable those skilled in the art to practice the invention with reference to the description.
Based on the traditional five-personality and dark triad characteristic psychology theory, the feature scores of all dimensions of part of microblog active users in the five-personality and dark triad characteristic are obtained by utilizing an questionnaire, the public data corresponding to the microblog users are obtained, the text representation and the behavior representation of the microblog users are extracted and weighted, and a five-personality and dark triad characteristic prediction model of the microblog users is established. And predicting the personality characteristics of the microblog users with unknown personality characteristics according to the constructed five-personality and dark triad characteristic prediction model, so as to realize the depiction of the psychological model of the microblog users based on microblog data.
The mental portrait modeling method facing the microblog user is provided, and comprises the following steps:
selecting a sample user on a microblog platform, and acquiring a personality characteristic score of the sample user by using a questionnaire method according to a set psychological scale; carrying out personality characteristic survey questionnaires on certain sample users on a microblog platform according to a five-personality model and a dark triad characteristic model, wherein the five-personality model comprises five dimensions of openness, camber, hommization, responsibility and nervousness, and the dark personality model comprises three dimensions of self-love, Makiya virucity and mental morbidity; taking each dimension score in the five-personality model and the dark triple feature model as the personality feature of each sample user; generating a Personality questionnaire for microblog users according to an International personal ability Item Pool (IPIP) quintuple measurement table and a dark triad table proposed by Jones, D.N. & Paulhus, D.L.; the questionnaire mainly adopts an on-line private letter mode, and the microblog user personality questionnaire is issued to the sample user and is recycled; and according to the content of the questionnaire, adopting a 1-5-point Lekter scale to mark the scores of each dimension of each user on the five-personality and dark triad, and taking the scores of each dimension as the personality characteristics of each user.
Acquiring microblog data of sample users, and acquiring text representations of the sample users according to microblog texts issued by the users; acquiring the behavior representation of the sample user according to the basic information and the behavior information of the sample user; weighting the text representation and the behavior representation of the sample user to form the representation characteristics of the sample user;
thirdly, constructing a personality characteristic prediction model according to the corresponding relation between the personality characteristic score of the sample user and the text representation and behavior representation;
and step four, predicting the personality characteristics of the microblog user by using the prediction model obtained in the step three to form the psychological portrait of the user.
In one embodiment, the psychological scale in step one is calculated from a quintuplet scale of 1-5 points based on a quintuplet model and a trigram of darkness.
The mental portrait modeling method facing the microblog users further comprises the following steps of collecting users of various categories on a microblog platform: and acquiring K active microblog users in each category according to the user classification information of the microblog platform to form sample users. Here, K can be any natural number between 10 and 200.
In one embodiment, the first step further includes verifying the validity of the questionnaire, and deleting questionnaires that do not pass validity verification: aiming at the recovered questionnaire, SPSS and Amos software are adopted for processing, the credibility of the questionnaire is verified, and the questionnaire passing the credibility inspection is reserved; in the aspect of the reliability coefficient, the Alpha reliability coefficient can be adopted for carrying out reliability check; in the aspect of validity coefficient, two modes of content validity check and structure validity check can be adopted.
In one embodiment, the second step further includes a step of cleaning microblog data of the sample user, and content irrelevant to the text issued by the user in the forwarded and commented microblogs is removed.
Firstly, acquiring microblog data of a sample user according to an API (application programming interface) of a microblog platform, wherein the microblog data comprises basic information, behavior information and issued microblog text information of the user;
because the microblog text comprises three types of original, forwarding and commenting, the forwarded and commenting microblog except the original microblog contain text contents which are not issued by the user; specifically, the method comprises the following steps: the forwarding microblog generally comprises two parts, namely the content written by the user after forwarding and the content of the original microblog, such as 6666// @ recall special small vest 6666, and the comment microblog generally comprises a character of 'reply', such as 'reply' small black man 'what you are about haha, and haha'. The contents irrelevant to the text published by the user can reduce the precision of the microblog user personality characteristic prediction model, so the contents irrelevant to the text published by the user need to be firstly acquired before the characteristic representation of the user is acquired. And removing contents which are irrelevant to the text issued by the user in the forwarding and commenting microblog by using the regular expression.
In one embodiment, the method for acquiring the text representation in step two includes:
and converting each sample user into an N-dimensional vector from the perspective of the text according to the LIWC dictionary, and calculating the text representation of each user.
Aiming at the cleaned text, the semantic features of the text are fully mined by combining with an LIWC dictionary widely used in the field of psychology, and the feature representation of each user is calculated from the perspective of the text.
In one embodiment, the method further comprises the step of expanding the text feature corpus in the LIWC dictionary according to the habit of using words in microblog words.
Considering the spoken language and diversified characteristics of microblog vocabularies, the method uses the most advanced word2vector technology and a BERT pre-training model to depict the semantic characteristics of each word in the microblog text, and uses the semantic characteristics of the words to expand the LIWC dictionary, thereby overcoming the limitation of the LIWC dictionary in quantity and enabling the dictionary to better fit the word use and vocabularies habit of a microblog platform. In the expansion process of the LIWC dictionary, the thought of minimizing group internal distance and maximizing group distance is utilized to provide an improved cosine similarity function, calculate the similarity between word vectors and better realize the grouping of all words. The method comprises the following specific steps:
1) and performing word segmentation and part-of-speech tagging on each piece of microblog text after cleaning by using a Chinese word segmentation tool. The method is simple and efficient, and a wide jieba word segmentation tool is used for Chinese text word segmentation and part of speech tagging.
2) And (3) for the text after word segmentation, learning words after the text word segmentation by using a word2vector technology and a BERT pre-training model, and converting each word into an N-dimensional characterization vector, wherein N is preferably 300.
3) And searching the characterization vectors of all the words in the LIWC dictionary from all the well-learned word vectors. For words in the not-searched LIWC dictionary, the present invention discards the words.
4) By utilizing the thought of minimizing the group internal distance and maximizing the group distance, an improved cosine similarity function is provided, and the semantic similarity between each word with a characterization vector in the LIWC dictionary and other words not contained in the LIWC dictionary is calculated. The calculation formula of the similarity is as follows:
Figure GDA0003280646490000051
wherein the content of the first and second substances,
Figure GDA0003280646490000052
indicates the direct cosine similarity, numder, of the ith word vector in the LIWC dictionary and the jth word vector not in the LIWC dictionaryLIWCThe number of words with word vectors in the LIWC dictionary is represented.
5) And acquiring M words with higher similarity aiming at each word with the characterization vector in the LIWC dictionary, wherein preferably M is 30, and combining all the selected words and the words with the characterization vector in the LIWC dictionary into a text feature corpus.
6) And calculating the text representation of each user according to the text feature corpus, and converting each user into an N-dimensional vector from the text perspective. The formula for text characterization is as follows:
Figure GDA0003280646490000061
wherein, w _ textiA representation of the text of the ith user,
Figure GDA0003280646490000062
representing the number of times the word j appears in the ith user-posted text, wjRepresenting the token vector of the word j calculated by the word2vector technology and the Bert pre-training model.
In one embodiment, the method for acquiring the behavior representation in the step two includes:
determining indexes related to the personality characteristics of the microblog users;
and calculating the numerical value of each user on each index, normalizing and splicing the numerical values into a P-dimensional vector according to columns, and calculating the behavior characterization of the sample user.
In one embodiment, the indexes related to the personality characteristics of the microblog users comprise numerical indexes and vector indexes; the numerical indexes comprise the attention number of the microblog users, the microblog sending number of the microblog users and the fan number of the microblog users; the vector-type indicators include emoticon usage preferences and activity levels. The method specifically comprises the following steps: and calculating the behavior representation of each user according to the basic user information and the behavior information acquired from the microblog platform. The method comprises the following specific steps:
1) through a large amount of interview, literature research, data analysis and other work, the invention determines the following five indexes related to the personality characteristics of the microblog user:
i, microblog user attention number: the calculation result of the index contained in the basic information is numerical type;
II, the number of microblogs sent by the microblog users is as follows: the calculation result of the index contained in the basic information is numerical type;
III, counting the fans of the microblog users: the calculation result of the index contained in the basic information is numerical type;
emoticon usage preference: the number of times of using the positive, negative and neutral emoticons is calculated to be a three-dimensional numerical vector;
v, activity degree: the shortest posting time interval, the longest posting time interval, the highest daily posting amount, the lowest daily posting amount, the highest monthly posting amount and the lowest monthly posting amount, and the calculation result is a six-dimensional numerical vector;
2) and calculating the numerical value of each user on each index according to the indexes related to the personality characteristics of the sample users, normalizing the numerical values according to columns and splicing the numerical values into a P-dimensional vector (the value of P is equal to the sum of the dimensions of the selected indexes), and forming the behavior representation of the users.
In one embodiment, the second step further includes weighting the text characterization and behavior characterization vectors to form characterization features of the sample user, where the weighting method includes:
step one, constructing a vector dimension reduction model and a vector dimension raising model, and realizing dimension unification of text representation vectors and behavior representation vectors;
and secondly, weighting the text representation and the behavior representation by using a vector weighting algorithm based on the representation vector with unified dimension to obtain the representation characteristics of the sample user.
Specifically, according to the text representation and the behavior representation of the user calculated in the first step and the second step, considering that the dimensionality of the text representation and the dimensionality of the behavior representation are not uniform, the step firstly utilizes a neural network technology, and respectively constructs a vector dimension reduction model and a vector dimension enhancement model on the basis that the distance between any two vectors before and after transformation is kept unchanged, so that the unification of the dimensionality of the text representation and the dimensionality of the behavior representation vectors is realized; and then, based on the characterization vectors with unified dimensions, weighting the text characterization and the behavior characterization by using a vector weighting algorithm to obtain the final user feature representation. Due to the fact that microblog data have the characteristic of sparsity, the method can control the importance degree of text representation and behavior representation according to the adjustment of the weight, and the problem that user feature representation is inaccurate due to the fact that the text representation or the behavior representation data are absent is solved. The method comprises the following specific steps:
1) and respectively constructing a vector dimension reduction model and a vector dimension lifting model by utilizing a neural network technology and taking the principle that the distance between any two vectors before and after transformation is kept unchanged. The vector dimension reduction model can reduce the high-dimensional vector to the T dimension, and the vector dimension increasing model can expand the low-dimensional vector to the T dimension. Wherein T preferably takes 100.
2) Reducing the N-dimensional text representation of each microblog user to T dimension by using a vector dimension reduction model; and expanding the P-dimensional behavior representation of each microblog user to the T dimension by using a vector dimension-rising model.
3) And according to the text representation and the behavior representation which are transformed to the same dimension, weighting the text representation and the behavior representation by using a vector weighting algorithm, and further acquiring the feature representation of the user. The vector weighting algorithm is calculated as follows:
Figure GDA0003280646490000071
wherein, w _ useriA feature representation vector representing the ith user,
Figure GDA0003280646490000072
representing the text representation vector of the ith user after dimension reduction processing,
Figure GDA0003280646490000073
representing the behavior characterization vector of the ith user after the dimension increasing processing, wherein alpha represents the weight, and preferably, alpha is 0.5.
In one embodiment, the modeling method of the predictive model of the quintuple and dark triplet characterization in step three comprises:
constructing a single-dimensional classification model: aiming at each dimension of the set psychological scale, dividing three intervals after adding or subtracting a standard deviation in a certain range according to the mean value of all sample users in the dimension, wherein each interval is provided with a category label;
and then establishing a prediction model of the text characterization, the behavior characterization and the class label according to a classification algorithm, wherein the classification algorithm comprises a support vector machine and naive Bayes.
Specifically, aiming at the sample users with the personality characteristics collected in the step one, a prediction model of the five personality and dark triad characteristics of the microblog users is constructed by utilizing the user characteristic feature data calculated in the step two. The method comprises the step of converting a five-personality and darkness triple feature prediction model of a microblog user into a classification problem. Aiming at each dimension of the five-personality and dark triad, dividing the dimension into three types after adding and subtracting a certain range of standard deviation according to the mean value of all users on the dimension, wherein the range of the standard deviation is 0.1-1.0.
And then establishing a prediction model of the text characterization, the behavior characterization and the class label according to a classification algorithm, wherein the classification algorithm comprises a support vector machine and naive Bayes.
With the above rules, each microblog user has a category label (A, B or C) in each dimension of the five-personality and dark triad. And aiming at each dimension of the characteristics of the five persons and the dark three, constructing a classification model of the dimension. The five personality contains five dimensions, and the dark triad contains three dimensions, which total eight dimensions. Therefore, eight classification models are required to be constructed in the invention, and the judgment of the class labels of the user in the eight dimensions is realized respectively.
Based on the invention, the specific operation steps are as follows:
step one, carrying out personality characteristic questionnaires on certain sample users on a microblog platform according to a five-personality model and a dark triad characteristic model, and taking dimension scores in the five-personality model and the dark triad characteristic model as personality characteristics of each sample user;
acquiring K hot microblog users in each category as sample users according to user classification information of the microblog platform, wherein K can take any natural number between 10 and 200; the personality survey questionnaire is carried out on the screened sample users, the personality characteristics in the invention comprise five personality and three dark triad characteristics, wherein a five personality model comprises five dimensions of openness, camber, hommization, responsibility and nervousness, a dark personality model comprises three dimensions of self-love, Makiya virgule and mental illness, and the personality characteristics of the sample users are measured according to the five personality and three dark triad characteristic models. The method comprises the following specific steps:
1) generating a Personality questionnaire for microblog users according to an International personal ability Item Pool (IPIP) quintuple measurement table and a dark triad table proposed by Jones, D.N. & Paulhus, D.L.;
2) issuing a microblog user personality questionnaire to a sample user by adopting an online private letter mode, and recovering the questionnaire;
3) aiming at the recovered questionnaire, SPSS and Amos software are adopted for processing, the credibility of the questionnaire is verified, and the questionnaire passing the credibility inspection is reserved; in the aspect of the reliability coefficient, the Alpha reliability coefficient can be adopted for carrying out reliability check; in the aspect of validity coefficient, two modes of content validity check or structure validity check can be adopted;
4) and according to the questionnaire passing the confidence validity test, adopting a 1-5-point Lekter scale to mark the scores of each user in the dimensions of the five-personality and dark triad, and taking the scores of each dimension as the personality characteristics of each user.
Acquiring microblog data of sample users, and acquiring text representations of the sample users according to microblog texts issued by the users; acquiring the behavior representation of the sample user according to the basic information and the behavior information of the sample user; weighting the text representation and the behavior representation of the sample user to form the representation characteristics of the sample user;
firstly, acquiring microblog data of a sample user according to an API (application programming interface) of a microblog platform, wherein the microblog data comprises basic information, behavior information and issued microblog text information of the user;
because the microblog text comprises three types of original, forwarding and commenting, the forwarded and commenting microblog except the original microblog contain text contents which are not issued by the user; specifically, the method comprises the following steps: the forwarding microblog generally comprises two parts, namely the content written by the user after forwarding and the content of the original microblog, such as 6666// @ recall special small vest 6666, and the comment microblog generally comprises a character of 'reply', such as 'reply' small black man 'what you are about haha, and haha'. The contents irrelevant to the text published by the user can reduce the precision of the microblog user personality characteristic prediction model, so the contents irrelevant to the text published by the user need to be firstly acquired before the characteristic representation of the user is acquired. Removing contents irrelevant to the text issued by the user in the forwarding and comment microblogs by using a regular expression;
aiming at the cleaned text, the semantic features of the text are fully mined by combining with an LIWC dictionary widely used in the field of psychology, and the feature representation of each user is calculated from the perspective of the text. Considering the spoken language and diversified characteristics of microblog vocabularies, the method uses the most advanced word2vector technology and a BERT pre-training model to depict the semantic characteristics of each word in the microblog text, and uses the semantic characteristics of the words to expand the LIWC dictionary, thereby overcoming the limitation of the LIWC dictionary in quantity and enabling the dictionary to better fit the word use and vocabularies habit of a microblog platform. In the expansion process of the LIWC dictionary, the thought of minimizing group internal distance and maximizing group distance is utilized to provide an improved cosine similarity function, calculate the similarity between word vectors and better realize the grouping of all words. The method comprises the following specific steps:
1) and performing word segmentation and part-of-speech tagging on each piece of microblog text after cleaning by using a Chinese word segmentation tool. The method is simple and efficient, and a wide jieba word segmentation tool is used for Chinese text word segmentation and part of speech tagging.
2) And (3) for the text after word segmentation, learning words after the text word segmentation by using a word2vector technology and a BERT pre-training model, and converting each word into an N-dimensional characterization vector, wherein N is preferably 300.
3) And searching the characterization vectors of all the words in the LIWC dictionary from all the well-learned word vectors. For words in the not-searched LIWC dictionary, the present invention discards the words.
4) By utilizing the thought of minimizing the group internal distance and maximizing the group distance, an improved cosine similarity function is provided, and the semantic similarity between each word with a characterization vector in the LIWC dictionary and other words not contained in the LIWC dictionary is calculated. The calculation formula of the similarity is as follows:
Figure GDA0003280646490000101
wherein the content of the first and second substances,
Figure GDA0003280646490000102
indicates the direct cosine similarity, numder, of the ith word vector in the LIWC dictionary and the jth word vector not in the LIWC dictionaryLIWCThe number of words with word vectors in the LIWC dictionary is represented.
5) And aiming at each word with the characteristic vector in the LIWC dictionary, obtaining M words with higher similarity, preferably M is 30, and combining all the selected words and the words with the characteristic vector in the LIWC dictionary into a text feature corpus.
6) And calculating the text representation of each user according to the text feature corpus, and converting each user into an N-dimensional vector from the text perspective. The formula for text characterization is as follows:
Figure GDA0003280646490000103
wherein, w _ textiA representation of the text of the ith user,
Figure GDA0003280646490000104
representing the number of times the word j appears in the ith user-posted text, wjRepresenting the token vector of the word j calculated by the word2vector technology and the Bert pre-training model.
And calculating the behavior representation of each user according to the basic user information and the behavior information acquired from the microblog platform. The method comprises the following specific steps:
1) through a large amount of interview, literature research, data analysis and other work, the invention determines the following five indexes related to the personality characteristics of the microblog user:
i, microblog user attention number: the calculation result of the index contained in the basic information is numerical type;
II, the number of microblogs sent by the microblog users is as follows: the calculation result of the index contained in the basic information is numerical type;
III, counting the fans of the microblog users: the calculation result of the index contained in the basic information is numerical type;
emoticon usage preference: the number of times of using the positive, negative and neutral emoticons is calculated to be a three-dimensional numerical vector;
v, activity degree: the shortest posting time interval, the longest posting time interval, the highest daily posting amount, the lowest daily posting amount, the highest monthly posting amount and the lowest monthly posting amount, and the calculation result is a six-dimensional numerical vector;
2) and calculating the numerical value of each user on each index according to the indexes related to the personality characteristics of the sample users, normalizing the numerical values according to columns and splicing the numerical values into a P-dimensional vector (the value of P is equal to the sum of the dimensions of the selected indexes), and forming the behavior representation of the users.
Thirdly, constructing a personality characteristic prediction model according to the corresponding relation between the personality characteristic score of the sample user and the text representation and behavior representation;
according to the user text representation and behavior representation calculated in the first step and the second step, considering that the dimensions of the text representation and the behavior representation are not uniform, firstly, a neural network technology is utilized, a vector descending dimension and a vector ascending dimension model are respectively constructed on the basis that the distance between any two vectors before and after transformation is kept unchanged, and the dimension uniformity of the text representation and the behavior representation vectors is realized; and then, based on the characterization vectors with unified dimensions, weighting the text characterization and the behavior characterization by using a vector weighting algorithm to obtain the final user feature representation. Due to the fact that microblog data have the characteristic of sparsity, the method can control the importance degree of text representation and behavior representation according to the adjustment of the weight, and the problem that user feature representation is inaccurate due to the fact that the text representation or the behavior representation data are absent is solved. The method comprises the following specific steps:
1) and respectively constructing a vector dimension reduction model and a vector dimension lifting model by utilizing a neural network technology and taking the principle that the distance between any two vectors before and after transformation is kept unchanged. The vector dimension reduction model can reduce the high-dimensional vector to the T dimension, and the vector dimension increasing model can expand the low-dimensional vector to the T dimension. Wherein T preferably takes 100.
2) Reducing the N-dimensional text representation of each microblog user to T dimension by using a vector dimension reduction model; and expanding the P-dimensional behavior representation of each microblog user to the T dimension by using a vector dimension-rising model.
3) And according to the text representation and the behavior representation which are transformed to the same dimension, weighting the text representation and the behavior representation by using a vector weighting algorithm, and further acquiring the feature representation of the user. The vector weighting algorithm is calculated as follows:
Figure GDA0003280646490000111
wherein, w _ useriA feature representation vector representing the ith user,
Figure GDA0003280646490000112
representing the text representation vector of the ith user after dimension reduction processing,
Figure GDA0003280646490000113
representing the behavior characterization vector of the ith user after the dimension increasing processing, wherein alpha represents the weight, and preferably, alpha is 0.5.
And aiming at the sample users with the personality characteristics collected in the step one, constructing a prediction model of the five personality and dark triad characteristics of the microblog users by utilizing the user characteristic data calculated in the step two. The method comprises the step of converting a five-personality and darkness triple feature prediction model of a microblog user into a classification problem. For each dimension of the five-personality and dark triad, dividing the dimension by adding and subtracting a certain range of standard deviation according to the mean value of all users in the dimension, wherein the standard deviation is 0.5, and dividing all users into three types. Specific category labels and partition rules are shown in table 1.
TABLE 1 concrete class Label partition rule Table
Figure GDA0003280646490000121
With the above rules, each microblog user has a category label (A, B or C) in each dimension of the five-personality and dark triad. And aiming at each dimension of the characteristics of the five persons and the dark three, constructing a classification model of the dimension. The five personality contains five dimensions, and the dark triad contains three dimensions, which total eight dimensions. Therefore, eight classification models are required to be constructed in the invention, and the judgment of the class labels of the user in the eight dimensions is realized respectively.
And then establishing a prediction model of the text characterization, the behavior characterization and the class label according to a classification algorithm, wherein the classification algorithm comprises a support vector machine and naive Bayes.
Predicting the personality characteristics of the microblog user by using the prediction model obtained in the step three to form a psychological portrait of the user, and obtaining category labels on eight dimensions through a support vector machine or a prime Bayesian algorithm according to the text characteristics and the behavior characteristics of the microblog user;
aiming at microblog users with unknown personality characteristics, firstly, performing a second step, and acquiring the characterization characteristics of the users from the acquired microblog data; and then, respectively predicting the category labels of the users in eight dimensions of the characteristics of the five persons and the dark triples by using the prediction model of the characteristics of the five persons and the dark triples calculated in the third step, and taking the eight category labels as the personality characteristics of the users.
The invention is based on the traditional five-personality and dark-triad characteristic psychology theory, and can be expanded to other existing or future psychology personality and personality theory and combination thereof under the condition of not departing from the general concept defined by the protection scope of the invention.
While embodiments of the invention have been disclosed above, it is not limited to the applications listed in the description and the embodiments, which are fully applicable in all kinds of fields of application of the invention, and further modifications may readily be effected by those skilled in the art, so that the invention is not limited to the specific details without departing from the general concept defined by the claims and the scope of equivalents.

Claims (6)

1. A psychological portrait method facing microblog users is characterized by comprising the following steps:
selecting a sample user on a microblog platform, and acquiring a personality characteristic score of the sample user by using a questionnaire method according to a set psychological scale;
acquiring a text representation of the sample user according to the text information of the sample user on the microblog platform, and acquiring a behavior representation of the sample user according to the behavior information of the sample user; weighting the text representation and the behavior representation of the sample user to obtain the representation characteristics of the sample user,
the method for acquiring the text representation comprises the following steps:
converting each word after the microblog text word segmentation into an N-dimensional characterization vector according to an LIWC dictionary, converting each sample user into an N-dimensional vector from a text angle through vector summation, and calculating the text characterization of each user;
the behavior characterization obtaining method comprises the following steps:
determining indexes related to the personality characteristics of the microblog users;
calculating the numerical value of each user on each index, summing the numerical values to form a P-dimensional vector, and calculating the behavior characterization of the sample user;
the weighting method of the characteristic features of the sample user comprises the following steps:
firstly, constructing a vector dimension reduction model and a vector dimension lifting model to realize the dimension unification of text representation and behavior representation vectors;
secondly, based on the characterization vectors with unified dimensions, the vector weighting algorithm is utilized to weight the text characterization and the behavior characterization so as to obtain the characterization features of the sample user
Thirdly, constructing a personality characteristic prediction model according to the corresponding relation between the personality characteristic score of the sample user and the text representation and behavior representation;
the modeling method of the personality characteristic prediction model comprises the following steps:
firstly, a single-dimension classification model is constructed, and for each dimension of a set psychological scale, three intervals are divided according to the mean value of all sample users in the dimension plus or minus the standard deviation of a set range, and each interval is provided with a category label;
then establishing a prediction model of text characterization, behavior characterization and class labels according to a classification algorithm, wherein the classification algorithm comprises a support vector machine and naive Bayes;
and step four, acquiring the text representation and the behavior representation of the user to be detected, and acquiring the personality characteristics of the user to be detected according to the personality characteristic prediction model.
2. The mental portrayal method for microblog users according to claim 1, wherein in the first step, the mental portrayal scale is a five-personality model and a dark triple feature model and is calculated by adopting a 1-5-point Lekter scale.
3. The psychological portrait method for microblog-oriented users as claimed in claim 2, wherein the first step further comprises verifying the validity of the questionnaire by using SPSS and Amos software, and deleting questionnaires that do not pass validity verification; in the aspect of the reliability coefficient, an Alpha reliability coefficient is adopted for carrying out reliability check; in the aspect of validity coefficient, two modes of content validity check and structure validity check are adopted.
4. The psychological portrait method for microblog users as claimed in claim 3, wherein the second step further comprises a step of cleaning microblog texts of the sample user, and removing contents irrelevant to the texts issued by the user in forwarding and commenting microblogs.
5. The psychological portrait method for microblog-oriented users as claimed in claim 4, further comprising a step of expanding a LIWC dictionary according to word-using habits of microblog parlance.
6. The mental portrayal method for microblog users according to claim 5, wherein the indexes related to the personality characteristics of the microblog users comprise numerical indexes and vector indexes; the numerical indexes comprise the attention number of the microblog users, the microblog sending number of the microblog users and the fan number of the microblog users; the vector-type indicators include emoticon usage preferences and activity levels.
CN201910375599.XA 2019-03-25 2019-05-07 Psychological portrait method facing microblog user Active CN110096575B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910228086 2019-03-25
CN2019102280866 2019-03-25

Publications (2)

Publication Number Publication Date
CN110096575A CN110096575A (en) 2019-08-06
CN110096575B true CN110096575B (en) 2022-02-01

Family

ID=67447174

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910375599.XA Active CN110096575B (en) 2019-03-25 2019-05-07 Psychological portrait method facing microblog user

Country Status (1)

Country Link
CN (1) CN110096575B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569405A (en) * 2019-08-26 2019-12-13 中电科大数据研究院有限公司 method for extracting government affair official document ontology concept based on BERT
CN110825842B (en) * 2019-10-10 2022-07-29 北京航空航天大学 Text viewpoint mining method based on different personality characteristics
CN110825824B (en) * 2019-10-16 2023-06-13 天津大学 User relation portrait method based on semantic visual/non-visual user character representation
CN110826322A (en) * 2019-10-22 2020-02-21 中电科大数据研究院有限公司 Method for discovering new words, predicting parts of speech and marking
CN111223235A (en) * 2019-12-27 2020-06-02 合肥美的智能科技有限公司 Commodity putting method of unmanned cabinet, unmanned cabinet and control device of unmanned cabinet
CN111581335B (en) * 2020-05-14 2023-11-24 腾讯科技(深圳)有限公司 Text representation method and device
CN113457122A (en) * 2021-06-28 2021-10-01 华东师范大学 User image drawing method based on VR emergency environment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902566A (en) * 2012-12-26 2014-07-02 中国科学院心理研究所 Personality prediction method based on microblog user behaviors
CN106202047A (en) * 2016-07-15 2016-12-07 国家计算机网络与信息安全管理中心 A kind of character personality depicting method based on microblogging text
CN106649267A (en) * 2016-11-30 2017-05-10 北京邮电大学 Method and system for mining user's large five personality via text topic
CN106874260A (en) * 2017-03-14 2017-06-20 山东师范大学 A kind of network social intercourse text big data processing method and system based on user-oriented dictionary
CN108388876A (en) * 2018-03-13 2018-08-10 腾讯科技(深圳)有限公司 A kind of image-recognizing method, device and relevant device
CN108399575A (en) * 2018-01-24 2018-08-14 大连理工大学 A kind of five-factor model personality prediction technique based on social media text

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7630986B1 (en) * 1999-10-27 2009-12-08 Pinpoint, Incorporated Secure data interchange
US9015084B2 (en) * 2011-10-20 2015-04-21 Gil Thieberger Estimating affective response to a token instance of interest
CN105183876A (en) * 2015-09-21 2015-12-23 清华大学 Psychological pressure value predicting method and system based on microblog
CN105528459B (en) * 2016-01-08 2020-07-14 腾讯科技(深圳)有限公司 Information processing method, server and terminal
CN105701210A (en) * 2016-01-13 2016-06-22 福建师范大学 Microblog theme emotion analysis method based on mixed characteristic calculation
CN105718579B (en) * 2016-01-22 2018-12-18 浙江大学 A kind of information-pushing method excavated based on internet log and User Activity identifies
CN108734338A (en) * 2018-04-24 2018-11-02 阿里巴巴集团控股有限公司 Credit risk forecast method and device based on LSTM models
CN108616545B (en) * 2018-06-26 2021-06-29 中国科学院信息工程研究所 Method and system for detecting network internal threat and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902566A (en) * 2012-12-26 2014-07-02 中国科学院心理研究所 Personality prediction method based on microblog user behaviors
CN106202047A (en) * 2016-07-15 2016-12-07 国家计算机网络与信息安全管理中心 A kind of character personality depicting method based on microblogging text
CN106649267A (en) * 2016-11-30 2017-05-10 北京邮电大学 Method and system for mining user's large five personality via text topic
CN106874260A (en) * 2017-03-14 2017-06-20 山东师范大学 A kind of network social intercourse text big data processing method and system based on user-oriented dictionary
CN108399575A (en) * 2018-01-24 2018-08-14 大连理工大学 A kind of five-factor model personality prediction technique based on social media text
CN108388876A (en) * 2018-03-13 2018-08-10 腾讯科技(深圳)有限公司 A kind of image-recognizing method, device and relevant device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Burst prediction from Weibo:A crowd-sensing and tweet-centric method;Kun yuan等;《2016 13th international conference on service system and service management》;20160626;1-6页 *
基于文本的抑郁情感倾向识别模型;施志伟等;《计算机系统应用》;20171215;第26卷(第12期);155-159页 *

Also Published As

Publication number Publication date
CN110096575A (en) 2019-08-06

Similar Documents

Publication Publication Date Title
CN110096575B (en) Psychological portrait method facing microblog user
CN109492157B (en) News recommendation method and theme characterization method based on RNN and attention mechanism
CN109933664B (en) Fine-grained emotion analysis improvement method based on emotion word embedding
Rao Contextual sentiment topic model for adaptive social emotion classification
US20190057310A1 (en) Expert knowledge platform
CN108984530A (en) A kind of detection method and detection system of network sensitive content
CN110472042B (en) Fine-grained emotion classification method
CN110192203A (en) Joint multitask neural network model for multiple natural language processings (NLP) task
CN109409433B (en) Personality recognition system and method for social network users
CN110119849B (en) Personality trait prediction method and system based on network behaviors
CN107315738A (en) A kind of innovation degree appraisal procedure of text message
CN110765769B (en) Clause feature-based entity attribute dependency emotion analysis method
CN111353044B (en) Comment-based emotion analysis method and system
CN112905739B (en) False comment detection model training method, detection method and electronic equipment
CN111368082A (en) Emotion analysis method for domain adaptive word embedding based on hierarchical network
CN114238573A (en) Information pushing method and device based on text countermeasure sample
CN114896386A (en) Film comment semantic emotion analysis method and system based on BilSTM
CN115309864A (en) Intelligent sentiment classification method and device for comment text, electronic equipment and medium
CN117314593B (en) Insurance item pushing method and system based on user behavior analysis
Jagadeesan et al. Twitter Sentiment Analysis with Machine Learning
Rana et al. A conceptual model for decision support systems using aspect based sentiment analysis
CN113032570A (en) Text aspect emotion classification method and system based on ATAE-BiGRU
CN107291686B (en) Method and system for identifying emotion identification
CN110990530A (en) Microblog owner character analysis method based on deep learning
Ling Coronavirus public sentiment analysis with BERT deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant