CN108108354B - Microblog user gender prediction method based on deep learning - Google Patents

Microblog user gender prediction method based on deep learning Download PDF

Info

Publication number
CN108108354B
CN108108354B CN201711380014.0A CN201711380014A CN108108354B CN 108108354 B CN108108354 B CN 108108354B CN 201711380014 A CN201711380014 A CN 201711380014A CN 108108354 B CN108108354 B CN 108108354B
Authority
CN
China
Prior art keywords
microblog
layer
word
term memory
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711380014.0A
Other languages
Chinese (zh)
Other versions
CN108108354A (en
Inventor
张春霞
冉昇
武嘉玉
冯丽霞
牛振东
黄达友
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Publication of CN108108354A publication Critical patent/CN108108354A/en
Application granted granted Critical
Publication of CN108108354B publication Critical patent/CN108108354B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Marketing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Primary Health Care (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a microblog user gender prediction method based on deep learning, and belongs to the field of Web mining and intelligent information processing. The prediction method comprises the following steps: collecting microblog information; preprocessing microblog texts; constructing word vectors of microblog text words; constructing a feature vector of a microblog text sentence by adopting a convolutional neural network-based microblog text representation method; and predicting or classifying the gender of the microblog users by adopting a method based on a long-term and short-term memory network model. The microblog text representation method based on the convolutional neural network does not need to artificially construct microblog text features, and can realize semantic modeling of microblog texts. The microblog user gender prediction method based on the long-short term memory network can extract semantic sequence dependency relationship features in microblog texts. According to the microblog user gender prediction method, the microblog text features are accurately extracted, the identification performance of the microblog user gender is improved, and the microblog user gender prediction method has wide application prospects in the fields of information recommendation and product marketing.

Description

Microblog user gender prediction method based on deep learning
Technical Field
The invention relates to the field of Web mining and intelligent information processing, in particular to a microblog user gender prediction method based on deep learning.
Background
The microblog user gender prediction is important research content for constructing the user identity portrait. User identity representation construction refers to identifying various identity attributes of a user, including the gender, age, education level and the like of the user. The user identity portrait construction technology can be widely applied to the fields of computer investigation and evidence obtaining, network public opinion monitoring, commodity marketing and the like.
Currently, the gender prediction of the user mainly adopts a classification method to identify the gender of the user. Mikros in the document "Autoform Attribution and Gender identification in Greek Blogs" (Methods and Applications of Quantitative rules, 2012), constructed features about high-frequency words and characters, and then used a support vector machine classifier to identify the Gender of the blog author. Ansari et al, in the document Gender Classification of Blog Authors (Special Issue of International Journal of Sustainable Development and Green Economics,2013), extracted features about part of speech and then used Bayesian classifier to identify the Gender of the Blog author. In the literature, "research on gender classification method for microblog users in chinese (chinese information science, 2014), royal crystal and the like, two classifiers based on user information and microblog texts are firstly developed respectively, and then the two classifiers are integrated by using a bayesian rule to identify the gender of a microblog writer.
The conventional microblog user gender identification method mainly has the following problems: manually constructing microblog text characteristics; the existing microblog text representation mainly adopts a vector space model or a bag-of-words model, and has the problems of sparse characteristic vectors and high dimensionality.
Aiming at the problems of the microblog user gender identification method, a microblog user gender identification technology is urgently needed for providing high-efficiency microblog user identity portrait construction service.
Disclosure of Invention
The invention aims to provide a microblog user gender prediction method based on deep learning, aiming at solving the problems in the microblog user gender recognition method. A microblog user gender prediction method based on deep learning comprises a microblog text representation method based on a convolutional neural network and a microblog user gender prediction or classification method based on a Long Short Term Memory network (LSTM). The microblog text representation method based on the convolutional neural network can automatically extract microblog text features. According to the microblog user gender prediction method based on the long-short term memory network, the semantic sequence dependency relationship in the microblog text can be obtained, and therefore the gender of the microblog user can be predicted more accurately.
The purpose of the invention is realized by the following technical scheme.
A network user gender prediction method based on deep learning comprises the following steps:
step 1, microblog information acquisition: acquiring a microblog text of a user on a microblog platform by using a web crawler, and storing the microblog text in a computer;
the method comprises the steps of collecting microblog texts of a plurality of microblog users with different genders, and storing the microblog text of each user into an extensible markup language file named by a user ID. In addition, the gender attributes of all microblog users are stored in a file.
Step 2, microblog text preprocessing: performing text extraction, morphological restoration and stop word and punctuation filtering on the microblog text collected in the step 1;
and (3) preprocessing the extensible markup language file collected in the step (1) to obtain the microblog text of each microblog user. In addition, a word form reduction is carried out on the microblog text by using an NLTK (Natural Language toolkit) tool, and stop words and punctuation marks in the microblog text are filtered out.
Step 3, constructing word vectors of microblog text words: and taking the microblog text as input, and mapping all words in the microblog text sentences into word vectors through an input mapping layer of a microblog text representation model convolutional neural network.
And for each word of a sentence in the microblog text, acquiring a k-dimensional vector of the current word by using a word vector model, wherein k is a positive integer. The Word vector model included either Word2Vec from google or GLOVE from stanford university. And if the current word is not contained in the word vector set constructed by the word vector model, generating a k-dimensional vector of the current word by a random method.
For a sentence w of microblog text1w2w3…wmWherein w isiRepresents a word, i is more than or equal to 1 and less than or equal to m, and m is a positive integer. Let the word w1The word vector of<x11,x12,…,x1n>N is a positive integer, the term w2The word vector of<x21,x22,…,x2n>…, word wmThe word vector of<xm1,xm2,…,xmn>Then, the initial feature vector for constructing the sentence is:
Figure BDA0001515395420000031
and 4, constructing a feature vector of the microblog text sentence by adopting a convolutional neural network-based microblog text representation method.
The convolutional neural network includes the input mapping layer of step 3, as well as convolutional and pooling layers.
And 4.1, performing convolution operation on the word vectors generated in the step 3 through a convolution layer of the microblog text representation model convolutional neural network to generate a Feature Map (Feature Map) of a microblog text sentence.
For a convolution kernel with a window length of h, the successive convolution operations are performed on h words, i.e.
ci=f(w*vi:i+h-1+b)
Wherein w and b are parameters, vi:i+h-1All word vectors from the ith word to the (i + h-1) th word are denoted as a concatenation, and the function f denotes the activation function.
For example, the activation function may take the form of a ReLU function of: (x) max {0, x }; that is, f (x) is the greater of 0 and x, where x is the input to the activation function.
Step 4.2, extracting the significant features of the microblog text sentences through a pooling layer of a convolutional neural network in the microblog text representation model, and generating feature vectors of the microblog text sentences;
the pooling layer is mainly used for realizing feature selection of feature vectors of microblog text sentences through pooling operation. And performing pooling operation in a mode of integrating maximum pooling operation and average pooling operation.
Setting the characteristic diagram of the microblog text sentence generated in the step 4.1 as follows:
Figure BDA0001515395420000032
wherein, yijRepresents the result of convolution operation of the word vector from the ith word to the (i + h-1) th word by the jth convolution kernel, h is the window length of the convolution kernel, and r and s are positive integers. The average pooling operation is as follows:
Figure BDA0001515395420000033
the maximization operation is as follows:
(max{y11,y12,...,y1s},max{y21,y22,...,y2s},...,max{yr1,yr2,...,yrs})
the integrated result of the maximum pooling operation and the average pooling operation is:
Figure BDA0001515395420000041
and 5, predicting the gender of the microblog user by adopting a method based on a long-short term memory network model.
The long-short term memory network model comprises a sequence generation layer, a bidirectional long-short term memory network layer and a classification layer.
And 5.1, using the feature vector of the microblog text sentence generated in the step 3 as input, regenerating the feature vector of the microblog text sentence through a sequence generation layer in a gender prediction method based on the long-short term memory network model, and using the feature vector as the input of the bidirectional long-short term memory network layer in the step 5.2.
The sequence generation layer sequentially comprises a first convolution layer, a second pooling layer, a third convolution layer and a fourth pooling layer. (1) In the first convolutional layer, convolution is performed using 64 convolution kernels with a window length of 2 and a step size of 1. (2) In the second pooling layer, pooling is performed using a pooling window having a window length of 2 and a step size of 1. (3) In the third convolutional layer, 64 convolutional kernels with a window length of 3 and a step size of 1 are used for convolution. (4) And in the fourth pooling layer, pooling is carried out by using a pooling window with the window length of 3 and the step length of 1, so as to generate the feature vector of the microblog text sentence.
And 5.2, taking the feature vector of the microblog text sentence generated in the step 5.1 as the input of a bidirectional long-short term memory network layer in the gender prediction method based on the long-short term memory network model, and regenerating the feature vector of the microblog text sentence by the bidirectional long-short term memory network layer by capturing the semantic sequence dependency relationship in the microblog text sentence.
The input of the bidirectional long and short term memory network layer is a feature vector sequence v of all sentences of the microblog text generated in the step 5.11,v2,…,vtAnd t is a positive integer. The sequence of feature vectors v1,v2,…,vtCan be regarded as a time series, vector viThe bidirectional long-short term memory network layer generates an output state for the input state of each time step.
If the feature vector sequence v is to be combined1,v2,…,vnAccording to v1,v2,…,vnThe sequence of (2) is inputted into the long-short term memory network layer, which is called as the forward long-short term memory network layer. If the feature vector sequence v1,v2,…,vnAccording to vn,vn-1,…,v2,v1The sequence of (2) is inputted into the long short term memory network layer, which is called reverse long short term memory network. If the feature vector sequence v is to be combined1,v2,…,vnAccording to v1,v2,…,vnThe sequence of the input vector is set as w1,w2,…,wn. Further, vector sequence w1,w2,…,wnAccording to wn,wn-1,…w2,w1The sequence of the first layer long short term memory network layer and the second layer short term memory network layer is called as a bidirectional long short term memory network.
Further, regarding the output vector sequence of the second layer long-short term memory network layer in the two-way long-short term memory network layer, the output state at the last time step is taken as the output state of the two-way long-short term memory network.
And 5.3, combining the feature vectors of the microblog text sentences constructed in the step 4 and the step 5.2.
Setting the feature vector of the microblog text sentence constructed in the step 4 as<a1,a2,…,ap>Wherein p is a positive integer, and is a parameter set by the bidirectional long-short term memory network layer. For example, p may take the value 70. Setting the feature vector of the microblog text sentence constructed in the step 5.2 as<b1,b2,…,bq>Wherein q is a positive integer. For example, q may take the value 32. Merging the two feature vectors into<a1,a2,…,ap,b1,b2,…,bq>As input vectors for the classification layer at step 5.4.
And 5.4, entering a classification layer in the gender prediction model based on the long-term and short-term memory network. The classification layer is composed of a fully connected neural network. And the classification layer inputs the feature vectors of the microblog text sentences constructed in the step 5.3 and outputs the feature vectors as gender classifications of microblog users, wherein the gender classifications include male and female categories.
The fully-connected neural network is formed by connecting a plurality of neurons of the neural network. The single neuron receives a vector as input, sums and applies an activation function to obtain the output of the single neuron.
The activation function is a ReLU function of the form: (x) max {0, x }; that is, f (x) is the greater of 0 and x, where x is the input to the activation function.
The fully-connected neural network can be constructed by connecting a plurality of neurons in a layered manner, so that the output of each neuron on the upper layer is used as the input of each neuron on the lower layer.
For predicting the gender of the microblog user, the output vector of the fully-connected neural network is<p1,p2>,p1Representing the probability that the predicted outcome is female, p2Indicating the probability that the predicted outcome is male. If p is1>p2And if not, the microblog user gender prediction result is female, otherwise, the microblog user gender prediction result is male.
Thus, the whole process of the method is completed.
Advantageous effects
According to the method, aiming at the existing microblog user gender identification method, microblog text characteristics need to be constructed manually; the conventional microblog text representation mainly adopts a vector space model or a bag-of-words model, has the problems of sparse characteristic vectors and high dimensionality, and provides a microblog user gender prediction method based on deep learning. The method comprises a microblog text representation method based on a convolutional neural network and a microblog user gender prediction method based on a long-short term memory network. The method improves the identification performance of the gender of the microblog user. The concrete aspects are as follows:
(1) according to the microblog text representation method based on the convolutional neural network, the characteristic vector of words and sentences of the microblog text can be automatically constructed without manually constructing the characteristic of the microblog text, and semantic modeling of the microblog text is realized.
(2) According to the microblog user gender prediction method based on the long-short term memory network, on one hand, the long-short term memory network can extract semantic sequence dependency relations in microblog text sentences, and implicit characteristics of microblog texts are captured. On the other hand, compared with the traditional recurrent neural network, the long-short term memory network effectively avoids the problem of gradient annihilation. That is, the gradient value becomes very small during back propagation due to the vector sequence being too long, making it difficult for the model to converge. Therefore, the microblog user gender prediction method based on the long-short term memory network improves the identification performance of the microblog user gender.
(3) According to the invention, the microblog text representation based on the convolutional neural network and the microblog text representation based on the long-short term memory network are combined into the feature representation of the microblog text, so that not only are the local features of the microblog text extracted, but also the semantic dependence features of the microblog text are extracted. In addition, the full-connection neural network is used as a classification layer, the fitting performance of the full-connection neural network is high, and the problem of predicting the gender of the microblog user is effectively solved.
Drawings
Fig. 1 is a schematic flow chart of a microblog user gender prediction method based on deep learning according to an embodiment of the invention.
Detailed Description
According to the technical scheme, the following describes a preferred embodiment of the invention in detail with reference to the accompanying drawings and examples.
Example 1
Step 1, microblog information acquisition: acquiring a microblog text of a user on a microblog platform by using a web crawler, and storing the microblog text in a computer;
the method comprises the steps of collecting microblog texts of a plurality of microblog users with different genders, and storing the microblog text of each user into an extensible markup language file named by a user ID. In addition, the gender attributes of all microblog users are stored in a file.
For example, for a microblog platform Twitter, a web crawler is used to collect Twitter text of a microblog user, namely microblog text. The microblog text with the user ID "1 a4a60942a15426c9a7ec3764e7d0 ede" is saved to a file "1 a4a60942a15426c9a7ec3764e7d0 ede.xml" in the form:
Figure BDA0001515395420000071
step 2, microblog text preprocessing: performing text extraction, morphological restoration and stop word and punctuation filtering on the microblog text collected in the step 1;
and (3) preprocessing the extensible markup language file collected in the step (1) to obtain the microblog text of each microblog user. In addition, a word form reduction is carried out on the microblog text by using an NLTK (Natural Language toolkit) tool, and stop words and punctuation marks in the microblog text are filtered out.
For example, the file "1 a4a60942a15426c9a7e 3764e7d0 ed. xml" collected in step 1 is preprocessed to obtain the microblog text "@ Michael _ J _ Parry can't com" on the microblog text but the microblog girl in me a bit registration of reacted ". And performing morphological restoration on the microblog text, wherein the morphological restoration result is as follows: "Michael _ J _ Parry can not comment on the transpositional bit in me be bit restriction on 'f' nd".
Step 3, constructing word vectors of microblog text words: and taking the microblog text as input, and mapping all words in the microblog text sentences into word vectors through an input mapping layer of a microblog text representation model convolutional neural network.
And for each word of a sentence in the microblog text, acquiring a k-dimensional vector of the current word by using a word vector model, wherein k is a positive integer. The Word vector model included either Word2Vec from google or GLOVE from stanford university. And if the current word is not contained in the word vector set constructed by the word vector model, generating a k-dimensional vector of the current word by a random method.
For a sentence w of microblog text1w2w3…wmWherein w isiRepresents a word, i is more than or equal to 1 and less than or equal to m, and m is a positive integer. Let the word w1The word vector of<x11,x12,…,x1n>N is a positive integer, the term w2The word vector of<x21,x22,…,x2n>…, word wmThe word vector of<xm1,xm2,…,xmn>Then, the initial feature vector for constructing the sentence is:
Figure BDA0001515395420000081
for example, for The sentence "The quick brown fox jumps over The lazy dog", 100-dimensional vectors for each word are generated by a word vector model and a stochastic method, and are stacked to form a 100 × 9 matrix.
For example, the 100-dimensional word vector for the word "dog" is: <0.50779, -1.0274, 0.48136, -0.09417, 0.44837, -0.52291, 0.51498, -0.038927, 0.35867, -0.065994, -0.82882, 0.76179, -3.803, -0.010576, 0.21654, 0.59712, 0.37424, -0.022629, -0.010331, -0.33966, 0.094336, 0.26253, -0.40161, -0.0079532, 1.0206, -0.35793, -0.565, 0.58815, -0.81847, 0.81847, 0.81847, -0.81847, -0.81847, -0.81847, -0.81847, 0.81847, 0.81847, -0.1903, 0.81847, 0.81847, -0.81847, -0.81847, -0.81847, 0.81847, 0.81847, 0.81847, 0.81847, -0.81847, -0.81847, -0.81847, -0.81847, 0.81847, 0.81847, 0.81847, -0.81847, 0.50297, 0.032685, -0.5179, -0.23541, 0.2296, -0.63588, 1.627, 0.62832, -0.74846, 0.60073, -0.011215, -0.32113, 0.14339, -0.060809, 0.088218, 0.65936, -0.46127, -0.37644, -0.1133, 0.15875, 0.39119, 0.67659, -0.071224, 0.17458, -0.033406, 0.73152 >.
And 4, constructing a feature vector of the microblog text sentence by adopting a convolutional neural network-based microblog text representation method.
The convolutional neural network includes the input mapping layer of step 3, as well as convolutional and pooling layers.
And 4.1, performing convolution operation on the word vectors generated in the step 3 through a convolution layer of the microblog text representation model convolutional neural network to generate a Feature Map (Feature Map) of a microblog text sentence.
For a convolution kernel with a window length of h, the successive convolution operations are performed on h words, i.e.
ci=f(w*vi:i+h-1+b)
Wherein w and b are parameters, vi:i+h-1All word vectors from the ith word to the (i + h-1) th word are denoted as a concatenation, and the function f denotes the activation function.
For example, the activation function may take the form of a ReLU function of: (x) max {0, x }; that is, f (x) is the greater of 0 and x, where x is the input to the activation function.
For example, a convolution kernel of 3 × 100 means that the window length of the convolution kernel is 3, and a convolution operation is performed on a word vector of dimension 100. Assuming that the maximum number of words in a sentence is 200, and selecting 32 convolution kernels with a step size of 1 and a convolution operation of 3 × 100, a feature map with dimensions of 32 × 198 can be generated, which is expressed as:
Figure BDA0001515395420000091
wherein, yijRepresents the result of the convolution operation on the word vector from the ith word to the (i + 3-1) th word by the jth convolution kernel.
Step 4.2, extracting the significant features of the microblog text sentences through a pooling layer of a convolutional neural network in the microblog text representation model, and generating feature vectors of the microblog text sentences;
the pooling layer is mainly used for realizing feature selection of feature vectors of microblog text sentences through pooling operation. And performing pooling operation in a mode of integrating maximum pooling operation and average pooling operation.
Setting the characteristic diagram of the microblog text sentence generated in the step 4.1 as follows:
Figure BDA0001515395420000092
wherein, yijRepresents the result of convolution operation of the word vector from the ith word to the (i + h-1) th word by the jth convolution kernel, h is the window length of the convolution kernel, and r and s are positive integers. The average pooling operation is as follows:
Figure BDA0001515395420000093
the maximization operation is as follows:
(max{y11,y12,...,y1s},max{y21,y22,...,y2s},...,max{yr1,yr2,...,yrs})
the integrated result of the maximum pooling operation and the average pooling operation is:
Figure BDA0001515395420000101
and 5, predicting the gender of the microblog user by adopting a method based on a long-short term memory network model.
The long-short term memory network model comprises a sequence generation layer, a bidirectional long-short term memory network layer and a classification layer.
And 5.1, using the feature vector of the microblog text sentence generated in the step 3 as input, regenerating the feature vector of the microblog text sentence through a sequence generation layer in a gender prediction method based on the long-short term memory network model, and using the feature vector as the input of the bidirectional long-short term memory network layer in the step 5.2.
The sequence generation layer sequentially comprises a first convolution layer, a second pooling layer, a third convolution layer and a fourth pooling layer. (1) In the first convolutional layer, convolution is performed using 64 convolution kernels with a window length of 2 and a step size of 1. (2) In the second pooling layer, pooling is performed using a pooling window having a window length of 2 and a step size of 1. (3) In the third convolutional layer, 64 convolutional kernels with a window length of 3 and a step size of 1 are used for convolution. (4) And in the fourth pooling layer, pooling is carried out by using a pooling window with the window length of 3 and the step length of 1, so as to generate the feature vector of the microblog text sentence.
And 5.2, taking the feature vector of the microblog text sentence generated in the step 5.1 as the input of a bidirectional long-short term memory network layer in the gender prediction method based on the long-short term memory network model, and regenerating the feature vector of the microblog text sentence by the bidirectional long-short term memory network layer by capturing the semantic sequence dependency relationship in the microblog text sentence.
The input of the bidirectional long and short term memory network layer is a feature vector sequence v of all sentences of the microblog text generated in the step 5.11,v2,…,vtAnd t is a positive integer. The sequence of feature vectors v1,v2,…,vtCan be regarded as a time series, vector viThe bidirectional long-short term memory network layer generates an output state for the input state of each time step.
If the feature vector sequence v is to be combined1,v2,…,vnAccording to v1,v2,…,vnThe sequence of (2) is inputted into the long-short term memory network layer, which is called as the forward long-short term memory network layer. If the feature vector sequence v1,v2,…,vnAccording to vn,vn-1,…,v2,v1The sequence of (2) is inputted into the long short term memory network layer, which is called reverse long short term memory network. If the feature vector sequence v is to be combined1,v2,…,vnAccording to v1,v2,…,vnThe sequence of the input vector is set as w1,w2,…,wn. Further, vector sequence w1,w2,…,wnAccording to wn,wn-1,…w2,w1The sequence of the first layer long short term memory network layer and the second layer short term memory network layer is called as a bidirectional long short term memory network.
Further, regarding the output vector sequence of the second layer long-short term memory network layer in the two-way long-short term memory network layer, the output state at the last time step is taken as the output state of the two-way long-short term memory network.
And 5.3, combining the feature vectors of the microblog text sentences constructed in the step 4 and the step 5.2.
Setting the feature vector of the microblog text sentence constructed in the step 4 as<a1,a2,…,ap>Wherein p is a positive integer, and is a parameter set by the bidirectional long-short term memory network layer. For example, p may take the value 70. Setting the feature vector of the microblog text sentence constructed in the step 5.2 as<b1,b2,…,bq>Wherein q is a positive integer. For example, q may take the value 32. Merging the two feature vectors into<a1,a2,…,ap,b1,b2,…,bq>As input vectors for the classification layer at step 5.4.
And 5.4, entering a classification layer in the gender prediction model based on the long-term and short-term memory network. The classification layer is composed of a fully connected neural network. And the classification layer inputs the feature vectors of the microblog text sentences constructed in the step 5.3 and outputs the feature vectors as gender classifications of microblog users, wherein the gender classifications include male and female categories.
The fully-connected neural network is formed by connecting a plurality of neurons of the neural network. The single neuron receives a vector as input, sums and applies an activation function to obtain the output of the single neuron.
The activation function may take the form of a ReLU function: (x) max {0, x }; that is, f (x) is the greater of 0 and x, where x is the input to the activation function.
The fully-connected neural network can be constructed by connecting a plurality of neurons in a layered manner, so that the output of each neuron on the upper layer is used as the input of each neuron on the lower layer.
For predicting the gender of the microblog user, the output vector of the fully-connected neural network is<p1,p2>,p1Representing the probability that the predicted outcome is female, p2Indicating the probability that the predicted outcome is male. If p is1>p2And if not, the microblog user gender prediction result is female, otherwise, the microblog user gender prediction result is male.
Thus, the whole process of the method is completed.
In order to illustrate the gender prediction effect of the microblog user, the experiment is carried out by comparing the same training set and the same testing set by two methods under the same condition. The first method is a microblog text representation based on a convolutional neural network and a microblog user gender prediction method based on logistic regression. The second method is the microblog user gender prediction method based on deep learning. The adopted evaluation index is precision (Accuracy), and the calculation formula is as follows:
Figure BDA0001515395420000111
wherein N is1Number of microblog users, N, correctly gender identified2The number of all microblog users for gender identification.
The microblog user gender prediction result has the following effects: the accuracy with the first method was about 63% and with the method of the invention about 71%. The effectiveness of the microblog user gender prediction method based on deep learning provided by the invention is shown through experiments.
While the foregoing is directed to the preferred embodiment of the present invention, it is not intended that the invention be limited to the embodiment and the drawings disclosed herein. Equivalents and modifications may be made without departing from the spirit of the disclosure, which is to be considered as within the scope of the invention.

Claims (6)

1. A microblog user gender prediction method based on deep learning is characterized by comprising the following steps: the method comprises the following steps:
step 1, microblog information acquisition: aiming at a Twitter webpage, collecting Twitter texts of microblog users, namely microblog texts by using a web crawler, and storing the microblog texts into a local computer;
step 2, microblog text preprocessing: performing text extraction, morphological restoration and stop word and punctuation mark filtration on the microblog text acquired in the microblog information acquisition step 1;
step 3, vectorization representation of microblog text words: the method comprises the following steps of taking a microblog text as an input, mapping all words in a microblog text sentence into word vectors through an input mapping layer of a microblog text representation model convolutional neural network, and specifically comprises the following steps:
for each word of a sentence in the microblog text, acquiring a k-dimensional vector of the current word by using a word vector model; if the current word is not contained in the word vector set constructed by the word vector model, generating a k-dimensional vector of the current word by a random method;
step 4, constructing feature vector representation of the microblog text sentence by adopting a convolutional neural network-based microblog text representation method, which specifically comprises the following steps:
step 4.1, carrying out convolution operation on the word vectors generated in the step 3 through a microblog text representation model convolution neural network to generate a feature map representation of a microblog text sentence;
step 4.2, extracting significant features of microblog text sentences through a pooling layer of a microblog text representation model convolutional neural network, and generating feature vector representations of the microblog text sentences;
step 5, adopting a gender classification model based on a long-short term memory network to predict the gender of the microblog user, specifically comprising the following steps:
step 5.1, using the feature vector representation of the microblog text sentences generated in the step 3 as input, and regenerating the feature vector representation of the microblog text sentences by adopting a sequence generation layer in a gender classification model based on a long-short term memory network as input of a bidirectional long-short term memory network layer in the step 5.2;
the long-short term memory network model comprises a sequence generation layer, a bidirectional long-short term memory network layer and a classification layer; the sequence generation layer sequentially comprises a convolution layer, a pooling layer, a convolution layer and a pooling layer;
step 5.2, representing the feature vectors of the microblog text sentences generated in the step 5.1 as input of a bidirectional long-short term memory network layer in the gender classification model based on the long-short term memory network, wherein the bidirectional long-short term memory network layer constructs the feature vectors of the microblog text sentences by capturing semantic sequence dependency relations in the microblog text sentences;
step 5.3, combining the feature vectors of the microblog text sentences constructed in the step 4 and the step 5.2;
step 5.4, entering a classification layer in the gender classification model based on the long-term and short-term memory network, wherein the classification layer is formed by a fully-connected neural network;
the classification layer inputs the feature vectors of the microblog text sentences constructed in the step 5.3 and outputs the feature vectors as gender classifications of microblog users, wherein the gender classifications include male and female categories;
the fully-connected neural network is formed by connecting a plurality of neural elements of the neural network, and a single neural element receives a vector as input, sums and applies an activation function to obtain the output of the single neural element;
the fully-connected neural network can be constructed by connecting a plurality of neurons in a layered manner so that the output of each neuron on the upper layer is used as the input of each neuron on the lower layer;
for predicting the gender of the microblog user, the output vector of the fully-connected neural network is (p)0,p1),p0Representing the probability that the predicted outcome is female, p1Indicating the probability that the predicted outcome is male.
2. The microblog user gender prediction method based on deep learning according to claim 1, characterized by comprising the following steps: the step 1 is realized by the following processes:
the method comprises the steps of collecting microblog texts of a plurality of microblog users with different genders, storing the microblog text of each user into an extensible markup language file named by a user ID, and simultaneously storing gender attributes of all the microblog users into one file.
3. The microblog user gender prediction method based on deep learning according to claim 2, characterized in that: the step 2 is realized by the following processes:
preprocessing the extensible markup language file acquired in the step 1 to obtain a microblog text of each microblog user;
in addition, performing morphological restoration on the microblog text by using an NLTK tool, and filtering stop words and punctuations in the microblog text;
among them, NLTK, Natural Language Toolkit.
4. The microblog user gender prediction method based on deep learning according to claim 3, wherein the microblog user gender prediction method comprises the following steps: k in the step 3 is a positive integer; the Word vector model includes Word2Vec of google or GLOVE of stanford university;
for a sentence w of microblog text1,w2,w3,…,wmWherein w isiRepresents a word; let the word w1The word vector of (x)11,x12,…,x1n) Word w2The word vector of (x)21,x22,…,x2n) …, word wmThe word vector of (x)m1,xm2,…,xmn) Then a vector representation of the sentence is constructed as:
Figure FDA0002865931960000031
withe value range of the middle subscript i is more than or equal to 1 and less than or equal to n.
5. The microblog user gender prediction method based on deep learning according to claim 4, wherein the microblog user gender prediction method comprises the following steps: the step 4.1 is specifically as follows: for a convolution kernel with a window length of h, successive convolution operations are performed on h words, i.e.
ci=f(w*vi:i+h-1+b)
Wherein w and b are parameters, vi:i+h-1Representing all word vectors from the ith word to the (i + h-1) th word in a concatenation, the function f representing an activation function;
4.2, the pooling layer realizes the feature selection of the feature vector of the microblog text sentence through pooling operation, and performs the pooling operation in a mode of integrating maximum pooling operation and average pooling operation;
setting the characteristic diagram of the microblog text sentence generated in the step 4.1 as follows:
Figure FDA0002865931960000032
wherein, yijRepresenting the result of convolution operation of a word vector from the ith word to the (i + h-1) th word by the jth convolution kernel, wherein h is the window length of the convolution kernel; the average pooling operation is as follows:
Figure FDA0002865931960000033
the maximization operation is as follows:
(max{y11,y12,...,y1s},max{y21,y22,...,y2s},...,max{yr1,yr2,...,yrs}) the integration result of the maximum pooling operation and the average pooling operation is:
Figure FDA0002865931960000041
6. the microblog user gender prediction method based on deep learning according to claim 5, wherein the microblog user gender prediction method comprises the following steps:
the sequence generation layer in the step 5.1 sequentially comprises a first convolution layer, a second pooling layer, a third convolution layer and a fourth pooling layer;
(1) in the first convolution layer, performing convolution by using 64 convolution kernels with the window length of 2 and the step length of 1;
(2) in the second layer of the pooling layer, pooling is carried out by using a pooling window with the window length of 2 and the step length of 1;
(3) in the third convolutional layer, performing convolution by using 64 convolution kernels with the window length of 3 and the step length of 1;
(4) in the fourth pooling layer, pooling is carried out by using a pooling window with the window length of 3 and the step length of 1, and a feature vector representation of a microblog text sentence is generated;
the input of the bidirectional long and short term memory network layer in step 5.2 is a feature vector sequence v of all sentences of the microblog text generated in step 5.11,v2,…,vn(ii) a The sequence of feature vectors v1,v2,…,vnCan be regarded as a time series, vector viThe bidirectional long-short term memory network layer generates an output state for the input state of each time step;
if the feature vector sequence v is to be combined1,v2,…,vnAccording to v1,v2,…,vnThe sequence of the long and short term memory network layer is called as a forward long and short term memory network layer;
if the feature vector sequence v1,v2,…,vnAccording to vn,vn-1,…,v2,v1The sequence of the long and short term memory network layer is input into the long and short term memory network layer, which is called a reverse long and short term memory network;
if the feature vector sequence v is to be combined1,v2,…,vnAccording to v1,v2,…,vnThe sequence of the input vector is input into the first layer long-short term memory network layer, and the output vector sequence is set as t1,t2,…,tn
Further, vector sequence t1,t2,…,tnAccording to tn,tn-1,…,t2,t1The sequence of the output vector is set as u1,u2,…,unIt is called bidirectional long-short term memory network;
further, adopting a bidirectional long-short term memory network, and for the output vector sequence of the second layer long-short term memory network layer, outputting the output state u at the last time stepnAs the output state of the bidirectional long-short term memory network;
step 5.3, specifically:
setting the feature vector of the microblog text sentence constructed in the step 4 as (a)1,a2,…,ap);
Wherein, p is a parameter set by the bidirectional long-short term memory network layer; setting the feature vector of the microblog text sentence constructed in the step 5.2 as (b)1,b2,…,bq);
The two feature vectors are combined into (a)1,a2,…,ap,b1,b2,…,bq) As input vectors for the classification layer of step 5.4;
in step 5.4, a single neuron in the fully-connected neural network receives a vector as an input, sums and applies an activation function to obtain the output of the single neuron, specifically: the activation function is a ReLU function of the form: (x) max {0, x }; that is, f (x) is the greater of 0 and x, where x is the input to the activation function.
CN201711380014.0A 2017-06-18 2017-12-20 Microblog user gender prediction method based on deep learning Active CN108108354B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2017104608043 2017-06-18
CN201710460804 2017-06-18

Publications (2)

Publication Number Publication Date
CN108108354A CN108108354A (en) 2018-06-01
CN108108354B true CN108108354B (en) 2021-04-06

Family

ID=62211311

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711380014.0A Active CN108108354B (en) 2017-06-18 2017-12-20 Microblog user gender prediction method based on deep learning

Country Status (1)

Country Link
CN (1) CN108108354B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191668B (en) * 2018-11-15 2023-04-28 零氪科技(北京)有限公司 Method for identifying disease content in medical record text
CN109697288B (en) * 2018-12-25 2020-09-15 北京理工大学 Instance alignment method based on deep learning
CN109918649B (en) * 2019-02-01 2023-08-11 杭州师范大学 Suicide risk identification method based on microblog text
CN110196945B (en) * 2019-05-27 2021-10-01 北京理工大学 Microblog user age prediction method based on LSTM and LeNet fusion
CN110275953B (en) * 2019-06-21 2021-11-30 四川大学 Personality classification method and apparatus
CN112200197A (en) * 2020-11-10 2021-01-08 天津大学 Rumor detection method based on deep learning and multi-mode
CN112487406B (en) * 2020-12-02 2022-05-31 中国电子科技集团公司第三十研究所 Network behavior analysis method based on machine learning
CN115186095B (en) * 2022-09-13 2022-12-13 广州趣丸网络科技有限公司 Juvenile text recognition method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106295507A (en) * 2016-07-25 2017-01-04 华南理工大学 A kind of gender identification method based on integrated convolutional neural networks
CN106570148A (en) * 2016-10-27 2017-04-19 浙江大学 Convolutional neutral network-based attribute extraction method
CN106599933A (en) * 2016-12-26 2017-04-26 哈尔滨工业大学 Text emotion classification method based on the joint deep learning model
CN106611055A (en) * 2016-12-27 2017-05-03 大连理工大学 Chinese hedge scope detection method based on stacked neural network
CN106845373A (en) * 2017-01-04 2017-06-13 天津大学 Towards pedestrian's attribute forecast method of monitor video

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10783900B2 (en) * 2014-10-03 2020-09-22 Google Llc Convolutional, long short-term memory, fully connected deep neural networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106295507A (en) * 2016-07-25 2017-01-04 华南理工大学 A kind of gender identification method based on integrated convolutional neural networks
CN106570148A (en) * 2016-10-27 2017-04-19 浙江大学 Convolutional neutral network-based attribute extraction method
CN106599933A (en) * 2016-12-26 2017-04-26 哈尔滨工业大学 Text emotion classification method based on the joint deep learning model
CN106611055A (en) * 2016-12-27 2017-05-03 大连理工大学 Chinese hedge scope detection method based on stacked neural network
CN106845373A (en) * 2017-01-04 2017-06-13 天津大学 Towards pedestrian's attribute forecast method of monitor video

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《基于事件卷积特征的新闻文本分类》;夏从零 等;《计算机应用研究》;20160622;第34卷(第4期);全文 *
《基于多类型文本的半监督性别分类方法研究》;戴斌 等;《山西大学学报(自然科学版)》;20170215;第40卷(第1期);第15-17页 *

Also Published As

Publication number Publication date
CN108108354A (en) 2018-06-01

Similar Documents

Publication Publication Date Title
CN108108354B (en) Microblog user gender prediction method based on deep learning
CN111291181B (en) Representation learning for input classification via topic sparse self-encoder and entity embedding
CN110969020B (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
Sun et al. Sentiment analysis for Chinese microblog based on deep neural networks with convolutional extension features
CN111966917B (en) Event detection and summarization method based on pre-training language model
US11941366B2 (en) Context-based multi-turn dialogue method and storage medium
Xiao et al. Semantic relation classification via hierarchical recurrent neural network with attention
CN106970910B (en) Keyword extraction method and device based on graph model
CN108681557B (en) Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint
US20120253792A1 (en) Sentiment Classification Based on Supervised Latent N-Gram Analysis
CN107247702A (en) A kind of text emotion analysis and processing method and system
CN107818084B (en) Emotion analysis method fused with comment matching diagram
CN110750640A (en) Text data classification method and device based on neural network model and storage medium
CN107122349A (en) A kind of feature word of text extracting method based on word2vec LDA models
CN112966091B (en) Knowledge map recommendation system fusing entity information and heat
Subramanian et al. A survey on sentiment analysis
CN111241271B (en) Text emotion classification method and device and electronic equipment
CN112347761B (en) BERT-based drug relation extraction method
US20220156489A1 (en) Machine learning techniques for identifying logical sections in unstructured data
Huang et al. Location prediction for tweets
CN110705279A (en) Vocabulary selection method and device and computer readable storage medium
Chaudhuri Visual and text sentiment analysis through hierarchical deep learning networks
Af'idah et al. Long short term memory convolutional neural network for Indonesian sentiment analysis towards touristic destination reviews
CN114036938B (en) News classification method for extracting text features by combining topic information and word vectors
CN109858035A (en) A kind of sensibility classification method, device, electronic equipment and readable storage medium storing program for executing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant