CN112232079A

CN112232079A - Microblog comment data classification method and system

Info

Publication number: CN112232079A
Application number: CN202011102758.8A
Authority: CN
Inventors: 刘浩然; 庞娜娜; 李晨冉
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2021-01-15
Anticipated expiration: 2040-10-15
Also published as: CN112232079B

Abstract

The invention discloses a microblog comment data classification method and system. The method comprises the following steps: performing word segmentation on microblog comment data to be classified to obtain word segmentation texts to be classified; processing the Word segmentation text to be classified by adopting a trained Word2Vec model to obtain a Word vector to be classified; performing weight calculation on the word vector to be classified by adopting a TF-IDF algorithm to obtain a multi-dimensional word vector to be classified; inputting the Multi-dimensional word vectors to be classified into a trained Multi-LSTM model to obtain classification results of microblog comment data to be classified; the well-trained Multi-LSTM model is formed by sequentially connecting a well-trained first Bi-LSTM layer, a well-trained second Bi-LSTM layer, a well-trained LSTM layer and a well-trained full-connected layer. The method and the device can accurately and quickly classify the microblog texts, intercept the microblog comments and further maintain the internet environment better.

Description

Microblog comment data classification method and system

Technical Field

The invention relates to the field of classification of computer natural language processing texts, in particular to a microblog comment data classification method and system.

Background

With the rapid development of modern technologies and network technologies, network comments are becoming popular, but this also provides some people with the opportunity to spread negative opinions, which leads to the rapid spread of negative opinions on the internet just like viruses. The development of network public sentiment is further promoted by the characteristics of instant, convenient and fast propagation of the microblog, a lot of comments issued in the form of pictures or texts in the microblog relate to a large amount of bad information and are quickly diffused, and the waste of network resources is caused to a great extent. Therefore, the text classification related to internet public opinion analysis is a hot problem to be solved urgently.

The text classification can effectively classify and manage huge and complex text data of the microblog, so that the text classification is widely applied to a microblog public opinion analysis technology. The currently common text classification methods are respectively: a traditional machine learning based method and an artificial neural network based text classification method. Most text representations based on the traditional machine learning method are high-dimensional sparse vectors, the feature expression capability is weak, the accuracy rate is low, and the context semantic relation and the word sequence cannot be represented. With the rise of artificial neural networks, strong data processing capacity of the artificial neural networks is widely applied to text classification, but the classification accuracy is still low because the recurrent neural networks have no memory function and cannot memorize previous or later contents. The long-time memory network LSTM adds a memory unit in a hidden layer neuron of the recurrent neural network, so that all information of the whole network in a time sequence is controlled, the information is memorized and forgotten through a controllable gate in the transmission process of the transmitted information, and the classification accuracy is improved. LSTM learns the relationship between text contexts, input is the above and output is the below, but the model can only deduce the below according to the above, and if the below is input, the context is difficult to deduce. Therefore, the existing microblog comment data classification method still has the problems of low classification accuracy and low classification efficiency.

Disclosure of Invention

Based on the above, it is necessary to provide a microblog comment data classification method and system to accurately and quickly classify microblog texts, so as to intercept microblog comments, and further better maintain the internet environment.

In order to achieve the purpose, the invention provides the following scheme:

a microblog comment data classification method comprises the following steps:

acquiring microblog comment data to be classified;

performing word segmentation processing on the microblog comment data to be classified to obtain word segmentation texts to be classified;

processing the Word segmentation text to be classified by adopting a trained Word2Vec model to obtain a Word vector to be classified;

adopting TF-IDF algorithm to carry out weight calculation on the word vector to be classified to obtain a multi-dimensional word vector to be classified;

inputting the Multi-dimensional word vectors to be classified into a trained Multi-LSTM model to obtain classification results of microblog comment data to be classified; the well-trained Multi-LSTM model is formed by sequentially connecting a well-trained first Bi-LSTM layer, a well-trained second Bi-LSTM layer, a well-trained LSTM layer and a well-trained full-connection layer; the classification result is positive public opinion or negative public opinion.

Optionally, after the obtaining of the microblog comment data to be classified, the method further includes:

judging the data type of the microblog comment data to be classified;

and when the microblog comment data to be classified is in a picture form, extracting text data in the microblog comment data to be classified by adopting an OCR image recognition technology.

Optionally, the word segmentation processing is performed on the microblog comment data to be classified to obtain word segmentation texts to be classified, and the word segmentation processing specifically includes:

and carrying out word segmentation processing on the microblog comment data to be classified by adopting Python-based crust word segmentation to obtain word segmentation texts to be classified.

Optionally, the method for determining the trained Multi-LSTM model includes:

acquiring a training set of a microblog corpus;

performing word segmentation processing on the microblog corpus training set to obtain a training word segmentation text;

processing the training Word segmentation text by adopting a trained Word2Vec model to obtain a training Word vector;

performing weight calculation on the training word vector by adopting a TF-IDF algorithm to obtain a multi-dimensional training word vector;

dividing the multi-dimensional training word vector to obtain a training sample and a test sample;

constructing a Multi-LSTM model based on Keras of Python; the Multi-LSTM model is formed by sequentially connecting a first Bi-LSTM layer, a second Bi-LSTM layer, an LSTM layer and a full-connection layer; the model parameters of the Multi-LSTM model are unknown; the model parameters comprise a loss function, an optimizer, a learning rate and an activation function;

taking the training sample as the input of a first Bi-LSTM layer in the Multi-LSTM model, taking a class label corresponding to the training sample as the output of a full connection layer, and sequentially performing bidirectional learning on model parameters of the first Bi-LSTM layer, the second Bi-LSTM layer, the LSTM layer and the full connection layer to obtain a microblog comment classification model;

and inputting the test sample into the microblog comment classification model for classification accuracy verification, and determining the microblog comment classification model with the classification accuracy reaching a set standard as a trained Multi-LSTM model.

Optionally, the method for determining the trained Word2Vec model includes:

acquiring a training set of a microblog corpus;

constructing a Skip-gram model; the Skip-gram model comprises an input layer, a hidden layer and an output layer which are connected in sequence;

and taking the training Word segmentation text as the input of the Skip-gram model, taking Word probability distribution as the output of the Skip-gram model, training by adopting a gradient descent method to obtain a trained Skip-gram model, and determining the trained Skip-gram model as a trained Word2Vec model.

The invention also provides a microblog comment data classification system, which comprises the following steps:

the data acquisition module is used for acquiring microblog comment data to be classified;

the word segmentation module is used for carrying out word segmentation processing on the microblog comment data to be classified to obtain word segmentation texts to be classified;

the Word vector determining module is used for processing the Word segmentation text to be classified by adopting a trained Word2Vec model to obtain a Word vector to be classified;

the multidimensional word vector calculation module is used for performing weight calculation on the word vector to be classified by adopting a TF-IDF algorithm to obtain a multidimensional word vector to be classified;

the classification module is used for inputting the Multi-dimensional word vectors to be classified into a trained Multi-LSTM model to obtain classification results of microblog comment data to be classified; the well-trained Multi-LSTM model is formed by sequentially connecting a well-trained first Bi-LSTM layer, a well-trained second Bi-LSTM layer, a well-trained LSTM layer and a well-trained full-connection layer; the classification result is positive public opinion or negative public opinion.

Optionally, the microblog comment data classification system further includes:

the judging module is used for judging the data type of the microblog comment data to be classified;

and the text extraction module is used for extracting text data in the microblog comment data to be classified by adopting an OCR image recognition technology when the microblog comment data to be classified is in a picture form.

Optionally, the word segmentation module specifically includes:

and the first word segmentation unit is used for performing word segmentation processing on the microblog comment data to be classified by adopting Python-based crust word segmentation to obtain word segmentation texts to be classified.

Optionally, the microblog comment data classification system further includes a first training module, configured to determine a trained Multi-LSTM model; the first training module comprises:

the first training data acquisition unit is used for acquiring a training set of a microblog corpus;

the second word segmentation unit is used for performing word segmentation processing on the microblog corpus training set to obtain a training word segmentation text;

the training Word vector determining unit is used for processing the training Word segmentation text by adopting a trained Word2Vec model to obtain a training Word vector;

the multidimensional training word vector calculation unit is used for performing weight calculation on the training word vector by adopting a TF-IDF algorithm to obtain a multidimensional training word vector;

the dividing unit is used for dividing the multi-dimensional training word vector to obtain a training sample and a test sample;

the first model building unit is used for building a Multi-LSTM model based on Keras of Python; the Multi-LSTM model is formed by sequentially connecting a first Bi-LSTM layer, a second Bi-LSTM layer, an LSTM layer and a full-connection layer; the model parameters of the Multi-LSTM model are unknown; the model parameters comprise a loss function, an optimizer, a learning rate and an activation function;

the first training unit is used for taking the training sample as the input of a first Bi-LSTM layer in the Multi-LSTM model, taking a class label corresponding to the training sample as the output of a full connection layer, and sequentially performing bidirectional learning on model parameters of the first Bi-LSTM layer, the second Bi-LSTM layer, the LSTM layer and the full connection layer to obtain a microblog comment classification model;

and the verification unit is used for inputting the test sample into the microblog comment classification model for verifying the classification accuracy, and determining the microblog comment classification model with the classification accuracy reaching the set standard as a trained Multi-LSTM model.

Optionally, the microblog comment data classification system further includes a second training module, configured to determine a trained Word2Vec model; the second training module comprises:

the second training data acquisition unit is used for acquiring a training set of the microblog corpus;

the third word segmentation unit is used for carrying out word segmentation processing on the microblog corpus training set to obtain a training word segmentation text;

the second model building unit is used for building a Skip-gram model; the Skip-gram model comprises an input layer, a hidden layer and an output layer which are connected in sequence;

and the second training unit is used for taking the training Word segmentation text as the input of the Skip-gram model, taking Word probability distribution as the output of the Skip-gram model, training by adopting a gradient descent method to obtain a trained Skip-gram model, and determining the trained Skip-gram model as a trained Word2Vec model.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a microblog comment data classification method and system, wherein a trained Word2Vec model is adopted to process Word segmentation texts to be classified, and a TF-IDF algorithm is adopted to perform weight calculation on Word vectors to be classified to obtain multi-dimensional Word vectors to be classified; inputting the Multi-dimensional word vectors to be classified into a trained Multi-LSTM model to obtain classification results of microblog comment data to be classified; the well-trained Multi-LSTM model is formed by sequentially connecting a well-trained first Bi-LSTM layer, a well-trained second Bi-LSTM layer, a well-trained LSTM layer and a well-trained full-connected layer. According to the invention, the Word2Vec model is combined with the TF-IDF algorithm, so that the feature dimension can be reduced, and the classification efficiency is improved; the classification method has the advantages that the classification is carried out by adopting the Multi-LSTM model formed by the first Bi-LSTM layer, the second Bi-LSTM layer, the LSTM layer and the full connection layer which are sequentially connected, and the classification accuracy can be improved, so that microblog texts can be accurately and rapidly classified, the interception of microblog comments is realized, and the Internet environment is better maintained.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a flowchart of a microblog comment data classification method according to an embodiment of the present invention;

fig. 2 is a diagram of a specific implementation process of the microblog comment data classification method according to the embodiment of the invention;

FIG. 3 is a flowchart of training based on the Word2Vec model according to the embodiment of the present invention;

FIG. 4 is a flowchart of a TF-IDF based algorithm according to an embodiment of the present invention;

FIG. 5 is a flow chart of model building based on the Multi-LSTM algorithm according to an embodiment of the present invention;

fig. 6 is a structural diagram of a microblog comment data classification system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention aims to provide a microblog comment data classification method and system, which are used for rapidly and accurately classifying and intercepting microblog comments so as to maintain the internet environment.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The text classification algorithm based on Word2Vec and Multi-LSTM proposed in this embodiment mainly includes three contents: (1) acquiring training data and test data and preprocessing a text; (2) a feature extraction method based on Word2Vec algorithm combined with TF-IDF algorithm; (3) a Multi-LSTM based classification method. The specific conception of each part is as follows:

(1) most texts in Wikipedia are adopted for obtaining the text classification training set, but the data set quantity is large and the pertinence is not strong, and the microblog corpus is crawled pertinently, so that public opinion analysis is better performed on microblog comments. The best mode for acquiring the microblog text data is the web crawler, the web crawler is frequently used in many languages, such as Python, PHP, JAVA and the like, and the Python language is simple and easy to understand, so that the content crawling is performed by using the Python language. Because the PC side of the Sina microblog rejects the crawler, the single microblog of the user can only be crawled by calling the API of the cellphone side of the Sina microblog.

Performing text preprocessing of word segmentation on the obtained training sample, and performing word segmentation on the microblog corpus text by using Python-based crust word segmentation, wherein the word segmentation strategy is as follows:

constructing a directed acyclic graph, and constructing a directed acyclic graph according to a prefix dictionary, wherein DAG is { key: list [ i, j … ], … }, wherein key is the position of a word in a sentence, and list is the end position of the word in the sentence.

And traversing the sentence from right to left by adopting a memorable dynamic programming algorithm, calculating the word frequency logarithm of a single word combination from the last bit, and finding out the word frequency logarithm with the maximum number as the maximum segmentation phrase.

(2) Word2Vec is a Word vector generation model that models text and context semantic relationships using a deep learning network. The single word corresponds to the multidimensional vector, the dimensionality of the word vector is reduced to a great extent, and words with similar semantics are integrated by calculating Euclidean distance or cosine similarity and the like, so that the problem that context semantics are not considered in the traditional method is avoided, and the accuracy is greatly improved. Since the microblog comments are mostly in the form of words or sentences, a single word vector dimension is set to be 100. In order to extract features of a text more accurately and quickly, the weight of feature words is calculated by adopting a TF-IDF algorithm, sample keywords are ranked, the first 5 words in the weight ranking are extracted, zero padding vectors for less than 5 word vectors are processed, and finally 500-dimensional word vectors are used as input vectors. TF-IDF is a weight characteristic statistical method combining word frequency (TF) and Inverse Document Frequency (IDF), TF refers to the frequency of a given word appearing in a document, the higher the frequency is, the more important the document is, IDF refers to the reciprocal of the number of documents where the word is located in the proportion of the total number of documents, TF-IDF algorithm principle is that IDF is used for balancing TF, so that the word weight is not excessively biased to the word with high word frequency, low-frequency words are also selected, namely if TF of a word appearing in a text is high, IDF appearing in a text set is also high, the importance of the word in the text is considered to be large, and the TF-IDF is suitable for representing the characteristics of the text. The Word2Vec-TF-IDF algorithm solves the problem that the context semantic relationship is not considered by the TF-IDF algorithm, and meanwhile, the central words which can most embody the text characteristics are screened out through weight calculation statistics, and finally the function of reducing the characteristic dimension is achieved.

(3) And obtaining the characteristic data of the training set through a statistical algorithm, and training the Multi-LSTM network model. In the field of text classification, a recurrent neural network is mostly adopted, but the recurrent neural network has no memory function and is not high in accuracy, and a long-time memory network is further provided, but an LSTM can only deduce a context according to the context, and if the context is input, the context is difficult to deduce, so that the text classification process is limited. Aiming at the problems, the embodiment adopts a multi-layer long-short term memory network, data of an input layer can pass through a Bi-LSTM layer, learn in the forward direction and the backward direction, connect the output hidden states to be used as the input of the next layer, and totally set four layers which are a first Bi-LSTM layer, a second Bi-LSTM layer, an LSTM layer and a full connection layer. After the network structure is obtained through training, a random gradient descent training algorithm is adopted for training, and the network model relaxes the condition independent assumption among attributes, so that the data and the structure are better fitted. By constructing a more complex network model, the fitting capability of the network is improved, so that more parameters are learned, the understanding of article semantics is deepened, and the speed and the accuracy of text classification are greatly improved. Meanwhile, a threshold sliding bar can be set, the size of the threshold is changed by sliding the 'discrimination threshold' sliding bar, the threshold is the severity of the discrimination rule for discriminating whether the microblog is bad, the closer to 1, the stricter, and vice versa, so that the best discrimination rule is further obtained. In order to distinguish microblog attributes more intuitively and accurately, word frequency statistical analysis is carried out on the keywords, and high-frequency words are visually displayed on a word cloud display column in a word cloud mode.

Based on the above thought, the microblog comment data classification method provided by the embodiment is described in detail below.

According to the microblog comment data classification method provided by the embodiment, Word segmentation texts are obtained by preprocessing texts in a microblog corpus, Word vectors are obtained by adopting a Word2Vec model, weight calculation is carried out on the texts by combining a TF-IDF algorithm, then a Multi-dimensional Word vector training Multi-LSTM network is obtained, and the purpose of improving the microblog comment text classification accuracy is finally achieved.

Fig. 1 is a flowchart of a microblog comment data classification method provided by an embodiment of the present invention. Referring to fig. 1, the microblog comment data classifier according to the embodiment includes:

step 101: and acquiring microblog comment data to be classified.

And calling an API (application program interface) mode of the Xinlang microblog mobile phone terminal to crawl a single microblog of the user to be classified to obtain microblog comment data.

Step 102: and performing word segmentation processing on the microblog comment data to be classified to obtain word segmentation texts to be classified.

In this embodiment, the microblog comment data to be classified may be subjected to word segmentation processing by using Python-based final segmentation, so as to obtain a word segmentation text to be classified, where a specific word segmentation strategy is as follows:

step1 is constructed as a directed acyclic graph. Constructing a directed acyclic graph according to the prefix dictionary (ditt): DAG { key: list [ i, j … ], … }, where key is the position of a word in a sentence and list stores the last position of the word in the sentence.

Step2, traversing the text from right to left by adopting a dynamic programming algorithm, calculating the probability logarithm of each word combination from the last word, selecting the maximum segmentation combination with the maximum probability logarithm, and finally converting the text into a single word.

In addition, in this step, before performing word segmentation processing on the microblog comment data to be classified, the data type of the microblog comment data to be classified is judged; when the microblog comment data to be classified is in a picture form, extracting text data in the microblog comment data to be classified by adopting an OCR image recognition technology; and when the microblog comment data to be classified is in a text form, directly performing word segmentation processing, as shown in fig. 2.

Step 103: and processing the Word segmentation text to be classified by adopting a trained Word2Vec model to obtain a Word vector to be classified.

The determination method of the trained Word2Vec model comprises the following steps:

(1) and acquiring a training set of the microblog corpus.

The method comprises the steps of crawling microblog corpus data by adopting a Python crawler technology, wherein a plurality of basic modules are used for carrying out the Python crawler, the requests module is simpler, so that the basic module for the Python crawler selects the requests, and the pandas module is used for downloading and storing the crawled data in a csv file format. The Python green sea microblog web crawler specifically comprises the following steps:

step1 sets url to

'https:// m.weibo.cn/api/comments/brown ═ ID +' & page ═ page + page;

step2, setting the request header as the configuration of the User-Agent client of the personal computer and the Cookie login information;

step3 sends a request by using get of the requests module, and places url, User-Agent and Cookie into a request body;

step4 disguising that a browser header sends a request to a microblog Web server to acquire a json file of response content;

step5, analyzing the content to obtain user id and text data;

step6 uses the pandas module to download and save the comment data locally in csv format.

(2) And performing word segmentation processing on the microblog corpus training set to obtain a training word segmentation text, wherein the specific process is the same as the step 102.

(3) And training the Word vector, wherein the Word vector corresponding to the text is obtained by using a Word2Vec model, the obtained Word vector can be combined with the semantic relation of the context, and meanwhile, the characteristic dimension of the vector is reduced, so that the problem that the TF-IDF algorithm ignores the semantic relation of the context is solved. In this embodiment, a Skip-gram model in a Word2Vec model is selected, and a Skip-gram model including an input layer, a hidden layer, and an output layer connected in sequence is first constructed; and then, the training Word segmentation text is used as the input of the Skip-gram model, Word probability distribution is used as the output of the Skip-gram model, a gradient descent method is adopted for training to obtain a trained Skip-gram model, and the trained Skip-gram model is determined as a trained Word2Vec model. The word vectors can be determined from the word probability distributions. The classification flow of the Word2Vec model is shown in fig. 3. The method specifically comprises the following steps:

training the training word segmentation text obtained in the step (2) by adopting a Skip-gram algorithm to further obtain a word vector corresponding to a single word text, wherein the principle is that the average logarithmic probability is maximized. The Skip-gram model structure consists of an input layer, a hidden layer and an output layer, and the Skip-gram model utilizes the current input word w_iSpeculation context S_wi＝(w_i-m,…,w_i-1,w_i+1,…,w_i+m) And m is the size of the context window of the current word, and the calculation formula of the training target optimization function is shown as (1).

In the formula: l is_Skip-GramIs the word w_tCorresponding objective function, p (w)_i+r|w_i) Is a word probability distribution, k is the word w_iInterval range, e (w), to which the corresponding context belongs_i+r) Is the word w_i+rCorresponding word vector, r is word w_i+rAnd w_iThe distance between them. Wherein,

in equation (2): e (w)_i) Is the word w_iCorresponding word vector, e (w)_k) Is the word w_iThe corresponding word vector in the range of the interval to which the context belongs.

After the Word2Vec model is trained, a probability distribution is output, Word vectors are all initialized randomly, vector parameters are also updated in an iterative mode along with gradient reduction, a target function is converged finally, the training error between a target optimization function and a sample is minimized, parameters of an input-hidden layer are trained by adopting a Word vector input method, words with similar semantics are close in space by calculating Euclidean distance or cosine similarity, and Word senses of the context semantics are also close.

Step 104: and performing weight calculation on the word vector to be classified by adopting a TF-IDF algorithm to obtain the multi-dimensional word vector to be classified.

And text feature extraction, which comprises the steps of performing weight calculation on a text by using a TF-IDF algorithm and outputting a feature set, wherein the TF-IDF is a weight feature statistical method combining word frequency (TF) and Inverse Document Frequency (IDF). TF refers to the frequency of occurrence of a given word in a document, and the higher the frequency is, the more important the document is, the calculation formula is shown in (3).

In the formula: w is a_iFor the currently input word, n_t,jThe expression w_iIn the text t_jNumber of occurrences, Σ_dn_d,jIs the sum of the occurrences of all d words in the text set T.

IDF means containing the word w_iThe number of documents in (b) is the reciprocal of the specific gravity of the text set T, and the calculation formula is shown in (4).

In the formula: | T | is the total document number, | { j: x_i∈r_jIs the word w_iThe number of the texts, the frequency of the inverse documents can avoid words like 'I', 'he', and the like, which have high occurrence frequency but small contribution to text classification, and obtain higher weight.

TF-IDF is the product of TF and IDF, and the calculation formula after normalization processing is shown as (5).

The word w is obtained from equation (5)_iFor text t_jIs proportional to the frequency of occurrence in the text, and the word w_iThe frequency of occurrence is inversely proportional throughout the text library.

The TF-IDF algorithm flow chart is shown in FIG. 4, and finally outputs feature set X. The TF-IDF algorithm principle is that IDF is used for balancing TF, so that word weight is not over biased to words with high word frequency, low-frequency words are also selected, namely if TF of a word in a text is high, and IDF of the word in a text set is also high, the word is considered to have high importance in the text and is suitable for representing the characteristics of the text, the diversity of characteristic nodes is stronger, and characteristic dimensionality is reduced while the expression capacity of the characteristic-like nodes in a network structure is increased.

Step 105: inputting the Multi-dimensional word vectors to be classified into a trained Multi-LSTM model to obtain classification results of microblog comment data to be classified; the well-trained Multi-LSTM model is formed by sequentially connecting a well-trained first Bi-LSTM layer, a well-trained second Bi-LSTM layer, a well-trained LSTM layer and a well-trained full-connection layer. And when the classification result is negative public opinion, intercepting the negative public opinion so as to maintain the network environment safety.

In step 103, the method for determining the trained Multi-LSTM model comprises:

(1) acquiring a training set of a microblog corpus; performing word segmentation processing on the microblog corpus training set to obtain a training word segmentation text; processing the training Word segmentation text by adopting a trained Word2Vec model to obtain a training Word vector; performing weight calculation on the training word vector by adopting a TF-IDF algorithm to obtain a multi-dimensional training word vector; and dividing the multi-dimensional training word vector to obtain a training sample and a test sample.

(2) Constructing a Multi-LSTM model based on Keras of Python; the Multi-LSTM model is formed by sequentially connecting a first Bi-LSTM layer, a second Bi-LSTM layer, an LSTM layer and a full-connection layer; the model parameters of the Multi-LSTM model are unknown; the model parameters include a loss function, an optimizer, a learning rate, and an activation function.

(3) And taking the training sample as the input of a first Bi-LSTM layer in the Multi-LSTM model, taking a class label corresponding to the training sample as the output of a full connection layer, and sequentially performing bidirectional learning on model parameters of the first Bi-LSTM layer, the second Bi-LSTM layer, the LSTM layer and the full connection layer to obtain a microblog comment classification model.

(4) And inputting the test sample into the microblog comment classification model for classification accuracy verification, and determining the microblog comment classification model with the classification accuracy reaching a set standard as a trained Multi-LSTM model.

In practical application, the specific determination method of the well-trained Multi-LSTM model is as follows:

a Multi-layer long-short memory neural network (Multi-LSTM) structure is adopted for text discrimination, the Multi-layer neural network is composed of two layers of Bi-LSTM, one layer of LSTM and one layer of full connection layer, and the Multi-LSTM network model construction flow is shown in figure 5. Bi-LSTM can not only learn the text content in the forward direction, but also learn in the reverse direction; the LSTM can memorize one-way long and short-term information and learn the past information; the fully-connected neural network has strong fitting capability, learns the high-order representation of the LSTM, and finally serves as a basis for output judgment. A more complex model can be constructed by combining the LSTM, the Bi-LSTM and the full connection layer, and the fitting capability of the network is improved, so that more parameters are learned, the semantic understanding of articles is deepened, and the public opinion analysis about microblog comments is better performed. The specific construction steps of the Multi-LSTM network model are as follows:

step1 firstly, a Word vector classification dictionary embedding is built, a dictionary in a trained Word2Vec model is assigned to a neural network Word vector dictionary, input Word segmentation can be expressed in a vector form, the model fuses the obtained Word vectors and the Word vectors, and the problem of misclassification caused by vector feature sparsity is effectively solved, wherein the parameter output _ dim is set to be 150.

Step2 then constructs a two-layer Bi-LSTM classification model, including assigning the output passing through the embedding layer to the first layer Bi-LSTM, and performing bidirectional training learning, wherein the parameters of the layer are that 128 neural units are hidden, and the dropout parameter is set to be 0.5, which indicates that 50% of neural networks are in a suppressed state, so that overfitting of the neural network model can be effectively prevented. The number of implicit neurons in the second layer Bi-LSTM layer parameter set is 64, and the dropout parameter is also 0.5.

Step3 secondly constructs an LSTM layer, the LSTM layer is represented by learning the high order of Bi-LSTM output of the second layer, the number of hidden units in LSTM parameter setting is 32, and dropout parameter setting is 0.3, namely 30% of network parameters are in a suppression state.

Step4 finally, a fully-connected layer is constructed, the output of the LSTM layer serves as the input of the fully-connected layer, the output neurons of the fully-connected layer are classified into classes, and the dropout parameter in the fully-connected layer is set to be 0.5, namely 50% of network parameters are in a suppression state. The activation function is a softmax function, and finally obtained text vectors are classified through a classifier.

In order to improve the classification accuracy, various parameters contained in the model need to be considered, and as shown in fig. 5, the LSTM parameter selection mainly considers a loss function, an optimizer selection, a learning rate setting and an activation function selection.

Loss function: to get the predicted values close to the actual data, the loss function needs to be changed continuously to predict the given model. Commonly used loss functions are the Ordinary Least Squares, Adaboost and mean square error, among others. When the mean square error is used, the accuracy is highest;

an optimizer: because LSTM model parameters are complex, a stronger optimizer is needed to obtain the optimal parameters, and the optimizer comprises Adam, Momentum, RMSProp and the like. The training of the model by using the RMSProp optimizer is found to be good through verification.

Learning rate: the learning rate is the size of an iteration step in the gradient descent algorithm, the setting is very important, the learning rate is too small, the local optimization is easy to fall into, the consumed time is too long, and the model cannot be converged easily due to too large learning rate. The learning rate is mostly set to be 0.0001-1, and the effect of training the model is better when the learning rate is set to be 0.001 through verification.

Activation function: the input of the neural network is generally linear, an activation function needs to be applied to convert the input into a nonlinear value, the expression capability of the network model is further improved, common activation functions comprise sigmod, relu, softmax and the like, the accuracy obtained by using the sigmod activation function is higher through verification, and a calculation formula is shown in (6).

f^(t)＝σ(W_fh^(t-1)+U_fx^(t)+b_f) (6)

Where σ is the sigmod activation function, W_f，U_f，b_fAre all coefficients of a linear relationship.

The invention also provides a microblog comment data classification system, and FIG. 6 is a structure diagram of the microblog comment data classification system provided by the embodiment of the invention.

Referring to fig. 6, the microblog comment data classification system in the embodiment includes:

the data obtaining module 201 is configured to obtain microblog comment data to be classified.

And the word segmentation module 202 is configured to perform word segmentation processing on the microblog comment data to be classified to obtain word segmentation texts to be classified.

And the Word vector determining module 203 is used for processing the Word segmentation text to be classified by adopting the trained Word2Vec model to obtain a Word vector to be classified.

And the multidimensional word vector calculation module 204 is configured to perform weight calculation on the word vector to be classified by using a TF-IDF algorithm to obtain a multidimensional word vector to be classified.

The classification module 205 is configured to input the Multi-dimensional word vector to be classified into a trained Multi-LSTM model, so as to obtain a classification result of microblog comment data to be classified; the well-trained Multi-LSTM model is formed by sequentially connecting a well-trained first Bi-LSTM layer, a well-trained second Bi-LSTM layer, a well-trained LSTM layer and a well-trained full-connection layer; the classification result is positive public opinion or negative public opinion.

As an optional implementation manner, the microblog comment data classification system further includes:

and the judging module is used for judging the data type of the microblog comment data to be classified.

As an optional implementation manner, the word segmentation module 202 specifically includes:

As an optional implementation manner, the microblog comment data classification system further comprises a first training module, configured to determine a trained Multi-LSTM model; the first training module comprises:

and the first training data acquisition unit is used for acquiring a training set of the microblog corpus.

And the second word segmentation unit is used for performing word segmentation processing on the microblog corpus training set to obtain a training word segmentation text.

And the training Word vector determining unit is used for processing the training Word segmentation text by adopting a trained Word2Vec model to obtain a training Word vector.

And the multidimensional training word vector calculation unit is used for performing weight calculation on the training word vector by adopting a TF-IDF algorithm to obtain a multidimensional training word vector.

And the dividing unit is used for dividing the multi-dimensional training word vector to obtain a training sample and a test sample.

The first model building unit is used for building a Multi-LSTM model based on Keras of Python; the Multi-LSTM model is formed by sequentially connecting a first Bi-LSTM layer, a second Bi-LSTM layer, an LSTM layer and a full-connection layer; the model parameters of the Multi-LSTM model are unknown; the model parameters include a loss function, an optimizer, a learning rate, and an activation function.

And the first training unit is used for taking the training sample as the input of a first Bi-LSTM layer in the Multi-LSTM model, taking the class label corresponding to the training sample as the output of a full connection layer, and sequentially performing bidirectional learning on the model parameters of the first Bi-LSTM layer, the second Bi-LSTM layer, the LSTM layer and the full connection layer to obtain a microblog comment classification model.

As an optional implementation manner, the microblog comment data classification system further comprises a second training module, which is used for determining a trained Word2Vec model; the second training module comprises:

and the second training data acquisition unit is used for acquiring a training set of the microblog corpus.

And the third word segmentation unit is used for carrying out word segmentation on the microblog corpus training set to obtain a training word segmentation text.

The second model building unit is used for building a Skip-gram model; the Skip-gram model comprises an input layer, a hidden layer and an output layer which are connected in sequence.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A microblog comment data classification method is characterized by comprising the following steps:

acquiring microblog comment data to be classified;

2. The microblog comment data classification method according to claim 1, further comprising, after the obtaining of microblog comment data to be classified:

judging the data type of the microblog comment data to be classified;

3. The microblog comment data classification method according to claim 1, wherein the segmenting process is performed on the microblog comment data to be classified to obtain segmented text to be classified, and specifically comprises:

4. The microblog comment data classification method according to claim 1, wherein the trained Multi-LSTM model is determined by:

acquiring a training set of a microblog corpus;

5. The microblog comment data classification method according to claim 1, wherein the determination method of the trained Word2Vec model is as follows:

acquiring a training set of a microblog corpus;

6. A microblog comment data classification system is characterized by comprising:

7. The microblog comment data classification system according to claim 6, further comprising:

8. The microblog comment data classification system according to claim 6, wherein the word segmentation module specifically comprises:

9. The microblog comment data classification system of claim 6 further comprising a first training module for determining a trained Multi-LSTM model; the first training module comprises:

10. The microblog comment data classification system of claim 6, further comprising a second training module for determining a trained Word2Vec model; the second training module comprises: