CN110263166A

CN110263166A - Public sentiment file classification method based on deep learning

Info

Publication number: CN110263166A
Application number: CN201910525459.6A
Authority: CN
Inventors: 肖翔; 黄泓; 周家木
Original assignee: Beijing Sea - Induced Star Map Technology Co Ltd
Current assignee: Beijing Sea - Induced Star Map Technology Co Ltd
Priority date: 2019-06-18
Filing date: 2019-06-18
Publication date: 2019-09-20

Abstract

The present invention provides the public sentiment file classification methods based on deep learning, include the following steps: 1, crawl enterprise's public sentiment text from internet Baidu, available a small amount of positive sample and largely without mark sample；2, initial training data set is constructed by PU-Learning technology；3, three kinds of depth models are trained using fasttext, CNN, RNN to the data set in 2, using multi-model coorinated training；4, the trained CNN of data set after expanding in use 3 classifies to test data set.This patent constructs positive sample data by the purposive data that crawl, and the quality of positive sample can be made higher；More farther, the relatively reliable negative sample apart from positive sample can be obtained from without mark sample；The problem of business personnel's concern is identified from public sentiment data with higher accuracy rate and event, push and early warning, substantially increase business personnel's working efficiency in time.

Description

Public sentiment file classification method based on deep learning

Technical field

The present invention relates to a kind of public sentiment file classification method, especially a kind of quality of positive sample is higher, can obtain distance Farther, the relatively reliable negative sample of positive sample, accuracy rate is high, the public sentiment text classification side based on deep learning that work efficiency is high Method.

Background technique

Currently, the classification for Company News public sentiment text data combines simple rule to be divided also in artificial treatment The stage of class, inefficiency, while classifying quality not can guarantee.

Summary of the invention

To solve the above problems, the present invention provides a kind of quality of positive sample is higher, can obtain it is farther apart from positive sample, Relatively reliable negative sample, accuracy rate is high, the public sentiment file classification method based on deep learning that work efficiency is high.

Public sentiment file classification method based on deep learning includes the following steps: 1, crawls enterprise carriage from internet Baidu Feelings text, available a small amount of positive sample and largely without mark sample；2, pass through the initial instruction of PU-Learning technology building Practice data set；3, three kinds of depth models are trained using fasttext, CNN, RNN to the data set in 2, is cooperateed with and is instructed using multi-model Practice, classification judgement is carried out to without mark sample data with these three models respectively, if three kinds of classifiers are determined as positive sample And emotion be it is negative, then be determined as positive sample, positive sample collection be added；If three kinds of classifiers are determined as negative sample and emotion is Front is then determined as negative sample, and negative sample collection is added；Other situations wouldn't process；4, the data training after expanding in use 3 The CNN perfected classifies to test data set, if classification accuracy is less than threshold value, iteration executes the operation in 3, on the contrary Process terminates.

The specific method is as follows for it:

(1) data are crawled and are pre-processed

In news public sentiment event category scene, data unlike it is contemplated that it is so ideal, due to data mark at This too high etc. reason, we are difficult to the positive negative sample of accumulating and enriching, therefore how to take a large amount of and accurately have the positive and negative of mark Sample has very big influence for classifying quality.

In this patent, using keyword combination, (, had been there is fund in this way in such as LeEco+capital chain for we The multiple combinations of the enterprise name and bankroll problem descriptor of problem) it crawls and business capital problem news data occurs, mark fund Problem positive sample data；Simultaneously with not occurring such as Tencent, " good " enterprise of bankroll problem, Alibaba crawls phase as keyword News is closed, (may cannot be known as negative sample as without mark sample also with the presence of the news comprising part bankroll problem, answer This is unknown sample also referred to as without mark sample).In this way we just have a small amount of positive sample (network crawls+the artificial mark in part Note confirmation) and largely without mark sample；

(2) training set constructs

Learn (PU-learning, Positive and unlabeled learning) iteration without label using positive sample From largely without sample with positive sample COS distance as far as possible is found out in mark sample set, being regarded as more reliable in (1) Negative sample, together with positive sample, construct training set.

The application scenarios of PULearning are that we can clearly determine positive sample, but not can determine that negative sample, because It is likely to be positive sample for it, only we prove not yet.At this moment the uncertain sample in this part can be called nothing by we Exemplar U, in addition positive sample P establishes model.

The calculation process of PU-learning is broadly divided into two stages:

First stage: reliable negative example collection RN is selected from unmarked example, way is:

A, it randomly selects a part of positive example S in P to be added in U, at this moment two datasets are respectively P-S and U+S, are determined respectively Justice is ps and us, and the data for being us with one two disaggregated model model, label 0 of ps and us training, label 1 is the number of ps According to；

B, then with this classifier model for no label data U, unlabeled exemplars set U is done and is classified, calculated every A sample belongs to the probability of negative class, sets a threshold value a, if sample classification probability is greater than a, it is considered that being a phase To reliable negative sample.

Second stage: using positive example P and reliable negative example RN, one traditional machine learning classification model of training is used to pre- Survey new samples.

(3) multi-model coorinated training

It is mainly divided into three steps:

A, identification and classification is carried out to no label data respectively with three kinds of sorter models fasttext, cnn, rnn, if three kinds Model, which all differentiates, to be positive class (there are bankroll problems), then is directly added into training set as positive sample；If all differentiating the class that is negative (bankroll problem is not present), then be negative sample；If there are two classifiers to differentiate the class that is positive, a classifier differentiates the class that is negative, then Retain this data, carries out manual intervention mark；If there are two classifiers to differentiate the class that is negative, a classifier differentiates the class that is positive, It disregards, continues to regard as no label data.

B, after by the operation in a, training set data is updated, then proceedes to three kinds of model model of training, calculating is being tested The classification accuracy of concentration；

C, iteration carries out a, and the operation in b terminates iteration until the accuracy rate in test set reaches threshold value, protects Deposit model

(4) category of model

According to updated training data is obtained in 3, trained depth convolutional neural networks CNN is to test in use 3 Data set is classified, if classification accuracy is less than threshold value (0.8), continues to execute the operation in 3, otherwise process terminates.

This patent constructs positive sample data by the purposive data that crawl, and the quality of positive sample can be made higher；Simultaneously In conjunction with PU-learning more farther, the relatively reliable negative sample apart from positive sample can be obtained from without mark sample；Simultaneously It can be in the generally existing a small amount of positive sample of industry and largely without mark in conjunction with PU-learning and multi-model coorinated training technology Ideal effect is obtained in the case where signed-off sample notebook data, and business personnel is identified from public sentiment data with higher accuracy rate The problem of concern and event, in time push and early warning substantially increase business personnel's working efficiency, and according to recognition result point Analysis, facilitates business personnel to take risk management measure.

Detailed description of the invention

Fig. 1 is the workflow schematic diagram of this patent

Fig. 2 is the model support composition of the character level convolutional neural networks (char-CNN) of this patent

Specific embodiment

As depicted in figs. 1 and 2, the public sentiment file classification method based on deep learning includes the following steps: 1, from internet Baidu crawls enterprise's public sentiment text, available a small amount of positive sample and largely without mark sample；2, pass through PU-Learning Technology constructs initial training data set；3, three kinds of depth models are trained using fasttext, CNN, RNN to the data set in 2, adopted With multi-model coorinated training, classification judgement is carried out to without mark sample data with these three models respectively, if three kinds of classifiers Be determined as positive sample and emotion be it is negative, then be determined as positive sample, positive sample collection be added；If three kinds of classifiers are determined as Negative sample and emotion are front, then are determined as negative sample, and negative sample collection is added；Other situations wouldn't process；4, expand in use 3 The trained CNN of data set after filling classifies to test data set, if classification accuracy is less than threshold value, iteration is executed Operation in 3, on the contrary process terminates.

The specific method is as follows for it:

(1) data are crawled and are pre-processed

(2) training set constructs

The calculation process of PU-learning is broadly divided into two stages:

(3) multi-model coorinated training

It is mainly divided into three steps:

(4) category of model

The above-described embodiments are merely illustrative of preferred embodiments of the present invention, not to model of the invention It encloses and is defined, without departing from the spirit of the design of the present invention, this field ordinary engineering and technical personnel is to the technology of the present invention side The various changes and improvements that case is made, should fall within the scope of protection determined by the claims of the present invention.

Claims

1. the public sentiment file classification method based on deep learning, includes the following steps:

1), from internet, Baidu crawls enterprise's public sentiment text, available a small amount of positive sample and largely without mark sample；

2) initial training data set, is constructed by PU-Learning technology；

3), to the data set in 2 using fasttext, CNN, RNN three kinds of depth models of training, using multi-model coorinated training, Classification judgement is carried out to without mark sample data with these three models respectively, if three kinds of classifiers are determined as positive sample and feelings It is negative for feeling, then is determined as positive sample, and positive sample collection is added；If three kinds of classifiers are determined as negative sample and emotion is positive Face is then determined as negative sample, and negative sample collection is added；Other situations wouldn't process；

4) the trained CNN of data set after, expanding in use 3 classifies to test data set, if classification accuracy is small In threshold value, then iteration executes the operation in 3, otherwise process terminates；

The specific method is as follows for it:

(1) data are crawled and are pre-processed

In news public sentiment event category scene, data unlike it is contemplated that it is so ideal, too due to data mark cost The reasons such as height, we are difficult to the positive negative sample of accumulating and enriching, therefore how to take positive negative sample that is a large amount of and accurately having mark, There is very big influence for classifying quality；

In this patent, we are crawled using keyword combination there is business capital problem news data, marks the positive sample of bankroll problem Notebook data；" good " enterprise for not occurring bankroll problem is used to crawl related news as keyword simultaneously, as without mark sample；This Sample we just have a small amount of positive sample and largely without mark sample；

(2) training set constructs

Using positive sample without label study iteration from (1) largely without mark sample set in find out and positive sample COS distance Sample as far as possible is regarded as more structurally sound negative sample, together with positive sample, constructs training set；

The application scenarios of PULearning are that we can clearly determine positive sample, but not can determine that negative sample, because it It is likely to be positive sample, only we prove not yet, and at this moment we can be known as the uncertain sample in this part without label Sample U, in addition positive sample P establishes model；

The calculation process of PU-learning is broadly divided into two stages:

A, it randomly selects a part of positive example S in P to be added in U, at this moment two datasets are respectively P-S and U+S, are respectively defined as Ps and us, the data for being us with one two disaggregated model model, label 0 of ps and us training, label 1 is the data of ps；

B, then with this classifier model for no label data U, unlabeled exemplars set U is done and is classified, each sample is calculated Originally the probability for belonging to negative class sets a threshold value a, if sample classification probability is greater than a, it is considered that be one it is opposite can The negative sample leaned on；

Second stage: using positive example P and reliable negative example RN, one traditional machine learning classification model of training is new for predicting Sample；

(3) multi-model coorinated training

It is mainly divided into three steps:

A, identification and classification is carried out to no label data respectively with three kinds of sorter models fasttext, cnn, rnn, if three kinds of models All differentiate the class that is positive, is then directly added into training set as positive sample；If differentiating the class that is negative, all for negative sample；If having two A classifier differentiates the class that is positive, and a classifier differentiates the class that is negative, then retains this data, carries out manual intervention mark；If having Two classifiers differentiate the class that is negative, and a classifier differentiates the class that is positive, disregards, continue to regard as no label data；

B, after by the operation in a, training set data is updated, three kinds of model model of training is then proceeded to, calculates in test set Classification accuracy；

C, iteration carries out a, and the operation in b terminates iteration until the accuracy rate in test set reaches threshold value, saves mould Type；

(4) category of model

According to updated training data is obtained in 3, trained depth convolutional neural networks CNN is to test data in use 3 Collection is classified, if classification accuracy is less than threshold value (0.8), continues to execute the operation in 3, otherwise process terminates.