CN112989052B - Chinese news long text classification method based on combination-convolution neural network - Google Patents
Chinese news long text classification method based on combination-convolution neural network Download PDFInfo
- Publication number
- CN112989052B CN112989052B CN202110419616.2A CN202110419616A CN112989052B CN 112989052 B CN112989052 B CN 112989052B CN 202110419616 A CN202110419616 A CN 202110419616A CN 112989052 B CN112989052 B CN 112989052B
- Authority
- CN
- China
- Prior art keywords
- chinese news
- news text
- text
- chinese
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a Chinese news text classification method based on a combined-convolutional neural network, which comprises the following steps of: s1, acquiring a Chinese news text data set, and preprocessing the data set; s2, constructing a vocabulary table based on the preprocessed data set, and carrying out standardization processing on the Chinese news text in the preprocessed data set through the vocabulary table to obtain text characteristic representation of the Chinese news text; s3, constructing a combined-convolution neural network model, training the combined-convolution neural network model based on the data set after standardization processing, and completing Chinese news text classification through the trained combined-convolution neural network model. The method and the device can realize accurate and effective classification of the Chinese news text.
Description
Technical Field
The invention relates to the technical field of Chinese news text classification, in particular to a Chinese news text classification method based on a combined-convolutional neural network.
Background
Nowadays, the internet and big data industry are developed vigorously, and news is one of important means for people to know social dynamics and acquire social information resources. Since the end of the 90 s of the 20 th century, more news websites are built, and the mobile terminal news APP is also of various types, so that massive news data are generated. In order to efficiently acquire and manage valuable news data, news-text classification is just a popular research field in the world. The realization of news text classification is beneficial to the management of text information, the realization of news order and the mining of news data.
Due to the global economy integration, Chinese is the most widely used language in the world, and is just the most important in the world's language system. However, there are few classifications of news text in chinese, especially for long text in chinese. On one hand, the relevant corpus for researching Chinese text classification is less, on the other hand, Chinese is much more complex than western language, and features are difficult to extract by using a traditional method, which is also the reason that Chinese news text classification develops slowly.
Currently, text classification is one of the fundamental problems of natural language processing, and solving this problem opens many doors to natural language processing, such as information retrieval, machine translation, and automatic summarization. Common machine learning algorithms for news text classification are: naive Bayes (NB), nearest neighbor (KNN), Decision Trees (DT), Neural Networks (NNs), maximum entropy Models (ME), Support Vector Machines (SVM), and the like.
The distributed representation of words in 2003 was first applied by Bengio et al to statistical language models, and neural language models began to gain widespread attention. Collobert et al, 2008, proposed and used neural networks to represent text vocabularies as tensor data, i.e., similar words were mapped to similar positions in vector space, meaning of a word was determined by the vocabulary of its context, but the way of sharing word embedding can only cooperate with low-level information in a matrix. Mikolov et al propose two models in 2013, a continuous bag of words model (CBOW) and a continuous Skip-gram model. CBOW is a prior probability mode, a word vector related to a certain characteristic word context is input, and a word vector of the specific word is output. And the prediction mode of the continuous Skip-gram model is opposite to that of CBOW, and a word vector of a context is predicted by inputting a vector of an intermediate word. The continuous Skip-gram model can better handle uncommon words, but when the data volume is large, the problem of training is too long. To address the problem of training efficiently on millions of orders of magnitude dictionaries and billions of datasets, Google has sourced a tool for word vector computation — word2 vec. The tool essentially maps words to a low-dimensional space, using these lower-dimensional word-embedding vectors into the classifier. And, the word vector (word embedding) of the training result obtained by word2vec can well measure the similarity between words. In the same year, Barakat et al mentioned in published papers that multilayer neural networks have a strong feature learning ability, and the true meaning of the original data can be more accurately mapped through training.
The convolutional neural network model was originally invented for computer vision, was later proven by Meek to be effective for NLP, and was very effective in semantic analysis. Since then, LeCun et al propose a character-level convolutional neural network model that uses different classification data sets for semantic analysis and topic classification tasks. This method, however, is very slow in training and working for chinese text classification because the N-grams of term sets and words for chinese text classification are much larger than for english text classification. Moreover, the character-level feature processing abandons semantic information of words, and for Chinese, a plurality of overlapping semantics exist between words and characters, and the feature extraction mode has defects.
Therefore, it is necessary to provide a method for classifying Chinese news texts based on a combined-convolutional neural network.
Disclosure of Invention
The invention aims to provide a Chinese news text classification method based on a combined-convolutional neural network, which is used for solving the problems in the prior art and realizing accurate and effective classification of Chinese news texts.
In order to achieve the purpose, the invention provides the following scheme: the invention provides a Chinese news text classification method based on a combined-convolutional neural network, which comprises the following steps of:
s1, acquiring a Chinese news text data set, and preprocessing the data set;
s2, constructing a vocabulary table based on the preprocessed data set, and carrying out standardization processing on the Chinese news text in the preprocessed data set through the vocabulary table to obtain text characteristic representation of the Chinese news text;
s3, constructing a combined-convolution neural network model, training the combined-convolution neural network model based on the data set after standardization processing, and completing Chinese news text classification through the trained combined-convolution neural network model.
Preferably, in S1, the method for preprocessing the data set includes:
s1.1, constructing a data index: setting the sequence length of the Chinese news text based on the big data visualization analysis, and constructing a data index based on the sequence length of the Chinese news text;
s1.2, data integration: the chinese news text is converted to a binary data stream.
Preferably, in S2, the method for constructing a vocabulary table based on the preprocessed data set includes: and making a vocabulary table for classifying Chinese news texts by removing stop words and word frequency statistics, wherein the vocabulary table comprises vocabularies and index numbers corresponding to the vocabularies.
Preferably, in S2, the method for normalizing the chinese news text in the preprocessed data set by using the vocabulary specifically includes: the data standardization of Chinese news text contents and the data standardization of Chinese news text labels.
Preferably, the specific method for data standardization of the chinese news text content includes: firstly, traversing the index sequence of the vocabulary table to obtain corresponding vocabularies in the Chinese news text and index numbers corresponding to the vocabularies;
and secondly, forcibly converting each vocabulary in the Chinese news text into word ids by adopting a dictionary method, and vectorizing and expressing the vocabulary in the Chinese news text based on the word ids to complete the data standardization of the content of the Chinese news text.
Preferably, the specific method for data standardization of the chinese news text tag includes: and (3) setting the label index corresponding to each Chinese news text as 1 by adopting an One-Hot coding method, and expressing the rest label indexes as all zero vectors, so that vectorization expression of the text labels is realized, and the data standardization of the Chinese news text labels is completed.
Preferably, in S3, the combined-convolutional neural network model is a six-layer model, and includes an Embedding layer, a convolutional layer, a pooling layer, a first hidden layer, a second hidden layer, and a full-connection layer, which are connected in sequence; wherein the content of the first and second substances,
the Embedding layer is used for receiving input Chinese news text data, adopting word2vec to map words in the Chinese news text into real number vectors and then Embedding the real number vectors into the Chinese news text to obtain word vector representation of the Chinese news text, and the word vector representation is used as the input of the convolutional layer;
the convolutional layer respectively extracts the characteristic vectors of the Chinese news text by adopting a plurality of convolutional kernels with different sizes;
the pooling layer is used for performing maximum pooling operation on the output of the convolutional layer;
the first hidden layer is used for combining feature vectors extracted by convolution kernels with different sizes in different convolution layers;
the second hidden layer is used for nonlinear dimensionality reduction;
dropout is added in the full connection layer, the full connection layer is further connected with a Softmax layer, and classification prediction is carried out on input Chinese news texts through the Softmax layer.
Preferably, in S3, the combined-convolutional neural network model is trained by minimizing a loss function, wherein the loss function employs multi-class cross entropy.
The invention discloses the following technical effects:
(1) the invention adopts a method of constructing data index to make a term set suitable for Chinese text classification, and is used for news long text classification. Meanwhile, by optimizing the structure of the classical convolutional neural network model, a combined convolutional neural network model is provided to automatically extract text features, and the classification effect of Chinese news texts is improved. In addition, the word vector characteristics trained by the word2vec word bag model are used as original input, multiple groups of experiments are compared by using the proposed model algorithm and a traditional news text classification method, and the classification accuracy of the combination-convolution neural network on the Chinese news text reaches 93.69%. In further experiments, influence factors caused by too unbalanced sample data sets are removed, and the accuracy of the method is improved.
(2) The invention provides a supervised learning combined-convolutional neural network model, which improves the structure of a classical convolutional neural network model in a mode of convolution and recombination respectively, increases convolution operation without deepening a neural network layer, finally obtains a better text classification effect, solves the problem of slow training of a Chinese text classifier, and enhances the extraction of local features of a text.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flow chart of a Chinese news text classification method based on a combined-convolutional neural network according to the present invention;
FIG. 2 is a statistical chart of the frequency of occurrence of text length in an embodiment of the present invention;
FIG. 3 is a diagram illustrating a cumulative distribution function of text length according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a combinatorial-convolutional neural network model in an embodiment of the present invention;
FIG. 5 is a schematic diagram of training accuracy and verification accuracy of a combined-convolutional neural network model in an embodiment of the present invention;
FIG. 6 is a diagram illustrating training loss and validation loss of a combinatorial-convolutional neural network model in an embodiment of the present invention;
FIG. 7 is a diagram illustrating a confusion matrix of classification results according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Referring to fig. 1, the present embodiment provides a chinese news text classification method based on a combined-convolutional neural network, including the following steps:
s1, acquiring a Chinese news text data set, and preprocessing the data set;
the data set used in this embodiment is THUCnews, and is generated by filtering the history data from the RSS subscribed channel of the news in new wave, and contains 836075 news documents (2.04 GB), which are all in UTF-8 plain text format. On the basis of an original Newcastle classification system, 14 categories are integrated and divided: science, stock, sports, entertainment, politics, society, education, finance, home, games, real estate, fashion, lottery, constellations.
The method for preprocessing the data set comprises the following steps: based on big data visualization analysis, the text sequence length is set, a data index is constructed based on the text sequence length, and text information is converted into binary data stream to realize batch processing of data reading and writing.
In order to better and more conveniently construct the whole data index, the embodiment performs large data visualization analysis on the THUCnews, so as to determine and set the optimal text sequence length, which also serves as the standard of sentence filling length in the later model. The average number of words per news is 941 by statistics. As can be seen from the histogram shown in fig. 2, most of texts are within 2000, and as can be seen from the cumulative distribution function graph of occurrence frequency shown in fig. 3, 90% of quantiles correspond to a text length of 1857, so according to the result of the visualization analysis, the read text length is set to 2000 in this embodiment.
As more than 80 ten thousand text files are processed, the reading time is long, so that a Python pickle standard module is adopted in programming to store complex data types, and text information is converted into binary data streams. The binary file is loaded at a very fast speed, which is more than 50 times faster than the text file. Such information is stored in a hard disk, and is convenient when file data is read in an experiment, and original data can be obtained by deserializing the file data. In order to avoid memory overflow, a certain number of files are integrated and stored once.
S2, constructing a vocabulary table based on the preprocessed data set, and carrying out standardization processing on the Chinese news text in the preprocessed data set through the vocabulary table to obtain text characteristic representation of the Chinese news text;
the method for constructing the vocabulary specifically comprises the following steps: and making a vocabulary table for classifying Chinese news texts by removing stop words and word frequency statistics, wherein the vocabulary table comprises vocabularies and index numbers corresponding to the vocabularies.
The vocabulary is prepared for standardization of the text data of the chinese news. Firstly, removing stop words in a Chinese news text; the reason for removing stop words from the vocabulary table is that the words are used with too high frequency and have little semantic influence, and if a large number of the words exist in the vocabulary table, much resources are wasted; the better the feature extraction, the better the addition of a keyword, so the vocabulary should give more space for the keyword.
In this embodiment, the vocabulary table excludes the 20 most frequently used stop words in the chinese news text, which includes: "what", "is", "i", "having", "and", "just", "all", "one", "on", "also", "to", "about", "go", "you", "about", "this".
The number of Chinese characters is large, and it is difficult to speak accurate numbers. According to the statistics of Beijing national security information equipment company, 91251 Chinese characters are collected in a Chinese character library, the common Chinese characters are only thousands of characters and are divided into a common character table and a secondary common character table, the common character table is about 2500 to 7000, and the statistical results of simplified form and traditional form are not very different. Therefore, in this embodiment, all the words of the chinese news text are counted statistically, and the word with the frequency of occurrence ranked 7000 is used as the vocabulary corpus.
The method comprises the following steps of standardizing the Chinese news text in the preprocessed data set through the vocabulary table, and converting the Chinese news text into a standard form which can be recognized by a computer, wherein the standardized processing method specifically comprises the following steps:
1) data standardization of Chinese news text content:
firstly, traversing the index sequence of the vocabulary table to obtain corresponding vocabularies in the Chinese news text and index numbers corresponding to the vocabularies;
secondly, forcibly converting each vocabulary in the Chinese news text into word id by adopting a dictionary method; specifically, mapping of vocabulary and word id is realized by using a list derivation formula and a lambda anonymity function; and embedding the word id into the Chinese news text to realize vectorization representation of the Chinese news text and finish data standardization of the content of the Chinese news text.
2) Data standardization of the Chinese news text label: and adopting One-Hot coding widely used by classified data, setting the label index corresponding to each Chinese news text as 1, and expressing the rest label indexes as all zero vectors, so as to realize vectorization expression of text labels and finish data standardization of the Chinese news text labels.
S3, constructing a combined-convolution neural network model, training the combined-convolution neural network model based on the data set after standardization processing, and completing Chinese news text classification through the trained combined-convolution neural network model.
In this embodiment, the combined-convolutional neural network model is a six-layer model, as shown in fig. 4, specifically:
the first layer is an Embedding layer and is used for receiving input data; because the input data of news classification is text data, and the text data can be input only by converting into real number vector data, the Embedding layer adopts word2vec to map words in the Chinese news text into real number vectors and then embeds the real number vectors into the Chinese news text to obtain word vector representation of the Chinese news text, and the word vector representation is used as the input of the convolutional layer; that is, the normalized chinese news text of step S2 is subjected to secondary vector mapping.
The second layer and the third layer are respectively a convolution layer and a pooling layer; the combined-convolutional neural network model mainly improves the way convolution and pooling operations compared to the classical convolutional neural network model. The classical convolution neural network model has different conditions of single-layer convolution and multilayer convolution, and in the aspect of single-layer convolution, local text characteristic information extracted by one convolution kernel is limited and is not complete; in the aspect of multilayer convolution, text features extracted in a superposition mode by multilayer convolution operation are often too abstract, and the text features are not beneficial to expressing the real meaning of the text. Therefore, in order to extract more complete local text block features, in the combined-convolutional neural network model, the convolutional layer respectively extracts text features by using three convolutional kernels with different sizes. Meanwhile, in order to extract main features and reduce the number of feature parameters, the maximum pooling operation is respectively carried out on the output of the convolutional layer by utilizing the characteristic of maximum pooling layer downsampling, so that more and more important text features are extracted under the condition that the depth of a neural network is not deepened.
The fourth layer and the fifth layer belong to middle hidden layers and are respectively a first hidden layer and a second hidden layer; there are no two hidden layers in a classical convolutional neural network. Since the output of the third layer is the result of three pooling operations, the feature vectors extracted by the different convolution kernels are combined using the first concealment layer. In the combined-convolution neural network model of the embodiment, the number of each convolution kernel is set to be large, and the vector dimension output by combining the feature vectors through the first hidden layer is too large, so that a second hidden layer is added for dimension reduction.
The sixth layer is a fully connected layer. Firstly, a Dropout layer is added in a full connection layer to prevent overfitting of a model and improve the generalization capability of the model; secondly, the model adopts ReLU as an activation function, the nonlinearity of the neural network model is increased, and the problem of disappearance of the neural network gradient is avoided; and finally, carrying out classification prediction on the news text by utilizing Softmax.
The working principle of the combined-convolution neural network model is as follows:
the Embdding layer is a dictionary lookup that maps integer indices into dense vectors. The layer receives integers as input and then looks up vectors associated with the integers in an internal dictionary and returns for output. The word vector mapping in the layer uses word vector computing tool word2vec of Google to embed the input data as words, and the word vector of the input convolutional layer is obtained.
Mapping backward quantized Chinese news text as a k-dimensional word vectorSuppose thatIs a vector representation of the ith word, so a sentence with length n is shown in equation (1):
wherein the content of the first and second substances,it is shown that the connection operation is performed,representing the word vector matrix in the 1 st through nth windows of the input.
The convolution layer performs convolution operation on a continuous window with the width of k by utilizing convolution kernels with different sizes, wherein the convolution kernel isThe height h of the three convolution kernels in this embodiment is set to 3, 5, and 7, respectively, and there are r convolution kernels for each size, and the value is set to 256. Weight matrixExtracting the features of the text block of h wordsOne feature extractedAs shown in formula (2):
wherein the content of the first and second substances,is a non-linear activation function of the device,r is a matrix for the bias term. Convolution operations applied to a word vector of a complete news textObtaining a characteristic diagramAs shown in formula (3):
in the formula (I), the compound is shown in the specification,. In order to extract main features and simultaneously reduce feature parameters and calculated amount, a maximum pooling method is adopted to take the maximum value in each feature map as the most important feature extracted by the convolution kernel on a text vector to obtain a dimension ofThe feature vector of (2).Representing a maximum poolThe pooling operation is shown in formula (4):
the above is the process of feature extraction for a convolution kernel of one size. In the combined-convolutional neural network model of the embodiment, a plurality of features are obtained by using a plurality of convolutional kernels with different sizes, so that the results of the maximal pooling of the different convolutional kernels are spliced to obtain a feature vectorSpecifically, as shown in formula (5):
in the formula (I), the compound is shown in the specification,and the feature vectors output after maximal pooling of the convolution kernels with the heights of 3, 5 and 7 are respectively represented.
Then, a hidden layer is added for nonlinear dimension reduction to become a feature vectorWhere d is the number of hidden layer neuron nodes, in this embodiment, d is set to 128.
Finally, the characteristics are transmitted to a full connection layer, the probability distribution of 14 category labels is output through a Softmax layer, the category corresponding to the maximum probability is taken, and the label value of the prediction category is obtainedAs shown in formula (6):
in the formula (I), the compound is shown in the specification,and m is the number of categories,is the bias term. In order to increase the convergence rate, a small batch sample gradient descent is adopted, and the number of batch samples is set to 64 in the embodiment. In addition, the handling of the Dropout layer and the ReLU activation function is introduced at the fully connected layer.
In the field of deep learning, it is important to reasonably divide a training set, a verification set and a test set. In this embodiment, the data volume increases steeply to a million level, and at this time, more sample data should be sent to the training set without too many verification sets and test sets, so in this embodiment, the proportion of the training set, the verification set, and the test set is adjusted to 82:6:12, and 686075 chinese news samples are obtained for training, 50000 verification sets are used for model verification and optimization, and 100000 test sets are used to evaluate the classification effect of the model.
The verification set is used for verifying the precision and loss of the model, searching iteration turns of the model which starts to be over-fitted, outputting a group of precision values and loss values every 100 iteration turns of the model, and drawing a precision curve and a loss curve, as shown in fig. 5 and 6. The total iteration number of the network is 20000 rounds, overfitting is started around the 10000 th round of training, namely the training precision and the training loss are relatively stable, the verification precision is not improved any more, and the verification loss is not reduced any more. Therefore, the elimination of the iterative training thereafter can not only reduce the computational load of the computer, but also avoid overfitting of the model.
Meanwhile, a regularization method Dropout layer is added in a full connection layer of the neural network to reduce overfitting, the Dropout layer is an important method for preventing overfitting from improving the effect in the convolutional neural network, and the output value of the hidden layer node is cleared in each training batch with a certain probability of 1-p. By reducing the interaction among the feature detectors (hidden layer nodes) in the mode, the overfitting phenomenon can be effectively reduced, and the regularization effect is achieved to a certain extent.
The process of training the combined-convolutional neural network model based on the term set includes:
the combinatorial-convolutional neural network model is trained by minimizing a loss function on the training set, which uses multi-class cross entropy, i.e., a logarithmic loss function, as shown in equation (7):
wherein, L is a loss function, and Y is an output variable;is a binary index, and represents whether the category m is an input exampleTrue category of (2);representing the probability that the jth instance is predicted as the tth category among the N instances; the loss value is used for measuring the distance between the probability distribution of the network output and the real probability distribution of the label, and the training network can enable the output result to be closer to the real label as much as possible; the optimizer calls an Adam optimization algorithm, introduces quadratic gradient correction, calculates the self-adaptive learning rate of each parameter, and is an optimization algorithm for searching a global optimum point; the model training was iterated 10000 times in total, and the training was completed for about 20 minutes. Therefore, a method for storing and loading the model in the TensorFlow is adopted, and the model which is trained in advance is loaded and is trained again on the basis of the model, so that a large amount of time is saved in the experiment.
The embodiment verifies the accuracy and the effectiveness of the Chinese news text classification method based on the combined-convolutional neural network through experiments:
the setting of the experimental environment and the establishment of the experimental platform are as follows:
(1) hardware aspect: windows10 system, CPU Inter (R) core (TM) i7-8750H 2.20GHz, and memory 8 GB.
(2) Software and dependent libraries: python3.7, Jupyter notebook, Tensorflow _ gpu-1.13.1, skleran, etc.
In the experimental process, the setting of the adjustable parameters of the combined-convolutional neural network model is shown in table 1, data is loaded in batches for training, each batch is 64, and the number of hidden neurons in the fully-connected layer is 128.
To verify the effectiveness of the combined-convolutional neural network model algorithm of the present invention, this embodiment performs a classification experiment on multiple sets of chinese news texts with different models, compares the classification experiment with the conventional and representative classification algorithm, and uses the overall average Precision (Precision), Recall (Recall) and F of each classification1The value (F-Measure) evaluates the classification effect of different models and serves as a performance index for measuring the classifier.
(1) In order to verify the classification performance of the combined-convolutional neural network model, a plurality of benchmarks are selected for comparison, and the combined-convolutional neural network is compared with a classical convolutional neural network and a traditional machine learning method. The classical convolutional neural network comprises a single-layer convolutional neural network (CNN-1) and a multi-layer convolutional neural network (CNN-3), and the traditional machine learning method comprises Naive Bayes (NB), nearest neighbor (KNN) and a Support Vector Machine (SVM).
(2) And in order to further test the effectiveness of the model and reduce the influence of sample data unbalance on the classification result, the data set is subjected to equalization processing. The various news samples were originally proportioned as follows: the proportions of "constellation", "lottery", "fashion", "real estate", "game", "home", "finance", "education", "society", "fashion", "entertainment", "sports", "stock" and "science" are respectively: 0.45%, 0.9%, 1.6%, 2.4%, 2.9%, 3.9%, 4.4%, 5.0%, 6.1%, 7.5%, 11.1%, 15.7%, 18.5%, 19.5%; wherein, the category samples of the 'constellation', 'lottery', 'fashion' are too few to be 3% of the total number of samples, while the category samples of the 'science', 'stock', 'sports' are too many, and only three categories exceed 50% of the total number of samples. Thus, the former classification results in poor classification, and partial samples of the former are classified into the latter as can be seen from the data indicated in the confusion matrix of fig. 7. Each row of the confusion matrix represents the true attribution category of data and each column represents the prediction category. The data set after random partition and equalization is 65000 sample data in total and is divided into 10 categories, wherein 5000 × 10 training sets, 500 × 10 verification sets and 1000 × 10 test sets are adopted. Based on different data sets, the classification results of the combined-convolutional neural network model of the invention are utilized for comparison.
In the experiment, the method for realizing feature construction all takes the pre-trained word vector as input, and the classification results of different classification models are shown in table 2:
as can be seen from a comparison of table 2, first: word vectors are pre-trained by adopting a word2vec bag-of-words model, feature construction is carried out to serve as model input, each classification model on the same data set achieves an accuracy rate of more than 80%, and the word vectors can well describe text features. Secondly, the method comprises the following steps: the obtained classification effect is superior to that of three traditional machine learning algorithms no matter in a single-layer convolutional neural network or a multi-layer convolutional neural network, and the convolutional neural network model can learn more classification characteristics and has more advantages compared with the traditional machine learning model. Thirdly, the method comprises the following steps: the classification effect obtained by the CNN-3 models of the plurality of convolutional layers is poorer than that obtained by the CNN-1 model of a single convolutional layer, which shows that the expected effect is not obtained by deepening the convolutional layers on the basis of a classical convolutional neural network model; fourthly: the accuracy rate of the combined-convolutional neural network model for classifying the Chinese news text reaches 93.69%, compared with the classification effects of NB, KNN and SVM, the classification accuracy rate is respectively improved by 11.82%, 8.21% and 6.34%, and compared with the classification effect of a classical CNN-1 model, the accuracy rate is improved by 1.19%, and the accuracy rate is improved similarly to that of the conventional CNN-1 modelHourly recall and F1The two indexes are superior to the comparison model, the word vector convolution recombination mode is adopted, more comprehensive local text block characteristic information can be extracted, and the text classification effect is improved well.
This example further designs classification experiments for different data sets. Analyzing the confusion matrix of the classification results shows that the classes with less sample ratio are often wrongly classified into the classes with more sample ratio. Therefore, the data sets were further partitioned in the experiment and the same model was used to perform classification result comparisons on different data sets as shown in table 3:
from table 3, it can be seen that the same combined-convolutional neural network model was used, which achieved an accuracy of 95.57% on the equalized data set. Compared with an unbalanced data set, the method has the advantages that the obtained classification effect is better, the accuracy rate is improved by 1.88%, the recall rate is improved by 1.76%, and F is realized1The value is improved by 1.72%, which shows that the balanced data set is obtained by processing all the unbalanced data sets again, so that the problem caused by extreme sample data occupation can be solved well, and the classes with less sample occupation are prevented from being wrongly classified into the classes with more sample occupation. Therefore, the data sets are too unbalanced, the influence on the classification result is large, and the equalization processing on the data sets can further improve the accuracy of news classification.
The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.
Claims (2)
1. A Chinese news long text classification method based on a combined-convolutional neural network is characterized by comprising the following steps:
s1, acquiring a Chinese news text data set, and preprocessing the data set;
s2, constructing a vocabulary table based on the preprocessed data set, and carrying out standardization processing on the Chinese news text in the preprocessed data set through the vocabulary table to obtain text characteristic representation of the Chinese news text;
s3, constructing a combined-convolutional neural network model, training the combined-convolutional neural network model based on the data set after standardization processing, and completing Chinese news text classification through the trained combined-convolutional neural network model; the combined-convolutional neural network model is a six-layer model and comprises an Embedding layer, a convolutional layer, a pooling layer, a first hidden layer, a second hidden layer and a full-connection layer which are sequentially connected; wherein the content of the first and second substances,
the Embedding layer is used for receiving input Chinese news text data, adopting word2vec to map words in the Chinese news text into real number vectors and then Embedding the real number vectors into the Chinese news text to obtain word vector representation of the Chinese news text, and taking the word vector representation as the input of the convolutional layer, namely performing secondary vector mapping on the Chinese news text standardized in the step S2;
the convolutional layer respectively extracts the characteristic vectors of the Chinese news text by adopting a plurality of convolutional kernels with different sizes;
the pooling layer is used for performing maximum pooling operation on the output of the convolutional layer;
the first hidden layer is used for combining feature vectors extracted by convolution kernels with different sizes in different convolution layers;
the second hidden layer is used for nonlinear dimensionality reduction;
dropout is added in the full connection layer, the full connection layer is also connected with a Softmax layer, and the input Chinese news text is classified and predicted through the Softmax layer;
in S1, the preprocessing the data set includes:
s1.1, constructing a data index: setting the sequence length of the Chinese news text based on the big data visualization analysis, and constructing a data index based on the sequence length of the Chinese news text;
s1.2, data integration: converting the Chinese news text into a binary data stream;
in S2, constructing the vocabulary table based on the preprocessed data set includes: making a vocabulary table for classifying Chinese news texts by removing stop words and word frequency statistics, wherein the vocabulary table comprises vocabularies and index numbers corresponding to the vocabularies;
in S2, the method for normalizing the chinese news text in the preprocessed data set by using the vocabulary specifically includes: standardizing data of Chinese news text contents and data of Chinese news text labels;
the specific method for standardizing the data of the Chinese news text content comprises the following steps: firstly, traversing the index sequence of the vocabulary table to obtain corresponding vocabularies in the Chinese news text and index numbers corresponding to the vocabularies;
secondly, forcibly converting each vocabulary in the Chinese news text into word ids by adopting a dictionary method, and vectorizing and expressing the vocabulary in the Chinese news text based on the word ids to complete the data standardization of the content of the Chinese news text; specifically, mapping of vocabulary and word id is realized by using a list derivation formula and a lambda anonymity function; embedding a word id into the Chinese news text to realize vectorization representation of the Chinese news text;
the specific method for standardizing the data of the Chinese news text label comprises the following steps: and (3) setting the label index corresponding to each Chinese news text as 1 by adopting an One Hot coding method, and expressing the rest label indexes as all zero vectors, so that vectorization expression of the text labels is realized, and the data standardization of the Chinese news text labels is completed.
2. The method for Chinese news long-text classification based on the combined-convolutional neural network as claimed in claim 1, wherein in the S3, the combined-convolutional neural network model is trained by minimizing a loss function, wherein the loss function adopts multi-class cross entropy.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110419616.2A CN112989052B (en) | 2021-04-19 | 2021-04-19 | Chinese news long text classification method based on combination-convolution neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110419616.2A CN112989052B (en) | 2021-04-19 | 2021-04-19 | Chinese news long text classification method based on combination-convolution neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112989052A CN112989052A (en) | 2021-06-18 |
CN112989052B true CN112989052B (en) | 2022-03-08 |
Family
ID=76341131
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110419616.2A Active CN112989052B (en) | 2021-04-19 | 2021-04-19 | Chinese news long text classification method based on combination-convolution neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112989052B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114638558B (en) * | 2022-05-19 | 2022-08-23 | 天津市普迅电力信息技术有限公司 | Data set classification method for operation accident analysis of comprehensive energy system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108595602A (en) * | 2018-04-20 | 2018-09-28 | 昆明理工大学 | The question sentence file classification method combined with depth model based on shallow Model |
CN109840279A (en) * | 2019-01-10 | 2019-06-04 | 山东亿云信息技术有限公司 | File classification method based on convolution loop neural network |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10963652B2 (en) * | 2018-12-11 | 2021-03-30 | Salesforce.Com, Inc. | Structured text translation |
-
2021
- 2021-04-19 CN CN202110419616.2A patent/CN112989052B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108595602A (en) * | 2018-04-20 | 2018-09-28 | 昆明理工大学 | The question sentence file classification method combined with depth model based on shallow Model |
CN109840279A (en) * | 2019-01-10 | 2019-06-04 | 山东亿云信息技术有限公司 | File classification method based on convolution loop neural network |
Non-Patent Citations (1)
Title |
---|
"基于特征融合分段卷积神经网络的情感分析";周泳东等;《计算机工程与设计》;20190604;第3009-3013页 * |
Also Published As
Publication number | Publication date |
---|---|
CN112989052A (en) | 2021-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111221944B (en) | Text intention recognition method, device, equipment and storage medium | |
CN112417153B (en) | Text classification method, apparatus, terminal device and readable storage medium | |
CN112749274B (en) | Chinese text classification method based on attention mechanism and interference word deletion | |
CN112784532B (en) | Multi-head attention memory system for short text sentiment classification | |
CN113315789B (en) | Web attack detection method and system based on multi-level combined network | |
CN113553510B (en) | Text information recommendation method and device and readable medium | |
CN112115716A (en) | Service discovery method, system and equipment based on multi-dimensional word vector context matching | |
CN110263343A (en) | The keyword abstraction method and system of phrase-based vector | |
CN116756303A (en) | Automatic generation method and system for multi-topic text abstract | |
CN114579741B (en) | GCN-RN aspect emotion analysis method and system for fusing syntax information | |
CN112989052B (en) | Chinese news long text classification method based on combination-convolution neural network | |
CN114610838A (en) | Text emotion analysis method, device and equipment and storage medium | |
CN112527959B (en) | News classification method based on pooling convolution embedding and attention distribution neural network | |
CN112966507A (en) | Method, device, equipment and storage medium for constructing recognition model and identifying attack | |
CN110348497B (en) | Text representation method constructed based on WT-GloVe word vector | |
Yildiz | A comparative study of author gender identification | |
CN112232079A (en) | Microblog comment data classification method and system | |
Liu et al. | Chinese news text classification and its application based on combined-convolutional neural network | |
CN111881667A (en) | Sensitive text auditing method | |
Tian et al. | Chinese short text multi-classification based on word and part-of-speech tagging embedding | |
Vikas et al. | User gender classification based on Twitter Profile Using machine learning | |
CN114386425B (en) | Big data system establishing method for processing natural language text content | |
CN113051886B (en) | Test question duplicate checking method, device, storage medium and equipment | |
CN115358340A (en) | Credit credit collection short message distinguishing method, system, equipment and storage medium | |
CN113987536A (en) | Method and device for determining security level of field in data table, electronic equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20210618 Assignee: Beijing Zhongke Chaocai Information Consulting Co.,Ltd. Assignor: Beijing University of Civil Engineering and Architecture Contract record no.: X2023980034081 Denomination of invention: A Chinese News Long Text Classification Method Based on Combination Convolutional Neural Network Granted publication date: 20220308 License type: Common License Record date: 20230327 |
|
EE01 | Entry into force of recordation of patent licensing contract |