CN112989052B - Chinese news long text classification method based on combination-convolution neural network - Google Patents

Chinese news long text classification method based on combination-convolution neural network Download PDF

Info

Publication number
CN112989052B
CN112989052B CN202110419616.2A CN202110419616A CN112989052B CN 112989052 B CN112989052 B CN 112989052B CN 202110419616 A CN202110419616 A CN 202110419616A CN 112989052 B CN112989052 B CN 112989052B
Authority
CN
China
Prior art keywords
chinese news
news text
text
chinese
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110419616.2A
Other languages
Chinese (zh)
Other versions
CN112989052A (en
Inventor
张昱
刘开峰
高凯龙
王艳歌
苏仡琳
李继涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Civil Engineering and Architecture
Original Assignee
Beijing University of Civil Engineering and Architecture
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Civil Engineering and Architecture filed Critical Beijing University of Civil Engineering and Architecture
Priority to CN202110419616.2A priority Critical patent/CN112989052B/en
Publication of CN112989052A publication Critical patent/CN112989052A/en
Application granted granted Critical
Publication of CN112989052B publication Critical patent/CN112989052B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Chinese news text classification method based on a combined-convolutional neural network, which comprises the following steps of: s1, acquiring a Chinese news text data set, and preprocessing the data set; s2, constructing a vocabulary table based on the preprocessed data set, and carrying out standardization processing on the Chinese news text in the preprocessed data set through the vocabulary table to obtain text characteristic representation of the Chinese news text; s3, constructing a combined-convolution neural network model, training the combined-convolution neural network model based on the data set after standardization processing, and completing Chinese news text classification through the trained combined-convolution neural network model. The method and the device can realize accurate and effective classification of the Chinese news text.

Description

Chinese news long text classification method based on combination-convolution neural network
Technical Field
The invention relates to the technical field of Chinese news text classification, in particular to a Chinese news text classification method based on a combined-convolutional neural network.
Background
Nowadays, the internet and big data industry are developed vigorously, and news is one of important means for people to know social dynamics and acquire social information resources. Since the end of the 90 s of the 20 th century, more news websites are built, and the mobile terminal news APP is also of various types, so that massive news data are generated. In order to efficiently acquire and manage valuable news data, news-text classification is just a popular research field in the world. The realization of news text classification is beneficial to the management of text information, the realization of news order and the mining of news data.
Due to the global economy integration, Chinese is the most widely used language in the world, and is just the most important in the world's language system. However, there are few classifications of news text in chinese, especially for long text in chinese. On one hand, the relevant corpus for researching Chinese text classification is less, on the other hand, Chinese is much more complex than western language, and features are difficult to extract by using a traditional method, which is also the reason that Chinese news text classification develops slowly.
Currently, text classification is one of the fundamental problems of natural language processing, and solving this problem opens many doors to natural language processing, such as information retrieval, machine translation, and automatic summarization. Common machine learning algorithms for news text classification are: naive Bayes (NB), nearest neighbor (KNN), Decision Trees (DT), Neural Networks (NNs), maximum entropy Models (ME), Support Vector Machines (SVM), and the like.
The distributed representation of words in 2003 was first applied by Bengio et al to statistical language models, and neural language models began to gain widespread attention. Collobert et al, 2008, proposed and used neural networks to represent text vocabularies as tensor data, i.e., similar words were mapped to similar positions in vector space, meaning of a word was determined by the vocabulary of its context, but the way of sharing word embedding can only cooperate with low-level information in a matrix. Mikolov et al propose two models in 2013, a continuous bag of words model (CBOW) and a continuous Skip-gram model. CBOW is a prior probability mode, a word vector related to a certain characteristic word context is input, and a word vector of the specific word is output. And the prediction mode of the continuous Skip-gram model is opposite to that of CBOW, and a word vector of a context is predicted by inputting a vector of an intermediate word. The continuous Skip-gram model can better handle uncommon words, but when the data volume is large, the problem of training is too long. To address the problem of training efficiently on millions of orders of magnitude dictionaries and billions of datasets, Google has sourced a tool for word vector computation — word2 vec. The tool essentially maps words to a low-dimensional space, using these lower-dimensional word-embedding vectors into the classifier. And, the word vector (word embedding) of the training result obtained by word2vec can well measure the similarity between words. In the same year, Barakat et al mentioned in published papers that multilayer neural networks have a strong feature learning ability, and the true meaning of the original data can be more accurately mapped through training.
The convolutional neural network model was originally invented for computer vision, was later proven by Meek to be effective for NLP, and was very effective in semantic analysis. Since then, LeCun et al propose a character-level convolutional neural network model that uses different classification data sets for semantic analysis and topic classification tasks. This method, however, is very slow in training and working for chinese text classification because the N-grams of term sets and words for chinese text classification are much larger than for english text classification. Moreover, the character-level feature processing abandons semantic information of words, and for Chinese, a plurality of overlapping semantics exist between words and characters, and the feature extraction mode has defects.
Therefore, it is necessary to provide a method for classifying Chinese news texts based on a combined-convolutional neural network.
Disclosure of Invention
The invention aims to provide a Chinese news text classification method based on a combined-convolutional neural network, which is used for solving the problems in the prior art and realizing accurate and effective classification of Chinese news texts.
In order to achieve the purpose, the invention provides the following scheme: the invention provides a Chinese news text classification method based on a combined-convolutional neural network, which comprises the following steps of:
s1, acquiring a Chinese news text data set, and preprocessing the data set;
s2, constructing a vocabulary table based on the preprocessed data set, and carrying out standardization processing on the Chinese news text in the preprocessed data set through the vocabulary table to obtain text characteristic representation of the Chinese news text;
s3, constructing a combined-convolution neural network model, training the combined-convolution neural network model based on the data set after standardization processing, and completing Chinese news text classification through the trained combined-convolution neural network model.
Preferably, in S1, the method for preprocessing the data set includes:
s1.1, constructing a data index: setting the sequence length of the Chinese news text based on the big data visualization analysis, and constructing a data index based on the sequence length of the Chinese news text;
s1.2, data integration: the chinese news text is converted to a binary data stream.
Preferably, in S2, the method for constructing a vocabulary table based on the preprocessed data set includes: and making a vocabulary table for classifying Chinese news texts by removing stop words and word frequency statistics, wherein the vocabulary table comprises vocabularies and index numbers corresponding to the vocabularies.
Preferably, in S2, the method for normalizing the chinese news text in the preprocessed data set by using the vocabulary specifically includes: the data standardization of Chinese news text contents and the data standardization of Chinese news text labels.
Preferably, the specific method for data standardization of the chinese news text content includes: firstly, traversing the index sequence of the vocabulary table to obtain corresponding vocabularies in the Chinese news text and index numbers corresponding to the vocabularies;
and secondly, forcibly converting each vocabulary in the Chinese news text into word ids by adopting a dictionary method, and vectorizing and expressing the vocabulary in the Chinese news text based on the word ids to complete the data standardization of the content of the Chinese news text.
Preferably, the specific method for data standardization of the chinese news text tag includes: and (3) setting the label index corresponding to each Chinese news text as 1 by adopting an One-Hot coding method, and expressing the rest label indexes as all zero vectors, so that vectorization expression of the text labels is realized, and the data standardization of the Chinese news text labels is completed.
Preferably, in S3, the combined-convolutional neural network model is a six-layer model, and includes an Embedding layer, a convolutional layer, a pooling layer, a first hidden layer, a second hidden layer, and a full-connection layer, which are connected in sequence; wherein the content of the first and second substances,
the Embedding layer is used for receiving input Chinese news text data, adopting word2vec to map words in the Chinese news text into real number vectors and then Embedding the real number vectors into the Chinese news text to obtain word vector representation of the Chinese news text, and the word vector representation is used as the input of the convolutional layer;
the convolutional layer respectively extracts the characteristic vectors of the Chinese news text by adopting a plurality of convolutional kernels with different sizes;
the pooling layer is used for performing maximum pooling operation on the output of the convolutional layer;
the first hidden layer is used for combining feature vectors extracted by convolution kernels with different sizes in different convolution layers;
the second hidden layer is used for nonlinear dimensionality reduction;
dropout is added in the full connection layer, the full connection layer is further connected with a Softmax layer, and classification prediction is carried out on input Chinese news texts through the Softmax layer.
Preferably, in S3, the combined-convolutional neural network model is trained by minimizing a loss function, wherein the loss function employs multi-class cross entropy.
The invention discloses the following technical effects:
(1) the invention adopts a method of constructing data index to make a term set suitable for Chinese text classification, and is used for news long text classification. Meanwhile, by optimizing the structure of the classical convolutional neural network model, a combined convolutional neural network model is provided to automatically extract text features, and the classification effect of Chinese news texts is improved. In addition, the word vector characteristics trained by the word2vec word bag model are used as original input, multiple groups of experiments are compared by using the proposed model algorithm and a traditional news text classification method, and the classification accuracy of the combination-convolution neural network on the Chinese news text reaches 93.69%. In further experiments, influence factors caused by too unbalanced sample data sets are removed, and the accuracy of the method is improved.
(2) The invention provides a supervised learning combined-convolutional neural network model, which improves the structure of a classical convolutional neural network model in a mode of convolution and recombination respectively, increases convolution operation without deepening a neural network layer, finally obtains a better text classification effect, solves the problem of slow training of a Chinese text classifier, and enhances the extraction of local features of a text.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flow chart of a Chinese news text classification method based on a combined-convolutional neural network according to the present invention;
FIG. 2 is a statistical chart of the frequency of occurrence of text length in an embodiment of the present invention;
FIG. 3 is a diagram illustrating a cumulative distribution function of text length according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a combinatorial-convolutional neural network model in an embodiment of the present invention;
FIG. 5 is a schematic diagram of training accuracy and verification accuracy of a combined-convolutional neural network model in an embodiment of the present invention;
FIG. 6 is a diagram illustrating training loss and validation loss of a combinatorial-convolutional neural network model in an embodiment of the present invention;
FIG. 7 is a diagram illustrating a confusion matrix of classification results according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Referring to fig. 1, the present embodiment provides a chinese news text classification method based on a combined-convolutional neural network, including the following steps:
s1, acquiring a Chinese news text data set, and preprocessing the data set;
the data set used in this embodiment is THUCnews, and is generated by filtering the history data from the RSS subscribed channel of the news in new wave, and contains 836075 news documents (2.04 GB), which are all in UTF-8 plain text format. On the basis of an original Newcastle classification system, 14 categories are integrated and divided: science, stock, sports, entertainment, politics, society, education, finance, home, games, real estate, fashion, lottery, constellations.
The method for preprocessing the data set comprises the following steps: based on big data visualization analysis, the text sequence length is set, a data index is constructed based on the text sequence length, and text information is converted into binary data stream to realize batch processing of data reading and writing.
In order to better and more conveniently construct the whole data index, the embodiment performs large data visualization analysis on the THUCnews, so as to determine and set the optimal text sequence length, which also serves as the standard of sentence filling length in the later model. The average number of words per news is 941 by statistics. As can be seen from the histogram shown in fig. 2, most of texts are within 2000, and as can be seen from the cumulative distribution function graph of occurrence frequency shown in fig. 3, 90% of quantiles correspond to a text length of 1857, so according to the result of the visualization analysis, the read text length is set to 2000 in this embodiment.
As more than 80 ten thousand text files are processed, the reading time is long, so that a Python pickle standard module is adopted in programming to store complex data types, and text information is converted into binary data streams. The binary file is loaded at a very fast speed, which is more than 50 times faster than the text file. Such information is stored in a hard disk, and is convenient when file data is read in an experiment, and original data can be obtained by deserializing the file data. In order to avoid memory overflow, a certain number of files are integrated and stored once.
S2, constructing a vocabulary table based on the preprocessed data set, and carrying out standardization processing on the Chinese news text in the preprocessed data set through the vocabulary table to obtain text characteristic representation of the Chinese news text;
the method for constructing the vocabulary specifically comprises the following steps: and making a vocabulary table for classifying Chinese news texts by removing stop words and word frequency statistics, wherein the vocabulary table comprises vocabularies and index numbers corresponding to the vocabularies.
The vocabulary is prepared for standardization of the text data of the chinese news. Firstly, removing stop words in a Chinese news text; the reason for removing stop words from the vocabulary table is that the words are used with too high frequency and have little semantic influence, and if a large number of the words exist in the vocabulary table, much resources are wasted; the better the feature extraction, the better the addition of a keyword, so the vocabulary should give more space for the keyword.
In this embodiment, the vocabulary table excludes the 20 most frequently used stop words in the chinese news text, which includes: "what", "is", "i", "having", "and", "just", "all", "one", "on", "also", "to", "about", "go", "you", "about", "this".
The number of Chinese characters is large, and it is difficult to speak accurate numbers. According to the statistics of Beijing national security information equipment company, 91251 Chinese characters are collected in a Chinese character library, the common Chinese characters are only thousands of characters and are divided into a common character table and a secondary common character table, the common character table is about 2500 to 7000, and the statistical results of simplified form and traditional form are not very different. Therefore, in this embodiment, all the words of the chinese news text are counted statistically, and the word with the frequency of occurrence ranked 7000 is used as the vocabulary corpus.
The method comprises the following steps of standardizing the Chinese news text in the preprocessed data set through the vocabulary table, and converting the Chinese news text into a standard form which can be recognized by a computer, wherein the standardized processing method specifically comprises the following steps:
1) data standardization of Chinese news text content:
firstly, traversing the index sequence of the vocabulary table to obtain corresponding vocabularies in the Chinese news text and index numbers corresponding to the vocabularies;
secondly, forcibly converting each vocabulary in the Chinese news text into word id by adopting a dictionary method; specifically, mapping of vocabulary and word id is realized by using a list derivation formula and a lambda anonymity function; and embedding the word id into the Chinese news text to realize vectorization representation of the Chinese news text and finish data standardization of the content of the Chinese news text.
2) Data standardization of the Chinese news text label: and adopting One-Hot coding widely used by classified data, setting the label index corresponding to each Chinese news text as 1, and expressing the rest label indexes as all zero vectors, so as to realize vectorization expression of text labels and finish data standardization of the Chinese news text labels.
S3, constructing a combined-convolution neural network model, training the combined-convolution neural network model based on the data set after standardization processing, and completing Chinese news text classification through the trained combined-convolution neural network model.
In this embodiment, the combined-convolutional neural network model is a six-layer model, as shown in fig. 4, specifically:
the first layer is an Embedding layer and is used for receiving input data; because the input data of news classification is text data, and the text data can be input only by converting into real number vector data, the Embedding layer adopts word2vec to map words in the Chinese news text into real number vectors and then embeds the real number vectors into the Chinese news text to obtain word vector representation of the Chinese news text, and the word vector representation is used as the input of the convolutional layer; that is, the normalized chinese news text of step S2 is subjected to secondary vector mapping.
The second layer and the third layer are respectively a convolution layer and a pooling layer; the combined-convolutional neural network model mainly improves the way convolution and pooling operations compared to the classical convolutional neural network model. The classical convolution neural network model has different conditions of single-layer convolution and multilayer convolution, and in the aspect of single-layer convolution, local text characteristic information extracted by one convolution kernel is limited and is not complete; in the aspect of multilayer convolution, text features extracted in a superposition mode by multilayer convolution operation are often too abstract, and the text features are not beneficial to expressing the real meaning of the text. Therefore, in order to extract more complete local text block features, in the combined-convolutional neural network model, the convolutional layer respectively extracts text features by using three convolutional kernels with different sizes. Meanwhile, in order to extract main features and reduce the number of feature parameters, the maximum pooling operation is respectively carried out on the output of the convolutional layer by utilizing the characteristic of maximum pooling layer downsampling, so that more and more important text features are extracted under the condition that the depth of a neural network is not deepened.
The fourth layer and the fifth layer belong to middle hidden layers and are respectively a first hidden layer and a second hidden layer; there are no two hidden layers in a classical convolutional neural network. Since the output of the third layer is the result of three pooling operations, the feature vectors extracted by the different convolution kernels are combined using the first concealment layer. In the combined-convolution neural network model of the embodiment, the number of each convolution kernel is set to be large, and the vector dimension output by combining the feature vectors through the first hidden layer is too large, so that a second hidden layer is added for dimension reduction.
The sixth layer is a fully connected layer. Firstly, a Dropout layer is added in a full connection layer to prevent overfitting of a model and improve the generalization capability of the model; secondly, the model adopts ReLU as an activation function, the nonlinearity of the neural network model is increased, and the problem of disappearance of the neural network gradient is avoided; and finally, carrying out classification prediction on the news text by utilizing Softmax.
The working principle of the combined-convolution neural network model is as follows:
the Embdding layer is a dictionary lookup that maps integer indices into dense vectors. The layer receives integers as input and then looks up vectors associated with the integers in an internal dictionary and returns for output. The word vector mapping in the layer uses word vector computing tool word2vec of Google to embed the input data as words, and the word vector of the input convolutional layer is obtained.
Mapping backward quantized Chinese news text as a k-dimensional word vector
Figure 984114DEST_PATH_IMAGE001
Suppose that
Figure 632264DEST_PATH_IMAGE002
Is a vector representation of the ith word, so a sentence with length n is shown in equation (1):
Figure 438546DEST_PATH_IMAGE003
wherein the content of the first and second substances,
Figure 737809DEST_PATH_IMAGE004
it is shown that the connection operation is performed,
Figure 604134DEST_PATH_IMAGE005
representing the word vector matrix in the 1 st through nth windows of the input.
The convolution layer performs convolution operation on a continuous window with the width of k by utilizing convolution kernels with different sizes, wherein the convolution kernel is
Figure 739580DEST_PATH_IMAGE006
The height h of the three convolution kernels in this embodiment is set to 3, 5, and 7, respectively, and there are r convolution kernels for each size, and the value is set to 256. Weight matrix
Figure 349553DEST_PATH_IMAGE007
Extracting the features of the text block of h words
Figure 503323DEST_PATH_IMAGE008
One feature extracted
Figure 806128DEST_PATH_IMAGE009
As shown in formula (2):
Figure 163292DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure 576955DEST_PATH_IMAGE011
is a non-linear activation function of the device,
Figure 195018DEST_PATH_IMAGE012
r is a matrix for the bias term. Convolution operations applied to a word vector of a complete news text
Figure 61868DEST_PATH_IMAGE013
Obtaining a characteristic diagram
Figure 906327DEST_PATH_IMAGE014
As shown in formula (3):
Figure 858103DEST_PATH_IMAGE015
in the formula (I), the compound is shown in the specification,
Figure 986464DEST_PATH_IMAGE016
. In order to extract main features and simultaneously reduce feature parameters and calculated amount, a maximum pooling method is adopted to take the maximum value in each feature map as the most important feature extracted by the convolution kernel on a text vector to obtain a dimension of
Figure 631072DEST_PATH_IMAGE017
The feature vector of (2).
Figure 962828DEST_PATH_IMAGE018
Representing a maximum poolThe pooling operation is shown in formula (4):
Figure 718294DEST_PATH_IMAGE019
the above is the process of feature extraction for a convolution kernel of one size. In the combined-convolutional neural network model of the embodiment, a plurality of features are obtained by using a plurality of convolutional kernels with different sizes, so that the results of the maximal pooling of the different convolutional kernels are spliced to obtain a feature vector
Figure 435583DEST_PATH_IMAGE020
Specifically, as shown in formula (5):
Figure 251093DEST_PATH_IMAGE021
in the formula (I), the compound is shown in the specification,
Figure 70144DEST_PATH_IMAGE022
and the feature vectors output after maximal pooling of the convolution kernels with the heights of 3, 5 and 7 are respectively represented.
Then, a hidden layer is added for nonlinear dimension reduction to become a feature vector
Figure 629301DEST_PATH_IMAGE023
Where d is the number of hidden layer neuron nodes, in this embodiment, d is set to 128.
Finally, the characteristics are transmitted to a full connection layer, the probability distribution of 14 category labels is output through a Softmax layer, the category corresponding to the maximum probability is taken, and the label value of the prediction category is obtained
Figure 466676DEST_PATH_IMAGE024
As shown in formula (6):
Figure 187508DEST_PATH_IMAGE025
in the formula (I), the compound is shown in the specification,
Figure 493855DEST_PATH_IMAGE026
and m is the number of categories,
Figure 591124DEST_PATH_IMAGE027
is the bias term. In order to increase the convergence rate, a small batch sample gradient descent is adopted, and the number of batch samples is set to 64 in the embodiment. In addition, the handling of the Dropout layer and the ReLU activation function is introduced at the fully connected layer.
In the field of deep learning, it is important to reasonably divide a training set, a verification set and a test set. In this embodiment, the data volume increases steeply to a million level, and at this time, more sample data should be sent to the training set without too many verification sets and test sets, so in this embodiment, the proportion of the training set, the verification set, and the test set is adjusted to 82:6:12, and 686075 chinese news samples are obtained for training, 50000 verification sets are used for model verification and optimization, and 100000 test sets are used to evaluate the classification effect of the model.
The verification set is used for verifying the precision and loss of the model, searching iteration turns of the model which starts to be over-fitted, outputting a group of precision values and loss values every 100 iteration turns of the model, and drawing a precision curve and a loss curve, as shown in fig. 5 and 6. The total iteration number of the network is 20000 rounds, overfitting is started around the 10000 th round of training, namely the training precision and the training loss are relatively stable, the verification precision is not improved any more, and the verification loss is not reduced any more. Therefore, the elimination of the iterative training thereafter can not only reduce the computational load of the computer, but also avoid overfitting of the model.
Meanwhile, a regularization method Dropout layer is added in a full connection layer of the neural network to reduce overfitting, the Dropout layer is an important method for preventing overfitting from improving the effect in the convolutional neural network, and the output value of the hidden layer node is cleared in each training batch with a certain probability of 1-p. By reducing the interaction among the feature detectors (hidden layer nodes) in the mode, the overfitting phenomenon can be effectively reduced, and the regularization effect is achieved to a certain extent.
The process of training the combined-convolutional neural network model based on the term set includes:
the combinatorial-convolutional neural network model is trained by minimizing a loss function on the training set, which uses multi-class cross entropy, i.e., a logarithmic loss function, as shown in equation (7):
Figure 551514DEST_PATH_IMAGE028
wherein, L is a loss function, and Y is an output variable;
Figure 443247DEST_PATH_IMAGE029
is a binary index, and represents whether the category m is an input example
Figure 236891DEST_PATH_IMAGE030
True category of (2);
Figure 137851DEST_PATH_IMAGE031
representing the probability that the jth instance is predicted as the tth category among the N instances; the loss value is used for measuring the distance between the probability distribution of the network output and the real probability distribution of the label, and the training network can enable the output result to be closer to the real label as much as possible; the optimizer calls an Adam optimization algorithm, introduces quadratic gradient correction, calculates the self-adaptive learning rate of each parameter, and is an optimization algorithm for searching a global optimum point; the model training was iterated 10000 times in total, and the training was completed for about 20 minutes. Therefore, a method for storing and loading the model in the TensorFlow is adopted, and the model which is trained in advance is loaded and is trained again on the basis of the model, so that a large amount of time is saved in the experiment.
The embodiment verifies the accuracy and the effectiveness of the Chinese news text classification method based on the combined-convolutional neural network through experiments:
the setting of the experimental environment and the establishment of the experimental platform are as follows:
(1) hardware aspect: windows10 system, CPU Inter (R) core (TM) i7-8750H 2.20GHz, and memory 8 GB.
(2) Software and dependent libraries: python3.7, Jupyter notebook, Tensorflow _ gpu-1.13.1, skleran, etc.
In the experimental process, the setting of the adjustable parameters of the combined-convolutional neural network model is shown in table 1, data is loaded in batches for training, each batch is 64, and the number of hidden neurons in the fully-connected layer is 128.
Figure 684238DEST_PATH_IMAGE033
To verify the effectiveness of the combined-convolutional neural network model algorithm of the present invention, this embodiment performs a classification experiment on multiple sets of chinese news texts with different models, compares the classification experiment with the conventional and representative classification algorithm, and uses the overall average Precision (Precision), Recall (Recall) and F of each classification1The value (F-Measure) evaluates the classification effect of different models and serves as a performance index for measuring the classifier.
(1) In order to verify the classification performance of the combined-convolutional neural network model, a plurality of benchmarks are selected for comparison, and the combined-convolutional neural network is compared with a classical convolutional neural network and a traditional machine learning method. The classical convolutional neural network comprises a single-layer convolutional neural network (CNN-1) and a multi-layer convolutional neural network (CNN-3), and the traditional machine learning method comprises Naive Bayes (NB), nearest neighbor (KNN) and a Support Vector Machine (SVM).
(2) And in order to further test the effectiveness of the model and reduce the influence of sample data unbalance on the classification result, the data set is subjected to equalization processing. The various news samples were originally proportioned as follows: the proportions of "constellation", "lottery", "fashion", "real estate", "game", "home", "finance", "education", "society", "fashion", "entertainment", "sports", "stock" and "science" are respectively: 0.45%, 0.9%, 1.6%, 2.4%, 2.9%, 3.9%, 4.4%, 5.0%, 6.1%, 7.5%, 11.1%, 15.7%, 18.5%, 19.5%; wherein, the category samples of the 'constellation', 'lottery', 'fashion' are too few to be 3% of the total number of samples, while the category samples of the 'science', 'stock', 'sports' are too many, and only three categories exceed 50% of the total number of samples. Thus, the former classification results in poor classification, and partial samples of the former are classified into the latter as can be seen from the data indicated in the confusion matrix of fig. 7. Each row of the confusion matrix represents the true attribution category of data and each column represents the prediction category. The data set after random partition and equalization is 65000 sample data in total and is divided into 10 categories, wherein 5000 × 10 training sets, 500 × 10 verification sets and 1000 × 10 test sets are adopted. Based on different data sets, the classification results of the combined-convolutional neural network model of the invention are utilized for comparison.
In the experiment, the method for realizing feature construction all takes the pre-trained word vector as input, and the classification results of different classification models are shown in table 2:
Figure 12452DEST_PATH_IMAGE035
as can be seen from a comparison of table 2, first: word vectors are pre-trained by adopting a word2vec bag-of-words model, feature construction is carried out to serve as model input, each classification model on the same data set achieves an accuracy rate of more than 80%, and the word vectors can well describe text features. Secondly, the method comprises the following steps: the obtained classification effect is superior to that of three traditional machine learning algorithms no matter in a single-layer convolutional neural network or a multi-layer convolutional neural network, and the convolutional neural network model can learn more classification characteristics and has more advantages compared with the traditional machine learning model. Thirdly, the method comprises the following steps: the classification effect obtained by the CNN-3 models of the plurality of convolutional layers is poorer than that obtained by the CNN-1 model of a single convolutional layer, which shows that the expected effect is not obtained by deepening the convolutional layers on the basis of a classical convolutional neural network model; fourthly: the accuracy rate of the combined-convolutional neural network model for classifying the Chinese news text reaches 93.69%, compared with the classification effects of NB, KNN and SVM, the classification accuracy rate is respectively improved by 11.82%, 8.21% and 6.34%, and compared with the classification effect of a classical CNN-1 model, the accuracy rate is improved by 1.19%, and the accuracy rate is improved similarly to that of the conventional CNN-1 modelHourly recall and F1The two indexes are superior to the comparison model, the word vector convolution recombination mode is adopted, more comprehensive local text block characteristic information can be extracted, and the text classification effect is improved well.
This example further designs classification experiments for different data sets. Analyzing the confusion matrix of the classification results shows that the classes with less sample ratio are often wrongly classified into the classes with more sample ratio. Therefore, the data sets were further partitioned in the experiment and the same model was used to perform classification result comparisons on different data sets as shown in table 3:
Figure 27812DEST_PATH_IMAGE037
from table 3, it can be seen that the same combined-convolutional neural network model was used, which achieved an accuracy of 95.57% on the equalized data set. Compared with an unbalanced data set, the method has the advantages that the obtained classification effect is better, the accuracy rate is improved by 1.88%, the recall rate is improved by 1.76%, and F is realized1The value is improved by 1.72%, which shows that the balanced data set is obtained by processing all the unbalanced data sets again, so that the problem caused by extreme sample data occupation can be solved well, and the classes with less sample occupation are prevented from being wrongly classified into the classes with more sample occupation. Therefore, the data sets are too unbalanced, the influence on the classification result is large, and the equalization processing on the data sets can further improve the accuracy of news classification.
The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims (2)

1. A Chinese news long text classification method based on a combined-convolutional neural network is characterized by comprising the following steps:
s1, acquiring a Chinese news text data set, and preprocessing the data set;
s2, constructing a vocabulary table based on the preprocessed data set, and carrying out standardization processing on the Chinese news text in the preprocessed data set through the vocabulary table to obtain text characteristic representation of the Chinese news text;
s3, constructing a combined-convolutional neural network model, training the combined-convolutional neural network model based on the data set after standardization processing, and completing Chinese news text classification through the trained combined-convolutional neural network model; the combined-convolutional neural network model is a six-layer model and comprises an Embedding layer, a convolutional layer, a pooling layer, a first hidden layer, a second hidden layer and a full-connection layer which are sequentially connected; wherein the content of the first and second substances,
the Embedding layer is used for receiving input Chinese news text data, adopting word2vec to map words in the Chinese news text into real number vectors and then Embedding the real number vectors into the Chinese news text to obtain word vector representation of the Chinese news text, and taking the word vector representation as the input of the convolutional layer, namely performing secondary vector mapping on the Chinese news text standardized in the step S2;
the convolutional layer respectively extracts the characteristic vectors of the Chinese news text by adopting a plurality of convolutional kernels with different sizes;
the pooling layer is used for performing maximum pooling operation on the output of the convolutional layer;
the first hidden layer is used for combining feature vectors extracted by convolution kernels with different sizes in different convolution layers;
the second hidden layer is used for nonlinear dimensionality reduction;
dropout is added in the full connection layer, the full connection layer is also connected with a Softmax layer, and the input Chinese news text is classified and predicted through the Softmax layer;
in S1, the preprocessing the data set includes:
s1.1, constructing a data index: setting the sequence length of the Chinese news text based on the big data visualization analysis, and constructing a data index based on the sequence length of the Chinese news text;
s1.2, data integration: converting the Chinese news text into a binary data stream;
in S2, constructing the vocabulary table based on the preprocessed data set includes: making a vocabulary table for classifying Chinese news texts by removing stop words and word frequency statistics, wherein the vocabulary table comprises vocabularies and index numbers corresponding to the vocabularies;
in S2, the method for normalizing the chinese news text in the preprocessed data set by using the vocabulary specifically includes: standardizing data of Chinese news text contents and data of Chinese news text labels;
the specific method for standardizing the data of the Chinese news text content comprises the following steps: firstly, traversing the index sequence of the vocabulary table to obtain corresponding vocabularies in the Chinese news text and index numbers corresponding to the vocabularies;
secondly, forcibly converting each vocabulary in the Chinese news text into word ids by adopting a dictionary method, and vectorizing and expressing the vocabulary in the Chinese news text based on the word ids to complete the data standardization of the content of the Chinese news text; specifically, mapping of vocabulary and word id is realized by using a list derivation formula and a lambda anonymity function; embedding a word id into the Chinese news text to realize vectorization representation of the Chinese news text;
the specific method for standardizing the data of the Chinese news text label comprises the following steps: and (3) setting the label index corresponding to each Chinese news text as 1 by adopting an One Hot coding method, and expressing the rest label indexes as all zero vectors, so that vectorization expression of the text labels is realized, and the data standardization of the Chinese news text labels is completed.
2. The method for Chinese news long-text classification based on the combined-convolutional neural network as claimed in claim 1, wherein in the S3, the combined-convolutional neural network model is trained by minimizing a loss function, wherein the loss function adopts multi-class cross entropy.
CN202110419616.2A 2021-04-19 2021-04-19 Chinese news long text classification method based on combination-convolution neural network Active CN112989052B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110419616.2A CN112989052B (en) 2021-04-19 2021-04-19 Chinese news long text classification method based on combination-convolution neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110419616.2A CN112989052B (en) 2021-04-19 2021-04-19 Chinese news long text classification method based on combination-convolution neural network

Publications (2)

Publication Number Publication Date
CN112989052A CN112989052A (en) 2021-06-18
CN112989052B true CN112989052B (en) 2022-03-08

Family

ID=76341131

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110419616.2A Active CN112989052B (en) 2021-04-19 2021-04-19 Chinese news long text classification method based on combination-convolution neural network

Country Status (1)

Country Link
CN (1) CN112989052B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114638558B (en) * 2022-05-19 2022-08-23 天津市普迅电力信息技术有限公司 Data set classification method for operation accident analysis of comprehensive energy system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595602A (en) * 2018-04-20 2018-09-28 昆明理工大学 The question sentence file classification method combined with depth model based on shallow Model
CN109840279A (en) * 2019-01-10 2019-06-04 山东亿云信息技术有限公司 File classification method based on convolution loop neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10963652B2 (en) * 2018-12-11 2021-03-30 Salesforce.Com, Inc. Structured text translation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595602A (en) * 2018-04-20 2018-09-28 昆明理工大学 The question sentence file classification method combined with depth model based on shallow Model
CN109840279A (en) * 2019-01-10 2019-06-04 山东亿云信息技术有限公司 File classification method based on convolution loop neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于特征融合分段卷积神经网络的情感分析";周泳东等;《计算机工程与设计》;20190604;第3009-3013页 *

Also Published As

Publication number Publication date
CN112989052A (en) 2021-06-18

Similar Documents

Publication Publication Date Title
CN111221944B (en) Text intention recognition method, device, equipment and storage medium
CN112417153B (en) Text classification method, apparatus, terminal device and readable storage medium
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN112784532B (en) Multi-head attention memory system for short text sentiment classification
CN113315789B (en) Web attack detection method and system based on multi-level combined network
CN113553510B (en) Text information recommendation method and device and readable medium
CN112115716A (en) Service discovery method, system and equipment based on multi-dimensional word vector context matching
CN110263343A (en) The keyword abstraction method and system of phrase-based vector
CN116756303A (en) Automatic generation method and system for multi-topic text abstract
CN114579741B (en) GCN-RN aspect emotion analysis method and system for fusing syntax information
CN112989052B (en) Chinese news long text classification method based on combination-convolution neural network
CN114610838A (en) Text emotion analysis method, device and equipment and storage medium
CN112527959B (en) News classification method based on pooling convolution embedding and attention distribution neural network
CN112966507A (en) Method, device, equipment and storage medium for constructing recognition model and identifying attack
CN110348497B (en) Text representation method constructed based on WT-GloVe word vector
Yildiz A comparative study of author gender identification
CN112232079A (en) Microblog comment data classification method and system
Liu et al. Chinese news text classification and its application based on combined-convolutional neural network
CN111881667A (en) Sensitive text auditing method
Tian et al. Chinese short text multi-classification based on word and part-of-speech tagging embedding
Vikas et al. User gender classification based on Twitter Profile Using machine learning
CN114386425B (en) Big data system establishing method for processing natural language text content
CN113051886B (en) Test question duplicate checking method, device, storage medium and equipment
CN115358340A (en) Credit credit collection short message distinguishing method, system, equipment and storage medium
CN113987536A (en) Method and device for determining security level of field in data table, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20210618

Assignee: Beijing Zhongke Chaocai Information Consulting Co.,Ltd.

Assignor: Beijing University of Civil Engineering and Architecture

Contract record no.: X2023980034081

Denomination of invention: A Chinese News Long Text Classification Method Based on Combination Convolutional Neural Network

Granted publication date: 20220308

License type: Common License

Record date: 20230327

EE01 Entry into force of recordation of patent licensing contract