CN111966828B

CN111966828B - Newspaper and magazine news classification method based on text context structure and attribute information superposition network

Info

Publication number: CN111966828B
Application number: CN202010729459.0A
Authority: CN
Inventors: 蔡世民; 陈明仁; 戴礼灿
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-07-27
Filing date: 2020-07-27
Publication date: 2022-05-03
Anticipated expiration: 2040-07-27
Also published as: CN111966828A

Abstract

The invention discloses a newspaper and magazine news classification method based on a text context structure and attribute information superposition network, and belongs to the field of information processing. The invention uses a text vector representation method to convert the text with indefinite length into the vector with fixed length, thereby avoiding the loss and redundancy of text information; from the perspective of training data, weighted random sampling is adopted, and the composition of a training sample is optimized through the possibility that a weight adjusting sample is selected; from the aspect of feature extraction, the text optimizes the feature extraction process by considering not only the context result information but also the text attribute information. The method and the device not only improve the extraction mode of the text features, but also additionally incorporate the attribute features into the feature construction process. The text with indefinite length is converted into the vector with fixed length by using a text vector representation method, so that the loss and redundancy of text information are avoided, and the extraction mode of text characteristics is optimized; additionally, the characteristic information of news is added, and the source of the characteristics is enriched.

Description

Newspaper and magazine news classification method based on text context structure and attribute information superposition network

Technical Field

The invention belongs to the field of information processing, and relates to a news classification method and a news classification system based on a text context structure information and attribute information overlay network.

Background

Definition of key terms:

a neural network: is a mathematical or computational model that mimics the structure and function of a biological neural network and is used to estimate or approximate functions. Neural networks are computed from a large number of artificial neuron connections. In most cases, the artificial neural network can change the internal structure on the basis of external information, and is an adaptive system.

Text characterization: the method is a machine learning technology for mapping a high-level cognitive abstract entity of text into a vector on a real number field in the field of natural language processing so as to facilitate subsequent computer processing.

Weighted random sampling: the sampling technology for determining the sampling probability of the sample according to the sample weight can effectively solve the problem of unbalanced distribution of the sample classes from the sampling level.

Newspapers and periodicals are a kind of transmission medium for transmitting literal information by using paper. The main functions of the system comprise explanation, propaganda and image maintenance, for example, the 'daily newspaper of people' is the image of a maintenance country, the 'liberation military newspaper' is the image of a maintenance army, and the 'enterprise newspaper' is the image of a maintenance enterprise.

Generally, a newsstand has several news items. For a certain news item in a newspaper, whether the news item can become the head edition news of the present day of the newspaper is related to the information content of the news item. In conjunction with current natural language processing techniques, it is still difficult to directly quantify the amount of information in a piece of textual news. Therefore, the problem of two classification black boxes (hereinafter, abbreviated as ' news classification problem of newspapers and periodicals ') of ' whether a certain news is the headline news ' is solved by using a neural network which is also the black box ', and the problem is a direct and efficient choice.

With the success of AlexNet, the study of neural networks has entered a new stage. In the current text classification technology, a neural network is mainly used as a technical means, so that the structural information of the text is fully mined, and classification is performed based on the characteristic information. In the field of text classification, algorithms such as TextCNN, TextRNN, FastText, TextRCNN, etc. are proposed in succession, performing feature extraction on texts from the perspective of convolutional neural networks and recurrent neural networks, respectively, and these algorithms perform excellently on multiple test data sets.

The prior art has the following disadvantages:

although algorithms such as TextCNN, TextRNN, FastText, TextRCNN and the like are excellent in performance on a plurality of classified test data sets of open texts, the algorithms cannot effectively solve the classification problem of news of newspapers due to the particularity of the news of the newspapers. Specifically, the following points are included: firstly, the length of news in a newspaper is uncertain, and the length of news is not directly related to the importance degree of news, and the above technical means mostly limit the input length of text, and require truncation or filling operations on the input text, which may cause the loss or redundancy of extracted features. Meanwhile, due to the particularity of news of the newspapers, the data of two categories of news classification problems of the newspapers are obviously biased, namely, the number of the first-edition news is far less than that of the non-first-edition news. The classification method directly trained by using the biased data is also biased, namely the classification method has a higher probability of classifying news into non-top news. Finally, for news in newspapers, although the main reason for the possibility of being the top news is the amount of information contained in the context of the news text, some news with too long or too short text cannot be easily found in the top news due to the limitations of top layout and post layout. Most of the prior technical means only consider text context structure information, but ignore text attribute information such as 'title length', 'text length' and the like, so that the characteristics are lost.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a newspaper and magazine news classification method based on a text context structure and an attribute information superposition network. In the invention, from the input point of view, the text with indefinite length is converted into the vector with fixed length by using a text vector representation method, thereby avoiding the loss and redundancy of text information; from the perspective of training data, weighted random sampling is adopted, and the composition of training samples is optimized through the possibility that weight adjusting samples are selected; from the aspect of feature extraction, the text optimizes the feature extraction process by considering not only the context result information but also the text attribute information.

The invention discloses a newspaper and periodical news classification method based on a text context structure and attribute information superposition network, which specifically comprises the following steps:

step 1: acquiring data;

acquiring text information and attribute information of news of a certain newspaper from a database, wherein the text information represents text content of the news, and the attribute information comprises 8 attribute information, specifically: the total number of versions of a certain newspaper and a certain journal on the same day, the number of words of a title of the news, the number of words of a text of the news, the maximum number of words of a news title of a certain newspaper and a certain journal on the same day, the minimum number of words of a news title of a certain newspaper and a certain journal on the same day, the maximum number of words of a text of the news of a certain newspaper and a certain journal on the same day, the minimum number of words of a news title of a certain newspaper and a era number;

step 2: generating a text feature vector;

carrying out vector representation on news text information, converting each news text into a text feature vector with low dimension and high information content respectively, and storing the text feature vector into a database;

and step 3: generating attribute feature vectors;

carrying out vector splicing on the news attribute information, splicing all the attribute information into an attribute feature vector, and finally storing the result into a database;

and 4, step 4: dividing a data set;

dividing the news data in the database into a training set, a verification set and a test set at random according to a certain proportion, wherein the specific proportion is 6: 2: 2;

and 5: sampling;

weighting the first edition news and the non-first edition news in the training set, and obtaining a training sample set with relatively balanced quantities of the first edition news and the non-first edition news by adopting a weighted random sampling mode;

step 6: training a network model;

training a composite neural network by using a text feature vector and an attribute feature vector corresponding to news in a training sample set and a category corresponding to the news;

and 7: predicting;

and inputting the text feature vector and the attribute feature vector of the concentrated news into the composite neural network obtained by training, wherein the output of the network is a prediction result of whether the news is the first edition news.

Compared with the prior art, the invention has the beneficial effects that:

1. the method and the device not only improve the extraction mode of the text features, but also additionally incorporate the attribute features into the feature construction process. In the step 2, a text vector representation method is used for converting the text with the indefinite length into the vector with the fixed length, so that the loss and redundancy of text information are avoided, and the extraction mode of text features is optimized; in step 3, the characteristic information of news is additionally added, so that the source of the characteristics is enriched. The features of the present invention are more efficiently and variously constructed than in the prior art.

2. The present invention uses the technique of weighted random sampling to train the model using the sample set resulting from the sampling. Unlike other related art application fields, the classification problem of news in newspapers and periodicals faces the objective fact that the proportion of top news and non-top news is severely unbalanced. In the step 5, a training sample set is obtained by adopting a weighted random sampling method, and the strict control on the form of the training data is realized on the premise of ensuring the authenticity of the data.

3. The invention uses the idea of solving the black box problem by using a black box method, and solves the problem of end-to-end and point-to-point. For the classification problem of news of newspapers and periodicals, under the condition that the existing algorithm and index can not be directly used for directly measuring the importance degree of news, a network model named as composite is provided in step 6 for solving the classification problem end to end. Compared with the prior art, the invention simulates the thinking way of the human brain by using the neural network.

Drawings

Fig. 1 is a flowchart of a news classification method according to the present invention.

Fig. 2 is a schematic structural diagram of a text vector representation method.

Fig. 3 is a schematic diagram illustrating the effect of the weighted random sampling algorithm.

Fig. 4 is a schematic diagram of a composite neural network structure.

FIG. 5 is a classification result of the present invention when handling the "daily news report for people" news classification problem.

Detailed description of the preferred embodiments

For the purpose of making the present invention clearer, the present invention will be described in further detail below with reference to the accompanying drawings.

Fig. 1 visually represents the steps of the news classification method proposed by the present invention. Specifically, the method comprises the steps of data acquisition, text feature vector generation, attribute feature vector generation, data set division, weighted random sampling, composite model training and final classification prediction.

Fig. 2 visually shows the method for converting text into vectors according to the present invention, and the principle is as follows:

simultaneous training of word vectors and textVector quantity; let text d_iThe corresponding code vector is p_iThe coding vector corresponding to the word t in the text is w_t(ii) a Can construct words t in the text d_iThe vector at the j-th occurrence of (a) is of the form:

t in the formula is the number of unilateral context words considered by the algorithm; setting words t in text d_iS times, then use as follows

To represent the vector of the word t:

n represents a text d_iTotal number of words t in the text d_iSum vector of

Substituting into the neural network model of the text vector guarantee method, the following outputs can be obtained:

in the above formula, W is a hidden layer in the neural network model, and b is an offset, so that the following loss function is constructed:

the distance in the formula is a distance function between vectors, can be a second-order Euclidean distance, and W can be obtained by optimizing the loss function_bestMatrix sum b_bestBiasing; in the text d_iCorresponding vector p_iAs input, the vector table of low dimension can be obtainedSign for

In the form:

fig. 3 illustrates the sampling effect using a weighted random sampling method, as follows:

for class C_jIn words, is provided with

The samples belong to class C_jThen this

Sample d of any one of the samples_iWeight of (1)_iCan be expressed in the following form:

c in the formula represents a defined classification category set, and a weighted random sampling method is used, wherein Weight is set from Weight of samples to { Weight ═ Weight }₁,weight₂,…,weight_nD ═ D from sample set₁,d₂,…,d_nThe way of selecting m samples in (1) is as follows:

phi is the element Weight in the set Weight_iE.g. Weight, selecting uniformly distributed random numbers u between 0 and 1_iAnd calculating k using the following formula_i：

Let set K ═ K-_i-where i ═ 1,2, …, n; pressing the set K by K_iSorting, selecting maximum m elements to form a Sample set, where the Sample can be expressed as the following formula:

Sample＝{d_l}

Where l meets that k_l≥k_m-th

in the above formula k_m-thRepresenting the value of the mth largest element in the set K.

Fig. 4 shows the structure of a composite neural network, for which the calculation process gives the following analysis:

the input to the composite neural network is sample news text

Text feature vector of

And attribute feature vector

The model also characterizes the vector separately for the text using 2 different parts

Performing dimension reduction operation and attribute feature vector

Carrying out normalization operation;

further, the composite neural network includes: implementation of

Vector dimension reduction part and implementation

Partial, classified, fully connected networks with vector normalization; implementation of

Partial sum implementation of vector dimension reduction

Output of vector normalized partThe network is fully connected with the input classification to realize final classification;

for implementation in a composite neural network model

For the vector dimension reduction part, 3 layers of fully connected neural networks are provided; in a first layer of fully-connected neural networks, the input is a text characterization vector

The weight matrix is W₁Offset is b₁The activation function is ReLu (X), and the output is expressed as follows:

the second and third layers of fully-connected network are similar to the first layer of neural network, and the input of the second layer of fully-connected network is the output H of the first layer of fully-connected network⁽¹⁾The weight matrix is W₂Offset is b₂The activation function is ReLu (X); in the third layer of fully-connected neural network, the input is the output H of the second layer of fully-connected network⁽²⁾The weight matrix is W₃Offset is b₃The activation function is ReLu (X); h⁽²⁾And H⁽³⁾In the form:

for implementation in a composite neural network model

For the vector normalization part, a set of attribute vectors corresponding to Sample set Sample is set

Setting attribute vectors

Has the dimension of

Then sample S_iCorresponding normalized attribute vector

The value of the j-th item of (1)

Can be expressed in the following form:

is obtained by

And

then, the idea based on superposition will

And

splicing to realize feature fusion, and recording the superposed result as

Expressed in the following form:

for a classified fully-connected network, the input is a hybrid vector

The weight matrix is W₄Offset coefficient of b₄The activation function is Softmax (X), then the output

Expressed in the following form:

output vector

As a one-dimensional 2-element vector, i.e.

The value of the first column of the vector represents news S_iThe value in the second column represents news S as a probability of top news_iProbability of non-headline news.

To demonstrate the effectiveness of this patent in solving the news classification problem, the "daily news" news is used herein as an example. In the process of processing data, the original data are divided into 4 times according to the administration time to form corresponding 4 sub-data sets, wherein the original data are distinguished from different styles of different leader core administration periods.

Verification experiments are carried out on the 4 news subdata sets, and XGBoost, Random Forest and SVM classification methods based on Doc2Vec vectors, a fastText text classification method based on deep learning, a TextCNN text classification method, a TextRNN text classification method and a TextRCNN text classification method are evaluated respectively, and the performances of a newspaper and periodical news classification method based on a text context structure and attribute information overlay network on the head edition news classification problem of the Renminbi newspaper are provided.

FIG. 5 illustrates the accuracy, correctness, recall, and F of the classification of 4 stages of the "daily Rens for people" news data using the various text classification methods mentioned above₁The value is obtained.

From the sub-graph a in fig. 5, we can see that the news classification method proposed by this patent is considerably improved compared with other classification methods. Specifically, in the first sub data set Stage1, compared with the Doc2Vec vector-based XGBoost, Random Forest, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, and TextRCNN text classification method, the news classification method provided by the present patent is respectively improved by 10.31%, 3.01%, 14.32%, 32.55%, 21.32%, 23.93%, and 22.10%. In the second sub data set Stage2, compared with the XGBoost, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, and TextRCNN text classification method based on Doc2Vec vector, the news classification method proposed by the present patent is respectively improved by 4.88%, 11.00%, 16.38%, 5.57%, 5.71%, and 0.18%. In the third subdata set Stage3, compared with the XGBoost, Random Forest, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method and TextRCNN text classification method based on Doc2Vec vector, the news classification method provided by the patent is respectively improved by 9.72%, 0.15%, 13.48%, 17.33%, 17.01%, 17.67% and 18.73%. In the fourth sub data set Stage4, compared with the XGBoost, Random Forest, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, TextRCNN text classification method based on Doc2Vec vector, the news classification method provided by the patent is respectively 5.71%, 1.09%, 14.10%, 3.47%, 3.28%, 5.62%, 1.30%.

Since the proportion of the first edition news and the non-first edition news in the people daily newspaper is not balanced, the accuracy rate of the classification result is far from enough. In FIG. 5, sub-graphs b, c, d are derived from the accuracy, recall and F of the predicted results₁The values depict the actual performance of each text classification method in processing the classification problem of the first edition news of the people's daily newspaper.

From the sub-graphs b and c in fig. 5, we can see that the news classification method proposed by the present patent has relatively higher accuracy and recall rate on each sub-data set or no order of magnitude difference from the best result. Specifically, in terms of accuracy, the method is better than the XGBoost, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, TextRCNN text classification method based on Doc2Vec vector in all stages, and is better than the Random Forest classification method based on Doc2Vec vector in some specific stages (for example, Stage 3). In terms of recall, the method of Random Forest classification based on Doc2Vec vector is superior at all stages, and the method of TextRCNN text classification is superior at some specific stages (e.g., Stage 2).

From fig. 5, sub-graph d, we can see that the news classification method proposed by this patent is in F of classification result compared with other algorithms₁There is a considerable improvement in the value. Specifically, in the first sub data set Stage1, compared with the Doc2Vec vector-based XGBoost, Random Forest, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, and TextRCNN text classification method, the news classification method provided by the present patent is respectively improved by 3.21%, 42.53%, 4.06, 25.25%, 23.45%, 22.25%, and 20.43%. In the second subdata set Stage2, compared with the Doc2Vec vector-based XGBoost, Random Forest, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, and TextRCNN text classification method, the news classification method provided by the present patent is respectively improved by 2.70%, 42.99, 3.93%, 8.88%, 5.53%, 3.73%, and 3.33%. In the third subdata set Stage3, compared with the XGBoost, Random Forest, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, TextRCNN text classification method based on Doc2Vec vector, the news classification method provided by the patent is respectively improved by 7.76%, 46.93%, 8.43%, 12.80%, 12.27%, 12.49%, 12.54%. In the fourth subdata set Stage4, compared with an XGboost, a Random Forest, an SVM classification method, a fastText text classification method, a TextCNN text classification method and a TextRNN text classification method based on a Doc2Vec vector, the news classification method provided by the patent is respectively improved by 9.38%, 46.85%, 14.04% and 1.2%8％，0.69％，3.76％。

Claims

1. A news classification method of newspapers and periodicals based on a text context structure and attribute information superposition network specifically comprises the following steps:

step 1: acquiring data;

acquiring text information and attribute information of news of a certain newspaper from a database, wherein the text information represents text content of the news, and the attribute information comprises 7 attribute information, specifically: the total edition number of a certain newspaper and periodical on the same day, the title word number of the news, the text word number of the news, the maximum word number of the news title of the certain newspaper and periodical on the same day, the maximum word number of the news text of the certain newspaper and periodical on the same day, the minimum word number of the news title of the certain newspaper and periodical on the same day and the era number;

step 2: generating a text feature vector;

and step 3: generating attribute feature vectors;

and 4, step 4: dividing a data set;

dividing news data in a database into a training set, a verification set and a test set at random;

and 5: sampling;

weighting the first edition news and the non-first edition news in the training set, and obtaining a training sample set by adopting a weighted random sampling mode;

for class C_jIn words, is provided with

The samples belong to class C_jThen this

SampleSample d of any one of the texts_iWeight of (1)_iCan be expressed in the following form:

c in the formula represents a defined classification category set, a weighted random sampling method is used, and Weight is set Weight ═ Weight according to samples₁,weight₂,…,weight_nFrom a set of samples D ═ D₁,d₂,…,d_nThe way of selecting m samples in (1) is as follows:

Let set K ═ K-_iWhere i ═ 1,2, …, n; pressing the set K by K_iSorting, selecting maximum m elements to form a Sample set, where the Sample can be expressed as the following formula:

Sample＝{d_l}

wherein l satisfies k_l≥k_m-th

In the above formula k_m-thA value representing the mth largest element in the set K;

step 6: training a network model;

training a composite neural network by using a text feature vector and an attribute feature vector corresponding to news in a training sample set and a category corresponding to the news; the method comprises the following steps:

the composite neural network includes: implementation of

Vector dimension reduction part and implementation

Partial sum implementation of vector dimension reduction

The outputs of the vector normalization parts are jointly input into a classification full-connection network to realize final classification;

for implementation in a composite neural network model

For the vector dimension reduction part, 3 layers of fully-connected neural networks are provided; in a first layer of fully-connected neural networks, the input is a text characterization vector

for implementation in a composite neural network model

For the part of vector normalization, a set of attribute vectors corresponding to Sample in the Sample set is assumed

Setting attribute vectors

Has the dimension of

Then sample S_iCorresponding normalized attribute vector

The value of the j-th item of (1)

Can be expressed in the following form:

in obtaining

And

then, the idea based on superposition will

And

splicing to realize feature fusion, and recording the superposed result as

Expressed in the following form:

for a classified fully-connected network, the input is a hybrid vector

Expressed in the following form:

output vector

Is a one-dimensional 2-element vector, i.e.

The value of the first column of the vector represents news S_iThe value in the second column represents news S as a probability of top news_iProbability of being non-headline news;

and 7: predicting;

2. The method for classifying news of newspapers and periodicals based on the text context structure and the attribute information overlay network as claimed in claim 1, wherein the specific method for generating the text feature vector in the step 2 is as follows:

training word vectors and text vectors simultaneously; let text d_iThe corresponding code vector is p_iThe coded vector corresponding to the word t in the text is w_t(ii) a Can construct words t in the text d_iThe vector at the j-th occurrence of (a) is of the form:

t in the formula is the unilateral context word number considered by the algorithm; setting words t in text d_iS times, then use as follows

To represent the vector of the word t:

n represents a text d_iTotal number of words t in the text d_iSum vector of

in the above formula, W is a hidden layer in the neural network model, and b is an offset, so the following loss function is constructed:

in the formula, distance is a distance function between vectors and is a second-order Euclidean distance, and W can be obtained by optimizing the loss function_bestMatrix sum b_bestBiasing; in the text d_iCorresponding vector p_iAs input, it can be characterized by its low-dimensional vector

In the form of: