CN111966828B - Newspaper and magazine news classification method based on text context structure and attribute information superposition network - Google Patents

Newspaper and magazine news classification method based on text context structure and attribute information superposition network Download PDF

Info

Publication number
CN111966828B
CN111966828B CN202010729459.0A CN202010729459A CN111966828B CN 111966828 B CN111966828 B CN 111966828B CN 202010729459 A CN202010729459 A CN 202010729459A CN 111966828 B CN111966828 B CN 111966828B
Authority
CN
China
Prior art keywords
text
news
vector
attribute
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010729459.0A
Other languages
Chinese (zh)
Other versions
CN111966828A (en
Inventor
蔡世民
陈明仁
戴礼灿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010729459.0A priority Critical patent/CN111966828B/en
Publication of CN111966828A publication Critical patent/CN111966828A/en
Application granted granted Critical
Publication of CN111966828B publication Critical patent/CN111966828B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a newspaper and magazine news classification method based on a text context structure and attribute information superposition network, and belongs to the field of information processing. The invention uses a text vector representation method to convert the text with indefinite length into the vector with fixed length, thereby avoiding the loss and redundancy of text information; from the perspective of training data, weighted random sampling is adopted, and the composition of a training sample is optimized through the possibility that a weight adjusting sample is selected; from the aspect of feature extraction, the text optimizes the feature extraction process by considering not only the context result information but also the text attribute information. The method and the device not only improve the extraction mode of the text features, but also additionally incorporate the attribute features into the feature construction process. The text with indefinite length is converted into the vector with fixed length by using a text vector representation method, so that the loss and redundancy of text information are avoided, and the extraction mode of text characteristics is optimized; additionally, the characteristic information of news is added, and the source of the characteristics is enriched.

Description

Newspaper and magazine news classification method based on text context structure and attribute information superposition network
Technical Field
The invention belongs to the field of information processing, and relates to a news classification method and a news classification system based on a text context structure information and attribute information overlay network.
Background
Definition of key terms:
a neural network: is a mathematical or computational model that mimics the structure and function of a biological neural network and is used to estimate or approximate functions. Neural networks are computed from a large number of artificial neuron connections. In most cases, the artificial neural network can change the internal structure on the basis of external information, and is an adaptive system.
Text characterization: the method is a machine learning technology for mapping a high-level cognitive abstract entity of text into a vector on a real number field in the field of natural language processing so as to facilitate subsequent computer processing.
Weighted random sampling: the sampling technology for determining the sampling probability of the sample according to the sample weight can effectively solve the problem of unbalanced distribution of the sample classes from the sampling level.
Newspapers and periodicals are a kind of transmission medium for transmitting literal information by using paper. The main functions of the system comprise explanation, propaganda and image maintenance, for example, the 'daily newspaper of people' is the image of a maintenance country, the 'liberation military newspaper' is the image of a maintenance army, and the 'enterprise newspaper' is the image of a maintenance enterprise.
Generally, a newsstand has several news items. For a certain news item in a newspaper, whether the news item can become the head edition news of the present day of the newspaper is related to the information content of the news item. In conjunction with current natural language processing techniques, it is still difficult to directly quantify the amount of information in a piece of textual news. Therefore, the problem of two classification black boxes (hereinafter, abbreviated as ' news classification problem of newspapers and periodicals ') of ' whether a certain news is the headline news ' is solved by using a neural network which is also the black box ', and the problem is a direct and efficient choice.
With the success of AlexNet, the study of neural networks has entered a new stage. In the current text classification technology, a neural network is mainly used as a technical means, so that the structural information of the text is fully mined, and classification is performed based on the characteristic information. In the field of text classification, algorithms such as TextCNN, TextRNN, FastText, TextRCNN, etc. are proposed in succession, performing feature extraction on texts from the perspective of convolutional neural networks and recurrent neural networks, respectively, and these algorithms perform excellently on multiple test data sets.
The prior art has the following disadvantages:
although algorithms such as TextCNN, TextRNN, FastText, TextRCNN and the like are excellent in performance on a plurality of classified test data sets of open texts, the algorithms cannot effectively solve the classification problem of news of newspapers due to the particularity of the news of the newspapers. Specifically, the following points are included: firstly, the length of news in a newspaper is uncertain, and the length of news is not directly related to the importance degree of news, and the above technical means mostly limit the input length of text, and require truncation or filling operations on the input text, which may cause the loss or redundancy of extracted features. Meanwhile, due to the particularity of news of the newspapers, the data of two categories of news classification problems of the newspapers are obviously biased, namely, the number of the first-edition news is far less than that of the non-first-edition news. The classification method directly trained by using the biased data is also biased, namely the classification method has a higher probability of classifying news into non-top news. Finally, for news in newspapers, although the main reason for the possibility of being the top news is the amount of information contained in the context of the news text, some news with too long or too short text cannot be easily found in the top news due to the limitations of top layout and post layout. Most of the prior technical means only consider text context structure information, but ignore text attribute information such as 'title length', 'text length' and the like, so that the characteristics are lost.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a newspaper and magazine news classification method based on a text context structure and an attribute information superposition network. In the invention, from the input point of view, the text with indefinite length is converted into the vector with fixed length by using a text vector representation method, thereby avoiding the loss and redundancy of text information; from the perspective of training data, weighted random sampling is adopted, and the composition of training samples is optimized through the possibility that weight adjusting samples are selected; from the aspect of feature extraction, the text optimizes the feature extraction process by considering not only the context result information but also the text attribute information.
The invention discloses a newspaper and periodical news classification method based on a text context structure and attribute information superposition network, which specifically comprises the following steps:
step 1: acquiring data;
acquiring text information and attribute information of news of a certain newspaper from a database, wherein the text information represents text content of the news, and the attribute information comprises 8 attribute information, specifically: the total number of versions of a certain newspaper and a certain journal on the same day, the number of words of a title of the news, the number of words of a text of the news, the maximum number of words of a news title of a certain newspaper and a certain journal on the same day, the minimum number of words of a news title of a certain newspaper and a certain journal on the same day, the maximum number of words of a text of the news of a certain newspaper and a certain journal on the same day, the minimum number of words of a news title of a certain newspaper and a era number;
step 2: generating a text feature vector;
carrying out vector representation on news text information, converting each news text into a text feature vector with low dimension and high information content respectively, and storing the text feature vector into a database;
and step 3: generating attribute feature vectors;
carrying out vector splicing on the news attribute information, splicing all the attribute information into an attribute feature vector, and finally storing the result into a database;
and 4, step 4: dividing a data set;
dividing the news data in the database into a training set, a verification set and a test set at random according to a certain proportion, wherein the specific proportion is 6: 2: 2;
and 5: sampling;
weighting the first edition news and the non-first edition news in the training set, and obtaining a training sample set with relatively balanced quantities of the first edition news and the non-first edition news by adopting a weighted random sampling mode;
step 6: training a network model;
training a composite neural network by using a text feature vector and an attribute feature vector corresponding to news in a training sample set and a category corresponding to the news;
and 7: predicting;
and inputting the text feature vector and the attribute feature vector of the concentrated news into the composite neural network obtained by training, wherein the output of the network is a prediction result of whether the news is the first edition news.
Compared with the prior art, the invention has the beneficial effects that:
1. the method and the device not only improve the extraction mode of the text features, but also additionally incorporate the attribute features into the feature construction process. In the step 2, a text vector representation method is used for converting the text with the indefinite length into the vector with the fixed length, so that the loss and redundancy of text information are avoided, and the extraction mode of text features is optimized; in step 3, the characteristic information of news is additionally added, so that the source of the characteristics is enriched. The features of the present invention are more efficiently and variously constructed than in the prior art.
2. The present invention uses the technique of weighted random sampling to train the model using the sample set resulting from the sampling. Unlike other related art application fields, the classification problem of news in newspapers and periodicals faces the objective fact that the proportion of top news and non-top news is severely unbalanced. In the step 5, a training sample set is obtained by adopting a weighted random sampling method, and the strict control on the form of the training data is realized on the premise of ensuring the authenticity of the data.
3. The invention uses the idea of solving the black box problem by using a black box method, and solves the problem of end-to-end and point-to-point. For the classification problem of news of newspapers and periodicals, under the condition that the existing algorithm and index can not be directly used for directly measuring the importance degree of news, a network model named as composite is provided in step 6 for solving the classification problem end to end. Compared with the prior art, the invention simulates the thinking way of the human brain by using the neural network.
Drawings
Fig. 1 is a flowchart of a news classification method according to the present invention.
Fig. 2 is a schematic structural diagram of a text vector representation method.
Fig. 3 is a schematic diagram illustrating the effect of the weighted random sampling algorithm.
Fig. 4 is a schematic diagram of a composite neural network structure.
FIG. 5 is a classification result of the present invention when handling the "daily news report for people" news classification problem.
Detailed description of the preferred embodiments
For the purpose of making the present invention clearer, the present invention will be described in further detail below with reference to the accompanying drawings.
Fig. 1 visually represents the steps of the news classification method proposed by the present invention. Specifically, the method comprises the steps of data acquisition, text feature vector generation, attribute feature vector generation, data set division, weighted random sampling, composite model training and final classification prediction.
Fig. 2 visually shows the method for converting text into vectors according to the present invention, and the principle is as follows:
simultaneous training of word vectors and textVector quantity; let text diThe corresponding code vector is piThe coding vector corresponding to the word t in the text is wt(ii) a Can construct words t in the text diThe vector at the j-th occurrence of (a) is of the form:
Figure BDA0002602639240000041
t in the formula is the number of unilateral context words considered by the algorithm; setting words t in text diS times, then use as follows
Figure BDA00026026392400000410
To represent the vector of the word t:
Figure BDA0002602639240000042
n represents a text diTotal number of words t in the text diSum vector of
Figure BDA0002602639240000049
Substituting into the neural network model of the text vector guarantee method, the following outputs can be obtained:
Figure BDA0002602639240000043
in the above formula, W is a hidden layer in the neural network model, and b is an offset, so that the following loss function is constructed:
Figure BDA0002602639240000044
the distance in the formula is a distance function between vectors, can be a second-order Euclidean distance, and W can be obtained by optimizing the loss functionbestMatrix sum bbestBiasing; in the text diCorresponding vector piAs input, the vector table of low dimension can be obtainedSign for
Figure BDA0002602639240000045
In the form:
Figure BDA0002602639240000046
fig. 3 illustrates the sampling effect using a weighted random sampling method, as follows:
for class CjIn words, is provided with
Figure BDA0002602639240000047
The samples belong to class CjThen this
Figure BDA0002602639240000048
Sample d of any one of the samplesiWeight of (1)iCan be expressed in the following form:
Figure BDA0002602639240000051
c in the formula represents a defined classification category set, and a weighted random sampling method is used, wherein Weight is set from Weight of samples to { Weight ═ Weight }1,weight2,…,weightnD ═ D from sample set1,d2,…,dnThe way of selecting m samples in (1) is as follows:
phi is the element Weight in the set WeightiE.g. Weight, selecting uniformly distributed random numbers u between 0 and 1iAnd calculating k using the following formulai
Figure BDA0002602639240000052
Let set K ═ K-i-where i ═ 1,2, …, n; pressing the set K by KiSorting, selecting maximum m elements to form a Sample set, where the Sample can be expressed as the following formula:
Sample={dl}
Where l meets that kl≥km-th
in the above formula km-thRepresenting the value of the mth largest element in the set K.
Fig. 4 shows the structure of a composite neural network, for which the calculation process gives the following analysis:
the input to the composite neural network is sample news text
Figure BDA0002602639240000053
Text feature vector of
Figure BDA0002602639240000054
And attribute feature vector
Figure BDA0002602639240000055
The model also characterizes the vector separately for the text using 2 different parts
Figure BDA0002602639240000056
Performing dimension reduction operation and attribute feature vector
Figure BDA0002602639240000057
Carrying out normalization operation;
further, the composite neural network includes: implementation of
Figure BDA0002602639240000058
Vector dimension reduction part and implementation
Figure BDA0002602639240000059
Partial, classified, fully connected networks with vector normalization; implementation of
Figure BDA00026026392400000510
Partial sum implementation of vector dimension reduction
Figure BDA00026026392400000511
Output of vector normalized partThe network is fully connected with the input classification to realize final classification;
for implementation in a composite neural network model
Figure BDA00026026392400000512
For the vector dimension reduction part, 3 layers of fully connected neural networks are provided; in a first layer of fully-connected neural networks, the input is a text characterization vector
Figure BDA00026026392400000513
The weight matrix is W1Offset is b1The activation function is ReLu (X), and the output is expressed as follows:
Figure BDA00026026392400000514
the second and third layers of fully-connected network are similar to the first layer of neural network, and the input of the second layer of fully-connected network is the output H of the first layer of fully-connected network(1)The weight matrix is W2Offset is b2The activation function is ReLu (X); in the third layer of fully-connected neural network, the input is the output H of the second layer of fully-connected network(2)The weight matrix is W3Offset is b3The activation function is ReLu (X); h(2)And H(3)In the form:
Figure BDA0002602639240000061
Figure BDA0002602639240000062
for implementation in a composite neural network model
Figure BDA0002602639240000063
For the vector normalization part, a set of attribute vectors corresponding to Sample set Sample is set
Figure BDA0002602639240000064
Setting attribute vectors
Figure BDA0002602639240000065
Has the dimension of
Figure BDA0002602639240000066
Then sample SiCorresponding normalized attribute vector
Figure BDA0002602639240000067
The value of the j-th item of (1)
Figure BDA0002602639240000068
Can be expressed in the following form:
Figure BDA0002602639240000069
is obtained by
Figure BDA00026026392400000622
And
Figure BDA00026026392400000611
then, the idea based on superposition will
Figure BDA00026026392400000621
And
Figure BDA00026026392400000613
splicing to realize feature fusion, and recording the superposed result as
Figure BDA00026026392400000614
Expressed in the following form:
Figure BDA00026026392400000615
for a classified fully-connected network, the input is a hybrid vector
Figure BDA00026026392400000616
The weight matrix is W4Offset coefficient of b4The activation function is Softmax (X), then the output
Figure BDA00026026392400000617
Expressed in the following form:
Figure BDA00026026392400000618
output vector
Figure BDA00026026392400000619
As a one-dimensional 2-element vector, i.e.
Figure BDA00026026392400000620
The value of the first column of the vector represents news SiThe value in the second column represents news S as a probability of top newsiProbability of non-headline news.
To demonstrate the effectiveness of this patent in solving the news classification problem, the "daily news" news is used herein as an example. In the process of processing data, the original data are divided into 4 times according to the administration time to form corresponding 4 sub-data sets, wherein the original data are distinguished from different styles of different leader core administration periods.
Verification experiments are carried out on the 4 news subdata sets, and XGBoost, Random Forest and SVM classification methods based on Doc2Vec vectors, a fastText text classification method based on deep learning, a TextCNN text classification method, a TextRNN text classification method and a TextRCNN text classification method are evaluated respectively, and the performances of a newspaper and periodical news classification method based on a text context structure and attribute information overlay network on the head edition news classification problem of the Renminbi newspaper are provided.
FIG. 5 illustrates the accuracy, correctness, recall, and F of the classification of 4 stages of the "daily Rens for people" news data using the various text classification methods mentioned above1The value is obtained.
From the sub-graph a in fig. 5, we can see that the news classification method proposed by this patent is considerably improved compared with other classification methods. Specifically, in the first sub data set Stage1, compared with the Doc2Vec vector-based XGBoost, Random Forest, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, and TextRCNN text classification method, the news classification method provided by the present patent is respectively improved by 10.31%, 3.01%, 14.32%, 32.55%, 21.32%, 23.93%, and 22.10%. In the second sub data set Stage2, compared with the XGBoost, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, and TextRCNN text classification method based on Doc2Vec vector, the news classification method proposed by the present patent is respectively improved by 4.88%, 11.00%, 16.38%, 5.57%, 5.71%, and 0.18%. In the third subdata set Stage3, compared with the XGBoost, Random Forest, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method and TextRCNN text classification method based on Doc2Vec vector, the news classification method provided by the patent is respectively improved by 9.72%, 0.15%, 13.48%, 17.33%, 17.01%, 17.67% and 18.73%. In the fourth sub data set Stage4, compared with the XGBoost, Random Forest, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, TextRCNN text classification method based on Doc2Vec vector, the news classification method provided by the patent is respectively 5.71%, 1.09%, 14.10%, 3.47%, 3.28%, 5.62%, 1.30%.
Since the proportion of the first edition news and the non-first edition news in the people daily newspaper is not balanced, the accuracy rate of the classification result is far from enough. In FIG. 5, sub-graphs b, c, d are derived from the accuracy, recall and F of the predicted results1The values depict the actual performance of each text classification method in processing the classification problem of the first edition news of the people's daily newspaper.
From the sub-graphs b and c in fig. 5, we can see that the news classification method proposed by the present patent has relatively higher accuracy and recall rate on each sub-data set or no order of magnitude difference from the best result. Specifically, in terms of accuracy, the method is better than the XGBoost, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, TextRCNN text classification method based on Doc2Vec vector in all stages, and is better than the Random Forest classification method based on Doc2Vec vector in some specific stages (for example, Stage 3). In terms of recall, the method of Random Forest classification based on Doc2Vec vector is superior at all stages, and the method of TextRCNN text classification is superior at some specific stages (e.g., Stage 2).
From fig. 5, sub-graph d, we can see that the news classification method proposed by this patent is in F of classification result compared with other algorithms1There is a considerable improvement in the value. Specifically, in the first sub data set Stage1, compared with the Doc2Vec vector-based XGBoost, Random Forest, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, and TextRCNN text classification method, the news classification method provided by the present patent is respectively improved by 3.21%, 42.53%, 4.06, 25.25%, 23.45%, 22.25%, and 20.43%. In the second subdata set Stage2, compared with the Doc2Vec vector-based XGBoost, Random Forest, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, and TextRCNN text classification method, the news classification method provided by the present patent is respectively improved by 2.70%, 42.99, 3.93%, 8.88%, 5.53%, 3.73%, and 3.33%. In the third subdata set Stage3, compared with the XGBoost, Random Forest, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, TextRCNN text classification method based on Doc2Vec vector, the news classification method provided by the patent is respectively improved by 7.76%, 46.93%, 8.43%, 12.80%, 12.27%, 12.49%, 12.54%. In the fourth subdata set Stage4, compared with an XGboost, a Random Forest, an SVM classification method, a fastText text classification method, a TextCNN text classification method and a TextRNN text classification method based on a Doc2Vec vector, the news classification method provided by the patent is respectively improved by 9.38%, 46.85%, 14.04% and 1.2%8%,0.69%,3.76%。

Claims (2)

1. A news classification method of newspapers and periodicals based on a text context structure and attribute information superposition network specifically comprises the following steps:
step 1: acquiring data;
acquiring text information and attribute information of news of a certain newspaper from a database, wherein the text information represents text content of the news, and the attribute information comprises 7 attribute information, specifically: the total edition number of a certain newspaper and periodical on the same day, the title word number of the news, the text word number of the news, the maximum word number of the news title of the certain newspaper and periodical on the same day, the maximum word number of the news text of the certain newspaper and periodical on the same day, the minimum word number of the news title of the certain newspaper and periodical on the same day and the era number;
step 2: generating a text feature vector;
carrying out vector representation on news text information, converting each news text into a text feature vector with low dimension and high information content respectively, and storing the text feature vector into a database;
and step 3: generating attribute feature vectors;
carrying out vector splicing on the news attribute information, splicing all the attribute information into an attribute feature vector, and finally storing the result into a database;
and 4, step 4: dividing a data set;
dividing news data in a database into a training set, a verification set and a test set at random;
and 5: sampling;
weighting the first edition news and the non-first edition news in the training set, and obtaining a training sample set by adopting a weighted random sampling mode;
for class CjIn words, is provided with
Figure FDA0003468195650000011
The samples belong to class CjThen this
Figure FDA0003468195650000012
SampleSample d of any one of the textsiWeight of (1)iCan be expressed in the following form:
Figure FDA0003468195650000013
c in the formula represents a defined classification category set, a weighted random sampling method is used, and Weight is set Weight ═ Weight according to samples1,weight2,…,weightnFrom a set of samples D ═ D1,d2,…,dnThe way of selecting m samples in (1) is as follows:
phi is the element Weight in the set WeightiE.g. Weight, selecting uniformly distributed random numbers u between 0 and 1iAnd calculating k using the following formulai
Figure FDA0003468195650000014
Let set K ═ K-iWhere i ═ 1,2, …, n; pressing the set K by KiSorting, selecting maximum m elements to form a Sample set, where the Sample can be expressed as the following formula:
Sample={dl}
wherein l satisfies kl≥km-th
In the above formula km-thA value representing the mth largest element in the set K;
step 6: training a network model;
training a composite neural network by using a text feature vector and an attribute feature vector corresponding to news in a training sample set and a category corresponding to the news; the method comprises the following steps:
the composite neural network includes: implementation of
Figure FDA0003468195650000021
Vector dimension reduction part and implementation
Figure FDA0003468195650000022
Partial, classified, fully connected networks with vector normalization; implementation of
Figure FDA0003468195650000023
Partial sum implementation of vector dimension reduction
Figure FDA0003468195650000024
The outputs of the vector normalization parts are jointly input into a classification full-connection network to realize final classification;
for implementation in a composite neural network model
Figure FDA0003468195650000025
For the vector dimension reduction part, 3 layers of fully-connected neural networks are provided; in a first layer of fully-connected neural networks, the input is a text characterization vector
Figure FDA0003468195650000026
The weight matrix is W1Offset is b1The activation function is ReLu (X), and the output is expressed as follows:
Figure FDA0003468195650000027
the second and third layers of fully-connected network are similar to the first layer of neural network, and the input of the second layer of fully-connected network is the output H of the first layer of fully-connected network(1)The weight matrix is W2Offset is b2The activation function is ReLu (X); in the third layer of fully-connected neural network, the input is the output H of the second layer of fully-connected network(2)The weight matrix is W3Offset is b3The activation function is ReLu (X); h(2)And H(3)In the form:
Figure FDA0003468195650000028
Figure FDA0003468195650000029
for implementation in a composite neural network model
Figure FDA00034681956500000210
For the part of vector normalization, a set of attribute vectors corresponding to Sample in the Sample set is assumed
Figure FDA00034681956500000211
Setting attribute vectors
Figure FDA00034681956500000212
Has the dimension of
Figure FDA00034681956500000213
Then sample SiCorresponding normalized attribute vector
Figure FDA00034681956500000214
The value of the j-th item of (1)
Figure FDA00034681956500000215
Can be expressed in the following form:
Figure FDA00034681956500000216
in obtaining
Figure FDA0003468195650000031
And
Figure FDA0003468195650000032
then, the idea based on superposition will
Figure FDA0003468195650000033
And
Figure FDA0003468195650000034
splicing to realize feature fusion, and recording the superposed result as
Figure FDA0003468195650000035
Expressed in the following form:
Figure FDA0003468195650000036
for a classified fully-connected network, the input is a hybrid vector
Figure FDA0003468195650000037
The weight matrix is W4Offset coefficient of b4The activation function is Softmax (X), then the output
Figure FDA0003468195650000038
Expressed in the following form:
Figure FDA0003468195650000039
output vector
Figure FDA00034681956500000310
Is a one-dimensional 2-element vector, i.e.
Figure FDA00034681956500000311
The value of the first column of the vector represents news SiThe value in the second column represents news S as a probability of top newsiProbability of being non-headline news;
and 7: predicting;
and inputting the text feature vector and the attribute feature vector of the concentrated news into the composite neural network obtained by training, wherein the output of the network is a prediction result of whether the news is the first edition news.
2. The method for classifying news of newspapers and periodicals based on the text context structure and the attribute information overlay network as claimed in claim 1, wherein the specific method for generating the text feature vector in the step 2 is as follows:
training word vectors and text vectors simultaneously; let text diThe corresponding code vector is piThe coded vector corresponding to the word t in the text is wt(ii) a Can construct words t in the text diThe vector at the j-th occurrence of (a) is of the form:
Figure FDA00034681956500000312
t in the formula is the unilateral context word number considered by the algorithm; setting words t in text diS times, then use as follows
Figure FDA00034681956500000313
To represent the vector of the word t:
Figure FDA00034681956500000314
n represents a text diTotal number of words t in the text diSum vector of
Figure FDA00034681956500000315
Substituting into the neural network model of the text vector guarantee method, the following outputs can be obtained:
Figure FDA00034681956500000316
in the above formula, W is a hidden layer in the neural network model, and b is an offset, so the following loss function is constructed:
Figure FDA00034681956500000317
in the formula, distance is a distance function between vectors and is a second-order Euclidean distance, and W can be obtained by optimizing the loss functionbestMatrix sum bbestBiasing; in the text diCorresponding vector piAs input, it can be characterized by its low-dimensional vector
Figure FDA0003468195650000041
In the form of:
Figure FDA0003468195650000042
CN202010729459.0A 2020-07-27 2020-07-27 Newspaper and magazine news classification method based on text context structure and attribute information superposition network Active CN111966828B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010729459.0A CN111966828B (en) 2020-07-27 2020-07-27 Newspaper and magazine news classification method based on text context structure and attribute information superposition network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010729459.0A CN111966828B (en) 2020-07-27 2020-07-27 Newspaper and magazine news classification method based on text context structure and attribute information superposition network

Publications (2)

Publication Number Publication Date
CN111966828A CN111966828A (en) 2020-11-20
CN111966828B true CN111966828B (en) 2022-05-03

Family

ID=73364052

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010729459.0A Active CN111966828B (en) 2020-07-27 2020-07-27 Newspaper and magazine news classification method based on text context structure and attribute information superposition network

Country Status (1)

Country Link
CN (1) CN111966828B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11538210B1 (en) * 2021-11-22 2022-12-27 Adobe Inc. Text importance spatial layout

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918501A (en) * 2019-01-18 2019-06-21 平安科技(深圳)有限公司 Method, apparatus, equipment and the storage medium of news article classification
WO2020092834A1 (en) * 2018-11-02 2020-05-07 Valve Corporation Classification and moderation of text
CN111125354A (en) * 2018-10-31 2020-05-08 北京国双科技有限公司 Text classification method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125354A (en) * 2018-10-31 2020-05-08 北京国双科技有限公司 Text classification method and device
WO2020092834A1 (en) * 2018-11-02 2020-05-07 Valve Corporation Classification and moderation of text
CN109918501A (en) * 2019-01-18 2019-06-21 平安科技(深圳)有限公司 Method, apparatus, equipment and the storage medium of news article classification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于Word2vec 的微博短文本分类研究;张谦等;《技术研究》;20171231;全文 *

Also Published As

Publication number Publication date
CN111966828A (en) 2020-11-20

Similar Documents

Publication Publication Date Title
CN108363753B (en) Comment text emotion classification model training and emotion classification method, device and equipment
CN105279495B (en) A kind of video presentation method summarized based on deep learning and text
CN105631468B (en) A kind of picture based on RNN describes automatic generation method
CN103544963B (en) A kind of speech-emotion recognition method based on core semi-supervised discrimination and analysis
CN109284506A (en) A kind of user comment sentiment analysis system and method based on attention convolutional neural networks
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
CN111966917A (en) Event detection and summarization method based on pre-training language model
CN111931061B (en) Label mapping method and device, computer equipment and storage medium
CN112528163B (en) Social platform user occupation prediction method based on graph convolution network
CN111275401A (en) Intelligent interviewing method and system based on position relation
CN109446423B (en) System and method for judging sentiment of news and texts
CN111709575A (en) Academic achievement prediction method based on C-LSTM
CN111597340A (en) Text classification method and device and readable storage medium
CN113688635B (en) Class case recommendation method based on semantic similarity
CN111858940A (en) Multi-head attention-based legal case similarity calculation method and system
CN112800225B (en) Microblog comment emotion classification method and system
CN112256866A (en) Text fine-grained emotion analysis method based on deep learning
CN114841151B (en) Medical text entity relation joint extraction method based on decomposition-recombination strategy
CN113946677A (en) Event identification and classification method based on bidirectional cyclic neural network and attention mechanism
CN112786160A (en) Multi-image input multi-label gastroscope image classification method based on graph neural network
CN115048511A (en) Bert-based passport layout analysis method
CN113934835B (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
CN111966828B (en) Newspaper and magazine news classification method based on text context structure and attribute information superposition network
CN112950414B (en) Legal text representation method based on decoupling legal elements
CN113886562A (en) AI resume screening method, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant