CN111966828A - Newspaper and magazine news classification method based on text context structure and attribute information superposition network - Google Patents

Newspaper and magazine news classification method based on text context structure and attribute information superposition network Download PDF

Info

Publication number
CN111966828A
CN111966828A CN202010729459.0A CN202010729459A CN111966828A CN 111966828 A CN111966828 A CN 111966828A CN 202010729459 A CN202010729459 A CN 202010729459A CN 111966828 A CN111966828 A CN 111966828A
Authority
CN
China
Prior art keywords
text
news
vector
weight
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010729459.0A
Other languages
Chinese (zh)
Other versions
CN111966828B (en
Inventor
蔡世民
陈明仁
戴礼灿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010729459.0A priority Critical patent/CN111966828B/en
Publication of CN111966828A publication Critical patent/CN111966828A/en
Application granted granted Critical
Publication of CN111966828B publication Critical patent/CN111966828B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a newspaper and magazine news classification method based on a text context structure and attribute information superposition network, and belongs to the field of information processing. The invention uses a text vector representation method to convert the text with indefinite length into the vector with fixed length, thereby avoiding the loss and redundancy of text information; from the perspective of training data, weighted random sampling is adopted, and the composition of training samples is optimized through the possibility that weight adjusting samples are selected; from the aspect of feature extraction, the text optimizes the feature extraction process by considering not only the context result information but also the text attribute information. The method and the device not only improve the extraction mode of the text features, but also additionally incorporate the attribute features into the feature construction process. The text with indefinite length is converted into the vector with fixed length by using a text vector representation method, so that the loss and redundancy of text information are avoided, and the extraction mode of text characteristics is optimized; additionally, the characteristic information of news is added, and the source of the characteristics is enriched.

Description

Newspaper and magazine news classification method based on text context structure and attribute information superposition network
Technical Field
The invention belongs to the field of information processing, and relates to a news classification method and a news classification system based on a text context structure information and attribute information overlay network.
Background
Definition of key terms:
a neural network: is a mathematical or computational model that mimics the structure and function of a biological neural network and is used to estimate or approximate functions. Neural networks are computed from a large number of artificial neuron connections. In most cases, the artificial neural network can change the internal structure on the basis of external information, and is an adaptive system.
Text characterization: the method is a machine learning technology for mapping a high-level cognitive abstract entity of text into a vector on a real number field in the field of natural language processing so as to facilitate subsequent computer processing.
Weighted random sampling: the sampling technology for determining the sampling probability of the sample according to the sample weight can effectively solve the problem of unbalanced distribution of the sample class from the sampling level.
Newspapers and periodicals are a kind of transmission medium for transmitting literal data by using paper. The main functions of the system comprise explanation, propaganda and image maintenance, for example, the 'daily newspaper of people' is the image of a maintenance country, the 'liberation military newspaper' is the image of a maintenance army, and the 'enterprise newspaper' is the image of a maintenance enterprise.
Generally, a newsstand has several news. For a certain news item in a newspaper, whether the news item can become the head edition news of the present day of the newspaper is related to the information content of the news item. In conjunction with current natural language processing techniques, it is still difficult to directly quantify the amount of information in a piece of textual news. Therefore, the problem of two classification black boxes (hereinafter, abbreviated as ' news classification problem of newspapers and periodicals ') of ' whether a certain news is the headline news ' is solved by using a neural network which is also the black box ', and the problem is a direct and efficient choice.
With the success of AlexNet, the study of neural networks entered a new phase. In the current text classification technology, a neural network is mainly used as a technical means, so that the structural information of the text is fully mined, and classification is performed based on the characteristic information. In the field of text classification, algorithms such as TextCNN, TextRNN, FastText, TextRCNN, etc. are proposed in succession, performing feature extraction on texts from the perspective of convolutional neural networks and recurrent neural networks, respectively, and these algorithms perform excellently on multiple test data sets.
The prior art has the following disadvantages:
although algorithms such as TextCNN, TextRNN, FastText, TextRCNN and the like are excellent in performance on many classified test data sets of open texts, the algorithms cannot effectively solve the problem of classification of the news of the newspapers due to the particularity of the news of the newspapers. Specifically, the following points are included: firstly, the length of news in a newspaper is uncertain, and the length of news is not directly related to the importance degree of news, and the above technical means mostly limit the input length of text, and require truncation or filling operations on the input text, which may cause the loss or redundancy of extracted features. Meanwhile, due to the particularity of the news of the newspapers, the data of two categories of the news classification problem of the newspapers are obviously biased, namely, the number of the first-edition news is far less than that of the non-first-edition news. The classification method directly trained by using the biased data is also biased, namely the classification method has a higher probability of classifying news into non-top news. Finally, for news in newspapers and periodicals, although the main reason for the possibility of becoming the top news is the amount of information contained between the contexts of news texts, some too long or too short news are difficult to become the top news due to the limitations of top layout and post typesetting. Most of the prior technical means only consider text context structure information, but ignore text attribute information such as 'title length', 'text length' and the like, so that the characteristics are lost.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a newspaper and magazine news classification method based on a text context structure and an attribute information superposition network. In the input aspect, the text with indefinite length is converted into the vector with fixed length by using a text vector representation method, so that the loss and redundancy of text information are avoided; from the perspective of training data, weighted random sampling is adopted, and the composition of training samples is optimized through the possibility that weight adjusting samples are selected; from the aspect of feature extraction, the text optimizes the feature extraction process by considering not only the context result information but also the text attribute information.
The invention discloses a newspaper and periodical news classification method based on a text context structure and attribute information superposition network, which specifically comprises the following steps:
step 1: acquiring data;
acquiring text information and attribute information of news of a certain newspaper from a database, wherein the text information represents text content of the news, and the attribute information comprises 8 attribute information, specifically: the total number of versions of a certain newspaper and a certain journal on the same day, the number of words of a title of the news, the number of words of a text of the news, the maximum number of words of a news title of a certain newspaper and a certain journal on the same day, the minimum number of words of a news title of a certain newspaper and a certain journal on the same day, the maximum number of words of a text of the news of a certain newspaper and a certain journal on the same day, the minimum number of words of a news title of a certain newspaper and a era number;
step 2: generating a text feature vector;
carrying out vector representation on news text information, converting each news text into a text feature vector with low dimension and high information content respectively, and storing the text feature vector into a database;
and step 3: generating attribute feature vectors;
carrying out vector splicing on the news attribute information, splicing all the attribute information into an attribute feature vector, and finally storing the result into a database;
and 4, step 4: dividing a data set;
dividing the news data in the database into a training set, a verification set and a test set at random according to a certain proportion, wherein the specific proportion is 6: 2: 2;
and 5: sampling;
weighting the first edition news and the non-first edition news in the training set, and obtaining a training sample set with relatively balanced quantities of the first edition news and the non-first edition news by adopting a weighted random sampling mode;
step 6: training a network model;
training a composite neural network by using a text feature vector and an attribute feature vector corresponding to news in a training sample set and a category corresponding to the news;
and 7: predicting;
and inputting the text feature vector and the attribute feature vector of the concentrated news into the composite neural network obtained by training, wherein the output of the network is a prediction result of whether the news is the first edition news.
Compared with the prior art, the invention has the beneficial effects that:
1. the method and the device not only improve the extraction mode of the text features, but also additionally incorporate the attribute features into the feature construction process. In the step 2, a text vector representation method is used for converting the text with the indefinite length into the vector with the fixed length, so that the loss and redundancy of text information are avoided, and the extraction mode of text features is optimized; in step 3, the characteristic information of news is additionally added, so that the source of the characteristics is enriched. The features of the present invention are more efficiently and variously constructed than in the prior art.
2. The present invention uses the technique of weighted random sampling to train the model using the sample set resulting from the sampling. Unlike other related art applications, the problem of news classification of newspapers and periodicals faces the objective fact that the proportion of top news and non-top news is severely unbalanced. In the step 5, a training sample set is obtained by adopting a weighted random sampling method, and the strict control on the form of the training data is realized on the premise of ensuring the authenticity of the data.
3. The invention uses the idea of solving the black box problem by using a black box method, and solves the problem of end-to-end and point-to-point. For the classification problem of news of newspapers and periodicals, under the condition that the existing algorithm and index can not be directly used for directly measuring the importance degree of news, a network model named as composite is provided in step 6 for solving the classification problem end to end. Compared with the prior art, the invention simulates the thinking way of the human brain by using the neural network.
Drawings
Fig. 1 is a flowchart of a news classification method according to the present invention.
Fig. 2 is a schematic structural diagram of a text vector representation method.
Fig. 3 is a schematic diagram illustrating the effect of the weighted random sampling algorithm.
Fig. 4 is a schematic diagram of a composite neural network structure.
FIG. 5 is a classification result of the present invention when handling the "daily news report for people" news classification problem.
Detailed description of the preferred embodiments
For the purpose of making the present invention clearer, the present invention will be described in further detail below with reference to the accompanying drawings.
Fig. 1 visually represents the steps of the news classification method proposed by the present invention. Specifically, the method comprises the steps of data acquisition, text feature vector generation, attribute feature vector generation, data set division, weighted random sampling, composite model training and final classification prediction.
Fig. 2 visually shows the method for converting text into vectors according to the present invention, and the principle is as follows:
training word vectors and text vectors simultaneously; let text diThe corresponding code vector is piThe coding vector corresponding to the word t in the text is wt(ii) a Can construct words t in the text diThe vector at the j-th occurrence of (a) is of the form:
Figure BDA0002602639240000041
t in the formula is the unilateral context word number considered by the algorithm; setting words t in text diS times, then use as follows
Figure BDA00026026392400000410
To represent the vector of the word t:
Figure BDA0002602639240000042
n represents a text diTotal number of words t in the text diSum vector of
Figure BDA0002602639240000049
Substituting into the neural network model of the text vector guarantee method, the following outputs can be obtained:
Figure BDA0002602639240000043
in the above formula, W is a hidden layer in the neural network model, and b is an offset, so the following loss function is constructed:
Figure BDA0002602639240000044
the distance in the formula is a distance function between vectors, can be a second-order Euclidean distance, and W can be obtained by optimizing the loss functionbestMatrix sum bbestBiasing; in the text diCorresponding vector piAs input, it can be characterized by its low-dimensional vector
Figure BDA0002602639240000045
In the form:
Figure BDA0002602639240000046
fig. 3 illustrates the sampling effect using a weighted random sampling method, as follows:
for class CjIn words, is provided with
Figure BDA0002602639240000047
The samples belong to class CjThen this
Figure BDA0002602639240000048
Sample d of any one of the samplesiWeight of (1)iCan be expressed in the following form:
Figure BDA0002602639240000051
c in the formula represents a defined classification category set, and a weighted random sampling method is used, wherein Weight is set from Weight of samples to { Weight ═ Weight }1,weight2,…,weightnD ═ D from sample set1,d2,…,dnThe way of selecting m samples in (1) is as follows:
phi is the element Weight in the set WeightiE.g. Weight, selecting uniformly distributed random numbers u between 0 and 1iAnd calculating k using the following formulai
Figure BDA0002602639240000052
Let set K ═ K-iWhere i ═ 1,2, …, n; pressing the set K by KiSorting, selecting maximum m elements to form a Sample set, where the Sample can be expressed as the following formula:
Sample={dl}
Where l meets that kl≥km-th
in the above formula km-thRepresenting the value of the mth largest element in the set K.
Fig. 4 shows the structure of a composite neural network, for which the calculation process gives the following analysis:
the input to the composite neural network is sample news text
Figure BDA0002602639240000053
Text feature vector of
Figure BDA0002602639240000054
And attribute feature vector
Figure BDA0002602639240000055
The model also characterizes the vector separately for the text using 2 different parts
Figure BDA0002602639240000056
Performing dimension reduction operation and attribute feature vector pair
Figure BDA0002602639240000057
Carrying out normalization operation;
further, the composite neural network includes: implementation of
Figure BDA0002602639240000058
Vector dimension reduction part and implementation
Figure BDA0002602639240000059
Partial, classified, fully connected networks with vector normalization; implementation of
Figure BDA00026026392400000510
Partial sum implementation of vector dimension reduction
Figure BDA00026026392400000511
The outputs of the vector normalization parts are jointly input into a classification full-connection network to realize final classification;
for implementation in a composite neural network model
Figure BDA00026026392400000512
For the vector dimension reduction part, 3 layers of fully connected neural networks are provided; in a first layer of fully-connected neural networks, the input is a text characterization vector
Figure BDA00026026392400000513
The weight matrix is W1Offset is b1The activation function is ReLu (X), and the output is expressed as follows:
Figure BDA00026026392400000514
the second and third layers of fully-connected network are similar to the first layer of neural network, and the input of the second layer of fully-connected network is the output H of the first layer of fully-connected network(1)The weight matrix is W2Offset is b2The activation function is ReLu (X); in the third layer of fully-connected neural network, outputOutput H of fully connected network of second layer(2)The weight matrix is W3Offset is b3The activation function is ReLu (X); h(2)And H(3)In the form:
Figure BDA0002602639240000061
Figure BDA0002602639240000062
for implementation in a composite neural network model
Figure BDA0002602639240000063
For the vector normalization part, a set of attribute vectors corresponding to Sample set Sample is set
Figure BDA0002602639240000064
Setting attribute vectors
Figure BDA0002602639240000065
Has the dimension of
Figure BDA0002602639240000066
Then sample SiCorresponding normalized attribute vector
Figure BDA0002602639240000067
The value of the j-th item of (1)
Figure BDA0002602639240000068
Can be expressed in the following form:
Figure BDA0002602639240000069
is obtained by
Figure BDA00026026392400000622
And
Figure BDA00026026392400000611
then, the idea based on superposition will
Figure BDA00026026392400000621
And
Figure BDA00026026392400000613
splicing to realize feature fusion, and recording the superposed result as
Figure BDA00026026392400000614
Expressed in the following form:
Figure BDA00026026392400000615
for a classified fully-connected network, the input is a hybrid vector
Figure BDA00026026392400000616
The weight matrix is W4Offset coefficient of b4The activation function is Softmax (X), then the output
Figure BDA00026026392400000617
Expressed in the following form:
Figure BDA00026026392400000618
output vector
Figure BDA00026026392400000619
Is a one-dimensional 2-element vector, i.e.
Figure BDA00026026392400000620
The value of the first column of the vector represents news SiThe value in the second column represents news S as a probability of top newsiProbability of non-headline news.
To demonstrate the effectiveness of this patent in solving the news classification problem, the "daily news" news is used herein as an example. In the process of processing data, the original data are divided into 4 times according to the administration time to form corresponding 4 sub-data sets, wherein the original data are distinguished from different styles of different leader core administration periods.
Verification experiments are carried out on the 4 news subdata sets, and XGBoost, Random Forest and SVM classification methods based on Doc2Vec vectors, a fastText text classification method based on deep learning, a TextCNN text classification method, a TextRNN text classification method and a TextRCNN text classification method are evaluated respectively, and the performances of a newspaper and periodical news classification method based on a text context structure and attribute information overlay network on the head edition news classification problem of the Renminbi newspaper are provided.
FIG. 5 illustrates the accuracy, correctness, recall, and F of the classification of 4 stages of the "daily Rens for people" news data using the various text classification methods mentioned above1The value is obtained.
From the sub-graph a in fig. 5, we can see that the news classification method proposed by this patent is considerably improved compared with other classification methods. Specifically, in the first sub data set Stage1, compared with the Doc2Vec vector-based XGBoost, Random Forest, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, and TextRCNN text classification method, the news classification method provided by the present patent is respectively improved by 10.31%, 3.01%, 14.32%, 32.55%, 21.32%, 23.93%, and 22.10%. In the second sub data set Stage2, compared with the XGBoost, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, and TextRCNN text classification method based on Doc2Vec vector, the news classification method proposed by the present patent is respectively improved by 4.88%, 11.00%, 16.38%, 5.57%, 5.71%, and 0.18%. In the third sub data set Stage3, compared with the XGBoost, Random Forest, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, TextRCNN text classification method based on Doc2Vec vector, the news classification method provided by the patent is respectively improved by 9.72%, 0.15%, 13.48%, 17.33%, 17.01%, 17.67%, and 18.73%. In the fourth sub data set Stage4, compared with the XGBoost, Random Forest, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, TextRCNN text classification method based on Doc2Vec vector, the news classification method provided by the patent is respectively 5.71%, 1.09%, 14.10%, 3.47%, 3.28%, 5.62%, 1.30%.
Because the proportion of the first edition news and the non-first edition news of the people's daily newspaper is not balanced, the accuracy rate of only considering the classification result is far from sufficient. In FIG. 5, sub-graphs b, c, d are derived from the accuracy, recall and F of the predicted results1The value characterizes the actual performance of each text classification method in processing the classification problem of the first edition news of the ' people's daily newspaper '.
From the sub-graphs b and c in fig. 5, we can see that the news classification method proposed by the present patent has relatively higher accuracy and recall rate on each sub-data set or no order of magnitude difference from the best result. Specifically, in terms of accuracy, the method is better than the XGBoost, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, TextRCNN text classification method based on Doc2Vec vector in all stages, and is better than the Random Forest classification method based on Doc2Vec vector in some specific stages (for example, Stage 3). In terms of recall, the method of Random Forest classification based on Doc2Vec vector is superior at all stages, and the method of TextRCNN text classification is superior at some specific stages (e.g., Stage 2).
From fig. 5, sub-graph d, we can see that the news classification method proposed by this patent is in F of classification result compared with other algorithms1There is a considerable improvement in the value. Specifically, in the first sub data set Stage1, compared with the Doc2Vec vector-based XGBoost, Random Forest, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, and TextRCNN text classification method, the news classification method provided by the present patent is respectively improved by 3.21%, 42.53%, 4.06, 25.25%, 23.45%, 22.25%, and 20.43%. In a second subset Stage2, this patentCompared with the XGboost, Random Forest, SVM, fastText text, TextCNN, TextRNN and TextRCNN text classification methods based on the Doc2Vec vector, the provided news classification method is respectively improved by 2.70%, 42.99, 3.93%, 8.88%, 5.53%, 3.73% and 3.33%. In the third sub data set Stage3, compared with the XGBoost, Random Forest, SVM classification method, fastText classification method, TextCNN text classification method, TextRNN text classification method, TextRCNN text classification method based on Doc2Vec vector, the news classification method provided by the patent is respectively improved by 7.76%, 46.93%, 8.43%, 12.80%, 12.27%, 12.49%, 12.54%. In the fourth sub data set Stage4, compared with the XGBoost, Random Forest, SVM classification method, fastText classification method, TextCNN text classification method and TextRNN text classification method based on Doc2Vec vector, the news classification method provided by the present patent is respectively improved by 9.38%, 46.85%, 14.04%, 1.28%, 0.69% and 3.76%.

Claims (3)

1. A news classification method of newspapers and periodicals based on a text context structure and attribute information superposition network specifically comprises the following steps:
step 1: acquiring data;
acquiring text information and attribute information of news of a certain newspaper from a database, wherein the text information represents text content of the news, and the attribute information comprises 8 attribute information, specifically: the total number of versions of a certain newspaper and a certain journal on the same day, the number of words of a title of the news, the number of words of a text of the news, the maximum number of words of a news title of a certain newspaper and a certain journal on the same day, the minimum number of words of a news title of a certain newspaper and a certain journal on the same day, the maximum number of words of a text of the news of a certain newspaper and a certain journal on the same day, the minimum number of words of a news title of a certain newspaper and a era number;
step 2: generating a text feature vector;
carrying out vector representation on news text information, converting each news text into a text feature vector with low dimension and high information content respectively, and storing the text feature vector into a database;
and step 3: generating attribute feature vectors;
carrying out vector splicing on the news attribute information, splicing all the attribute information into an attribute feature vector, and finally storing the result into a database;
and 4, step 4: dividing a data set;
dividing news data in a database into a training set, a verification set and a test set at random;
and 5: sampling;
weighting the first edition news and the non-first edition news in the training set, and obtaining a training sample set by adopting a weighted random sampling mode;
for class CjIn words, is provided with
Figure FDA0002602639230000011
The samples belong to class CjThen this
Figure FDA0002602639230000012
Sample d of any one of the samplesiWeight of (1)iCan be expressed in the following form:
Figure FDA0002602639230000013
c in the formula represents a defined classification category set, and a weighted random sampling method is used, wherein Weight is set from Weight of samples to { Weight ═ Weight }1,weight2,…,weightnD ═ D from sample set1,d2,…,dnThe way of selecting m samples in (1) is as follows:
phi is the element Weight in the set WeightiE.g. Weight, selecting uniformly distributed random numbers u between 0 and 1iAnd calculating k using the following formulai
Figure FDA0002602639230000014
Let set K ═ K-iWhere i ═ 1,2, …, n; general collectionAnd then K is pressediSorting, selecting maximum m elements to form a Sample set, where the Sample can be expressed as the following formula:
Sample={dl}
Where l meets that kl≥km-th
in the above formula km-thA value representing the mth largest element in the set K;
step 6: training a network model;
training a composite neural network by using a text feature vector and an attribute feature vector corresponding to news in a training sample set and a category corresponding to the news;
and 7: predicting;
and inputting the text feature vector and the attribute feature vector of the concentrated news into the composite neural network obtained by training, wherein the output of the network is a prediction result of whether the news is the first edition news.
2. The method for classifying news of newspapers and periodicals based on the text context structure and the attribute information overlay network as claimed in claim 1, wherein the specific method for generating the text feature vector in the step 2 is as follows:
training word vectors and text vectors simultaneously; let text diThe corresponding code vector is piThe coding vector corresponding to the word t in the text is wt(ii) a Can construct words t in the text diThe vector at the j-th occurrence of (a) is of the form:
Figure FDA0002602639230000021
t in the formula is the unilateral context word number considered by the algorithm; setting words t in text diS times, then use as follows
Figure FDA0002602639230000028
To represent the vector of the word t:
Figure FDA0002602639230000022
n represents a text diTotal number of words t in the text diSum vector of
Figure FDA0002602639230000023
Substituting into the neural network model of the text vector guarantee method, the following outputs can be obtained:
Figure FDA0002602639230000024
in the above formula, W is a hidden layer in the neural network model, and b is an offset, so the following loss function is constructed:
Figure FDA0002602639230000025
the distance in the formula is a distance function between vectors, can be a second-order Euclidean distance, and W can be obtained by optimizing the loss functionbestMatrix sum bbestBiasing; in the text diCorresponding vector piAs input, it can be characterized by its low-dimensional vector
Figure FDA0002602639230000026
In the form:
Figure FDA0002602639230000027
3. the method for classifying news of newspapers and periodicals based on the text context structure and the attribute information overlay network as claimed in claim 1, wherein the specific method of the step 6 is as follows:
the composite neural network includes: implementation of
Figure FDA0002602639230000031
Vector dimension reduction part and implementation
Figure FDA0002602639230000032
Partial, classified, fully connected networks with vector normalization; implementation of
Figure FDA0002602639230000033
Partial sum implementation of vector dimension reduction
Figure FDA0002602639230000034
The outputs of the vector normalization parts are jointly input into a classification full-connection network to realize final classification;
for implementation in a composite neural network model
Figure FDA0002602639230000035
For the vector dimension reduction part, 3 layers of fully connected neural networks are provided; in a first layer of fully-connected neural networks, the input is a text characterization vector
Figure FDA0002602639230000036
The weight matrix is W1Offset is b1The activation function is ReLu (X), and the output is expressed as follows:
Figure FDA0002602639230000037
the second and third layers of fully-connected network are similar to the first layer of neural network, and the input of the second layer of fully-connected network is the output H of the first layer of fully-connected network(1)The weight matrix is W2Offset is b2The activation function is ReLu (X); in the third layer of fully-connected neural network, the input is the output H of the second layer of fully-connected network(2)The weight matrix is W3Offset is b3The activation function is ReLu (X); h(2)And H(3)In the form:
Figure FDA0002602639230000038
Figure FDA0002602639230000039
for implementation in a composite neural network model
Figure FDA00026026392300000310
For the vector normalization part, a set of attribute vectors corresponding to Sample set Sample is set
Figure FDA00026026392300000311
Setting attribute vectors
Figure FDA00026026392300000312
Has the dimension of
Figure FDA00026026392300000313
Then sample SiCorresponding normalized attribute vector
Figure FDA00026026392300000314
The value of the j-th item of (1)
Figure FDA00026026392300000315
Can be expressed in the following form:
Figure FDA00026026392300000316
is obtained by
Figure FDA00026026392300000317
And
Figure FDA00026026392300000318
then, the idea based on superposition will
Figure FDA00026026392300000319
And
Figure FDA00026026392300000320
splicing to realize feature fusion, and recording the superposed result as
Figure FDA00026026392300000321
Figure FDA00026026392300000322
Expressed in the following form:
Figure FDA00026026392300000323
for a classified fully-connected network, the input is a hybrid vector
Figure FDA00026026392300000324
The weight matrix is W4Offset coefficient of b4The activation function is Softmax (X), then the output
Figure FDA0002602639230000041
Expressed in the following form:
Figure FDA0002602639230000042
output vector
Figure FDA0002602639230000043
Is a one-dimensional 2-element vector, i.e.
Figure FDA0002602639230000044
The first of the vectorThe values of a column represent news SiThe value in the second column represents news S as a probability of top newsiProbability of non-headline news.
CN202010729459.0A 2020-07-27 2020-07-27 Newspaper and magazine news classification method based on text context structure and attribute information superposition network Active CN111966828B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010729459.0A CN111966828B (en) 2020-07-27 2020-07-27 Newspaper and magazine news classification method based on text context structure and attribute information superposition network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010729459.0A CN111966828B (en) 2020-07-27 2020-07-27 Newspaper and magazine news classification method based on text context structure and attribute information superposition network

Publications (2)

Publication Number Publication Date
CN111966828A true CN111966828A (en) 2020-11-20
CN111966828B CN111966828B (en) 2022-05-03

Family

ID=73364052

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010729459.0A Active CN111966828B (en) 2020-07-27 2020-07-27 Newspaper and magazine news classification method based on text context structure and attribute information superposition network

Country Status (1)

Country Link
CN (1) CN111966828B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11538210B1 (en) * 2021-11-22 2022-12-27 Adobe Inc. Text importance spatial layout

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918501A (en) * 2019-01-18 2019-06-21 平安科技(深圳)有限公司 Method, apparatus, equipment and the storage medium of news article classification
WO2020092834A1 (en) * 2018-11-02 2020-05-07 Valve Corporation Classification and moderation of text
CN111125354A (en) * 2018-10-31 2020-05-08 北京国双科技有限公司 Text classification method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125354A (en) * 2018-10-31 2020-05-08 北京国双科技有限公司 Text classification method and device
WO2020092834A1 (en) * 2018-11-02 2020-05-07 Valve Corporation Classification and moderation of text
CN109918501A (en) * 2019-01-18 2019-06-21 平安科技(深圳)有限公司 Method, apparatus, equipment and the storage medium of news article classification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张谦等: "基于Word2vec 的微博短文本分类研究", 《技术研究》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11538210B1 (en) * 2021-11-22 2022-12-27 Adobe Inc. Text importance spatial layout

Also Published As

Publication number Publication date
CN111966828B (en) 2022-05-03

Similar Documents

Publication Publication Date Title
CN111554268B (en) Language identification method based on language model, text classification method and device
CN108363753B (en) Comment text emotion classification model training and emotion classification method, device and equipment
CN105631468B (en) A kind of picture based on RNN describes automatic generation method
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
CN111931061B (en) Label mapping method and device, computer equipment and storage medium
CN112528163B (en) Social platform user occupation prediction method based on graph convolution network
CN110807324A (en) Video entity identification method based on IDCNN-crf and knowledge graph
CN111858940A (en) Multi-head attention-based legal case similarity calculation method and system
CN112800225B (en) Microblog comment emotion classification method and system
CN113569001A (en) Text processing method and device, computer equipment and computer readable storage medium
CN112256866A (en) Text fine-grained emotion analysis method based on deep learning
CN113987187A (en) Multi-label embedding-based public opinion text classification method, system, terminal and medium
CN114841151B (en) Medical text entity relation joint extraction method based on decomposition-recombination strategy
CN114780723B (en) Portrayal generation method, system and medium based on guide network text classification
CN113220768A (en) Resume information structuring method and system based on deep learning
CN113946677A (en) Event identification and classification method based on bidirectional cyclic neural network and attention mechanism
CN115048511A (en) Bert-based passport layout analysis method
CN112786160A (en) Multi-image input multi-label gastroscope image classification method based on graph neural network
CN113934835B (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
CN111966828B (en) Newspaper and magazine news classification method based on text context structure and attribute information superposition network
CN112950414B (en) Legal text representation method based on decoupling legal elements
WO2022087688A1 (en) System and method for text mining
CN113627550A (en) Image-text emotion analysis method based on multi-mode fusion
CN111460817A (en) Method and system for recommending criminal legal document related law provision
CN111563374A (en) Personnel social relationship extraction method based on judicial official documents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant