CN114722896A - News topic discovery method fusing neighbor topic map - Google Patents

News topic discovery method fusing neighbor topic map Download PDF

Info

Publication number
CN114722896A
CN114722896A CN202210211576.7A CN202210211576A CN114722896A CN 114722896 A CN114722896 A CN 114722896A CN 202210211576 A CN202210211576 A CN 202210211576A CN 114722896 A CN114722896 A CN 114722896A
Authority
CN
China
Prior art keywords
news
topic
title
data
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210211576.7A
Other languages
Chinese (zh)
Inventor
余正涛
卢天旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202210211576.7A priority Critical patent/CN114722896A/en
Publication of CN114722896A publication Critical patent/CN114722896A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a news topic discovery method fusing neighbor topic maps, and belongs to the field of natural language processing. The invention comprises the following steps: constructing a news topic data set; the method comprises the steps of coding news titles by using a Bert model, enhancing the representation of the titles, then constructing an incidence relation of news neighbor title graphs for optimizing similar titles, learning the representation of the neighbor title graphs through a plurality of graph convolution layers, integrating the incidence structure information of the titles, then fusing global features of the titles learned by graph convolution networks and local features of news documents learned by deep networks by using a fusion factor, and finally uniformly guiding two modules to optimize parameters by using a guiding module. According to the method, the news representation with high quality is realized by fusing the neighbor relation of the titles and the representation of the news document, and the topic clusters are clustered from the obtained representations, so that support is provided for subsequent tasks.

Description

News topic discovery method fusing neighbor topic map
Technical Field
The invention relates to a news topic discovery method fusing neighbor topic maps, and belongs to the technical field of natural language processing.
Background
News has particularity, each news document and title contain case elements, a plurality of news documents and titles with similar elements exist under different topics describing the same case, if people can identify the news under the different topics by naked eyes, the news can be distinguished easily, but if the representation quality is not high, the computer is difficult to achieve the identification accuracy of people. Sun et al adds a time factor in the similarity calculation through an improved Single-Pass incremental text clustering algorithm, organizes news information with topics as granularity, and realizes the discovery of network news topics; hu et al, on the basis of a Dirichlet process mixed model (DPM-M) based on an LDA topic model, incorporates prior knowledge to improve the performance of topic discovery; li et al propose a hierarchical classification model based on LDA as a feature extraction technique to extract potential topics to reduce the influence of data sparsity, and construct topic feature vectors related to a corpus to train a more robust classification model for rare classes. However, in the existing topic discovery method, news documents under the same case are modeled, captured topic information and topic words are classified under the same topic due to high similarity, and news documents under different topics of the same case can not be well distinguished. How to consider the connection of news documents under the same case topic and the difference of news documents with similar case elements is one of the problems to be solved in the topic discovery task of the domain news.
Disclosure of Invention
The invention provides a news topic discovery method fusing a neighbor topic map, which constructs an incidence relation of similar topics, adds the characteristics of a document into the coding process of the topics in order to avoid the influence of bias and noise data of the topics only, and introduces a guide module to enable two parts of a model to update iterative parameters in the same direction, thereby improving the performance of a news topic discovery task.
The technical scheme of the invention is as follows: the news topic discovery method fusing the neighbor topic map comprises the following specific steps:
step1, crawling hot case public sentiment news of recent big news websites such as hundred-degree news and new wave news through a crawler technology, and selecting 17889 related news of more than ten cases with high netizen attention degrees such as a certain maintenance case to construct a news topic data set. The crawled news is analyzed, so that each piece of news only belongs to one case topic, the news is manually marked to be related to which case topic, and the news is stored as a json format file after data screening and preprocessing.
Step2, constructing a neighboring topic map by introducing the association relation of the topics in the topic discovery process, and extracting the global characteristics of the topics through a graph convolution network; in order to avoid the influence of noise data, local features of news documents are extracted by using a deep network and added into a title coding process, so that topic news clustering is better realized.
The specific steps of Step1 are as follows:
step1.1, crawling key news of each big news website and a public platform in recent years by a crawler technology, and selecting 17889 news of more than ten case topics with high netizen attention, such as 'the running right case';
step1.2, the screening and preprocessing process of data comprises the steps of manually calibrating the correlation between news data and case topics, removing data and repeated data which are not related to the case topics, removing special symbols, links and the like;
step1.3, obtaining a news topic data set by adopting manual marking; and each piece of news only belongs to one case topic by analyzing the crawled news, and manually marking the case topic to which the news is related.
As a preferable scheme of the invention, the Step2 comprises the following specific steps:
step2.1, coding a title part in a news topic data set, and obtaining the representation of the title after the training of a Bert pre-training model is completed so as to construct an adjacent title map;
step2.2, constructing a news neighbor title map by adopting a K neighbor algorithm to extract global characteristics of news titles;
step2.3, extracting local features of the documents in the news topic data set, and learning effective data representation by using a deep neural network self-encoder;
step2.4, the constructed neighboring header map contains global header structure information, a graph convolution network is used for extracting structure features in the neighboring header map, and document local features extracted from a coder are integrated into the graph convolution network; the local features of the document are effectively fused into the global features of the title by connecting the self-encoder and the graph convolution network layer by layer through fusion factors;
and Step2.5, performing clustering optimization training on Step2.3 and Step2.4, and after the training is stable, taking the clustering distribution finally output by the graph convolution network as a final result of news topic discovery.
As a preferred embodiment of the present invention, the step2.1 specifically includes:
the Bert model is formed by combining a plurality of transform models, and the training mode of the Bert model is divided into two tasks: one is that 15% of words are randomly selected for prediction, wherein 80% of words are covered by MASK symbols, 10% of words are replaced by random words, and the rest words are kept unchanged, so that the model tends to predict words depending on context and has certain error correction capability; secondly, whether two sentences are coherent texts is predicted, so that the Bert model can obtain word representation and sentence representation of news titles after training is finished;
specifically, let N be the number of Title samples in the news topic set, and Title { Title } Title1,title2,…,titleNEach news title has the length S, E ═ E }1,e2,…,esAnd (4) inputting the word vectors of the titles into a Bert model for coding to obtain vector representation of each title, inputting all the title word vectors into the Bert model for coding, and finally obtaining the title vector representation T fused with semantic information, wherein T is { T1, T2, …, TN }.
As a preferred embodiment of the present invention, step2.2 specifically includes:
setting subject data T E RN×aWherein each row TiRepresents the ith title sample, N is the number of samples, and a represents the dimension; for each title sample, firstly finding the top K neighbors with the highest similarity as neighbor nodes, and connecting the top K neighbors through edges to form a neighbor title graph; calculating similarity matrix S between any two news headlines by using dot product operation of vectorsijIt is an N × N dimensional matrix;
Figure RE-GDA0003685366130000031
for any two targetsQuestion node tiAnd tjLet wijIs the weight between nodes; if there are edges connected between nodes, then wij> 0, if no edges are connected, wij0; w is the undirected weight graph because the constructed neighbor topic graph isij=wji(ii) a The degree of any node in the graph is the sum of the weights of all edges connected with the node;
Figure RE-GDA0003685366130000032
by calculating the degree of each node, a node degree matrix D epsilon R with values only of the main diagonal is obtainedN×N
Figure RE-GDA0003685366130000033
The value of the main diagonal represents the degree of the ith point of the ith row, the weights among all the nodes are calculated to obtain an N multiplied by N dimension adjacent matrix M, and the jth element of the ith row is the weight wij,wij=sij
As a preferred embodiment of the present invention, the step2.3 specifically includes:
extracting local features of the documents in the news topic data set by adopting a document feature extraction module, and learning effective data representation by using a deep neural network self-encoder;
the self-encoder is a representation model, which uses input data as reference and does not use label supervision to extract features and reduce dimensions; the self-encoder maps the input to a characteristic space and then maps the input space back to the input space for data reconstruction; given that the encoder has L layers, the L-th layer learned by the encoder is denoted by H(L)
H(l)=σ(Wenc (l)H(l-1)+benc (l)) (4)
Where σ is the relu function, Wenc (l)For the transform matrix of the l-th layer in the encoder, benc (l)Is offset by,H(0)Expressed as original document data X;
the decoder part maps the features back to the input space, resulting in a reconstruction of the original data
Figure RE-GDA0003685366130000034
H(l)=σ(Wdec (l)H(l-1)+bdec (l)) (5)
Wdec (l)As a transform matrix of the l-th layer in the decoder, bdec (l)Reconstructing data for biasing
Figure RE-GDA0003685366130000041
The loss function of the document feature extraction module is
Figure RE-GDA0003685366130000042
Figure RE-GDA0003685366130000043
And continuously optimizing network parameters for training by a minimum reconstruction error and gradient descent algorithm.
As a preferred embodiment of the present invention, step2.4 specifically includes:
extracting global features of the title:
the constructed neighbor topic map contains the global title structure information, the structure features in the neighbor topic map are extracted by using a graph convolution network, the document local features extracted by a self-encoder are integrated into the graph convolution network, and the expression extracted by the l-th layer of the graph convolution network is obtained through convolution operation;
Figure RE-GDA0003685366130000044
wherein
Figure RE-GDA0003685366130000045
In order to be a normalized laplace matrix,
Figure RE-GDA0003685366130000046
i is a unit diagonal matrix of the adjacency matrix M, D is a node degree matrix,
Figure RE-GDA0003685366130000047
representing U by the previous layer learned by the graph convolution network(l-1)Propagating to the next layer results in a new representation U(l)
In order to enable the news topic data features learned by the graph volume network to simultaneously have global features of the title and local features of the document, two representations U are used(l-1)And H(l-1)The fusion factors are combined together to obtain a more comprehensive data representation;
Figure RE-GDA0003685366130000048
alpha is a weight coefficient for balancing the two expressions, the local feature of the document is effectively fused into the global feature of the title by connecting the self-encoder and the graph convolution network layer by layer through the fusion factor, and after the two expressions are fused, the local feature of the document is effectively fused into the global feature of the title
Figure RE-GDA0003685366130000049
Inputting the data into the graph convolution network to obtain a representation U(l)
Figure RE-GDA00036853661300000410
By analogy, a representation U output by the last layer of the graph convolution network is obtained(L)(ii) a The output end of the network is connected with a softmax multi-classifier, and the final output result is distribution U;
Figure RE-GDA00036853661300000411
the result U obtained is a probability distribution whose elements UijRepresenting the probability that the news sample i belongs to the cluster center j;
as a preferred embodiment of the present invention, the step2.5 specifically includes:
unifying the document feature extraction module and the title global feature extraction module into a framework through the guide module and simultaneously carrying out end-to-end clustering optimization training; the system comprises a document feature extraction module, a data processing module and a data processing module, wherein the document feature extraction module is used for extracting local features of documents in a news topic data set and learning effective data representation by using a deep neural network self-encoder; the title global feature extraction module is used for: the constructed neighboring header map contains global title structure information, the structure features in the neighboring header map are extracted by using a graph convolution network, and the document local features extracted by a self-encoder are integrated into the graph convolution network; the local features of the document are effectively fused into the global features of the title by connecting the self-encoder and the graph convolution network layer by layer through fusion factors;
for the ith sample and the jth cluster, the representation h of the self-encoder is scaled as a kernel function by referring to the student-t distribution with the degree of freedom of 1iAnd cluster heart muiThe distance between them;
Figure RE-GDA0003685366130000051
wherein h isiIs represented by H(L)Row i of (1), muiIs a cluster center after initialization of a K-means algorithm, and q is calculatedijThe probability that the document sample i is distributed to the cluster j is regarded as Q, namely the distribution that all the document samples are distributed to the cluster;
in order to obtain a high-confidence-degree distribution iteration clustering result and improve the clustering accuracy, a target distribution P is constructed to assist model training;
Figure RE-GDA0003685366130000052
in the target distribution P, each cluster distribution in the document sample distribution Q is subjected to square treatment and then normalized treatment, so that the cluster distribution with higher confidence coefficient is obtained, the samples in the clusters are forced to be closer to the cluster center, the distance between the clusters is maximized, and the distribution is clearer. One of the loss functions of the guiding module is the KL divergence loss between the distribution Q and the target distribution P;
Figure RE-GDA0003685366130000053
updating parameters, target distribution by minimizing loss function
Figure RE-GDA0003685366130000054
Causing the self-encoder to learn a sample document cluster representation that is closer to the cluster center;
in order to enable the title global feature extraction module and the document local feature extraction module to be consistent in the training iteration process, the two modules are required to be unified in the same target distribution, so that the target distribution P is used for guiding the sample distribution U containing the title global feature output by the convolutional network of the graph, and the second loss function of the guiding module is KL divergence loss between the distribution U and the target distribution P;
Figure RE-GDA0003685366130000061
the two clusters represented by different expressions are distributed and unified in the same loss function through different weight parameters of the guide module, and the overall loss function of the model is
Figure RE-GDA0003685366130000062
Figure RE-GDA0003685366130000063
Figure RE-GDA0003685366130000064
As a function of the equilibrium lossA first weight parameter and a second weight parameter of the loss function; and after the whole model is trained to be stable, the clustering distribution U finally output by the graph convolution network is used as a final result of news topic discovery.
Furthermore, the titles in the news topic data sets are represented by a pre-trained Bert Chinese corpus, the vocabulary is a Bert model with a vocabulary, the Bert model comprises 12 layers of transform networks, each layer of network comprises 12 attention heads, the model parameter is 110M, and the hidden layer dimension is 768; the dimension of a self-encoder in the document feature extraction module is 'input-768-2000-10', the size of a graph convolution layer in the title global feature extraction module is the same as that of the self-encoder, the number of K in a neighboring title graph is 10, the initial cluster center of the topic cluster is obtained by 20 times of initialization through a K-means algorithm, and the balance parameter alpha in the fusion factor is set to be 0.5; the model training round is 200, the learning rate is 1e-3, and Adam is adopted by the optimizer.
The invention has the beneficial effects that:
(1) aiming at news topic discovery, how to consider the relation of news documents under the same case topic, how to obtain the high-quality news documents and the characterization of the titles and how to obtain the high-quality news documents and the characterization of the titles, a method for performing topic modeling by combining the news titles and the characterization of the news documents is provided, and a topic model fused with a near-neighbor title association relation graph is designed to improve the accuracy of a topic discovery task;
(2) the provided fusion factor can enable the learned news topic data characteristics to have global characteristics of the title and local characteristics of the document at the same time, so that the representation effect of the model is improved;
(3) and unifying the title global feature extraction module and the document feature extraction module into the same frame by using the guide module, and simultaneously carrying out end-to-end clustering optimization training so as to improve the cohesiveness of the topic cluster.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
Example 1: as shown in fig. 1, a news topic discovery method fusing a neighbor topic map includes the following specific steps:
step1, crawling hot case public sentiment news of recent big news websites such as hundred-degree news and new wave news through a crawler technology, and selecting 17889 related news of more than ten cases with high netizen attention degrees such as a certain maintenance case to construct a news topic data set. The crawled news is analyzed, so that each piece of news only belongs to one case topic, the news is manually marked to be related to which case topic, and the news is stored as a json format file after data screening and preprocessing.
Step1.1, crawling key news of each big news website and a public platform in recent years by a crawler technology, and selecting 17889 news of more than ten case topics with high netizen attention, such as 'Benwei right case' and the like;
step1.2, the screening and preprocessing process of data comprises the steps of manually calibrating the correlation between news data and case topics, removing data and repeated data which are not related to case topics, removing special symbols and links and the like;
step1.3, obtaining a news topic data set by adopting manual marking; and each piece of news only belongs to one case topic by analyzing the crawled news, and manually marking the case topic to which the news is related. The scale of the experimental data set is shown in table 1:
table 1 experimental data set statistics
Figure RE-GDA0003685366130000071
Step2, constructing a neighboring topic map by introducing the association relation of the topics in the topic discovery process, and extracting the global characteristics of the topics through a graph convolution network; in order to avoid the influence of noise data, the local features of the news documents are extracted by using the deep network and added into the title coding process, so that topic news clustering is better realized.
And Step2.1, a title coding module is used for coding a title part in the news topic data set, and the representation of the title can be obtained after the training of the Bert pre-training model is completed so as to construct a neighboring title map. The Bert model is formed by combining a plurality of transform models, and the training mode is divided into two tasks: one is that 15% of words are randomly selected for prediction, wherein 80% of words are covered by MASK symbols, 10% of words are replaced by random words, and the rest words are kept unchanged, so that the model tends to predict words depending on context and has certain error correction capability; the second is to predict whether two words are coherent texts. The Bert model is thus able to obtain word representations and sentence representations of news headlines after training is completed.
Specifically, let N be the number of titles in the news topic set, where the Title is { Title1, Title2, …, Title }, and the length of each news Title is S, and E is { E ═ E }1,e2,…,eSThe words in each title are collected, the word vectors of the titles are input into a Bert model for coding, vector representation of each title can be obtained, all the title word vectors are input into the Bert model for coding, finally, the title vector representation T fused with semantic information is obtained, and the T is { T ═ T }1,T2,…,TN}。
Step2.2, a neighbor topic map building module adopts a K neighbor algorithm to build a news neighbor topic map to extract the global features of the news headlines. Setting subject data T e RN×aWherein each row TiRepresents the ith header sample, N is the number of samples, and a represents the dimension. For each title sample, its top K most similar neighbors are first found as neighbor nodes and connected by edges to form a neighbor title graph. Calculating similarity matrix S between any two news headlines by using dot product operation of vectorsijIt is an N × N dimensional matrix.
Figure RE-GDA0003685366130000081
For any two title nodes tiAnd tjLet wijIs the weight between the nodes. If there are edges connected between nodes, then wij> 0, if no edges are connected, wij0. Due to the proximity of our constructionThe neighbor topic map is a undirected weight map, thus wij=wji. The degree of any node in the graph is the sum of the weights of all edges connected to it.
Figure RE-GDA0003685366130000082
By calculating the degree of each node, a node degree matrix D epsilon R with values only of the main diagonal is obtainedN×N
Figure RE-GDA0003685366130000083
The value of the main diagonal represents the degree of the ith point of the ith row. Calculating the weight among all the nodes to obtain an N multiplied by N dimensional adjacency matrix M, wherein the jth element of the ith row is the weight wij,wij=sij
Step2.3, the function of the document feature extraction module is to extract the local features of the documents in the news topic data set, and the invention uses a deep neural network self-encoder to learn effective data representation. The autoencoder is a representation model, which uses input data as a reference, and does not use label supervision, so as to extract features and reduce dimensions. The self-encoder maps the input to a feature space and then maps back to the input space for data reconstruction. Given that the encoder has L layers, the L-th layer learned by the encoder is denoted by H(l)
H(l)=σ(Wenc (l)H(l-1)+benc (l)) (4)
Where σ is the relu function, Wenc (l)For the transform matrix of the l-th layer in the encoder, benc (l)Is an offset. H(0)Represented as original document data X.
The decoder part maps the features back to the input space, resulting in a reconstruction of the original data
Figure RE-GDA0003685366130000084
H(l)=σ(Wdec (l)H(l-1)+bdec (l)) (5)
Wdec (l)For the transform matrix of the l-th layer in the decoder, bdec (l)Reconstructing data for biasing
Figure RE-GDA0003685366130000085
The loss function of the document feature extraction module is
Figure RE-GDA0003685366130000091
Figure RE-GDA0003685366130000092
And continuously optimizing network parameters for training by a minimum reconstruction error and gradient descent algorithm.
Step2.4, the constructed neighbor topic map contains a large amount of global title structure information, the structure features in the neighbor topic map are extracted by using a graph convolution network, and the document local features extracted by a self-encoder are integrated into the graph convolution network, so that two different features of data can be simultaneously extracted by the model. The representation extracted at the ith layer of the graph convolution network is obtained through convolution operation.
Figure RE-GDA0003685366130000093
Wherein
Figure RE-GDA0003685366130000094
In order to be a normalized laplacian matrix,
Figure RE-GDA0003685366130000095
i is a unit diagonal matrix of the adjacency matrix M, and D is a node degree matrix.
Figure RE-GDA0003685366130000096
Representing U by the previous layer learned by the graph convolution network(l-1)Propagating to the next layer results in a new representation U(l)
In order to enable the news topic data features learned by the graph volume network to simultaneously have global features of the title and local features of the document, two representations U are used(l-1)And H(l-1)By combining the fusion factors together, a more comprehensive data representation is obtained.
Figure RE-GDA0003685366130000097
Alpha is a weight coefficient for balancing two expressions, and the local features of the document can be effectively fused into the global features of the title by connecting the self-encoder and the graph convolution network layer by layer through fusion factors. After fusing the two representations, will
Figure RE-GDA0003685366130000098
Inputting the data into a graph convolution network to obtain a representation U(l)
Figure RE-GDA0003685366130000099
By analogy, a representation U output by the last layer of the graph convolution network is obtained(L). The output end of the network is connected with a softmax multi-classifier, and the final output result is distribution U.
Figure RE-GDA00036853661300000910
The result U obtained by the model is a probability distribution whose element U isijRepresenting the probability that the news sample i belongs to the cluster center j.
And Step2.5, the guiding module unifies the document feature extraction module and the title global feature extraction module into a frame to simultaneously carry out end-to-end clustering optimization training.
For the ith sample and jth cluster, reference freedomDegree 1 student-t distribution as a kernel function metric representation h from the encoderiAnd cluster heart muiThe distance between them.
Figure RE-GDA0003685366130000101
Wherein h isiIs represented by H(L)Row i of (d), μiIs the cluster center after initialization of the K-means algorithm. We will qijConsidering the probability that the document sample i is assigned to the cluster j, Q is the distribution that all document samples are assigned to the cluster.
In order to obtain a high-confidence distribution to iterate a clustering result and improve the clustering accuracy, a target distribution P is constructed to assist model training.
Figure RE-GDA0003685366130000102
In the target distribution P, each cluster distribution in the document sample distribution Q is subjected to square first and then normalization processing, so that cluster distribution with higher confidence can be obtained, samples in the clusters are forced to be closer to the cluster center, the distance between the clusters is maximized, and the distribution is clearer. One of the loss functions of the guideline module is the KL divergence loss between the distribution Q and the target distribution P.
Figure RE-GDA0003685366130000103
The target distribution P enables the self-encoder to learn a sample document cluster representation closer to the cluster center by minimizing the loss function update parameter.
In order to make the title global feature extraction module and the document local feature extraction module tend to be consistent in the training iteration process, the two modules need to be unified in the same target distribution, so that the target distribution P can also be used to direct the sample distribution U containing the title global feature output by the graph convolution network. The second loss function of the guiding module is the KL divergence loss between the distribution U and the target distribution P.
Figure RE-GDA0003685366130000104
The two clusters represented by different expressions can be distributed and unified in the same loss function through different weight parameters of the guide module, and the overall loss function of the model is
Figure RE-GDA0003685366130000105
Figure RE-GDA0003685366130000106
The weighting parameters of the first loss function and the second loss function are balanced. After the whole model is trained to be stable, the clustering distribution U finally output by the graph convolution network can be used as a final result of news topic discovery.
Step2.6, representing the title in the news topic data set through a pre-trained Bert Chinese corpus, wherein the vocabulary is a Bert model with a vocabulary, the Bert model comprises 12 layers of transform networks, each layer of network comprises 12 attention heads, the model parameter is 110M, and the hidden layer dimension is 768; the dimension of a self-encoder in the document feature extraction module is 'input-768-2000-10', the size of a graph convolution layer in the title global feature extraction module is the same as that of the self-encoder, the number of K in a neighboring title graph is 10, the initial cluster center of the topic cluster is obtained by 20 times of initialization through a K-means algorithm, and the balance parameter in the fusion factor is set to be 0.5; the model training round is 200, the learning rate is 1e-3, and Adam is adopted by the optimizer.
To illustrate the effect of the present invention, 3 sets of comparative experiments were set up. The first group of experiments verify the improvement of topic discovery performance, the second group of experiments verify the effectiveness of the model, and the third group of experiments verify the influence of different fusion factor weight coefficients on the effectiveness of the model.
(1) Topic discovery performance enhancement verification
In the baseline model, the news topic data sets constructed by step1 are respectively used as model inputs to carry out a comparison experiment, and 5 models are selected as reference models, wherein the comparison experiment comprises the following steps: the classical K-means algorithm, AE + Kmeans, DEC, DCN, IDEC, the experimental results are shown in Table 2.
TABLE 2 comparison of Performance of baseline models
Figure RE-GDA0003685366130000111
As can be seen from the analysis table 2, compared with other reference models, the method provided by the invention achieves better performance, and compared with an IDEC baseline model, the Accuracy (ACC) is improved by 7.06%, the Normalized Mutual Information (NMI) is improved by 6.15%, and the adjusted landed coefficient (ARI) is improved by 8.26%. This is because the baseline method usually only emphasizes extracting local features of the documents when performing a task of finding news topics, and news documents on different topics of the same news contain many similar case element information, and the baseline method cannot distinguish well. The model of the invention extracts the incidence relation between adjacent titles by using the graph convolution network, and fuses the incidence relation with the local characteristics of the document to enhance the expression of the titles, thereby realizing better effect of topic modeling. This also demonstrates that topic modeling in conjunction with headlines and documents is effective by incorporating a map of neighboring headlines. The model of the invention achieves the optimal result in three performance indexes, which shows the effectiveness of the invention.
(2) Model validation
In order to verify the effectiveness of each module of the model, the model is disassembled into two submodels, namely a title global feature module-guidance module and a document feature module-guidance module, three evaluation indexes are kept unchanged, and the optimal result is represented by bolding. The test results are shown in table 3:
TABLE 3 simplified model Performance analysis
Figure RE-GDA0003685366130000121
As can be seen from the analysis of Table 3, the main model of the present invention has an obvious effect of improving the effect of modeling by combining the title feature and the document feature. The title characteristic part in the model is removed, only the local characteristics of the document and the guide module are utilized to carry out modeling, the modeling effect is the worst, although the document contains a large amount of case element information, news document elements of different topics under the same case have many similarities, the noise data is more, and the situation that the data divided into the same topic cluster under the same case belongs to different topics or the cases belonging to the same type are not the same case easily occurs. The effect of modeling by only using the global title features and the guidance module is better than that of modeling by only using the document features, because the model extracts the structural relationship among the adjacent titles, but because of the limitation of the title space, the content of the covered case topic information is limited, and the information bias of the titles is easy to occur. Therefore, on the basis of acquiring the association relation between news, the news topic discovery can be better realized by introducing the representation of the document characterization enhanced title to avoid bias, and the effectiveness of the invention is verified from the side.
(3) Verification of influence of different fusion factor weight coefficients on model effectiveness
In order to verify whether the weight coefficient alpha for adjusting the fusion factor improves the model performance, the invention performs the following experiment. And (3) taking a plurality of alpha values with the step length of 0.2 to respectively carry out comparison experiments, and thickening the optimal group of experiment results to represent. The test results are shown in table 4:
TABLE 4 analysis of the influence of different fusion factor weight coefficients on the effectiveness of the model
Figure RE-GDA0003685366130000122
Analysis table 4 shows that the model of the present invention achieves the best effect when α is 0.5, and the performance of the model is reduced when α is greater than 0.5 or less than 0.5. Because α is a balance weight coefficient of the fusion factor, it plays a role in balancing the global title feature and the local document feature. When the alpha is too large, the local characteristic weight of the document is weakened, the model can only learn the incidence relation of a neighboring topic map, the content information of the document is lacked, the information bias of a topic is easy to generate, the graph convolution network is easy to generate excessive smoothness, meanwhile, the model loses the reconstruction loss of a self-encoder, and the accuracy of finding news topics is reduced; when alpha is too small, the global feature weight of the title is weakened, the characters learned by the model almost come from the document, similar elements cannot be well distinguished, and the accuracy of finding news topics is also reduced. Therefore, setting the weight coefficient α of the fusion factor to 0.5 can fuse both features well, also proving the effectiveness of the present invention.
The experimental data prove that the method fuses the adjacent topic maps, topic modeling is carried out by combining the news topics and the characteristics of the documents, the incidence relation of similar topics is constructed, meanwhile, in order to avoid the influence of only the bias of the topics and noise data, the characteristics of the documents are added into the coding process of the topics, and the guiding module is introduced to enable two parts of the model to update iterative parameters in the same direction, so that news characterization can be effectively carried out, and the clustering accuracy of a news topic finding task is improved. Experiments show that the method of the invention obtains the optimal effect compared with a plurality of baseline models. Aiming at a news topic discovery task, the news topic discovery method fusing the neighbor topic map provided by the invention is effective in improving the field news topic discovery performance.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (8)

1. The news topic discovery method fused with the neighbor topic map is characterized by comprising the following steps: the method comprises the following specific steps:
step1, crawling hot case public opinion news through a crawler technology, and selecting related news to construct a news topic data set; the method comprises the steps that crawled news is analyzed, each news only belongs to one case topic, the news is manually marked to be related to which case topic, and data are screened and preprocessed;
step2, constructing a neighboring topic map by introducing the association relation of the topics in the topic discovery process, and extracting the global characteristics of the topics through a graph convolution network; in order to avoid the influence of noise data, local features of news documents are extracted by using a deep network and added into a title coding process, so that topic news clustering is better realized.
2. The method for discovering news topics by fusing neighbor topic maps according to claim 1, wherein: the specific steps of Step1 are as follows:
step1.1, crawling key news of various big news websites and a public platform in recent years by a crawler technology;
step1.2, screening and preprocessing the crawled data; the screening and preprocessing process of the data comprises the steps of manually calibrating the correlation between news data and case topics, removing data and repeated data which are not related to the case topics, and removing special symbols and links;
step1.3, obtaining a news topic data set by adopting manual marking; and each piece of news only belongs to one case topic by analyzing the crawled news, and manually marking the case topic to which the news is related.
3. The method for discovering news topics by fusing neighbor topic maps according to claim 1, wherein: the specific steps of Step2 are as follows:
step2.1, coding a title part in the news topic data set, and obtaining the representation of the title after the training of a Bert pre-training model is completed so as to construct a neighboring title map;
step2.2, constructing a news neighbor title map by adopting a K neighbor algorithm to extract global characteristics of news titles;
step2.3, extracting local features of the documents in the news topic data set, and learning effective data representation by using a deep neural network self-encoder;
step2.4, the constructed neighboring header map contains global header structure information, a graph convolution network is used for extracting structure features in the neighboring header map, and document local features extracted from a coder are integrated into the graph convolution network; the local features of the document are effectively fused into the global features of the title by connecting the self-encoder and the graph convolution network layer by layer through fusion factors;
and Step2.5, performing clustering optimization training on Step2.3 and Step2.4, and after the training is stable, taking the clustering distribution finally output by the graph convolution network as a final result of news topic discovery.
4. The method for discovering news topics by fusing neighbor topic maps according to claim 3, wherein: the Step2.1 specifically comprises:
the Bert model is formed by combining a plurality of transform models, and the training mode is divided into two tasks: one is that 15% of words are randomly selected for prediction, wherein 80% of words are covered by MASK symbols, 10% of words are replaced by random words, and the rest words are kept unchanged, so that the model tends to predict words depending on context and has certain error correction capability; secondly, whether two sentences are coherent texts is predicted, so that the Bert model can obtain word representation and sentence representation of news titles after training is finished;
specifically, let N be the number of titles in the news topic set, where Title is { Title }1,title2,…,titleNEach news title has a length S, E ═ E } E1,e2,…,esThe words in each title are collected, the word vectors of the titles are input into a Bert model to be coded to obtain vector representation of each title, all the title word vectors are input into the Bert model to be coded to finally obtain the title vector representation T fused with semantic information, and T is equal to { T ═ T }1,T2,…,TN}。
5. The method for discovering news topics by fusing neighbor topic maps according to claim 3, wherein: the Step2.2 specifically comprises:
setting subject data T E RN×aWherein each row TiRepresents the ith title sample, N is the number of samples, and a represents the dimension; for each titleFirstly, finding out front K neighbors with the highest similarity as neighbor nodes, and connecting the neighbors through edges to form a neighbor topic map; calculating similarity matrix S between any two news headlines by using dot product operation of vectorsijIt is an N × N dimensional matrix;
Figure RE-FDA0003685366120000021
for any two title nodes tiAnd tjLet is wijIs the weight between nodes; if there are edges connected between nodes, then wij> 0, if no edges are connected, then wij0; since the constructed neighbor topic map is an undirected weight map,
thus wij=wji(ii) a The degree of any node in the graph is the sum of the weights of all edges connected with the node;
Figure RE-FDA0003685366120000022
by calculating the degree of each node, a node degree matrix D epsilon R with values only of the main diagonal is obtainedN×N
Figure RE-FDA0003685366120000023
The value of the main diagonal represents the degree of the ith point of the ith row, the weights among all the nodes are calculated to obtain an N multiplied by N dimension adjacent matrix M, and the jth element of the ith row is the weight wij,wij=sij
6. The news topic discovery method that merges neighbor topic maps according to claim 3, wherein: the Step2.3 specifically comprises:
extracting local features of the documents in the news topic data set by adopting a document feature extraction module, and learning effective data representation by using a deep neural network self-encoder;
the self-encoder is a representation model, which uses input data as a reference and does not use label supervision to extract features and reduce dimensions; the self-encoder maps the input to a characteristic space and then maps the input space back to the input space for data reconstruction; given that the encoder has L layers, the L-th layer learned by the encoder is denoted by H(L)
H(l)=σ(Wenc (l)H(l-1)+benc (l)) (4)
Where σ is the relu function, Wenc (l)For the transform matrix of the l-th layer in the encoder, benc (l)To be offset, H(0)Expressed as original document data X;
the decoder part maps the features back to the input space, resulting in a reconstruction of the original data
Figure RE-FDA0003685366120000031
H(l)=σ(Wdec (l)H(l-1)+bdec (l)) (5)
Wdec (l)For the transform matrix of the l-th layer in the decoder, benc (l)Reconstructing data for biasing
Figure RE-FDA0003685366120000032
The loss function of the document feature extraction module is
Figure RE-FDA0003685366120000033
Figure RE-FDA0003685366120000034
And continuously optimizing network parameters for training by a minimum reconstruction error and gradient descent algorithm.
7. The method for discovering news topics by fusing neighbor topic maps according to claim 3, wherein: the Step2.4 specifically comprises:
extracting global features of the title:
the constructed neighbor topic map contains the global title structure information, the structure features in the neighbor topic map are extracted by using a graph convolution network, the document local features extracted by a self-encoder are integrated into the graph convolution network, and the expression extracted by the l-th layer of the graph convolution network is obtained through convolution operation;
Figure RE-FDA0003685366120000035
wherein
Figure RE-FDA0003685366120000036
In order to be a normalized laplacian matrix,
Figure RE-FDA0003685366120000037
i is a unit diagonal matrix of the adjacency matrix M, D is a node degree matrix,
Figure RE-FDA0003685366120000038
the previous layer learned by the graph convolution network represents U(l-1)Propagating to the next layer results in a new representation U(l)
In order to enable the news topic data features learned by the graph volume network to simultaneously have global features of the title and local features of the document, two representations U are used(l-1)And H(l-1)By combining the fusion factors together, a more comprehensive data representation is obtained
Figure RE-FDA0003685366120000041
Figure RE-FDA0003685366120000042
Alpha is a weight coefficient for balancing the two expressions, the local feature of the document is effectively fused into the global feature of the title by connecting the self-encoder and the graph convolution network layer by layer through the fusion factor, and after the two expressions are fused, the local feature of the document is effectively fused into the global feature of the title
Figure RE-FDA0003685366120000043
Inputting the data into a graph convolution network to obtain a representation U(l)
Figure RE-FDA0003685366120000044
By analogy, a representation U output by the last layer of the graph convolution network is obtained(L)(ii) a The output end of the network is connected with a softmax multi-classifier, and the final output result is a distribution U;
Figure RE-FDA0003685366120000045
the result U obtained is a probability distribution whose elements UijRepresenting the probability that the news sample i belongs to the cluster center j;
8. the method for discovering news topics by fusing neighbor topic maps according to claim 3, wherein: the Step2.5 specifically comprises:
unifying the document feature extraction module and the title global feature extraction module into a framework through the guide module and simultaneously carrying out end-to-end clustering optimization training; the system comprises a document feature extraction module, a data processing module and a data processing module, wherein the document feature extraction module is used for extracting local features of documents in a news topic data set and learning effective data representation by using a deep neural network self-encoder; the title global feature extraction module is used for: the constructed neighboring header map contains global title structure information, the structure features in the neighboring header map are extracted by using a graph convolution network, and the document local features extracted by a self-encoder are integrated into the graph convolution network; the local features of the document are effectively fused into the global features of the title by connecting the self-encoder and the graph convolution network layer by layer through fusion factors;
for the ith sample and the jth cluster, the representation h of the self-encoder is scaled as a kernel function by referring to the student-t distribution with the degree of freedom of 1iAnd cluster heart muiThe distance between them;
Figure RE-FDA0003685366120000046
wherein h isiIs represented by H(L)Row i of (1), muiIs a cluster center after initialization of a K-means algorithm, and q is calculatedijThe probability that the document sample i is distributed to the cluster j is regarded, and Q is the distribution that all the document samples are distributed to the clusters;
in order to obtain a high-confidence-degree distribution iteration clustering result and improve the clustering accuracy, a target distribution P is constructed to assist model training;
Figure RE-FDA0003685366120000051
in the target distribution P, each cluster distribution in the document sample distribution Q is subjected to square first and then normalization processing, so that cluster distribution with higher confidence is obtained, samples in the clusters are forced to be closer to the cluster center, the distance between the clusters is maximized, and the distribution is clearer. One of the loss functions of the guiding module is the KL divergence loss between the distribution Q and the target distribution P;
Figure RE-FDA0003685366120000052
updating parameters through a minimization loss function, and enabling a target distribution P to enable a self-encoder to learn a sample document cluster representation closer to a cluster center;
in order to enable the title global feature extraction module and the document local feature extraction module to be consistent in the training iteration process, the two modules need to be unified in the same target distribution, so that the target distribution P is used for guiding the sample distribution U containing the title global features output by the convolution network, and the second loss function of the guiding module is the KL divergence loss between the distribution U and the target distribution P;
Figure RE-FDA0003685366120000053
the two clusters represented by different expressions are distributed and unified in the same loss function through different weight parameters of the guide module, and the overall loss function of the model is
Figure RE-FDA0003685366120000054
Figure RE-FDA0003685366120000055
Beta is a weight parameter of a balance loss function I and a loss function II; after the whole model is trained to be stable, the clustering distribution H finally output by the graph convolution network(l)As a final result of news topic discovery.
CN202210211576.7A 2022-03-05 2022-03-05 News topic discovery method fusing neighbor topic map Pending CN114722896A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210211576.7A CN114722896A (en) 2022-03-05 2022-03-05 News topic discovery method fusing neighbor topic map

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210211576.7A CN114722896A (en) 2022-03-05 2022-03-05 News topic discovery method fusing neighbor topic map

Publications (1)

Publication Number Publication Date
CN114722896A true CN114722896A (en) 2022-07-08

Family

ID=82236036

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210211576.7A Pending CN114722896A (en) 2022-03-05 2022-03-05 News topic discovery method fusing neighbor topic map

Country Status (1)

Country Link
CN (1) CN114722896A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117422063A (en) * 2023-12-18 2024-01-19 四川省大数据技术服务中心 Big data processing method applying intelligent auxiliary decision and intelligent auxiliary decision system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117422063A (en) * 2023-12-18 2024-01-19 四川省大数据技术服务中心 Big data processing method applying intelligent auxiliary decision and intelligent auxiliary decision system
CN117422063B (en) * 2023-12-18 2024-02-23 四川省大数据技术服务中心 Big data processing method applying intelligent auxiliary decision and intelligent auxiliary decision system

Similar Documents

Publication Publication Date Title
CN110717047B (en) Web service classification method based on graph convolution neural network
Cao et al. Deep neural networks for learning graph representations
CN111125358B (en) Text classification method based on hypergraph
WO2018218708A1 (en) Deep-learning-based public opinion hotspot category classification method
CN109189925A (en) Term vector model based on mutual information and based on the file classification method of CNN
CN108363695B (en) User comment attribute extraction method based on bidirectional dependency syntax tree representation
CN111127146B (en) Information recommendation method and system based on convolutional neural network and noise reduction self-encoder
CN110516074B (en) Website theme classification method and device based on deep learning
CN113220886A (en) Text classification method, text classification model training method and related equipment
CN114757182A (en) BERT short text sentiment analysis method for improving training mode
CN110807324A (en) Video entity identification method based on IDCNN-crf and knowledge graph
CN112784532B (en) Multi-head attention memory system for short text sentiment classification
CN110825850B (en) Natural language theme classification method and device
JP6738769B2 (en) Sentence pair classification device, sentence pair classification learning device, method, and program
CN112364638A (en) Personality identification method based on social text
CN113255366B (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN109614611B (en) Emotion analysis method for fusion generation of non-antagonistic network and convolutional neural network
CN114969304A (en) Case public opinion multi-document generation type abstract method based on element graph attention
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN111507093A (en) Text attack method and device based on similar dictionary and storage medium
CN112464674A (en) Word-level text intention recognition method
CN113127737A (en) Personalized search method and search system integrating attention mechanism
CN115422939A (en) Fine-grained commodity named entity identification method based on big data
CN115496072A (en) Relation extraction method based on comparison learning
CN111680684A (en) Method, device and storage medium for recognizing spine text based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination