CN114722896A

CN114722896A - News topic discovery method fusing neighbor topic map

Info

Publication number: CN114722896A
Application number: CN202210211576.7A
Authority: CN
Inventors: 余正涛; 卢天旭
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2022-03-05
Filing date: 2022-03-05
Publication date: 2022-07-08

Abstract

The invention relates to a news topic discovery method fusing neighbor topic maps, and belongs to the field of natural language processing. The invention comprises the following steps: constructing a news topic data set; the method comprises the steps of coding news titles by using a Bert model, enhancing the representation of the titles, then constructing an incidence relation of news neighbor title graphs for optimizing similar titles, learning the representation of the neighbor title graphs through a plurality of graph convolution layers, integrating the incidence structure information of the titles, then fusing global features of the titles learned by graph convolution networks and local features of news documents learned by deep networks by using a fusion factor, and finally uniformly guiding two modules to optimize parameters by using a guiding module. According to the method, the news representation with high quality is realized by fusing the neighbor relation of the titles and the representation of the news document, and the topic clusters are clustered from the obtained representations, so that support is provided for subsequent tasks.

Description

News topic discovery method fusing neighbor topic map

Technical Field

The invention relates to a news topic discovery method fusing neighbor topic maps, and belongs to the technical field of natural language processing.

Background

News has particularity, each news document and title contain case elements, a plurality of news documents and titles with similar elements exist under different topics describing the same case, if people can identify the news under the different topics by naked eyes, the news can be distinguished easily, but if the representation quality is not high, the computer is difficult to achieve the identification accuracy of people. Sun et al adds a time factor in the similarity calculation through an improved Single-Pass incremental text clustering algorithm, organizes news information with topics as granularity, and realizes the discovery of network news topics; hu et al, on the basis of a Dirichlet process mixed model (DPM-M) based on an LDA topic model, incorporates prior knowledge to improve the performance of topic discovery; li et al propose a hierarchical classification model based on LDA as a feature extraction technique to extract potential topics to reduce the influence of data sparsity, and construct topic feature vectors related to a corpus to train a more robust classification model for rare classes. However, in the existing topic discovery method, news documents under the same case are modeled, captured topic information and topic words are classified under the same topic due to high similarity, and news documents under different topics of the same case can not be well distinguished. How to consider the connection of news documents under the same case topic and the difference of news documents with similar case elements is one of the problems to be solved in the topic discovery task of the domain news.

Disclosure of Invention

The invention provides a news topic discovery method fusing a neighbor topic map, which constructs an incidence relation of similar topics, adds the characteristics of a document into the coding process of the topics in order to avoid the influence of bias and noise data of the topics only, and introduces a guide module to enable two parts of a model to update iterative parameters in the same direction, thereby improving the performance of a news topic discovery task.

The technical scheme of the invention is as follows: the news topic discovery method fusing the neighbor topic map comprises the following specific steps:

step1, crawling hot case public sentiment news of recent big news websites such as hundred-degree news and new wave news through a crawler technology, and selecting 17889 related news of more than ten cases with high netizen attention degrees such as a certain maintenance case to construct a news topic data set. The crawled news is analyzed, so that each piece of news only belongs to one case topic, the news is manually marked to be related to which case topic, and the news is stored as a json format file after data screening and preprocessing.

Step2, constructing a neighboring topic map by introducing the association relation of the topics in the topic discovery process, and extracting the global characteristics of the topics through a graph convolution network; in order to avoid the influence of noise data, local features of news documents are extracted by using a deep network and added into a title coding process, so that topic news clustering is better realized.

The specific steps of Step1 are as follows:

step1.1, crawling key news of each big news website and a public platform in recent years by a crawler technology, and selecting 17889 news of more than ten case topics with high netizen attention, such as 'the running right case';

step1.2, the screening and preprocessing process of data comprises the steps of manually calibrating the correlation between news data and case topics, removing data and repeated data which are not related to the case topics, removing special symbols, links and the like;

step1.3, obtaining a news topic data set by adopting manual marking; and each piece of news only belongs to one case topic by analyzing the crawled news, and manually marking the case topic to which the news is related.

As a preferable scheme of the invention, the Step2 comprises the following specific steps:

step2.1, coding a title part in a news topic data set, and obtaining the representation of the title after the training of a Bert pre-training model is completed so as to construct an adjacent title map;

step2.2, constructing a news neighbor title map by adopting a K neighbor algorithm to extract global characteristics of news titles;

step2.3, extracting local features of the documents in the news topic data set, and learning effective data representation by using a deep neural network self-encoder;

step2.4, the constructed neighboring header map contains global header structure information, a graph convolution network is used for extracting structure features in the neighboring header map, and document local features extracted from a coder are integrated into the graph convolution network; the local features of the document are effectively fused into the global features of the title by connecting the self-encoder and the graph convolution network layer by layer through fusion factors;

and Step2.5, performing clustering optimization training on Step2.3 and Step2.4, and after the training is stable, taking the clustering distribution finally output by the graph convolution network as a final result of news topic discovery.

As a preferred embodiment of the present invention, the step2.1 specifically includes:

the Bert model is formed by combining a plurality of transform models, and the training mode of the Bert model is divided into two tasks: one is that 15% of words are randomly selected for prediction, wherein 80% of words are covered by MASK symbols, 10% of words are replaced by random words, and the rest words are kept unchanged, so that the model tends to predict words depending on context and has certain error correction capability; secondly, whether two sentences are coherent texts is predicted, so that the Bert model can obtain word representation and sentence representation of news titles after training is finished;

specifically, let N be the number of Title samples in the news topic set, and Title { Title } Title₁,title₂,…,title_NEach news title has the length S, E ═ E }₁,e₂,…,e_sAnd (4) inputting the word vectors of the titles into a Bert model for coding to obtain vector representation of each title, inputting all the title word vectors into the Bert model for coding, and finally obtaining the title vector representation T fused with semantic information, wherein T is { T1, T2, …, TN }.

As a preferred embodiment of the present invention, step2.2 specifically includes:

setting subject data T E R^N×aWherein each row T_iRepresents the ith title sample, N is the number of samples, and a represents the dimension; for each title sample, firstly finding the top K neighbors with the highest similarity as neighbor nodes, and connecting the top K neighbors through edges to form a neighbor title graph; calculating similarity matrix S between any two news headlines by using dot product operation of vectors_ijIt is an N × N dimensional matrix;

for any two targetsQuestion node t_iAnd t_jLet w_ijIs the weight between nodes; if there are edges connected between nodes, then w_ij> 0, if no edges are connected, w_ij0; w is the undirected weight graph because the constructed neighbor topic graph is_ij＝w_ji(ii) a The degree of any node in the graph is the sum of the weights of all edges connected with the node;

by calculating the degree of each node, a node degree matrix D epsilon R with values only of the main diagonal is obtained^N×N；

The value of the main diagonal represents the degree of the ith point of the ith row, the weights among all the nodes are calculated to obtain an N multiplied by N dimension adjacent matrix M, and the jth element of the ith row is the weight w_ij，w_ij＝s_ij。

As a preferred embodiment of the present invention, the step2.3 specifically includes:

extracting local features of the documents in the news topic data set by adopting a document feature extraction module, and learning effective data representation by using a deep neural network self-encoder;

the self-encoder is a representation model, which uses input data as reference and does not use label supervision to extract features and reduce dimensions; the self-encoder maps the input to a characteristic space and then maps the input space back to the input space for data reconstruction; given that the encoder has L layers, the L-th layer learned by the encoder is denoted by H^(L)；

H^(l)＝σ(W_enc ^(l)H^(l-1)+b_enc ^(l)) (4)

Where σ is the relu function, W_enc ^(l)For the transform matrix of the l-th layer in the encoder, b_enc ^(l)Is offset by，H⁽⁰⁾Expressed as original document data X;

the decoder part maps the features back to the input space, resulting in a reconstruction of the original data

H^(l)＝σ(W_dec ^(l)H^(l-1)+b_dec ^(l)) (5)

W_dec ^(l)As a transform matrix of the l-th layer in the decoder, b_dec ^(l)Reconstructing data for biasing

The loss function of the document feature extraction module is

And continuously optimizing network parameters for training by a minimum reconstruction error and gradient descent algorithm.

As a preferred embodiment of the present invention, step2.4 specifically includes:

extracting global features of the title:

the constructed neighbor topic map contains the global title structure information, the structure features in the neighbor topic map are extracted by using a graph convolution network, the document local features extracted by a self-encoder are integrated into the graph convolution network, and the expression extracted by the l-th layer of the graph convolution network is obtained through convolution operation;

wherein

In order to be a normalized laplace matrix,

i is a unit diagonal matrix of the adjacency matrix M, D is a node degree matrix,

representing U by the previous layer learned by the graph convolution network^(l-1)Propagating to the next layer results in a new representation U^(l)；

In order to enable the news topic data features learned by the graph volume network to simultaneously have global features of the title and local features of the document, two representations U are used^(l-1)And H^(l-1)The fusion factors are combined together to obtain a more comprehensive data representation;

alpha is a weight coefficient for balancing the two expressions, the local feature of the document is effectively fused into the global feature of the title by connecting the self-encoder and the graph convolution network layer by layer through the fusion factor, and after the two expressions are fused, the local feature of the document is effectively fused into the global feature of the title

Inputting the data into the graph convolution network to obtain a representation U^(l)；

By analogy, a representation U output by the last layer of the graph convolution network is obtained^(L)(ii) a The output end of the network is connected with a softmax multi-classifier, and the final output result is distribution U;

the result U obtained is a probability distribution whose elements U_ijRepresenting the probability that the news sample i belongs to the cluster center j;

as a preferred embodiment of the present invention, the step2.5 specifically includes:

unifying the document feature extraction module and the title global feature extraction module into a framework through the guide module and simultaneously carrying out end-to-end clustering optimization training; the system comprises a document feature extraction module, a data processing module and a data processing module, wherein the document feature extraction module is used for extracting local features of documents in a news topic data set and learning effective data representation by using a deep neural network self-encoder; the title global feature extraction module is used for: the constructed neighboring header map contains global title structure information, the structure features in the neighboring header map are extracted by using a graph convolution network, and the document local features extracted by a self-encoder are integrated into the graph convolution network; the local features of the document are effectively fused into the global features of the title by connecting the self-encoder and the graph convolution network layer by layer through fusion factors;

for the ith sample and the jth cluster, the representation h of the self-encoder is scaled as a kernel function by referring to the student-t distribution with the degree of freedom of 1_iAnd cluster heart mu_iThe distance between them;

wherein h is_iIs represented by H^(L)Row i of (1), mu_iIs a cluster center after initialization of a K-means algorithm, and q is calculated_ijThe probability that the document sample i is distributed to the cluster j is regarded as Q, namely the distribution that all the document samples are distributed to the cluster;

in order to obtain a high-confidence-degree distribution iteration clustering result and improve the clustering accuracy, a target distribution P is constructed to assist model training;

in the target distribution P, each cluster distribution in the document sample distribution Q is subjected to square treatment and then normalized treatment, so that the cluster distribution with higher confidence coefficient is obtained, the samples in the clusters are forced to be closer to the cluster center, the distance between the clusters is maximized, and the distribution is clearer. One of the loss functions of the guiding module is the KL divergence loss between the distribution Q and the target distribution P;

updating parameters, target distribution by minimizing loss function

Causing the self-encoder to learn a sample document cluster representation that is closer to the cluster center;

in order to enable the title global feature extraction module and the document local feature extraction module to be consistent in the training iteration process, the two modules are required to be unified in the same target distribution, so that the target distribution P is used for guiding the sample distribution U containing the title global feature output by the convolutional network of the graph, and the second loss function of the guiding module is KL divergence loss between the distribution U and the target distribution P;

the two clusters represented by different expressions are distributed and unified in the same loss function through different weight parameters of the guide module, and the overall loss function of the model is

As a function of the equilibrium lossA first weight parameter and a second weight parameter of the loss function; and after the whole model is trained to be stable, the clustering distribution U finally output by the graph convolution network is used as a final result of news topic discovery.

Furthermore, the titles in the news topic data sets are represented by a pre-trained Bert Chinese corpus, the vocabulary is a Bert model with a vocabulary, the Bert model comprises 12 layers of transform networks, each layer of network comprises 12 attention heads, the model parameter is 110M, and the hidden layer dimension is 768; the dimension of a self-encoder in the document feature extraction module is 'input-768-2000-10', the size of a graph convolution layer in the title global feature extraction module is the same as that of the self-encoder, the number of K in a neighboring title graph is 10, the initial cluster center of the topic cluster is obtained by 20 times of initialization through a K-means algorithm, and the balance parameter alpha in the fusion factor is set to be 0.5; the model training round is 200, the learning rate is 1e-3, and Adam is adopted by the optimizer.

The invention has the beneficial effects that:

(1) aiming at news topic discovery, how to consider the relation of news documents under the same case topic, how to obtain the high-quality news documents and the characterization of the titles and how to obtain the high-quality news documents and the characterization of the titles, a method for performing topic modeling by combining the news titles and the characterization of the news documents is provided, and a topic model fused with a near-neighbor title association relation graph is designed to improve the accuracy of a topic discovery task;

(2) the provided fusion factor can enable the learned news topic data characteristics to have global characteristics of the title and local characteristics of the document at the same time, so that the representation effect of the model is improved;

(3) and unifying the title global feature extraction module and the document feature extraction module into the same frame by using the guide module, and simultaneously carrying out end-to-end clustering optimization training so as to improve the cohesiveness of the topic cluster.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

Example 1: as shown in fig. 1, a news topic discovery method fusing a neighbor topic map includes the following specific steps:

Step1.1, crawling key news of each big news website and a public platform in recent years by a crawler technology, and selecting 17889 news of more than ten case topics with high netizen attention, such as 'Benwei right case' and the like;

step1.2, the screening and preprocessing process of data comprises the steps of manually calibrating the correlation between news data and case topics, removing data and repeated data which are not related to case topics, removing special symbols and links and the like;

step1.3, obtaining a news topic data set by adopting manual marking; and each piece of news only belongs to one case topic by analyzing the crawled news, and manually marking the case topic to which the news is related. The scale of the experimental data set is shown in table 1:

table 1 experimental data set statistics

Step2, constructing a neighboring topic map by introducing the association relation of the topics in the topic discovery process, and extracting the global characteristics of the topics through a graph convolution network; in order to avoid the influence of noise data, the local features of the news documents are extracted by using the deep network and added into the title coding process, so that topic news clustering is better realized.

And Step2.1, a title coding module is used for coding a title part in the news topic data set, and the representation of the title can be obtained after the training of the Bert pre-training model is completed so as to construct a neighboring title map. The Bert model is formed by combining a plurality of transform models, and the training mode is divided into two tasks: one is that 15% of words are randomly selected for prediction, wherein 80% of words are covered by MASK symbols, 10% of words are replaced by random words, and the rest words are kept unchanged, so that the model tends to predict words depending on context and has certain error correction capability; the second is to predict whether two words are coherent texts. The Bert model is thus able to obtain word representations and sentence representations of news headlines after training is completed.

Specifically, let N be the number of titles in the news topic set, where the Title is { Title1, Title2, …, Title }, and the length of each news Title is S, and E is { E ═ E }₁,e₂,…,e_SThe words in each title are collected, the word vectors of the titles are input into a Bert model for coding, vector representation of each title can be obtained, all the title word vectors are input into the Bert model for coding, finally, the title vector representation T fused with semantic information is obtained, and the T is { T ═ T }₁,T₂,…,T_N}。

Step2.2, a neighbor topic map building module adopts a K neighbor algorithm to build a news neighbor topic map to extract the global features of the news headlines. Setting subject data T e R^N×aWherein each row T_iRepresents the ith header sample, N is the number of samples, and a represents the dimension. For each title sample, its top K most similar neighbors are first found as neighbor nodes and connected by edges to form a neighbor title graph. Calculating similarity matrix S between any two news headlines by using dot product operation of vectors_ijIt is an N × N dimensional matrix.

For any two title nodes t_iAnd t_jLet w_ijIs the weight between the nodes. If there are edges connected between nodes, then w_ij> 0, if no edges are connected, w_ij0. Due to the proximity of our constructionThe neighbor topic map is a undirected weight map, thus w_ij＝w_ji. The degree of any node in the graph is the sum of the weights of all edges connected to it.

By calculating the degree of each node, a node degree matrix D epsilon R with values only of the main diagonal is obtained^N×N。

The value of the main diagonal represents the degree of the ith point of the ith row. Calculating the weight among all the nodes to obtain an N multiplied by N dimensional adjacency matrix M, wherein the jth element of the ith row is the weight w_ij，w_ij＝s_ij。

Step2.3, the function of the document feature extraction module is to extract the local features of the documents in the news topic data set, and the invention uses a deep neural network self-encoder to learn effective data representation. The autoencoder is a representation model, which uses input data as a reference, and does not use label supervision, so as to extract features and reduce dimensions. The self-encoder maps the input to a feature space and then maps back to the input space for data reconstruction. Given that the encoder has L layers, the L-th layer learned by the encoder is denoted by H^(l)。

H^(l)＝σ(W_enc ^(l)H^(l-1)+b_enc ^(l)) (4)

Where σ is the relu function, W_enc ^(l)For the transform matrix of the l-th layer in the encoder, b_enc ^(l)Is an offset. H⁽⁰⁾Represented as original document data X.

H^(l)＝σ(W_dec ^(l)H^(l-1)+b_dec ^(l)) (5)

W_dec ^(l)For the transform matrix of the l-th layer in the decoder, b_dec ^(l)Reconstructing data for biasing

The loss function of the document feature extraction module is

Step2.4, the constructed neighbor topic map contains a large amount of global title structure information, the structure features in the neighbor topic map are extracted by using a graph convolution network, and the document local features extracted by a self-encoder are integrated into the graph convolution network, so that two different features of data can be simultaneously extracted by the model. The representation extracted at the ith layer of the graph convolution network is obtained through convolution operation.

Wherein

In order to be a normalized laplacian matrix,

i is a unit diagonal matrix of the adjacency matrix M, and D is a node degree matrix.

Representing U by the previous layer learned by the graph convolution network^(l-1)Propagating to the next layer results in a new representation U^(l)。

In order to enable the news topic data features learned by the graph volume network to simultaneously have global features of the title and local features of the document, two representations U are used^(l-1)And H^(l-1)By combining the fusion factors together, a more comprehensive data representation is obtained.

Alpha is a weight coefficient for balancing two expressions, and the local features of the document can be effectively fused into the global features of the title by connecting the self-encoder and the graph convolution network layer by layer through fusion factors. After fusing the two representations, will

Inputting the data into a graph convolution network to obtain a representation U^(l)。

By analogy, a representation U output by the last layer of the graph convolution network is obtained^(L). The output end of the network is connected with a softmax multi-classifier, and the final output result is distribution U.

The result U obtained by the model is a probability distribution whose element U is_ijRepresenting the probability that the news sample i belongs to the cluster center j.

And Step2.5, the guiding module unifies the document feature extraction module and the title global feature extraction module into a frame to simultaneously carry out end-to-end clustering optimization training.

For the ith sample and jth cluster, reference freedomDegree 1 student-t distribution as a kernel function metric representation h from the encoder_iAnd cluster heart mu_iThe distance between them.

Wherein h is_iIs represented by H^(L)Row i of (d), μ_iIs the cluster center after initialization of the K-means algorithm. We will q_ijConsidering the probability that the document sample i is assigned to the cluster j, Q is the distribution that all document samples are assigned to the cluster.

In order to obtain a high-confidence distribution to iterate a clustering result and improve the clustering accuracy, a target distribution P is constructed to assist model training.

In the target distribution P, each cluster distribution in the document sample distribution Q is subjected to square first and then normalization processing, so that cluster distribution with higher confidence can be obtained, samples in the clusters are forced to be closer to the cluster center, the distance between the clusters is maximized, and the distribution is clearer. One of the loss functions of the guideline module is the KL divergence loss between the distribution Q and the target distribution P.

The target distribution P enables the self-encoder to learn a sample document cluster representation closer to the cluster center by minimizing the loss function update parameter.

In order to make the title global feature extraction module and the document local feature extraction module tend to be consistent in the training iteration process, the two modules need to be unified in the same target distribution, so that the target distribution P can also be used to direct the sample distribution U containing the title global feature output by the graph convolution network. The second loss function of the guiding module is the KL divergence loss between the distribution U and the target distribution P.

The two clusters represented by different expressions can be distributed and unified in the same loss function through different weight parameters of the guide module, and the overall loss function of the model is

The weighting parameters of the first loss function and the second loss function are balanced. After the whole model is trained to be stable, the clustering distribution U finally output by the graph convolution network can be used as a final result of news topic discovery.

Step2.6, representing the title in the news topic data set through a pre-trained Bert Chinese corpus, wherein the vocabulary is a Bert model with a vocabulary, the Bert model comprises 12 layers of transform networks, each layer of network comprises 12 attention heads, the model parameter is 110M, and the hidden layer dimension is 768; the dimension of a self-encoder in the document feature extraction module is 'input-768-2000-10', the size of a graph convolution layer in the title global feature extraction module is the same as that of the self-encoder, the number of K in a neighboring title graph is 10, the initial cluster center of the topic cluster is obtained by 20 times of initialization through a K-means algorithm, and the balance parameter in the fusion factor is set to be 0.5; the model training round is 200, the learning rate is 1e-3, and Adam is adopted by the optimizer.

To illustrate the effect of the present invention, 3 sets of comparative experiments were set up. The first group of experiments verify the improvement of topic discovery performance, the second group of experiments verify the effectiveness of the model, and the third group of experiments verify the influence of different fusion factor weight coefficients on the effectiveness of the model.

(1) Topic discovery performance enhancement verification

In the baseline model, the news topic data sets constructed by step1 are respectively used as model inputs to carry out a comparison experiment, and 5 models are selected as reference models, wherein the comparison experiment comprises the following steps: the classical K-means algorithm, AE + Kmeans, DEC, DCN, IDEC, the experimental results are shown in Table 2.

TABLE 2 comparison of Performance of baseline models

As can be seen from the analysis table 2, compared with other reference models, the method provided by the invention achieves better performance, and compared with an IDEC baseline model, the Accuracy (ACC) is improved by 7.06%, the Normalized Mutual Information (NMI) is improved by 6.15%, and the adjusted landed coefficient (ARI) is improved by 8.26%. This is because the baseline method usually only emphasizes extracting local features of the documents when performing a task of finding news topics, and news documents on different topics of the same news contain many similar case element information, and the baseline method cannot distinguish well. The model of the invention extracts the incidence relation between adjacent titles by using the graph convolution network, and fuses the incidence relation with the local characteristics of the document to enhance the expression of the titles, thereby realizing better effect of topic modeling. This also demonstrates that topic modeling in conjunction with headlines and documents is effective by incorporating a map of neighboring headlines. The model of the invention achieves the optimal result in three performance indexes, which shows the effectiveness of the invention.

(2) Model validation

In order to verify the effectiveness of each module of the model, the model is disassembled into two submodels, namely a title global feature module-guidance module and a document feature module-guidance module, three evaluation indexes are kept unchanged, and the optimal result is represented by bolding. The test results are shown in table 3:

TABLE 3 simplified model Performance analysis

As can be seen from the analysis of Table 3, the main model of the present invention has an obvious effect of improving the effect of modeling by combining the title feature and the document feature. The title characteristic part in the model is removed, only the local characteristics of the document and the guide module are utilized to carry out modeling, the modeling effect is the worst, although the document contains a large amount of case element information, news document elements of different topics under the same case have many similarities, the noise data is more, and the situation that the data divided into the same topic cluster under the same case belongs to different topics or the cases belonging to the same type are not the same case easily occurs. The effect of modeling by only using the global title features and the guidance module is better than that of modeling by only using the document features, because the model extracts the structural relationship among the adjacent titles, but because of the limitation of the title space, the content of the covered case topic information is limited, and the information bias of the titles is easy to occur. Therefore, on the basis of acquiring the association relation between news, the news topic discovery can be better realized by introducing the representation of the document characterization enhanced title to avoid bias, and the effectiveness of the invention is verified from the side.

(3) Verification of influence of different fusion factor weight coefficients on model effectiveness

In order to verify whether the weight coefficient alpha for adjusting the fusion factor improves the model performance, the invention performs the following experiment. And (3) taking a plurality of alpha values with the step length of 0.2 to respectively carry out comparison experiments, and thickening the optimal group of experiment results to represent. The test results are shown in table 4:

TABLE 4 analysis of the influence of different fusion factor weight coefficients on the effectiveness of the model

Analysis table 4 shows that the model of the present invention achieves the best effect when α is 0.5, and the performance of the model is reduced when α is greater than 0.5 or less than 0.5. Because α is a balance weight coefficient of the fusion factor, it plays a role in balancing the global title feature and the local document feature. When the alpha is too large, the local characteristic weight of the document is weakened, the model can only learn the incidence relation of a neighboring topic map, the content information of the document is lacked, the information bias of a topic is easy to generate, the graph convolution network is easy to generate excessive smoothness, meanwhile, the model loses the reconstruction loss of a self-encoder, and the accuracy of finding news topics is reduced; when alpha is too small, the global feature weight of the title is weakened, the characters learned by the model almost come from the document, similar elements cannot be well distinguished, and the accuracy of finding news topics is also reduced. Therefore, setting the weight coefficient α of the fusion factor to 0.5 can fuse both features well, also proving the effectiveness of the present invention.

The experimental data prove that the method fuses the adjacent topic maps, topic modeling is carried out by combining the news topics and the characteristics of the documents, the incidence relation of similar topics is constructed, meanwhile, in order to avoid the influence of only the bias of the topics and noise data, the characteristics of the documents are added into the coding process of the topics, and the guiding module is introduced to enable two parts of the model to update iterative parameters in the same direction, so that news characterization can be effectively carried out, and the clustering accuracy of a news topic finding task is improved. Experiments show that the method of the invention obtains the optimal effect compared with a plurality of baseline models. Aiming at a news topic discovery task, the news topic discovery method fusing the neighbor topic map provided by the invention is effective in improving the field news topic discovery performance.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The news topic discovery method fused with the neighbor topic map is characterized by comprising the following steps: the method comprises the following specific steps:

step1, crawling hot case public opinion news through a crawler technology, and selecting related news to construct a news topic data set; the method comprises the steps that crawled news is analyzed, each news only belongs to one case topic, the news is manually marked to be related to which case topic, and data are screened and preprocessed;

2. The method for discovering news topics by fusing neighbor topic maps according to claim 1, wherein: the specific steps of Step1 are as follows:

step1.1, crawling key news of various big news websites and a public platform in recent years by a crawler technology;

step1.2, screening and preprocessing the crawled data; the screening and preprocessing process of the data comprises the steps of manually calibrating the correlation between news data and case topics, removing data and repeated data which are not related to the case topics, and removing special symbols and links;

3. The method for discovering news topics by fusing neighbor topic maps according to claim 1, wherein: the specific steps of Step2 are as follows:

step2.1, coding a title part in the news topic data set, and obtaining the representation of the title after the training of a Bert pre-training model is completed so as to construct a neighboring title map;

4. The method for discovering news topics by fusing neighbor topic maps according to claim 3, wherein: the Step2.1 specifically comprises:

the Bert model is formed by combining a plurality of transform models, and the training mode is divided into two tasks: one is that 15% of words are randomly selected for prediction, wherein 80% of words are covered by MASK symbols, 10% of words are replaced by random words, and the rest words are kept unchanged, so that the model tends to predict words depending on context and has certain error correction capability; secondly, whether two sentences are coherent texts is predicted, so that the Bert model can obtain word representation and sentence representation of news titles after training is finished;

specifically, let N be the number of titles in the news topic set, where Title is { Title }₁,title₂,…,title_NEach news title has a length S, E ═ E } E₁,e₂,…,e_sThe words in each title are collected, the word vectors of the titles are input into a Bert model to be coded to obtain vector representation of each title, all the title word vectors are input into the Bert model to be coded to finally obtain the title vector representation T fused with semantic information, and T is equal to { T ═ T }₁,T₂,…,T_N}。

5. The method for discovering news topics by fusing neighbor topic maps according to claim 3, wherein: the Step2.2 specifically comprises:

setting subject data T E R^N×aWherein each row T_iRepresents the ith title sample, N is the number of samples, and a represents the dimension; for each titleFirstly, finding out front K neighbors with the highest similarity as neighbor nodes, and connecting the neighbors through edges to form a neighbor topic map; calculating similarity matrix S between any two news headlines by using dot product operation of vectors_ijIt is an N × N dimensional matrix;

for any two title nodes t_iAnd t_jLet is w_ijIs the weight between nodes; if there are edges connected between nodes, then w_ij> 0, if no edges are connected, then w_ij0; since the constructed neighbor topic map is an undirected weight map,

thus w_ij＝w_ji(ii) a The degree of any node in the graph is the sum of the weights of all edges connected with the node;

6. The news topic discovery method that merges neighbor topic maps according to claim 3, wherein: the Step2.3 specifically comprises:

the self-encoder is a representation model, which uses input data as a reference and does not use label supervision to extract features and reduce dimensions; the self-encoder maps the input to a characteristic space and then maps the input space back to the input space for data reconstruction; given that the encoder has L layers, the L-th layer learned by the encoder is denoted by H^(L)；

H^(l)＝σ(W_enc ^(l)H^(l-1)+b_enc ^(l)) (4)

Where σ is the relu function, W_enc ^(l)For the transform matrix of the l-th layer in the encoder, b_enc ^(l)To be offset, H⁽⁰⁾Expressed as original document data X;

H^(l)＝σ(W_dec ^(l)H^(l-1)+b_dec ^(l)) (5)

W_dec ^(l)For the transform matrix of the l-th layer in the decoder, b_enc ^(l)Reconstructing data for biasing

The loss function of the document feature extraction module is

7. The method for discovering news topics by fusing neighbor topic maps according to claim 3, wherein: the Step2.4 specifically comprises:

extracting global features of the title:

wherein

In order to be a normalized laplacian matrix,

the previous layer learned by the graph convolution network represents U^(l-1)Propagating to the next layer results in a new representation U^(l)；

In order to enable the news topic data features learned by the graph volume network to simultaneously have global features of the title and local features of the document, two representations U are used^(l-1)And H^(l-1)By combining the fusion factors together, a more comprehensive data representation is obtained

Inputting the data into a graph convolution network to obtain a representation U^(l)；

By analogy, a representation U output by the last layer of the graph convolution network is obtained^(L)(ii) a The output end of the network is connected with a softmax multi-classifier, and the final output result is a distribution U;

8. the method for discovering news topics by fusing neighbor topic maps according to claim 3, wherein: the Step2.5 specifically comprises:

wherein h is_iIs represented by H^(L)Row i of (1), mu_iIs a cluster center after initialization of a K-means algorithm, and q is calculated_ijThe probability that the document sample i is distributed to the cluster j is regarded, and Q is the distribution that all the document samples are distributed to the clusters;

in the target distribution P, each cluster distribution in the document sample distribution Q is subjected to square first and then normalization processing, so that cluster distribution with higher confidence is obtained, samples in the clusters are forced to be closer to the cluster center, the distance between the clusters is maximized, and the distribution is clearer. One of the loss functions of the guiding module is the KL divergence loss between the distribution Q and the target distribution P;

updating parameters through a minimization loss function, and enabling a target distribution P to enable a self-encoder to learn a sample document cluster representation closer to a cluster center;

in order to enable the title global feature extraction module and the document local feature extraction module to be consistent in the training iteration process, the two modules need to be unified in the same target distribution, so that the target distribution P is used for guiding the sample distribution U containing the title global features output by the convolution network, and the second loss function of the guiding module is the KL divergence loss between the distribution U and the target distribution P;

Beta is a weight parameter of a balance loss function I and a loss function II; after the whole model is trained to be stable, the clustering distribution H finally output by the graph convolution network^(l)As a final result of news topic discovery.