CN111831820B

CN111831820B - News and case correlation analysis method based on case element guidance and deep clustering

Info

Publication number: CN111831820B
Application number: CN202010166279.6A
Authority: CN
Inventors: 余正涛; 李云龙; 高盛祥; 郭军军; 相艳; 线岩团
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2022-07-19
Anticipated expiration: 2040-03-11
Also published as: CN111831820A

Abstract

The invention relates to a news and case correlation analysis method based on case element guidance and deep clustering, which comprises the steps of firstly extracting important sentence representation texts; secondly, characterizing the cases by using case elements to initialize a clustering center and guide a clustering search process; and finally, a convolution self-encoder is selected to obtain text representation, a reconstruction loss and clustering loss combined training network is utilized to enable the representation of the text to be closer to the case, the text representation and the clustering process are unified into the same frame, and self-encoder parameters and clustering model parameters are alternately updated to realize text clustering. The method and the device aim at solving the problems that the current clustering algorithm lacks effective guide information for news and case correlation analysis tasks, so that clustering divergence is caused, and the accuracy of results is reduced, give full play to the guide effect of case elements in the clustering process and text vectorization characterization, and effectively improve the accuracy of clustering results.

Description

News and case correlation analysis method based on case element guidance and deep clustering

Technical Field

The invention relates to a news and case correlation analysis method based on case element guidance and deep clustering, and belongs to the technical field of natural language processing.

Background

The case field public opinion analysis is developed by news texts related to a case, the purpose of the news and case correlation analysis is to judge whether the news texts are related to the case, and the case field public opinion analysis is an important link of the case field news public opinion analysis and has important significance for the case field public opinion analysis. News and case correlation analysis can be regarded as a text clustering process, namely news texts describing the same case are clustered under the same case cluster.

Currently, the related research on text clustering can be divided into two methods based on statistics and deep learning. However, for news and case correlation analysis tasks, due to the lack of effective guide information, the existing method easily causes clustering divergence, and reduces the accuracy of results.

Disclosure of Invention

The invention provides a news and case correlation analysis method based on case element guidance and deep clustering, which is used for solving the problems that the existing clustering method is lack of effective guidance information for news and case correlation analysis tasks, clustering divergence is easily caused, the accuracy of results is reduced and the like.

The technical scheme of the invention is as follows: a news and case correlation analysis method based on case element guidance and deep clustering comprises the following steps:

step1, compressing the case-related news text by using a plurality of summarization technologies; the method comprises the steps of extracting abstracts of news texts by adopting a plurality of abstraction methods, synthesizing the abstracts by using a voting method, extracting important information representation texts, and realizing text compression;

step2, representing the case by using the mean value of the case element word vectors to obtain the vectorization representation of the case;

step3, the compressed news text data is passed through a convolution self-encoder to obtain a text vectorization representation; wherein, a Text-CNN model encoder is used, a deconvolution network is used to form a decoder part, and the minimum mean square error loss is used as the reconstruction loss of the convolution self-encoding;

step4, initializing a clustering center by using vectorization representation of cases, unifying text vectorization representation and clustering processes into the same frame, and alternately updating parameters of a self-encoder and parameters of a clustering model to realize text clustering.

For a given set of news text vectors H_i}_{i＝1,2,...,N}，H_iAnd (3) obtaining a vectorized representation for the ith news document through a convolution self-encoder. The task here is to divide the news text of N different cases into k different case clusters, i.e. C ═ C₁,...,C_r,...,C_k}。

Further, the Step1 includes the specific steps of:

step1.1, firstly formally describing various abstract text compression tasks as follows: let a news text be S ═ S₁,S₂,...,S_pP sentences are contained in total, and the abstracts generated by q methods are respectively set as L_1v,L_2v,...,L_qvAbbreviated as L_1v:L_qvWherein each abstract contains v sentences, including o different sentences, with the goal of starting from L_1v:L_qvSelecting z sentences as compressed texts;

defining the ith abstract as f_i(. cndot.), then:

L_iv＝f_i(S) (1)

here, the news text is abstracted by using 7 abstraction methods, which are Lead, Luhn, LSA, LexRank, TextRank, SumBasic, KL-Sum, so that i belongs to [1,7], that is, q is 7;

selecting z sentences with the highest frequency of occurrence in the plurality of abstracts as compressed texts, and selecting the sentences with the front positions by considering the positions of the sentences in the document when the frequency of occurrence is the same; in addition, news headlines are also considered herein to be part of the news and are topical and factual, so headline information is also added to the compressed text collection.

Further, Step2 includes:

the case elements are structured displays of cases, and the cases can be characterized by the case elements. If E_r＝{e₁,e₂,...e_mThe case element set of the r-th case includes m case elements in total, and each case element e_iIt can be characterized as a d-dimensional word vector w_iI.e. E_r＝{w₁,w₂,...w_m}；

Mitchel et al found that vector addition is a simple and effective semantic combination method. By taking the idea as a reference, the case is vectorized by the mean value of the word vectors of the case elements: let Cen be_r∈R^dFor the vectorized representation of the r-th case, the calculation method is as follows:

assuming that there are k cases in total, using Cen to represent the set of cases, then:

Cen＝{Cen₁,...,Cen_r,...,Cen_k} (3)。

further, Step3 includes:

and constructing a word vector matrix for the compressed document, selecting a convolution self-encoder, and training a network by utilizing reconstruction loss and clustering loss.

The specific steps of Step3 are as follows:

step3.1, set X for compressed sentences of a news text S_i∈R^kIs a word vector of the nth word in the sentence set X, and the sentence set contains n words in total, then the news text is represented as:

here, the first and second liquid crystal display panels are,

splicing, namely constructing a sentence set X into a document word matrix with dimensions of n multiplied by k;

adopting a Text-CNN Text classification model as a coder, and determining an input single-channel document word matrix x belonging to R^n×kThe potential representation of the τ th feature map is:

cτ＝σ(x*W^τ+b^τ) (5)

wherein, W^τ∈R^a×kFor the τ th convolution kernel, a is the height of the convolution kernel, σ is the activation function, which represents the 2d convolution operation, b^τIs the bias term for the τ -th convolution operation; since a narrow convolution is used, c^τ∈R^n-a+1；

To c^τPerforming maximum pooling operation to obtain h^τe.R, namely:

h^τ＝max(c^τ) (6)

because the dimensionality of the clustering center is d, d convolution checks are needed to perform convolution operation on the input document word matrix, maximum pooling operation is performed on each feature map, and finally, each h-th feature map is subjected to^τSplicing to obtain vectorization expression H e R of text^dNamely:

the decoder portion is constructed using a deconvolution network, first, for each h separately^τPerforming inverse pooling operation to reduce the data to g^τ∈R^n-a+1(ii) a Secondly, for each g^τPerforming deconvolution operation, and reconstructing a document word matrix, wherein the calculation method comprises the following steps:

here, σ is the activation function, and T denotes allCharacteristic diagram of (1), W^TIs the transpose of the corresponding convolution kernel, is a 2d convolution operation, ξ is the bias term;

the minimum mean square error loss is used as the reconstruction loss of the convolution self-coding, and the calculation formula is as follows:

where θ is a parameter of the convolutional auto-encoder.

Further, Step4 is to cluster the text when the convolution self-encoder calculates forward.

Further, the iteration of the clustering centers is to update the clustering centers by adopting the combination of the last clustering center and the current newly allocated clustering center.

As a preferred embodiment of the present invention, the Step4 specifically comprises the following steps:

step4.1, for a given set of news text vectors H_i}_{i＝1,2,...,N}，H_iAnd (3) obtaining a vectorized representation for the ith news document through a convolution self-encoder. The task here is to divide the news text of N different cases into k different case clusters, i.e. C ═ C₁,...,C_r,...,C_k}. Wherein, C_rIs the r-th case cluster. k-means is one of the most widely used clustering algorithms, and its loss function is:

wherein M is equal to R^d×kAs a cluster center matrix, s_r,i∈{0,1}^kFor each news text case cluster, and 1^Ts_i＝1。

The update mode of the r-th case cluster partition is as follows:

C_r＝C_r∪{H_i} if s_r,i＝1 (11)

in the iterative updating process, the deviceThe news text is divided into cluster clusters closest to the cluster center, specifically, updates s_iThe rule of (1) is as follows:

initializing a cluster center M using a case-vectorized representation set Cen, wherein each column of M is Cen_r. Considering that news is reported on different sides of a case, information of news texts under the case is added into the case characterization vector, so that the case characterization is more reasonable. Specifically, in the clustering process, the last clustering center is adopted

And the current newly assigned cluster center

The cluster center is updated to obtain a new case representation, and the updating method of the r-th case cluster center comprises the following steps:

wherein the content of the first and second substances,

and distributing the mean vector of the news text in the r case cluster for the t round, namely:

for the weight coefficient of the r-th case cluster, the calculation method is as follows:

here, the number of the first and second electrodes,

the number of news assigned to the r-th case cluster for the t-th turn.

The network is trained under the guidance of the reconstruction loss of the self-encoder, the text representation can be restrained, and the network is trained under the guidance of the clustering loss, so that the representation of the text is closer to a case. Therefore, the network is jointly trained using a combination of reconstruction and clustering losses of the convolutional autocoder, and the loss function is defined as follows:

Loss＝λLoss_c+(1-λ)Loss(θ)_n (16)

wherein λ ∈ [0,1]]Is balance Loss_cAnd Loss (θ)_nIs determined.

In the early stage of clustering iteration, the self-encoder cannot learn good text representation and influences the representation of the case, so that a poor clustering result is generated. Let a co-training T round, the front J round only perform updating the parameters of the convolutional auto-encoder, and let λ be 0, make the Loss be the reconstruction Loss Loss (θ) of the auto-encoder only_n. And adding the later T-J to the clustering process in the forward direction, wherein the Loss is the joint Loss.

Iteratively updating X ═ X using the proposed method₁,X₂,...,X_NAnd after the round, the news text set converges into different case clusters, so that a final clustering result is obtained.

The invention has the beneficial effects that:

1. firstly, extracting important sentence representation texts; secondly, characterizing cases by using case elements, and using the case elements to initially cluster centers and guide a clustering search process; finally, a convolution self-encoder is selected to obtain text representation, a reconstruction loss and clustering loss combined training network is utilized to enable the representation of the text to be closer to a case, the text representation and the clustering process are unified into the same frame, self-encoder parameters and clustering model parameters are alternately updated, and text clustering is achieved;

2. the method aims at the problems that the current clustering algorithm lacks effective guide information for news and case correlation analysis tasks, so that clustering divergence is caused, and the accuracy of results is reduced, gives full play to the guidance of case elements in the clustering process and text vectorization representation, and effectively improves the accuracy of clustering results.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

Example 1: as shown in fig. 1, the news and case correlation analysis method based on case element guidance and deep clustering specifically includes:

step1, collecting related case news documents and defining related case elements.

The related case news documents collected and sorted in Step1 are obtained by writing a web crawler and crawling related news texts.

The case elements defined in Step1 are defined by analyzing the composition of the case elements of Chinese books in the Chinese referee's netbook and considering the characteristics of the news text related to the case.

Specifically, a total of 5970 pieces of news text related to 6 hot cases are crawled, as shown in table 1. Definition of "Party, Party involved, case description" 3 elements as case elements are shown in Table 2.

TABLE 1 case-related news text data set

TABLE 2 case elements List

Step2, compressing the case-related news text by using a plurality of summarization technologies;

step3, representing the case by using the mean value of the case element word vectors to obtain the vectorization representation of the case;

step4, the compressed news text data is passed through a convolution self-encoder to obtain a text vectorization representation; constructing a word vector matrix for the compressed document, selecting a convolution self-encoder, and jointly training a network by utilizing reconstruction loss and clustering loss;

step5, initializing a clustering center by using vectorization representation of cases, unifying text vectorization representation and clustering processes into the same frame, and alternately updating parameters of a self-encoder and parameters of a clustering model to realize text clustering.

Further, in Step2, several summarization methods are adopted to extract the summaries of the news texts, the summarization methods are used to synthesize the summaries, important information representation texts are extracted, and text compression is realized.

Further, the Step2 includes the specific steps of:

step2.1, firstly formally describing various abstract text compression tasks as follows: let a news text be S ═ S₁,S₂,...,S_pAnd h, containing p sentences in total, and setting the abstracts generated by the q methods as L_1v,L_2v,...,L_qvAbbreviated as L_1v:L_qvWherein each abstract contains v sentences and o different sentences, and the target is from L_1v:L_qvSelecting z sentences as compressed texts;

defining the ith abstract as f_i(. o.), then:

L_iv＝f_i(S) (1)

here, the news text is abstracted by using 7 abstraction methods, namely Lead, Luhn, LSA, LexRank, TextRank, SumBasic, KL-Sum, so that i belongs to [1,7], i is equal to 7;

Further, Step3 includes:

if E_r＝{e₁,e₂,...e_mThe case element set of the r-th case includes m case elements in total, and each case element e_iIt can be characterized as a d-dimensional word vector w_iI.e. E_r＝{w₁,w₂,...w_m}；

Then the case is vectorized with the mean of the word vectors of the case elements: let Cen be_r∈R^dFor the vectorized representation of the r-th case, the calculation method is as follows:

Cen＝{Cen₁,...,Cen_r,...,Cen_k} (3)。

the specific steps of Step4 are as follows:

step4.1, set X of compressed sentences of a news text S_i∈R^kIs a word vector for the nth word in sentence set X, and the sentence set contains n words in total, then the news text is represented as:

here, the first and second liquid crystal display panels are,

splicing, namely constructing a sentence set X into a document word matrix with dimension of n multiplied by k;

adopting a Text-CNN Text classification model as a coder, and determining an input single-channel document word matrix x E R^n×kLatent, of the τ -th feature mapIn the representation:

c^τ＝σ(x*W^τ+b^τ) (5)

wherein, W^τ∈R^a×kFor the τ -th convolution kernel, a is the height of the convolution kernel, σ is the activation function, which represents the 2d convolution operation, b^τIs the bias term for the τ th convolution operation; since a narrow convolution is used, c^τ∈R^n-a+1；

To c is paired^τPerforming maximum pooling operation to obtain h^τe.R, namely:

h^τ＝max(c^τ) (6)

because the dimensionality of the clustering center is d dimensionality, d convolution checks are needed to perform convolution operation on the input document word matrix, each feature graph is subjected to maximum pooling operation, and finally each h-th feature graph is subjected to maximum pooling operation^τSplicing to obtain vectorization expression H e R of text^dNamely:

here, σ is the activation function, T denotes all the characteristic diagrams, W^TIs the transpose of the corresponding convolution kernel, is a 2d convolution operation, ξ is the bias term;

where θ is a parameter of the convolutional auto-encoder.

Further, Step5 is clustering the text when the convolution self-encoder calculates forward.

Further, the iteration of the cluster centers is to update the cluster centers by adopting the combination of the last cluster center and the current newly allocated cluster center.

And comparing the clustering result with the label of the text in the data set to evaluate the clustering performance, and selecting the Accuracy (ACC) and the standardized mutual information (NMI) as evaluation indexes, wherein the accuracy is defined as:

s^T∈[1,N]for the transposition of the clustering result matrix, s' belongs to [1, N ]]And the label matrix of the text in the data set, tr is the trace of the matrix, and N is the total number of the news text.

Standardized mutual information (NMI) can be used to measure the similarity between two data distributions, i.e. for clustering tasks, i.e. to measure the similarity between clustering labels and clustering results.

Wherein, MI (-) is mutual information, H (-) is information entropy, NMI epsilon [0,1], and the larger the value is, the better the clustering effect is.

In order to make the invention more convincing, 2 vector space model-based, 1 topic model-based and 3 word vector-based distributed representation methods are selected to characterize the document, and the k-means clustering algorithm and the proposed method are used for comparison. In particular, for a feature dimension based on a vector space model of 2000, the dimensions of the remaining baseline methods are all 300. Further, for a distributed representation method of a document, the text after compression is used with the text method. Specifically, (1) TFIDF-1: taking each word in the document as a feature item, wherein the weight is TFIDF; (2) TFIDF-2: context words with the window size of 2 are used as feature items, and the weight is TFIDF; (3) LDA: obtaining a document representation using a topic model; (4) MeanWV (mean Word embedding): an average word vector for the document; (5) TWE (topical Word embedding) the average of the concatenation topic vector and the average of the Word vector represent the document; (6) TopicVec: documents are represented using a concatenation of the document topic vector and the average of the word vectors.

Aiming at the invention, the following hyper-parameter settings are adopted: (1) for the document compression module, the number of sentences extracted from each abstract is set to be 3, and the number of sentences synthesized by a plurality of abstracts is also set to be 3. (2) For the convolutional self-coding module, the dimension of the input word vector is 300 dimensions; selecting three different convolution kernels, wherein the heights of the three different convolution kernels are 3, 4 and 5 respectively, and the thickness of each convolution kernel is 100; the optimizer is Adam, the learning rate is 0.01, and the L2 regularization weight is 0.00001. (3) For the clustering module, the embedding dimension of the case element is 300 dimensions; setting the iteration turn to be 25 turns, and in the clustering process, in the first 5 turns, not using a clustering loss optimization network; the hyperparameter that balances the cluster loss and the autoencoder loss is set to 0.1.

Table 3 shows the comparison of clustering effect between the present method and the baseline method in 4, 5, and 6 cases. Experimental results show that the method is superior to a baseline method in both accuracy and standardized mutual information indexes.

Table 3 comparison of experimental results for the methods herein and the baseline method

From the experimental results in table 3, it can be seen that the text characterization based on LDA has relatively poor clustering effect, and the reason for analyzing the text characterization is mainly that the task of the method is not very suitable, because the purpose is to cluster the news texts of the same case into the same case cluster, one case is a topic, and LDA considers that one news text contains a plurality of topics, so that the clustering result is not ideal. The clustering method based on the vector space text representation obtains good effect, as for case-related public sentiment data, news texts of different cases have certain difference, TF-IDF calculates the representative degree of words to documents, can better distinguish different documents, especially for TFIDF-2, considers 2-element grammatical features, and captures a part of context information. Based on a distributed document representation method, the document is represented by word embedding or theme embedding respectively, and the effect close to TFIDF-2 is achieved.

The method utilizes a convolution self-encoder to extract features and combine semantics of the text, so that the representation of the text has n-element syntactic features, and meanwhile, the loss of clustering is used for guidance, so that a model can learn a text representation form related to a task better. And the clustering center is initialized by using case elements, so that the clustering process is guided. The method herein is superior to the baseline method in both average indicators. For example, under 6 cases, the accuracy of the method is improved by 4.16% compared with TFIDF-2, and the standardized mutual information is improved by 9.20%.

While the present invention has been described in detail with reference to the embodiments, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The news and case correlation analysis method based on case element guidance and deep clustering is characterized by comprising the following steps of: the method comprises the following steps:

step1, compressing the case-related news text by using a plurality of summarization technologies;

step3, the compressed news text data is passed through a convolution self-encoder to obtain a text vectorization representation;

step4, utilizing vectorization representation of cases to initialize clustering centers, unifying text vectorization representation and clustering processes into the same frame, and alternately updating self-encoder parameters and clustering model parameters to realize text clustering;

step3 comprises the following steps:

constructing a word vector matrix for the compressed document, selecting a convolution self-encoder, and training a network by utilizing reconstruction loss and clustering loss;

step4 is to cluster the text when the convolution self-encoder calculates forward.

2. The news and case correlation analysis method based on case element guidance and deep clustering as claimed in claim 1, wherein: in Step1, several summarization methods are adopted to extract the summaries of the news texts, the summarization methods are used to synthesize the summaries, and important information representation texts are extracted, so that text compression is realized.

3. The news and case correlation analysis method based on case element guidance and deep clustering as claimed in claim 1 or 2, wherein: the specific steps of Step1 are as follows:

defining the ith abstract as f_i(. cndot.), then:

L_iv＝f_i(S) (1)

and selecting z sentences with the highest frequency of occurrence in the plurality of abstracts as compressed texts.

4. The news and case correlation analysis method based on case element guidance and deep clustering of claim 1, wherein: the Step2 includes:

assuming there are a total of k cases, using Cen to represent the set of cases, then:

Cen＝{Cen₁,...,Cen_r,...,Cen_k} (3)。

5. the news and case correlation analysis method based on case element guidance and deep clustering of claim 1, wherein: and the iteration of the clustering centers is to update the clustering centers by adopting the combination of the last clustering center and the current newly distributed clustering center.