CN111831820A

CN111831820A - News and case correlation analysis method based on case element guidance and deep clustering

Info

Publication number: CN111831820A
Application number: CN202010166279.6A
Authority: CN
Inventors: 余正涛; 李云龙; 高盛祥; 郭军军; 相艳; 线岩团
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2020-10-27
Anticipated expiration: 2040-03-11
Also published as: CN111831820B

Abstract

The invention relates to a news and case correlation analysis method based on case element guidance and deep clustering, which comprises the steps of firstly extracting important sentence representation texts; secondly, characterizing cases by using case elements, and using the case elements to initially cluster centers and guide a clustering search process; and finally, a convolution self-encoder is selected to obtain text representation, a reconstruction loss and clustering loss combined training network is utilized to enable the representation of the text to be closer to the case, the text representation and the clustering process are unified into the same frame, and self-encoder parameters and clustering model parameters are alternately updated to realize text clustering. The method aims at the problems that the current clustering algorithm lacks effective guide information for news and case correlation analysis tasks, so that clustering divergence is caused, and the accuracy of results is reduced, gives full play to the guidance of case elements in the clustering process and text vectorization representation, and effectively improves the accuracy of clustering results.

Description

News and case correlation analysis method based on case element guidance and deep clustering

Technical Field

The invention relates to a news and case correlation analysis method based on case element guidance and deep clustering, and belongs to the technical field of natural language processing.

Background

The case field public opinion analysis is developed by news texts related to a case, the purpose of the news and case correlation analysis is to judge whether the news texts are related to the case, the case field public opinion analysis is an important link, and the case field public opinion analysis has important significance. News and case correlation analysis can be regarded as a text clustering process, namely news texts describing the same case are clustered under the same case cluster.

Currently, the related research on text clustering can be divided into two methods based on statistics and deep learning. However, for news and case correlation analysis tasks, due to the lack of effective guide information, clustering divergence is easily caused by the existing method, and the accuracy of results is reduced.

Disclosure of Invention

The invention provides a news and case correlation analysis method based on case element guidance and deep clustering, which is used for solving the problems that the existing clustering method lacks effective guidance information for news and case correlation analysis tasks, clustering divergence is easy to cause, the accuracy of results is reduced and the like.

The technical scheme of the invention is as follows: the news and case correlation analysis method based on case element guidance and deep clustering comprises the following steps:

step1, compressing the case-related news text by using a plurality of summarization technologies; extracting the abstracts of the news text by adopting a plurality of abstraction methods, synthesizing the abstracts by using a voting method, extracting important information representation texts, and realizing text compression;

step2, representing the case by using the mean value of the case element word vectors to obtain the vectorization representation of the case;

step3, the compressed news text data is passed through a convolution self-encoder to obtain a text vectorization representation; wherein, a Text-CNN model encoder is used, a deconvolution network is used to form a decoder part, and the minimum mean square error loss is used as the reconstruction loss of the convolution self-encoding;

step4, utilizing the vectorization representation of the case to initialize a clustering center, unifying the text vectorization representation and the clustering process into the same frame, and alternately updating the self-encoder parameters and the clustering model parameters to realize text clustering.

For a given set of news text vectors H_i}_{i＝1,2,...,N}，H_iAnd (3) obtaining a vectorized representation for the ith news document through a convolution self-encoder. The task here is to divide the news text of N different cases into k different case clusters, i.e. C ═ C₁,...,C_r,...,C_k}。

Further, the Step1 includes the specific steps of:

step1.1, firstly formally describing various abstract text compression tasks as follows: let a news text be S ═ S₁,S₂,...,S_pP sentences are contained in total, and the abstracts generated by q methods are respectively set as L_1v,L_2v,...,L_qvAbbreviated as L_1v:L_qvWherein each abstract contains v sentences, including o different sentences, with the goal of starting from L_1v:L_qvSelecting z sentences as compressed texts;

defining the ith abstract as f_i(. o.), then:

L_iv＝f_i(S) (1)

here, the news text is abstracted by using 7 abstraction methods, namely Lead, Luhn, LSA, LexRank, TextRank, SumBasic, KL-Sum, so that i belongs to [1,7], i is equal to 7;

selecting z sentences with the highest frequency of occurrence in the plurality of abstracts as compressed texts, and selecting the sentences with the front positions by considering the positions of the sentences in the document when the frequency of occurrence is the same; in addition, news headlines are also considered herein to be part of the news and are topical and factual, so headline information is also added to the compressed text collection.

Further, Step2 includes:

the case elements are structured displays of cases, and the cases can be characterized by the case elements. If E_r＝{e₁,e₂,...e_mThe case element set of the r-th case includes m case elements in total, and each case element e_iIt can be characterized as a d-dimensional word vector w_iI.e. E_r＝{w₁,w₂,...w_m}；

Mitchel et al found that vector addition is a simple and effective semantic combination method. By taking the idea as a reference, the case is vectorized by the mean value of the word vectors of the case elements: let Cen be_r∈R^dFor the vectorized representation of the r-th case, the calculation method is as follows:

assuming there are a total of k cases, using Cen to represent the set of cases, then:

Cen＝{Cen₁,...,Cen_r,...,Cen_k} (3)。

further, Step3 includes:

and constructing a word vector matrix for the compressed document, selecting a convolution self-encoder, and training a network by utilizing reconstruction loss and clustering loss.

The specific steps of Step3 are as follows:

step3.1, set X of compressed sentences of a news text S_i∈R^kIs a word vector of the nth word in the sentence set X, and the sentence set contains n words in total, then the news text is represented as:

here, the first and second liquid crystal display panels are,

splicing, namely constructing a sentence set X into a document word matrix with dimension of n multiplied by k;

adopting a Text-CNN Text classification model as a coder, and determining an input single-channel document word matrix x belonging to R^n×kThe potential representation of the τ th feature map is:

c^τ＝σ(x*W^τ+b^τ) (5)

wherein, W^τ∈R^a×kFor the τ th convolution kernel, a is the height of the convolution kernel, σ is the activation function, which represents the 2d convolution operation, b^τIs the bias term for the τ th convolution operation; since a narrow convolution is used, c^τ∈R^n-a+1；

To c^τPerforming maximum pooling operation to obtain h^τe.R, namely:

h^τ＝max(c^τ) (6)

because the dimensionality of the clustering center is d, d convolution checks are needed to perform convolution operation on the input document word matrix, maximum pooling operation is performed on each feature map, and finally, each h-th feature map is subjected to^τSplicing to obtain vectorization expression H e R of text^dNamely:

the decoder portion is constructed using a deconvolution network, first for each h separately^τPerforming inverse pooling operation to reduce the data to g^τ∈R^n-a+1(ii) a Secondly, for each g^τPerforming deconvolution operation, and reconstructing a document word matrix, wherein the calculation method comprises the following steps:

here, σ is the activation function, T denotes all the characteristic diagrams, W^TIs the transpose of the corresponding convolution kernel, is a 2d convolution operation, ξ is the bias term;

the minimum mean square error loss is used as the reconstruction loss of the convolution self-coding, and the calculation formula is as follows:

where θ is a parameter of the convolutional auto-encoder.

Further, Step4 is to cluster the text when the convolution self-encoder calculates forward.

Further, the iteration of the cluster centers is to update the cluster centers by adopting the combination of the last cluster center and the current newly allocated cluster center.

As a preferred embodiment of the present invention, the Step4 specifically comprises the following steps:

step4.1, for a given set of news text vectors H_i}_{i＝1,2,...,N}，H_iAnd (3) carrying out vectorization representation on the ith news document obtained by the convolutional self-encoder. The task here is to divide the news text of N different cases into k different case clusters, i.e. C ═ C₁,...,C_r,...,C_k}. Wherein, C_rIs the r-th case cluster. k-means is one of the most widely used clustering algorithms, and its loss function is:

wherein M is equal to R^d×kAs a cluster center matrix, s_r,i∈{0,1}^kFor each news text case cluster, and 1^Ts_i＝1。

The update mode of the r-th case cluster partition is as follows:

C_r＝C_r∪{H_i}if s_r,i＝1 (11)

in the iterative updating process, the news text is divided into cluster clusters closest to the cluster center, and specifically, the s is updated_iThe rule of (1) is as follows:

initializing a cluster center M using a case-vectorized representation set Cen, wherein each column of M is Cen_r. Considering that news is reported on different sides of a case, the information of the news text under the case is also added into the case characterization vector, so that the case characterization is more reasonable. Specifically, in the clustering process, the last clustering center is adopted

And the current newly assigned cluster center

The cluster center is updated to obtain a new case characteristic, and the updating method of the r-th case cluster center comprises the following steps:

wherein the content of the first and second substances,

and distributing the mean vector of the news text in the r case cluster for the t round, namely:

for the weight coefficient of the r-th case cluster, the calculation method is as follows:

here, the first and second liquid crystal display panels are,

the number of news assigned to the r-th case cluster for the t-th turn.

The network is trained under the guidance of reconstruction loss of the self-encoder, so that the representation of the text can be restrained, and the network is trained under the guidance of clustering loss, so that the representation of the text is closer to a case. Therefore, the network is jointly trained using a combination of reconstruction and clustering losses of the convolutional autocoder, and the loss function is defined as follows:

Loss＝λLoss_c+(1-λ)Loss(θ)_n(16)

wherein, λ ∈ [0,1]]Is balance Loss_cAnd Loss (θ)_nIs determined.

In the early stage of clustering iteration, the self-encoder cannot learn good text representation and influences the representation of the case, so that a poor clustering result is generated. Let a co-training T round, the front J round only perform updating the parameters of the convolutional auto-encoder, and let λ be 0, make the Loss be the reconstruction Loss Loss (θ) of the auto-encoder only_n. The last T-J is added to the clustering process in the forward direction, and the Loss is the Loss of the joint Loss.

Iteratively updating X ═ X using the proposed method₁,X₂,...,X_NAnd after the round, the news text set converges into different case clusters, so that a final clustering result is obtained.

The invention has the beneficial effects that:

1. firstly, extracting important sentence representation texts; secondly, characterizing cases by using case elements, and using the case elements to initially cluster centers and guide a clustering search process; finally, a convolution self-encoder is selected to obtain text representation, a reconstruction loss and clustering loss combined training network is utilized to enable the representation of the text to be closer to a case, the text representation and the clustering process are unified into the same frame, self-encoder parameters and clustering model parameters are alternately updated, and text clustering is achieved;

2. the method aims at the problems that the current clustering algorithm lacks effective guide information for news and case correlation analysis tasks, so that clustering divergence is caused, and the accuracy of results is reduced, gives full play to the guide effect of case elements in the clustering process and text vectorization representation, and effectively improves the accuracy of clustering results.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

Example 1: as shown in fig. 1, the news and case correlation analysis method based on case element guidance and deep clustering specifically includes:

step1, collecting related case news documents and defining related case elements.

Step1, collecting and sorting the related case news documents by compiling web crawlers to crawl related news texts.

The case elements defined in Step1 are defined by analyzing the composition of the case elements of Chinese books in the Chinese referee's netbook and considering the characteristics of the news text related to the case.

Specifically, a total of 5970 pieces of news text related to 6 hot cases are crawled, as shown in table 1. Definition "places of record, personnel involved in the record, description of the record" 3 elements are taken as record elements, and are shown in table 2.

TABLE 1 case-related news text data set

TABLE 2 case elements List

Step2, compressing the case-related news text by using a plurality of summarization technologies;

step3, representing the case by using the mean value of the case element word vectors to obtain the vectorization representation of the case;

step4, the compressed news text data is passed through a convolution self-encoder to obtain a text vectorization representation; constructing a word vector matrix for the compressed document, selecting a convolution self-encoder, and jointly training a network by utilizing reconstruction loss and clustering loss;

step5, utilizing the vectorization representation of the case to initialize a clustering center, unifying the text vectorization representation and the clustering process into the same frame, and alternately updating the self-encoder parameters and the clustering model parameters to realize text clustering.

Further, in Step2, several summarization methods are adopted to extract the summaries of the news texts, the summarization methods are used to synthesize the summaries, important information representation texts are extracted, and text compression is realized.

Further, the Step2 includes the specific steps of:

step2.1, firstly formally describing various abstract text compression tasks as follows: let a news text be S ═ S₁,S₂,...,S_pP sentences are contained in total, and the abstracts generated by q methods are respectively set as L_1v,L_2v,...,L_qvAbbreviated as L_1v:L_qvWherein each abstract contains v sentences, including o different sentences, with the goal of starting from L_1v:L_qvSelecting z sentences as compressed texts;

defining the ith abstract as f_i(. o.), then:

L_iv＝f_i(S) (1)

Further, Step3 includes:

if E_r＝{e₁,e₂,...e_mThe case element set of the r-th case includes m case elements in total, and each case element e_iIt can be characterized as a d-dimensional word vector w_iI.e. E_r＝{w₁,w₂,...w_m}；

Then the case is vectorized with the mean of the word vectors of the case elements: let Cen be_r∈R^dFor the vectorized representation of the r-th case, the calculation method is as follows:

Cen＝{Cen₁,...,Cen_r,...,Cen_k} (3)。

the specific steps of Step4 are as follows:

step4.1, set X of compressed sentences of a news text S_i∈R^kIs a word vector of the nth word in the sentence set X, and the sentence set contains n words in total, then the news text is represented as:

here, the first and second liquid crystal display panels are,

c^τ＝σ(x*W^τ+b^τ) (5)

To c^τPerforming maximum pooling operation to obtain h^τe.R, namely:

h^τ＝max(c^τ) (6)

where θ is a parameter of the convolutional auto-encoder.

Further, Step5 is to cluster the text when the convolution self-encoder calculates forward.

And comparing the clustering result with the label of the text in the data set to evaluate the clustering performance, and selecting Accuracy (ACC) and standardized mutual information (NMI) as evaluation indexes, wherein the accuracy is defined as:

s^T∈[1,N]is the transpose of the clustering result matrix, s, [1, N ]]And the label matrix of the text in the data set, tr is the trace of the matrix, and N is the total number of the news text.

Standardized mutual information (NMI) can be used to measure the similarity between two data distributions, i.e. for clustering tasks, i.e. to measure the similarity between clustering labels and clustering results.

Wherein, MI (-) is mutual information, H (-) is information entropy, NMI epsilon [0,1], and the larger the value is, the better the clustering effect is.

In order to make the invention more convincing, 2 vector space models, 1 topic model and 3 word vector-based distributed representation methods are selected to characterize the document, and the k-means clustering algorithm and the proposed method are used for comparison. In particular, for a feature dimension based on a vector space model of 2000, the dimensions of the remaining baseline methods are all 300. Further, for a distributed representation method of a document, the text after compression is used with the text method. Specifically, the method comprises the following steps of (1) TFIDF-1: taking each word in the document as a feature item, wherein the weight is TFIDF; (2) TFIDF-2: context words with the window size of 2 are used as feature items, and the weight is TFIDF; (3) LDA: obtaining a document representation using a topic model; (4) MeanWV (mean Word embedding): an average word vector for the document; (5) TWE (local Word embedding) the average of the concatenation subject vector and the average of the Word vector represent the document; (6) TopicVec: documents are represented using a concatenation of the document topic vector and the average of the word vectors.

Aiming at the invention, the following hyper-parameter settings are adopted: (1) for the document compression module, the number of sentences extracted by each abstract is set to be 3, and the number of sentences synthesized by a plurality of abstracts is also set to be 3. (2) For the convolutional self-coding module, the dimension of the input word vector is 300 dimensions; selecting three different convolution kernels, wherein the heights of the three different convolution kernels are 3, 4 and 5 respectively, and the thickness of each convolution kernel is 100; the optimizer is Adam, the learning rate is 0.01, and the L2 regularization weight is 0.00001. (3) for the clustering module, the embedding dimension of the case element is 300 dimensions; setting the iteration turn to be 25 turns, and in the clustering process, using no clustering loss optimization network in the first 5 turns; the hyperparameter that balances the cluster loss and the autoencoder loss is set to 0.1.

Table 3 shows the comparison of clustering effect between the present method and the baseline method in 4, 5, and 6 cases. Experimental results show that the method is superior to a baseline method in both accuracy and standardized mutual information indexes.

Table 3 comparison of experimental results for the methods herein and the baseline method

As can be seen from the experimental results in table 3, the text characterization based on LDA has relatively poor clustering effect, and the reason for analyzing the text characterization is mainly that the task of the method is not very suitable, because our purpose is to cluster news texts of the same case into the same case cluster, one case is a topic, and LDA considers that one news text contains multiple topics, the clustering result is not ideal. The clustering method based on the vector space text representation obtains good effect, as for case-related public sentiment data, news texts of different cases have certain difference, TF-IDF calculates the representative degree of words to documents, can better distinguish different documents, especially for TFIDF-2, considers 2-element grammatical features, and captures a part of context information. The document representation method based on the distribution type respectively uses word embedding or theme embedding to represent the document, and achieves the effect close to TFIDF-2.

The method utilizes a convolution self-encoder to extract features and combine semantics of the text, so that the representation of the text has n-element grammatical features, and meanwhile, the loss of clustering is used for guidance, so that a model can learn a text representation form related to a task better. And the case elements are used for initializing the clustering center, so that the clustering process is guided. The method herein is superior to the baseline method in both average indicators. For example, under 6 cases, the accuracy of the method is improved by 4.16% compared with TFIDF-2, and the standardized mutual information is improved by 9.20%.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The news and case correlation analysis method based on case element guidance and deep clustering is characterized by comprising the following steps of: the method comprises the following steps:

step1, compressing the case-related news text by using a plurality of summarization technologies;

step3, the compressed news text data is passed through a convolution self-encoder to obtain a text vectorization representation;

2. The news and case correlation analysis method based on case element guidance and deep clustering of claim 1, wherein: in Step1, a plurality of summarization methods are adopted to extract the summaries of the news texts, the summarization methods are used to synthesize the summaries, important information representation texts are extracted, and text compression is realized.

3. The news and case correlation analysis method based on case element guidance and deep clustering according to claim 1 or 2, characterized in that: the specific steps of Step1 are as follows:

defining the ith abstract as f_i(. o.), then:

L_iv＝f_i(S) (1)

and selecting z sentences with the highest frequency of occurrence in the plurality of abstracts as compressed texts.

4. The news and case correlation analysis method based on case element guidance and deep clustering of claim 1, wherein: the Step2 includes:

Cen＝{Cen₁,...,Cen_r,...,Cen_k} (3)。

5. the news and case correlation analysis method based on case element guidance and deep clustering of claim 1, wherein: step3 comprises the following steps:

6. The news and case correlation analysis method based on case element guidance and deep clustering of claim 1, wherein: step4 is to cluster the text when the convolution self-encoder calculates forward.

7. The news and case correlation analysis method based on case element guidance and deep clustering of claim 6, wherein: and the iteration of the clustering centers is to update the clustering centers by adopting the combination of the last clustering center and the current newly distributed clustering center.