CN111831820B - News and case correlation analysis method based on case element guidance and deep clustering - Google Patents

News and case correlation analysis method based on case element guidance and deep clustering Download PDF

Info

Publication number
CN111831820B
CN111831820B CN202010166279.6A CN202010166279A CN111831820B CN 111831820 B CN111831820 B CN 111831820B CN 202010166279 A CN202010166279 A CN 202010166279A CN 111831820 B CN111831820 B CN 111831820B
Authority
CN
China
Prior art keywords
case
clustering
text
news
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010166279.6A
Other languages
Chinese (zh)
Other versions
CN111831820A (en
Inventor
余正涛
李云龙
高盛祥
郭军军
相艳
线岩团
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202010166279.6A priority Critical patent/CN111831820B/en
Publication of CN111831820A publication Critical patent/CN111831820A/en
Application granted granted Critical
Publication of CN111831820B publication Critical patent/CN111831820B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a news and case correlation analysis method based on case element guidance and deep clustering, which comprises the steps of firstly extracting important sentence representation texts; secondly, characterizing the cases by using case elements to initialize a clustering center and guide a clustering search process; and finally, a convolution self-encoder is selected to obtain text representation, a reconstruction loss and clustering loss combined training network is utilized to enable the representation of the text to be closer to the case, the text representation and the clustering process are unified into the same frame, and self-encoder parameters and clustering model parameters are alternately updated to realize text clustering. The method and the device aim at solving the problems that the current clustering algorithm lacks effective guide information for news and case correlation analysis tasks, so that clustering divergence is caused, and the accuracy of results is reduced, give full play to the guide effect of case elements in the clustering process and text vectorization characterization, and effectively improve the accuracy of clustering results.

Description

News and case correlation analysis method based on case element guidance and deep clustering
Technical Field
The invention relates to a news and case correlation analysis method based on case element guidance and deep clustering, and belongs to the technical field of natural language processing.
Background
The case field public opinion analysis is developed by news texts related to a case, the purpose of the news and case correlation analysis is to judge whether the news texts are related to the case, and the case field public opinion analysis is an important link of the case field news public opinion analysis and has important significance for the case field public opinion analysis. News and case correlation analysis can be regarded as a text clustering process, namely news texts describing the same case are clustered under the same case cluster.
Currently, the related research on text clustering can be divided into two methods based on statistics and deep learning. However, for news and case correlation analysis tasks, due to the lack of effective guide information, the existing method easily causes clustering divergence, and reduces the accuracy of results.
Disclosure of Invention
The invention provides a news and case correlation analysis method based on case element guidance and deep clustering, which is used for solving the problems that the existing clustering method is lack of effective guidance information for news and case correlation analysis tasks, clustering divergence is easily caused, the accuracy of results is reduced and the like.
The technical scheme of the invention is as follows: a news and case correlation analysis method based on case element guidance and deep clustering comprises the following steps:
step1, compressing the case-related news text by using a plurality of summarization technologies; the method comprises the steps of extracting abstracts of news texts by adopting a plurality of abstraction methods, synthesizing the abstracts by using a voting method, extracting important information representation texts, and realizing text compression;
step2, representing the case by using the mean value of the case element word vectors to obtain the vectorization representation of the case;
step3, the compressed news text data is passed through a convolution self-encoder to obtain a text vectorization representation; wherein, a Text-CNN model encoder is used, a deconvolution network is used to form a decoder part, and the minimum mean square error loss is used as the reconstruction loss of the convolution self-encoding;
step4, initializing a clustering center by using vectorization representation of cases, unifying text vectorization representation and clustering processes into the same frame, and alternately updating parameters of a self-encoder and parameters of a clustering model to realize text clustering.
For a given set of news text vectors Hi}i=1,2,...,N,HiAnd (3) obtaining a vectorized representation for the ith news document through a convolution self-encoder. The task here is to divide the news text of N different cases into k different case clusters, i.e. C ═ C1,...,Cr,...,Ck}。
Further, the Step1 includes the specific steps of:
step1.1, firstly formally describing various abstract text compression tasks as follows: let a news text be S ═ S1,S2,...,SpP sentences are contained in total, and the abstracts generated by q methods are respectively set as L1v,L2v,...,LqvAbbreviated as L1v:LqvWherein each abstract contains v sentences, including o different sentences, with the goal of starting from L1v:LqvSelecting z sentences as compressed texts;
defining the ith abstract as fi(. cndot.), then:
Liv=fi(S) (1)
here, the news text is abstracted by using 7 abstraction methods, which are Lead, Luhn, LSA, LexRank, TextRank, SumBasic, KL-Sum, so that i belongs to [1,7], that is, q is 7;
selecting z sentences with the highest frequency of occurrence in the plurality of abstracts as compressed texts, and selecting the sentences with the front positions by considering the positions of the sentences in the document when the frequency of occurrence is the same; in addition, news headlines are also considered herein to be part of the news and are topical and factual, so headline information is also added to the compressed text collection.
Further, Step2 includes:
the case elements are structured displays of cases, and the cases can be characterized by the case elements. If Er={e1,e2,...emThe case element set of the r-th case includes m case elements in total, and each case element eiIt can be characterized as a d-dimensional word vector wiI.e. Er={w1,w2,...wm};
Mitchel et al found that vector addition is a simple and effective semantic combination method. By taking the idea as a reference, the case is vectorized by the mean value of the word vectors of the case elements: let Cen ber∈RdFor the vectorized representation of the r-th case, the calculation method is as follows:
Figure GDA0002680482060000021
assuming that there are k cases in total, using Cen to represent the set of cases, then:
Cen={Cen1,...,Cenr,...,Cenk} (3)。
further, Step3 includes:
and constructing a word vector matrix for the compressed document, selecting a convolution self-encoder, and training a network by utilizing reconstruction loss and clustering loss.
The specific steps of Step3 are as follows:
step3.1, set X for compressed sentences of a news text Si∈RkIs a word vector of the nth word in the sentence set X, and the sentence set contains n words in total, then the news text is represented as:
Figure GDA0002680482060000031
here, the first and second liquid crystal display panels are,
Figure GDA0002680482060000035
splicing, namely constructing a sentence set X into a document word matrix with dimensions of n multiplied by k;
adopting a Text-CNN Text classification model as a coder, and determining an input single-channel document word matrix x belonging to Rn×kThe potential representation of the τ th feature map is:
cτ=σ(x*Wτ+bτ) (5)
wherein, Wτ∈Ra×kFor the τ th convolution kernel, a is the height of the convolution kernel, σ is the activation function, which represents the 2d convolution operation, bτIs the bias term for the τ -th convolution operation; since a narrow convolution is used, cτ∈Rn-a+1
To cτPerforming maximum pooling operation to obtain hτe.R, namely:
hτ=max(cτ) (6)
because the dimensionality of the clustering center is d, d convolution checks are needed to perform convolution operation on the input document word matrix, maximum pooling operation is performed on each feature map, and finally, each h-th feature map is subjected toτSplicing to obtain vectorization expression H e R of textdNamely:
Figure GDA0002680482060000032
the decoder portion is constructed using a deconvolution network, first, for each h separatelyτPerforming inverse pooling operation to reduce the data to gτ∈Rn-a+1(ii) a Secondly, for each gτPerforming deconvolution operation, and reconstructing a document word matrix, wherein the calculation method comprises the following steps:
Figure GDA0002680482060000033
here, σ is the activation function, and T denotes allCharacteristic diagram of (1), WTIs the transpose of the corresponding convolution kernel, is a 2d convolution operation, ξ is the bias term;
the minimum mean square error loss is used as the reconstruction loss of the convolution self-coding, and the calculation formula is as follows:
Figure GDA0002680482060000034
where θ is a parameter of the convolutional auto-encoder.
Further, Step4 is to cluster the text when the convolution self-encoder calculates forward.
Further, the iteration of the clustering centers is to update the clustering centers by adopting the combination of the last clustering center and the current newly allocated clustering center.
As a preferred embodiment of the present invention, the Step4 specifically comprises the following steps:
step4.1, for a given set of news text vectors Hi}i=1,2,...,N,HiAnd (3) obtaining a vectorized representation for the ith news document through a convolution self-encoder. The task here is to divide the news text of N different cases into k different case clusters, i.e. C ═ C1,...,Cr,...,Ck}. Wherein, CrIs the r-th case cluster. k-means is one of the most widely used clustering algorithms, and its loss function is:
Figure GDA0002680482060000041
wherein M is equal to Rd×kAs a cluster center matrix, sr,i∈{0,1}kFor each news text case cluster, and 1Tsi=1。
The update mode of the r-th case cluster partition is as follows:
Cr=Cr∪{Hi} if sr,i=1 (11)
in the iterative updating process, the deviceThe news text is divided into cluster clusters closest to the cluster center, specifically, updates siThe rule of (1) is as follows:
Figure GDA0002680482060000042
initializing a cluster center M using a case-vectorized representation set Cen, wherein each column of M is Cenr. Considering that news is reported on different sides of a case, information of news texts under the case is added into the case characterization vector, so that the case characterization is more reasonable. Specifically, in the clustering process, the last clustering center is adopted
Figure GDA0002680482060000043
And the current newly assigned cluster center
Figure GDA0002680482060000044
The cluster center is updated to obtain a new case representation, and the updating method of the r-th case cluster center comprises the following steps:
Figure GDA0002680482060000045
wherein the content of the first and second substances,
Figure GDA0002680482060000046
and distributing the mean vector of the news text in the r case cluster for the t round, namely:
Figure GDA0002680482060000047
Figure GDA0002680482060000048
for the weight coefficient of the r-th case cluster, the calculation method is as follows:
Figure GDA0002680482060000049
here, the number of the first and second electrodes,
Figure GDA00026804820600000410
the number of news assigned to the r-th case cluster for the t-th turn.
The network is trained under the guidance of the reconstruction loss of the self-encoder, the text representation can be restrained, and the network is trained under the guidance of the clustering loss, so that the representation of the text is closer to a case. Therefore, the network is jointly trained using a combination of reconstruction and clustering losses of the convolutional autocoder, and the loss function is defined as follows:
Loss=λLossc+(1-λ)Loss(θ)n (16)
wherein λ ∈ [0,1]]Is balance LosscAnd Loss (θ)nIs determined.
In the early stage of clustering iteration, the self-encoder cannot learn good text representation and influences the representation of the case, so that a poor clustering result is generated. Let a co-training T round, the front J round only perform updating the parameters of the convolutional auto-encoder, and let λ be 0, make the Loss be the reconstruction Loss Loss (θ) of the auto-encoder onlyn. And adding the later T-J to the clustering process in the forward direction, wherein the Loss is the joint Loss.
Iteratively updating X ═ X using the proposed method1,X2,...,XNAnd after the round, the news text set converges into different case clusters, so that a final clustering result is obtained.
The invention has the beneficial effects that:
1. firstly, extracting important sentence representation texts; secondly, characterizing cases by using case elements, and using the case elements to initially cluster centers and guide a clustering search process; finally, a convolution self-encoder is selected to obtain text representation, a reconstruction loss and clustering loss combined training network is utilized to enable the representation of the text to be closer to a case, the text representation and the clustering process are unified into the same frame, self-encoder parameters and clustering model parameters are alternately updated, and text clustering is achieved;
2. the method aims at the problems that the current clustering algorithm lacks effective guide information for news and case correlation analysis tasks, so that clustering divergence is caused, and the accuracy of results is reduced, gives full play to the guidance of case elements in the clustering process and text vectorization representation, and effectively improves the accuracy of clustering results.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
Example 1: as shown in fig. 1, the news and case correlation analysis method based on case element guidance and deep clustering specifically includes:
step1, collecting related case news documents and defining related case elements.
The related case news documents collected and sorted in Step1 are obtained by writing a web crawler and crawling related news texts.
The case elements defined in Step1 are defined by analyzing the composition of the case elements of Chinese books in the Chinese referee's netbook and considering the characteristics of the news text related to the case.
Specifically, a total of 5970 pieces of news text related to 6 hot cases are crawled, as shown in table 1. Definition of "Party, Party involved, case description" 3 elements as case elements are shown in Table 2.
TABLE 1 case-related news text data set
Figure GDA0002680482060000051
TABLE 2 case elements List
Figure GDA0002680482060000061
Step2, compressing the case-related news text by using a plurality of summarization technologies;
step3, representing the case by using the mean value of the case element word vectors to obtain the vectorization representation of the case;
step4, the compressed news text data is passed through a convolution self-encoder to obtain a text vectorization representation; constructing a word vector matrix for the compressed document, selecting a convolution self-encoder, and jointly training a network by utilizing reconstruction loss and clustering loss;
step5, initializing a clustering center by using vectorization representation of cases, unifying text vectorization representation and clustering processes into the same frame, and alternately updating parameters of a self-encoder and parameters of a clustering model to realize text clustering.
Further, in Step2, several summarization methods are adopted to extract the summaries of the news texts, the summarization methods are used to synthesize the summaries, important information representation texts are extracted, and text compression is realized.
Further, the Step2 includes the specific steps of:
step2.1, firstly formally describing various abstract text compression tasks as follows: let a news text be S ═ S1,S2,...,SpAnd h, containing p sentences in total, and setting the abstracts generated by the q methods as L1v,L2v,...,LqvAbbreviated as L1v:LqvWherein each abstract contains v sentences and o different sentences, and the target is from L1v:LqvSelecting z sentences as compressed texts;
defining the ith abstract as fi(. o.), then:
Liv=fi(S) (1)
here, the news text is abstracted by using 7 abstraction methods, namely Lead, Luhn, LSA, LexRank, TextRank, SumBasic, KL-Sum, so that i belongs to [1,7], i is equal to 7;
selecting z sentences with the highest frequency of occurrence in the plurality of abstracts as compressed texts, and selecting the sentences with the front positions by considering the positions of the sentences in the document when the frequency of occurrence is the same; in addition, news headlines are also considered herein to be part of the news and are topical and factual, so headline information is also added to the compressed text collection.
Further, Step3 includes:
if Er={e1,e2,...emThe case element set of the r-th case includes m case elements in total, and each case element eiIt can be characterized as a d-dimensional word vector wiI.e. Er={w1,w2,...wm};
Then the case is vectorized with the mean of the word vectors of the case elements: let Cen ber∈RdFor the vectorized representation of the r-th case, the calculation method is as follows:
Figure GDA0002680482060000071
assuming that there are k cases in total, using Cen to represent the set of cases, then:
Cen={Cen1,...,Cenr,...,Cenk} (3)。
the specific steps of Step4 are as follows:
step4.1, set X of compressed sentences of a news text Si∈RkIs a word vector for the nth word in sentence set X, and the sentence set contains n words in total, then the news text is represented as:
Figure GDA0002680482060000072
here, the first and second liquid crystal display panels are,
Figure GDA0002680482060000073
splicing, namely constructing a sentence set X into a document word matrix with dimension of n multiplied by k;
adopting a Text-CNN Text classification model as a coder, and determining an input single-channel document word matrix x E Rn×kLatent, of the τ -th feature mapIn the representation:
cτ=σ(x*Wτ+bτ) (5)
wherein, Wτ∈Ra×kFor the τ -th convolution kernel, a is the height of the convolution kernel, σ is the activation function, which represents the 2d convolution operation, bτIs the bias term for the τ th convolution operation; since a narrow convolution is used, cτ∈Rn-a+1
To c is pairedτPerforming maximum pooling operation to obtain hτe.R, namely:
hτ=max(cτ) (6)
because the dimensionality of the clustering center is d dimensionality, d convolution checks are needed to perform convolution operation on the input document word matrix, each feature graph is subjected to maximum pooling operation, and finally each h-th feature graph is subjected to maximum pooling operationτSplicing to obtain vectorization expression H e R of textdNamely:
Figure GDA0002680482060000074
the decoder portion is constructed using a deconvolution network, first, for each h separatelyτPerforming inverse pooling operation to reduce the data to gτ∈Rn-a+1(ii) a Secondly, for each gτPerforming deconvolution operation, and reconstructing a document word matrix, wherein the calculation method comprises the following steps:
Figure GDA0002680482060000075
here, σ is the activation function, T denotes all the characteristic diagrams, WTIs the transpose of the corresponding convolution kernel, is a 2d convolution operation, ξ is the bias term;
the minimum mean square error loss is used as the reconstruction loss of the convolution self-coding, and the calculation formula is as follows:
Figure GDA0002680482060000076
where θ is a parameter of the convolutional auto-encoder.
Further, Step5 is clustering the text when the convolution self-encoder calculates forward.
Further, the iteration of the cluster centers is to update the cluster centers by adopting the combination of the last cluster center and the current newly allocated cluster center.
And comparing the clustering result with the label of the text in the data set to evaluate the clustering performance, and selecting the Accuracy (ACC) and the standardized mutual information (NMI) as evaluation indexes, wherein the accuracy is defined as:
Figure GDA0002680482060000081
sT∈[1,N]for the transposition of the clustering result matrix, s' belongs to [1, N ]]And the label matrix of the text in the data set, tr is the trace of the matrix, and N is the total number of the news text.
Standardized mutual information (NMI) can be used to measure the similarity between two data distributions, i.e. for clustering tasks, i.e. to measure the similarity between clustering labels and clustering results.
Figure GDA0002680482060000082
Wherein, MI (-) is mutual information, H (-) is information entropy, NMI epsilon [0,1], and the larger the value is, the better the clustering effect is.
In order to make the invention more convincing, 2 vector space model-based, 1 topic model-based and 3 word vector-based distributed representation methods are selected to characterize the document, and the k-means clustering algorithm and the proposed method are used for comparison. In particular, for a feature dimension based on a vector space model of 2000, the dimensions of the remaining baseline methods are all 300. Further, for a distributed representation method of a document, the text after compression is used with the text method. Specifically, (1) TFIDF-1: taking each word in the document as a feature item, wherein the weight is TFIDF; (2) TFIDF-2: context words with the window size of 2 are used as feature items, and the weight is TFIDF; (3) LDA: obtaining a document representation using a topic model; (4) MeanWV (mean Word embedding): an average word vector for the document; (5) TWE (topical Word embedding) the average of the concatenation topic vector and the average of the Word vector represent the document; (6) TopicVec: documents are represented using a concatenation of the document topic vector and the average of the word vectors.
Aiming at the invention, the following hyper-parameter settings are adopted: (1) for the document compression module, the number of sentences extracted from each abstract is set to be 3, and the number of sentences synthesized by a plurality of abstracts is also set to be 3. (2) For the convolutional self-coding module, the dimension of the input word vector is 300 dimensions; selecting three different convolution kernels, wherein the heights of the three different convolution kernels are 3, 4 and 5 respectively, and the thickness of each convolution kernel is 100; the optimizer is Adam, the learning rate is 0.01, and the L2 regularization weight is 0.00001. (3) For the clustering module, the embedding dimension of the case element is 300 dimensions; setting the iteration turn to be 25 turns, and in the clustering process, in the first 5 turns, not using a clustering loss optimization network; the hyperparameter that balances the cluster loss and the autoencoder loss is set to 0.1.
Table 3 shows the comparison of clustering effect between the present method and the baseline method in 4, 5, and 6 cases. Experimental results show that the method is superior to a baseline method in both accuracy and standardized mutual information indexes.
Table 3 comparison of experimental results for the methods herein and the baseline method
Figure GDA0002680482060000091
From the experimental results in table 3, it can be seen that the text characterization based on LDA has relatively poor clustering effect, and the reason for analyzing the text characterization is mainly that the task of the method is not very suitable, because the purpose is to cluster the news texts of the same case into the same case cluster, one case is a topic, and LDA considers that one news text contains a plurality of topics, so that the clustering result is not ideal. The clustering method based on the vector space text representation obtains good effect, as for case-related public sentiment data, news texts of different cases have certain difference, TF-IDF calculates the representative degree of words to documents, can better distinguish different documents, especially for TFIDF-2, considers 2-element grammatical features, and captures a part of context information. Based on a distributed document representation method, the document is represented by word embedding or theme embedding respectively, and the effect close to TFIDF-2 is achieved.
The method utilizes a convolution self-encoder to extract features and combine semantics of the text, so that the representation of the text has n-element syntactic features, and meanwhile, the loss of clustering is used for guidance, so that a model can learn a text representation form related to a task better. And the clustering center is initialized by using case elements, so that the clustering process is guided. The method herein is superior to the baseline method in both average indicators. For example, under 6 cases, the accuracy of the method is improved by 4.16% compared with TFIDF-2, and the standardized mutual information is improved by 9.20%.
While the present invention has been described in detail with reference to the embodiments, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (5)

1. The news and case correlation analysis method based on case element guidance and deep clustering is characterized by comprising the following steps of: the method comprises the following steps:
step1, compressing the case-related news text by using a plurality of summarization technologies;
step2, representing the case by using the mean value of the case element word vectors to obtain the vectorization representation of the case;
step3, the compressed news text data is passed through a convolution self-encoder to obtain a text vectorization representation;
step4, utilizing vectorization representation of cases to initialize clustering centers, unifying text vectorization representation and clustering processes into the same frame, and alternately updating self-encoder parameters and clustering model parameters to realize text clustering;
step3 comprises the following steps:
constructing a word vector matrix for the compressed document, selecting a convolution self-encoder, and training a network by utilizing reconstruction loss and clustering loss;
step4 is to cluster the text when the convolution self-encoder calculates forward.
2. The news and case correlation analysis method based on case element guidance and deep clustering as claimed in claim 1, wherein: in Step1, several summarization methods are adopted to extract the summaries of the news texts, the summarization methods are used to synthesize the summaries, and important information representation texts are extracted, so that text compression is realized.
3. The news and case correlation analysis method based on case element guidance and deep clustering as claimed in claim 1 or 2, wherein: the specific steps of Step1 are as follows:
step1.1, firstly formally describing various abstract text compression tasks as follows: let a news text be S ═ S1,S2,...,SpP sentences are contained in total, and the abstracts generated by q methods are respectively set as L1v,L2v,...,LqvAbbreviated as L1v:LqvWherein each abstract contains v sentences, including o different sentences, with the goal of starting from L1v:LqvSelecting z sentences as compressed texts;
defining the ith abstract as fi(. cndot.), then:
Liv=fi(S) (1)
here, the news text is abstracted by using 7 abstraction methods, namely Lead, Luhn, LSA, LexRank, TextRank, SumBasic, KL-Sum, so that i belongs to [1,7], i is equal to 7;
and selecting z sentences with the highest frequency of occurrence in the plurality of abstracts as compressed texts.
4. The news and case correlation analysis method based on case element guidance and deep clustering of claim 1, wherein: the Step2 includes:
if Er={e1,e2,...emThe case element set of the r-th case includes m case elements in total, and each case element eiIt can be characterized as a d-dimensional word vector wiI.e. Er={w1,w2,...wm};
Then the case is vectorized with the mean of the word vectors of the case elements: let Cen ber∈RdFor the vectorized representation of the r-th case, the calculation method is as follows:
Figure FDA0003672800590000021
assuming there are a total of k cases, using Cen to represent the set of cases, then:
Cen={Cen1,...,Cenr,...,Cenk} (3)。
5. the news and case correlation analysis method based on case element guidance and deep clustering of claim 1, wherein: and the iteration of the clustering centers is to update the clustering centers by adopting the combination of the last clustering center and the current newly distributed clustering center.
CN202010166279.6A 2020-03-11 2020-03-11 News and case correlation analysis method based on case element guidance and deep clustering Active CN111831820B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010166279.6A CN111831820B (en) 2020-03-11 2020-03-11 News and case correlation analysis method based on case element guidance and deep clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010166279.6A CN111831820B (en) 2020-03-11 2020-03-11 News and case correlation analysis method based on case element guidance and deep clustering

Publications (2)

Publication Number Publication Date
CN111831820A CN111831820A (en) 2020-10-27
CN111831820B true CN111831820B (en) 2022-07-19

Family

ID=72913341

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010166279.6A Active CN111831820B (en) 2020-03-11 2020-03-11 News and case correlation analysis method based on case element guidance and deep clustering

Country Status (1)

Country Link
CN (1) CN111831820B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408266A (en) * 2020-12-02 2021-09-17 腾讯科技(深圳)有限公司 Text processing method, device, equipment and storage medium
CN113191411B (en) * 2021-04-22 2023-02-07 杭州卓智力创信息技术有限公司 Electronic sound image file management method based on photo group
CN113158079B (en) * 2021-04-22 2022-06-17 昆明理工大学 Case public opinion timeline generation method based on difference case elements
CN115269768A (en) * 2021-04-29 2022-11-01 京东科技控股股份有限公司 Element text processing method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108898180A (en) * 2018-06-28 2018-11-27 中国人民解放军国防科技大学 Depth clustering method for single-particle cryoelectron microscope images
CN109272992A (en) * 2018-11-27 2019-01-25 北京粉笔未来科技有限公司 A kind of spoken language assessment method, device and a kind of device for generating spoken appraisal model
CN109492157A (en) * 2018-10-24 2019-03-19 华侨大学 Based on RNN, the news recommended method of attention mechanism and theme characterizing method
CN110533545A (en) * 2019-07-12 2019-12-03 长春工业大学 Side community discovery algorithm based on the sparse self-encoding encoder of depth
CN110717332A (en) * 2019-07-26 2020-01-21 昆明理工大学 News and case similarity calculation method based on asymmetric twin network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019008580A1 (en) * 2017-07-03 2019-01-10 Yissum Research Development Company Of The Hebrew University Of Jerusalem Ltd. Method and system for enhancing a speech signal of a human speaker in a video using visual information
US10699697B2 (en) * 2018-03-29 2020-06-30 Tencent Technology (Shenzhen) Company Limited Knowledge transfer in permutation invariant training for single-channel multi-talker speech recognition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108898180A (en) * 2018-06-28 2018-11-27 中国人民解放军国防科技大学 Depth clustering method for single-particle cryoelectron microscope images
CN109492157A (en) * 2018-10-24 2019-03-19 华侨大学 Based on RNN, the news recommended method of attention mechanism and theme characterizing method
CN109272992A (en) * 2018-11-27 2019-01-25 北京粉笔未来科技有限公司 A kind of spoken language assessment method, device and a kind of device for generating spoken appraisal model
CN110533545A (en) * 2019-07-12 2019-12-03 长春工业大学 Side community discovery algorithm based on the sparse self-encoding encoder of depth
CN110717332A (en) * 2019-07-26 2020-01-21 昆明理工大学 News and case similarity calculation method based on asymmetric twin network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A deep convolutional auto-encoder with embedded clustering;A. Alqahtani 等;《2018 25th IEEE International Conference on Image Processing》;20180906;4058-4062 *
Spatial Fuzzy Clustering and Deep Auto-encoder for Unsupervised Change Detection in Synthetic Aperture Radar Images;Y. Li 等;《IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium》;20181105;4479-4482 *
Unsupervised multi-manifold clustering by learning deep representation;CHEN D 等;《Proceedings of the 31st AAAI Conference on Artificial Intelligence》;20170401;385-391 *
深度卷积自编码图像聚类算法;谢娟英 等;《计算机科学与探索》;20180629;第13卷(第4期);586-595 *

Also Published As

Publication number Publication date
CN111831820A (en) 2020-10-27

Similar Documents

Publication Publication Date Title
CN111831820B (en) News and case correlation analysis method based on case element guidance and deep clustering
Da'u et al. Recommendation system exploiting aspect-based opinion mining with deep learning method
Wu et al. Learning to extract coherent summary via deep reinforcement learning
US20170249387A1 (en) Methods and systems for investigation of compositions of ontological subjects and intelligent systems therefrom
CN105183833B (en) Microblog text recommendation method and device based on user model
CN109189925A (en) Term vector model based on mutual information and based on the file classification method of CNN
US20120253792A1 (en) Sentiment Classification Based on Supervised Latent N-Gram Analysis
CN106156023B (en) Semantic matching method, device and system
López-Sánchez et al. Hybridizing metric learning and case-based reasoning for adaptable clickbait detection
Li et al. Tourism review sentiment classification using a bidirectional recurrent neural network with an attention mechanism and topic-enriched word vectors
CN108319734A (en) A kind of product feature structure tree method for auto constructing based on linear combiner
Gupta et al. Text Categorization with Knowledge Transfer from Heterogeneous Data Sources.
Jiang et al. KSCB: A novel unsupervised method for text sentiment analysis
Karimi et al. Global least squares method based on tensor form to solve linear systems in Kronecker format
CN110705247A (en) Based on x2-C text similarity calculation method
Jiang et al. Semi-supervised unified latent factor learning with multi-view data
Simchoni et al. Integrating random effects in deep neural networks
CN113934835B (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
Fitrianah et al. Extractive text summarization for scientific journal articles using long short-term memory and gated recurrent units
CN112883229B (en) Video-text cross-modal retrieval method and device based on multi-feature-map attention network model
Ma et al. Clustering and integrating of heterogeneous microbiome data by joint symmetric nonnegative matrix factorization with laplacian regularization
CN112015760B (en) Automatic question-answering method and device based on candidate answer set reordering and storage medium
CN109902273A (en) The modeling method and device of keyword generation model
Chu et al. Refined SBERT: Representing sentence BERT in manifold space
CN113221531A (en) Multi-model dynamic collaborative semantic matching method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant