CN112667810A - Document clustering device, electronic apparatus, and storage medium - Google Patents

Document clustering device, electronic apparatus, and storage medium Download PDF

Info

Publication number
CN112667810A
CN112667810A CN202011572311.7A CN202011572311A CN112667810A CN 112667810 A CN112667810 A CN 112667810A CN 202011572311 A CN202011572311 A CN 202011572311A CN 112667810 A CN112667810 A CN 112667810A
Authority
CN
China
Prior art keywords
clustered
documents
document
clustering
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011572311.7A
Other languages
Chinese (zh)
Inventor
柴玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202011572311.7A priority Critical patent/CN112667810A/en
Priority to PCT/CN2021/082726 priority patent/WO2022134343A1/en
Publication of CN112667810A publication Critical patent/CN112667810A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The application relates to the technical field of artificial intelligence, in particular to document clustering, a device, electronic equipment and a storage medium. The method comprises the following steps: obtaining N documents to be clustered, wherein N is an integer greater than 1; determining the co-introduced similarity between any two documents to be clustered in the N documents to be clustered; according to the common introduced similarity between any two documents to be clustered, carrying out first clustering on the N documents to be clustered to obtain M clustering clusters, wherein the M clustering clusters correspond to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N; and performing secondary clustering on the rest (N-K) documents to be clustered so as to fuse the (N-K) documents to be clustered into the M clustering clusters. The method and the device are beneficial to improving the clustering precision of the literature.

Description

Document clustering device, electronic apparatus, and storage medium
Technical Field
The application relates to the technical field of artificial intelligence, in particular to document clustering, a device, electronic equipment and a storage medium.
Background
At present, the similarity between the document topics can be better measured by using the similarity based on the citation relationship, but in order to supplement the similarity between the documents, the text similarity between the documents is introduced to comprehensively measure the similarity between the document topics, namely, a plurality of indexes are put in the same space to measure the similarity of the document topics; after the similarity between the topics of the documents is measured, a single clustering algorithm or a community detection algorithm can be used for clustering the documents.
However, after the text similarity is introduced, the formed clustering network is very dense, so that the clustering granularity is coarse, and the clustering precision of the documents is reduced.
Disclosure of Invention
The embodiment of the application provides a document clustering device, an electronic device and a storage medium, and clustering precision is improved through twice clustering.
In a first aspect, an embodiment of the present application provides a document clustering method, including:
obtaining N documents to be clustered, wherein N is an integer greater than 1;
determining the co-introduced similarity between any two documents to be clustered in the N documents to be clustered;
according to the common introduced similarity between any two documents to be clustered, carrying out first clustering on the N documents to be clustered to obtain M clustering clusters, wherein the M clustering clusters correspond to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;
and performing secondary clustering on the rest (N-K) documents to be clustered so as to fuse the (N-K) documents to be clustered into the M clustering clusters.
In a second aspect, an embodiment of the present application provides a document clustering apparatus, including:
the device comprises an acquisition unit, a clustering unit and a clustering unit, wherein the acquisition unit is used for acquiring N documents to be clustered, and N is an integer greater than 1;
the processing unit is used for determining the co-quoted similarity between any two documents to be clustered in the N documents to be clustered;
according to the common introduced similarity between any two documents to be clustered, carrying out first clustering on the N documents to be clustered to obtain M clustering clusters, wherein the M clustering clusters correspond to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;
and performing secondary clustering on the rest (N-K) documents to be clustered so as to fuse the (N-K) documents to be clustered into the M clustering clusters.
In a third aspect, an embodiment of the present application provides an electronic device, including: a processor coupled to a memory, the memory configured to store a computer program, the processor configured to execute the computer program stored in the memory to cause the electronic device to perform the method of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, where the computer program makes a computer execute the method according to the first aspect.
In a fifth aspect, embodiments of the present application provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program, the computer being operable to cause a computer to perform the method according to the first aspect.
The embodiment of the application has the following beneficial effects:
it can be seen that, in the embodiment of the present application, N clustering documents are clustered according to the common introduced similarity between the documents to obtain M clustering clusters, and since a more accurate clustering cluster can be obtained by using the common introduced similarity, the accuracy of the M clustering clusters is higher; and then, under the condition that the clustering granularity is not changed, performing secondary clustering on the rest documents which are not in the M clustering clusters, and fusing the rest documents to the M clustering clusters, so that the clustering granularity is not reduced, and the clustering precision is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a document clustering method according to an embodiment of the present application;
fig. 2 is a schematic flow chart of a naming method for cluster topics provided in an embodiment of the present application;
FIG. 3 is a schematic diagram of a constructed undirected graph according to an embodiment of the present application;
fig. 4 is a block diagram illustrating functional units of a document clustering device according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a document clustering device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
Referring to fig. 1, fig. 1 is a schematic flow chart of a document clustering method according to an embodiment of the present application. The method is applied to a literature clustering device. The method comprises the following steps:
101: the literature clustering device obtains N literatures to be clustered, wherein N is an integer larger than 1.
Illustratively, the document to be clustered may be a medical document, a patent document, an academic document, or the like. The type of the document to be clustered is not limited in the present application.
102: the literature clustering device determines the co-quoted similarity between any two literatures to be clustered in the N literatures to be clustered.
Exemplarily, determining a first number of the N documents to be clustered, which quotes a first document to be clustered, and determining a second number of the N documents to be clustered, which quotes a second document to be clustered; determining a third number of the N documents to be clustered which quote the first document to be clustered and the second document to be clustered at the same time; finally, determining the co-introduced similarity between the second document to be clustered and the second document to be clustered according to the first quantity, the second quantity and the third quantity; the first document to be clustered and the second document to be clustered are any two documents to be clustered in the N documents to be clustered. Therefore, the co-cited similarity between the first document to be clustered and the second document to be clustered can be represented by formula (1):
Figure BDA0002859729310000041
wherein, a1For the first document to be clustered, a2For the second document to be clustered, sim (a)1,a2) For the common quoted similarity between the first to-be-clustered document and the second to-be-clustered document, X is quoted a1In a collection of documents, Y is a citation of a2X |, Y, is the third quantity, X |, is the first quantity, Y | is the second quantity.
It should be understood that if the documents to be clustered of the first document to be clustered and the documents to be clustered of the second document to be clustered are not cited in the N documents to be clustered simultaneously, the co-cited similarity between the first document to be clustered and the second document to be clustered is 0, that is, there is no co-cited relationship between the two documents.
103: the literature clustering device carries out first clustering on N literatures to be clustered according to the common introduced similarity between any two literatures to be clustered to obtain M clustering clusters, wherein the M clustering clusters correspond to K literatures to be clustered, M is an integer larger than or equal to 1, and K is an integer smaller than or equal to N.
Exemplarily, the community networks corresponding to the N documents to be clustered can be determined according to the co-introduced similarity between any two documents; and then, carrying out first clustering on the N documents to be clustered according to the common introduced similarity between any two documents to be clustered, a community detection algorithm and the community network to obtain M clustering clusters. In the application, the common quoted similarity between any two documents to be clustered is taken as the weight between the two documents (namely the weight of each edge in the community network); then, taking each document as an independent community, merging some communities in the N communities based on the principle of modularity minimization to obtain M communities, namely M clustering clusters, wherein the M clustering clusters totally contain K documents to be clustered, namely, the remaining (N-K) documents to be clustered are not clustered.
104: and the literature clustering device carries out secondary clustering on the rest (N-K) literatures to be clustered so as to fuse the (N-K) literatures to be clustered into M clustering clusters.
Exemplarily, constructing a feature vector for each document to be clustered in the N documents to be clustered; and secondly clustering the (N-K) documents to be clustered according to the feature vector of each document to be clustered in the (N-K) documents to be clustered and the feature vector of each document to be clustered in the K documents to be clustered so as to fuse the (N-K) documents to be clustered into the M clustering clusters.
Illustratively, the title and the abstract of each document to be clustered in the N documents to be clustered are subjected to phrase extraction, and the extracted phrases of the N documents to be clustered are subjected to de-duplication to obtain P phrases, wherein P is an integer greater than 1, and the phrases of each document can be extracted from the topic and the abstract of each document by using a language processing tool kit stanford NLP. Then, determining the word frequency-inverse text frequency (TF-IDF) of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered; and finally, constructing a feature vector for each document to be clustered in the N documents to be clustered according to the TD-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered.
Specifically, under the condition that the document i to be clustered comprises the jth phrase in the P phrases, setting the value of the jth dimension in the feature vector of the document i to be clustered as the TD-IDF of the jth phrase relative to the document i to be clustered; and under the condition that the document i to be clustered does not include the jth phrase, setting the value of the jth dimension in the feature vector of the document i to be clustered as a preset value, for example, 0. And if the document i to be clustered is any one of the N documents to be clustered, the value of i is an integer from 1 to N, the jth phrase is any one of the P phrases, and the value of j is an integer from 1 to P.
Therefore, the feature vector of each document can be represented by formula (2):
Figure BDA0002859729310000051
wherein, ViAs a feature vector of the document i to be clustered, W1,……,Wj,……,WpThe value of the feature vector of the document i to be clustered is from 1 dimension to P dimension,
Figure BDA0002859729310000052
is the TF-IDF of the phrase j relative to the document i to be clustered.
In one embodiment of the present application, the TF-IDF of each phrase with respect to each document to be clustered in the N documents to be clustered may be represented by formula (3):
Figure BDA0002859729310000053
wherein the content of the first and second substances,
Figure BDA0002859729310000054
to the number of phrases j extracted from the title and abstract of the document i to be clustered,
Figure BDA0002859729310000055
the total number of phrases extracted from the title and abstract of the document i to be clustered, the number of N documents to be clustered, NjThe number of documents containing the phrase j in the N documents to be clustered.
Further, after a feature vector of each document to be clustered is constructed, the similarity between the feature vector of the document q to be clustered and the feature vector of the document e to be clustered can be determined, wherein the document q to be clustered is any one of (N-K) documents to be clustered, and the document e to be clustered is any one of K documents to be clustered; traversing K documents to be clustered, so that the similarity between the document q to be clustered and the K documents to be clustered can be obtained, and obtaining K similarity; then, according to the sequence of the K similarity degrees from large to small, H documents to be clustered are selected from the K documents to be clustered, wherein H is a preset integer larger than 1; and finally, determining the number of the H documents to be clustered which respectively belong to each clustering cluster in the M clustering clusters, and fusing the documents q to be clustered to a target clustering cluster, wherein the target clustering cluster is the clustering cluster which contains the H documents to be clustered in the M clustering clusters and has the largest number.
For example, 10 documents to be clustered are selected from the K documents to be clustered, that is, H is 10, and the clustering cluster includes a clustering cluster 1, a clustering cluster 2, and a clustering cluster 3, where 3 of the 10 documents to be clustered are from the clustering cluster 1, 4 are from the clustering cluster 2, and 3 are from the clustering cluster 3, and it is determined that the clustering cluster 2 contains the 10 documents to be clustered in the largest number. Therefore, the document q to be clustered is fused to the cluster 2.
It can be seen that, in the embodiment of the present application, N clustering documents are clustered according to the common introduced similarity between the documents to obtain M clustering clusters, and since a more accurate clustering cluster can be obtained by using the common introduced similarity, the accuracy of the M clustering clusters is higher; and then, under the condition that the clustering granularity is not changed, performing secondary clustering on the rest documents which are not in the M clustering clusters, and fusing the rest documents to the M clustering clusters, so that the clustering granularity is not reduced, and the clustering precision is improved.
In an embodiment of the present application, after clustering N documents to be clustered, topic naming may be performed on each cluster, so as to facilitate subsequent classification of each cluster. The following provides an example of naming a cluster i to explain a process of naming a topic of a cluster, where the cluster is any one of the M clusters, and the naming modes of other clusters are similar to this and will not be described again.
Referring to fig. 2, fig. 2 is a schematic flow chart of a document theme naming method provided in the embodiment of the present application. The method is applied to a literature clustering device. The method comprises the following steps:
201: and determining the target literature to be clustered in the clustering cluster i according to the common introduced similarity between any two literatures to be clustered in the clustering cluster i.
Exemplarily, determining the score of each document to be clustered in the clustering cluster i according to the co-introduced similarity between any two documents to be clustered in the clustering cluster i, wherein the score of each document to be clustered is used for indicating the importance degree of each document to be clustered, namely the quality of the document to be clustered; and then, determining the target documents to be clustered in the clustering cluster i according to the sequence of scores from large to small. For example, documents with a preset proportion can be selected from each clustering cluster according to the sequence of scores from large to small as the target documents to be clustered. For example, if the number of documents in a certain cluster is 100 and the preset proportion is 10%, the first ten documents are selected from the 100 documents as the target documents to be clustered of the cluster according to the sequence of scores from large to small.
Specifically, determining an undirected graph corresponding to the clustering cluster i according to the common referenced similarity between any two documents to be clustered in the clustering cluster i, namely connecting the documents to be clustered with the common referenced relation, and not connecting the documents to be clustered with the common referenced relation; the score of each node (namely each document to be clustered) in the undirected graph is determined according to the undirected graph corresponding to the clustering cluster i and the pagerank algorithm, so that the score of each document to be clustered in the clustering cluster i can be obtained, namely the score of each document to be clustered is determined according to the path between each document to be clustered and other documents to be clustered. Specifically, the score of each document to be clustered is determined according to the common introduced similarity between each document to be clustered and other documents to be clustered, preset parameters and the number of the documents to be clustered at intervals.
For example, the clustering cluster i includes a document a to be clustered, a document B to be clustered, a document C to be clustered, and a document D to be clustered, where the document D to be clustered is not referred to by other documents to be clustered, but is a document to be clustered that is finally merged, and thus, an undirected graph as shown in fig. 3 can be established. And respectively determining the scores corresponding to the literature A to be clustered, the literature B to be clustered, the literature C to be clustered and the literature D to be clustered according to a pagerank algorithm and the undirected graph. Illustratively, the score corresponding to the document a to be clustered is the sum of the scores from the document a to be clustered to the document B to be clustered and the scores from the document a to be clustered to the document C to be clustered. Illustratively, the score corresponding to the document a to be clustered can be represented by formula (4):
S=1*γ*sim(A,B)+1*γ2equation (A, C) (4)
S is the corresponding score of the document A to be clustered, sim (A, B) is the co-introduced similarity between the document A to be clustered and the document B to be clustered, sim (A, C) is the co-introduced similarity between the document A to be clustered and the document C to be clustered, gamma is a preset parameter,
Figure BDA0002859729310000071
202: and determining L phrases belonging to the cluster i in the P phrases, wherein L is an integer less than or equal to P.
203: and determining a theme corresponding to the clustering cluster i according to the target document to be clustered and the L phrases.
Illustratively, word embedding can be performed on the title of the target document to be clustered in the clustering cluster i to obtain a first feature vector corresponding to the clustering cluster i. The word embedding of the title of the target document to be clustered in the clustering cluster i can be realized through a Biobert model which is trained by taking the document in the medical field as a training corpus, so that the Biobert model can be used for more accurately processing the language in the medical field and accurately extracting the semantics of the document, wherein the training of the Biobert model can be obtained through a supervised mode and is not repeated.
It should be understood that, when the number of the target documents to be clustered is one, a feature vector obtained by word embedding the title of the target documents to be clustered is used as the first feature vector; under the condition that the number of the target documents to be clustered is multiple, word embedding can be performed on the title of each document to be clustered to obtain the feature vector corresponding to each document to be clustered, and then the first feature vector is obtained after the feature vectors corresponding to the multiple documents to be clustered are averaged according to the position.
Further, performing word embedding on each phrase in the L phrases to obtain a second feature vector of each phrase, wherein the word embedding on each phrase can also be realized through the above-mentioned Biobert model, and is not described; then, performing word embedding on each word of each phrase to obtain a third feature vector corresponding to each word; and determining a fourth feature vector corresponding to each phrase according to the third feature vector corresponding to each word, namely, bit-averaging the third feature vectors corresponding to each word in each phrase, and using the feature vectors obtained by bit-averaging as the fourth feature vectors corresponding to each phrase. For example, four word words in the phrase "long cancer survival rate" are respectively word-embedded to obtain four feature vectors, and the four feature vectors are averaged according to bits to obtain a second feature vector corresponding to the phrase.
And finally, determining a theme corresponding to each clustering cluster according to the first feature vector corresponding to the clustering cluster i, the second feature vector corresponding to each phrase, the fourth feature vector corresponding to each phrase and the TF-IDF of each phrase relative to the clustering cluster i, wherein the TF-IDF of each phrase relative to the clustering cluster i is the average value of each phrase relative to the TF-IDF of each document to be clustered in the clustering cluster i.
Exemplarily, a first similarity between a first feature vector corresponding to the cluster i and a second feature vector corresponding to each phrase is determined; determining a second similarity between the first feature vector corresponding to the cluster i and the fourth feature vector corresponding to each phrase; and finally, determining a third similarity between the cluster i and each phrase according to the first similarity and the second similarity corresponding to each phrase and the TF-IDF corresponding to the cluster i. For example, the first similarity, the second similarity, and the TF-IDF may be weighted to obtain the third similarity.
Illustratively, the similarity may be cosine similarity between vectors. Therefore, the third similarity can be expressed by equation (5):
sim(phr,cluster)=β*cossim(vec1,vec2)+(1-β)*cossim(vec1,vec4) + (1-beta) TF-IDF equation (5)
Where sim (phr) is the third similarity between cluster i and each phrase, cossimFor operation of finding cosine similarity, vec1Is the first feature vector, vec, corresponding to the cluster i2A second feature vector, vec, for each phrase4Beta is a preset parameter, and beta is more than or equal to 0 and less than or equal to 1.
Then, a fourth similarity between any two phrases of the L phrases is determined according to the second feature vector of each phrase. For example, the fourth similarity may also be a cosine similarity, and therefore, the fourth similarity may be represented by formula (6):
sim(phr1,phr2)=cossim(vec21,vec22) Formula (6)
phr1Is given as2Is any two of the L phrases, sim (phr)1,phr2) Is a fourth degree of similarity, vec, between two phrases21Is expressed as1Second feature vector of, vec22Is expressed as2A corresponding second feature vector.
And finally, determining the theme corresponding to the cluster i according to the third similarity between the cluster i and each phrase and the fourth similarity between any two phrases.
Exemplarily, the phrase with the maximum third similarity is taken as a target phrase, and the target phrase is moved from the L phrases to the target phrase set; then, determining a maximum boundary correlation (MMR) score corresponding to each phrase in the remaining phrases according to a third similarity between each phrase in the remaining phrases in the L phrases and the cluster i and a second similarity between each phrase in the target phrase set, for example, obtaining a fifth similarity corresponding to each target phrase in the target phrase set according to the third similarity between each phrase in the remaining phrases and the cluster i and the second similarity between each phrase in the target phrase set, and using the maximum fifth similarity as the MMR score of each phrase in the remaining phrases; then, the phrase with the largest MMR score in the remaining phrases is moved from the remaining phrases to the target phrase set. And finally, determining the MMR score corresponding to each phrase in the remaining phrases again, moving the phrase with the maximum MMR score in the remaining phrases to the target phrase set, sequentially iterating until the number of the target phrases in the target phrase set reaches a preset number, stopping iteration, and taking the target phrases in the target phrase set as the subject of the cluster i.
Illustratively, the MMR score for each of the remaining phrases can be represented by equation (7):
Figure BDA0002859729310000091
wherein PHR represents a candidate phrase set corresponding to the cluster i, K is a target phrase set, PHRiE PHR \ K represents the ith phrase, MMR, of the remaining phrasesiIs the MMR score, phr, of the ith phraseje.K denotes the jth phrase in the target set of phrases, sim (phr)iCluster) is the third similarity between the ith phrase and cluster i,
Figure BDA0002859729310000092
for the fourth similarity between the ith phrase and the jth phrase, argmax means that the maximized ph takes rj ∈ K value, that is, after traversing the phrases in the target phrase set, the maximum value is used as the MMR score of the ith phrase, and α is a preset parameter. Finally, after traversing each of the remaining phrases, the MMR score for each of the remaining phrases can be obtained.
For example, a candidate phrase set of a certain cluster includes a phrase a, a phrase B, a phrase C, a phrase D, and a phrase E, and a third similarity between the phrase a and the cluster is the largest, the phrase a is first used as a target phrase, and the phrase a is moved from L phrases to the target phrase set, and at this time, the remaining phrases include the phrase B, the phrase C, the phrase D, and the phrase E; then, calculating the MMR score of each phrase in the remaining phrases, namely substituting a third similarity between each phrase and the cluster and a second similarity between each phrase and the phrase A into the formula (6) to obtain the MMR scores corresponding to the phrases B, C, D and E respectively; assuming that the MMR score of the phrase B is the largest, the phrase B is moved from the candidate set to the target phrase set, and the remaining phrases at this time include phrase C, phrase D and phrase E. And finally, substituting a third similarity between each phrase in the remaining phrases and the cluster and a second similarity between each phrase and the phrase A into the formula (6) to obtain a similarity corresponding to the phrase A, substituting the third similarity between the phrase and the cluster i and the second similarity between the phrase and the phrase B into the formula (6) to obtain a similarity corresponding to the phrase B, and taking the maximum similarity of the two similarities as the MMR score of the phrase. Determining the MMR score of each of the remaining phrases in turn, the MMR scores of phrase C, phrase D, and phrase E can be obtained. Assuming that the MMR score for phrase C is the greatest, phrase C is moved to the target phrase set. If the preset number is three phrases, at this time, three phrases already exist in the target phrase set, the iteration is stopped, and the phrase A, the phrase B and the phrase C are taken as the subjects of the cluster i.
It can be seen that, in the process of naming the theme of the cluster, the similarity between phrases is also considered in addition to the relationship between the phrases and the cluster itself, so as to avoid selecting the repeated and redundant phrases as the main body of the cluster. In addition, each phrase is segmented, and a second similarity between each phrase and the first feature vector of the medical document cluster is determined with the word as granularity. The method mainly avoids the problem that some phrases are long and are not related to the subjects of the cluster, but the long phrases may frequently contain words related to the subjects, so that the semantic features of the long phrases are related to the subjects of the cluster due to the influence of the high-frequency words in the process of extracting the semantic features of the long phrases, and the long and short phrases are mistakenly used as the subjects of the cluster, so that the accuracy of the extracted subjects of the cluster is low. And by segmenting each phrase, starting from each word and not considering the context of the word, the words which are not related to the subject but appear frequently are classified into general words, and the second similarity obtained in the process of calculating the second similarity is smaller, so that the third similarity obtained after weighting is also smaller, the phrases are not used as the subject of the cluster, and the finally extracted subject is relatively more accurate.
In an embodiment of the present application, the document clustering method of the present application may also be applied to the field of medical technology, for example, medical documents may be clustered by the document clustering method of the present application, for example, in a case that the document to which the present application relates is a medical document, N medical documents to be clustered may be obtained from a medical database, for example, N documents to be clustered may be read from a public medical media (PUBMED) database, and then, the N medical documents are clustered into a plurality of clustering clusters, and each clustering cluster may be named, so that a doctor can clearly and quickly refer to the medical document to be searched, and advance of medical technology is promoted.
Referring to fig. 4, fig. 4 is a block diagram illustrating functional units of a document clustering device according to an embodiment of the present application. The document cluster 400 includes: an acquisition unit 401 and a processing unit 402, wherein:
an obtaining unit 401, configured to obtain N documents to be clustered, where N is an integer greater than 1;
the processing unit 402 is configured to determine a common quoted similarity between any two documents to be clustered in the N documents to be clustered; according to the common introduced similarity between any two documents to be clustered, carrying out first clustering on the N documents to be clustered to obtain M clustering clusters, wherein the M clustering clusters correspond to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N; and performing secondary clustering on the rest (N-K) documents to be clustered so as to fuse the (N-K) documents to be clustered into the M clustering clusters.
In some possible embodiments, in determining the co-cited similarity between any two documents to be clustered in the N documents to be clustered, the processing unit 402 is specifically configured to:
determining a first number of first documents to be clustered cited in the N documents to be clustered;
determining a second number of the N documents to be clustered, wherein the second number of the N documents to be clustered refers to a second document to be clustered;
determining a third number of the N documents to be clustered, wherein the first document to be clustered and the second document to be clustered are cited simultaneously;
determining the co-introduced similarity between the first document to be clustered and the second document to be clustered according to the first quantity, the second quantity and the third quantity;
the first document to be clustered and the second document to be clustered are any two documents to be clustered in the N documents to be clustered.
In some possible embodiments, in performing the second clustering on the remaining (N-K) documents to be clustered to fuse the (N-K) documents to be clustered to the M clustering clusters, the processing unit 402 is specifically configured to:
constructing a feature vector for each document to be clustered in the N documents to be clustered;
and secondly clustering the (N-K) documents to be clustered according to the feature vector of each document to be clustered in the (N-K) documents to be clustered and the feature vector of each document to be clustered in the K documents to be clustered so as to fuse the (N-K) documents to be clustered into the M clustering clusters.
In some possible embodiments, in constructing the feature vector for each document to be clustered in the N documents to be clustered, the processing unit 402 is specifically configured to:
performing phrase extraction and de-duplication on the title and the abstract of each document to be clustered in the N documents to be clustered to obtain P phrases, wherein P is an integer greater than 1;
determining the word frequency-inverse text frequency TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered;
and constructing a feature vector for each document to be clustered in the N documents to be clustered according to the TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered.
In some possible embodiments, in terms of constructing a feature vector for each document to be clustered from the TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered, the processing unit 402 is specifically configured to:
setting the value of the jth dimension in the feature vector of the document i to be clustered as TF-IDF of the jth phrase relative to the document i to be clustered under the condition that the jth phrase in the P phrases is included in the document i to be clustered, and setting the value of the jth dimension in the feature vector of the document i to be clustered as a preset value under the condition that the jth phrase is not included in the document i to be clustered;
the document i to be clustered is any one of the N documents to be clustered, the value of i is an integer from 1 to N, and the value of j is an integer from 1 to P.
In some possible embodiments, in terms of performing second clustering on the (N-K) documents to be clustered according to the feature vector of each of the (N-K) documents to be clustered and the feature vector of each of the K documents to be clustered to fuse the (N-K) documents to be clustered to the M clustering clusters, the processing unit 402 is specifically configured to:
determining similarity between a feature vector of a document q to be clustered and a feature vector of a document e to be clustered, wherein the document q to be clustered is any one of the (N-K) documents to be clustered, and the document e to be clustered is any one of the K documents to be clustered;
according to the sequence of similarity from big to small, H documents to be clustered are selected from the K documents to be clustered, wherein H is an integer larger than 1;
determining the number of the H documents to be clustered which respectively belong to each clustering cluster in the M clustering clusters;
and fusing the documents q to be clustered to a target clustering cluster, wherein the target clustering cluster is a clustering cluster containing the largest number of the H documents to be clustered in the M clustering clusters.
In some possible embodiments, in terms of performing first clustering on the N documents to be clustered according to the common cited similarity between any two documents to be clustered to obtain M clustered clusters, the processing unit 402 is specifically configured to:
constructing a community network of the N documents to be clustered according to the co-introduced similarity between any two documents to be clustered;
and carrying out first clustering on the N documents to be clustered according to the community network, the common introduced similarity between any two documents to be clustered and a community detection algorithm to obtain M clustering clusters.
Referring to fig. 5, fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 5, the electronic device 500 includes a transceiver 501, a processor 502, and a memory 503. Connected to each other by a bus 504. The memory 503 is used to store computer programs and data, and may transmit the data stored by the memory 503 to the processor 502.
The processor 502 is configured to read the computer program in the memory 503 to perform the following operations:
obtaining N documents to be clustered, wherein N is an integer greater than 1;
determining the co-introduced similarity between any two documents to be clustered in the N documents to be clustered;
according to the common introduced similarity between any two documents to be clustered, carrying out first clustering on the N documents to be clustered to obtain M clustering clusters, wherein the M clustering clusters correspond to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;
and performing secondary clustering on the rest (N-K) documents to be clustered so as to fuse the (N-K) documents to be clustered into the M clustering clusters.
In some possible embodiments, in determining the co-quoted similarity between any two documents to be clustered in the N documents to be clustered, the processor 502 is specifically configured to perform the following operations:
determining a first number of first documents to be clustered cited in the N documents to be clustered;
determining a second number of the N documents to be clustered, wherein the second number of the N documents to be clustered refers to a second document to be clustered;
determining a third number of the N documents to be clustered, wherein the first document to be clustered and the second document to be clustered are cited simultaneously;
determining the co-introduced similarity between the first document to be clustered and the second document to be clustered according to the first quantity, the second quantity and the third quantity;
the first document to be clustered and the second document to be clustered are any two documents to be clustered in the N documents to be clustered.
In some possible embodiments, in the second clustering of the remaining (N-K) documents to be clustered to fuse the (N-K) documents to be clustered to the M clusters, the processor 502 is specifically configured to perform the following operations:
constructing a feature vector for each document to be clustered in the N documents to be clustered;
and secondly clustering the (N-K) documents to be clustered according to the feature vector of each document to be clustered in the (N-K) documents to be clustered and the feature vector of each document to be clustered in the K documents to be clustered so as to fuse the (N-K) documents to be clustered into the M clustering clusters.
In some possible embodiments, in constructing the feature vector for each document to be clustered in the N documents to be clustered, the processor 502 is specifically configured to perform the following operations:
performing phrase extraction and de-duplication on the title and the abstract of each document to be clustered in the N documents to be clustered to obtain P phrases, wherein P is an integer greater than 1;
determining the word frequency-inverse text frequency TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered;
and constructing a feature vector for each document to be clustered in the N documents to be clustered according to the TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered.
In some possible embodiments, in terms of constructing a feature vector for each document to be clustered from the TF-IDF of each phrase of the P phrases relative to each document to be clustered of the N documents to be clustered, the processor 502 is specifically configured to perform the following operations:
setting the value of the jth dimension in the feature vector of the document i to be clustered as TF-IDF of the jth phrase relative to the document i to be clustered under the condition that the jth phrase in the P phrases is included in the document i to be clustered, and setting the value of the jth dimension in the feature vector of the document i to be clustered as a preset value under the condition that the jth phrase is not included in the document i to be clustered;
the document i to be clustered is any one of the N documents to be clustered, the value of i is an integer from 1 to N, and the value of j is an integer from 1 to P.
In some possible embodiments, in terms of performing the second clustering on the (N-K) documents to be clustered according to the feature vector of each of the (N-K) documents to be clustered and the feature vector of each of the K documents to be clustered to fuse the (N-K) documents to be clustered into the M clustering clusters, the processor 502 is specifically configured to perform the following operations:
determining similarity between a feature vector of a document q to be clustered and a feature vector of a document e to be clustered, wherein the document q to be clustered is any one of the (N-K) documents to be clustered, and the document e to be clustered is any one of the K documents to be clustered;
according to the sequence of similarity from big to small, H documents to be clustered are selected from the K documents to be clustered, wherein H is an integer larger than 1;
determining the number of the H documents to be clustered which respectively belong to each clustering cluster in the M clustering clusters;
and fusing the documents q to be clustered to a target clustering cluster, wherein the target clustering cluster is a clustering cluster containing the largest number of the H documents to be clustered in the M clustering clusters.
In some possible embodiments, in terms of performing first clustering on the N documents to be clustered according to the common cited similarity between any two documents to be clustered to obtain M clustered clusters, the processor 502 is specifically configured to perform the following operations:
constructing a community network of the N documents to be clustered according to the co-introduced similarity between any two documents to be clustered;
and carrying out first clustering on the N documents to be clustered according to the community network, the common introduced similarity between any two documents to be clustered and a community detection algorithm to obtain M clustering clusters.
Specifically, the transceiver 501 may be the transceiver 401 of the document clustering device 400 in the embodiment shown in fig. 4, and the processor 502 may be the processing unit 402 of the document clustering device 400 in the embodiment shown in fig. 4.
It should be understood that the document clustering device in the present application may include a smart Phone (e.g., an Android Phone, an iOS Phone, a Windows Phone, etc.), a tablet computer, a palm computer, a notebook computer, a Mobile Internet device MID (MID), a wearable device, or the like. The document clustering device is merely an example, not an exhaustive list, and includes but is not limited to the electronic device. In practical applications, the document clustering device may further include: intelligent vehicle-mounted terminal, computer equipment and the like.
Embodiments of the present application also provide a computer storage medium, which stores a computer program, where the computer program is executed by a processor to implement part or all of the steps of any one of the document clustering methods as described in the above method embodiments.
Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any of the document clustering methods as described in the above method embodiments.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules referred to are not necessarily required in this application.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.
The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A method for clustering documents, comprising:
obtaining N documents to be clustered, wherein N is an integer greater than 1;
determining the co-introduced similarity between any two documents to be clustered in the N documents to be clustered;
according to the common introduced similarity between any two documents to be clustered, carrying out first clustering on the N documents to be clustered to obtain M clustering clusters, wherein the M clustering clusters correspond to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;
and performing secondary clustering on the rest (N-K) documents to be clustered so as to fuse the (N-K) documents to be clustered into the M clustering clusters.
2. The method according to claim 1, wherein the determining the co-quoted similarity between any two documents to be clustered in the N documents to be clustered comprises:
determining a first number of first documents to be clustered cited in the N documents to be clustered;
determining a second number of the N documents to be clustered, wherein the second number of the N documents to be clustered refers to a second document to be clustered;
determining a third number of the N documents to be clustered, wherein the first document to be clustered and the second document to be clustered are cited simultaneously;
determining the co-introduced similarity between the first document to be clustered and the second document to be clustered according to the first quantity, the second quantity and the third quantity;
the first document to be clustered and the second document to be clustered are any two documents to be clustered in the N documents to be clustered.
3. The method according to claim 1 or 2, wherein the second clustering of the remaining (N-K) documents to be clustered to fuse the (N-K) documents to be clustered to the M clusters comprises:
constructing a feature vector for each document to be clustered in the N documents to be clustered;
and secondly clustering the (N-K) documents to be clustered according to the feature vector of each document to be clustered in the (N-K) documents to be clustered and the feature vector of each document to be clustered in the K documents to be clustered so as to fuse the (N-K) documents to be clustered into the M clustering clusters.
4. The method according to claim 3, wherein the constructing a feature vector for each document to be clustered in the N documents to be clustered comprises:
performing phrase extraction and de-duplication on the title and the abstract of each document to be clustered in the N documents to be clustered to obtain P phrases, wherein P is an integer greater than 1;
determining the word frequency-inverse text frequency TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered;
and constructing a feature vector for each document to be clustered in the N documents to be clustered according to the TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered.
5. The method according to claim 4, wherein the constructing a feature vector for each document to be clustered from the TF-IDF of each phrase in the P phrases relative to the document to be clustered in each document to be clustered comprises:
setting the value of the jth dimension in the feature vector of the document i to be clustered as the TF-IDF of the jth phrase relative to the document i to be clustered under the condition that the document i to be clustered comprises the jth phrase in the P phrases, and setting the value of the jth dimension in the feature vector of the document i to be clustered as a preset value under the condition that the document i to be clustered does not comprise the jth phrase;
the document i to be clustered is any one of the N documents to be clustered, the value of i is an integer from 1 to N, and the value of j is an integer from 1 to P.
6. The method according to claim 4 or 5, wherein the second clustering of the (N-K) documents to be clustered according to the feature vector of each of the (N-K) documents to be clustered and the feature vector of each of the K documents to be clustered to fuse the (N-K) documents to be clustered to the M clusters comprises:
determining similarity between a feature vector of a document q to be clustered and a feature vector of a document e to be clustered, wherein the document q to be clustered is any one of the (N-K) documents to be clustered, and the document e to be clustered is any one of the K documents to be clustered;
according to the sequence of similarity from big to small, H documents to be clustered are selected from the K documents to be clustered, wherein H is an integer larger than 1;
determining the number of the H documents to be clustered which respectively belong to each clustering cluster in the M clustering clusters;
and fusing the documents q to be clustered to a target clustering cluster, wherein the target clustering cluster is a clustering cluster containing the largest number of the H documents to be clustered in the M clustering clusters.
7. The method according to any one of claims 1 to 6, wherein the first clustering of the N documents to be clustered according to the co-quoted similarity between any two documents to be clustered to obtain M clusters comprises:
constructing a community network of the N documents to be clustered according to the co-introduced similarity between any two documents to be clustered;
and carrying out first clustering on the N documents to be clustered according to the community network, the common introduced similarity between any two documents to be clustered and a community detection algorithm to obtain M clustering clusters.
8. A document clustering apparatus, comprising:
the device comprises an acquisition unit, a clustering unit and a clustering unit, wherein the acquisition unit is used for acquiring N documents to be clustered, and N is an integer greater than 1;
the processing unit is used for determining the co-quoted similarity between any two documents to be clustered in the N documents to be clustered;
according to the common introduced similarity between any two documents to be clustered, carrying out first clustering on the N documents to be clustered to obtain M clustering clusters, wherein the M clustering clusters correspond to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;
and performing secondary clustering on the rest (N-K) documents to be clustered so as to fuse the (N-K) documents to be clustered into the M clustering clusters.
9. An electronic device, comprising: a processor coupled to the memory, and a memory for storing a computer program, the processor being configured to execute the computer program stored in the memory to cause the electronic device to perform the method of any of claims 1-7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executed by a processor to implement the method according to any one of claims 1-7.
CN202011572311.7A 2020-12-25 2020-12-25 Document clustering device, electronic apparatus, and storage medium Pending CN112667810A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011572311.7A CN112667810A (en) 2020-12-25 2020-12-25 Document clustering device, electronic apparatus, and storage medium
PCT/CN2021/082726 WO2022134343A1 (en) 2020-12-25 2021-03-24 Document clustering method and apparatus, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011572311.7A CN112667810A (en) 2020-12-25 2020-12-25 Document clustering device, electronic apparatus, and storage medium

Publications (1)

Publication Number Publication Date
CN112667810A true CN112667810A (en) 2021-04-16

Family

ID=75410135

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011572311.7A Pending CN112667810A (en) 2020-12-25 2020-12-25 Document clustering device, electronic apparatus, and storage medium

Country Status (2)

Country Link
CN (1) CN112667810A (en)
WO (1) WO2022134343A1 (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6457028B1 (en) * 1998-03-18 2002-09-24 Xerox Corporation Method and apparatus for finding related collections of linked documents using co-citation analysis
US8566360B2 (en) * 2010-05-28 2013-10-22 Drexel University System and method for automatically generating systematic reviews of a scientific field
CN103455622B (en) * 2013-09-12 2017-02-15 广东电子工业研究院有限公司 Automatic document dimensional clustering method
CN108509481B (en) * 2018-01-18 2019-08-27 天津大学 Draw the study frontier visual analysis method of cluster altogether based on document
CN111898366B (en) * 2020-07-29 2022-08-09 平安科技(深圳)有限公司 Document subject word aggregation method and device, computer equipment and readable storage medium

Also Published As

Publication number Publication date
WO2022134343A1 (en) 2022-06-30

Similar Documents

Publication Publication Date Title
CN108509474B (en) Synonym expansion method and device for search information
CN108829822B (en) Media content recommendation method and device, storage medium and electronic device
Yu et al. Learning term embeddings for hypernymy identification
CN110019732B (en) Intelligent question answering method and related device
CN112270178B (en) Medical literature cluster theme determination method and device, electronic equipment and storage medium
US20170193086A1 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN111931490B (en) Text error correction method, device and storage medium
CN112347778A (en) Keyword extraction method and device, terminal equipment and storage medium
CN108304377B (en) Extraction method of long-tail words and related device
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN111949802A (en) Construction method, device and equipment of knowledge graph in medical field and storage medium
CN112559684A (en) Keyword extraction and information retrieval method
CN111291177A (en) Information processing method and device and computer storage medium
CN111950303B (en) Medical text translation method, device and storage medium
CN111967261B (en) Cancer stage information processing method, device and storage medium
CN112307190B (en) Medical literature ordering method, device, electronic equipment and storage medium
CN113807073B (en) Text content anomaly detection method, device and storage medium
CN110162769B (en) Text theme output method and device, storage medium and electronic device
CN108846142A (en) A kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing
Iwata et al. Unsupervised group matching with application to cross-lingual topic matching without alignment information
CN112287217B (en) Medical document retrieval method, medical document retrieval device, electronic equipment and storage medium
CN112667810A (en) Document clustering device, electronic apparatus, and storage medium
CN116167369A (en) Text keyword extraction method and device
CN111460808B (en) Synonymous text recognition and content recommendation method and device and electronic equipment
CN114818727A (en) Key sentence extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination