WO2022134343A1 - Document clustering method and apparatus, electronic device, and storage medium - Google Patents
Document clustering method and apparatus, electronic device, and storage medium Download PDFInfo
- Publication number
- WO2022134343A1 WO2022134343A1 PCT/CN2021/082726 CN2021082726W WO2022134343A1 WO 2022134343 A1 WO2022134343 A1 WO 2022134343A1 CN 2021082726 W CN2021082726 W CN 2021082726W WO 2022134343 A1 WO2022134343 A1 WO 2022134343A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- clustered
- documents
- document
- clusters
- phrase
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 239000013598 vector Substances 0.000 claims description 86
- 238000004590 computer program Methods 0.000 claims description 19
- 238000012545 processing Methods 0.000 claims description 14
- 238000004422 calculation algorithm Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 8
- 238000001514 detection method Methods 0.000 claims description 7
- 238000010276 construction Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 230000009471 action Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 239000000047 product Substances 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 201000005202 lung cancer Diseases 0.000 description 1
- 208000020816 lung neoplasm Diseases 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Definitions
- the present application relates to the technical field of artificial intelligence, and in particular, to a document clustering method, apparatus, electronic device and storage medium.
- the similarity based on the citation relationship can generally be used to measure the similarity between the topics of the documents.
- the text similarity between the documents is introduced.
- Comprehensively measure the similarity between document topics that is, put multiple indicators in the same space to measure the similarity of document topics; after measuring the similarity between document topics, a single clustering algorithm or community detection algorithm can be used to measure the similarity between document topics. Clustering of multiple articles.
- the inventor realizes that after the introduction of text similarity, the formed clustering network will be very dense, which will make the clustering granularity thicker and reduce the clustering accuracy of documents.
- the embodiments of the present application provide a document clustering method, apparatus, electronic device, and storage medium, which improve the clustering accuracy by clustering twice.
- an embodiment of the present application provides a document clustering method, including:
- N is an integer greater than 1
- the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;
- a second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
- an embodiment of the present application provides a document clustering device, including:
- the acquisition unit is used to acquire N documents to be clustered, where N is an integer greater than 1;
- a processing unit configured to determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered;
- the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;
- a second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
- an embodiment of the present application provides an electronic device, including: a processor, the processor is connected to a memory, the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory , so that the electronic device performs the following methods:
- N is an integer greater than 1
- the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;
- a second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
- an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program causes a computer to execute the following method:
- N is an integer greater than 1
- the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;
- a second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
- an embodiment of the present application provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, the computer is operable to cause the computer to execute as described in the first aspect Methods.
- N cluster documents are clustered to obtain M clusters, because the co-citation similarity can be used to obtain more accurate clusters. , so the accuracy of the M clusters is high; then, without changing the clustering granularity, the remaining documents that are not in the M clusters are clustered a second time, and fused into the M clusters. Clustering clusters, so that the clustering granularity will not be reduced, and the clustering accuracy will be improved.
- FIG. 1 is a schematic flowchart of a document clustering method provided in an embodiment of the present application.
- FIG. 2 is a schematic flowchart of a clustering theme naming method provided by an embodiment of the present application
- FIG. 3 is a schematic diagram of constructing an undirected graph according to an embodiment of the present application.
- FIG. 4 is a block diagram of functional unit composition of a document clustering apparatus provided in an embodiment of the present application.
- FIG. 5 is a schematic structural diagram of a document clustering apparatus provided by an embodiment of the present application.
- the technical solution of the present application relates to the field of artificial intelligence technology, and can be applied to scenarios such as digital medical care to promote the construction of smart cities.
- the data involved in this application such as documents to be clustered and/or clustering results, may be stored in a database, or may be stored in a blockchain, which is not limited in this application.
- FIG. 1 is a schematic flowchart of a document clustering method provided by an embodiment of the present application. The method is applied to a document clustering device. The method includes the following steps:
- the document clustering device obtains N documents to be clustered, where N is an integer greater than 1.
- the documents to be clustered may be medical documents, patent documents, academic documents, and so on. This application does not limit the types of documents to be clustered.
- the document clustering apparatus determines the co-citation similarity between any two documents to be clustered among the N documents to be clustered.
- determining the first number of the N documents to be clustered citing the first document to be clustered, and determining the second number of the N documents to be clustered citing the second document to be clustered; Determine the third quantity of the N documents to be clustered that simultaneously cite the first document to be clustered and the two documents to be clustered; finally, determine the first quantity according to the first quantity, the second quantity and the third quantity Co-citation similarity between the second document to be clustered and the second document to be clustered; the first document to be clustered and the second document to be clustered are any two of the N documents to be clustered articles to be clustered. Therefore, the co-citation similarity between the first document to be clustered and the second document to be clustered can be expressed by formula (1):
- a 1 is the first document to be clustered
- a 2 is the second document to be clustered
- sim(a 1 , a 2 ) is the difference between the first document to be clustered and the second document to be clustered
- X is the set of documents citing a 1
- Y is the set of documents citing a 2
- X ⁇ Y is the third quantity
- ⁇ X ⁇ is the first quantity
- ⁇ Y ⁇ is second quantity.
- the N documents to be clustered do not cite the documents to be clustered of the first document to be clustered and the second document to be clustered at the same time, the first document to be clustered and the second document to be clustered are not.
- the co-citation similarity between the clustered documents is 0, that is, there is no co-citation relationship between the two documents.
- the document clustering device performs the first clustering on the N documents to be clustered according to the co-citation similarity between any two documents to be clustered, and obtains M clusters, wherein the M clusters are A cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N.
- the community network corresponding to the N documents to be clustered can be determined according to the co-citation similarity between any two documents; then, according to the co-citation similarity between any two documents to be clustered,
- the community detection algorithm and the community network perform the first clustering on the N documents to be clustered to obtain M clusters.
- the co-citation similarity between any two documents to be clustered is used as the weight between the two documents (that is, the weight of each edge in the community network); then, each document is regarded as an independent Community, based on the principle of minimizing modularity, merge some of the N communities to obtain M communities, that is, to obtain M clusters, and the M clusters contain a total of K articles to be clustered. Documents, that is to say, there are still (N-K) documents to be clustered that have not been clustered.
- the document clustering device performs a second clustering on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into M clusters.
- a feature vector is constructed for each document to be clustered in the N documents to be clustered; according to the feature vector of each document to be clustered in (N-K) documents to be clustered and K documents to be clustered
- the second clustering is performed on the (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
- phrase extraction is performed on the title and abstract of each document to be clustered in the N documents to be clustered, and the phrases extracted from the N documents to be clustered are deduplicated to obtain P phrases, Among them, P is an integer greater than 1, and the phrase extraction of the title and abstract of the document can extract the phrase of each document from the subject and abstract of each document through the language processing toolkit stanford NLP.
- each phrase in the P phrases constructs a feature vector for each of the N documents to be clustered.
- the value of the jth dimension in the feature vector of the document i to be clustered is set as The jth phrase is relative to the TD-IDF of the to-be-clustered document i to be clustered; if the j-th phrase is not included in the to-be-clustered document i to be clustered, then the to-be-clustered document i is not included in the TD-IDF.
- the value of the jth dimension in the feature vector of the cluster document i is set to a preset value, for example, it can be set to 0.
- the document i to be clustered is any one of the N documents to be clustered, then the value of i is an integer from 1 to N, and the jth phrase is any one of the P phrases phrase, j is an integer from 1 to P.
- V i is the feature vector of the document i to be clustered
- W 1 , ..., W j , ..., W p is the value of the feature vector of the document i to be clustered from 1 dimension to P dimension
- W p is the value of the feature vector of the document i to be clustered from 1 dimension to P dimension
- TF-IDF of phrase j is the TF-IDF of phrase j relative to document i to be clustered.
- the TF-IDF of each phrase relative to each document to be clustered in the N documents to be clustered can be expressed by formula (3):
- N is the number of documents to be clustered
- N j is the number of documents containing phrase j in the N documents to be clustered.
- the similarity between the feature vector of the document to be clustered q and the feature vector of the document to be clustered e can be determined, wherein the document to be clustered q is any one of the (N-K) documents to be clustered, and the document to be clustered e is any one of the K documents to be clustered; traverse the K documents to be clustered, so that the to-be-clustered document can be obtained
- the similarity between the document q and the K documents to be clustered is obtained, and K similarities are obtained; then, according to the order of the K similarities from large to small, H documents to be clustered are selected from the K documents to be clustered document, H is a preset integer greater than 1; finally, determine the number of the H documents to be clustered that belong to each of the M clusters, and fuse the document q to be clustered into the target cluster. cluster, wherein the target cluster is the cluster that contains the most number
- N cluster documents are clustered according to the co-citation similarity between the documents, and M clusters are obtained, because the co-citation similarity can be used to compare Therefore, the accuracy of the M clusters is high; then, without changing the clustering granularity, the remaining documents that are not in the M clusters are clustered a second time, and the fusion To the M clusters, the clustering granularity will not be reduced, and the clustering accuracy will be improved.
- each cluster cluster may be named a subject, so as to facilitate subsequent classification of each cluster cluster.
- the following provides an example of naming a cluster i to illustrate the process of naming a cluster.
- the cluster is any one of the M clusters, and the naming methods of other clusters are the same as This is similar and will not be described again.
- FIG. 2 is a schematic flowchart of a method for naming a document subject provided by an embodiment of the present application. The method is applied to a document clustering device. The method includes the following steps:
- the score of each document to be clustered in cluster i is determined, wherein the score of each document to be clustered is determined.
- the score is used to indicate the degree of importance of each document to be clustered, that is, the quality of the document to be clustered; then, the target document to be clustered in cluster i is determined according to the order of scores from large to small.
- a preset proportion of documents may be selected from each cluster as the target documents to be clustered according to the descending order of scores. For example, if the number of documents in a certain cluster is 100, and the preset ratio is 10%, then the top ten documents are selected from the 100 documents in the order of the scores from the largest to the smallest as the cluster.
- the target document to be clustered is determined.
- the undirected graph corresponding to cluster i is determined, that is, the documents to be clustered that have a co-citation relationship are connected. , the documents to be clustered without co-citation relationship are not connected; according to the undirected graph corresponding to the cluster i and the pagerank algorithm to determine the score of each node (that is, each document to be clustered) in the undirected graph, it can be The score of each document to be clustered in cluster i is obtained, that is, the score of each document to be clustered is determined according to the path between each document to be clustered and other documents to be clustered.
- the score of each to-be-clustered document is determined according to the co-citation similarity between each to-be-clustered document and other to-be-clustered documents, preset parameters, and the number of intermediate to-be-clustered documents.
- cluster i includes document A to be clustered, document B to be clustered, document C to be clustered, and document D to be clustered, wherein document D to be clustered is not shared with other documents to be clustered.
- the citation relationship is the last fused document to be clustered. Therefore, an undirected graph as shown in Figure 3 can be established. According to the pagerank algorithm and the undirected graph, the scores corresponding to document A to be clustered, document B to be clustered, document C to be clustered, and document D to be clustered can be determined respectively.
- the score corresponding to the document A to be clustered is the score between the document A to be clustered and the document B to be clustered, and the sum of the score between the document A to be clustered and the document C to be clustered.
- the score corresponding to the document A to be clustered can be expressed by formula (4):
- sim(A, B) is the co-citation similarity between document A to be clustered and document B to be clustered
- sim(A, C) is the document to be clustered
- ⁇ is a preset parameter, 0 ⁇ 1.
- word embedding may be performed on the title of the target document to be clustered in cluster i to obtain the first feature vector corresponding to cluster i.
- the word embedding of the title of the target document to be clustered in the cluster i can be realized by the Biobert model that has completed the training.
- the Biobert model is obtained by training the documents in the medical field as the training corpus. Language processing in the medical field will be more accurate, and the semantics of documents can be extracted more accurately.
- the training of the Biobert model can be obtained through supervised training, and will not be repeated here.
- the feature vector obtained by word embedding of the title of the target document to be clustered is used as the first feature vector;
- word embedding can be performed on the title of each target document to be clustered to obtain the feature vector corresponding to each target document to be clustered.
- the first eigenvector is obtained.
- word embedding is performed on each of the L phrases to obtain the second feature vector of each phrase, wherein the word embedding for each phrase can also be implemented by the above-mentioned Biobert model, and will not be described again; then, Perform word embedding on each word of each phrase to obtain the third feature vector corresponding to each word; according to the third feature vector corresponding to each word, determine the fourth feature vector corresponding to each phrase, that is, in each phrase The third feature vector corresponding to each word of , and the feature vector obtained by the bitwise averaging is used as the fourth feature vector corresponding to each phrase.
- the four words in the phrase "lung cancer survival rate" are word-embedded respectively to obtain four feature vectors, and the four feature vectors are averaged bit by bit to obtain the second feature vector corresponding to the phrase. .
- the second feature vector corresponding to each phrase, the fourth feature vector corresponding to each phrase, and the TF-IDF of each phrase relative to the cluster i determine The topic corresponding to each cluster, wherein the TF-IDF of each phrase relative to the cluster i is the average of the TF-IDF of each phrase relative to each document to be clustered in the cluster i value.
- the first similarity between the first feature vector corresponding to cluster i and the second feature vector corresponding to each phrase determines the first feature vector corresponding to cluster i and each phrase corresponding to the first similarity.
- the second similarity between the fourth feature vectors determines the cluster i and each phrase.
- the first similarity, the second similarity and the TF-IDF may be weighted to obtain the third similarity.
- the above similarity may be the cosine similarity between vectors. Therefore, the third similarity can be expressed by formula (5):
- sim(phr,cluster) ⁇ *cos sim (vec 1 ,vec 2 )+(1- ⁇ )*cos sim (vec 1 ,vec 4 )+(1- ⁇ )*TF-IDF formula (5)
- sim(phr,cluster) is the third similarity between cluster i and each phrase
- cos sim is the cosine similarity operation
- vec 1 is the first feature vector corresponding to cluster i
- vec 2 is the second feature vector corresponding to each phrase
- vec 4 is the fourth feature vector corresponding to each phrase
- ⁇ is a preset parameter, 0 ⁇ 1.
- the fourth degree of similarity between any two phrases in the L phrases is determined.
- the fourth similarity may also be a cosine similarity. Therefore, the fourth similarity may be expressed by formula (6):
- phr 1 and phr 2 are any two phrases among the L phrases
- sim(phr 1 ,phr 2 ) is the fourth similarity between the two phrases
- vec 21 is the second feature vector of phr 1
- vec 22 is the second eigenvector corresponding to phr 2 .
- the topic corresponding to cluster i is determined.
- the phrase with the third highest similarity as a target phrase take the phrase with the third highest similarity as a target phrase, and move the target phrase from the L phrases to the target phrase set;
- the third similarity between clusters i, and the second similarity with each target phrase in the target phrase set determine the Maximum Marginal Relevance (MMR) score corresponding to each phrase in the remaining phrases, For example, according to the third degree of similarity between each phrase in the remaining phrases and cluster i, and the second degree of similarity between each phrase in the target phrase set and each target phrase in the target phrase set, we can obtain each target phrase in the target phrase set.
- MMR Maximum Marginal Relevance
- the fifth similarity corresponding to the phrase, and the maximum fifth similarity is used as the MMR score of each phrase in the remaining phrases; then, the phrase with the largest MMR score in the remaining phrases is moved from the remaining phrases to the target phrase set. Finally, determine the MMR score corresponding to each phrase in the remaining phrases again, and move the phrase with the largest MMR score in the remaining phrases to the target phrase set, and iterate sequentially until the number of target phrases in the target phrase set reaches the preset number , stop the iteration, and take the target phrase in the target phrase set as the topic of cluster i.
- the MMR score of each phrase in the remaining phrases can be represented by formula (7):
- PHR represents the candidate phrase set corresponding to cluster i
- K is the target phrase set
- phr i ⁇ PHR ⁇ K represents the ith phrase in the remaining phrases
- MMR i is the MMR score of the ith phrase
- phr j ⁇ K represents the jth phrase in the target phrase set
- sim(phr i ,cluster) is the third similarity between the ith phrase and cluster i
- argmax represents the maximum value, that is, after traversing the phrases in the target phrase set, the maximum value is taken as the MMR score of the ith phrase
- ⁇ is Preset parameters.
- the candidate phrase set of a cluster includes phrase A, phrase B, phrase C, phrase D, and phrase E, and the third similarity between phrase A and the cluster is the largest, then phrase A is first As a target phrase, and move the phrase A from the L phrases to the target phrase set, the remaining phrases include phrase B, phrase C, phrase D and phrase E; then, calculate the MMR score of each phrase in the remaining phrases , that is, the third similarity between each phrase and the cluster and the second similarity between each phrase and phrase A are substituted into the above formula (6), and the corresponding phrases B, C, D and E are obtained respectively.
- phrase B Assuming that the MMR score of phrase B is the largest, then move phrase B from the candidate set to the target phrase set, then the remaining phrases include phrase C, phrase D, and phrase E. Finally, substitute the third similarity between each phrase in the remaining phrases and the cluster and the second similarity with phrase A into the above formula (6) to obtain a similarity corresponding to phrase A, Substitute the third similarity between the phrase and the cluster i and the second similarity between the phrase B into the above formula (6) to obtain a similarity corresponding to the phrase B. The largest similarity among the similarities is used as the MMR score of this phrase. The MMR scores of each phrase in the remaining phrases are sequentially determined, and then the MMR scores of phrase C, phrase D, and phrase E can be obtained.
- phrase C has the largest MMR score
- move phrase C to the target phrase set If the preset number is three phrases, there are already three phrases in the target phrase set at this time, stop the iteration, and use phrase A, phrase B, and phrase C as the topics of cluster i.
- each phrase is divided into words, and the second similarity between each phrase and the first feature vector of the medical document cluster is determined with the granularity of words.
- the document clustering method of the present application can also be applied to the field of medical technology.
- medical documents can be clustered by the document clustering method of the present application.
- N medical documents to be clustered can be obtained from the medical database, for example, N documents to be clustered are read from the public medicine (PUBMED) database, and then Medical documents are clustered, N medical documents are clustered into multiple clusters, and each cluster can be named, so that doctors can clearly and quickly consult the medical documents they need to find, and promote the development of medical technology. progress.
- POBMED public medicine
- FIG. 4 is a block diagram of functional units of a document clustering apparatus provided by an embodiment of the present application.
- the document clustering 400 includes: an acquisition unit 401 and a processing unit 402, wherein:
- Obtaining unit 401 configured to obtain N documents to be clustered, where N is an integer greater than 1;
- the processing unit 402 is configured to determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered; according to the co-citation similarity between any two documents to be clustered , perform the first clustering on the N documents to be clustered to obtain M clusters, wherein the M clusters correspond to the K documents to be clustered, and M is an integer greater than or equal to 1, K is an integer less than or equal to N; the second clustering is performed on the remaining (N-K) documents to be clustered to fuse the (N-K) documents to be clustered into the M clusters.
- the processing unit 402 in determining the co-citation similarity between any two documents to be clustered among the N documents to be clustered, is specifically configured to:
- first document to be clustered and the second document to be clustered are any two documents to be clustered among the N documents to be clustered.
- the second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters,
- the processing unit 402 is specifically used for:
- the feature vector of each document to be clustered in the (N-K) documents to be clustered and the feature vector of each document to be clustered in the K documents to be clustered, for the (N-K) documents are clustered a second time, so as to fuse the (N-K) documents to be clustered into the M clusters.
- the processing unit 402 in constructing a feature vector for each of the N documents to be clustered, is specifically configured to:
- the processing unit 402 is specifically used for:
- the document i to be clustered includes the jth phrase among the P phrases
- the jth phrase in the feature vector of the document i to be clustered is The value of the dimension is set to the default value
- the document i to be clustered is any one of the N documents to be clustered, the value of i is an integer from 1 to N, and the value of j is an integer from 1 to P.
- the processing unit 402 is specifically used for :
- the document q to be clustered is fused into a target cluster, wherein the target cluster is the cluster that contains the most number of the H documents to be clustered among the M clusters.
- the N documents to be clustered are clustered for the first time to obtain M clusters
- the processing unit 402 is specifically configured to:
- the co-citation similarity between any two documents to be clustered, and the community detection algorithm perform the first clustering on the N documents to be clustered to obtain M clusters .
- FIG. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
- the electronic device includes a processor and memory.
- the electronic device may further include a transceiver.
- the electronic device 500 includes a transceiver 501 , a processor 502 and a memory 503 . They are connected by bus 504 .
- the memory 503 is used to store computer programs and data, and can transmit the data stored by the memory 503 to the processor 502 .
- the processor 502 is used to read the computer program in the memory 503 to perform the following operations:
- N is an integer greater than 1
- the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;
- a second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
- the processor 502 in determining the co-citation similarity between any two documents to be clustered among the N documents to be clustered, the processor 502 is specifically configured to perform the following operations:
- the second quantity and the third quantity determining the co-citation similarity between the first document to be clustered and the two documents to be clustered;
- first document to be clustered and the second document to be clustered are any two documents to be clustered among the N documents to be clustered.
- the second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters
- the processor 502 is specifically configured to perform the following operations:
- the feature vector of each document to be clustered in the (N-K) documents to be clustered and the feature vector of each document to be clustered in the K documents to be clustered, for the (N-K) documents are clustered a second time, so as to fuse the (N-K) documents to be clustered into the M clusters.
- the processor 502 in constructing a feature vector for each of the N documents to be clustered, is specifically configured to perform the following operations:
- the processor 502 is specifically configured to perform the following operations:
- the document i to be clustered includes the jth phrase among the P phrases
- the jth phrase in the feature vector of the document i to be clustered is The value of the dimension is set to the default value
- the document i to be clustered is any one of the N documents to be clustered, the value of i is an integer from 1 to N, and the value of j is an integer from 1 to P.
- the second clustering is performed on the (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters, and the processor 502 is specifically configured to execute Do the following:
- the document q to be clustered is fused into a target cluster, wherein the target cluster is the cluster that contains the most number of the H documents to be clustered among the M clusters.
- the processor 502 is specifically configured to perform the following operations:
- the co-citation similarity between any two documents to be clustered, and the community detection algorithm perform the first clustering on the N documents to be clustered to obtain M clusters .
- the transceiver 501 may be the transceiver unit 401 of the document clustering apparatus 400 of the embodiment described in FIG. 4
- the processor 502 may be the processing unit 402 of the document clustering apparatus 400 of the embodiment described in FIG. 4 . .
- the document clustering device in this application can include smart phones (such as Android mobile phones, iOS mobile phones, Windows Phone mobile phones, etc.), tablet computers, handheld computers, notebook computers, mobile Internet Devices MID (Mobile Internet Devices, referred to as: MID). ) or wearable devices, etc.
- smart phones such as Android mobile phones, iOS mobile phones, Windows Phone mobile phones, etc.
- tablet computers handheld computers
- notebook computers mobile Internet Devices MID (Mobile Internet Devices, referred to as: MID).
- MID Mobile Internet Devices, referred to as: MID).
- wearable devices etc.
- the above-mentioned document clustering apparatus is only an example, rather than an exhaustive list, including but not limited to the above-mentioned electronic equipment. In practical applications, the above-mentioned document clustering apparatus may further include: intelligent vehicle-mounted terminals, computer equipment, and the like.
- Embodiments of the present application further provide a computer (readable) storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is executed by a processor to implement any one of the documents described in the foregoing method embodiments Some or all of the steps of the clustering method.
- the storage medium involved in this application such as a computer-readable storage medium, may be non-volatile or volatile.
- Embodiments of the present application further provide a computer program product, the computer program product comprising a non-transitory computer-readable storage medium storing a computer program, the computer program being operable to cause a computer to execute the methods described in the foregoing method embodiments Some or all of the steps of any document clustering method.
- the disclosed apparatus may be implemented in other manners.
- the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented.
- the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical or other forms.
- the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
- the above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software program modules.
- the integrated unit if implemented in the form of a software program module and sold or used as a stand-alone product, may be stored in a computer readable memory.
- the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art, or all or part of the technical solution, and the computer software product is stored in a memory.
- a computer device which may be a personal computer, a server, or a network device, etc.
- the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A document clustering method, a document clustering apparatus (400), an electronic device (500), and a storage medium. The document clustering method comprises: the document clustering apparatus (400) obtains N documents to be clustered, N being an integer greater than 1 (101); the document clustering apparatus (400) determines a co-citation similarity between any two documents among the N documents (102); according to the co-citation similarity between the any two documents, the document clustering apparatus (400) performs first clustering on the N documents to obtain M clusters, the M clusters corresponding to K documents to be clustered, M being an integer greater than or equal to 1, and K being an integer less than or equal to N (103); and the document clustering apparatus (400) performs second clustering on the remaining (N-K) documents so as to fuse the (N-K) documents into the M clusters (104). The method helps to increase the accuracy of document clustering.
Description
本申请要求于2020年12月25日提交中国专利局、申请号为202011572311.7,发明名称为“文献聚类、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202011572311.7 and the invention title "Document Clustering, Apparatus, Electronic Equipment and Storage Medium", which was filed with the China Patent Office on December 25, 2020, the entire contents of which are incorporated by reference in this application.
本申请涉及人工智能技术领域,具体涉及一种文献聚类方法、装置、电子设备及存储介质。The present application relates to the technical field of artificial intelligence, and in particular, to a document clustering method, apparatus, electronic device and storage medium.
发明人发现,目前,使用基于引用关系的相似度,一般可以较好地度量文献主题之间的相似度,但是,为了对文献之间的相似度进行补充,引入了文献之间的文本相似度综合度量文献主题之间的相似度,即将多种指标放在同一空间度量文献主题的相似度;在度量出文献主题之间的相似度之后,可使用单一的聚类算法或者社团检测算法,将多篇文献进行聚类。The inventor found that, at present, the similarity based on the citation relationship can generally be used to measure the similarity between the topics of the documents. However, in order to supplement the similarity between the documents, the text similarity between the documents is introduced. Comprehensively measure the similarity between document topics, that is, put multiple indicators in the same space to measure the similarity of document topics; after measuring the similarity between document topics, a single clustering algorithm or community detection algorithm can be used to measure the similarity between document topics. Clustering of multiple articles.
然而,发明人意识到,引入了文本相似度之后,会使构成的聚类网络非常稠密,这样会使得聚类粒度变粗,降低了对文献的聚类精度。However, the inventor realizes that after the introduction of text similarity, the formed clustering network will be very dense, which will make the clustering granularity thicker and reduce the clustering accuracy of documents.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种文献聚类方法、装置、电子设备及存储介质,通过两次聚类,提高聚类精度。The embodiments of the present application provide a document clustering method, apparatus, electronic device, and storage medium, which improve the clustering accuracy by clustering twice.
第一方面,本申请实施例提供一种文献聚类方法,包括:In a first aspect, an embodiment of the present application provides a document clustering method, including:
获取N篇待聚类文献,N为大于1的整数;Obtain N documents to be clustered, where N is an integer greater than 1;
确定所述N篇待聚类文献中任意两篇待聚类文献之间的共被引相似度;Determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered;
根据所述任意两篇待聚类文献之间的共被引相似度,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇,其中,所述M个聚类簇对应K篇待聚类文献,M为大于或等于1的整数,K为小于或等于N的整数;According to the co-citation similarity between any two documents to be clustered, the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;
对剩余的(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇。A second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
第二方面,本申请实施例提供一种文献聚类装置,包括:In a second aspect, an embodiment of the present application provides a document clustering device, including:
获取单元,用于获取N篇待聚类文献,N为大于1的整数;The acquisition unit is used to acquire N documents to be clustered, where N is an integer greater than 1;
处理单元,用于确定所述N篇待聚类文献中任意两篇待聚类文献之间的共被引相似度;a processing unit, configured to determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered;
根据所述任意两篇待聚类文献之间的共被引相似度,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇,其中,所述M个聚类簇对应K篇待聚类文献,M为大于或等于1的整数,K为小于或等于N的整数;According to the co-citation similarity between any two documents to be clustered, the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;
对剩余的(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇。A second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
第三方面,本申请实施例提供一种电子设备,包括:处理器,所述处理器与存储器相连,所述存储器用于存储计算机程序,所述处理器用于执行所述存储器中存储的计算机程序,以使得所述电子设备执行以下方法:In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, the processor is connected to a memory, the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory , so that the electronic device performs the following methods:
获取N篇待聚类文献,N为大于1的整数;Obtain N documents to be clustered, where N is an integer greater than 1;
确定所述N篇待聚类文献中任意两篇待聚类文献之间的共被引相似度;Determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered;
根据所述任意两篇待聚类文献之间的共被引相似度,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇,其中,所述M个聚类簇对应K篇待聚类文献,M为大于或等于1的整数,K为小于或等于N的整数;According to the co-citation similarity between any two documents to be clustered, the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;
对剩余的(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇。A second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
第四方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序使得计算机执行以下方法:In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program causes a computer to execute the following method:
获取N篇待聚类文献,N为大于1的整数;Obtain N documents to be clustered, where N is an integer greater than 1;
确定所述N篇待聚类文献中任意两篇待聚类文献之间的共被引相似度;Determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered;
根据所述任意两篇待聚类文献之间的共被引相似度,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇,其中,所述M个聚类簇对应K篇待聚类文献,M为大于或等于1的整数,K为小于或等于N的整数;According to the co-citation similarity between any two documents to be clustered, the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;
对剩余的(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇。A second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
第五方面,本申请实施例提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机可操作来使计算机执行如第一方面所述的方法。In a fifth aspect, an embodiment of the present application provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, the computer is operable to cause the computer to execute as described in the first aspect Methods.
实施本申请实施例,首先根据文献之间的共被引相似度对N篇聚类文献进行聚类,得到M个聚类簇,由于,使用共被引相似度能够得到比较精确的聚类簇,所以,该M个聚类簇的精度较高;然后,再不改变聚类粒度的情况下,将剩余未处于该M个聚类簇中的文献进行第二次聚类,融合到该M个聚类簇,从而不会降低聚类粒度,提高了聚类精度。To implement the embodiment of the present application, firstly, according to the co-citation similarity between the documents, N cluster documents are clustered to obtain M clusters, because the co-citation similarity can be used to obtain more accurate clusters. , so the accuracy of the M clusters is high; then, without changing the clustering granularity, the remaining documents that are not in the M clusters are clustered a second time, and fused into the M clusters. Clustering clusters, so that the clustering granularity will not be reduced, and the clustering accuracy will be improved.
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.
图1为本申请实施例提供的一种文献聚类方法的流程示意图;1 is a schematic flowchart of a document clustering method provided in an embodiment of the present application;
图2为本申请实施例提供的一种聚类簇主题命名方法的流程示意图;2 is a schematic flowchart of a clustering theme naming method provided by an embodiment of the present application;
图3为本申请实施例提供的一种构造无向图的示意图;3 is a schematic diagram of constructing an undirected graph according to an embodiment of the present application;
图4为本申请实施例提供的一种文献聚类装置的功能单元组成框图;4 is a block diagram of functional unit composition of a document clustering apparatus provided in an embodiment of the present application;
图5为本申请实施例提供的一种文献聚类装置的结构示意图。FIG. 5 is a schematic structural diagram of a document clustering apparatus provided by an embodiment of the present application.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third" and "fourth" in the description and claims of the present application and the drawings are used to distinguish different objects, rather than to describe a specific order . Furthermore, the terms "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, or optionally also includes For other steps or units inherent to these processes, methods, products or devices.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结果或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
本申请的技术方案涉及人工智能技术领域,如可应用于数字医疗等场景中,以推动智慧城市的建设。可选的,本申请涉及的数据如待聚类文献和/或聚类结果等可存储于数据库中,或者可以存储于区块链中,本申请不做限定。The technical solution of the present application relates to the field of artificial intelligence technology, and can be applied to scenarios such as digital medical care to promote the construction of smart cities. Optionally, the data involved in this application, such as documents to be clustered and/or clustering results, may be stored in a database, or may be stored in a blockchain, which is not limited in this application.
参阅图1,图1为本申请实施例提供的一种文献聚类方法的流程示意图。该方法应用于文献聚类装置。该方法包括以下步骤:Referring to FIG. 1 , FIG. 1 is a schematic flowchart of a document clustering method provided by an embodiment of the present application. The method is applied to a document clustering device. The method includes the following steps:
101:文献聚类装置获取N篇待聚类文献,N为大于1的整数。101: The document clustering device obtains N documents to be clustered, where N is an integer greater than 1.
示例性的,待聚类文献可以为医学文献、专利文献、学术文献,等等。本申请不对待聚类文献的类型进行限定。Exemplarily, the documents to be clustered may be medical documents, patent documents, academic documents, and so on. This application does not limit the types of documents to be clustered.
102:文献聚类装置确定N篇待聚类文献中任意两篇待聚类文献之间的共被引相似度。102: The document clustering apparatus determines the co-citation similarity between any two documents to be clustered among the N documents to be clustered.
示例性的,确定该N篇待聚类文献中引用了第一篇待聚类文献的第一数量,确定该N篇待聚类文献中引用了第二篇待聚类文献的第二数量;确定该N篇待聚类文献中同时引用了第一篇待聚类文献以及二篇待聚类文献的第三数量;最后,根据该第一数量、第二数量以及第三数量,确定第一篇待聚类文献第与二篇待聚类文献之间的共被引相似度;其中,第一篇待聚类文献和第二篇待聚类文献为N篇待聚类文献中的任意两篇待聚类文献。因此,第一篇待聚类文献第与二篇待聚类文献之间的共被引相似度可以通过公式(1)表示:Exemplarily, determining the first number of the N documents to be clustered citing the first document to be clustered, and determining the second number of the N documents to be clustered citing the second document to be clustered; Determine the third quantity of the N documents to be clustered that simultaneously cite the first document to be clustered and the two documents to be clustered; finally, determine the first quantity according to the first quantity, the second quantity and the third quantity Co-citation similarity between the second document to be clustered and the second document to be clustered; the first document to be clustered and the second document to be clustered are any two of the N documents to be clustered articles to be clustered. Therefore, the co-citation similarity between the first document to be clustered and the second document to be clustered can be expressed by formula (1):
其中,a
1为第一篇待聚类文献,a
2为第二篇待聚类文献,sim(a
1,a
2)为第一篇待聚类文献第与二篇待聚类文献之间的共被引相似度,X为引用了a
1的文献的集合,Y为引用了a
2的文献的集合,则X∩Y为第三数量,丨X丨为第一数量,丨Y丨为第二数量。
Among them, a 1 is the first document to be clustered, a 2 is the second document to be clustered, sim(a 1 , a 2 ) is the difference between the first document to be clustered and the second document to be clustered The co-citation similarity of , X is the set of documents citing a 1 , Y is the set of documents citing a 2 , then X∩Y is the third quantity, 丨X丨 is the first quantity, and 丨Y丨 is second quantity.
应理解,若该N篇待聚类文献中没有同时引用第一篇待聚类文献和第二篇待聚类文献的待聚类文献,则该第一篇待聚类文献和第二篇待聚类文献之间的共被引相似度为0,也就是这两篇文献之间不存在共被引关系。It should be understood that if the N documents to be clustered do not cite the documents to be clustered of the first document to be clustered and the second document to be clustered at the same time, the first document to be clustered and the second document to be clustered are not. The co-citation similarity between the clustered documents is 0, that is, there is no co-citation relationship between the two documents.
103:文献聚类装置根据任意两篇待聚类文献之间的共被引相似度,对N篇待聚类文献进行第一次聚类,得到M个聚类簇,其中,M个聚类簇对应K篇待聚类文献,M为大于或等于1的整数,K为小于或等于N的整数。103: The document clustering device performs the first clustering on the N documents to be clustered according to the co-citation similarity between any two documents to be clustered, and obtains M clusters, wherein the M clusters are A cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N.
示例性的,可根据任意两篇之间的共被引相似度,确定该N篇待聚类文献对应的社区网络;然后,根据任意两篇待聚类文献之间的共被引相似度、社团检测算法以及该社区网络,对该N篇待聚类文献进行第一次聚类,得到M个聚类簇。本申请中是将任意两篇待聚类文献之间的共被引相似度作为两篇文献之间的权重(即社区网络中每个边的权重);然后,将每篇文献作为一个独立的社区,基于模块度最小化的原理,将N个社区中的某些社区进行合并,得到M个社区,即得到M个聚类簇,且该M个聚类簇总共包含有K篇待聚类文献,也就是说,还剩余(N-K)篇待聚类文献未完成聚类。Exemplarily, the community network corresponding to the N documents to be clustered can be determined according to the co-citation similarity between any two documents; then, according to the co-citation similarity between any two documents to be clustered, The community detection algorithm and the community network perform the first clustering on the N documents to be clustered to obtain M clusters. In this application, the co-citation similarity between any two documents to be clustered is used as the weight between the two documents (that is, the weight of each edge in the community network); then, each document is regarded as an independent Community, based on the principle of minimizing modularity, merge some of the N communities to obtain M communities, that is, to obtain M clusters, and the M clusters contain a total of K articles to be clustered. Documents, that is to say, there are still (N-K) documents to be clustered that have not been clustered.
104:文献聚类装置对剩余的(N-K)篇待聚类文献进行第二次聚类,以将(N-K)篇待聚类文献融合到M个聚类簇。104: The document clustering device performs a second clustering on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into M clusters.
示例性的,为N篇待聚类文献中的每篇待聚类文献构造特征向量;根据(N-K)篇待聚类文献中的每篇待聚类文献的特征向量以及K篇待聚类文献中的每篇待聚类文献的特征向量,对(N-K)篇待聚类文献进行第二次聚类,以将该(N-K)篇待聚类文献融合到该M个聚类簇。Exemplarily, a feature vector is constructed for each document to be clustered in the N documents to be clustered; according to the feature vector of each document to be clustered in (N-K) documents to be clustered and K documents to be clustered The second clustering is performed on the (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
示例性的,对该N篇待聚类文献中的每篇待聚类文献的标题和摘要进行短语提取,并对该N篇待聚类文献提取出的短语进行去重,得到P个短语,其中,P为大于1的整数,对文献的标题和摘要进行短语提取可以通过语言处理工具包stanford NLP从每篇文献的主题和摘要中抽取出每篇文献的短语。然后,确定P个短语中的每个短语相对于该N篇待聚类文献中的每篇待聚类文献的词频-逆文本频率(term frequency–inverse document frequency,TF-IDF);最后,根据P个短语中每个短语相对于该N篇待聚类文献中的每篇待聚类文献的TD-IDF,为该N篇待聚类文献中的每篇待聚类文献构造特征向量。Exemplarily, phrase extraction is performed on the title and abstract of each document to be clustered in the N documents to be clustered, and the phrases extracted from the N documents to be clustered are deduplicated to obtain P phrases, Among them, P is an integer greater than 1, and the phrase extraction of the title and abstract of the document can extract the phrase of each document from the subject and abstract of each document through the language processing toolkit stanford NLP. Then, determine the term frequency-inverse document frequency (TF-IDF) of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered; finally, according to With respect to the TD-IDF of each of the N documents to be clustered, each phrase in the P phrases constructs a feature vector for each of the N documents to be clustered.
具体的,在待聚类待聚类文献i中包括该P个短语中的第j个短语的情况下,将待聚类 待聚类文献i的特征向量中第j个维度的取值设置为该第j个短语相对于该待聚类待聚类文献i的TD-IDF;在该待聚类待聚类文献i中不包括该第j个短语的情况下,则将该待聚类待聚类文献i的特征向量中第j个维度的取值设置为预设值,比如,可以设置为0。该待聚类待聚类文献i为该N篇待聚类文献中的任意一篇文献,则i的取值为从1到N的整数,第j个短语为该P个短语中的任意一个短语,则j的取值为从1到P的整数。Specifically, when the document i to be clustered includes the jth phrase among the P phrases, the value of the jth dimension in the feature vector of the document i to be clustered is set as The jth phrase is relative to the TD-IDF of the to-be-clustered document i to be clustered; if the j-th phrase is not included in the to-be-clustered document i to be clustered, then the to-be-clustered document i is not included in the TD-IDF. The value of the jth dimension in the feature vector of the cluster document i is set to a preset value, for example, it can be set to 0. The document i to be clustered is any one of the N documents to be clustered, then the value of i is an integer from 1 to N, and the jth phrase is any one of the P phrases phrase, j is an integer from 1 to P.
因此,每篇文献的特征向量可以通过公式(2)表示:Therefore, the feature vector of each document can be expressed by formula (2):
其中,V
i为待聚类文献i的特征向量,W
1,……,W
j,……,W
p,为待聚类文献i的特征向量从1维到P维的取值,
为短语j相对于待聚类文献i的TF-IDF。
Among them, V i is the feature vector of the document i to be clustered, W 1 , ..., W j , ..., W p , is the value of the feature vector of the document i to be clustered from 1 dimension to P dimension, is the TF-IDF of phrase j relative to document i to be clustered.
在本申请的一个实施方式中,每个短语相对于N篇待聚类文献中的每篇待聚类文献的TF-IDF可以通过公式(3)表示:In an embodiment of the present application, the TF-IDF of each phrase relative to each document to be clustered in the N documents to be clustered can be expressed by formula (3):
其中,
为从待聚类文献i的标题和摘要中提取出的短语j的数量,
为从待聚类文献i的标题和摘要中提取出的短语的总数量,N待聚类文献的数量,N
j为N篇待聚类文献中包含有短语j的文献的数量。
in, is the number of phrases j extracted from the titles and abstracts of document i to be clustered, is the total number of phrases extracted from the titles and abstracts of document i to be clustered, N is the number of documents to be clustered, and N j is the number of documents containing phrase j in the N documents to be clustered.
进一步地,在构造出每篇待聚类文献的特征向量之后,可以确定待聚类文献q的特征向量与待聚类文献e的特征向量之间的相似度,其中,该待聚类文献q为(N-K)篇待聚类文献中的任意一篇文献,待聚类文献e为K篇待聚类文献中的任意一篇文献;遍历K篇待聚类文献,这样就可以得到待聚类文献q与K篇待聚类文献之间的相似度,得到K个相似度;然后,按照该K个相似度从大到小的顺序,从K篇待聚类文献中选取H篇待聚类文献,H为大于1的预设整数;最后,确定该H篇待聚类文献分别属于该M个聚类簇中的每个聚类簇的数量,将该待聚类文献q融合到目标聚类簇,其中,该目标聚类簇为该M个聚类簇中包含该H篇待聚类文献数量最多的聚类簇。Further, after constructing the feature vector of each document to be clustered, the similarity between the feature vector of the document to be clustered q and the feature vector of the document to be clustered e can be determined, wherein the document to be clustered q is any one of the (N-K) documents to be clustered, and the document to be clustered e is any one of the K documents to be clustered; traverse the K documents to be clustered, so that the to-be-clustered document can be obtained The similarity between the document q and the K documents to be clustered is obtained, and K similarities are obtained; then, according to the order of the K similarities from large to small, H documents to be clustered are selected from the K documents to be clustered document, H is a preset integer greater than 1; finally, determine the number of the H documents to be clustered that belong to each of the M clusters, and fuse the document q to be clustered into the target cluster. cluster, wherein the target cluster is the cluster that contains the most number of H documents to be clustered among the M clusters.
举例来说,从K篇待聚类文献中选出了10篇待聚类文献,即H=10,并且聚类簇有聚类簇1、聚类簇2以及聚类簇3,其中,该10篇待聚类文献中有3篇来自于聚类簇1,有4篇来自于聚类簇2,有3篇来自于聚类簇3,则确定聚类簇2包含该10篇待聚类文献的数量最多。因此,将待聚类文献q融合到聚类簇2。For example, 10 documents to be clustered are selected from the K documents to be clustered, that is, H=10, and the clusters include cluster 1, cluster 2 and cluster 3, wherein the Among the 10 documents to be clustered, 3 are from cluster 1, 4 are from cluster 2, and 3 are from cluster 3, then it is determined that cluster 2 contains the 10 to be clustered The largest number of documents. Therefore, the document q to be clustered is fused into cluster 2.
可以看出,在本申请实施例中,首先根据文献之间的共被引相似度对N篇聚类文献进行聚类,得到M个聚类簇,由于,使用共被引相似度能够得到比较精确的聚类簇,所以,该M个聚类簇的精度较高;然后,再不改变聚类粒度的情况下,将剩余未处于该M个聚类簇的文献进行第二次聚类,融合到该M个聚类簇,从而不会降低聚类粒度,提高了聚类精度。It can be seen that, in the embodiment of the present application, firstly, N cluster documents are clustered according to the co-citation similarity between the documents, and M clusters are obtained, because the co-citation similarity can be used to compare Therefore, the accuracy of the M clusters is high; then, without changing the clustering granularity, the remaining documents that are not in the M clusters are clustered a second time, and the fusion To the M clusters, the clustering granularity will not be reduced, and the clustering accuracy will be improved.
在本申请的一个实施方式中,在对N篇待聚类文献聚类之后,可以对每个聚类簇进行主题命名,便于后续对每个聚类簇分类。下面提供以对聚类簇i进行命名为例说明对聚类簇进行主题命名的过程,该聚类簇为该M个聚类簇中的任意一个聚类簇,其他聚类簇的命名方式与此类似,不再叙述。In an embodiment of the present application, after the N documents to be clustered are clustered, each cluster cluster may be named a subject, so as to facilitate subsequent classification of each cluster cluster. The following provides an example of naming a cluster i to illustrate the process of naming a cluster. The cluster is any one of the M clusters, and the naming methods of other clusters are the same as This is similar and will not be described again.
参阅图2,图2为本申请实施例提供的一种文献主题命名方法的流程示意图。该方法应用于文献聚类装置。该方法包括以下步骤:Referring to FIG. 2, FIG. 2 is a schematic flowchart of a method for naming a document subject provided by an embodiment of the present application. The method is applied to a document clustering device. The method includes the following steps:
201:根据聚类簇i中任意两篇待聚类文献之间的共被引相似度确定聚类簇i中的目标待聚类文献。201: Determine a target document to be clustered in cluster i according to the co-citation similarity between any two documents to be clustered in cluster i.
示例性的,根据聚类簇i中任意两篇待聚类文献之间的共被引相似度,确定聚类簇i中每篇待聚类文献的评分,其中,每篇待聚类文献的评分用于表示每篇待聚类文献的重要性程度,即待聚类文献的质量;然后,根据评分从大到小的顺序确定聚类簇i中的目标待聚类文献。示例性的,可根据评分从大到小的顺序从每个聚类簇中选取预设比例的文献作为目标待聚类文献。比如,某个聚类簇中的文献的数量为100个,预设比例为10%,则按照评分从大到小的顺序从这100篇文献中选出前十篇文献作为这个聚类簇的目标待聚类文献。Exemplarily, according to the co-citation similarity between any two documents to be clustered in cluster i, the score of each document to be clustered in cluster i is determined, wherein the score of each document to be clustered is determined. The score is used to indicate the degree of importance of each document to be clustered, that is, the quality of the document to be clustered; then, the target document to be clustered in cluster i is determined according to the order of scores from large to small. Exemplarily, a preset proportion of documents may be selected from each cluster as the target documents to be clustered according to the descending order of scores. For example, if the number of documents in a certain cluster is 100, and the preset ratio is 10%, then the top ten documents are selected from the 100 documents in the order of the scores from the largest to the smallest as the cluster. The target document to be clustered.
具体来说,根据聚类簇i中任意两篇待聚类文献之间的共被引相似度,确定聚类簇i对应的无向图,即将有共被引关系的待聚类文献进行连接,无共被引关系的待聚类文献不进行连接;根据聚类簇i对应的无向图以及pagerank算法确定该无向图中每个节点(即每篇待聚类文献)的评分,可得到聚类簇i中每篇待聚类文献的评分,即根据每篇待聚类文献与其他待聚类文献之间的路径确定该篇待聚类文献的评分。具体的,根据每篇待聚类文献与其他待聚类文献之间的共被引相似度、预设参数以及中间相隔的待聚类文献的数量,确定每篇待聚类文献的评分。Specifically, according to the co-citation similarity between any two documents to be clustered in cluster i, the undirected graph corresponding to cluster i is determined, that is, the documents to be clustered that have a co-citation relationship are connected. , the documents to be clustered without co-citation relationship are not connected; according to the undirected graph corresponding to the cluster i and the pagerank algorithm to determine the score of each node (that is, each document to be clustered) in the undirected graph, it can be The score of each document to be clustered in cluster i is obtained, that is, the score of each document to be clustered is determined according to the path between each document to be clustered and other documents to be clustered. Specifically, the score of each to-be-clustered document is determined according to the co-citation similarity between each to-be-clustered document and other to-be-clustered documents, preset parameters, and the number of intermediate to-be-clustered documents.
举例来说,聚类簇i包括待聚类文献A、待聚类文献B、待聚类文献C以及待聚类文献D,其中,待聚类文献D是跟其他待聚类文献没有共被引关系,而是最后融合进来的待聚类文献,因此,可建立如图3所示的无向图。根据pagerank算法以及该无向图可分别确定出待聚类文献A、待聚类文献B、待聚类文献C以及待聚类文献D对应的评分。示例性的,待聚类文献A对应的评分为待聚类文献A到待聚类文献B之间的评分,以及与待聚类文献A到待聚类文献C之间的评分之和。示例性的,待聚类文献A对应的评分可通过公式(4)表示:For example, cluster i includes document A to be clustered, document B to be clustered, document C to be clustered, and document D to be clustered, wherein document D to be clustered is not shared with other documents to be clustered. The citation relationship is the last fused document to be clustered. Therefore, an undirected graph as shown in Figure 3 can be established. According to the pagerank algorithm and the undirected graph, the scores corresponding to document A to be clustered, document B to be clustered, document C to be clustered, and document D to be clustered can be determined respectively. Exemplarily, the score corresponding to the document A to be clustered is the score between the document A to be clustered and the document B to be clustered, and the sum of the score between the document A to be clustered and the document C to be clustered. Exemplarily, the score corresponding to the document A to be clustered can be expressed by formula (4):
S=1*γ*sim(A,B)+1*γ
2*(A,C) 公式(4)
S=1*γ*sim(A,B)+1*γ 2 *(A,C) Formula (4)
S为待聚类文献A对应的评分,sim(A,B)为表示待聚类文献A与待聚类文献B之间的共被引相似度,sim(A,C)表示待聚类文献A与待聚类文献C之间的共被引相似度,γ为预设参数,0<γ<1。S is the score corresponding to document A to be clustered, sim(A, B) is the co-citation similarity between document A to be clustered and document B to be clustered, sim(A, C) is the document to be clustered The co-citation similarity between A and the document C to be clustered, γ is a preset parameter, 0<γ<1.
202:确定P个短语中属于聚类簇i中的L个短语,L为小于等于P的整数。202 : Determine L phrases belonging to cluster i among the P phrases, where L is an integer less than or equal to P.
203:根据目标待聚类文献以及L个短语,确定聚类簇i对应的主题。203: Determine the topic corresponding to the cluster i according to the target documents to be clustered and the L phrases.
示例性的,可对聚类簇i中的目标待聚类文献的标题进行词嵌入,得到聚类簇i对应的第一特征向量。其中,对聚类簇i中的目标待聚类文献的标题进行词嵌入可通过完成训练的Biobert模型实现,该Biobert模型是通过医疗领域的文献作为训练语料进行训练得到的,因此该Biobert模型对医学领域的语言处理会更加精确,能准确的提取出文献的语义也更加准确,其中,对Biobert模型进行训练可通过有监督的方式进行训练得到,不再赘述。Exemplarily, word embedding may be performed on the title of the target document to be clustered in cluster i to obtain the first feature vector corresponding to cluster i. Among them, the word embedding of the title of the target document to be clustered in the cluster i can be realized by the Biobert model that has completed the training. The Biobert model is obtained by training the documents in the medical field as the training corpus. Language processing in the medical field will be more accurate, and the semantics of documents can be extracted more accurately. Among them, the training of the Biobert model can be obtained through supervised training, and will not be repeated here.
应理解,在该目标待聚类文献的数量为一个的情况下,则将该目标待聚类文献的标题进行词嵌入得到的特征向量作为该第一特征向量;在该目标待聚类文献的数量为多个的情况下,则可对每篇目标待聚类文献的标题进行词嵌入,得到每篇目标待聚类文献对应的特征向量,然后,将多篇目标待聚类文献对应的多个特征向量按位取平均值后,得到该第一特征向量。It should be understood that when the number of the target document to be clustered is one, the feature vector obtained by word embedding of the title of the target document to be clustered is used as the first feature vector; When the number is multiple, word embedding can be performed on the title of each target document to be clustered to obtain the feature vector corresponding to each target document to be clustered. After the eigenvectors are averaged bit by bit, the first eigenvector is obtained.
进一步地,对L个短语中的每个短语进行词嵌入,得到每个短语的第二特征向量,其中,对每个短语进行词嵌入也可通过上述的Biobert模型实现,不再叙述;然后,对每个短语的每个单词进行词嵌入,得到每个单词对应的第三特征向量;根据每个单词对应的第三特征向量,确定每个短语对应的第四特征向量,即将每个短语中的每个单词对应的第三特征向量按位求均值,并将按位求均值得到的特征向量作为每个短语对应的第四特征向量。举例来说,将短语“lung cancer survival rate”中的四个单词词分别进行词嵌入,得到四个特征向量,并将该四个特征向量按位求均值,得到该短语对应的第二特征向量。Further, word embedding is performed on each of the L phrases to obtain the second feature vector of each phrase, wherein the word embedding for each phrase can also be implemented by the above-mentioned Biobert model, and will not be described again; then, Perform word embedding on each word of each phrase to obtain the third feature vector corresponding to each word; according to the third feature vector corresponding to each word, determine the fourth feature vector corresponding to each phrase, that is, in each phrase The third feature vector corresponding to each word of , and the feature vector obtained by the bitwise averaging is used as the fourth feature vector corresponding to each phrase. For example, the four words in the phrase "lung cancer survival rate" are word-embedded respectively to obtain four feature vectors, and the four feature vectors are averaged bit by bit to obtain the second feature vector corresponding to the phrase. .
最后,根据聚类簇i对应的第一特征向量、每个短语对应的第二特征向量、每个短语对应的第四特征向量以及每个短语相对于该聚类簇i的TF-IDF,确定每个聚类簇对应的主题,其中,每个短语相对于该聚类簇i的TF-IDF为每个短语相对于该聚类簇i中的每篇待聚类文献的TF-IDF的平均值。Finally, according to the first feature vector corresponding to cluster i, the second feature vector corresponding to each phrase, the fourth feature vector corresponding to each phrase, and the TF-IDF of each phrase relative to the cluster i, determine The topic corresponding to each cluster, wherein the TF-IDF of each phrase relative to the cluster i is the average of the TF-IDF of each phrase relative to each document to be clustered in the cluster i value.
示例性的,确定聚类簇i对应的第一特征向量与每个短语对应的第二特征向量之间的第一相似度;确定聚类簇i对应的第一特征向量与每个短语对应的第四特征向量之间的第二相似度;最后,根据每个短语对应的第一相似度、第二相似度以及相对于聚类簇i的TF-IDF,确定聚类簇i与每个短语之间的第三相似度。比如,可以对该第一相似度、第二相似度以及TF-IDF进行加权处理,得到该第三相似度。Exemplarily, determine the first similarity between the first feature vector corresponding to cluster i and the second feature vector corresponding to each phrase; determine the first feature vector corresponding to cluster i and each phrase corresponding to the first similarity. The second similarity between the fourth feature vectors; finally, according to the first similarity corresponding to each phrase, the second similarity and the TF-IDF relative to the cluster i, determine the cluster i and each phrase The third similarity between. For example, the first similarity, the second similarity and the TF-IDF may be weighted to obtain the third similarity.
示例性的,上述的相似度可以为向量之间的余弦相似度。因此,第三相似度可以通过公式(5)表示:Exemplarily, the above similarity may be the cosine similarity between vectors. Therefore, the third similarity can be expressed by formula (5):
sim(phr,cluster)=β*cos
sim(vec
1,vec
2)+(1-β)*cos
sim(vec
1,vec
4)+(1-β)*TF-IDF公式(5)
sim(phr,cluster)=β*cos sim (vec 1 ,vec 2 )+(1-β)*cos sim (vec 1 ,vec 4 )+(1-β)*TF-IDF formula (5)
其中,sim(phr,cluster)为聚类簇i与每个短语之间的第三相似度,cos
sim为求余弦相似度操作,vec
1为聚类簇i对应的第一特征向量,vec
2为每个短语对应的第二特征向量,vec
4每个短语对应的第四特征向量,β为预设参数,0≤β≤1。
Among them, sim(phr,cluster) is the third similarity between cluster i and each phrase, cos sim is the cosine similarity operation, vec 1 is the first feature vector corresponding to cluster i, vec 2 is the second feature vector corresponding to each phrase, vec 4 is the fourth feature vector corresponding to each phrase, β is a preset parameter, 0≤β≤1.
然后,根据每个短语的第二特征向量,确定该L个短语中任意两个短语之间的第四相似度。示例性的,该第四相似度也可以为余弦相似度,因此,第四相似度可以通过公式(6)表示:Then, according to the second feature vector of each phrase, the fourth degree of similarity between any two phrases in the L phrases is determined. Exemplarily, the fourth similarity may also be a cosine similarity. Therefore, the fourth similarity may be expressed by formula (6):
sim(phr
1,phr
2)=cos
sim(vec
21,vec
22) 公式(6)
sim(phr 1 ,phr 2 )=cos sim (vec 21 ,vec 22 ) Formula (6)
phr
1为和phr
2为L个短语中的任意两个短语,sim(phr
1,phr
2)为两个短语之间的第四相似度,vec
21为phr
1的第二特征向量,vec
22为phr
2对应的第二特征向量。
phr 1 and phr 2 are any two phrases among the L phrases, sim(phr 1 ,phr 2 ) is the fourth similarity between the two phrases, vec 21 is the second feature vector of phr 1 , vec 22 is the second eigenvector corresponding to phr 2 .
最后,根据聚类簇i与每个短语之间的第三相似度以及该任意两个短语之间的第四相似度,确定聚类簇i对应的主题。Finally, according to the third degree of similarity between cluster i and each phrase and the fourth degree of similarity between any two phrases, the topic corresponding to cluster i is determined.
示例性的,将第三相似度最大的短语作为一个目标短语,并将该目标短语从该L个短语中移动到目标短语集;然后,根据L个短语中的剩余短语中每个短语与聚类簇i之间的第三相似度,以及与该目标短语集中每个目标短语之间的第二相似度,确定剩余短语中每 个短语对应的最大边界相关(Maximal Marginal Relevance,MMR)分值,比如,可根据剩余短语中每个短语与聚类簇i之间的第三相似度,以及与该目标短语集中每个目标短语之间的第二相似度,得到与该目标短语集中每个目标短语对应的第五相似度,并将最大的第五相似度作为剩余短语中每个短语的MMR分值;然后,将剩余短语中的MMR分值最大的短语从剩余短语中移动到目标短语集。最后,再次确定剩余短语中每个短语对应的MMR分值,并将剩余短语中MMR分值最大的短语移动到目标短语集,依次迭代,直至该目标短语集中的目标短语的数量达到预设数量,停止迭代,并将该目标短语集中的目标短语作为聚类簇i的主题。Exemplarily, take the phrase with the third highest similarity as a target phrase, and move the target phrase from the L phrases to the target phrase set; The third similarity between clusters i, and the second similarity with each target phrase in the target phrase set, determine the Maximum Marginal Relevance (MMR) score corresponding to each phrase in the remaining phrases, For example, according to the third degree of similarity between each phrase in the remaining phrases and cluster i, and the second degree of similarity between each phrase in the target phrase set and each target phrase in the target phrase set, we can obtain each target phrase in the target phrase set. The fifth similarity corresponding to the phrase, and the maximum fifth similarity is used as the MMR score of each phrase in the remaining phrases; then, the phrase with the largest MMR score in the remaining phrases is moved from the remaining phrases to the target phrase set. Finally, determine the MMR score corresponding to each phrase in the remaining phrases again, and move the phrase with the largest MMR score in the remaining phrases to the target phrase set, and iterate sequentially until the number of target phrases in the target phrase set reaches the preset number , stop the iteration, and take the target phrase in the target phrase set as the topic of cluster i.
示例性的,剩余短语中每个短语的MMR分值可通过公式(7)表示:Exemplarily, the MMR score of each phrase in the remaining phrases can be represented by formula (7):
其中,PHR表示聚类簇i对应的候选短语集,K为目标短语集,phr
i∈PHR\K表示剩余短语中的第i个短语,MMR
i为第i个短语的MMR分值,phr
j∈K表示目标短语集中的第j个短语,sim(phr
i,cluster)为第i个短语与聚类簇i之间的第三相似度,
为第i个短语与第j个短语之间的第四相似度,argmax表示最大化取值,即在遍历目标短语集中的短语之后,将最大值作为第i个短语的MMR分值,α为预设参数。最后,在遍历剩余短语中每个短语之后,可得到剩余短语中每个短语的MMR分值。
Among them, PHR represents the candidate phrase set corresponding to cluster i, K is the target phrase set, phr i ∈ PHR\K represents the ith phrase in the remaining phrases, MMR i is the MMR score of the ith phrase, phr j ∈K represents the jth phrase in the target phrase set, sim(phr i ,cluster) is the third similarity between the ith phrase and cluster i, is the fourth similarity between the ith phrase and the jth phrase, argmax represents the maximum value, that is, after traversing the phrases in the target phrase set, the maximum value is taken as the MMR score of the ith phrase, α is Preset parameters. Finally, after traversing each phrase in the remaining phrases, the MMR score of each phrase in the remaining phrases can be obtained.
举例说明,某个聚类簇的候选短语集包括短语A、短语B、短语C、短语D以及短语E,并且短语A与该聚类簇之间的第三相似度最大,则先将短语A作为一个目标短语,并将该短语A从L个短语中移动到目标短语集,此时剩余短语包括短语B、短语C、短语D以及短语E;然后,计算剩余短语中每个短语的MMR分值,即将每个短语与该聚类簇之间的第三相似度以及与短语A之间的第二相似度代入到上述公式(6),分别得到短语B、短语C、短语D以及短语E对应的MMR分值;假设,短语B的MMR分值最大,则将短语B从候选集合中移动到目标短语集,则此时剩余短语包括短语C、短语D以及短语E。最后,将剩余短语中每个短语与该聚类簇之间的第三相似度以及与短语A之间的第二相似度代入到上述公式(6),得到与短语A对应的一个相似度,并将该短语与该聚类簇i之间的第三相似度以及与短语B之间的第二相似度代入到上述公式(6),得到与短语B对应的一个相似度,将这两个相似度中最大的相似度作为这个短语的MMR分值。依次确定剩余短语中每个短语的MMR分值,则可得到短语C、短语D和短语E的MMR分值。假设短语C的MMR分值最大,则将短语C移动到目标短语集。如预设数量为三个短语,这个时候目标短语集中已经有了三个短语,停止迭代,将短语A、短语B和短语C作为聚类簇i的主题。For example, the candidate phrase set of a cluster includes phrase A, phrase B, phrase C, phrase D, and phrase E, and the third similarity between phrase A and the cluster is the largest, then phrase A is first As a target phrase, and move the phrase A from the L phrases to the target phrase set, the remaining phrases include phrase B, phrase C, phrase D and phrase E; then, calculate the MMR score of each phrase in the remaining phrases , that is, the third similarity between each phrase and the cluster and the second similarity between each phrase and phrase A are substituted into the above formula (6), and the corresponding phrases B, C, D and E are obtained respectively. Assuming that the MMR score of phrase B is the largest, then move phrase B from the candidate set to the target phrase set, then the remaining phrases include phrase C, phrase D, and phrase E. Finally, substitute the third similarity between each phrase in the remaining phrases and the cluster and the second similarity with phrase A into the above formula (6) to obtain a similarity corresponding to phrase A, Substitute the third similarity between the phrase and the cluster i and the second similarity between the phrase B into the above formula (6) to obtain a similarity corresponding to the phrase B. The largest similarity among the similarities is used as the MMR score of this phrase. The MMR scores of each phrase in the remaining phrases are sequentially determined, and then the MMR scores of phrase C, phrase D, and phrase E can be obtained. Assuming that phrase C has the largest MMR score, move phrase C to the target phrase set. If the preset number is three phrases, there are already three phrases in the target phrase set at this time, stop the iteration, and use phrase A, phrase B, and phrase C as the topics of cluster i.
可以看出,在对聚类簇进行主题命名的过程中,除了考虑短语与聚类簇本身的关系之外,还考虑短语之间的相似度,从而避免选取出重复冗余的短语作为聚类簇的主体。另外,将每个短语分词,以单词为粒度确定每个短语和医疗文献簇的第一特征向量之间的第二相似度。主要避免一些短语比较长,其本身和聚类簇的主题不相关,但是由于短语较长可能会频繁包含一些与主题相关的单词,这样在对这些长短语进行语义特征提取的过程中,可能会受这些高频词汇的影响,使这些长短语的语义特征与聚类簇的主题相关,会误将这些长短语作为聚类簇的主题,导致抽取出的聚类簇的主题精度比较低。而通过对每个短语分词,从每个单词本身出发,不考虑单词的上下文语境,这样就会将一些本身不与主题相关但频繁出现的单词归类为通用词汇,在进行第二相似度计算的过程中,得到的第二相似度 比较小,这样在加权之后,得到第三相似度也会相对较小,从而不会将这样的短语作为聚类簇的主题,进而使最终抽取出的主题相对更加精确。It can be seen that in the process of naming clusters, in addition to considering the relationship between phrases and the clusters themselves, the similarity between phrases is also considered, so as to avoid selecting repetitive and redundant phrases as clusters main body of the cluster. In addition, each phrase is divided into words, and the second similarity between each phrase and the first feature vector of the medical document cluster is determined with the granularity of words. Mainly avoid some long phrases, which are not related to the topic of the cluster, but because the phrases are long, they may frequently contain some topic-related words, so in the process of semantic feature extraction for these long phrases, it may be Affected by these high-frequency words, the semantic features of these long phrases are related to the topics of the clusters, and these long phrases will be mistakenly regarded as the topics of the clusters, resulting in low topic accuracy of the extracted clusters. By segmenting each phrase, starting from each word itself, regardless of the context of the word, some words that are not related to the topic but frequently appear are classified as general vocabulary, and the second similarity is carried out. In the process of calculation, the obtained second similarity is relatively small, so after weighting, the obtained third similarity will be relatively small, so that such phrases will not be used as the subject of clusters, so that the final extracted The subject is relatively more precise.
在本申请的一个实施方式中,本申请的文献聚类方法还可以应用到医疗技术领域,比如,可以通过本申请的文献聚类方法对医学文献进行聚类,例如,在本申请所涉及的文献为医学文献的情况下,则可以从医学数据库中获取N篇待聚类医学文献,比如,从公共医疗(public medicine,PUBMED)数据库中读取N篇待聚类文献,然后,对N篇医学文献进行聚类,将N篇医学文献聚类成多个聚类簇,针对每个聚类簇可以进行命名,这样医生就可以清楚、快速的查阅自己需要查找的医学文献,推动医疗科技的进步。In an embodiment of the present application, the document clustering method of the present application can also be applied to the field of medical technology. For example, medical documents can be clustered by the document clustering method of the present application. If the documents are medical documents, N medical documents to be clustered can be obtained from the medical database, for example, N documents to be clustered are read from the public medicine (PUBMED) database, and then Medical documents are clustered, N medical documents are clustered into multiple clusters, and each cluster can be named, so that doctors can clearly and quickly consult the medical documents they need to find, and promote the development of medical technology. progress.
参阅图4,图4本申请实施例提供的一种文献聚类装置的功能单元组成框图。文献聚类400包括:获取单元401和处理单元402,其中:Referring to FIG. 4 , FIG. 4 is a block diagram of functional units of a document clustering apparatus provided by an embodiment of the present application. The document clustering 400 includes: an acquisition unit 401 and a processing unit 402, wherein:
获取单元401,用于获取N篇待聚类文献,N为大于1的整数;Obtaining unit 401, configured to obtain N documents to be clustered, where N is an integer greater than 1;
处理单元402,用于确定所述N篇待聚类文献中任意两篇待聚类文献之间的共被引相似度;根据所述任意两篇待聚类文献之间的共被引相似度,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇,其中,所述M个聚类簇对应K篇待聚类文献,M为大于或等于1的整数,K为小于或等于N的整数;对剩余的(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇。The processing unit 402 is configured to determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered; according to the co-citation similarity between any two documents to be clustered , perform the first clustering on the N documents to be clustered to obtain M clusters, wherein the M clusters correspond to the K documents to be clustered, and M is an integer greater than or equal to 1, K is an integer less than or equal to N; the second clustering is performed on the remaining (N-K) documents to be clustered to fuse the (N-K) documents to be clustered into the M clusters.
在一些可能的实施方式中,在确定所述N篇待聚类文献中任意两篇待聚类文献之间的共被引相似度方面,处理单元402,具体用于:In some possible implementations, in determining the co-citation similarity between any two documents to be clustered among the N documents to be clustered, the processing unit 402 is specifically configured to:
确定所述N篇待聚类文献中引用了第一篇待聚类文献的第一数量;determining the first number of the N documents to be clustered citing the first document to be clustered;
确定所述N篇待聚类文献中引用了第二篇待聚类文献的第二数量;determining the second number of the N documents to be clustered that cite the second document to be clustered;
确定所述N篇待聚类文献中同时引用了所述第一篇待聚类文献以及所述二篇待聚类文献的第三数量;Determine the third number of the N documents to be clustered that simultaneously cite the first document to be clustered and the two documents to be clustered;
根据所述第一数量、所述第二数量以及所述第三数量,确定所述第一篇待聚类文献第与所述二篇待聚类文献之间的共被引相似度;determining the co-citation similarity between the first document to be clustered and the two documents to be clustered according to the first quantity, the second quantity and the third quantity;
其中,所述第一篇待聚类文献和所述第二篇待聚类文献为所述N篇待聚类文献中的任意两篇待聚类文献。Wherein, the first document to be clustered and the second document to be clustered are any two documents to be clustered among the N documents to be clustered.
在一些可能的实施方式中,在对剩余的(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇方面,处理单元402,具体用于:In some possible implementations, the second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters, The processing unit 402 is specifically used for:
为所述N篇待聚类文献中的每篇待聚类文献构造特征向量;Construct a feature vector for each document to be clustered in the N documents to be clustered;
根据所述(N-K)篇待聚类文献中的每篇待聚类文献的特征向量以及所述K篇待聚类文献中的每篇待聚类文献的特征向量,对所述(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇。According to the feature vector of each document to be clustered in the (N-K) documents to be clustered and the feature vector of each document to be clustered in the K documents to be clustered, for the (N-K) documents The documents to be clustered are clustered a second time, so as to fuse the (N-K) documents to be clustered into the M clusters.
在一些可能的实施方式中,在为所述N篇待聚类文献中的每篇待聚类文献构造特征向量方面,处理单元402,具体用于:In some possible implementations, in constructing a feature vector for each of the N documents to be clustered, the processing unit 402 is specifically configured to:
对所述N篇待聚类文献中的每篇待聚类文献的标题和摘要进行短语提取并去重,得到P个短语,其中,P为大于1的整数;Perform phrase extraction and deduplication on the title and abstract of each document to be clustered in the N documents to be clustered to obtain P phrases, where P is an integer greater than 1;
确定所述P个短语中的每个短语相对于所述N篇待聚类文献中的每篇待聚类文献的词频-逆文本频率TF-IDF;Determine the word frequency-inverse text frequency TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered;
根据所述P个短语中的每个短语相对于所述N篇待聚类文献中的每篇待聚类文献的TF-IDF,为所述N篇待聚类文献中的每篇待聚类文献构造特征向量。According to the TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered, for each document to be clustered in the N documents to be clustered Document construction eigenvectors.
在一些可能的实施方式中,在根据所述P个短语中的每个短语相对于所述N篇待聚类文献中的每篇待聚类文献的TF-IDF,为所述N篇待聚类文献中的每篇待聚类文献构造特征向量方面,处理单元402,具体用于:In some possible implementations, according to the TF-IDF of each of the P phrases relative to each of the N documents to be clustered, for the N documents to be clustered In terms of constructing a feature vector for each document to be clustered in the class document, the processing unit 402 is specifically used for:
在待聚类文献i中包括所述P个短语中的第j个短语的情况下,将所述待聚类文献i的 特征向量中第j个维度的取值设置为所述第j个短语相对于所述待聚类文献i的TF-IDF,在所述待聚类文献i中不包括所述第j个短语的情况下,将所述待聚类文献i的特征向量中第j个维度的取值设置为预设值;In the case that the document i to be clustered includes the jth phrase among the P phrases, set the value of the jth dimension in the feature vector of the document i to be clustered as the jth phrase With respect to the TF-IDF of the document i to be clustered, in the case that the document i to be clustered does not include the jth phrase, the jth phrase in the feature vector of the document i to be clustered is The value of the dimension is set to the default value;
其中,所述待聚类文献i为所述N篇待聚类文献中的任意一篇文献,i的取值为从1到N的整数,j的取值为从1到P的整数。Wherein, the document i to be clustered is any one of the N documents to be clustered, the value of i is an integer from 1 to N, and the value of j is an integer from 1 to P.
在一些可能的实施方式中,在根据所述(N-K)篇待聚类文献中的每篇待聚类文献的特征向量以及所述K篇待聚类文献中的每篇待聚类文献的特征向量,对所述(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇方面,处理单元402,具体用于:In some possible implementations, according to the feature vector of each document to be clustered in the (N-K) documents to be clustered and the characteristics of each document to be clustered in the K documents to be clustered vector, the second clustering is performed on the (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters, the processing unit 402 is specifically used for :
确定待聚类文献q的特征向量和待聚类文献e的特征向量之间的相似度,其中,所述待聚类文献q为所述(N-K)篇待聚类文献中的任意一篇文献,所述待聚类文献e为所述K篇待聚类文献中的任意一篇文献;Determine the similarity between the feature vector of the document q to be clustered and the feature vector of the document e to be clustered, wherein the document q to be clustered is any one of the (N-K) documents to be clustered , the document e to be clustered is any one of the K documents to be clustered;
按照相似度从大到小的顺序,从所述K篇待聚类文献中选取H篇待聚类文献,H为大于1的整数;Select H documents to be clustered from the K documents to be clustered in descending order of similarity, where H is an integer greater than 1;
确定所述H篇待聚类文献分别属于所述M个聚类簇中的每个聚类簇的数量;Determine the number of each of the H documents to be clustered that belong to each of the M clusters;
将所述待聚类文献q融合到目标聚类簇,其中,所述目标聚类簇为所述M个聚类簇中包含所述H篇待聚类文献数量最多的聚类簇。The document q to be clustered is fused into a target cluster, wherein the target cluster is the cluster that contains the most number of the H documents to be clustered among the M clusters.
在一些可能的实施方式中,在根据所述任意两篇待聚类文献之间的共被引相似度,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇方面,处理单元402,具体用于:In some possible implementations, according to the co-citation similarity between any two documents to be clustered, the N documents to be clustered are clustered for the first time to obtain M clusters In one aspect, the processing unit 402 is specifically configured to:
根据所述任意两篇待聚类文献之间的共被引相似度,构建所述N篇待聚类文献的社区网络;According to the co-citation similarity between the arbitrary two documents to be clustered, construct a community network of the N documents to be clustered;
根据所述社区网络、所述任意两篇待聚类文献之间的共被引相似度以及社团检测算法,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇。According to the community network, the co-citation similarity between any two documents to be clustered, and the community detection algorithm, perform the first clustering on the N documents to be clustered to obtain M clusters .
参阅图5,图5为本申请实施例提供的一种电子设备的结构示意图。该电子设备包括处理器和存储器。可选的,该电子设备还可包括收发器。例如,如图5所示,电子设备500包括收发器501、处理器502和存储器503。它们之间通过总线504连接。存储器503用于存储计算机程序和数据,并可以将存储503存储的数据传输给处理器502。Referring to FIG. 5 , FIG. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device includes a processor and memory. Optionally, the electronic device may further include a transceiver. For example, as shown in FIG. 5 , the electronic device 500 includes a transceiver 501 , a processor 502 and a memory 503 . They are connected by bus 504 . The memory 503 is used to store computer programs and data, and can transmit the data stored by the memory 503 to the processor 502 .
处理器502用于读取存储器503中的计算机程序执行以下操作:The processor 502 is used to read the computer program in the memory 503 to perform the following operations:
获取N篇待聚类文献,N为大于1的整数;Obtain N documents to be clustered, where N is an integer greater than 1;
确定所述N篇待聚类文献中任意两篇待聚类文献之间的共被引相似度;Determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered;
根据所述任意两篇待聚类文献之间的共被引相似度,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇,其中,所述M个聚类簇对应K篇待聚类文献,M为大于或等于1的整数,K为小于或等于N的整数;According to the co-citation similarity between any two documents to be clustered, the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;
对剩余的(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇。A second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
在一些可能的实施方式中,在确定所述N篇待聚类文献中任意两篇待聚类文献之间的共被引相似度方面,处理器502具体用于执行以下操作:In some possible implementations, in determining the co-citation similarity between any two documents to be clustered among the N documents to be clustered, the processor 502 is specifically configured to perform the following operations:
确定所述N篇待聚类文献中引用了第一篇待聚类文献的第一数量;determining the first number of the N documents to be clustered citing the first document to be clustered;
确定所述N篇待聚类文献中引用了第二篇待聚类文献的第二数量;determining the second number of the N documents to be clustered that cite the second document to be clustered;
确定所述N篇待聚类文献中同时引用了所述第一篇待聚类文献以及所述二篇待聚类文献的第三数量;Determine the third number of the N documents to be clustered that simultaneously cite the first document to be clustered and the two documents to be clustered;
根据所述第一数量、所述第二数量以及所述第三数量,确定所述第一篇待聚类文献第与所述二篇待聚类文献之间的共被引相似度;According to the first quantity, the second quantity and the third quantity, determining the co-citation similarity between the first document to be clustered and the two documents to be clustered;
其中,所述第一篇待聚类文献和所述第二篇待聚类文献为所述N篇待聚类文献中的任 意两篇待聚类文献。Wherein, the first document to be clustered and the second document to be clustered are any two documents to be clustered among the N documents to be clustered.
在一些可能的实施方式中,在对剩余的(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇方面,处理器502具体用于执行以下操作:In some possible implementations, the second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters, The processor 502 is specifically configured to perform the following operations:
为所述N篇待聚类文献中的每篇待聚类文献构造特征向量;Construct a feature vector for each document to be clustered in the N documents to be clustered;
根据所述(N-K)篇待聚类文献中的每篇待聚类文献的特征向量以及所述K篇待聚类文献中的每篇待聚类文献的特征向量,对所述(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇。According to the feature vector of each document to be clustered in the (N-K) documents to be clustered and the feature vector of each document to be clustered in the K documents to be clustered, for the (N-K) documents The documents to be clustered are clustered a second time, so as to fuse the (N-K) documents to be clustered into the M clusters.
在一些可能的实施方式中,在为所述N篇待聚类文献中的每篇待聚类文献构造特征向量方面,处理器502具体用于执行以下操作:In some possible implementations, in constructing a feature vector for each of the N documents to be clustered, the processor 502 is specifically configured to perform the following operations:
对所述N篇待聚类文献中的每篇待聚类文献的标题和摘要进行短语提取并去重,得到P个短语,其中,P为大于1的整数;Perform phrase extraction and deduplication on the title and abstract of each document to be clustered in the N documents to be clustered to obtain P phrases, where P is an integer greater than 1;
确定所述P个短语中的每个短语相对于所述N篇待聚类文献中的每篇待聚类文献的词频-逆文本频率TF-IDF;Determine the word frequency-inverse text frequency TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered;
根据所述P个短语中的每个短语相对于所述N篇待聚类文献中的每篇待聚类文献的TF-IDF,为所述N篇待聚类文献中的每篇待聚类文献构造特征向量。According to the TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered, for each document to be clustered in the N documents to be clustered Document construction eigenvectors.
在一些可能的实施方式中,在根据所述P个短语中的每个短语相对于所述N篇待聚类文献中的每篇待聚类文献的TF-IDF,为所述N篇待聚类文献中的每篇待聚类文献构造特征向量方面,处理器502具体用于执行以下操作:In some possible implementations, according to the TF-IDF of each of the P phrases relative to each of the N documents to be clustered, for the N documents to be clustered In terms of constructing a feature vector for each document to be clustered in the class document, the processor 502 is specifically configured to perform the following operations:
在待聚类文献i中包括所述P个短语中的第j个短语的情况下,将所述待聚类文献i的特征向量中第j个维度的取值设置为所述第j个短语相对于所述待聚类文献i的TF-IDF,在所述待聚类文献i中不包括所述第j个短语的情况下,将所述待聚类文献i的特征向量中第j个维度的取值设置为预设值;In the case that the document i to be clustered includes the jth phrase among the P phrases, set the value of the jth dimension in the feature vector of the document i to be clustered as the jth phrase With respect to the TF-IDF of the document i to be clustered, in the case that the document i to be clustered does not include the jth phrase, the jth phrase in the feature vector of the document i to be clustered is The value of the dimension is set to the default value;
其中,所述待聚类文献i为所述N篇待聚类文献中的任意一篇文献,i的取值为从1到N的整数,j的取值为从1到P的整数。Wherein, the document i to be clustered is any one of the N documents to be clustered, the value of i is an integer from 1 to N, and the value of j is an integer from 1 to P.
在一些可能的实施方式中,在根据所述(N-K)篇待聚类文献中的每篇待聚类文献的特征向量以及所述K篇待聚类文献中的每篇待聚类文献的特征向量,对所述(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇方面,处理器502具体用于执行以下操作:In some possible implementations, according to the feature vector of each document to be clustered in the (N-K) documents to be clustered and the characteristics of each document to be clustered in the K documents to be clustered vector, the second clustering is performed on the (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters, and the processor 502 is specifically configured to execute Do the following:
确定待聚类文献q的特征向量和待聚类文献e的特征向量之间的相似度,其中,所述待聚类文献q为所述(N-K)篇待聚类文献中的任意一篇文献,所述待聚类文献e为所述K篇待聚类文献中的任意一篇文献;Determine the similarity between the feature vector of the document q to be clustered and the feature vector of the document e to be clustered, wherein the document q to be clustered is any one of the (N-K) documents to be clustered , the document e to be clustered is any one of the K documents to be clustered;
按照相似度从大到小的顺序,从所述K篇待聚类文献中选取H篇待聚类文献,H为大于1的整数;Select H documents to be clustered from the K documents to be clustered in descending order of similarity, where H is an integer greater than 1;
确定所述H篇待聚类文献分别属于所述M个聚类簇中的每个聚类簇的数量;Determine the number of each of the H documents to be clustered that belong to each of the M clusters;
将所述待聚类文献q融合到目标聚类簇,其中,所述目标聚类簇为所述M个聚类簇中包含所述H篇待聚类文献数量最多的聚类簇。The document q to be clustered is fused into a target cluster, wherein the target cluster is the cluster that contains the most number of the H documents to be clustered among the M clusters.
在一些可能的实施方式中,在根据所述任意两篇待聚类文献之间的共被引相似度,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇方面,处理器502具体用于执行以下操作:In some possible implementations, according to the co-citation similarity between any two documents to be clustered, the N documents to be clustered are clustered for the first time to obtain M clusters In one aspect, the processor 502 is specifically configured to perform the following operations:
根据所述任意两篇待聚类文献之间的共被引相似度,构建所述N篇待聚类文献的社区网络;According to the co-citation similarity between the arbitrary two documents to be clustered, construct a community network of the N documents to be clustered;
根据所述社区网络、所述任意两篇待聚类文献之间的共被引相似度以及社团检测算法,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇。According to the community network, the co-citation similarity between any two documents to be clustered, and the community detection algorithm, perform the first clustering on the N documents to be clustered to obtain M clusters .
具体地,上述收发器501可为图4所述的实施例的文献聚类装置400的收发单元401,上述处理器502可以为图4所述的实施例的文献聚类装置400的处理单元402。Specifically, the transceiver 501 may be the transceiver unit 401 of the document clustering apparatus 400 of the embodiment described in FIG. 4 , and the processor 502 may be the processing unit 402 of the document clustering apparatus 400 of the embodiment described in FIG. 4 . .
应理解,本申请中的文献聚类装置可以包括智能手机(如Android手机、iOS手机、Windows Phone手机等)、平板电脑、掌上电脑、笔记本电脑、移动互联网设备MID(Mobile Internet Devices,简称:MID)或穿戴式设备等。上述文献聚类装置仅是举例,而非穷举,包含但不限于上述电子设备。在实际应用中,上述文献聚类装置还可以包括:智能车载终端、计算机设备等等。It should be understood that the document clustering device in this application can include smart phones (such as Android mobile phones, iOS mobile phones, Windows Phone mobile phones, etc.), tablet computers, handheld computers, notebook computers, mobile Internet Devices MID (Mobile Internet Devices, referred to as: MID). ) or wearable devices, etc. The above-mentioned document clustering apparatus is only an example, rather than an exhaustive list, including but not limited to the above-mentioned electronic equipment. In practical applications, the above-mentioned document clustering apparatus may further include: intelligent vehicle-mounted terminals, computer equipment, and the like.
本申请实施例还提供一种计算机(可读)存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行以实现如上述方法实施例中记载的任何一种文献聚类方法的部分或全部步骤。Embodiments of the present application further provide a computer (readable) storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is executed by a processor to implement any one of the documents described in the foregoing method embodiments Some or all of the steps of the clustering method.
可选的,本申请涉及的存储介质如计算机可读存储介质可以是非易失性的,也可以是易失性的。Optionally, the storage medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如上述方法实施例中记载的任何一种文献聚类方法的部分或全部步骤。Embodiments of the present application further provide a computer program product, the computer program product comprising a non-transitory computer-readable storage medium storing a computer program, the computer program being operable to cause a computer to execute the methods described in the foregoing method embodiments Some or all of the steps of any document clustering method.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本申请所必须的。It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present application is not limited by the described action sequence. Because in accordance with the present application, certain steps may be performed in other orders or concurrently. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required by the present application.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software program modules.
所述集成的单元如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software program module and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art, or all or part of the technical solution, and the computer software product is stored in a memory, Several instructions are included to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包 括:闪存盘、只读存储器(英文:Read-Only Memory,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。Those skilled in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable memory, and the memory can include: a flash disk , Read-only memory (English: Read-Only Memory, referred to as: ROM), random access device (English: Random Access Memory, referred to as: RAM), magnetic disk or optical disk, etc.
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The embodiments of the present application have been introduced in detail above, and the principles and implementations of the present application are described in this paper by using specific examples. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present application; at the same time, for Persons of ordinary skill in the art, based on the idea of the present application, will have changes in the specific implementation manner and application scope. In summary, the contents of this specification should not be construed as limitations on the present application.
Claims (20)
- 一种文献聚类方法,包括:A document clustering method, including:获取N篇待聚类文献,N为大于1的整数;Obtain N documents to be clustered, where N is an integer greater than 1;确定所述N篇待聚类文献中任意两篇待聚类文献之间的共被引相似度;Determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered;根据所述任意两篇待聚类文献之间的共被引相似度,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇,其中,所述M个聚类簇对应K篇待聚类文献,M为大于或等于1的整数,K为小于或等于N的整数;According to the co-citation similarity between any two documents to be clustered, the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;对剩余的(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇。A second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
- 根据权利要求1所述的方法,其中,所述确定所述N篇待聚类文献中任意两篇待聚类文献之间的共被引相似度,包括:The method according to claim 1, wherein the determining the co-citation similarity between any two documents to be clustered in the N documents to be clustered comprises:确定所述N篇待聚类文献中引用了第一篇待聚类文献的第一数量;determining the first number of the N documents to be clustered citing the first document to be clustered;确定所述N篇待聚类文献中引用了第二篇待聚类文献的第二数量;determining the second number of the N documents to be clustered that cite the second document to be clustered;确定所述N篇待聚类文献中同时引用了所述第一篇待聚类文献以及所述二篇待聚类文献的第三数量;Determine the third number of the N documents to be clustered that simultaneously cite the first document to be clustered and the two documents to be clustered;根据所述第一数量、所述第二数量以及所述第三数量,确定所述第一篇待聚类文献第与所述二篇待聚类文献之间的共被引相似度;determining the co-citation similarity between the first document to be clustered and the two documents to be clustered according to the first quantity, the second quantity and the third quantity;其中,所述第一篇待聚类文献和所述第二篇待聚类文献为所述N篇待聚类文献中的任意两篇待聚类文献。Wherein, the first document to be clustered and the second document to be clustered are any two documents to be clustered among the N documents to be clustered.
- 根据权利要求1或2所述的方法,其中,所述对剩余的(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇,包括:The method according to claim 1 or 2, wherein the second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters, including:为所述N篇待聚类文献中的每篇待聚类文献构造特征向量;Construct a feature vector for each document to be clustered in the N documents to be clustered;根据所述(N-K)篇待聚类文献中的每篇待聚类文献的特征向量以及所述K篇待聚类文献中的每篇待聚类文献的特征向量,对所述(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇。According to the feature vector of each document to be clustered in the (N-K) documents to be clustered and the feature vector of each document to be clustered in the K documents to be clustered, for the (N-K) documents The documents to be clustered are clustered a second time, so as to fuse the (N-K) documents to be clustered into the M clusters.
- 根据权利要求3所述的方法,其中,所述为所述N篇待聚类文献中的每篇待聚类文献构造特征向量,包括:The method according to claim 3, wherein the constructing a feature vector for each document to be clustered in the N documents to be clustered comprises:对所述N篇待聚类文献中的每篇待聚类文献的标题和摘要进行短语提取并去重,得到P个短语,其中,P为大于1的整数;Perform phrase extraction and deduplication on the title and abstract of each document to be clustered in the N documents to be clustered to obtain P phrases, where P is an integer greater than 1;确定所述P个短语中的每个短语相对于所述N篇待聚类文献中的每篇待聚类文献的词频-逆文本频率TF-IDF;Determine the word frequency-inverse text frequency TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered;根据所述P个短语中的每个短语相对于所述N篇待聚类文献中的每篇待聚类文献的TF-IDF,为所述N篇待聚类文献中的每篇待聚类文献构造特征向量。According to the TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered, for each document to be clustered in the N documents to be clustered Document construction eigenvectors.
- 根据权利要求4所述的方法,其中,所述根据所述P个短语中的每个短语相对于所述N篇待聚类文献中的每篇待聚类文献的TF-IDF,为所述N篇待聚类文献中的每篇待聚类文献构造特征向量,包括:The method according to claim 4, wherein the TF-IDF of each phrase in the P phrases relative to the TF-IDF of each document to be clustered in the N documents to be clustered is the A feature vector is constructed for each document to be clustered in the N documents to be clustered, including:在待聚类文献i中包括所述P个短语中的第j个短语的情况下,将所述待聚类待聚类文献i的特征向量中第j个维度的取值设置为所述第j个短语相对于所述待聚类待聚类文献i的TF-IDF,在所述待聚类待聚类文献i中不包括所述第j个短语的情况下,将所述待聚类待聚类文献i的特征向量中第j个维度的取值设置为预设值;In the case that the document i to be clustered includes the jth phrase among the P phrases, set the value of the jth dimension in the feature vector of the document i to be clustered as the The j phrases are relative to the TF-IDF of the to-be-clustered document i to be clustered, and if the j-th phrase is not included in the to-be-clustered document i to be clustered, the to-be-clustered document i The value of the jth dimension in the feature vector of the document i to be clustered is set as a preset value;其中,所述待聚类待聚类文献i为所述N篇待聚类文献中的任意一篇文献,i的取值为从1到N的整数,j的取值为从1到P的整数。Wherein, the document i to be clustered is any one of the N documents to be clustered, the value of i is an integer from 1 to N, and the value of j is from 1 to P Integer.
- 根据权利要求4或5所述的方法,其中,所述根据所述(N-K)篇待聚类文献中的每篇待聚类文献的特征向量以及所述K篇待聚类文献中的每篇待聚类文献的特征向量,对 所述(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇,包括:The method according to claim 4 or 5, wherein the feature vector of each document to be clustered in the (N-K) documents to be clustered and each of the K documents to be clustered The feature vector of the documents to be clustered, the second clustering is performed on the (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters, including:确定待聚类文献q的特征向量和待聚类文献e的特征向量之间的相似度,其中,所述待聚类文献q为所述(N-K)篇待聚类文献中的任意一篇文献,所述待聚类文献e为所述K篇待聚类文献中的任意一篇文献;Determine the similarity between the feature vector of the document q to be clustered and the feature vector of the document e to be clustered, wherein the document q to be clustered is any one of the (N-K) documents to be clustered , the document e to be clustered is any one of the K documents to be clustered;按照相似度从大到小的顺序,从所述K篇待聚类文献中选取H篇待聚类文献,H为大于1的整数;Select H documents to be clustered from the K documents to be clustered in descending order of similarity, where H is an integer greater than 1;确定所述H篇待聚类文献分别属于所述M个聚类簇中的每个聚类簇的数量;Determine the number of each of the H documents to be clustered that belong to each of the M clusters;将所述待聚类文献q融合到目标聚类簇,其中,所述目标聚类簇为所述M个聚类簇中包含所述H篇待聚类文献数量最多的聚类簇。The document q to be clustered is fused into a target cluster, wherein the target cluster is the cluster that contains the most number of the H documents to be clustered among the M clusters.
- 根据权利要1所述的方法,其中,所述根据所述任意两篇待聚类文献之间的共被引相似度,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇,包括:The method according to claim 1, wherein the first clustering is performed on the N documents to be clustered according to the co-citation similarity between any two documents to be clustered, to obtain M clusters, including:根据所述任意两篇待聚类文献之间的共被引相似度,构建所述N篇待聚类文献的社区网络;According to the co-citation similarity between the arbitrary two documents to be clustered, construct a community network of the N documents to be clustered;根据所述社区网络、所述任意两篇待聚类文献之间的共被引相似度以及社团检测算法,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇。According to the community network, the co-citation similarity between any two documents to be clustered, and the community detection algorithm, perform the first clustering on the N documents to be clustered to obtain M clusters .
- 一种文献聚类装置,包括:A document clustering device, comprising:获取单元,用于获取N篇待聚类文献,N为大于1的整数;The acquisition unit is used to acquire N documents to be clustered, where N is an integer greater than 1;处理单元,用于确定所述N篇待聚类文献中任意两篇待聚类文献之间的共被引相似度;a processing unit, configured to determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered;根据所述任意两篇待聚类文献之间的共被引相似度,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇,其中,所述M个聚类簇对应K篇待聚类文献,M为大于或等于1的整数,K为小于或等于N的整数;According to the co-citation similarity between any two documents to be clustered, the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;对剩余的(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇。A second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
- 一种电子设备,包括:处理器和存储器,所述处理器与所述存储器相连,所述存储器用于存储计算机程序,所述处理器用于执行所述存储器中存储的计算机程序,以使得所述电子设备执行以下方法:An electronic device, comprising: a processor and a memory, the processor is connected to the memory, the memory is used for storing a computer program, the processor is used for executing the computer program stored in the memory, so that the The electronic device performs the following methods:获取N篇待聚类文献,N为大于1的整数;Obtain N documents to be clustered, where N is an integer greater than 1;确定所述N篇待聚类文献中任意两篇待聚类文献之间的共被引相似度;Determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered;根据所述任意两篇待聚类文献之间的共被引相似度,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇,其中,所述M个聚类簇对应K篇待聚类文献,M为大于或等于1的整数,K为小于或等于N的整数;According to the co-citation similarity between any two documents to be clustered, the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;对剩余的(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇。A second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
- 根据权利要求9所述的电子设备,其中,执行所述确定所述N篇待聚类文献中任意两篇待聚类文献之间的共被引相似度,包括:The electronic device according to claim 9, wherein performing the determining of the co-citation similarity between any two documents to be clustered among the N documents to be clustered comprises:确定所述N篇待聚类文献中引用了第一篇待聚类文献的第一数量;determining the first number of the N documents to be clustered citing the first document to be clustered;确定所述N篇待聚类文献中引用了第二篇待聚类文献的第二数量;determining the second number of the N documents to be clustered that cite the second document to be clustered;确定所述N篇待聚类文献中同时引用了所述第一篇待聚类文献以及所述二篇待聚类文献的第三数量;Determine the third number of the N documents to be clustered that simultaneously cite the first document to be clustered and the two documents to be clustered;根据所述第一数量、所述第二数量以及所述第三数量,确定所述第一篇待聚类文献第与所述二篇待聚类文献之间的共被引相似度;determining the co-citation similarity between the first document to be clustered and the two documents to be clustered according to the first quantity, the second quantity and the third quantity;其中,所述第一篇待聚类文献和所述第二篇待聚类文献为所述N篇待聚类文献中的任意两篇待聚类文献。Wherein, the first document to be clustered and the second document to be clustered are any two documents to be clustered among the N documents to be clustered.
- 根据权利要求9或10所述的电子设备,其中,执行所述对剩余的(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇,包括:The electronic device according to claim 9 or 10, wherein the second clustering of the remaining (N-K) documents to be clustered is performed to fuse the (N-K) documents to be clustered into all the documents to be clustered. The M clusters, including:为所述N篇待聚类文献中的每篇待聚类文献构造特征向量;Construct a feature vector for each document to be clustered in the N documents to be clustered;根据所述(N-K)篇待聚类文献中的每篇待聚类文献的特征向量以及所述K篇待聚类文献中的每篇待聚类文献的特征向量,对所述(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇。According to the feature vector of each document to be clustered in the (N-K) documents to be clustered and the feature vector of each document to be clustered in the K documents to be clustered, for the (N-K) documents The documents to be clustered are clustered a second time, so as to fuse the (N-K) documents to be clustered into the M clusters.
- 根据权利要求11所述的电子设备,其中,执行所述为所述N篇待聚类文献中的每篇待聚类文献构造特征向量,包括:The electronic device according to claim 11, wherein performing the constructing a feature vector for each of the N documents to be clustered for each document to be clustered comprises:对所述N篇待聚类文献中的每篇待聚类文献的标题和摘要进行短语提取并去重,得到P个短语,其中,P为大于1的整数;Perform phrase extraction and deduplication on the title and abstract of each document to be clustered in the N documents to be clustered to obtain P phrases, where P is an integer greater than 1;确定所述P个短语中的每个短语相对于所述N篇待聚类文献中的每篇待聚类文献的词频-逆文本频率TF-IDF;Determine the word frequency-inverse text frequency TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered;根据所述P个短语中的每个短语相对于所述N篇待聚类文献中的每篇待聚类文献的TF-IDF,为所述N篇待聚类文献中的每篇待聚类文献构造特征向量。According to the TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered, for each document to be clustered in the N documents to be clustered Document construction eigenvectors.
- 根据权利要求12所述的电子设备,其中,执行所述根据所述P个短语中的每个短语相对于所述N篇待聚类文献中的每篇待聚类文献的TF-IDF,为所述N篇待聚类文献中的每篇待聚类文献构造特征向量,包括:The electronic device according to claim 12, wherein, executing the TF-IDF according to each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered, is: A feature vector is constructed for each document to be clustered in the N documents to be clustered, including:在待聚类文献i中包括所述P个短语中的第j个短语的情况下,将所述待聚类待聚类文献i的特征向量中第j个维度的取值设置为所述第j个短语相对于所述待聚类待聚类文献i的TF-IDF,在所述待聚类待聚类文献i中不包括所述第j个短语的情况下,将所述待聚类待聚类文献i的特征向量中第j个维度的取值设置为预设值;In the case that the document i to be clustered includes the jth phrase among the P phrases, set the value of the jth dimension in the feature vector of the document i to be clustered as the The j phrases are relative to the TF-IDF of the to-be-clustered document i to be clustered, and if the j-th phrase is not included in the to-be-clustered document i to be clustered, the to-be-clustered document i The value of the jth dimension in the feature vector of the document i to be clustered is set as a preset value;其中,所述待聚类待聚类文献i为所述N篇待聚类文献中的任意一篇文献,i的取值为从1到N的整数,j的取值为从1到P的整数。Wherein, the document i to be clustered is any one of the N documents to be clustered, the value of i is an integer from 1 to N, and the value of j is from 1 to P Integer.
- 根据权利要9所述的电子设备,其中,执行所述根据所述任意两篇待聚类文献之间的共被引相似度,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇,包括:The electronic device according to claim 9, wherein performing the first clustering on the N documents to be clustered according to the co-citation similarity between any two documents to be clustered, Get M clusters, including:根据所述任意两篇待聚类文献之间的共被引相似度,构建所述N篇待聚类文献的社区网络;According to the co-citation similarity between the arbitrary two documents to be clustered, construct a community network of the N documents to be clustered;根据所述社区网络、所述任意两篇待聚类文献之间的共被引相似度以及社团检测算法,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇。According to the community network, the co-citation similarity between any two documents to be clustered, and the community detection algorithm, perform the first clustering on the N documents to be clustered to obtain M clusters .
- 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行以实现以下方法:A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the following method:获取N篇待聚类文献,N为大于1的整数;Obtain N documents to be clustered, where N is an integer greater than 1;确定所述N篇待聚类文献中任意两篇待聚类文献之间的共被引相似度;Determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered;根据所述任意两篇待聚类文献之间的共被引相似度,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇,其中,所述M个聚类簇对应K篇待聚类文献,M为大于或等于1的整数,K为小于或等于N的整数;According to the co-citation similarity between any two documents to be clustered, the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;对剩余的(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇。A second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
- 根据权利要求15所述的计算机可读存储介质,其中,执行所述确定所述N篇待聚类文献中任意两篇待聚类文献之间的共被引相似度,包括:The computer-readable storage medium according to claim 15, wherein performing the determining of the co-citation similarity between any two documents to be clustered among the N documents to be clustered comprises:确定所述N篇待聚类文献中引用了第一篇待聚类文献的第一数量;determining the first number of the N documents to be clustered citing the first document to be clustered;确定所述N篇待聚类文献中引用了第二篇待聚类文献的第二数量;determining the second number of the N documents to be clustered that cite the second document to be clustered;确定所述N篇待聚类文献中同时引用了所述第一篇待聚类文献以及所述二篇待聚类文献的第三数量;Determine the third number of the N documents to be clustered that simultaneously cite the first document to be clustered and the two documents to be clustered;根据所述第一数量、所述第二数量以及所述第三数量,确定所述第一篇待聚类文献第与所述二篇待聚类文献之间的共被引相似度;determining the co-citation similarity between the first document to be clustered and the two documents to be clustered according to the first quantity, the second quantity and the third quantity;其中,所述第一篇待聚类文献和所述第二篇待聚类文献为所述N篇待聚类文献中的任意两篇待聚类文献。Wherein, the first document to be clustered and the second document to be clustered are any two documents to be clustered among the N documents to be clustered.
- 根据权利要求15或16所述的计算机可读存储介质,其中,执行所述对剩余的(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇,包括:The computer-readable storage medium according to claim 15 or 16, wherein the second clustering of the remaining (N-K) documents to be clustered is performed to group the (N-K) documents to be clustered fused to the M clusters, including:为所述N篇待聚类文献中的每篇待聚类文献构造特征向量;Construct a feature vector for each document to be clustered in the N documents to be clustered;根据所述(N-K)篇待聚类文献中的每篇待聚类文献的特征向量以及所述K篇待聚类文献中的每篇待聚类文献的特征向量,对所述(N-K)篇待聚类文献进行第二次聚类,以将所述(N-K)篇待聚类文献融合到所述M个聚类簇。According to the feature vector of each document to be clustered in the (N-K) documents to be clustered and the feature vector of each document to be clustered in the K documents to be clustered, for the (N-K) documents The documents to be clustered are clustered a second time, so as to fuse the (N-K) documents to be clustered into the M clusters.
- 根据权利要求17所述的计算机可读存储介质,其中,执行所述为所述N篇待聚类文献中的每篇待聚类文献构造特征向量,包括:The computer-readable storage medium according to claim 17, wherein performing the constructing a feature vector for each of the N documents to be clustered for each document to be clustered comprises:对所述N篇待聚类文献中的每篇待聚类文献的标题和摘要进行短语提取并去重,得到P个短语,其中,P为大于1的整数;Perform phrase extraction and deduplication on the title and abstract of each document to be clustered in the N documents to be clustered to obtain P phrases, where P is an integer greater than 1;确定所述P个短语中的每个短语相对于所述N篇待聚类文献中的每篇待聚类文献的词频-逆文本频率TF-IDF;Determine the word frequency-inverse text frequency TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered;根据所述P个短语中的每个短语相对于所述N篇待聚类文献中的每篇待聚类文献的TF-IDF,为所述N篇待聚类文献中的每篇待聚类文献构造特征向量。According to the TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered, for each document to be clustered in the N documents to be clustered Document construction eigenvectors.
- 根据权利要求18所述的计算机可读存储介质,其中,执行所述根据所述P个短语中的每个短语相对于所述N篇待聚类文献中的每篇待聚类文献的TF-IDF,为所述N篇待聚类文献中的每篇待聚类文献构造特征向量,包括:The computer-readable storage medium according to claim 18, wherein the performing the method according to the TF- IDF, constructs a feature vector for each document to be clustered in the N documents to be clustered, including:在待聚类文献i中包括所述P个短语中的第j个短语的情况下,将所述待聚类待聚类文献i的特征向量中第j个维度的取值设置为所述第j个短语相对于所述待聚类待聚类文献i的TF-IDF,在所述待聚类待聚类文献i中不包括所述第j个短语的情况下,将所述待聚类待聚类文献i的特征向量中第j个维度的取值设置为预设值;In the case that the document i to be clustered includes the jth phrase among the P phrases, set the value of the jth dimension in the feature vector of the document i to be clustered as the The j phrases are relative to the TF-IDF of the to-be-clustered document i to be clustered, and if the j-th phrase is not included in the to-be-clustered document i to be clustered, the to-be-clustered document i The value of the jth dimension in the feature vector of the document i to be clustered is set as a preset value;其中,所述待聚类待聚类文献i为所述N篇待聚类文献中的任意一篇文献,i的取值为从1到N的整数,j的取值为从1到P的整数。Wherein, the document i to be clustered is any one of the N documents to be clustered, the value of i is an integer from 1 to N, and the value of j is from 1 to P Integer.
- 根据权利要15所述的计算机可读存储介质,其中,执行所述根据所述任意两篇待聚类文献之间的共被引相似度,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇,包括:The computer-readable storage medium according to claim 15, wherein the performing the first time on the N documents to be clustered according to the co-citation similarity between the any two documents to be clustered Clustering to get M clusters, including:根据所述任意两篇待聚类文献之间的共被引相似度,构建所述N篇待聚类文献的社区网络;According to the co-citation similarity between the arbitrary two documents to be clustered, construct a community network of the N documents to be clustered;根据所述社区网络、所述任意两篇待聚类文献之间的共被引相似度以及社团检测算法,对所述N篇待聚类文献进行第一次聚类,得到M个聚类簇。According to the community network, the co-citation similarity between any two documents to be clustered, and the community detection algorithm, perform the first clustering on the N documents to be clustered to obtain M clusters .
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011572311.7 | 2020-12-25 | ||
CN202011572311.7A CN112667810B (en) | 2020-12-25 | 2020-12-25 | Document clustering, device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022134343A1 true WO2022134343A1 (en) | 2022-06-30 |
Family
ID=75410135
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/082726 WO2022134343A1 (en) | 2020-12-25 | 2021-03-24 | Document clustering method and apparatus, electronic device, and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112667810B (en) |
WO (1) | WO2022134343A1 (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6457028B1 (en) * | 1998-03-18 | 2002-09-24 | Xerox Corporation | Method and apparatus for finding related collections of linked documents using co-citation analysis |
US20110295903A1 (en) * | 2010-05-28 | 2011-12-01 | Drexel University | System and method for automatically generating systematic reviews of a scientific field |
CN103455622A (en) * | 2013-09-12 | 2013-12-18 | 广东电子工业研究院有限公司 | Automatic document dimensional clustering method |
CN108509481A (en) * | 2018-01-18 | 2018-09-07 | 天津大学 | Draw the study frontier visual analysis method of cluster altogether based on document |
CN111898366A (en) * | 2020-07-29 | 2020-11-06 | 平安科技(深圳)有限公司 | Document subject word aggregation method and device, computer equipment and readable storage medium |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8543576B1 (en) * | 2012-05-23 | 2013-09-24 | Google Inc. | Classification of clustered documents based on similarity scores |
CN110083703A (en) * | 2019-04-28 | 2019-08-02 | 浙江财经大学 | A kind of document clustering method based on citation network and text similarity network |
EP3882786A4 (en) * | 2019-05-17 | 2022-03-23 | Aixs, Inc. | Cluster analysis method, cluster analysis system, and cluster analysis program |
CN111581162B (en) * | 2020-05-06 | 2022-09-06 | 上海海事大学 | Ontology-based clustering method for mass literature data |
-
2020
- 2020-12-25 CN CN202011572311.7A patent/CN112667810B/en active Active
-
2021
- 2021-03-24 WO PCT/CN2021/082726 patent/WO2022134343A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6457028B1 (en) * | 1998-03-18 | 2002-09-24 | Xerox Corporation | Method and apparatus for finding related collections of linked documents using co-citation analysis |
US20110295903A1 (en) * | 2010-05-28 | 2011-12-01 | Drexel University | System and method for automatically generating systematic reviews of a scientific field |
CN103455622A (en) * | 2013-09-12 | 2013-12-18 | 广东电子工业研究院有限公司 | Automatic document dimensional clustering method |
CN108509481A (en) * | 2018-01-18 | 2018-09-07 | 天津大学 | Draw the study frontier visual analysis method of cluster altogether based on document |
CN111898366A (en) * | 2020-07-29 | 2020-11-06 | 平安科技(深圳)有限公司 | Document subject word aggregation method and device, computer equipment and readable storage medium |
Non-Patent Citations (1)
Title |
---|
WU, FENGHUI: "Improvement of K-means Algorithm Based on Co-Citation Analysis", JOURNAL OF THE CHINA SOCIETY FOR SCIENTIFIC AND TECHNICAL INFORMATION, vol. 31, no. 1, 31 January 2012 (2012-01-31), CN , pages 82 - 94, XP009537875, ISSN: 1000-0135 * |
Also Published As
Publication number | Publication date |
---|---|
CN112667810B (en) | 2024-07-23 |
CN112667810A (en) | 2021-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108509474B (en) | Synonym expansion method and device for search information | |
CN108681557B (en) | Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint | |
JP6231668B2 (en) | Keyword expansion method and system and classification corpus annotation method and system | |
CN108804641A (en) | A kind of computational methods of text similarity, device, equipment and storage medium | |
CN107992477A (en) | Text subject determines method, apparatus and electronic equipment | |
US8001139B2 (en) | Using a bipartite graph to model and derive image and text associations | |
CN104750798B (en) | Recommendation method and device for application program | |
CN104239373B (en) | Add tagged method and device for document | |
CN108717407B (en) | Entity vector determination method and device, and information retrieval method and device | |
WO2020114100A1 (en) | Information processing method and apparatus, and computer storage medium | |
US20180046721A1 (en) | Systems and Methods for Automatic Customization of Content Filtering | |
WO2021189920A1 (en) | Medical text cluster subject matter determination method and apparatus, electronic device, and storage medium | |
CN112329460B (en) | Text topic clustering method, device, equipment and storage medium | |
CN111159359A (en) | Document retrieval method, document retrieval device and computer-readable storage medium | |
Jin et al. | Entity linking at the tail: sparse signals, unknown entities, and phrase models | |
CN111737997A (en) | Text similarity determination method, text similarity determination equipment and storage medium | |
CN112988980B (en) | Target product query method and device, computer equipment and storage medium | |
WO2018121198A1 (en) | Topic based intelligent electronic file searching | |
CN115658851B (en) | Medical literature retrieval method, system, storage medium and terminal based on theme | |
CN112307190B (en) | Medical literature ordering method, device, electronic equipment and storage medium | |
US20150169740A1 (en) | Similar image retrieval | |
CN114330335B (en) | Keyword extraction method, device, equipment and storage medium | |
CN109614478B (en) | Word vector model construction method, keyword matching method and device | |
CN112287217B (en) | Medical document retrieval method, medical document retrieval device, electronic equipment and storage medium | |
CN109635004A (en) | A kind of object factory providing method, device and the equipment of database |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21908355 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21908355 Country of ref document: EP Kind code of ref document: A1 |