WO2022134343A1

WO2022134343A1 - Document clustering method and apparatus, electronic device, and storage medium

Info

Publication number: WO2022134343A1
Application number: PCT/CN2021/082726
Authority: WO
Inventors: 柴玲
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-12-25
Filing date: 2021-03-24
Publication date: 2022-06-30
Also published as: CN112667810B; CN112667810A

Abstract

A document clustering method, a document clustering apparatus (400), an electronic device (500), and a storage medium. The document clustering method comprises: the document clustering apparatus (400) obtains N documents to be clustered, N being an integer greater than 1 (101); the document clustering apparatus (400) determines a co-citation similarity between any two documents among the N documents (102); according to the co-citation similarity between the any two documents, the document clustering apparatus (400) performs first clustering on the N documents to obtain M clusters, the M clusters corresponding to K documents to be clustered, M being an integer greater than or equal to 1, and K being an integer less than or equal to N (103); and the document clustering apparatus (400) performs second clustering on the remaining (N-K) documents so as to fuse the (N-K) documents into the M clusters (104). The method helps to increase the accuracy of document clustering.

Description

Document clustering method, device, electronic device and storage medium

This application claims the priority of the Chinese patent application with the application number 202011572311.7 and the invention title "Document Clustering, Apparatus, Electronic Equipment and Storage Medium", which was filed with the China Patent Office on December 25, 2020, the entire contents of which are incorporated by reference in this application.

technical field

The present application relates to the technical field of artificial intelligence, and in particular, to a document clustering method, apparatus, electronic device and storage medium.

Background technique

The inventor found that, at present, the similarity based on the citation relationship can generally be used to measure the similarity between the topics of the documents. However, in order to supplement the similarity between the documents, the text similarity between the documents is introduced. Comprehensively measure the similarity between document topics, that is, put multiple indicators in the same space to measure the similarity of document topics; after measuring the similarity between document topics, a single clustering algorithm or community detection algorithm can be used to measure the similarity between document topics. Clustering of multiple articles.

However, the inventor realizes that after the introduction of text similarity, the formed clustering network will be very dense, which will make the clustering granularity thicker and reduce the clustering accuracy of documents.

SUMMARY OF THE INVENTION

The embodiments of the present application provide a document clustering method, apparatus, electronic device, and storage medium, which improve the clustering accuracy by clustering twice.

In a first aspect, an embodiment of the present application provides a document clustering method, including:

Obtain N documents to be clustered, where N is an integer greater than 1;

Determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered;

According to the co-citation similarity between any two documents to be clustered, the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;

A second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.

In a second aspect, an embodiment of the present application provides a document clustering device, including:

The acquisition unit is used to acquire N documents to be clustered, where N is an integer greater than 1;

a processing unit, configured to determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered;

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, the processor is connected to a memory, the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory , so that the electronic device performs the following methods:

Obtain N documents to be clustered, where N is an integer greater than 1;

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program causes a computer to execute the following method:

Obtain N documents to be clustered, where N is an integer greater than 1;

In a fifth aspect, an embodiment of the present application provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, the computer is operable to cause the computer to execute as described in the first aspect Methods.

To implement the embodiment of the present application, firstly, according to the co-citation similarity between the documents, N cluster documents are clustered to obtain M clusters, because the co-citation similarity can be used to obtain more accurate clusters. , so the accuracy of the M clusters is high; then, without changing the clustering granularity, the remaining documents that are not in the M clusters are clustered a second time, and fused into the M clusters. Clustering clusters, so that the clustering granularity will not be reduced, and the clustering accuracy will be improved.

Description of drawings

In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

1 is a schematic flowchart of a document clustering method provided in an embodiment of the present application;

2 is a schematic flowchart of a clustering theme naming method provided by an embodiment of the present application;

3 is a schematic diagram of constructing an undirected graph according to an embodiment of the present application;

4 is a block diagram of functional unit composition of a document clustering apparatus provided in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a document clustering apparatus provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

The terms "first", "second", "third" and "fourth" in the description and claims of the present application and the drawings are used to distinguish different objects, rather than to describe a specific order . Furthermore, the terms "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, or optionally also includes For other steps or units inherent to these processes, methods, products or devices.

Reference herein to an "embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

The technical solution of the present application relates to the field of artificial intelligence technology, and can be applied to scenarios such as digital medical care to promote the construction of smart cities. Optionally, the data involved in this application, such as documents to be clustered and/or clustering results, may be stored in a database, or may be stored in a blockchain, which is not limited in this application.

Referring to FIG. 1 , FIG. 1 is a schematic flowchart of a document clustering method provided by an embodiment of the present application. The method is applied to a document clustering device. The method includes the following steps:

101: The document clustering device obtains N documents to be clustered, where N is an integer greater than 1.

Exemplarily, the documents to be clustered may be medical documents, patent documents, academic documents, and so on. This application does not limit the types of documents to be clustered.

102: The document clustering apparatus determines the co-citation similarity between any two documents to be clustered among the N documents to be clustered.

Exemplarily, determining the first number of the N documents to be clustered citing the first document to be clustered, and determining the second number of the N documents to be clustered citing the second document to be clustered; Determine the third quantity of the N documents to be clustered that simultaneously cite the first document to be clustered and the two documents to be clustered; finally, determine the first quantity according to the first quantity, the second quantity and the third quantity Co-citation similarity between the second document to be clustered and the second document to be clustered; the first document to be clustered and the second document to be clustered are any two of the N documents to be clustered articles to be clustered. Therefore, the co-citation similarity between the first document to be clustered and the second document to be clustered can be expressed by formula (1):

Among them, a ₁ is the first document to be clustered, a ₂ is the second document to be clustered, sim(a ₁ , a ₂ ) is the difference between the first document to be clustered and the second document to be clustered The co-citation similarity of , X is the set of documents citing a ₁ , Y is the set of documents citing a ₂ , then X∩Y is the third quantity, 丨X丨 is the first quantity, and 丨Y丨 is second quantity.

It should be understood that if the N documents to be clustered do not cite the documents to be clustered of the first document to be clustered and the second document to be clustered at the same time, the first document to be clustered and the second document to be clustered are not. The co-citation similarity between the clustered documents is 0, that is, there is no co-citation relationship between the two documents.

103: The document clustering device performs the first clustering on the N documents to be clustered according to the co-citation similarity between any two documents to be clustered, and obtains M clusters, wherein the M clusters are A cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N.

Exemplarily, the community network corresponding to the N documents to be clustered can be determined according to the co-citation similarity between any two documents; then, according to the co-citation similarity between any two documents to be clustered, The community detection algorithm and the community network perform the first clustering on the N documents to be clustered to obtain M clusters. In this application, the co-citation similarity between any two documents to be clustered is used as the weight between the two documents (that is, the weight of each edge in the community network); then, each document is regarded as an independent Community, based on the principle of minimizing modularity, merge some of the N communities to obtain M communities, that is, to obtain M clusters, and the M clusters contain a total of K articles to be clustered. Documents, that is to say, there are still (N-K) documents to be clustered that have not been clustered.

104: The document clustering device performs a second clustering on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into M clusters.

Exemplarily, a feature vector is constructed for each document to be clustered in the N documents to be clustered; according to the feature vector of each document to be clustered in (N-K) documents to be clustered and K documents to be clustered The second clustering is performed on the (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.

Exemplarily, phrase extraction is performed on the title and abstract of each document to be clustered in the N documents to be clustered, and the phrases extracted from the N documents to be clustered are deduplicated to obtain P phrases, Among them, P is an integer greater than 1, and the phrase extraction of the title and abstract of the document can extract the phrase of each document from the subject and abstract of each document through the language processing toolkit stanford NLP. Then, determine the term frequency-inverse document frequency (TF-IDF) of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered; finally, according to With respect to the TD-IDF of each of the N documents to be clustered, each phrase in the P phrases constructs a feature vector for each of the N documents to be clustered.

Specifically, when the document i to be clustered includes the jth phrase among the P phrases, the value of the jth dimension in the feature vector of the document i to be clustered is set as The jth phrase is relative to the TD-IDF of the to-be-clustered document i to be clustered; if the j-th phrase is not included in the to-be-clustered document i to be clustered, then the to-be-clustered document i is not included in the TD-IDF. The value of the jth dimension in the feature vector of the cluster document i is set to a preset value, for example, it can be set to 0. The document i to be clustered is any one of the N documents to be clustered, then the value of i is an integer from 1 to N, and the jth phrase is any one of the P phrases phrase, j is an integer from 1 to P.

Therefore, the feature vector of each document can be expressed by formula (2):

in,

Among them, V _i is the feature vector of the document i to be clustered, W ₁ , ..., W _j , ..., W _p , is the value of the feature vector of the document i to be clustered from 1 dimension to P dimension,

is the TF-IDF of phrase j relative to document i to be clustered.

In an embodiment of the present application, the TF-IDF of each phrase relative to each document to be clustered in the N documents to be clustered can be expressed by formula (3):

in,

is the number of phrases j extracted from the titles and abstracts of document i to be clustered,

is the total number of phrases extracted from the titles and abstracts of document i to be clustered, N is the number of documents to be clustered, and N _j is the number of documents containing phrase j in the N documents to be clustered.

Further, after constructing the feature vector of each document to be clustered, the similarity between the feature vector of the document to be clustered q and the feature vector of the document to be clustered e can be determined, wherein the document to be clustered q is any one of the (N-K) documents to be clustered, and the document to be clustered e is any one of the K documents to be clustered; traverse the K documents to be clustered, so that the to-be-clustered document can be obtained The similarity between the document q and the K documents to be clustered is obtained, and K similarities are obtained; then, according to the order of the K similarities from large to small, H documents to be clustered are selected from the K documents to be clustered document, H is a preset integer greater than 1; finally, determine the number of the H documents to be clustered that belong to each of the M clusters, and fuse the document q to be clustered into the target cluster. cluster, wherein the target cluster is the cluster that contains the most number of H documents to be clustered among the M clusters.

For example, 10 documents to be clustered are selected from the K documents to be clustered, that is, H=10, and the clusters include cluster 1, cluster 2 and cluster 3, wherein the Among the 10 documents to be clustered, 3 are from cluster 1, 4 are from cluster 2, and 3 are from cluster 3, then it is determined that cluster 2 contains the 10 to be clustered The largest number of documents. Therefore, the document q to be clustered is fused into cluster 2.

It can be seen that, in the embodiment of the present application, firstly, N cluster documents are clustered according to the co-citation similarity between the documents, and M clusters are obtained, because the co-citation similarity can be used to compare Therefore, the accuracy of the M clusters is high; then, without changing the clustering granularity, the remaining documents that are not in the M clusters are clustered a second time, and the fusion To the M clusters, the clustering granularity will not be reduced, and the clustering accuracy will be improved.

In an embodiment of the present application, after the N documents to be clustered are clustered, each cluster cluster may be named a subject, so as to facilitate subsequent classification of each cluster cluster. The following provides an example of naming a cluster i to illustrate the process of naming a cluster. The cluster is any one of the M clusters, and the naming methods of other clusters are the same as This is similar and will not be described again.

Referring to FIG. 2, FIG. 2 is a schematic flowchart of a method for naming a document subject provided by an embodiment of the present application. The method is applied to a document clustering device. The method includes the following steps:

201: Determine a target document to be clustered in cluster i according to the co-citation similarity between any two documents to be clustered in cluster i.

Exemplarily, according to the co-citation similarity between any two documents to be clustered in cluster i, the score of each document to be clustered in cluster i is determined, wherein the score of each document to be clustered is determined. The score is used to indicate the degree of importance of each document to be clustered, that is, the quality of the document to be clustered; then, the target document to be clustered in cluster i is determined according to the order of scores from large to small. Exemplarily, a preset proportion of documents may be selected from each cluster as the target documents to be clustered according to the descending order of scores. For example, if the number of documents in a certain cluster is 100, and the preset ratio is 10%, then the top ten documents are selected from the 100 documents in the order of the scores from the largest to the smallest as the cluster. The target document to be clustered.

Specifically, according to the co-citation similarity between any two documents to be clustered in cluster i, the undirected graph corresponding to cluster i is determined, that is, the documents to be clustered that have a co-citation relationship are connected. , the documents to be clustered without co-citation relationship are not connected; according to the undirected graph corresponding to the cluster i and the pagerank algorithm to determine the score of each node (that is, each document to be clustered) in the undirected graph, it can be The score of each document to be clustered in cluster i is obtained, that is, the score of each document to be clustered is determined according to the path between each document to be clustered and other documents to be clustered. Specifically, the score of each to-be-clustered document is determined according to the co-citation similarity between each to-be-clustered document and other to-be-clustered documents, preset parameters, and the number of intermediate to-be-clustered documents.

For example, cluster i includes document A to be clustered, document B to be clustered, document C to be clustered, and document D to be clustered, wherein document D to be clustered is not shared with other documents to be clustered. The citation relationship is the last fused document to be clustered. Therefore, an undirected graph as shown in Figure 3 can be established. According to the pagerank algorithm and the undirected graph, the scores corresponding to document A to be clustered, document B to be clustered, document C to be clustered, and document D to be clustered can be determined respectively. Exemplarily, the score corresponding to the document A to be clustered is the score between the document A to be clustered and the document B to be clustered, and the sum of the score between the document A to be clustered and the document C to be clustered. Exemplarily, the score corresponding to the document A to be clustered can be expressed by formula (4):

S=1*γ*sim(A,B)+1*γ ² *(A,C) Formula (4)

S is the score corresponding to document A to be clustered, sim(A, B) is the co-citation similarity between document A to be clustered and document B to be clustered, sim(A, C) is the document to be clustered The co-citation similarity between A and the document C to be clustered, γ is a preset parameter, 0<γ<1.

202 : Determine L phrases belonging to cluster i among the P phrases, where L is an integer less than or equal to P.

203: Determine the topic corresponding to the cluster i according to the target documents to be clustered and the L phrases.

Exemplarily, word embedding may be performed on the title of the target document to be clustered in cluster i to obtain the first feature vector corresponding to cluster i. Among them, the word embedding of the title of the target document to be clustered in the cluster i can be realized by the Biobert model that has completed the training. The Biobert model is obtained by training the documents in the medical field as the training corpus. Language processing in the medical field will be more accurate, and the semantics of documents can be extracted more accurately. Among them, the training of the Biobert model can be obtained through supervised training, and will not be repeated here.

It should be understood that when the number of the target document to be clustered is one, the feature vector obtained by word embedding of the title of the target document to be clustered is used as the first feature vector; When the number is multiple, word embedding can be performed on the title of each target document to be clustered to obtain the feature vector corresponding to each target document to be clustered. After the eigenvectors are averaged bit by bit, the first eigenvector is obtained.

Further, word embedding is performed on each of the L phrases to obtain the second feature vector of each phrase, wherein the word embedding for each phrase can also be implemented by the above-mentioned Biobert model, and will not be described again; then, Perform word embedding on each word of each phrase to obtain the third feature vector corresponding to each word; according to the third feature vector corresponding to each word, determine the fourth feature vector corresponding to each phrase, that is, in each phrase The third feature vector corresponding to each word of , and the feature vector obtained by the bitwise averaging is used as the fourth feature vector corresponding to each phrase. For example, the four words in the phrase "lung cancer survival rate" are word-embedded respectively to obtain four feature vectors, and the four feature vectors are averaged bit by bit to obtain the second feature vector corresponding to the phrase. .

Finally, according to the first feature vector corresponding to cluster i, the second feature vector corresponding to each phrase, the fourth feature vector corresponding to each phrase, and the TF-IDF of each phrase relative to the cluster i, determine The topic corresponding to each cluster, wherein the TF-IDF of each phrase relative to the cluster i is the average of the TF-IDF of each phrase relative to each document to be clustered in the cluster i value.

Exemplarily, determine the first similarity between the first feature vector corresponding to cluster i and the second feature vector corresponding to each phrase; determine the first feature vector corresponding to cluster i and each phrase corresponding to the first similarity. The second similarity between the fourth feature vectors; finally, according to the first similarity corresponding to each phrase, the second similarity and the TF-IDF relative to the cluster i, determine the cluster i and each phrase The third similarity between. For example, the first similarity, the second similarity and the TF-IDF may be weighted to obtain the third similarity.

Exemplarily, the above similarity may be the cosine similarity between vectors. Therefore, the third similarity can be expressed by formula (5):

sim(phr,cluster)=β*cos _sim (vec ₁ ,vec ₂ )+(1-β)*cos _sim (vec ₁ ,vec ₄ )+(1-β)*TF-IDF formula (5)

Among them, sim(phr,cluster) is the third similarity between cluster i and each phrase, cos _sim is the cosine similarity operation, vec ₁ is the first feature vector corresponding to cluster i, vec ₂ is the second feature vector corresponding to each phrase, vec ₄ is the fourth feature vector corresponding to each phrase, β is a preset parameter, 0≤β≤1.

Then, according to the second feature vector of each phrase, the fourth degree of similarity between any two phrases in the L phrases is determined. Exemplarily, the fourth similarity may also be a cosine similarity. Therefore, the fourth similarity may be expressed by formula (6):

sim(phr ₁ ,phr ₂ )=cos _sim (vec ₂₁ ,vec ₂₂ ) Formula (6)

phr ₁ and phr ₂ are any two phrases among the L phrases, sim(phr ₁ ,phr ₂ ) is the fourth similarity between the two phrases, vec ₂₁ is the second feature vector of phr ₁ , vec ₂₂ is the second eigenvector corresponding to phr ₂ .

Finally, according to the third degree of similarity between cluster i and each phrase and the fourth degree of similarity between any two phrases, the topic corresponding to cluster i is determined.

Exemplarily, take the phrase with the third highest similarity as a target phrase, and move the target phrase from the L phrases to the target phrase set; The third similarity between clusters i, and the second similarity with each target phrase in the target phrase set, determine the Maximum Marginal Relevance (MMR) score corresponding to each phrase in the remaining phrases, For example, according to the third degree of similarity between each phrase in the remaining phrases and cluster i, and the second degree of similarity between each phrase in the target phrase set and each target phrase in the target phrase set, we can obtain each target phrase in the target phrase set. The fifth similarity corresponding to the phrase, and the maximum fifth similarity is used as the MMR score of each phrase in the remaining phrases; then, the phrase with the largest MMR score in the remaining phrases is moved from the remaining phrases to the target phrase set. Finally, determine the MMR score corresponding to each phrase in the remaining phrases again, and move the phrase with the largest MMR score in the remaining phrases to the target phrase set, and iterate sequentially until the number of target phrases in the target phrase set reaches the preset number , stop the iteration, and take the target phrase in the target phrase set as the topic of cluster i.

Exemplarily, the MMR score of each phrase in the remaining phrases can be represented by formula (7):

Among them, PHR represents the candidate phrase set corresponding to cluster i, K is the target phrase set, phr _i ∈ PHR\K represents the ith phrase in the remaining phrases, MMR _i is the MMR score of the ith phrase, phr _j ∈K represents the jth phrase in the target phrase set, sim(phr _i ,cluster) is the third similarity between the ith phrase and cluster i,

is the fourth similarity between the ith phrase and the jth phrase, argmax represents the maximum value, that is, after traversing the phrases in the target phrase set, the maximum value is taken as the MMR score of the ith phrase, α is Preset parameters. Finally, after traversing each phrase in the remaining phrases, the MMR score of each phrase in the remaining phrases can be obtained.

For example, the candidate phrase set of a cluster includes phrase A, phrase B, phrase C, phrase D, and phrase E, and the third similarity between phrase A and the cluster is the largest, then phrase A is first As a target phrase, and move the phrase A from the L phrases to the target phrase set, the remaining phrases include phrase B, phrase C, phrase D and phrase E; then, calculate the MMR score of each phrase in the remaining phrases , that is, the third similarity between each phrase and the cluster and the second similarity between each phrase and phrase A are substituted into the above formula (6), and the corresponding phrases B, C, D and E are obtained respectively. Assuming that the MMR score of phrase B is the largest, then move phrase B from the candidate set to the target phrase set, then the remaining phrases include phrase C, phrase D, and phrase E. Finally, substitute the third similarity between each phrase in the remaining phrases and the cluster and the second similarity with phrase A into the above formula (6) to obtain a similarity corresponding to phrase A, Substitute the third similarity between the phrase and the cluster i and the second similarity between the phrase B into the above formula (6) to obtain a similarity corresponding to the phrase B. The largest similarity among the similarities is used as the MMR score of this phrase. The MMR scores of each phrase in the remaining phrases are sequentially determined, and then the MMR scores of phrase C, phrase D, and phrase E can be obtained. Assuming that phrase C has the largest MMR score, move phrase C to the target phrase set. If the preset number is three phrases, there are already three phrases in the target phrase set at this time, stop the iteration, and use phrase A, phrase B, and phrase C as the topics of cluster i.

It can be seen that in the process of naming clusters, in addition to considering the relationship between phrases and the clusters themselves, the similarity between phrases is also considered, so as to avoid selecting repetitive and redundant phrases as clusters main body of the cluster. In addition, each phrase is divided into words, and the second similarity between each phrase and the first feature vector of the medical document cluster is determined with the granularity of words. Mainly avoid some long phrases, which are not related to the topic of the cluster, but because the phrases are long, they may frequently contain some topic-related words, so in the process of semantic feature extraction for these long phrases, it may be Affected by these high-frequency words, the semantic features of these long phrases are related to the topics of the clusters, and these long phrases will be mistakenly regarded as the topics of the clusters, resulting in low topic accuracy of the extracted clusters. By segmenting each phrase, starting from each word itself, regardless of the context of the word, some words that are not related to the topic but frequently appear are classified as general vocabulary, and the second similarity is carried out. In the process of calculation, the obtained second similarity is relatively small, so after weighting, the obtained third similarity will be relatively small, so that such phrases will not be used as the subject of clusters, so that the final extracted The subject is relatively more precise.

In an embodiment of the present application, the document clustering method of the present application can also be applied to the field of medical technology. For example, medical documents can be clustered by the document clustering method of the present application. If the documents are medical documents, N medical documents to be clustered can be obtained from the medical database, for example, N documents to be clustered are read from the public medicine (PUBMED) database, and then Medical documents are clustered, N medical documents are clustered into multiple clusters, and each cluster can be named, so that doctors can clearly and quickly consult the medical documents they need to find, and promote the development of medical technology. progress.

Referring to FIG. 4 , FIG. 4 is a block diagram of functional units of a document clustering apparatus provided by an embodiment of the present application. The document clustering 400 includes: an acquisition unit 401 and a processing unit 402, wherein:

Obtaining unit 401, configured to obtain N documents to be clustered, where N is an integer greater than 1;

The processing unit 402 is configured to determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered; according to the co-citation similarity between any two documents to be clustered , perform the first clustering on the N documents to be clustered to obtain M clusters, wherein the M clusters correspond to the K documents to be clustered, and M is an integer greater than or equal to 1, K is an integer less than or equal to N; the second clustering is performed on the remaining (N-K) documents to be clustered to fuse the (N-K) documents to be clustered into the M clusters.

In some possible implementations, in determining the co-citation similarity between any two documents to be clustered among the N documents to be clustered, the processing unit 402 is specifically configured to:

determining the first number of the N documents to be clustered citing the first document to be clustered;

determining the second number of the N documents to be clustered that cite the second document to be clustered;

Determine the third number of the N documents to be clustered that simultaneously cite the first document to be clustered and the two documents to be clustered;

determining the co-citation similarity between the first document to be clustered and the two documents to be clustered according to the first quantity, the second quantity and the third quantity;

Wherein, the first document to be clustered and the second document to be clustered are any two documents to be clustered among the N documents to be clustered.

In some possible implementations, the second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters, The processing unit 402 is specifically used for:

Construct a feature vector for each document to be clustered in the N documents to be clustered;

According to the feature vector of each document to be clustered in the (N-K) documents to be clustered and the feature vector of each document to be clustered in the K documents to be clustered, for the (N-K) documents The documents to be clustered are clustered a second time, so as to fuse the (N-K) documents to be clustered into the M clusters.

In some possible implementations, in constructing a feature vector for each of the N documents to be clustered, the processing unit 402 is specifically configured to:

Perform phrase extraction and deduplication on the title and abstract of each document to be clustered in the N documents to be clustered to obtain P phrases, where P is an integer greater than 1;

Determine the word frequency-inverse text frequency TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered;

According to the TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered, for each document to be clustered in the N documents to be clustered Document construction eigenvectors.

In some possible implementations, according to the TF-IDF of each of the P phrases relative to each of the N documents to be clustered, for the N documents to be clustered In terms of constructing a feature vector for each document to be clustered in the class document, the processing unit 402 is specifically used for:

In the case that the document i to be clustered includes the jth phrase among the P phrases, set the value of the jth dimension in the feature vector of the document i to be clustered as the jth phrase With respect to the TF-IDF of the document i to be clustered, in the case that the document i to be clustered does not include the jth phrase, the jth phrase in the feature vector of the document i to be clustered is The value of the dimension is set to the default value;

Wherein, the document i to be clustered is any one of the N documents to be clustered, the value of i is an integer from 1 to N, and the value of j is an integer from 1 to P.

In some possible implementations, according to the feature vector of each document to be clustered in the (N-K) documents to be clustered and the characteristics of each document to be clustered in the K documents to be clustered vector, the second clustering is performed on the (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters, the processing unit 402 is specifically used for :

Determine the similarity between the feature vector of the document q to be clustered and the feature vector of the document e to be clustered, wherein the document q to be clustered is any one of the (N-K) documents to be clustered , the document e to be clustered is any one of the K documents to be clustered;

Select H documents to be clustered from the K documents to be clustered in descending order of similarity, where H is an integer greater than 1;

Determine the number of each of the H documents to be clustered that belong to each of the M clusters;

The document q to be clustered is fused into a target cluster, wherein the target cluster is the cluster that contains the most number of the H documents to be clustered among the M clusters.

In some possible implementations, according to the co-citation similarity between any two documents to be clustered, the N documents to be clustered are clustered for the first time to obtain M clusters In one aspect, the processing unit 402 is specifically configured to:

According to the co-citation similarity between the arbitrary two documents to be clustered, construct a community network of the N documents to be clustered;

According to the community network, the co-citation similarity between any two documents to be clustered, and the community detection algorithm, perform the first clustering on the N documents to be clustered to obtain M clusters .

Referring to FIG. 5 , FIG. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device includes a processor and memory. Optionally, the electronic device may further include a transceiver. For example, as shown in FIG. 5 , the electronic device 500 includes a transceiver 501 , a processor 502 and a memory 503 . They are connected by bus 504 . The memory 503 is used to store computer programs and data, and can transmit the data stored by the memory 503 to the processor 502 .

The processor 502 is used to read the computer program in the memory 503 to perform the following operations:

Obtain N documents to be clustered, where N is an integer greater than 1;

In some possible implementations, in determining the co-citation similarity between any two documents to be clustered among the N documents to be clustered, the processor 502 is specifically configured to perform the following operations:

According to the first quantity, the second quantity and the third quantity, determining the co-citation similarity between the first document to be clustered and the two documents to be clustered;

In some possible implementations, the second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters, The processor 502 is specifically configured to perform the following operations:

In some possible implementations, in constructing a feature vector for each of the N documents to be clustered, the processor 502 is specifically configured to perform the following operations:

In some possible implementations, according to the TF-IDF of each of the P phrases relative to each of the N documents to be clustered, for the N documents to be clustered In terms of constructing a feature vector for each document to be clustered in the class document, the processor 502 is specifically configured to perform the following operations:

In some possible implementations, according to the feature vector of each document to be clustered in the (N-K) documents to be clustered and the characteristics of each document to be clustered in the K documents to be clustered vector, the second clustering is performed on the (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters, and the processor 502 is specifically configured to execute Do the following:

In some possible implementations, according to the co-citation similarity between any two documents to be clustered, the N documents to be clustered are clustered for the first time to obtain M clusters In one aspect, the processor 502 is specifically configured to perform the following operations:

Specifically, the transceiver 501 may be the transceiver unit 401 of the document clustering apparatus 400 of the embodiment described in FIG. 4 , and the processor 502 may be the processing unit 402 of the document clustering apparatus 400 of the embodiment described in FIG. 4 . .

It should be understood that the document clustering device in this application can include smart phones (such as Android mobile phones, iOS mobile phones, Windows Phone mobile phones, etc.), tablet computers, handheld computers, notebook computers, mobile Internet Devices MID (Mobile Internet Devices, referred to as: MID). ) or wearable devices, etc. The above-mentioned document clustering apparatus is only an example, rather than an exhaustive list, including but not limited to the above-mentioned electronic equipment. In practical applications, the above-mentioned document clustering apparatus may further include: intelligent vehicle-mounted terminals, computer equipment, and the like.

Embodiments of the present application further provide a computer (readable) storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is executed by a processor to implement any one of the documents described in the foregoing method embodiments Some or all of the steps of the clustering method.

Optionally, the storage medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.

Embodiments of the present application further provide a computer program product, the computer program product comprising a non-transitory computer-readable storage medium storing a computer program, the computer program being operable to cause a computer to execute the methods described in the foregoing method embodiments Some or all of the steps of any document clustering method.

It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present application is not limited by the described action sequence. Because in accordance with the present application, certain steps may be performed in other orders or concurrently. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required by the present application.

In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical or other forms.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software program modules.

The integrated unit, if implemented in the form of a software program module and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art, or all or part of the technical solution, and the computer software product is stored in a memory, Several instructions are included to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

Those skilled in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable memory, and the memory can include: a flash disk , Read-only memory (English: Read-Only Memory, referred to as: ROM), random access device (English: Random Access Memory, referred to as: RAM), magnetic disk or optical disk, etc.

The embodiments of the present application have been introduced in detail above, and the principles and implementations of the present application are described in this paper by using specific examples. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present application; at the same time, for Persons of ordinary skill in the art, based on the idea of the present application, will have changes in the specific implementation manner and application scope. In summary, the contents of this specification should not be construed as limitations on the present application.

Claims

A document clustering method, including:

Obtain N documents to be clustered, where N is an integer greater than 1;

Determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered;

According to the co-citation similarity between any two documents to be clustered, the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;

A second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
The method according to claim 1, wherein the determining the co-citation similarity between any two documents to be clustered in the N documents to be clustered comprises:

determining the first number of the N documents to be clustered citing the first document to be clustered;

determining the second number of the N documents to be clustered that cite the second document to be clustered;

Determine the third number of the N documents to be clustered that simultaneously cite the first document to be clustered and the two documents to be clustered;

determining the co-citation similarity between the first document to be clustered and the two documents to be clustered according to the first quantity, the second quantity and the third quantity;

Wherein, the first document to be clustered and the second document to be clustered are any two documents to be clustered among the N documents to be clustered.
The method according to claim 1 or 2, wherein the second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters, including:

Construct a feature vector for each document to be clustered in the N documents to be clustered;

According to the feature vector of each document to be clustered in the (N-K) documents to be clustered and the feature vector of each document to be clustered in the K documents to be clustered, for the (N-K) documents The documents to be clustered are clustered a second time, so as to fuse the (N-K) documents to be clustered into the M clusters.
The method according to claim 3, wherein the constructing a feature vector for each document to be clustered in the N documents to be clustered comprises:

Perform phrase extraction and deduplication on the title and abstract of each document to be clustered in the N documents to be clustered to obtain P phrases, where P is an integer greater than 1;

Determine the word frequency-inverse text frequency TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered;

According to the TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered, for each document to be clustered in the N documents to be clustered Document construction eigenvectors.
The method according to claim 4, wherein the TF-IDF of each phrase in the P phrases relative to the TF-IDF of each document to be clustered in the N documents to be clustered is the A feature vector is constructed for each document to be clustered in the N documents to be clustered, including:

In the case that the document i to be clustered includes the jth phrase among the P phrases, set the value of the jth dimension in the feature vector of the document i to be clustered as the The j phrases are relative to the TF-IDF of the to-be-clustered document i to be clustered, and if the j-th phrase is not included in the to-be-clustered document i to be clustered, the to-be-clustered document i The value of the jth dimension in the feature vector of the document i to be clustered is set as a preset value;

Wherein, the document i to be clustered is any one of the N documents to be clustered, the value of i is an integer from 1 to N, and the value of j is from 1 to P Integer.
The method according to claim 4 or 5, wherein the feature vector of each document to be clustered in the (N-K) documents to be clustered and each of the K documents to be clustered The feature vector of the documents to be clustered, the second clustering is performed on the (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters, including:

Determine the similarity between the feature vector of the document q to be clustered and the feature vector of the document e to be clustered, wherein the document q to be clustered is any one of the (N-K) documents to be clustered , the document e to be clustered is any one of the K documents to be clustered;

Select H documents to be clustered from the K documents to be clustered in descending order of similarity, where H is an integer greater than 1;

Determine the number of each of the H documents to be clustered that belong to each of the M clusters;

The document q to be clustered is fused into a target cluster, wherein the target cluster is the cluster that contains the most number of the H documents to be clustered among the M clusters.
The method according to claim 1, wherein the first clustering is performed on the N documents to be clustered according to the co-citation similarity between any two documents to be clustered, to obtain M clusters, including:

According to the co-citation similarity between the arbitrary two documents to be clustered, construct a community network of the N documents to be clustered;

According to the community network, the co-citation similarity between any two documents to be clustered, and the community detection algorithm, perform the first clustering on the N documents to be clustered to obtain M clusters .
A document clustering device, comprising:

The acquisition unit is used to acquire N documents to be clustered, where N is an integer greater than 1;

a processing unit, configured to determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered;

According to the co-citation similarity between any two documents to be clustered, the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;

A second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
An electronic device, comprising: a processor and a memory, the processor is connected to the memory, the memory is used for storing a computer program, the processor is used for executing the computer program stored in the memory, so that the The electronic device performs the following methods:

Obtain N documents to be clustered, where N is an integer greater than 1;

Determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered;

According to the co-citation similarity between any two documents to be clustered, the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;

A second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
The electronic device according to claim 9, wherein performing the determining of the co-citation similarity between any two documents to be clustered among the N documents to be clustered comprises:

determining the first number of the N documents to be clustered citing the first document to be clustered;

determining the second number of the N documents to be clustered that cite the second document to be clustered;

Determine the third number of the N documents to be clustered that simultaneously cite the first document to be clustered and the two documents to be clustered;

determining the co-citation similarity between the first document to be clustered and the two documents to be clustered according to the first quantity, the second quantity and the third quantity;

Wherein, the first document to be clustered and the second document to be clustered are any two documents to be clustered among the N documents to be clustered.
The electronic device according to claim 9 or 10, wherein the second clustering of the remaining (N-K) documents to be clustered is performed to fuse the (N-K) documents to be clustered into all the documents to be clustered. The M clusters, including:

Construct a feature vector for each document to be clustered in the N documents to be clustered;

According to the feature vector of each document to be clustered in the (N-K) documents to be clustered and the feature vector of each document to be clustered in the K documents to be clustered, for the (N-K) documents The documents to be clustered are clustered a second time, so as to fuse the (N-K) documents to be clustered into the M clusters.
The electronic device according to claim 11, wherein performing the constructing a feature vector for each of the N documents to be clustered for each document to be clustered comprises:

Perform phrase extraction and deduplication on the title and abstract of each document to be clustered in the N documents to be clustered to obtain P phrases, where P is an integer greater than 1;

Determine the word frequency-inverse text frequency TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered;

According to the TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered, for each document to be clustered in the N documents to be clustered Document construction eigenvectors.
The electronic device according to claim 12, wherein, executing the TF-IDF according to each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered, is: A feature vector is constructed for each document to be clustered in the N documents to be clustered, including:

In the case that the document i to be clustered includes the jth phrase among the P phrases, set the value of the jth dimension in the feature vector of the document i to be clustered as the The j phrases are relative to the TF-IDF of the to-be-clustered document i to be clustered, and if the j-th phrase is not included in the to-be-clustered document i to be clustered, the to-be-clustered document i The value of the jth dimension in the feature vector of the document i to be clustered is set as a preset value;

Wherein, the document i to be clustered is any one of the N documents to be clustered, the value of i is an integer from 1 to N, and the value of j is from 1 to P Integer.
The electronic device according to claim 9, wherein performing the first clustering on the N documents to be clustered according to the co-citation similarity between any two documents to be clustered, Get M clusters, including:

According to the co-citation similarity between the arbitrary two documents to be clustered, construct a community network of the N documents to be clustered;

According to the community network, the co-citation similarity between any two documents to be clustered, and the community detection algorithm, perform the first clustering on the N documents to be clustered to obtain M clusters .
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the following method:

Obtain N documents to be clustered, where N is an integer greater than 1;

Determine the co-citation similarity between any two documents to be clustered in the N documents to be clustered;

According to the co-citation similarity between any two documents to be clustered, the first clustering is performed on the N documents to be clustered, and M clusters are obtained, wherein the M clusters are The cluster corresponds to K documents to be clustered, M is an integer greater than or equal to 1, and K is an integer less than or equal to N;

A second clustering is performed on the remaining (N-K) documents to be clustered, so as to fuse the (N-K) documents to be clustered into the M clusters.
The computer-readable storage medium according to claim 15, wherein performing the determining of the co-citation similarity between any two documents to be clustered among the N documents to be clustered comprises:

determining the first number of the N documents to be clustered citing the first document to be clustered;

determining the second number of the N documents to be clustered that cite the second document to be clustered;

Determine the third number of the N documents to be clustered that simultaneously cite the first document to be clustered and the two documents to be clustered;

determining the co-citation similarity between the first document to be clustered and the two documents to be clustered according to the first quantity, the second quantity and the third quantity;

Wherein, the first document to be clustered and the second document to be clustered are any two documents to be clustered among the N documents to be clustered.
The computer-readable storage medium according to claim 15 or 16, wherein the second clustering of the remaining (N-K) documents to be clustered is performed to group the (N-K) documents to be clustered fused to the M clusters, including:

Construct a feature vector for each document to be clustered in the N documents to be clustered;

According to the feature vector of each document to be clustered in the (N-K) documents to be clustered and the feature vector of each document to be clustered in the K documents to be clustered, for the (N-K) documents The documents to be clustered are clustered a second time, so as to fuse the (N-K) documents to be clustered into the M clusters.
The computer-readable storage medium according to claim 17, wherein performing the constructing a feature vector for each of the N documents to be clustered for each document to be clustered comprises:

Perform phrase extraction and deduplication on the title and abstract of each document to be clustered in the N documents to be clustered to obtain P phrases, where P is an integer greater than 1;

Determine the word frequency-inverse text frequency TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered;

According to the TF-IDF of each phrase in the P phrases relative to each document to be clustered in the N documents to be clustered, for each document to be clustered in the N documents to be clustered Document construction eigenvectors.
The computer-readable storage medium according to claim 18, wherein the performing the method according to the TF- IDF, constructs a feature vector for each document to be clustered in the N documents to be clustered, including:

In the case that the document i to be clustered includes the jth phrase among the P phrases, set the value of the jth dimension in the feature vector of the document i to be clustered as the The j phrases are relative to the TF-IDF of the to-be-clustered document i to be clustered, and if the j-th phrase is not included in the to-be-clustered document i to be clustered, the to-be-clustered document i The value of the jth dimension in the feature vector of the document i to be clustered is set as a preset value;

Wherein, the document i to be clustered is any one of the N documents to be clustered, the value of i is an integer from 1 to N, and the value of j is from 1 to P Integer.
The computer-readable storage medium according to claim 15, wherein the performing the first time on the N documents to be clustered according to the co-citation similarity between the any two documents to be clustered Clustering to get M clusters, including:

According to the co-citation similarity between the arbitrary two documents to be clustered, construct a community network of the N documents to be clustered;

According to the community network, the co-citation similarity between any two documents to be clustered, and the community detection algorithm, perform the first clustering on the N documents to be clustered to obtain M clusters .