WO2020234930A1 - クラスタ解析方法、クラスタ解析システム、及びクラスタ解析プログラム - Google Patents
クラスタ解析方法、クラスタ解析システム、及びクラスタ解析プログラム Download PDFInfo
- Publication number
- WO2020234930A1 WO2020234930A1 PCT/JP2019/019725 JP2019019725W WO2020234930A1 WO 2020234930 A1 WO2020234930 A1 WO 2020234930A1 JP 2019019725 W JP2019019725 W JP 2019019725W WO 2020234930 A1 WO2020234930 A1 WO 2020234930A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cluster
- document
- inter
- clusters
- similarity
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/383—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Definitions
- the present invention relates to a cluster analysis method, a cluster analysis system, and a cluster analysis program that classify a plurality of documents into clusters according to their contents and generate display data indicating relationships between clusters according to a time series.
- morphological analysis is performed on the technical documents searched by the concept search, weights are given to each word obtained from the morphological documents, each technical document is vectorized, and technical documents having similar vector directions are combined into one.
- a cluster analysis method for grouping into clusters has been proposed (for example, "Patent Document 1").
- the present invention classifies a large number of documents, especially a huge number of documents, into clusters composed of similar documents, and can grasp the relationships between clusters in other sets such as the time-series relationships of clusters. It is an object of the present invention to provide a cluster analysis method, a cluster analysis system, and a cluster analysis program that can understand the relationship between clusters across sets.
- the present invention is a cluster analysis method in which a computer classifies a plurality of documents into clusters according to their contents, and the first set is extracted from the plurality of documents under the first condition.
- the first inter-document similarity for calculating the inter-document similarity between the set extraction step of, the content of one document included in the first set, and the content of another document included in the first set.
- a second set extraction step of extracting a second set from a plurality of documents under a second condition different from the first condition, the contents of one document included in the second set, and the first.
- the second inter-document similarity calculation step for calculating the inter-document similarity with the contents of other documents included in the second set, and in the second similarity calculation step in the second set.
- the clusters classified in the first cluster classification step and the second cluster classification step.
- the first set and the first set are based on the inter-cluster similarity calculation step for calculating the inter-cluster similarity between the clusters classified by the above and the inter-cluster similarity calculated in the inter-cluster similarity calculation step.
- This is a cluster analysis method including a cluster association step for generating association information in which related clusters are associated with each other across a set of two.
- a large number of documents especially a huge number of documents, are classified into a group of documents (clusters) composed of similar documents, and relationships between clusters in other sets such as time-series relationships of clusters.
- FIG. 1 It is an overall block diagram of the cluster analysis system which concerns on one Embodiment of this invention.
- This is a display example of the cluster analysis result displayed in the output section of the information terminal. It is explanatory drawing of display data. It is explanatory drawing which shows the relationship between clusters across a set. It is explanatory drawing which shows an example of the time series map of each cluster. It is a flowchart which shows the cluster analysis control routine executed by the server of the cluster analysis system in one Embodiment of this invention.
- FIG. 1 is an overall configuration diagram showing a cluster analysis system according to an embodiment of the present invention, and the configuration of the present embodiment will be described based on the same diagram.
- the document database 2 (hereinafter, the database is referred to as “DB”), the information terminal 3, and the server 4 are connected via the communication network N. Is connected.
- the communication network N is, for example, the Internet, an intranet, a VPN (Virtual Private Network), or the like, and is a communication network capable of bidirectionally transmitting information by using a wired or wireless communication means.
- one document DB 2 and one information terminal 3 are connected to one server 4 for simplification of explanation, but the server 4 is connected to a plurality of document DBs and a plurality of information terminals 3. It is possible.
- the document DB2 is a database that stores information on documents such as academic papers, patent documents, magazines, books, and newspaper articles, and the stored documents are open to limited or non-limited persons. ..
- the document DB 2 will be described as an example of a document DB that stores information on medical literature.
- the information in the medical literature includes the author's name, publication date (time information), bibliographic matters such as the author's institution, the title, abstract, and text of the paper, and citations.
- citation / citation information such as the number of citations and document names
- publication information such as the name of the society in which the document was published, the name of the journal, or the name of the publisher.
- the information terminal 3 is, for example, a personal computer (hereinafter referred to as "PC"), a mobile terminal such as a smartphone, a tablet PC, and a mobile phone, and has an output unit 10 and an input unit 11.
- PC personal computer
- mobile terminal such as a smartphone, a tablet PC, and a mobile phone
- the output unit 10 is a device such as a display or a printer, and can visually display the display data generated by the server 4.
- the input unit 11 is a device such as a keyboard or a mouse, and can input and operate information.
- the output unit 10 and the input unit 11 may be integrated to form, for example, a touch panel.
- a person (user) who uses the information terminal 3 can check the display data generated by the server 4 on the output unit 10 and can issue various instructions to the server 4 via the input unit 11. ..
- the server 4 is composed of one or more servers (computers) that classify a plurality of documents into clusters according to their contents and generate display data indicating the relationship between the documents.
- the server 4 has various arithmetic units and storage units, for example, a document storage unit 20, a set extraction unit 21, a document-to-document similarity calculation unit 22, a cluster classification unit 23, an index calculation unit 24, a network storage unit 25, and a cluster. It has an inter-similarity calculation unit 26, a cluster association unit 27, and a display data generation unit 28.
- the document storage unit 20 is a storage unit that is connected to the document DB 2 via the communication network N and acquires and stores necessary document information from the document DB 2.
- the medical literature is acquired from the document DB2 and stored.
- the document storage unit 20 also has a function of automatically updating the document in the document storage unit 20 in synchronization with the update such as addition or deletion of the document in the document DB 2.
- the set extraction unit 21 has a function of extracting a set from the document storage unit 20 under the condition using time information.
- the set extraction unit 21 can extract a set narrowed down to medical literature published in a predetermined period (for example, a predetermined year) by using the publication date of the document.
- the condition for extracting the set not only the time information but also other conditions may be used or other conditions may be added.
- conditions such as medical literature on a specific disease, medical literature presented at a specific academic society, etc. may be used or added, or a plurality of these conditions may be used.
- the set extraction unit 21 extracts the document that meets the conditions again based on the updated information.
- the inter-document similarity calculation unit 22 has a function of calculating the similarity between the contents of one document and the contents of another document for the documents in the set extracted by the set extraction unit 21.
- TF-IDF or cosine similarity can be used for the calculation of the similarity. That is, the inter-document similarity calculation unit 22 extracts the words used for the contents of each document, and is used for each word in the document frequency (TF: Term Frequency) and in other documents. The word is weighted from the product of the rarity (IDF: Inverse Document Frequency) for the word, and the document is vectorized.
- the inter-document similarity calculation unit 22 calculates the value of the cosine (cos) between the vectorized documents as the value of the similarity between the documents.
- the similarity between the first document and the second document is 0.856
- the similarity between the first document and the third document is 0.732
- the similarity is a value between 0 and 1. It is represented, and the closer it is to 1, the more similar the document is.
- the cluster classification unit 23 generates a network including each document based on the similarity calculated by the inter-document similarity calculation unit 22 and connecting them with a line (hereinafter, referred to as “edge”), and uses similar documents. Classify into clusters (documents).
- the algorithm for clustering is not particularly limited, but for example, it is possible to use an algorithm (so-called Girvan-Newman algorithm) that identifies clusters by iterative calculation so that the connectivity between nodes is maintained as much as possible even if the edges are separated. it can.
- the index calculation unit 24 has a function of calculating a centrality index indicating the centrality of each document in the network generated by the cluster classification unit 23.
- the algorithm for calculating the centrality index is not particularly limited, and for example, eigenvector centrality, PageRank, mediation centrality, degree centrality, and the like can be used.
- eigenvector centrality is used.
- the eigenvector centrality is indicated by the probability of passing through an edge of a document (hereinafter referred to as a "node") on the network, starting from any node in the network and repeatedly following the edge. Is done.
- the network storage unit 25 is a storage unit that stores network information after clustering for each set of documents extracted by the set extraction unit 21. For example, when the set extraction unit 21 generates a set for each year based on the publication year of the document, the network information for each year is stored in the network storage unit 25. Each network information stored here is converted into network display data by the display data generation unit 28, and can be displayed by the output unit 10 of the information terminal 3.
- FIG. 2 is a display example of one network as a cluster analysis result displayed on the output unit of the information terminal
- FIG. 3 is an explanatory diagram of the network. The display of the network in one set will be described based on these figures.
- the network in one set has, for each document in the set, the representation according to the centrality index, the representation according to the type of cluster, and the degree of similarity between the documents. It is indicated by the corresponding expression.
- one document (node) on the network is represented by one circle
- the centrality index is represented by the size of the circle
- the type of cluster is represented by color.
- the magnitude of similarity is expressed by the thickness of the edge.
- nodes 30a to 30j (hereinafter collectively referred to as “node 30”) are displayed, and the four nodes 30a to 30d on the upper left belong to the first cluster, and the four nodes 30a to 30d on the upper left belong to the first cluster, and the nodes 30a to 30d on the upper left belong to the first cluster.
- Six nodes 30e to 30j belong to the second cluster.
- the first cluster and the second cluster can be shown in different colors. In FIG. 3, the difference in color is shown by the difference in hatching.
- the size of the node 30 indicates the size of the centrality, and in FIG. 3, it can be seen that the node 30a and the node 30e are documents with high centrality. Further, the thickness of the edge 32 connecting the nodes 30 indicates the magnitude of the similarity between documents connected by the edge 32. Therefore, in FIG. 3, since the edge 32 between the node 30a and the node 30c and between the node 30e and the node 30h is thick, it can be seen that the document-to-document similarity between these nodes is high.
- the network storage unit 25 stores network information that is the basis of such a network display for each set.
- the inter-cluster similarity calculation unit 26 has a function of calculating the inter-cluster similarity between clusters of a plurality of sets stored in the network storage unit 25.
- TF-IDF or cosine similarity can be used as in the inter-document similarity calculation unit 22. That is, the inter-cluster similarity calculation unit 26 extracts the words used for the contents of the documents in each cluster in each set, and for each word, the frequency of occurrence (TF: Term Frequency) in the cluster and Words are weighted from the product of rarity (IDF: Inverse Document Frequency) for words used in other clusters, and each cluster is vectorized. Then, the inter-cluster similarity calculation unit 26 sets the value of the cosine (cos) between the cluster vectorized in the first set and the cluster vectorized in the second set as the value of the inter-cluster similarity. Calculate as.
- the cluster association unit 27 has a function of generating cluster association information, assuming that clusters having a cluster-to-cluster similarity equal to or greater than a predetermined threshold value are related clusters. That is, the cluster association unit 27 associates related clusters across the set.
- the display data generation unit 28 can generate network display data based on the network information stored in the network storage unit 25 described above, and shows the relationship between clusters across the set associated with the cluster association unit 27. It has a function to generate series display data.
- FIG. 4 shows the relationship between clusters across sets
- FIG. 5 shows a display example of time-series display data.
- FIG. 4 shows the network in the set shown in FIG. 3 as an example of the network showing the set of medical literature published in 2018.
- FIG. 4 shows a time-series network showing a collection of medical literature published in 2017 and 2016.
- the inter-cluster similarity calculation unit 26 is based on the similarity between the documents in the cluster of the 2018 set and the documents in the cluster of the 2017 set, as shown by the solid line and the dotted line extending between the sets in FIG. The similarity between clusters across sets is calculated. In addition, the inter-cluster similarity calculation unit 26 can calculate the similarity between clusters in time series by performing the same processing on the 2017 set and the 2016 set.
- the time-series display in FIG. 5 is a chronological arrangement of the major clusters belonging to the collection of medical literature published in each year from 2014 to 2018.
- the cluster is indicated by a circle, and the size of the circle represents the number of documents belonging to the cluster, and the number in the circle indicates the number of documents.
- Figure 5 shows the cluster association based on the latest 2018.
- four clusters 40a to 40d which have a large number of documents, are displayed, and the relationship with the past clusters is shown by lines (edges 50 and 51) with reference to these clusters.
- each cluster is shown in a different color, but in FIG. 5, the difference in color is shown by the difference in hatching.
- the thicknesses of the edges 50 and 51 indicate the high degree of similarity between clusters, and the display data generation unit 28 generates display data so as to display only the degree of similarity equal to or higher than a predetermined threshold value.
- edges There are two types of edges, a main edge 50 that connects the cluster with the highest similarity to the reference cluster, and a sub-edge 51 that connects the other clusters with the second and subsequent similarities. Clusters connected by the main edge 50 are shown in the same color (hatching) as clusters with the same attributes.
- the sub-edge 51 connects clusters having different attributes.
- the attributes of the cluster correspond to, for example, a research theme in the medical literature.
- FIG. 4 showing the clusters of each year connected by the edges 50 and 51 is the time series display data in the medical literature, the following can be inferred.
- the attribute of the cluster 40a which has the highest number of documents in 2018, is the first (cluster 41a, 42a) in 2017 and 2016, but the second (cluster 43a, 44a) in 2015 and 2014. From 2015 to 2016, the number of documents has increased rapidly. Therefore, the research theme of the cluster 40a has been attracting attention from the past, but it can be inferred that a more remarkable event occurred from 2015 to 2016 in particular.
- the cluster 40c which has the third largest number of documents in 2018, has always been ranked third in the number of documents since 2014, but the number of documents is on the rise, and it is a research theme that may develop in the future. It can be inferred that there is.
- the cluster 40d which has the fourth largest number of documents in 2018, is an attribute that has occurred since 2017 and is a relatively new research theme. Furthermore, it can be inferred that the clusters 42d, 43d, and 44d, which have the fourth largest number of documents from 2016 to 2014, were integrated into the cluster 41b, which has the second largest number of documents, in 2017.
- the display data generation unit 28 transmits the generated network display data and time-series display data to the information terminal 3 connected to the server 4 via the communication network N.
- the server 4 inputs the information.
- the corresponding network display data as shown in FIGS. 2 and 3 and time-series display data as shown in FIG. 5 are output to the output unit 10 of the information terminal 3.
- FIG. 6 shows a flowchart of a cluster analysis routine that generates time-series display data executed by the server 4 of the cluster analysis system 1.
- the cluster analysis method of the present embodiment will be described in detail with reference to the same flowchart.
- the set extraction unit 21 receives a document conforming to the conditions from the document storage unit 20. Extract the set. For example, when the time-series display of FIG. 5 described above is requested, first, a set of medical literature published in 2018 (first set) is extracted.
- the inter-document similarity calculation unit 22 calculates the inter-document similarity between the documents constituting the set extracted in step S1.
- step S3 the cluster classification unit 23 generates a network between documents based on the similarity calculated in step S2, and classifies a set of similar documents so as to form a cluster.
- step S4 the index calculation unit 24 calculates a centrality index indicating the centrality of the document in the network generated in step S3.
- the network information related to the set extracted in step S1 is generated and stored in the network storage unit 25.
- step S5 the intercluster similarity calculation unit 26 determines whether or not the network storage unit 25 stores a set of networks that meet the conditions. If the determination result is false (No), the process returns to step S1. For example, in the case of the time series display of FIG. 5 described above, if a network has not been generated for each year set from 2014 to 2018, the process returns to step S1 and the set of ungenerated years is extracted. Then, the above steps S2 to S4 are executed to generate a network.
- step S5 determines whether a network of sets that meet the conditions is generated. If the determination result in step S5 is true (Yes), that is, if a network of sets that meet the conditions is generated, the process proceeds to step S6.
- step S6 the inter-cluster similarity calculation unit 26 calculates the inter-cluster similarity between clusters of a plurality of sets stored in the network storage unit 25. For example, in the case of the time series display of FIG. 5, the cluster-to-cluster similarity between the clusters of the 2018 and 2017 sets is calculated, followed by 2017 and 2016, 2016 and 2015, 2015 and 2014. The cluster-to-cluster similarity between clusters of the set of is calculated.
- step S7 the cluster association unit 27 generates cluster association information assuming that clusters having a cluster-to-cluster similarity equal to or greater than a predetermined threshold value are related clusters. For example, in the case of the time series display of FIG. 5, clusters having a predetermined threshold value or more are connected by edges 50 and 51 between the clusters of each year.
- step S8 the display data generation unit 28 generates time-series data as shown in FIG. 5, transmits it to the information terminal 3, and ends the routine.
- a plurality of sets having different temporal conditions are extracted, a network is formed in this set based on the similarity between documents, and a cluster of similar documents is formed. Then, the similarity between clusters is calculated and the clusters are associated across the set. This makes it possible to show the transition of the cluster over time.
- cluster association can reduce unnecessary information by targeting clusters whose similarity between clusters is equal to or higher than a predetermined threshold value, reduce the processing load on the server 4, and provide information to the information terminal 3. The amount can be reduced.
- time-series display data showing the relationship between clusters across each associated set as shown in FIG. 5, it is possible to obtain a bird's-eye view of the transition of clusters.
- a large number of documents can be classified into clusters composed of similar documents, and the time-series relationship of each cluster can be grasped. So, you can understand the process between clusters.
- the display data generation unit 28 expresses the time series display by a circle as shown in FIG. 5, the number of documents is expressed by the size of a circle, and the similarity between clusters is expressed by the thickness of the edge.
- the expression of the time series display is not limited to this, and may be expressed by other expressions.
- the time-series relationship of each cluster across the set can be grasped by using the time information as the condition for extracting the set.
- the conditions for extracting are not limited to time information.
- the relationship between clusters related to the disease or drug can be visualized.
- the relationship between clusters related to a specific technology can be visualized by extracting a set based on the technical field.
- Cluster analysis system 1
- Document DB 3 Information terminal 4
- Server 10 Output unit 11
- Input unit 20
- Document storage unit 21 Collective extraction unit 22
- Inter-document similarity calculation unit 23
- Cluster classification unit 24
- Index calculation unit 25
- Network storage unit 26 Inter-cluster similarity calculation unit 27
- Cluster association unit 28 Display data generator
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Library & Information Science (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
2 文書DB
3 情報端末
4 サーバ
10 出力部
11 入力部
20 文書記憶部
21 集合抽出部
22 文書間類似度算出部
23 クラスタ分類部
24 指標算出部
25 ネットワーク記憶部
26 クラスタ間類似度算出部
27 クラスタ関連付け部
28 表示データ生成部
Claims (9)
- コンピュータが、複数の文書を、その内容に応じてクラスタに分類するクラスタ解析方法であって、
前記複数の文書から、第1の条件により第1の集合を抽出する第1の集合抽出ステップと、
前記第1の集合に含まれる一の文書の内容と、前記第1の集合に含まれる他の文書の内容との文書間類似度を算出する第1の文書間類似度算出ステップと、
前記第1の集合の中で、第1の類似度算出ステップにて算出された文書間類似度に基づいて各文書について複数のクラスタに分類する第1のクラスタ分類ステップと、
前記複数の文書から、前記第1の条件とは異なる第2の条件により第2の集合を抽出する第2の集合抽出ステップと、
前記第2の集合に含まれる一の文書の内容と、前記第2の集合に含まれる他の文書の内容との文書間類似度を算出する第2の文書間類似度算出ステップと、
前記第2の集合の中で、第2の類似度算出ステップにて算出された文書間類似度に基づいて各文書について複数のクラスタに分類を行う第2のクラスタ分類ステップと、
前記第1のクラスタ分類ステップにて分類されたクラスタと、前記第2のクラスタ分類ステップにて分類されたクラスタとの間のクラスタ間類似度を算出するクラスタ間類似度算出ステップと、
前記クラスタ間類似度算出ステップで算出されたクラスタ間類似度に基づいて、前記第1の集合と第2の集合に跨って関連のあるクラスタ同士を紐づけた関連付け情報を生成するクラスタ関連付けステップと、
を備えるクラスタ解析方法。 - 前記複数の文書には時間情報が紐づけられており、前記第1の条件及び前記第2の条件は前記時間情報を用いた条件が含まれる請求項1に記載のクラスタ解析方法。
- 前記クラスタ関連付けステップでは、前記クラスタ間類似度算出ステップで算出されたクラスタ間類似度が所定の閾値以上のクラスタ同士を紐づける請求項1又は2記載のクラスタ解析方法。
- さらに、前記クラスタ関連付けステップにて、関連付けられた各集合を跨ったクラスタ間の関係を示す表示データを生成する表示データ生成ステップを備える請求項1から3のいずれか一項に記載のクラスタ解析方法。
- 前記表示データ生成ステップでは、前記第1の集合のクラスタ及び前記第2の集合のクラスタを時系列順に並べ、前記第1の集合と第2の集合に跨って関連性のあるクラスタ同士を線で接続した前記表示データを生成する請求項4記載のクラスタ解析方法。
- 前記表示データ生成ステップでは、前記クラスタを円で表現し、クラスタに属する文書の数を円の大きさで表現し、前記クラスタ間類似度を前記線の太さで表現した前記表示データを生成する請求項5記載のクラスタ解析方法。
- 複数の文書を、その内容に応じてクラスタに分類するクラスタ解析システムであって、
前記複数の文書から、第1の条件により第1の集合を抽出し、且つ前記第1の条件とは異なる第2の条件により第2の集合を抽出する集合抽出部と、
前記第1の集合に含まれる一の文書の内容と、前記第1の集合に含まれる他の文書の内容との文書間類似度を算出し、且つ前記第2の集合に含まれる一の文書の内容と、前記第2の集合に含まれる他の文書の内容との文書間類似度を算出する文書間類似度算出部と、
前記第1の集合の中で、前記文書間類似度算出部により算出された文書間類似度に基づいて各文書について複数のクラスタに分類し、且つ前記第2の集合の中で、前記文書間類似度算出部により算出された類似度に基づいて各文書について複数のクラスタに分類を行うクラスタ分類部と、
前記第1の集合におけるクラスタと、前記第2の集合におけるクラスタとの間のクラスタ間類似度を算出するクラスタ間類似度算出部と、
前記クラスタ間類似度算出部で算出されたクラスタ間類似度に基づいて、記第1の集合と第2の集合に跨って関連のあるクラスタ同士を紐づけた関連付け情報を生成するクラスタ関連付け部と、
を備えるクラスタ解析システム。 - コンピュータに、複数の文書を、その内容に応じてクラスタに分類させるクラスタ解析プログラムであって、
前記複数の文書から、第1の条件により第1の集合を抽出する第1の集合抽出ステップと、
前記第1の集合に含まれる一の文書の内容と、前記第1の集合に含まれる他の文書の内容との文書間類似度を算出する第1の文書間類似度算出ステップと、
前記第1の集合の中で、第1の類似度算出ステップにて算出された文書間類似度に基づいて各文書について複数のクラスタに分類する第1のクラスタ分類ステップと、
前記複数の文書から、前記第1の条件とは異なる第2の条件により第2の集合を抽出する第2の集合抽出ステップと、
前記第2の集合に含まれる一の文書の内容と、前記第2の集合に含まれる他の文書の内容との文書間類似度を算出する第2の文書間類似度算出ステップと、
前記第2の集合の中で、第2の類似度算出ステップにて算出された文書間類似度に基づいて各文書について複数のクラスタに分類を行う第2のクラスタ分類ステップと、
前記第1のクラスタ分類ステップにて分類されたクラスタと、前記第2のクラスタ分類ステップにて分類されたクラスタとの間のクラスタ間類似度を算出するクラスタ間類似度算出ステップと、
前記クラスタ間類似度算出ステップで算出されたクラスタ間類似度に基づいて、前記第1の集合と第2の集合に跨って関連のあるクラスタ同士を紐づけた関連付け情報を生成するクラスタ関連付けステップと、
を実行させる解析プログラム。 - コンピュータが、複数の文書を、その内容に応じてクラスタに分類するクラスタ解析方法であって、
前記複数の文書から抽出された第1の集合の中で分類されたクラスタと、前記複数の文書から抽出された第1の集合とは異なる第2の集合の中で分類されたクラスタとの間のクラスタ間類似度を算出するクラスタ間類似度算出ステップと、
前記クラスタ間類似度算出ステップで算出されたクラスタ間類似度に基づいて、前記第1の集合と第2の集合に跨って関連のあるクラスタ同士を紐づけた関連付け情報を生成するクラスタ関連付けステップと、
を備えるクラスタ解析方法。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2021520502A JPWO2020234930A1 (ja) | 2019-05-17 | 2019-05-17 | |
PCT/JP2019/019725 WO2020234930A1 (ja) | 2019-05-17 | 2019-05-17 | クラスタ解析方法、クラスタ解析システム、及びクラスタ解析プログラム |
US17/595,151 US11989222B2 (en) | 2019-05-17 | 2019-05-17 | Cluster analysis method, cluster analysis system, and cluster analysis program |
JP2024005074A JP2024041946A (ja) | 2019-05-17 | 2024-01-17 | クラスタ解析方法、クラスタ解析システム、及びクラスタ解析プログラム |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2019/019725 WO2020234930A1 (ja) | 2019-05-17 | 2019-05-17 | クラスタ解析方法、クラスタ解析システム、及びクラスタ解析プログラム |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020234930A1 true WO2020234930A1 (ja) | 2020-11-26 |
Family
ID=73459205
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2019/019725 WO2020234930A1 (ja) | 2019-05-17 | 2019-05-17 | クラスタ解析方法、クラスタ解析システム、及びクラスタ解析プログラム |
Country Status (3)
Country | Link |
---|---|
US (1) | US11989222B2 (ja) |
JP (2) | JPWO2020234930A1 (ja) |
WO (1) | WO2020234930A1 (ja) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115098690B (zh) * | 2022-08-24 | 2023-02-24 | 中信天津金融科技服务有限公司 | 一种基于聚类分析的多数据文档分类方法及系统 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011086032A (ja) * | 2009-10-14 | 2011-04-28 | Hitachi Solutions Ltd | 変化話題抽出装置または変化話題抽出方法 |
JP2012073899A (ja) * | 2010-09-29 | 2012-04-12 | Teikoku Databank Ltd | 取引関係マップ生成システム及びプログラム |
US9836183B1 (en) * | 2016-09-14 | 2017-12-05 | Quid, Inc. | Summarized network graph for semantic similarity graphs of large corpora |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10154150A (ja) | 1996-11-25 | 1998-06-09 | Nippon Telegr & Teleph Corp <Ntt> | 情報潮流提示方法及びその装置 |
JP2000242652A (ja) | 1999-02-18 | 2000-09-08 | Nippon Telegr & Teleph Corp <Ntt> | 情報潮流検索方法、装置、および情報潮流検索プログラムを記録した記録媒体 |
JP2005092443A (ja) | 2003-09-16 | 2005-04-07 | Mitsubishi Research Institute Inc | クラスター分析装置およびクラスター分析方法 |
JP2015004996A (ja) * | 2012-02-14 | 2015-01-08 | インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation | 複数の文書をクラスタリングする装置 |
US20160314184A1 (en) * | 2015-04-27 | 2016-10-27 | Google Inc. | Classifying documents by cluster |
US11023774B2 (en) * | 2018-01-12 | 2021-06-01 | Thomson Reuters Enterprise Centre Gmbh | Clustering and tagging engine for use in product support systems |
-
2019
- 2019-05-17 WO PCT/JP2019/019725 patent/WO2020234930A1/ja active Application Filing
- 2019-05-17 JP JP2021520502A patent/JPWO2020234930A1/ja active Pending
- 2019-05-17 US US17/595,151 patent/US11989222B2/en active Active
-
2024
- 2024-01-17 JP JP2024005074A patent/JP2024041946A/ja active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011086032A (ja) * | 2009-10-14 | 2011-04-28 | Hitachi Solutions Ltd | 変化話題抽出装置または変化話題抽出方法 |
JP2012073899A (ja) * | 2010-09-29 | 2012-04-12 | Teikoku Databank Ltd | 取引関係マップ生成システム及びプログラム |
US9836183B1 (en) * | 2016-09-14 | 2017-12-05 | Quid, Inc. | Summarized network graph for semantic similarity graphs of large corpora |
Non-Patent Citations (1)
Title |
---|
SANO, TSUKASA ET AL.: "Integrating crosslanguage hierarchies and its application to relevant document extraction", IPSJ SIG TECHNICAL REPORT, vol. 2007, no. 47, 25 May 2007 (2007-05-25), pages 55 - 60 * |
Also Published As
Publication number | Publication date |
---|---|
US20220222287A1 (en) | 2022-07-14 |
JP2024041946A (ja) | 2024-03-27 |
JPWO2020234930A1 (ja) | 2020-11-26 |
US11989222B2 (en) | 2024-05-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kalra et al. | Importance of Text Data Preprocessing & Implementation in RapidMiner. | |
EP2866421B1 (en) | Method and apparatus for identifying a same user in multiple social networks | |
Gualberto et al. | The answer is in the text: Multi-stage methods for phishing detection based on feature engineering | |
CN109214454B (zh) | 一种面向微博的情感社区分类方法 | |
Marshakova-Shaikevich | Bibliometric maps of field of science | |
JP2024041946A (ja) | クラスタ解析方法、クラスタ解析システム、及びクラスタ解析プログラム | |
CN112052308A (zh) | 一种摘要文本提取方法、装置、存储介质和电子设备 | |
US20240111943A1 (en) | Summary creation method, summary creation system, and summary creation program | |
JP6621514B1 (ja) | 要約作成装置、要約作成方法、及びプログラム | |
US20230119422A1 (en) | Cluster analysis method, cluster analysis system, and cluster analysis program | |
Hulliyah et al. | A Benchmark of Modeling for Sentiment Analysis of The Indonesian Presidential Election in 2019 | |
Swaraj et al. | A fast approach to identify trending articles in hot topics from XML based big bibliographic datasets | |
Kaur et al. | Development of human face literature database using text mining approach: phase I | |
Kumara et al. | Ontology learning with complex data type for Web service clustering | |
JP2011248740A (ja) | データ出力装置、データ出力方法およびデータ出力プログラム | |
Boeva et al. | Evolutionary Clustering Techniques for Expertise Mining Scenarios. | |
Sharmila et al. | Non-Class Element based Iterative Text Clustering Algorithm for Improved Clustering Accuracy using Semantic Ontology | |
Patrick | Textual prediction of attitudes towards mental health | |
Luling et al. | COVID-19 literature mining and analysis research | |
Hada et al. | A novel recommendation system for vaccines using hybrid machine learning model | |
Silveira et al. | Ranking keyphrases from semantic and syntactic features of textual terms | |
da Silva Jacinto et al. | Automatic and semantic pre—Selection of features using ontology for data mining on data sets related to cancer | |
Senthilkumar et al. | A unified approach to detect the record duplication using bat algorithm and fuzzy classifier for health informatics | |
JP6745686B2 (ja) | 名寄せ処理方法 | |
Reddy et al. | High-performanceintelligent Models for Faster Ailments Extraction Over the Big Healthcare Data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19929671 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2021520502 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 18.02.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19929671 Country of ref document: EP Kind code of ref document: A1 |