KR20150057497A - Method and System of Topic Detection for On-line Text Documents: A Topic Tree-based Approach - Google Patents
Method and System of Topic Detection for On-line Text Documents: A Topic Tree-based Approach Download PDFInfo
- Publication number
- KR20150057497A KR20150057497A KR1020130140913A KR20130140913A KR20150057497A KR 20150057497 A KR20150057497 A KR 20150057497A KR 1020130140913 A KR1020130140913 A KR 1020130140913A KR 20130140913 A KR20130140913 A KR 20130140913A KR 20150057497 A KR20150057497 A KR 20150057497A
- Authority
- KR
- South Korea
- Prior art keywords
- cluster
- word
- topic
- similarity
- document
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
The present invention relates to a method for detecting a subject in big data such as an online document, and more particularly, to detecting a subject through a new medium-based cluster proposed in the present invention based on a subject tree.
The present invention has been derived from research conducted as part of basic research projects (major category), middle researcher support projects (subdivision), and core research-individual (small dispute) of the future creation science department and the Korea Research Foundation [assignment number: 2013R1A2A2A01017030 , Title: Semantic Text Based on Queuboid - A Study on Large - Scale Text Mining Framework].
The modern Internet has a lot of information about various subjects. Therefore, the user can acquire the information using the search engine to retrieve the query information in order to obtain the information desired by the user. However, it is not easy to build a query language for the information each time the user desires and quickly find the desired information in a large number of results. To date, the most common way to organize large amounts of document information is to classify and index them hierarchically by topic. Korean Patent No. 10-0902673 entitled " Document Clustering Based Document Retrieval Service Providing Method and System "has been proposed as an example of a system constituting a conventional subject tree. In the prior art, a document search service based on title clustering is provided. Clusters are formed on the basis of a title, and the data is classified by mapping them on a subject basis.
Since the clustering technique has a disadvantage in that it supports only a linear form for the subject keyword search and maps the clustering technique to a category of the existing directory structure, it is possible to efficiently transmit a large amount of web documents on the Internet, It is not easy to classify.
Most information systems require considerable effort to build document indexes through this hierarchically classified topic tree. In addition, since the document information of the classified topic tree provides independent information and treats it as a separate information even if the document has the same or extremely similar contents, additional efforts to improve the efficiency in searching and browsing information .
SUMMARY OF THE INVENTION The present invention has been made to solve the problems of the prior art as described above, and it is an object of the present invention to extract a main keyword in real time from an incoming document set and automatically calculate a hierarchical relationship thereof, The objective is to automatically identify hierarchical phase relationships.
According to an aspect of the present invention, there is provided a hierarchical tree-based topic search method for an on-line text document, the method including: calculating similarity between documents included in a cluster; Setting a document having the smallest degree of similarity as a medoid; Calculating a similarity of the received document with the received document; Calculating the similarity between the received document and the median and merging the received document having a degree of similarity lower than a threshold value into a cluster to which the medoid belongs; Extracting a subject word from the cluster to generate a topic tree of the cluster; And generating a topic tree for constructing a hierarchical relationship between the topic words using the probability information generated in the cluster by the topic word.
In this case, if the merged cluster is merged into the one cluster, calculating the similarity between the documents included in the merged cluster and resetting the meridian may be performed.
The method may further include generating the received document as an independent cluster if the similarity between the received document and the method is greater than or equal to the threshold value.
In addition, the topic tree may include a hierarchical relationship of the topic words extracted from the generated cluster or the independent cluster.
The subject word extraction may include extracting a subject word using at least one of a frequency of words appearing in the cluster set, a frequency of a document containing the word, and a number of the clusters including the word can do.
In addition, the extraction of the subject word may include the use of the following CTF-CDF-ICF expression.
(Where tc (t, c) is the occurrence frequency of the word t contained in one cluster c,
(T, c) is the frequency of documents having the word t contained in the cluster c, D is the total number of documents included in one cluster c, Number. Also, cc (t) is the frequency of clusters in which the word t appears and C is the total number of clusters. At this time The word with the largest value is extracted as the subject word)In addition, the probability information may include what is calculated using the following equation.
when
(only,
Word The word in the document set that occurs , W is the probability of occurrence Lt; Is a list of topic words extracted from CTF-CDF-ICF, and d is a constant.)In addition,
when Is satisfied, Gt; And constructing a hierarchical tree of topics to be included.A hierarchical tree-based topic search system for an online text document according to an embodiment of the present invention includes: a similarity calculation unit for calculating similarities between documents included in a cluster; A medoid setting unit for setting a document having the smallest value of the summed similarity as a medoid; A medoid calculation unit for calculating the similarity between the newly received received document and the medoid; A cluster merger for calculating the similarity between the received document and the median and merging the received document having a degree of similarity lower than the threshold into a cluster to which the medoid belongs; A topic tree generation unit for extracting a subject word from the cluster and generating a topic tree of the cluster; The topic tree generation unit may include building a hierarchical relationship between the topic words using probability information generated in the cluster.
In this case, the cluster merging unit may further include, when merging into the one cluster, calculating the degree of similarity between the documents included in the merged cluster and resetting the menu.
The cluster merging unit may further generate the received document as an independent cluster if the similarity degree between the received document and the mean is greater than or equal to the threshold value.
In addition, the topic tree generation unit may include a hierarchical relationship of the topic words extracted from the generated cluster or the independent cluster.
The topic tree generation unit may include extracting the topic word using at least one of a frequency of words appearing in the cluster set, a frequency of a document including the word, and a number of the clusters including the word have.
In addition, the topic tree generation unit may include using the following CTF-CDF-ICF expression.
(Where tc (t, c) is the occurrence frequency of the word t contained in one cluster c,
(T, c) is the frequency of documents having the word t contained in the cluster c, D is the total number of documents included in one cluster c, Number. Also, cc (t) is the frequency of clusters in which the word t appears and C is the total number of clusters. At this time The word with the largest value is extracted as the subject word).In addition, the probability information may include what is calculated using the following equation.
when
(only,
Word The word in the document set that occurs , W is the probability of occurrence Lt; Is a list of topic words extracted from CTF-CDF-ICF, and d is a constant.)In addition,
when Is satisfied, Gt; And constructing a hierarchical tree of topics to be included.According to the hierarchical tree-based subject search method of the online text document of the present invention, a main keyword set existing in a big data web document set can be extracted to construct a hierarchical tree of them, and a large- The subject tree can be generated by automatically extracting the main keywords and automatically calculating their hierarchical relationships.
1 is a diagram illustrating a topic tree generation process of a medode-based cluster.
2 is a diagram showing an example of setting a threshold value in a cluster.
Figure 3 is a diagram illustrating a topic tree generation system of a medoid-based cluster.
FIG. 4 is a diagram showing a result of a subject tree generated through a conventional topic tree generation technique.
FIG. 5 is a diagram illustrating a result of a subject tree generated through the proposed topic tree generation technique. Referring to FIG.
Other objects and features of the present invention will become apparent from the following description of embodiments with reference to the accompanying drawings.
Preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.
However, the present invention is not limited to or limited by the embodiments. Like reference symbols in the drawings denote like elements.
Hereinafter, a subject search method according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.
FIG. 1 shows a topic tree generation process of a medode-based cluster.
Referring to FIG. 1, similarity between documents existing in a cluster is calculated (S100). Here, the similarity between documents indicates the distance between the document and the document. The shorter the distance between documents, the greater the similarity. The present invention proposes a concept of medoids that can determine the similarity between a document and a document. Medoid is the best description of the characteristics of a cluster, and all cluster enediyoids exist. Also, the distance between each of the documents included in the cluster and the document is the closest to the center document. The median representing the cluster calculates the distance by vectorizing the document, and selects the nearest vector as the medoid. The method of calculating the distance by vectorizing the document uses the usual euclidean distance.
Here, (a1, b1, c1, d1) and (a2, b2, c2, d2) represent words included in each document. For example, if the total word set is {a, b, c, d}, and there are words a, a, a, b in the x documents included in the cluster, d exists. To calculate the distance between documents, it can be expressed as an n-dimensional vector, and the number of dimensions is expressed as 4, the number of words. In other words, the word is assumed to be each dimension and the dimension is generated by the number of words. a vector can be obtained by taking the number of occurrences of a word as a vector factor on the n-dimensional space. In the document, there are 3 a and 1 b (3 1 0 0) in the document, and 2 documents (c) and 3 d (0 0 2 3) can be used as the vector parameters.
The mean is the actual vector with the median midpoint of the n-dimensional vector. In other words, the distance from the other vector is the minimum.
The similarity between the documents included in the cluster is calculated by the above process (S100).
In step S110, a document having the smallest similarity degree is set as a medoid.
In the prior art, a mean base using an average value was used. However, the mean vector of the mean base is susceptible to noise because virtual values that do not exist on the vector may become an average, and the cluster may collapse due to erroneous clustering due to noise. However, since the mean based on the present invention uses an actual vector value that is closest to the other vectors even if the center is not the center of the vector, noise is less likely to be included in the clustering process.
In the cluster, new documents are received in real time, and received documents similar to existing clusters are merged into one cluster, which enables more robust clustering. Therefore, the degree of similarity between the received document and the median is calculated (S120). If the similarity between the received document and the median of the cluster is less than the threshold value (S130), the received document is merged into the cluster to which the median belongs (S140). The received document with similarity less than the threshold value is considered to be close to the meridian and merged into one cluster. At this time, the threshold value is set to a threshold value as an arbitrary similarity among the results obtained by calculating the mutual similarity between documents. The reason for setting the threshold is to determine whether to merge the received document with the cluster or to allow it to become a new independent cluster, to classify received documents of good quality, I can grasp the distant extent. A received document having a threshold value or more is generated as an independent cluster set (S150). This is because it is judged that the distance between the received document and the cluster is long, so that the received document is created as an independent cluster so as to create a space for merging with the received document to be received later, and to form a more systematic clustering. In the merged cluster, the documents included in the existing cluster and the new document coexist, and the similarity between them is calculated and the medoid is reset. In this way, when a new received document is input, it is possible to perform more robust clustering by repeating the process of generating a cluster or an independent cluster by calculating the similarity between the reset menu and the new received document.
FIG. 2 is a diagram showing an embodiment relating to setting a threshold value. FIG.
Referring to FIG. 2, the degree of similarity between
Generally, clustered topic words frequently appear in clusters. However, rarely appearing words may have special meanings, or they may be words that have a general meaning. In the present invention, three methods of extracting a subject word are proposed: a CTF Cluster Term Frequency (CDF Cluster Document Frequency) and an ICF Inverse Cluster Frequency (ICF Inverse Cluster Frequency). This is used as a means to extract key words that best describe the cluster in the cluster and to create a topic tree that is a top-down relationship between words, the ultimate goal of the present invention.
Cluster term frequency (CTF) means the number of frequencies in which a given term appears in a cluster. The cluster term frequency is calculated as follows.
Where tc (t, c) is the frequency of occurrence of the term t in cluster c.
Represents the sum of the occurrences of all the terms existing in the cluster c.The frequency of document appearance in a cluster called cluster document frequency (CDF) also needs to be considered. This represents the number of occurrences of the document containing the specified term in the cluster. The cluster document frequency is calculated as follows.
Where dc (t, c) is the frequency of the document with the term t contained in cluster C. D is the total number of documents in cluster c.
Finally, the number of clusters in which a given term, referred to as the inverse cluster frequency (ICF), occurs should be considered. This allows you to see how much of the specified term is used in the entire cluster. To identify the specified term, it is necessary to normalize the frequency of the inverse cluster, and the frequency of the inverse cluster is calculated as follows.
cc (t) is the frequency of clusters in which the term t appears, and C is the total number of clusters.
As mentioned above, the appearance frequency of the specified term in the cluster can be confirmed through the cluster term frequency and the cluster document frequency, and the reverse cluster frequency can confirm the number of clusters in which the designated term is used in the entire cluster. The present invention combines the above three methods, represented by CTF-CDF-ICF, to enhance the extraction ability of the subject words, which can represent and represent clusters.
The candidates of the subject words are sorted by the result calculated by the CTF-CDF-ICF,
Extracts the largest word among the candidates of the number of subject words as the subject word. The reason for extracting a number of subject words is that the number of documents in each cluster is different. That is, a small number of words are used to represent a small number of documents and clusters, and a large number of words Is used. This is because, even if the same important word is influenced by the frequency with which the document and the word appear, the result extracted as the main word may vary depending on the appearance frequency. This is normalized in the same manner as above to extract the subject word under the same conditions.To construct a topic tree, a generic way of constructing a conceptual hierarchy in words can be used. This is a common way,
end when If The . here silver From the set of documents where . Relatively Compared to the general topic words. So in the topic tree The Can be the mother word of The It can be a child word.The above method creates a hierarchical topic tree by having a subordinate relationship between words. The existing method is effective and simple, but it does not grasp the meaning of the subject word and constructs the topic tree without the ability of words to describe clusters. Therefore, the present invention proposes the following parameters to consider the ability of the subject word to describe the cluster.
The method proposed by the present invention is based on two words
, end when If The . here Wow Is measured by CTF-CDF-ICF and is an indicator of how well the subject word can describe the cluster. The end The probability of being involved in Lt; / RTI > In addition, In the present invention, experiments on various data have been carried out The probability condition of.In using CTF-CDF-ICF to construct the topic tree, it is necessary to consider that the range of CTF-CDF-ICF values differs for each cluster. There are different numbers of words in each cluster, and the ranges for calculating CTF-CDF-ICF differ accordingly. Therefore, it is necessary to normalize CTF-CDF-ICF as follows.
Where w is the cluster
Is a list of topic words derived from CTF-CDF-ICF. Is the largest value calculated by CTF-CDF-ICF in cluster W, Is the smallest value calculated by CTF-CDF-ICF in the cluster. All CTF-CDF-ICF values in the cluster are mapped from 1 to (1 + d) through the above equation. Where d is a positive number and can be set according to the intention. The above equation can be used to correct different calculation ranges of CTF-CDF-ICF for each cluster, and is used as an index to show how many clusters are depicted under the same conditions of all the subject words. The larger the value, Able to know. remind when sign The , A topic tree in which the upper layer of the topic tree subscribes to the lower layer is generated. The upper layer is a topic word that better describes the cluster than the lower layer, so it can include one or more lower layers. The lower hierarchy can be subordinate to the lower hierarchy to construct a topic tree, and similarly, it can include more than one hierarchy lower than the lower hierarchy. This is repeated to create a topic tree. In the process of creating a topic tree, topic words that can not be linked up and down can arise. This topic tree can also be constructed independently. Therefore, one or more topic trees can be created in one cluster. The construction of the topic tree proposed by the present invention constructs a topic tree by identifying the top-to-bottom relationship between the topic words and repeating them (S160).Figure 3 is a diagram illustrating a method tree based topic tree generation system.
Referring to FIG. 3, the
At this time, the degree of similarity is the same as the degree of similarity in step S100 described above, and thus redundant description is omitted. The
The extraction of topic words using CTF-CDF-ICF sorts the candidate words of the topic with the results calculated by CTF-CDF-ICF,
Extracts the largest word among the candidates of the number of subject words as the subject word.The subject words extracted by CTF-CDF-ICF are two words
, end when If The , And constructing a topic tree by normalizing the CTF-CDF-ICF values in the following manner, considering that the ranges of the CTF-CDF-ICF values are different from each other.
here
silver From the set of documents where . W is a cluster Is a list of topic words derived from CTF-CDF-ICF. The value calculated through CTF-CDF-ICF is the largest value in the cluster, Is the smallest value calculated by CTF-CDF-ICF in the cluster. All CTF-CDF-ICF values in the cluster are mapped from 1 to (1 + d) through the above equation. Where d is a positive number and can be set according to the intention. The generation of the topic tree using the above equation is the same as that of S160 described above.In order to evaluate the proposed method for constructing a topic tree, the present invention is described in Reuters-21578 (http://www.daviddlewis.com/resources/testcollections/reuters21578/) and Google News US (news.google.com) I prepared two data sets each. The first dataset, Dataset-1, contains 1000 randomly selected documents from Reuters-21578 and the second dataset, Dataset-2, is available from August 12, 2012 through September 24, 2012 I used a collection of documents containing 1000 documents on business, election, technology, entertainment, sports, science, and health.
The present invention automatically constructs a subject detection system that grasps the contents of a document and then generates a topic tree. The system can be written in Java, and term processing modules such as negation elimination and morphological analysis can be done by the R_TM package. In order to evaluate the accuracy of the generated topic tree, it is necessary to check whether it is included in the correct topic tree and the accuracy is calculated as follows.
Datasets
Table 1 shows the accuracy of the subordinate relations in the topic tree generated by the conventional method and the method proposed in the present invention. The accuracy of Dataset-2 increased by 16.1% for Google News data sets while the accuracy of Dataset-1 increased by 17.8% for Reuters -21578 data sets. We can confirm that the proposed method logically not only generates the topic tree, but also improves the accuracy of the subordinate relation.
FIG. 4 shows a part of a topic tree generated by a conventional baseline system, and FIG. 5 shows a part of a topic tree generated by a method proposed in the present invention. Both FIGS. 4 and 5 illustrate the main contents, but it can be seen that the method proposed in the present invention generates a more accurate topic tree than the prior art. Referring to FIG. 4, six topic trees are displayed, but it shows that there is a misleading relationship between win and britain. However, FIG. 5 shows that the relationship between the main words can be grasped and the main word is displayed in one topic tree in order to grasp the main word. Therefore, it can be seen that the subject search method presented in the present invention shows more accurate subject search results through the above-described embodiments.
The hierarchical tree-based subject search method of an online text document according to an embodiment of the present invention may be implemented in a form of a program command that can be executed through various computer means and recorded in a computer readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the medium may be those specially designed and configured for the present invention or may be those known to the person skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.
As described above, the present invention has been described with reference to particular embodiments, such as specific elements, and specific embodiments and drawings. However, it should be understood that the present invention is not limited to the above- And various modifications and changes may be made thereto by those skilled in the art to which the present invention pertains.
Accordingly, the spirit of the present invention should not be construed as being limited to the embodiments described, and all of the equivalents or equivalents of the claims, as well as the following claims, belong to the scope of the present invention .
Claims (16)
Setting a document having the smallest degree of similarity as a medoid;
Calculating a similarity of the received document with the received document;
Calculating the similarity between the received document and the median and merging the received document having a degree of similarity lower than a threshold value into a cluster to which the medoid belongs;
Extracting a subject word from the cluster to generate a topic tree of the cluster;
And generating a topic tree for constructing a hierarchical relationship between the topic words using probability information generated by the topic word in the cluster.
And if the merged cluster is merged into the one cluster, calculating the degree of similarity between documents included in the merged cluster and resetting the medoid.
And generating the received document as an independent cluster if the similarity degree between the received document and the mean is greater than or equal to the threshold value.
Wherein the topic tree is a hierarchical relationship of the topic words extracted from the generated cluster or the independent clusters.
Wherein the extraction of the subject word is performed by using at least one of a frequency of words appearing in the cluster set, a frequency of a document including the word, and a number of the clusters including the word, A hierarchical tree based topic search method for online text documents.
Wherein the topic word is extracted using the following CTF-CDF-ICF expression.
(Where tc (t, c) is the occurrence frequency of the word t contained in one cluster c, (T, c) is the frequency of documents having the word t contained in the cluster c, D is the total number of documents included in one cluster c, Number. Also, cc (t) is the frequency of clusters in which the word t appears and C is the total number of clusters. At this time Extract the word with the highest value as the subject word.)
Wherein the probability information is calculated using the following equation.
when
(only, Word The word in the document set that occurs , W is the probability of occurrence Lt; Is a list of topic words extracted from CTF-CDF-ICF, and d is a constant.)
In the formula when Is satisfied, Gt; And a hierarchical tree-based topic search method of an online text document.
A medoid setting unit for setting a document having the smallest value of the summed similarity as a medoid;
A medoid calculation unit for calculating the similarity between the newly received received document and the medoid;
A cluster merger for calculating the similarity between the received document and the median and merging the received document having a degree of similarity lower than the threshold into a cluster to which the medoid belongs;
A topic tree generation unit for extracting a subject word from the cluster and generating a topic tree of the cluster;
Wherein the topic tree generation unit includes a hierarchical relationship between the topic words using the probability information generated by the topic word in the cluster.
Wherein the cluster merging unit comprises:
And if the merged cluster is merged into the one cluster, calculating the degree of similarity between documents included in the merged cluster and resetting the menu.
Wherein the cluster merging unit comprises:
And generating the received document as an independent cluster if the degree of similarity between the received document and the mean is greater than or equal to the threshold value.
Wherein the topic tree generating unit comprises:
Wherein the hierarchical tree-based topic search system is a hierarchical relationship of the topic words extracted from the generated cluster or the independent clusters.
The topic tree generation unit,
Wherein the keyword extraction unit extracts a topic word using at least one of a frequency of words appearing in the cluster set, a frequency of a document including the word, and a number of the clusters including the word. Based topic search system.
Wherein the topic tree generating unit comprises:
A hierarchical tree-based topic search system for online text documents, characterized by using the following CTF-CDF-ICF formula:
(Where tc (t, c) is the occurrence frequency of the word t contained in one cluster c, (T, c) is the frequency of documents having the word t contained in the cluster c, D is the total number of documents included in one cluster c, Number. Also, cc (t) is the frequency of clusters in which the word t appears and C is the total number of clusters. At this time Extract the word with the highest value as the subject word.)
Wherein the probability information is calculated using the following equation.
when
(only, Word The word in the document set that occurs , W is the probability of occurrence Lt; Is a list of topic words extracted from CTF-CDF-ICF, and d is a constant.)
In the formula when Is satisfied, Gt; And a hierarchical tree-based topic search system for hierarchically structuring the online text documents.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020130140913A KR20150057497A (en) | 2013-11-19 | 2013-11-19 | Method and System of Topic Detection for On-line Text Documents: A Topic Tree-based Approach |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020130140913A KR20150057497A (en) | 2013-11-19 | 2013-11-19 | Method and System of Topic Detection for On-line Text Documents: A Topic Tree-based Approach |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20150057497A true KR20150057497A (en) | 2015-05-28 |
Family
ID=53392345
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020130140913A KR20150057497A (en) | 2013-11-19 | 2013-11-19 | Method and System of Topic Detection for On-line Text Documents: A Topic Tree-based Approach |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20150057497A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101706300B1 (en) | 2015-10-13 | 2017-02-14 | 포항공과대학교 산학협력단 | Apparatus and method for generating word hierarchy of technology terms |
CN111581355A (en) * | 2020-05-13 | 2020-08-25 | 杭州安恒信息技术股份有限公司 | Method, device and computer storage medium for detecting subject of threat intelligence |
-
2013
- 2013-11-19 KR KR1020130140913A patent/KR20150057497A/en not_active Application Discontinuation
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101706300B1 (en) | 2015-10-13 | 2017-02-14 | 포항공과대학교 산학협력단 | Apparatus and method for generating word hierarchy of technology terms |
CN111581355A (en) * | 2020-05-13 | 2020-08-25 | 杭州安恒信息技术股份有限公司 | Method, device and computer storage medium for detecting subject of threat intelligence |
CN111581355B (en) * | 2020-05-13 | 2023-07-25 | 杭州安恒信息技术股份有限公司 | Threat information topic detection method, device and computer storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104239513B (en) | A kind of semantic retrieving method of domain-oriented data | |
Kang et al. | On co-authorship for author disambiguation | |
JP5092165B2 (en) | Data construction method and system | |
US9183286B2 (en) | Methodologies and analytics tools for identifying white space opportunities in a given industry | |
CN107590128B (en) | Paper homonymy author disambiguation method based on high-confidence characteristic attribute hierarchical clustering method | |
Shouzhong et al. | Mining microblog user interests based on TextRank with TF-IDF factor | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
CN111382276B (en) | Event development context graph generation method | |
Dinh et al. | k-PbC: an improved cluster center initialization for categorical data clustering | |
Sarkhel et al. | Visual segmentation for information extraction from heterogeneous visually rich documents | |
Baralis et al. | Analysis of twitter data using a multiple-level clustering strategy | |
Wick et al. | A unified approach for schema matching, coreference and canonicalization | |
CN115563313A (en) | Knowledge graph-based document book semantic retrieval system | |
Sapul et al. | Trending topic discovery of Twitter Tweets using clustering and topic modeling algorithms | |
Zhu et al. | Robust hybrid name disambiguation framework for large databases | |
JP5500070B2 (en) | Data classification system, data classification method, and data classification program | |
Bagdouri et al. | Profession-based person search in microblogs: Using seed sets to find journalists | |
CN103942204B (en) | For excavating the method and apparatus being intended to | |
KR20080049239A (en) | Method for disambiguation of same name authors by information extraction from orignal text | |
KR20150057497A (en) | Method and System of Topic Detection for On-line Text Documents: A Topic Tree-based Approach | |
Oo | Pattern discovery using association rule mining on clustered data | |
Su et al. | Understanding query interfaces by statistical parsing | |
Tseng et al. | Advances in knowledge discovery and data mining | |
KR101869329B1 (en) | Data Mining Method of Spatial Frequent Wordset from Social Database | |
Chen et al. | Topic detection over online forum |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
E902 | Notification of reason for refusal | ||
E601 | Decision to refuse application |