CN110245209A - A method of extracting milestone event from mass text - Google Patents
A method of extracting milestone event from mass text Download PDFInfo
- Publication number
- CN110245209A CN110245209A CN201910539127.3A CN201910539127A CN110245209A CN 110245209 A CN110245209 A CN 110245209A CN 201910539127 A CN201910539127 A CN 201910539127A CN 110245209 A CN110245209 A CN 110245209A
- Authority
- CN
- China
- Prior art keywords
- file
- cluster
- event
- milestone
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/322—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The method that the invention discloses a kind of from mass text extracts milestone event, the method comprising the steps of: (1) in mass text extracted file file level related information, pass through tree structure and carry out data storage;(2) pathname of filename and file is carried out splicing the text as current file, the tree-like distance of each file is calculated using K-Means clustering algorithm, file with same level relationship is grouped together to the initial classes cluster size that K-Means clustering algorithm is determined as initial clustering cluster;(3) extraction that milestone event and timing node are carried out under each clustering cluster, to the milestone node listing for extracting formation event after result is screened.The present invention carries out the extraction of milestone event and event node again in each cluster after cluster, can be extracted into behind multiple subevents the problem of can not merging to avoid similar events in this way, while also improving the accuracy rate and integrality of extraction.
Description
Technical field
The invention belongs to extracting milestone event technical field, it is related to a kind of extracting milestone event from mass text
Method.
Background technique
Existing information extraction method has been able to the method that event and time are extracted from text, but based on magnanimity
For data, the case where similar events are described there may be multiple documents, if directly carrying out the pumping of event and time to document
It takes, the milestone nodal information that may cause similar events is dispersed in multiple events, can not be polymerize, thus can not
It is drawn into the event milestone information of completion.
Summary of the invention
The technical problem to be solved by the present invention is a kind of method that milestone event is extracted from mass text is provided, with
Solve problems of the prior art.
The technical scheme adopted by the invention is as follows: a method of extracting milestone event, this method packet from mass text
Include following steps:
(1) in mass text extracted file file level related information, using filename, folder name as node, with layer
Grade relationship is side, carries out data storage by tree structure;
(2) pathname of filename and file is carried out splicing text as current file, using K-Means clustering algorithm,
The tree-like distance for calculating each file, the file with same level relationship is grouped together as initial clustering cluster, simultaneously
Determine the initial classes cluster size of K-Means clustering algorithm;
(3) for the cluster result obtained in step (2), milestone event and timing node are carried out under each clustering cluster
It extracts, to the milestone node listing for extracting formation event after result is screened.
Beneficial effects of the present invention: compared with prior art, the present invention is first by text cluster, by the same event
Different description texts condense together, in cluster process, introduce and the existing archive information i.e. text of file
Part presss from both sides hierarchy structure information and then improves the cluster result of document.Carried out again in each cluster after cluster milestone event and
The extraction of event node can be extracted into behind multiple subevents the problem of can not merging in this way to avoid similar events, while
Improve the accuracy rate and integrality of extraction.
Detailed description of the invention
Fig. 1 is flow diagram of the invention.
Specific embodiment
With reference to the accompanying drawing and the present invention is described further in specific embodiment.
Embodiment 1: as shown in Figure 1, a kind of method for extracting milestone event from mass text, this method includes following
Step:
(1) in mass text extracted file file level related information, using filename, folder name as node, with layer
Grade relationship is side, carries out data storage by tree structure;
(2) pathname of filename and file is carried out splicing text as current file, using K-Means clustering algorithm,
Calculate (each file is a node for tree, and path is then the branch of tree, such as the file packet of a multi-layer folders nesting,
There are many files in the inside, and the pure file stored under first file is exactly first layer, stored in first file other
File can proceed further downhole) the tree-like distance of each file, the file with same level relationship is grouped together work
For initial clustering cluster, at the same determine K-Means clustering algorithm initial classes cluster size (file of same level be exactly it is same just
Beginning cluster);
(3) for the cluster result obtained in step (2), milestone event and timing node are carried out under each clustering cluster
It extracts, to the milestone node listing for extracting formation event after result is screened.
The present invention by there are in the text of file, file hierarchical relationship carry out text subject cluster, mutually similar
The abstracting method of event and event time node is carried out in the text of cluster.In the text of magnanimity, some documents be by
That thinks filed, these files under the identical sub-folder have existed incidence relation.In the base of traditional clustering algorithm
Under plinth, during the hierarchical relationship information of file, file is brought into text cluster so that the cluster result of text not only with
The semantic information of word is related in text, and the file level incidence relation of text is related.In the phase identical text of cluster result
The information extraction of event and event time node is carried out in shelves cluster to obtain the milestone event in mass text.
The present invention is by improving the poly- of magnanimity document the existing related information of file to be dissolved into Text Clustering Algorithm
Class effect, and then the method for extracting in the mutually similar cluster of magnanimity document milestone event.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain
Within protection scope of the present invention, therefore, protection scope of the present invention should be based on the protection scope of the described claims lid.
Claims (1)
1. a kind of method for extracting milestone event from mass text, it is characterised in that: method includes the following steps:
(1) in mass text extracted file file level related information, using filename, folder name as node, with layer
Grade relationship is side, carries out data storage by tree structure;
(2) pathname of filename and file is carried out splicing text as current file, using K-Means clustering algorithm,
The tree-like distance for calculating each file, the file with same level relationship is grouped together as initial clustering cluster, simultaneously
Determine the initial classes cluster size of K-Means clustering algorithm;
(3) for the cluster result obtained in step (2), milestone event and timing node are carried out under each clustering cluster
It extracts, to the milestone node listing for extracting formation event after result is screened.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910539127.3A CN110245209B (en) | 2019-06-20 | 2019-06-20 | Method for extracting milestone events from massive texts |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910539127.3A CN110245209B (en) | 2019-06-20 | 2019-06-20 | Method for extracting milestone events from massive texts |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110245209A true CN110245209A (en) | 2019-09-17 |
CN110245209B CN110245209B (en) | 2022-09-23 |
Family
ID=67888433
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910539127.3A Active CN110245209B (en) | 2019-06-20 | 2019-06-20 | Method for extracting milestone events from massive texts |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110245209B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101388013A (en) * | 2007-09-12 | 2009-03-18 | 日电(中国)有限公司 | Method and system for clustering network files |
US20090106255A1 (en) * | 2001-01-11 | 2009-04-23 | Attune Systems, Inc. | File Aggregation in a Switched File System |
CN104091054A (en) * | 2014-06-26 | 2014-10-08 | 中国科学院自动化研究所 | Mass disturbance warning method and system applied to short texts |
US20150088827A1 (en) * | 2013-09-26 | 2015-03-26 | Cygnus Broadband, Inc. | File block placement in a distributed file system network |
CN105117397A (en) * | 2015-06-18 | 2015-12-02 | 浙江大学 | Method for searching semantic association of medical documents based on ontology |
CN106484887A (en) * | 2016-10-18 | 2017-03-08 | 安徽天达网络科技有限公司 | A kind of document handling method based on internet |
CN108399213A (en) * | 2018-02-05 | 2018-08-14 | 中国科学院信息工程研究所 | A kind of clustering method and system of user oriented personal document |
-
2019
- 2019-06-20 CN CN201910539127.3A patent/CN110245209B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090106255A1 (en) * | 2001-01-11 | 2009-04-23 | Attune Systems, Inc. | File Aggregation in a Switched File System |
CN101388013A (en) * | 2007-09-12 | 2009-03-18 | 日电(中国)有限公司 | Method and system for clustering network files |
US20150088827A1 (en) * | 2013-09-26 | 2015-03-26 | Cygnus Broadband, Inc. | File block placement in a distributed file system network |
CN104091054A (en) * | 2014-06-26 | 2014-10-08 | 中国科学院自动化研究所 | Mass disturbance warning method and system applied to short texts |
CN105117397A (en) * | 2015-06-18 | 2015-12-02 | 浙江大学 | Method for searching semantic association of medical documents based on ontology |
CN106484887A (en) * | 2016-10-18 | 2017-03-08 | 安徽天达网络科技有限公司 | A kind of document handling method based on internet |
CN108399213A (en) * | 2018-02-05 | 2018-08-14 | 中国科学院信息工程研究所 | A kind of clustering method and system of user oriented personal document |
Non-Patent Citations (1)
Title |
---|
仇绍刚: "基于元搜索的知识获取方法与系统集成研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Also Published As
Publication number | Publication date |
---|---|
CN110245209B (en) | 2022-09-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101464905B (en) | Web page information extraction system and method | |
CN104331446B (en) | A kind of massive data processing method mapped based on internal memory | |
CN107609052A (en) | A kind of generation method and device of the domain knowledge collection of illustrative plates based on semantic triangle | |
US10929439B2 (en) | Taxonomic tree generation | |
CN111178079B (en) | Triplet extraction method and device | |
CN110232187A (en) | Enterprise name similarity recognition method, device, computer equipment and storage medium | |
CN108470040B (en) | Method and device for warehousing unstructured data | |
CN107608948A (en) | A kind of construction method and device of Text Information Extraction model | |
CN102810114A (en) | Personal computer resource management system based on body | |
CN107291858A (en) | Data indexing method based on character string suffix | |
KR100835290B1 (en) | System and method for classifying document | |
CN106933823A (en) | Method of data synchronization and device | |
CN106250552A (en) | Search engine results page is assembled WEB page | |
CN106649557A (en) | Semantic association mining method for defect report and mail list | |
CN108304377A (en) | A kind of extracting method and relevant apparatus of long-tail word | |
CN104750673B (en) | Text matches filter method and device | |
CN102799632B (en) | Method for acquiring and describing text information based on visual basic application (VBA) and tetrahedron data model | |
CN110781297B (en) | Classification method of multi-label scientific research papers based on hierarchical discriminant trees | |
CN106096014A (en) | The Text Clustering Method of mixing length text set based on DMR | |
CN110245209A (en) | A method of extracting milestone event from mass text | |
US8032521B2 (en) | Managing structured content stored as a binary large object (BLOB) | |
CN107133200A (en) | A kind of android system text string extracting and merging method | |
CN104376000A (en) | Webpage attribute determination method and webpage attribute determination device | |
CN104572730A (en) | Method and device for importing and exporting digital resources | |
JP2013206280A (en) | Deletion file detection program, deletion file detection method and deletion file detection device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |