CN110245209A - A method of extracting milestone event from mass text - Google Patents

A method of extracting milestone event from mass text Download PDF

Info

Publication number
CN110245209A
CN110245209A CN201910539127.3A CN201910539127A CN110245209A CN 110245209 A CN110245209 A CN 110245209A CN 201910539127 A CN201910539127 A CN 201910539127A CN 110245209 A CN110245209 A CN 110245209A
Authority
CN
China
Prior art keywords
file
cluster
event
milestone
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910539127.3A
Other languages
Chinese (zh)
Other versions
CN110245209B (en
Inventor
王鹏宇
吴漾
罗念华
孔庆波
缪新萍
李文科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Power Grid Co Ltd
Original Assignee
Guizhou Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Power Grid Co Ltd filed Critical Guizhou Power Grid Co Ltd
Priority to CN201910539127.3A priority Critical patent/CN110245209B/en
Publication of CN110245209A publication Critical patent/CN110245209A/en
Application granted granted Critical
Publication of CN110245209B publication Critical patent/CN110245209B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The method that the invention discloses a kind of from mass text extracts milestone event, the method comprising the steps of: (1) in mass text extracted file file level related information, pass through tree structure and carry out data storage;(2) pathname of filename and file is carried out splicing the text as current file, the tree-like distance of each file is calculated using K-Means clustering algorithm, file with same level relationship is grouped together to the initial classes cluster size that K-Means clustering algorithm is determined as initial clustering cluster;(3) extraction that milestone event and timing node are carried out under each clustering cluster, to the milestone node listing for extracting formation event after result is screened.The present invention carries out the extraction of milestone event and event node again in each cluster after cluster, can be extracted into behind multiple subevents the problem of can not merging to avoid similar events in this way, while also improving the accuracy rate and integrality of extraction.

Description

A method of extracting milestone event from mass text
Technical field
The invention belongs to extracting milestone event technical field, it is related to a kind of extracting milestone event from mass text Method.
Background technique
Existing information extraction method has been able to the method that event and time are extracted from text, but based on magnanimity For data, the case where similar events are described there may be multiple documents, if directly carrying out the pumping of event and time to document It takes, the milestone nodal information that may cause similar events is dispersed in multiple events, can not be polymerize, thus can not It is drawn into the event milestone information of completion.
Summary of the invention
The technical problem to be solved by the present invention is a kind of method that milestone event is extracted from mass text is provided, with Solve problems of the prior art.
The technical scheme adopted by the invention is as follows: a method of extracting milestone event, this method packet from mass text Include following steps:
(1) in mass text extracted file file level related information, using filename, folder name as node, with layer Grade relationship is side, carries out data storage by tree structure;
(2) pathname of filename and file is carried out splicing text as current file, using K-Means clustering algorithm, The tree-like distance for calculating each file, the file with same level relationship is grouped together as initial clustering cluster, simultaneously Determine the initial classes cluster size of K-Means clustering algorithm;
(3) for the cluster result obtained in step (2), milestone event and timing node are carried out under each clustering cluster It extracts, to the milestone node listing for extracting formation event after result is screened.
Beneficial effects of the present invention: compared with prior art, the present invention is first by text cluster, by the same event Different description texts condense together, in cluster process, introduce and the existing archive information i.e. text of file Part presss from both sides hierarchy structure information and then improves the cluster result of document.Carried out again in each cluster after cluster milestone event and The extraction of event node can be extracted into behind multiple subevents the problem of can not merging in this way to avoid similar events, while Improve the accuracy rate and integrality of extraction.
Detailed description of the invention
Fig. 1 is flow diagram of the invention.
Specific embodiment
With reference to the accompanying drawing and the present invention is described further in specific embodiment.
Embodiment 1: as shown in Figure 1, a kind of method for extracting milestone event from mass text, this method includes following Step:
(1) in mass text extracted file file level related information, using filename, folder name as node, with layer Grade relationship is side, carries out data storage by tree structure;
(2) pathname of filename and file is carried out splicing text as current file, using K-Means clustering algorithm, Calculate (each file is a node for tree, and path is then the branch of tree, such as the file packet of a multi-layer folders nesting, There are many files in the inside, and the pure file stored under first file is exactly first layer, stored in first file other File can proceed further downhole) the tree-like distance of each file, the file with same level relationship is grouped together work For initial clustering cluster, at the same determine K-Means clustering algorithm initial classes cluster size (file of same level be exactly it is same just Beginning cluster);
(3) for the cluster result obtained in step (2), milestone event and timing node are carried out under each clustering cluster It extracts, to the milestone node listing for extracting formation event after result is screened.
The present invention by there are in the text of file, file hierarchical relationship carry out text subject cluster, mutually similar The abstracting method of event and event time node is carried out in the text of cluster.In the text of magnanimity, some documents be by That thinks filed, these files under the identical sub-folder have existed incidence relation.In the base of traditional clustering algorithm Under plinth, during the hierarchical relationship information of file, file is brought into text cluster so that the cluster result of text not only with The semantic information of word is related in text, and the file level incidence relation of text is related.In the phase identical text of cluster result The information extraction of event and event time node is carried out in shelves cluster to obtain the milestone event in mass text.
The present invention is by improving the poly- of magnanimity document the existing related information of file to be dissolved into Text Clustering Algorithm Class effect, and then the method for extracting in the mutually similar cluster of magnanimity document milestone event.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Within protection scope of the present invention, therefore, protection scope of the present invention should be based on the protection scope of the described claims lid.

Claims (1)

1. a kind of method for extracting milestone event from mass text, it is characterised in that: method includes the following steps:
(1) in mass text extracted file file level related information, using filename, folder name as node, with layer Grade relationship is side, carries out data storage by tree structure;
(2) pathname of filename and file is carried out splicing text as current file, using K-Means clustering algorithm, The tree-like distance for calculating each file, the file with same level relationship is grouped together as initial clustering cluster, simultaneously Determine the initial classes cluster size of K-Means clustering algorithm;
(3) for the cluster result obtained in step (2), milestone event and timing node are carried out under each clustering cluster It extracts, to the milestone node listing for extracting formation event after result is screened.
CN201910539127.3A 2019-06-20 2019-06-20 Method for extracting milestone events from massive texts Active CN110245209B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910539127.3A CN110245209B (en) 2019-06-20 2019-06-20 Method for extracting milestone events from massive texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910539127.3A CN110245209B (en) 2019-06-20 2019-06-20 Method for extracting milestone events from massive texts

Publications (2)

Publication Number Publication Date
CN110245209A true CN110245209A (en) 2019-09-17
CN110245209B CN110245209B (en) 2022-09-23

Family

ID=67888433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910539127.3A Active CN110245209B (en) 2019-06-20 2019-06-20 Method for extracting milestone events from massive texts

Country Status (1)

Country Link
CN (1) CN110245209B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101388013A (en) * 2007-09-12 2009-03-18 日电(中国)有限公司 Method and system for clustering network files
US20090106255A1 (en) * 2001-01-11 2009-04-23 Attune Systems, Inc. File Aggregation in a Switched File System
CN104091054A (en) * 2014-06-26 2014-10-08 中国科学院自动化研究所 Mass disturbance warning method and system applied to short texts
US20150088827A1 (en) * 2013-09-26 2015-03-26 Cygnus Broadband, Inc. File block placement in a distributed file system network
CN105117397A (en) * 2015-06-18 2015-12-02 浙江大学 Method for searching semantic association of medical documents based on ontology
CN106484887A (en) * 2016-10-18 2017-03-08 安徽天达网络科技有限公司 A kind of document handling method based on internet
CN108399213A (en) * 2018-02-05 2018-08-14 中国科学院信息工程研究所 A kind of clustering method and system of user oriented personal document

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090106255A1 (en) * 2001-01-11 2009-04-23 Attune Systems, Inc. File Aggregation in a Switched File System
CN101388013A (en) * 2007-09-12 2009-03-18 日电(中国)有限公司 Method and system for clustering network files
US20150088827A1 (en) * 2013-09-26 2015-03-26 Cygnus Broadband, Inc. File block placement in a distributed file system network
CN104091054A (en) * 2014-06-26 2014-10-08 中国科学院自动化研究所 Mass disturbance warning method and system applied to short texts
CN105117397A (en) * 2015-06-18 2015-12-02 浙江大学 Method for searching semantic association of medical documents based on ontology
CN106484887A (en) * 2016-10-18 2017-03-08 安徽天达网络科技有限公司 A kind of document handling method based on internet
CN108399213A (en) * 2018-02-05 2018-08-14 中国科学院信息工程研究所 A kind of clustering method and system of user oriented personal document

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
仇绍刚: "基于元搜索的知识获取方法与系统集成研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Also Published As

Publication number Publication date
CN110245209B (en) 2022-09-23

Similar Documents

Publication Publication Date Title
CN101464905B (en) Web page information extraction system and method
CN104331446B (en) A kind of massive data processing method mapped based on internal memory
CN107609052A (en) A kind of generation method and device of the domain knowledge collection of illustrative plates based on semantic triangle
US10929439B2 (en) Taxonomic tree generation
CN111178079B (en) Triplet extraction method and device
CN110232187A (en) Enterprise name similarity recognition method, device, computer equipment and storage medium
CN108470040B (en) Method and device for warehousing unstructured data
CN107608948A (en) A kind of construction method and device of Text Information Extraction model
CN102810114A (en) Personal computer resource management system based on body
CN107291858A (en) Data indexing method based on character string suffix
KR100835290B1 (en) System and method for classifying document
CN106933823A (en) Method of data synchronization and device
CN106250552A (en) Search engine results page is assembled WEB page
CN106649557A (en) Semantic association mining method for defect report and mail list
CN108304377A (en) A kind of extracting method and relevant apparatus of long-tail word
CN104750673B (en) Text matches filter method and device
CN102799632B (en) Method for acquiring and describing text information based on visual basic application (VBA) and tetrahedron data model
CN110781297B (en) Classification method of multi-label scientific research papers based on hierarchical discriminant trees
CN106096014A (en) The Text Clustering Method of mixing length text set based on DMR
CN110245209A (en) A method of extracting milestone event from mass text
US8032521B2 (en) Managing structured content stored as a binary large object (BLOB)
CN107133200A (en) A kind of android system text string extracting and merging method
CN104376000A (en) Webpage attribute determination method and webpage attribute determination device
CN104572730A (en) Method and device for importing and exporting digital resources
JP2013206280A (en) Deletion file detection program, deletion file detection method and deletion file detection device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant