CN110245209A

CN110245209A - A method of extracting milestone event from mass text

Info

Publication number: CN110245209A
Application number: CN201910539127.3A
Authority: CN
Inventors: 王鹏宇; 吴漾; 罗念华; 孔庆波; 缪新萍; 李文科
Original assignee: Guizhou Power Grid Co Ltd
Current assignee: Guizhou Power Grid Co Ltd
Priority date: 2019-06-20
Filing date: 2019-06-20
Publication date: 2019-09-17
Anticipated expiration: 2039-06-20
Also published as: CN110245209B

Abstract

The method that the invention discloses a kind of from mass text extracts milestone event, the method comprising the steps of: (1) in mass text extracted file file level related information, pass through tree structure and carry out data storage；(2) pathname of filename and file is carried out splicing the text as current file, the tree-like distance of each file is calculated using K-Means clustering algorithm, file with same level relationship is grouped together to the initial classes cluster size that K-Means clustering algorithm is determined as initial clustering cluster；(3) extraction that milestone event and timing node are carried out under each clustering cluster, to the milestone node listing for extracting formation event after result is screened.The present invention carries out the extraction of milestone event and event node again in each cluster after cluster, can be extracted into behind multiple subevents the problem of can not merging to avoid similar events in this way, while also improving the accuracy rate and integrality of extraction.

Description

A method of extracting milestone event from mass text

Technical field

The invention belongs to extracting milestone event technical field, it is related to a kind of extracting milestone event from mass text Method.

Background technique

Existing information extraction method has been able to the method that event and time are extracted from text, but based on magnanimity For data, the case where similar events are described there may be multiple documents, if directly carrying out the pumping of event and time to document It takes, the milestone nodal information that may cause similar events is dispersed in multiple events, can not be polymerize, thus can not It is drawn into the event milestone information of completion.

Summary of the invention

The technical problem to be solved by the present invention is a kind of method that milestone event is extracted from mass text is provided, with Solve problems of the prior art.

The technical scheme adopted by the invention is as follows: a method of extracting milestone event, this method packet from mass text Include following steps:

(1) in mass text extracted file file level related information, using filename, folder name as node, with layer Grade relationship is side, carries out data storage by tree structure；

(2) pathname of filename and file is carried out splicing text as current file, using K-Means clustering algorithm, The tree-like distance for calculating each file, the file with same level relationship is grouped together as initial clustering cluster, simultaneously Determine the initial classes cluster size of K-Means clustering algorithm；

(3) for the cluster result obtained in step (2), milestone event and timing node are carried out under each clustering cluster It extracts, to the milestone node listing for extracting formation event after result is screened.

Beneficial effects of the present invention: compared with prior art, the present invention is first by text cluster, by the same event Different description texts condense together, in cluster process, introduce and the existing archive information i.e. text of file Part presss from both sides hierarchy structure information and then improves the cluster result of document.Carried out again in each cluster after cluster milestone event and The extraction of event node can be extracted into behind multiple subevents the problem of can not merging in this way to avoid similar events, while Improve the accuracy rate and integrality of extraction.

Detailed description of the invention

Fig. 1 is flow diagram of the invention.

Specific embodiment

With reference to the accompanying drawing and the present invention is described further in specific embodiment.

Embodiment 1: as shown in Figure 1, a kind of method for extracting milestone event from mass text, this method includes following Step:

(2) pathname of filename and file is carried out splicing text as current file, using K-Means clustering algorithm, Calculate (each file is a node for tree, and path is then the branch of tree, such as the file packet of a multi-layer folders nesting, There are many files in the inside, and the pure file stored under first file is exactly first layer, stored in first file other File can proceed further downhole) the tree-like distance of each file, the file with same level relationship is grouped together work For initial clustering cluster, at the same determine K-Means clustering algorithm initial classes cluster size (file of same level be exactly it is same just Beginning cluster)；

The present invention by there are in the text of file, file hierarchical relationship carry out text subject cluster, mutually similar The abstracting method of event and event time node is carried out in the text of cluster.In the text of magnanimity, some documents be by That thinks filed, these files under the identical sub-folder have existed incidence relation.In the base of traditional clustering algorithm Under plinth, during the hierarchical relationship information of file, file is brought into text cluster so that the cluster result of text not only with The semantic information of word is related in text, and the file level incidence relation of text is related.In the phase identical text of cluster result The information extraction of event and event time node is carried out in shelves cluster to obtain the milestone event in mass text.

The present invention is by improving the poly- of magnanimity document the existing related information of file to be dissolved into Text Clustering Algorithm Class effect, and then the method for extracting in the mutually similar cluster of magnanimity document milestone event.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Within protection scope of the present invention, therefore, protection scope of the present invention should be based on the protection scope of the described claims lid.

Claims

1. a kind of method for extracting milestone event from mass text, it is characterised in that: method includes the following steps: