CN105573976A - Rich ontology based multi-document mining disaster management method - Google Patents

Rich ontology based multi-document mining disaster management method Download PDF

Info

Publication number
CN105573976A
CN105573976A CN201410521099.XA CN201410521099A CN105573976A CN 105573976 A CN105573976 A CN 105573976A CN 201410521099 A CN201410521099 A CN 201410521099A CN 105573976 A CN105573976 A CN 105573976A
Authority
CN
China
Prior art keywords
statement
management method
concept
document
disaster management
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410521099.XA
Other languages
Chinese (zh)
Inventor
李千目
李涛
刘浩
徐建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology Changshu Research Institute Co Ltd
Original Assignee
Nanjing University of Science and Technology Changshu Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology Changshu Research Institute Co Ltd filed Critical Nanjing University of Science and Technology Changshu Research Institute Co Ltd
Priority to CN201410521099.XA priority Critical patent/CN105573976A/en
Publication of CN105573976A publication Critical patent/CN105573976A/en
Pending legal-status Critical Current

Links

Abstract

The present invention relates to a rich ontology based disaster management method. The management method is based on a rich ontology and comprises three steps of sentence mapping, sub-model modeling and sentence screening: 1) the step of sentence mapping: dividing a document of one field into a plurality of sentences, and mapping the plurality of sentences into a hierarchical structure corresponding to the ontology, wherein keywords are specified by an expert for the hierarchical structure for sentence mapping; 2) the step of sub-model modeling: applying a submodular function into a greedy algorithm, and selecting sentences from a given sentence set in sequence; and 3) the step of sentence screening: extracting a long sentence from the original document by using the greedy algorithm. Compared with the conventional method based on a single term, which is relatively low in mining efficiency, the rich ontology based multi-document mining disaster management method provided by the present invention is more competitive.

Description

A kind of many text minings Disaster Management method based on enriching body
Technical field
The invention belongs to data mining technology field, be specifically related to a kind of method based on enriching body and carry out Disaster Management.
Background technology
Disaster Management is an emerging technology areas, and in this field, some strategic management processes will be used, and threaten from disaster to protect the keystone resources of the mankind.In reality, can describe in a large number about the report of disaster and information with the form of document, expert then expects to summarize from these information the process that the development trend of disaster, the function situation of public infrastructure or home rebuild.These information that expert provides can provide very large help to the mankind when again in the face of disaster.
But huge quantity of information makes the artificial treatment of information no longer feasible, data mining technology increasingly mature and its apply more widely in Disaster Management field and improve this situation gradually.But tradition is lower based on the digging efficiency of single term, the present invention proposes a kind of more competitive based on the many text minings Disaster Management method enriching body.
Summary of the invention
The present invention is directed to based on the lower problem of the digging efficiency of single term and a kind of many text minings Disaster Management method based on enriching body be provided, the method by ontological technique of expression to excavate the relevance of statement within the scope of disaster relevant documentation.
The technical scheme realizing the object of the invention is: a kind of many text minings Disaster Management method based on enriching body, and this management method, to enrich based on body, comprises statement mapping, submodel modeling and statement and screens three steps,
1) step of statement mapping: the document in a certain field is divided into many statements, is mapped to by many statements in the hierarchical structure of corresponding body is that this hierarchical structure nominal key is used for statement mapping through expert;
2) step of submodel modeling: by submodule function application in greedy algorithm, sequentially concentrates from given statement and chooses statement;
3) step of statement screening: adopt greedy algorithm to extract long statement from original document.
This statement when only and a conceptual dependency connection, is then mapped to that this is conceptive by statement of the present invention; While statement and multiple conceptual dependency connection, then map this statement on the minimum ancestors LCA of these concepts.
Statement of the present invention calculates with the key word degree of overlapping being assigned to each concept, and acquired results, as the mark of the tolerance degree of association, then selects the highest concept of mark.
Submodular function of the present invention is: set f as non-decreasing function, meets: , wherein , and S and T is the subset of E, ; . given document sets D and budget B, by using Submodular function to generate the document sets D that meets budget B, if budget B is total number of word, the quality of the summary that definition document generates before being taken in is:
, e 1and e 2represent two statements, c 1and c 2two concepts, respectively with e 1and e 2correspondence, c 1→ e 1be meant to statement e 1with concept c 1be associated;
What be mapped in body layer aggregated(particle) structure by inquiry q is a certain conceptive, then defines mass function to be:
The correlativity of two concepts of the present invention is by following formulae discovery:
wherein C 1and C 2want calculated concept, C 0c 1and C 2minimum public father node in concept hierarchy, P () represents that the target of a random selecting belongs to the probability of this concept.
The advantage that the present invention has is: 1, use the semantics concept of a certain disaster body to build statement, instead of use a large amount of vocabulary.Thus user can be helped better to determine to adopt this statement in summarizing this disaster body in summary.2, provide a kind of general framework, this framework is based on being hidden in the concentrated submodularity of disaster relevant documentation statement, and use it to point out different problems during summary, by submodule attribute, this framework can process multiple disaster index of correlation.
Accompanying drawing explanation
Fig. 1 is the framework of the inventive method.
Embodiment
Below in conjunction with accompanying drawing this method done and describe further.
Fig. 1 gives the framework of this method, is made up of three parts, i.e. statement mapping, submodel modeling and statement screening.Statement maps and refers to, a given body, sets up the mapping of statement to this body corresponding concepts; Submodel modeling refers to as a mark specified in each statement, and be used for embodying the contribution degree that they summarize the summary of result, multiple-document summarization is carried out modeling according to maximum budget problem by the present invention; Statement screening chooses the highest statement of contribution degree by a greedy algorithm.The embodiment of every part is as follows:
1) statement maps
In Disaster Management field, body is we provide abundant conceptual, semantic information, helps us to carry out multiple-document summarization.Statement maps and first the document in a certain field is divided into many statements, and they is mapped in the hierarchical structure of corresponding body, and be that this hierarchical structure nominal key is used for statement mapping by expert, whole mapping process carries out according to following two standards:
One, if statement only and a conceptual dependency connection, then this statement is mapped to that this is conceptive;
Two, if statement and multiple conceptual dependency connection, then map this statement on the minimum ancestors (LCA, LeastCommonAncestor) of these concepts.If this LCA is the most ordinary concept of this body, then this statement is mapped to original specific concept.
In this course, statement will be calculated with the key word degree of overlapping being assigned to each concept, and as measuring the mark of the degree of association, select the highest concept afterwards.Concept due to body is through selection, significant, all statement can both be mapped, and we just can obtain one and have the body layer aggregated(particle) structure enriching example like this.
2) submodel modeling
1.1. Submodular function
In the many document processes of summary, submodule function application in greedy algorithm, sequentially concentrates from given statement to choose statement by we.
Definition 1. sets f as non-decreasing function, meets:
(1)
Wherein , and S and T is the subset of E, , title f is Submodular function.
According to this definition, in a larger T of collection, add the increment that an element brings as f, the increment that an interpolation element brings as f in a less S of collection can be less than or equal to.
Maximum budget problem in the present invention is described below: the element in given set E, E is assigned a factor of influence and the expense factor, and both is defined by field described in element and a budget B.The target of problem finds a subset of E, and this subset has maximum influence power and do not excess budget B.
1.2. ordinary summary
If budget B is total number of word.Add the while that candidate's statement will improving the quality of summary and also can increase expense.The quality of the summary that definition document generates before being taken in is:
(2)
In this function, e 1and e 2represent two statements, c 1and c 2two concepts, respectively with e 1and e 2corresponding.C 1→ e 1be meant to statement e 1with concept c 1be associated.The correlativity of two concepts is by following formulae discovery:
(3) wherein c 1and c 2want calculated concept, c 0c 1and c 2minimum public father node in concept hierarchy.P () represents that the target of a random selecting belongs to the probability of this concept.
Correspondingly, the increased quality that definition interpolation candidate's statement brings is:
(4)
1.3. based on the summary of inquiry
What be mapped in body layer aggregated(particle) structure by inquiry q is a certain conceptive, then defines mass function to be:
(5)
3) statement screening
We use greedy algorithm from original document, extract important statement.Given document sets D and budget B, this algorithm generates by using Submodular function the D that meets B.Algorithm selects a longer statement to carry out the summary of result at every turn, because long statement has larger may forgiving important information and bring larger Quality advance.

Claims (5)

1., based on the many text minings Disaster Management method enriching body, this management method, to enrich based on body, comprises statement mapping, submodel modeling and statement and screens three steps, it is characterized in that:
1) step of statement mapping: the document in a certain field is divided into many statements, is mapped to by many statements in the hierarchical structure of corresponding body is that this hierarchical structure nominal key is used for statement mapping through expert;
2) step of submodel modeling: be applied to by Submodular function in greedy algorithm, sequentially concentrates from given statement and chooses statement;
3) step of statement screening: adopt greedy algorithm to extract long statement from original document.
2. the many text minings Disaster Management method based on enriching body according to claim 1, is characterized in that: this statement when only and a conceptual dependency connection, is then mapped to that this is conceptive by described statement; While statement and multiple conceptual dependency connection, then map this statement on the minimum ancestors LCA of these concepts.
3. the many text minings Disaster Management method based on enriching body according to claim 2, it is characterized in that: described statement calculates with the key word degree of overlapping being assigned to each concept, acquired results, as the mark of the tolerance degree of association, then selects the highest concept of mark.
4. the many text minings Disaster Management method based on enriching body according to claim 1, is characterized in that: described Submodular function is: set f as non-decreasing function, meets: , wherein , and S and T is the subset of E, ; . given document sets D and budget B, by using Submodular function to generate the document sets D that meets budget B, if budget B is total number of word, the quality of the summary that definition document generates before being taken in is:
, e 1and e 2represent two statements, c 1and c 2two concepts, respectively with e 1and e 2correspondence, c 1→ e 1be meant to statement e 1with concept c 1be associated;
What be mapped in body layer aggregated(particle) structure by inquiry q is a certain conceptive, then defines mass function to be:
5. the many text minings Disaster Management method based on enriching body according to claim 4, is characterized in that: the correlativity of two described concepts is by following formulae discovery:
wherein C 1and C 2want calculated concept, C 0c 1and C 2minimum public father node in concept hierarchy, P () represents that the target of a random selecting belongs to the probability of this concept.
CN201410521099.XA 2014-10-08 2014-10-08 Rich ontology based multi-document mining disaster management method Pending CN105573976A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410521099.XA CN105573976A (en) 2014-10-08 2014-10-08 Rich ontology based multi-document mining disaster management method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410521099.XA CN105573976A (en) 2014-10-08 2014-10-08 Rich ontology based multi-document mining disaster management method

Publications (1)

Publication Number Publication Date
CN105573976A true CN105573976A (en) 2016-05-11

Family

ID=55884128

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410521099.XA Pending CN105573976A (en) 2014-10-08 2014-10-08 Rich ontology based multi-document mining disaster management method

Country Status (1)

Country Link
CN (1) CN105573976A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391460A (en) * 2017-07-04 2017-11-24 北京航空航天大学 A kind of industry security theme multi-document summary automatic generation method and device
CN111222347A (en) * 2020-04-15 2020-06-02 北京金山数字娱乐科技有限公司 Sentence translation model training method and device and sentence translation method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399932A (en) * 2013-08-06 2013-11-20 武汉大学 Situation identification method based on semantic social network entity analysis technique
CN103500208A (en) * 2013-09-30 2014-01-08 中国科学院自动化研究所 Deep layer data processing method and system combined with knowledge base

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399932A (en) * 2013-08-06 2013-11-20 武汉大学 Situation identification method based on semantic social network entity analysis technique
CN103500208A (en) * 2013-09-30 2014-01-08 中国科学院自动化研究所 Deep layer data processing method and system combined with knowledge base

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
KESHOU WU ET AL: "Ontology-enriched multi-document summarization in disaster management using submodular function", 《INFORMATION SCIENCES》 *
LEI LI ET AL: "Ontology-enriched Multi-Document Summarization in Disaster Management", 《PROCEEDINGS OF THE 33RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL》 *
LEI LI,TAO LI: "An Empirical Study of Ontology-Based Multi-Document Summarization in Disaster Management", 《IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391460A (en) * 2017-07-04 2017-11-24 北京航空航天大学 A kind of industry security theme multi-document summary automatic generation method and device
CN111222347A (en) * 2020-04-15 2020-06-02 北京金山数字娱乐科技有限公司 Sentence translation model training method and device and sentence translation method and device
CN111222347B (en) * 2020-04-15 2020-07-28 北京金山数字娱乐科技有限公司 Sentence translation model training method and device and sentence translation method and device

Similar Documents

Publication Publication Date Title
CN103699689B (en) Method and device for establishing event repository
CN110032369A (en) A kind of code automatic generation method, device and medium
CN104239513A (en) Semantic retrieval method oriented to field data
CN104360842B (en) A kind of service dynamic flow method of combination based on JBPM
CN103927385A (en) Unifying method and device of data model
CN103678550A (en) Mass data real-time query method based on dynamic index structure
CN109471921A (en) A kind of text duplicate checking method, device and equipment
CN104731911A (en) Dynamic mapping and conversion method of data table and entity class
CN104239570A (en) Method and device for searching for paper
CN102779161B (en) Semantic labeling method based on resource description framework (RDF) knowledge base
CN102325161B (en) Query workload estimation-based extensible markup language (XML) fragmentation method
CN105573976A (en) Rich ontology based multi-document mining disaster management method
CN104778252A (en) Index storage method and index storage device
CN112883192A (en) Heterogeneous field user and resource association mining method and system
CN103345536A (en) Semantic association indexing method
CN113407702B (en) Employee cooperation relationship intensity quantization method, system, computer and storage medium
CN103678545A (en) Network resource clustering method and device
CN103886049A (en) Method for mining heterogeneous related data set in data space
CN104462463B (en) Data access methods of the JavaScript based on SQL templates
Hou et al. Design and achievement of cloud geodatabase for a sponge city
CN102681830B (en) A kind of method and apparatus of comparison program text
CN113962549A (en) Business process arrangement method and system based on power grid operation knowledge
CN109614491B (en) Further mining method based on mining result of data quality detection rule
CN101883130B (en) Storage method and device of page frame stream conversation information
CN105512484B (en) A kind of data correlation method using characteristic value similarity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160511

WD01 Invention patent application deemed withdrawn after publication