CN111597351A

CN111597351A - Visual document map construction method

Info

Publication number: CN111597351A
Application number: CN202010404729.0A
Authority: CN
Inventors: 许青青; 谢赟; 吴新野; 韩欣
Original assignee: Shanghai Datatom Information Technology Co ltd
Current assignee: Shanghai Datatom Information Technology Co ltd
Priority date: 2020-05-14
Filing date: 2020-05-14
Publication date: 2020-08-28

Abstract

The invention discloses a visual document map construction method, which comprises the following steps: s1, extracting keywords from the input text, segmenting each single sentence generated by segmenting the sentence, and sequentially performing part-of-speech tagging, named entity recognition and dependency syntactic analysis on each word formed by segmenting the words; s2: formulating a relation extraction rule, and extracting ternary group data from each single sentence obtained in S1 based on the relation extraction rule; the triple data consists of two entity words and a relation word; s3: importing the ternary group data obtained in the step S2 into a database to form a document map; s4: and carrying out graph mining operation on the data in the graph database, and realizing the visualization of the document graph based on the mining operation. The method and the device can extract the key information of the text, map the document into a visual graph based on semantic association, and help the user to efficiently master the semantic information of the article.

Description

Visual document map construction method

Technical Field

The invention belongs to the technical field of knowledge maps, and particularly relates to a visual document map construction method.

Background

In the era of internet information explosion, people are difficult to quickly and accurately acquire the subject content of a document from unstructured text information, and especially the information overload is more serious for text documents such as reports, treatises, reports and the like. Therefore, it is necessary to perform a "dimension reduction" process on various types of texts. One common way is to form a text abstract, that is, to summarize and summarize an original text, and to summarize the main content of an article with a concise text, but this natural language information expression way does not allow people to intuitively and clearly obtain the required information. In 2012, google proposed a knowledge-graph concept. A knowledge graph is essentially a graph-based data structure, consisting of nodes (points) and edges (edges). Each node of the knowledge-graph represents an entity, and each edge is a relationship between the entities. Therefore, how to design a document map construction method based on the concept of the knowledge map to convert text information into a graphic format and simplify the content of graphics and texts to help users perceive and analyze article semantic information is a direction that needs to be researched by those skilled in the art.

Disclosure of Invention

The invention aims to provide a visual document map construction method, which can extract key information of a text, map the document into a visual map based on semantic association, simplify map complexity and help a user to efficiently master article information.

The technical scheme is as follows:

a visual document map construction method comprises the following steps: s1, extracting keywords from the input text, segmenting each single sentence generated by segmenting the sentence, and sequentially performing part-of-speech tagging, named entity recognition and dependency syntactic analysis on each word formed by segmenting the words; s2: formulating a relation extraction rule, and extracting ternary group data from each single sentence obtained in S1 based on the relation extraction rule; the triple data consists of two entity words and a relation word; s3: importing the ternary group data obtained in the step S2 into a database to form a document map; s4: and carrying out graph mining operation on the data in the graph database, and realizing the visualization of the document graph based on the mining operation.

By adopting the technical scheme: firstly, extracting key information from a document, mining semantic association among all entity words in an input text based on the extracted key information, and converting the semantic association into a node-edge-node map form. And the visualization of the atlas is implemented and optimized based on graph mining calculations. By mapping abstract data into graphic elements, a user is helped to effectively perceive and analyze article semantic information. In the above process, the text is divided into sentences based on commas, periods, exclamation marks and question marks. The named entity recognition process extracts terms of particular meaning, such as people, places, and organizations in the sentence.

Preferably, in the visual document map construction method: in step S1, keywords are extracted from the input text based on the TextRank algorithm.

By adopting the technical scheme: only nominal vocabularies in the text can be reserved to participate in the TextRank weight calculation, and a part of keywords can be filtered out by setting a threshold.

More preferably, in the visual document map construction method: introducing an external dictionary when carrying out named entity recognition in step S1; the external dictionary includes a person name, a place name, and an organization name.

By adopting the technical scheme: by introducing external dictionaries such as names, places, mechanisms and the like, the accuracy of named entity recognition can be effectively improved.

Further preferably, in the visual document map construction method: the relationship extraction rule in step S2 is constructed by combining dependency parsing with chinese grammar rules, which include isA rules and non-isA rules.

By adopting the technical scheme: in the dependency syntax structure, the main labeling relationships include: a cardinal relationship, a guest-moving relationship, a preposed object, a complement structure, a structure in a form, a guest-intervening relationship, etc. And mapping a plurality of structure combinations in the structures into a relation syntax rule, and applying the relation syntax rule to entity relation extraction. The isA rules are connected by the 'yes' family verb and comprise two types of rules of which the entity and the relation word have dependency relationship and rules of which the entity and the relation word have no dependency relationship. The non-isA rules are connected by verbs except the verb, and mainly comprise four types of syntax structures of a main predicate object, a main predicate intervening object, a main predicate supplement object and a preposed intervening object; in addition, the non-isA rules also include rules that the sentence has no subject, the rules indicate that the sentence has no subject, an entity can directly establish a moving guest relationship or indirectly establish a relationship such as a intervening guest and a moving guest, etc. with a verb, the previous sentence of the sentence is traced according to a Chinese heuristic rule, the subject of the previous sentence core verb is used as the subject of the sentence, and a corresponding triple is extracted.

Further preferably, in the visual document map construction method, the step S3 includes the following steps: s31: respectively establishing semantic-based association among the entity words; s32: respectively establishing association between the input text and each extracted entity word, storing the association in a graph database and giving a unique node name; s33: the document ID is given to each association obtained in S31 and S32.

By adopting the technical scheme: establishing semantic association refers to the fact that entity words in triples correspond to nodes in a graph database, relation words correspond to edges in the graph database, and the nodes are connected through the edges to form a relation network; the entity words are marked as different Label according to the types of the entity words; the sub-graph of each document may be obtained from the global graph by establishing associations of entities to the documents in S32. To avoid redundancy, the names of the nodes stored in the graph database are unique; by giving document ID (namely adding a fileID attribute) to each association, the affiliated documents of the relationship among the nodes are distinguished. The document ID adopts a unique identifier automatically generated by a system or a manually set unique identifier which is prestored.

Further preferably, in the visual document map construction method, the step S4 includes the following steps: s41: calculating the association number of each node and other nodes in the document map; s42: respectively calculating the weight of each node based on a node weight formula; s43: carrying out relationship combination on each node with the same relationship; s44: and importing the data obtained in the step S43 into a front end for visual display.

By adopting the technical scheme: the number of associations between each node and other nodes in the document graph obtained by the calculation in step S41 is the centrality of each node, which is the most direct measure for describing the centrality of the node in the network analysis. The more centralised a node is, the more cohesive the node is in the network. The graph shows the TOP N nodes spread out around the TOP with the highest centrality of the degree. Meanwhile, in step S42, the weight of each node is calculated, so as to set different node display sizes according to the weight, and highlight the emphasis of the text content. Through the steps of S41 and S42, the complexity of the finally generated graph network is simplified, and the understanding of the graph by the user is facilitated.

Still more preferably, in the visual document map construction method: the node weight formula in step S42 is

The W is_ijThe final weight value of the jth node of the ith document map is obtained; said D_ijThe association number of the jth node in the ith document map and other nodes in the document map is obtained; the T is the number of documents stored in a graph database; said N is_jRepresenting the number of document maps containing node j.

By adopting the technical scheme: the weight of each node is calculated by the weight formula, so that the importance of different nodes in the same document can be highlighted, and the difference between documents containing certain same nodes can be highlighted.

Compared with the prior art, the method can extract the semantic relation among the entities in the document to form the visual document map based on the dependency syntactic relation extraction rule, can clearly and intuitively reveal the subject content of the article, highlights the key points and different points of the map content of each document, simplifies the complexity of the map network, and is convenient for a user to efficiently master the map information.

Drawings

The invention will be described in further detail with reference to the following detailed description and accompanying drawings:

FIG. 1 is a schematic view of the operation of example 1;

FIG. 2 is a schematic structural view in example 1;

FIG. 3 is a flowchart of extracting entity relationships in embodiment 1;

FIG. 4 is a visual document map formed after steps S1-S3;

fig. 5 is a visual document map finally formed after step S4.

Detailed Description

In order to more clearly illustrate the technical solution of the present invention, the following will be further described with reference to various embodiments.

Examples 1 are shown in FIGS. 1-3:

step S1, preprocessing the input text:

s11: extracting keywords of the input text based on a TextRank algorithm: only nominal vocabularies in the text are reserved to participate in the TextRank weight calculation, and then a threshold value is set and some key words are filtered;

the specific process is as follows: setting a threshold value as T, if the TextRank weight of the vocabulary is greater than the set threshold value T, keeping the words as key words, otherwise, filtering out the words

S12: sentence splitting is carried out on the input text: the text clause is aimed at analyzing the subsequent dependency syntax by taking a single sentence as a unit, and the adopted method is to divide the sentence according to punctuations such as commas, periods, exclamation marks, question marks and the like;

s13: performing word segmentation, part of speech tagging, named entity identification and dependency syntax processing on each single sentence: the named entity recognition extracts terms with specific meanings such as people, places and mechanisms in the sentence, and introduces external dictionaries such as names, places and mechanisms to improve the accuracy of the named entity recognition.

S21: establishing a relation extraction rule: and establishing a plurality of relation extraction rules by analyzing the dependency syntax relation and part-of-speech tagging among all the words in a single sentence.

S22: and (3) matching the relation rules: if an entity pair consisting of the keywords or the people, the places and the mechanisms obtained in the step S1 exists in the sentence, whether the entity pair meets a certain relation extraction rule is checked, and if the entity pair meets the certain relation extraction rule, a corresponding < entity 1, relation, entity 2> triple is output. And finally, after all sentences in the text are subjected to relation extraction, a text triple set can be obtained.

The relationship syntax rules are mainly constructed according to dependency syntax analysis and combined with heuristic rules of Chinese syntax, and a large number of entity relationship example analyses find that entity relationship triples always appear in certain fixed syntax structures, and verbs often play a connecting role in the structures, so that verbs connecting two entities can generally represent semantic relationships between the entities. In the dependency syntax structure, the main labeled relationships are shown in table 1 below, where the structure related to verbs is: a cardinal relationship, a guest-moving relationship, a preposed object, a complement structure, a structure in a form, a guest-intervening relationship, etc. Several structure combinations in the structures are mapped into a relation syntax rule, and the relation syntax rule can be applied to entity relation extraction. The invention divides the relation rules into two categories of isA rules and non-isA rules.

TABLE 1

The isA rules are connected by the "yes" family verb, and include two types of rules that the entity has dependency relationship with the relation word and the entity has no dependency relationship with the relation word, as shown in table 2 below:

TABLE 2

The non-isA rules are connected by verbs except the family verbs, and mainly include rules as shown in table 3 below:

TABLE 3

Further, the non-isA rules include rules without subject in the sentence in addition to the rules shown in table 3, the rules indicate that the subject does not exist in the sentence, an entity capable of directly establishing a moving object relationship with a verb or indirectly establishing an intervening object, a moving object, etc. relationship exists, the previous sentence in the sentence is traced according to the heuristic rule of chinese, the subject of the core verb of the previous sentence is used as the subject of the sentence, and the corresponding triples are extracted.

Step S3, importing the triple data extracted from the text into the database. For the data stored in the graph database, firstly, semantic association between entities is to be established, secondly, association between documents and extracted entity words is established, and the node names stored in the graph database are unique, so that homonymous nodes do not exist. The entity types and relationship types stored in the graph database are shown in Table 4 below:

TABLE 4

Further, to distinguish the document to which the relationship between the nodes belongs, a file ID (document ID) attribute is added to each relationship due to the uniqueness of the nodes.

At this time, the resulting map is shown in FIG. 4.

And step S4, carrying out graph mining calculation on the data in the graph database, and realizing the visualization of the document graph on the basis. The visualization mode of the graph is optimized through the series of graph calculation operations by calculating the degree centrality, the importance of the nodes and the same relation combination in the document graph. Specifically, the method comprises the following steps:

and calculating the centrality of the node degree in the graph database, selecting the Top N nodes with higher centrality in each document graph, and only displaying the associated networks of the N nodes. The degree centrality is used for calculating the relevance number of each node and other nodes to measure the cohesion of the nodes, and the nodes with higher degree centers are considered as key nodes in the network.

And calculating the weight of each node to distinguish the importance of the nodes in each document map, setting the size of the nodes according to the calculated node weight in map visualization, and highlighting the nodes with higher weight values when the document map is displayed. The idea of calculating the graph node weight refers to the TFIDF text feature extraction idea, and it is considered that the more frequently a node is associated with other nodes in a certain document map, and the less the node appears in other document maps, the more important the node is, that is, the larger the node weight value is. The node weight calculation formula is as follows:

wij represents the final weight value of the jth node of the ith document map; dij represents the correlation number between the jth node in the ith document map and other nodes in the document map, namely the degree centrality of the computing node; t represents the number of documents stored in a graph database, namely the number of nodes marked as 'File'; nj represents the number of document maps containing node j; for the same node Nj but different Dij in different document maps, the importance of different nodes in the same document can be highlighted through the weight calculation method, and the difference between documents containing certain same nodes can be highlighted.

And (5) combining the relations. If a node has a certain relationship with other nodes, the relationships are performed, for example, if there are triples < zhang, work 1>, < zhang, work 2>, the relationship is merged and becomes < zhang, work, [ work 1, work 2] >. The number of edges in the complex relationship network can be reduced after the maps are combined through the relationship so as to simplify the visual display of the maps. The resulting pattern is shown in FIG. 5.

The visual document map formed by the scheme can highlight the emphasis point of each document, can show the difference among different documents, can simplify the network complexity and is convenient for the map understanding of clients.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. The protection scope of the present invention is subject to the protection scope of the claims.

Claims

1. A visual document map construction method is characterized by comprising the following steps:

s1: extracting keywords from the input text and segmenting the input text in sequence, segmenting each single sentence generated by the segmentation, and sequentially performing part-of-speech tagging, named entity recognition and dependency syntactic analysis on each word formed by segmentation;

s2: formulating a relation extraction rule, and extracting ternary group data from each single sentence obtained in S1 based on the relation extraction rule; the triple data consists of two entity words and a relation word;

s3: importing the ternary group data obtained in the step S2 into a database to form a document map;

s4: and carrying out graph mining operation on the data in the graph database, and realizing the visualization of the document graph based on the mining operation.

2. The visual document map construction method of claim 1, wherein: in step S1, keywords are extracted from the input text based on the TextRank algorithm.

3. The visual document map construction method of claim 2, wherein: introducing an external dictionary when carrying out named entity recognition in step S1; the external dictionary includes a person name, a place name, and an organization name.

4. The visual document map construction method of claim 3, wherein: the relationship extraction rule in step S2 is constructed by combining dependency parsing with chinese grammar rules, which include isA rules and non-isA rules.

5. The visual document map construction method according to claim 4, wherein the step S3 includes the steps of:

s31: respectively establishing semantic-based association among the entity words;

s32: respectively establishing association between the input text and each extracted entity word, storing the association in a graph database and giving a unique node name;

s33: the document ID is given to each association obtained in S31 and S32.

6. The visual document map construction method according to claim 5, wherein the step S4 includes the steps of:

s41: calculating the association number of each node and other nodes in the document map;

s42: respectively calculating the weight of each node based on a node weight formula;

s43: carrying out relationship combination on each node with the same relationship;

s44: and importing the data obtained in the step S43 into a front end for visual display.

7. The visual document map construction method of claim 6, wherein: the node weight formula in step S42 is: