CN113282744B

CN113282744B - Literary work character relation visualization analysis method based on node influence measurement

Info

Publication number: CN113282744B
Application number: CN202110629883.2A
Authority: CN
Inventors: 马鸣声; 霍智勇; 施奎宇; 许波於
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2022-11-08
Anticipated expiration: 2041-06-07
Also published as: CN113282744A

Abstract

The invention discloses a visual analysis method for literary work character relation based on node influence measurement. Belongs to the technical field of information processing, and comprises the following operation steps: preprocessing the novel text data; calculating character interaction by utilizing character node information, and creating an interactive dictionary; constructing a social network graph and a centrality degree table; providing a node influence evaluation standard and calculating the node influence; and performing literature analysis according to the influence evaluation standard. According to the invention, a computer is used for processing a large amount of text data, so that a large amount of human resources are saved; in addition, the mathematical content in the quantitative analysis method is used as the research basis, the research content is quantized, and the accuracy of qualitative judgment is improved; the research process is also described by means of the form of data, and the process and the result of the analysis can be verified; therefore, the same experimental method can be adopted for the same data to obtain the same result no matter how the researcher knows the researcher himself or herself and the quality of the researcher looks.

Description

Literary work character relationship visualization analysis method based on node influence measurement

Technical Field

The invention belongs to the technical field of information processing, and relates to a literature analysis method, a social network node influence research method and a visualization technology; in particular to a literary work character relationship visualization analysis method based on node influence measurement.

Background

The discovery of node influence in the social network influence analysis method is gradually applied to real life and online social platforms, but is less applied to the analysis of character relations of literary works. For a long time, the traditional literature research always adopts a qualitative research method, and the literature works or literature phenomena are analyzed by virtue of experience, insights, subjective judgment and logic deduction of researchers, so that the method inevitably has the following problems: (1) Researchers rely on manual retrieval and data arrangement, and reading in a traditional mode becomes a main mode for acquiring data, and more human resources are consumed; (2) Different scholars intentionally and unintentionally fill aesthetic judgment, learning, personality, cultural good knowledge and value concepts of the scholars into research objects in the literature research, and the conclusion is not objective and comprehensive enough, so that the excessive interpretation of the content and meaning of the literature is easily caused; (3) Generally, researchers cannot prove own viewpoints when explaining changes of literary forms, only can make impressive judgment according to limited reading amount of the researchers, most of information is information which is easily recognized by human brains such as 'plots' and images, most of researches on literary works are 'explanatory', and 'desirability' is ignored.

With the development of computer science, the quantitative analysis method is gradually applied to the field of literature research in China, and one form of the quantitative analysis method is to quantitatively compare the main characteristics of the research objects to clear up the properties and the mutual relation and to describe the research results by means of the form of data.

Chen et al put forward a local centrality index in Identifying information nodes in complex networks, which comprehensively considers the node degree and the degree information of the neighbor nodes thereof; and when the propagation rate reaches the vicinity of the critical value, the measurement effect of the centrality of the feature vector is better.

Node influence measurement indexes based on node global attributes mainly investigate global network information of a network where nodes are located, the indexes can well reflect topological characteristics of the nodes, but time complexity is high, and most indexes are not suitable for a large-scale network. The betweenness centrality (betweenness centrality) is defined as the number of times the shortest path between two nodes in the network passes through the current node, and describes the frequency of information passing through the node when propagating in the social network. If the nodes with large numbers are removed, network congestion will be caused, which is not favorable for information transmission.

The feature vector centrality (eigenvector centrality) is an important index for measuring the global influence of the node, not only considers the number of neighbor nodes, but also considers the importance of the neighbor nodes, and considers the influence of a single node as a linear combination of the influences of other nodes.

The degree centrality index is simple and intuitive, is convenient to calculate, and can only reflect local characteristics of the nodes.

The betweenness centrality index can find a node with high information load capacity, but is not suitable for a large-scale network, and the analysis and discovery of the betweenness centrality algorithm in a network propagation node influence discovery method based on the betweenness centrality by president et al have a relatively accurate effect on measuring the criticality and the influence of one node in the whole social network, but the measurement accuracy still has a space for improvement, and the time complexity is relatively high and is one of the disadvantages, and the disadvantages are that in the accuracy of the measurement result: the shortest paths passing through the same node and having different lengths and the nodes in the shortest paths are processed in the same way at different positions; the second disadvantage is that: the influence of the heavy nodes in the global network is ignored, and the influence of the nodes in the local network is ignored.

The characteristic vector centrality index can reflect the importance of the neighbor nodes, but is simply linear superposition without considering the network structure.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a literary work character relationship visualization analysis method based on node influence measurement; the invention comprehensively evaluates the measurement index based on the local attribute and the node influence measurement index based on the node global attribute, provides the node influence evaluation standard, provides the unified definition of the influence evaluation, and solves the problem of the influence analysis measurement model of the node influence evaluation target loss.

The technical scheme is as follows: the invention relates to a literary work figure relation visualization analysis method based on node influence measurement, which comprises the following specific operation steps:

(1) Preprocessing the novel text data;

(2) Calculating the character interaction by utilizing the character node information, and creating an interactive dictionary;

(3) Constructing a social network graph and a centrality degree table by utilizing the interactive dictionary;

(4) Providing a node influence evaluation standard and calculating the node influence;

(5) And performing literature analysis according to the influence evaluation standard.

Further, in the step (1), the specific operation steps of preprocessing the novel text data are as follows:

(1.1) reading text data and filtering special symbols;

(1.2) realizing 'name merging and symbolization';

(1.3) word segmentation;

(1.4) reduction of human name.

Further, in step (2), the step of computing the character interaction by using the character node information includes the following specific operation steps:

(2.1) taking characters in the novel as nodes in the influence network, and respectively constructing a double-character marking tuple list and a three-character marking tuple list according to node information;

(2.2) calculating the difference value between the character index and other character indexes, traversing the character list, and searching character names appearing in the mark list;

(2.3) according to the search result, establishing a nested dictionary, wherein the character name is a key of the dictionary, the index array is a value of the dictionary, and the character dictionary is linked;

(2.4) providing a judgment standard for real interaction;

(2.5) selecting a distance threshold value and an interaction threshold value according to the real character interaction occurrence evaluation standard and calculating the interaction times;

(2.6) creating an interactive dictionary according to the calculation result; the keys of the interactive dictionary are character names, and the values are characters interacted with the initial key characters of the interactive dictionary and the interaction times.

Further, in the step (3), the specific operation steps of constructing the social network diagram and the centrality metric table by using the interactive dictionary are as follows:

(3.1) constructing a social network topological structure G with the users in the social network as nodes and the connection relationship as connecting edges; g = (V, E), where V is the set of all nodes in G and E is the set of all connected edges;

(3.2) detecting communities of the graph, and setting community partitions as attributes of the graph nodes;

(3.3) creating a Python script, taking a text file as input, and generating a graphic object as output;

and (3.4) calculating three different centralities and obtaining a character centrality degree table.

Further, in step (4), the node influence evaluation criterion is provided, and the specific operation steps of calculating the node influence are as follows:

(4.1) obtaining a calculation result of the character centrality degree table according to three different centralities; calculating the comprehensive influence;

and (4.2) visually presenting the text data in a chart form according to a character centrality analysis method and a visualization technology.

Further, in step (5), the performing literature analysis according to the evaluation criterion of influence specifically includes: extracting social networks in the novel works on the basis of the mathematics of graph theory; based on the social network centrality measurement index; and respectively investigating comprehensive influence and tracking the central ranking of the characters by sections from the centrality of the whole text.

Has the advantages that: compared with the prior art, the invention adopts the centrality analysis of people and utilizes a quantitative analysis method, and combines the visualization technology to visually present the data in a chart form; it is characterized in that: (1) A large amount of text data are processed by using a computer, so that a large amount of human resources are saved; (2) By taking mathematical contents in a quantitative analysis method as the basis of research, the research contents are quantified, and the accuracy of qualitative judgment is improved; (3) Describing the research process by means of the form of data, and checking the process and the result of analysis; (4) The same experimental method can be used for the same data to obtain the same result no matter how the researcher knows himself or herself and the quality of the obtained product is good.

Drawings

FIG. 1 is a flow chart of the operation of the present invention;

FIG. 2 is a flow chart of the operation of the present invention for pre-processing novel text data;

FIG. 3 is a flowchart illustrating the operation of creating an interactive dictionary in accordance with the present invention;

FIG. 4 is a flow chart of the operation of the present invention to compute node influence;

FIG. 5 is a flowchart of the operation of the impact evaluation criteria of the present invention for literary analysis;

FIG. 6 is a diagram of an example of a social network in an embodiment of the present invention;

FIG. 7 is a comparison of "three centrality" measures for a character in chapters 1-30 of an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following drawings and specific embodiments; in the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, but the present application may be practiced in many ways different from those described herein, and similar modifications may be made by those skilled in the art without departing from the spirit of the present application, and the present application is therefore not limited to the specific implementations disclosed below.

The invention relates to a literary work character relationship visualization analysis method based on node influence measurement, which comprises the following specific operation steps:

(1) Preprocessing the novel text data;

(1.1) reading text data and filtering special symbols;

(1.2) realizing 'name merging and symbolization';

(1.3) segmenting words;

(1.4) reduction of human names;

the text data preprocessing is characterized by: the Chinese names comprise a plurality of forms such as big names, small names, characters, numbers and the like, so that different name forms of characters need to be combined, the names in the text are replaced by certain specific symbols before the stop words are removed, and the stop words are restored after the stop words are removed, so that the condition that when the stop words are filtered, individual rare characters in the names are mistakenly taken as the stop words to be removed, and the occurrence frequency and the position of the characters in the text cannot be accurately identified is avoided; the method can effectively improve the identification accuracy and the indexing efficiency by using the 'name merging and symbolization'.

(2.1) taking characters in the novel as nodes in the influence network, and respectively constructing a double-character mark tuple list and a three-character mark tuple list according to node information;

(2.3) creating a nested dictionary according to the search result, wherein the character name is a key of the dictionary, the index array is a value of the dictionary, and the character dictionary is linked;

(2.4) providing a judgment standard for real interaction; co-occurrence between entities is a statistical-based information extraction; characters with close relations often appear in multiple sections of the text at the same time, and the extraction of the character relations can calculate the co-occurrence times of different entities by identifying the determined entities (names) in the text; when the value is less than a threshold, it is assumed that there is some relationship between the two entities;

(2.5) selecting a distance threshold value and an interaction threshold value according to the real character occurrence interaction judgment standard and calculating the interaction times;

specifically, the average distance between the names of people is used as a basis for determining the distance threshold, the average value of the distances between all nodes is obtained by calculating the average distance between a certain node and other nodes, and the calculation formula is shown as formula (1):

wherein g represents the number of nodes, V represents a set of nodes, D (N) _i ,N _j ) Representing the distance between two nodes;

in addition, considering that the probability of more than three times of the same person is lower, the invention sets the interaction threshold value to be 3, thereby improving the accuracy of interaction calculation;

(2.6) creating an interactive dictionary according to the calculation result of the step (2.5); the keys of the interactive dictionary are character names, and the values are characters interacted with the initial key characters of the interactive dictionary and the interaction times.

generating a tuple list of edges of the graph according to the dictionary, creating the graph by using nx.graph (), adding nodes by using add _ nodes _ from, and adding edges by using g.add _ weighted _ edges _ from;

(3.3) creating a Python script, taking a text file as input, and generating a graphic object as output; further processing the graph by using a Gephi tool to realize the visualization effect of the relationship network;

(3.4) calculating three different centralities and acquiring a figure centrality degree scale; the calculation of three different centralities includes: calculating degree centrality, calculating betweenness centrality and calculating feature vector centrality;

specifically, the centrality of the degree of calculation is: the calculation formula adopted is shown as formula (2):

where CD (Ni) represents the degree centrality of node i, used to calculate the number of direct connections between node i and the other g-1 j nodes (i ≠ j, excluding connections between i and itself; that is, values of the principal diagonal can be ignored).

Calculating the betweenness centrality: the calculation formula adopted is shown as formula (3):

where V denotes a node set, σ (s, t) denotes the number of shortest (s, t) paths, and σ (s, t | V) denotes the number of paths passing through some node V other than s, t;

calculating the centrality of the feature vector: the calculation formula adopted is shown in formula (4):

wherein c represents a proportionality constant, χ _i Indicating the importance of node i.

Further, in step (4), the specific operation steps of providing the evaluation criterion of the node influence and calculating the node influence are as follows:

(4.1) calculating three different centralities according to the step (3.4) to obtain a calculation result of the character centrality degree table; thereby calculating the comprehensive influence;

specifically, the calculation formula is shown in formula (5):

wherein g represents the number of nodes, V represents a node set, C _D (Ni)、C _B (ν)、C _E (Ni) represents the degree-centrality, the betweenness-centrality, and the feature vector centrality of the node i, respectively.

The computing method comprehensively considers the influence of individuals in the whole situation, the influence between individuals, the influence of individuals on the population and the influence of the population on the individuals; meanwhile, dividing the comprehensive centrality value of the node i by the maximum possible connection number of the other g-1 nodes to obtain the proportion of the network nodes directly connected with the node i; the influence of network scale change on comprehensive centrality is eliminated; the method has advantages in large-scale social network character influence analysis.

(4.2) visually presenting the text data in a chart form according to a character centrality analysis method and a visualization technology;

the relationship network graph can clarify the characteristic relationship of the crowd, the centrality can compare the importance degree of the characters in the community, and the analysis view angle of the literature characters can be enriched by combining the relationship network graph and the character centrality scale, so that the result is more reliable.

Further, in step (5), the performing literature analysis according to the evaluation criterion of influence specifically includes: extracting social networks in the novel works on the basis of the mathematics of graph theory; based on the social network centrality measurement index; respectively investigating comprehensive influence from the centrality of the whole text and tracking the centrality ranking of people in sections; the research process is described by means of the form of data, and the data is used as the basis for carrying out literature analysis research, so that the research content is quantized, and the accuracy of qualitative judgment is improved.

In order to make the objects, schemes and advantages of the present invention more clear, the following will explain the present invention in further detail by taking the experiment performed on the real data set as an example; it should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

Taking text data of 'three kingdoms' for example, a concrete implementation step of a literary work figure relation visualization analysis method based on node influence measurement is described.

In this embodiment, the preprocessing method is used for preprocessing the text data of the Chinese novel dialect in the "three kingdoms Yan Yi", with reference to fig. 2, and the specific process mainly includes: special symbol filtering, name merging and symbolization, word segmentation, chinese stop word filtering and name restoration; the method is used for cleaning data, removing designated useless symbols, and only keeping Chinese characters in the text, thereby facilitating the subsequent interaction calculation; table 1 provides the results of statistical comparison of partial name merging with the frequency of names before and after symbolization.

TABLE 1 name merge and statistical comparison of before and after symbolization of names frequency (part)

In this embodiment, the character interaction calculation method is used to output an interactive dictionary, and traverse the distance between every two nodes and the number of interactions; table 2 provides an algorithm description.

Table 2 description of human interaction calculation algorithm

In this embodiment, a distance calculation function distance _ dic _ f () is constructed, a certain node and other nodes are traversed, and distances are sequentially added to a dictionary; constructing an average distance calculation function ave _ simple _ distance (), calculating the average distance between a certain node and other nodes, and adding the nodes serving as keys and the average distance serving as values into a dictionary in sequence; constructing a total average distance calculation function average _ ave _ distance (), and calculating the average distance between a certain node and other nodes to obtain the average value of the distances between all the nodes; the distance threshold was calculated as 23 from the text data processing of "the three kingdoms Yangye".

It can be understood that the calculation method is based on the language features of the three kingdoms Yanyi, and the calculation result of the distance threshold value determination method of the invention is changed in other cases.

In this embodiment, the method for constructing a social network diagram is used to output a social network diagram, as shown in fig. 6; table 3 provides a chart creation algorithm description.

Table 3 chart creation algorithm description

In this embodiment, the node influence measuring method is used for calculating the centrality of a person, calculating the comprehensive influence, and outputting a ranking chart; FIG. 7 shows the comparison of the centrality measures of the characters in chapters 1-30, for example, cao, and the comprehensive influence calculation method is:

wherein, C _D (Ni)＝0.65、C _B (ν)＝0.3、C _E (Ni)＝0.95，g＝70

In this embodiment, the influence evaluation criterion is based on graph theory mathematics, and is based on social networks in the novel works and social network centrality measurement indexes. And respectively inspecting comprehensive influence from the centrality of the whole text and tracking the centrality ranking of the characters by sections.

The analysis can be realized through algorithm design and graph interpretation: the core character of the whole text is Cao. Liu Bei is second, and he is not as powerful as Cao, but he has powerful friends; third, many people in the novel are connected by the novel, and even the old masters in the front third of the novel are connected by important people in the back fourth of the novel, such as Semalice and Zingiber officinale; the evaluation standard is used as a data support for literature analysis, and the character structure data is considered from a numerical level and subjected to data statistics through a quantitative analysis means; the proposed theoretical analysis is still proved by combining the traditional literature analysis method.

The detail description is given above to the literary work character relationship visualization analysis method based on the node influence measurement provided by the embodiment of the present invention, and the description of the above embodiment is only used to help understand the distribution and the core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A visualized analysis method for literary work character relationship based on node influence measurement is characterized by comprising the following specific operation steps of:

(1) Preprocessing the novel text data;

the specific operation steps of utilizing the character node information to calculate the character interaction and creating the interactive dictionary are as follows:

(2.3) creating a nested dictionary according to the search result, wherein the character name is a key of the dictionary, the index array is a value of the dictionary, the character dictionary is linked,

(2.4) providing a judgment standard for real interaction,

(2.6) creating an interactive dictionary according to the calculation result; wherein, the keys of the interactive dictionary are character names, and the values are characters interacted with the initial key characters of the interactive dictionary and the interaction times;

(3) Constructing a social network diagram and a centrality degree table by utilizing the interactive dictionary;

2. The method for visually analyzing the character relationship of the literary works based on the node influence metric as claimed in claim 1,

in the step (1), the specific operation steps of preprocessing the novel text data are as follows:

(1.1) reading text data and filtering special symbols;

(1.2) realizing 'name merging and symbolization';

(1.3) segmenting words;

(1.4) reduction of human name.

3. The method of claim 1 wherein the node influence metric based on visual analysis of literary work character relationship is characterized in that,

in the step (3), the specific operation steps of constructing the social network diagram and the centrality metric table by using the interactive dictionary are as follows:

4. The method of claim 1 wherein the node influence metric based on visual analysis of literary work character relationship is characterized in that,

in the step (4), the node influence evaluation criterion is provided, and the specific operation steps of calculating the node influence are as follows:

5. The method of claim 1 wherein the node influence metric based on visual analysis of literary work character relationship is characterized in that,

in the step (5), the performing of the literature analysis according to the evaluation criterion of the influence specifically includes: extracting social networks in the novel works on the basis of the mathematics of graph theory; based on the social network centrality measurement index; and respectively inspecting comprehensive influence from the centrality of the whole text and tracking the centrality ranking of the characters by sections.