CN116150352A - Group division method and related device - Google Patents

Group division method and related device Download PDF

Info

Publication number
CN116150352A
CN116150352A CN202210882017.9A CN202210882017A CN116150352A CN 116150352 A CN116150352 A CN 116150352A CN 202210882017 A CN202210882017 A CN 202210882017A CN 116150352 A CN116150352 A CN 116150352A
Authority
CN
China
Prior art keywords
target
target type
node
homogeneous
map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210882017.9A
Other languages
Chinese (zh)
Inventor
刘振国
蒋宁
吴海英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mashang Xiaofei Finance Co Ltd
Original Assignee
Mashang Xiaofei Finance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mashang Xiaofei Finance Co Ltd filed Critical Mashang Xiaofei Finance Co Ltd
Priority to CN202210882017.9A priority Critical patent/CN116150352A/en
Publication of CN116150352A publication Critical patent/CN116150352A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Primary Health Care (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Marketing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a group division method and a related device. The method comprises the following steps: selecting a knowledge graph subgraph aiming at a target type node from the target knowledge graph; using M edge weight generation algorithms to respectively calculate edge weights among a plurality of target type nodes in the knowledge graph sub-graph to generate M homogeneous graphs corresponding to the knowledge graph sub-graph; applying an N-population division algorithm to each homogeneous map in the M homogeneous maps to obtain N-population division results corresponding to each homogeneous map; scoring N group division results corresponding to each homogeneous map in the M homogeneous maps, and selecting a target side weight generation algorithm and a target group division algorithm corresponding to the highest score; and generating a homogeneous map of the target knowledge map through a target side weight generation algorithm, and carrying out group division on the homogeneous map of the target knowledge map through a target group division algorithm. The method and the device can achieve the group division effect more accurately.

Description

Group division method and related device
Technical Field
The application relates to the field of knowledge maps, in particular to a group division method and a related device.
Background
Communities, communities and other groups can reflect local characteristics of individual behaviors in the network and association relations among the communities, the groups in the research network play a vital role in understanding the structure and the function of the whole network, and the analysis and the prediction of interaction relations among elements of the whole network are facilitated. The group discovery can be used for mining the hidden cluster structure information in the network, has important significance for analyzing the structure and the function of the complex network, and is applied to the fields of social networks, biological networks, transaction networks and the like.
The knowledge graph is a large semantic network essentially, aims to describe concept entity events and relations thereof in an objective world, takes entity concepts as nodes and relations as edges, and provides a world from the perspective of relations, so that massive and diverse data can be expressed, organized and utilized more efficiently.
Group discovery is a classical problem of network science and has received extensive attention from researchers for a long time. Most of the existing group discovery algorithms are based on homogeneous patterns, however, heterogeneous patterns occupy a great proportion in the graph data in actual business. Therefore, the existing method converts non-target nodes between target type nodes in a network (heterogeneous atlas) into edges, and simply sets edge weights to construct a homogeneous atlas, but the coarse-grained solution cannot obtain a better grouping effect.
Disclosure of Invention
The application provides a population dividing method and a related device, which at least solve the problem of more accurately dividing the population.
The technical scheme of the application is as follows:
according to a first aspect of embodiments of the present application, there is provided a population partitioning method, which may include the steps of:
selecting a knowledge graph subgraph aiming at a target type node from the target knowledge graph; using M kinds of edge weight generation algorithms to respectively calculate edge weights among a plurality of target type nodes in the knowledge graph subgraph so as to generate M homogeneous graphs corresponding to the knowledge graph subgraph, wherein M is an integer greater than 1; applying an N-population division algorithm to each homogeneous map of the M homogeneous maps to obtain N-population division results corresponding to each homogeneous map, wherein N is an integer greater than 1; scoring N group division results corresponding to each homogeneous map in the M homogeneous maps, and selecting a target side weight generation algorithm and a target group division algorithm corresponding to the highest score from the M side weight generation algorithms and the N group division algorithms; and generating a homogeneous map of the target knowledge map through the target side weight generation algorithm, and carrying out group division on the homogeneous map of the target knowledge map through the target group division algorithm.
According to a second aspect of embodiments of the present application, there is provided a population dividing apparatus, the apparatus may include:
the sub-graph selection module is configured to select a knowledge graph sub-graph aiming at the target type node from the target knowledge graphs;
the homogeneous map generation module is configured to respectively calculate edge weights among a plurality of target type nodes in the knowledge map subgraph by using M edge weight generation algorithms so as to generate M homogeneous maps corresponding to the knowledge map subgraph, wherein M is an integer larger than 1;
the group division module is configured to apply an N group division algorithm to each homogeneous map in the M homogeneous maps to obtain N group division results corresponding to each homogeneous map, wherein N is an integer greater than 1;
the classification result evaluation module is configured to score N group classification results corresponding to each homogeneous map in the M homogeneous maps, and select a target side weight generation algorithm and a target group classification algorithm corresponding to the highest score from the M side weight generation algorithms and the N group classification algorithms;
the grouping module is configured to generate a homogeneous map of the target knowledge map through the target side weight generation algorithm, and conduct group division on the homogeneous map of the target knowledge map through the target group division algorithm.
According to a third aspect of embodiments of the present application, there is provided an electronic device, which may include: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the population division method as set forth in the first aspect above.
According to a fourth aspect of embodiments of the present application, there is provided a computer readable storage medium storing instructions which, when executed by at least one processor, cause the at least one processor to perform the population division method as described above in the first aspect.
According to a fifth aspect of embodiments of the present application, there is provided a computer program product, instructions in which are executed by at least one processor in an electronic device to perform the population division method as described in the first aspect above.
The technical scheme provided by the embodiment of the application at least brings the following beneficial effects:
selecting part of subgraphs from the target knowledge graph as test data, which is beneficial to saving the actual engineering cost and avoiding engineering scrapping caused by scheme errors; the test data are subjected to group division by adopting a plurality of edge weight generation algorithms and a plurality of group division algorithms, and the optimal edge weight generation algorithm and the optimal edge weight division algorithm which are suitable for the target knowledge graph are screened by scoring the group division result of the test data, so that better group division effect is obtained.
In addition, in the process of converting the heterogeneous graph into the homogeneous graph by adopting the multiple edge weight generation algorithm, graph characteristic data in the target knowledge graph, such as path information, node degree, node centrality and the like, are considered, so that the generated homogeneous graph has stronger interpretation so as to adapt to a specific downstream task, improve the effect of the downstream task, and greatly improve the recognition rate of abnormal groups.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application and do not constitute an undue limitation on the application.
FIG. 1 is a flow chart of a population partitioning method according to an embodiment of the present application;
FIG. 2 is a flow chart of a population partitioning method according to another embodiment of the present application;
FIG. 3 is a schematic diagram of a heterogram subgraph for user nodes in accordance with an embodiment of the present application;
FIG. 4 is a schematic diagram of a homogeneity map for a user node according to an embodiment of the present application;
FIG. 5 is a block diagram of a group partitioning apparatus according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a population dividing apparatus according to an embodiment of the present application;
fig. 7 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to enable those skilled in the art to better understand the technical solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.
The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of embodiments of the present application defined by the claims and their equivalents. Various specific details are included to aid understanding, but are merely to be considered exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.
The terms and words used in the following description and claims are not limited to written meanings, but are used by the inventors to achieve a clear and consistent understanding of the application. Accordingly, it should be apparent to those skilled in the art that the following descriptions of the various embodiments of the present application are provided for illustration only and not for the purpose of limiting the application as defined by the claims and their equivalents.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
In the related art, network nodes with the same attribute or the same behavior pattern can be searched first, and a homogeneity map is obtained by constructing a homogeneity edge between the network nodes with the same attribute or the same behavior pattern. And then, classifying the homogeneous graph by using a graph classification algorithm to obtain a plurality of group subgraphs. However, when the weights of the homogeneous edges are calculated, the weights corresponding to the same attributes and/or the same behavior patterns of the target nodes at the two ends of the edges are used for determining, only attribute information of the target nodes is considered, and the importance of the topology structure of the heterogeneous network and other types of nodes is ignored. This consideration ignores the global nature, resulting in poor interpretability of edge weights in the homogeneity map, affecting the effect of downstream tasks.
In addition, in another related technology, a heterogeneous knowledge graph can be constructed according to an application scene and a data set, a homogeneous network with only entities is extracted from the heterogeneous knowledge graph, node distance vectors of all the connected subgraphs are obtained for each connected subgraph in the homogeneous network by using a network embedding technology, all the connected subgraphs are clustered based on the node distance vectors, a group is generated, and whether the corresponding group is an abnormal group is judged according to the proportion of abnormal label data and normal label data in each group. The method can cause poor entity aggregation effect due to the problem of clustering clusters by using a clustering method, and the edge weights among the entities in the constructed homogeneous graph are set based on the number of relations between the corresponding two entities, and the influence of the edge weights in the heterogeneous knowledge graph on the entities is not considered.
The inventor considers that most group discovery algorithms (group division algorithms) are extremely sensitive to weight matrices in the atlas, and subtle differences can cause extremely obvious differences in grouping effects. Therefore, the application provides a group division scheme capable of improving the grouping effect of the target nodes. For example, a knowledge graph sub-graph for testing can be extracted from the existing knowledge graph and data labeling can be performed, then a weighted homogeneous graph based on the target type nodes is generated according to an edge weight generation scheme (which can comprise multiple edge weight design schemes) and sub-graph data, population division is realized on the weighted homogeneous graph by adopting multiple population discovery algorithms, each weighted homogeneous graph is analyzed and evaluated according to the division result, and an optimal combination of the edge weight generation scheme and the population discovery algorithm is selected according to the calculated homogeneous graph evaluation score so as to realize population division on all the target type nodes in the existing knowledge graph.
Hereinafter, the method and apparatus of the present application will be described in detail with reference to the accompanying drawings, according to various embodiments of the present application. In this application.
FIG. 1 is a flow chart of a population partitioning method according to an embodiment of the present application. The training method according to the embodiment of the application can be implemented in any electronic device with a data processing function. The electronic device may be at least one of, for example, a smart phone, a tablet Personal Computer (PC), a mobile phone, a video phone, an electronic book reader (e-book reader), a desktop PC, a laptop PC, a netbook computer, a workstation, a server, a Personal Digital Assistant (PDA), a Portable Multimedia Player (PMP), a video player, a wearable device, a server, and the like.
In step S101, a knowledge graph sub-graph for the target type node is selected from the target knowledge graphs. In the application, a part of subgraphs of the existing knowledge graph is selected as test data to screen an edge weight generation algorithm and a group division algorithm which can obtain the optimal grouping effect.
For example, with the user knowledge graph stored in the neo4j graph database as the target knowledge graph and the user node as the target type node, the graph subgraph can be extracted from the existing user knowledge graph through the apoc function as the test data, and the graph subgraph is exported and saved as the csv format file. For example, the apoc function extracts a range of central networks (such as ego networks) from the existing heterograms (heterograms) as the heterograms subgraphs for testing by breadth-first algorithm based on the input nodes.
The structural form of each row of data in the sub-graph file is, for example, edge weight of user node 1-relationship 1-edge weight of intermediate node-relationship 2-user node 2, which represents a path from user node 1 to user node 2 and includes weights of intermediate nodes and edges.
In step S102, using M kinds of edge weight generation algorithms, edge weights between a plurality of target type nodes in the knowledge-graph sub-graph are calculated, respectively, to generate M homogeneous graphs corresponding to the knowledge-graph sub-graph, where M is an integer greater than 1.
In this application, a homogeneity map may also be referred to as a homogeneity map. A plurality of edge weight generation algorithms may be employed to generate a weight matrix of the homogeneous atlas to obtain a plurality of different homogeneous atlas. In the process of converting the heterogeneous graph into the homogeneous graph, graph characteristic data in the heterogeneous graph, such as path information, node degree, node centrality (i.e. symmetry) and the like, are lost. Based on these feature loss considerations, distance, step size, node degree, etc. factors may be taken into account in the generated homogeneity map. For example, at least two of a distance penalty method, a step-size decay method, a node degree penalty method, and a node similarity method may be employed to generate a weight matrix for the homogeneity atlas. The following description will take any two target type nodes on the map subgraph as an example.
As an example, a distance penalty method may be utilized to calculate a first weight sum for all edges in a path from a first target type node to a second target type node in the plurality of target type nodes, and then divide the first weight sum by an order of the path to obtain an edge weight between the first target type node and the second target type node.
A distance penalty method may be utilized to penalize the sum of the edge weights in the path based on the path length between two entities of the same type. When the paths of the two entities in the knowledge graph are in a higher-order relationship, the weaker the connection between the two entities is, the smaller the edge weight is.
For example, the sum of the weights of all the edges in the path between the first target type node to the second target type node (i.e., the sum of the first weights) may be divided by the order of the path to obtain the edge weight between the first target type node and the second target type node. Computing edge weights between the first target type node and the second target type node using distance penalty method may be implemented as follows in equation (1):
Figure BDA0003764544540000061
wherein W is ij Representing the calculated weights of the two target type entities i and j in the homogeneous atlas. Where R ε R represents the path P between entity i and entity j ij The set of relationships R is R. The L () function is used to calculate P ij Is used for the path length of the (c). w (w) r The weight of the relation r, i.e. the weight of the edge r, is represented.
In the distance penalty method, edge weights between the target type nodes may be calculated based on the number of steps/hops between the target type nodes as the path length between the target type nodes. For example, if the user node 1 is directly connected to the user node 2, the order is 1; the user node 1 is connected to the user node 2 via an intermediate node, the order is 2.
In general, in the existing user knowledge graph, since the shortest path between user nodes is a second-order path, that is, the order is 2, and more than one edge (relationship) of the user nodes and the intermediate nodes is considered, equation (1) may be modified into equation (2) as shown below:
Figure BDA0003764544540000062
w is obtained ij Representing one half of the sum of all edge weights of user node i to user node j in the subgraph csv file. In the above manner, a weight matrix of the homography corresponding to the subgraph can be constructed.
In the process of converting the heterogeneous graph into the homogeneous graph, the path data in the original knowledge graph is considered, so that the loss of the distance information between the nodes in the heterogeneous graph is avoided, and the generated homogeneous graph has stronger interpretation.
As another example, an attenuation factor may be set for the edge weights in the path between two entities of the same type according to the step size of the edge, and the final weighted sum may be taken as the final weight. For example, applying a first attenuation factor to weights of all edges in a path between a third target type node to a fourth target type node of the plurality of target type nodes using a step attenuation method; an edge weight between the third target type node and the fourth target type node is determined based on a sum of the second weights of all edges to which the first attenuation factor is applied, wherein the first attenuation factor is determined based on a step size of the respective edge from the third target type node in the path.
Considering symmetry of edge weights in the homogeneous atlas, a path-based bidirectional calculation method can be set. That is, after calculating the second weight sum, a second attenuation factor is applied to the weights of all sides in the path between the fourth target type node to the third target type node to determine a third weight sum of all sides to which the second attenuation factor is applied. And taking the average value of the sum of the second weights and the sum of the third weights as the edge weight between the third target type node and the fourth target type node, wherein the second attenuation factor is determined based on the step length of the corresponding edge from the fourth target type node in the path.
For example, calculating the edge weight between the third target type node and the fourth target type node using the step-delta attenuation method may be implemented according to the following equation (3):
Figure BDA0003764544540000071
wherein W is ij Representing the calculated weights of the two target type entities i and j in the homogeneous atlas. r.epsilon.R represents the path P between two entities ij Or P ji The set of relationships R is R.
Figure BDA0003764544540000072
Path information indicating the relation r contained in the path from entity i to entity j. />
Figure BDA0003764544540000073
The step size of the relation r to the entity i in the path from the entity i to the entity j, i.e. the step distance of the edge r to the node i, is represented. />
Figure BDA0003764544540000074
Path information indicating the relation r in the path from entity j to entity i. />
Figure BDA0003764544540000075
The step size of the relation r to the entity j in the path from the entity j to the entity i, i.e. the step distance of the edge r to the node j, is represented. w (w) r The weights of the relation r are represented. L (P) ij ) The distance length between entity i and entity j, i.e., the inter-node distance, is represented.
Referring to FIG. 3, in sub-graph 301, u i 、u j And u k Representing user nodes, d 1 、d 2 、d 3 And d 4 Representing other types of nodes, u i And u is equal to j There are relationships r1, r2 and r3, and distances r1 and r3 are u i 2, r2 distance u i Is 1. Therefore,
Figure BDA0003764544540000076
in equation (3), the attenuation factor is
Figure BDA0003764544540000077
Or->
Figure BDA0003764544540000078
This is because the use of a decimal fraction is inconvenient to calculate considering that the attenuation factor needs to be greater than 1, and greater than 2 results in excessive attenuation so that the calculation result is too small.
According to the embodiments of the present application, due to
Figure BDA0003764544540000081
The function has directionality (i.e.)>
Figure BDA0003764544540000082
) In order to make the final edge weights symmetrical (i.e., u i To u j Is equal to u j To u i Edge weights of (c), two-way computation is selected such that u i To u j Edge weights and u in the path of (2) j To u i As are edge weights in the path of (c).
In general, in the existing user knowledge graph, the shortest path length between user nodes is fixed to 2, and thus, the above equation (3) can be modified to equation (4) shown below:
Figure BDA0003764544540000083
obtaining the edge weight W from the user node i to the user node j ij And according to the aboveAnd constructing a weight matrix of the homography corresponding to the subgraph in a mode.
In the step attenuation method, the sum of the second weights or the sum of the third weights may be used as the edge weights, or the average of the sum of the second weights and the sum of the third weights may be used as the edge weights.
In the process of converting the heterogeneous graph into the homogeneous graph, step data in the original knowledge graph is considered, so that node centrality information in the heterogeneous graph is prevented from being lost, and the generated homogeneous graph has stronger interpretation.
As yet another example, there are several intermediate nodes between two target type entities that have different importance in the knowledge-graph, so this importance needs to be taken into account in the constructed homography. The sum of the fourth weights of all edges in the path between a fifth target type node to a sixth target type node of the plurality of target type nodes may be determined, and the sum of the fourth weights divided by the sum of the node degrees of other type nodes present in the path, where the node degrees are the number of edges associated with the corresponding node, is obtained as an edge weight between the fifth target type node and the sixth target type node.
For example, the node degree penalty may be utilized to divide the sum of the weights of all the edges in the path between the fifth target type node and the sixth target type node by the sum of the degrees of other types of nodes present in the path to obtain the edge weight between the fifth target type node and the sixth target type node.
For example, calculating the edge weight between the fifth and sixth target type nodes using the node degree penalty method may be implemented according to equation (5) below:
Figure BDA0003764544540000084
wherein W is ij Representing the calculated weights of the two target type entities i and j in the homogeneous atlas. Where D (O) represents the sum of the degrees of other types of entities in the path of entity i to entity j. r.epsilon.R represents the relationship between two entitiesPath P ij The set of relationships R is R. w (w) r The weights of the relation r are represented.
Referring to FIG. 3, node u i To node u j Node d2 is included in the path of (c). Node d2 has only ingress (i.e. the arrow points to itself as ingress, the arrow points to as egress, e.g. u j Only the degree of output is 3), and the degree of input of d2 is 3. The sum of degrees refers to the sum of the ingress and egress degrees. In the sub-graph 301, d2 has only an input degree, and thus, only an input degree is required.
In general, in the existing user knowledge graph, there is only one intermediate node middle_node between user nodes, and thus, equation (5) can be modified to equation (6) shown below:
Figure BDA0003764544540000091
Since the middle_node has only the degree of entry, D (middle_node) represents the degree of entry of the intermediate node.
In the process of converting the heterogeneous map into the homogeneous map, the node degree information in the original knowledge graph is considered, so that the generated homogeneous map has stronger interpretation.
As another example, a bipartite graph of a knowledge graph sub-graph may be constructed for a plurality of target type nodes using a node similarity method, then a node similarity matrix in the bipartite graph is calculated, and values corresponding to elements in the node similarity matrix for the target type nodes are taken as edge weights between the target type nodes.
For example, a target type entity may be considered a type of node and other types of entities may be considered the same type of node, depending on traffic requirements. Only the connection edges of the target type entity to other type entities are reserved, thereby constructing a bipartite graph of the target type node. Aiming at the constructed bipartite graph, a SimRank++ algorithm is utilized to realize a node similarity matrix in the bipartite graph. The node similarity matrix in the bipartite graph can be calculated according to the following equation (7):
Figure BDA0003764544540000092
the matrix P is a transition probability matrix between nodes of the two graphs which converts the existing knowledge graph subgraphs into the two graphs according to service requirements. Final result S k Representing the similarity matrix between the nodes generated through k rounds of iteration. The Diag () function represents the diagonal matrix of the input matrix, I n -Diag(cP T S k-1 P) the addition of this term is to make S k Is set to 1, i.e., satisfies that the similarity between any object and itself is 1.
In the actual calculation process, two similarity matrixes (a user node similarity matrix n×n and other types of node similarity matrixes m×m) are constructed to replace the similarity matrixes n×m×n×m of all types of nodes, so that the memory can be saved and the calculation can be accelerated. Through codes, the user node similarity matrix and other types of node similarity matrices can be directly constructed.
And taking the similarity matrix of the target type node calculated based on the algorithm as a weight matrix of the homogeneous graph.
In general, the structure form of "user node 1-edge weight of relation 1-edge weight of intermediate node-edge weight of relation 2-user node 2" of each line of data in the sub-graph file is converted into a csv file in the form of "user node-edge weight-intermediate node", namely, the sub-graph is converted into a file with a two-part graph structure. And then calculating according to equation (7) to obtain a similarity matrix between the user nodes, namely a weight matrix of the homogeneous map.
The above listed edge weighting algorithm methods are merely exemplary and the present application is not limited thereto.
The edge weight finally obtained is in a matrix form, and the rows and columns of the matrix are index values corresponding to the node list of the map subgraph respectively. For example, if the weight of the ith row and the jth column of the weight matrix is 2.45, the edge weights of the ith node and the jth node in the node list are 2.45. For example, in the manner described above, sub-graph 301 shown in FIG. 3 may be constructed as a user node-based homogeneity map, as shown in FIG. 4. In the homogeneous graph shown in fig. 4, only the user nodes and the edges/relationships between the user nodes are included, each edge has a corresponding edge weight, the edge weight reflects the strength of the connection between the users, and the greater the edge weight, the stronger the connection between the two users is indicated.
And after obtaining the edge weight matrix of the homogeneous graph, combining the edge weight matrix with the test subgraph to generate the weighted target type node homogeneous graph. The weights mentioned in this application may refer to the weights of the edges. The object type node may be used as an entity in the homogeneity map, and the edge weight of the edge weight matrix may be used as a relation in the homogeneity map, so as to generate a homogeneity map corresponding to the extracted subgraph, i.e. a homogeneity map with weights.
The edge weight generation algorithm is considered from the whole point of view of the original knowledge graph. Therefore, the homography reconstructed by the edge weight generation algorithm based on the structural information of the existing knowledge graph has stronger interpretation so as to adapt to the specific downstream task, improve the effect of the downstream task and greatly improve the recognition rate of the abnormal group.
In step S103, an N-population division algorithm is applied to each of the M homogeneous maps to obtain N population division results corresponding to each homogeneous map, where N is an integer greater than 1. In the present application, the population division algorithm includes a population discovery algorithm.
In the present application, in step S102, a plurality of different homogeneous graphs are obtained by using different edge weight generating algorithms, and for each homogeneous graph, the homogeneous graph is clustered by using a different group partitioning algorithm, so as to obtain a corresponding group partitioning result.
The population division algorithm may include at least two of a modularity-based population division algorithm, a cut-edge-based population division algorithm, and a vector similarity-based spectral clustering algorithm. For example, the generated homogeneity map may be clustered using the Louvain algorithm, i.e., a bottom-up algorithm based on modularity. The generated homogeneity map can be clustered using the Girvan Newman algorithm, i.e., a top-down edge-based algorithm. The generated homogeneity map can be grouped using a spectra Cluster algorithm, i.e., a distance algorithm based on vector similarity.
The above-described population division algorithm is merely exemplary, and other population division algorithms and population discovery algorithms may be used to group the homogeneous images.
In step S104, N population division results corresponding to each of the M homogeneous maps are scored, and a target edge weight generation algorithm and a target population division algorithm corresponding to the highest score are selected from the M edge weight generation algorithms and the N population division algorithms.
The obtained plurality of community division results may be evaluated, and the side weight generation algorithm and the community division algorithm corresponding to the highest evaluation score may be selected from the plurality of side weight generation algorithms and the plurality of community division algorithms described above.
Good grouping results require high similarity within groups and lower similarity between groups. For the requirement of group division, i.e. that the abnormal group is divided into the same group and the users with a link relation with the abnormal group (i.e. having a higher similarity with the behavior of the abnormal group) should also be divided into the same group as the abnormal group. Therefore, a plurality of evaluation methods can be adopted to jointly carry out comparison analysis on the group division results, and the optimal combination algorithm is selected.
The non-real tag data evaluation index and the real data tag evaluation index of each group division result can be calculated. Here, the no-genuine tag data evaluation index is not determined based on the genuine labelling tag and the genuine tag data evaluation index is determined based on the genuine labelling tag. The true labeling label may represent a label labeling a population to which a plurality of the target type nodes belong. For example, the true label of the node is generated by manually labeling the group to which the node belongs. And then calculating the score of each group division result based on the non-real tag data evaluation index and the real data tag evaluation index.
Specifically, the evaluation index without real label data refers to the group label of the user node is not needed when the model is marked, so that the group label of each user node is not needed to be marked again manually, and the evaluation index with real label data refers to the group label of the user node is needed when the model is marked, so that the user node is marked manually, and then the marking is carried out by utilizing the manually marked label and the predicted label. The label refers to which group the user node belongs to, and in the group division method provided in the embodiment of the present application, the manual label is used to label which group the user belongs to, for example, user 1 belongs to a student, user 2 belongs to a teacher, user 3 belongs to a principal, and user 3 belongs to a principal.
As an example, the non-genuine tag data evaluation index and the genuine data tag evaluation index of the acquired plurality of group division results may be calculated, and the evaluation score of each group division result may be calculated based on the non-genuine tag data evaluation index and the genuine data tag evaluation index. For example, the modularity of each group division result may be calculated by using a modularity algorithm as an evaluation index of the non-real tag data, and the adjusted rad coefficient of each group division result may be calculated based on the pre-labeled group tags of the knowledge graph sub-graph as an evaluation index of the real tag data. The final evaluation score may be derived from the linear addition of modularity and the adjusted Lande coefficient.
For example, the group classification algorithm may be evaluated according to the modularity using the following equation (8) to determine the classification effect:
Figure BDA0003764544540000121
wherein A is ij Is an adjacency matrix corresponding to the weight matrix.
Figure BDA0003764544540000122
The degree of node i is denoted as k for the total number of edges in the adjacency matrix i =∑ j A ij . The group where the nodes are located is C i E {1,2, …, q }. δ (u, v) is a Kronecker function, when u=v δ (u, v) =1, otherwise 0.
The value range of the modularity Q is (-0.5, 1), and generally 0.3-0.7 shows that the grouping effect is good, and the larger the modularity Q is, the better the grouping result is.
In addition, after the subgraph is acquired in step S101, the target type nodes in the subgraph may be labeled in groups manually according to expert knowledge, so as to form a final test subgraph. The ARI evaluation index of the clusters can be obtained according to the artificially marked real population partition labels, as shown in the following equation (9):
Figure BDA0003764544540000123
wherein n is ij =|P i ∩T j |
Figure BDA0003764544540000124
Figure BDA0003764544540000125
Figure BDA0003764544540000126
Wherein P is a user grouping set predicted by a group discovery algorithm, and T is a real user grouping set with artificial annotation. The |P| is the number of prediction clusters, and the |T| is the number of real clusters. n is n ij The number contained for the intersection of the predicted cluster set i and the real cluster set j. ARI is within the range of [ -1,1]ARI indicates the degree of overlap of the predicted tag and the actual tag, and a larger value indicates a better grouping effect.
Finally, two evaluation indexes can be considered in combination, for example, the final evaluation score can be determined using the following equation (10):
Score=(1-λ)×Q+λ×ARI (10)
wherein lambda is the super parameter, the modularity Q is the evaluation index of group discovery, ARI is the evaluation index of clustering performance, and the final evaluation index Score is obtained. The larger the final evaluation index Score, the better, and the structure of the corresponding homogeneity map is in accordance with the requirement of actual business-population division.
The above-described grouping evaluation method is merely exemplary, and the present application is not limited thereto, and other evaluation methods may be employed.
And reserving an edge weight generation algorithm and a group division algorithm corresponding to the maximum evaluation index Score so as to perform large-scale graph data processing in the next step.
The method can create a homogeneous graph with weight according to service requirements, and select different side weight generation schemes according to the service requirements, so that a more proper side weight generation scheme can be screened out to adapt to downstream specific tasks, and the effect of the downstream tasks is improved. For example, when determining the importance of the target type node, such as whether the node is a center node, a step-size attenuation method may be used to make the edge weights between the target type nodes more different, and then a PageRank algorithm may be used to obtain the node importance score.
In addition, by comprehensively considering multiple aspects of evaluation indexes and selecting an edge weight generation algorithm and a group division algorithm which are suitable for existing knowledge maps to be clustered, the recognition rate of abnormal groups can be greatly improved.
In addition, the actual engineering cost can be saved by screening the side weight generation scheme and the group discovery algorithm through the test data. For example, test data with universality is selected, and the tested scheme can be suitable for big data, so that engineering scrapping caused by scheme errors is avoided.
In step S105, a homogeneous map of the target knowledge map is generated by the target edge weight generation algorithm, and the homogeneous map of the target knowledge map is subjected to population division by the target population division algorithm.
And generating a homogeneous map of the target knowledge map by using the selected side weight generation algorithm, and carrying out group division on the homogeneous map of the target knowledge map by using the selected community division algorithm.
Specifically, the target knowledge graph may be divided into distributed graphs according to a node segmentation strategy. Reconstructing the distributed spectrum into a homogeneous spectrum by using a selected side weight generation algorithm, and carrying out group division on the homogeneous spectrum of the distributed spectrum in parallel by using a selected group division algorithm so as to obtain a group division result of the whole target knowledge spectrum. The grouping speed of the original knowledge graph can be improved by adopting a distributed graph parallel computing mode.
And reconstructing the target knowledge graph into a homogeneous graph by using a distributed framework of large-scale graph data processing according to the selected optimal edge weight generation algorithm, and storing the homogeneous graph in a distributed file system according to a node segmentation strategy. Based on the obtained homogeneous graph, the whole graph is clustered by using a selected group division algorithm and a distributed graph parallel computing mode, and the clusters are labeled by analyzing the characteristics of user data in each cluster.
The method and the device can create a homogeneous graph with weights according to service requirements, and select different edge weight generation schemes according to the service requirements. For example, aiming at the situation that the graph neural network in the graph deep learning is mostly suitable for the homogeneous graph, the application can better adapt to the business logic of the graph neural network, and a better grouping effect is obtained.
According to the embodiment of the application, the weighted homogeneous graph based on the target type node is obtained by utilizing the edge weight generation algorithm, and the recognition rate of the abnormal group can be improved by means of the group discovery algorithm. In addition, according to the neighbor condition of the target type node and the edge weight connected with the neighbor condition, whether the target type node has potential abnormal behaviors can be judged.
As an example, in a scenario in which a community is a community, the community division method provided by the embodiment of the application may implement that a community user knowledge map stored in a neo4j graph database is used as a target knowledge map, community user nodes are used as target type nodes, after the target side weight generation algorithm and the target community division algorithm are selected from the multiple side weight generation algorithms and the multiple community division methods through test data, the community user knowledge map is converted into a corresponding homogeneous map by using the target side weight generation algorithm, the community user nodes on the homogeneous map are divided by using the target community division algorithm, the community user nodes with the same attribute or belonging to the same type are divided into a community, specifically, the community comprises singing communities, dance communities, drawing communities, swimming communities and other communities in school, and when a large number of students need to be divided, the knowledge of the students including student information is input into the target side weight generation algorithm and the target community division algorithm, so that the students in a large number of students can be divided into the community knowledge maps, such as the students in the community map, and the students in the community map can be divided into the community map with high score of 100, and the high-speed, and the social score is implemented. The above examples are merely illustrative, and the present application is not limited in any way.
FIG. 2 is a flow chart of a population partitioning method according to another embodiment of the present application. Fig. 2 illustrates an existing user knowledge graph as an example of a user-based homogeneity map constructed from an original heterogeneity graph.
Referring to fig. 2, in step S201, a test hetero-graph sub-graph is extracted from an original hetero-graph. For example, the heterogram subgraphs may be extracted by a breadth-first algorithm.
In step S202, grouping and labeling are performed on the user nodes in the heterogeneous graph subgraph by manual mode according to expert knowledge, so as to form a final test subgraph.
In step S203, the hetero-graph subgraphs are constructed into corresponding homogeneity graphs using a plurality of edge weight generation algorithms.
For example, using a distance penalty method, the sum of the weights of all the edges in the path is penalized according to the path length between user entities, such as shown in equation (1). The step attenuation method is adopted, attenuation factors are set according to the step length between the edges and the user nodes for paths between the user entities, and then the edges are weighted and summed to be the final edge weight, such as shown in equation (3). The node degree penalty method is used to penalize the sum of edge weights in the path between user entities by using the total degree of intermediate nodes present in the path, such as shown in equation (5). The node similarity method is adopted, and the calculated similarity is used as the edge weight between the user nodes through the node similarity algorithm of the graph, such as shown in an equation (7). By adopting the edge weight generation algorithm, an edge weight matrix of the homogeneous graph can be obtained.
In step S204, the obtained edge weight matrix is combined with the test subgraph to generate a user homogeneous graph with edge weight. For example, using the four edge weight generation algorithms above, four corresponding user homogeneity maps can be obtained.
In step S205, a plurality of group discovery algorithms are adopted to conduct group division on the user homogeneous graphs generated by the different edge weight generation algorithms, and the group division results are analyzed in a comparison mode to evaluate each user homogeneous graph.
For example, the population discovery algorithm may group user homogeneous graphs using a bottom-up algorithm based on modularity, a top-down algorithm based on trimmings, a distance algorithm based on vector similarity, and so on.
In step S206, a final evaluation score of the group classification result is obtained according to the user-defined classification effect evaluation method.
The final evaluation score is derived from the linear addition of the modularity and the adjusted Lande coefficient, such as shown in equation (10). The modularity is an evaluation index for data without real labels, and the adjustment of the Lande coefficient is a self-defined evaluation index with real data labels.
In step S207, the optimal user homogeneity map is screened out according to the final evaluation score, and the corresponding edge weight generation algorithm and population discovery algorithm are obtained as the optimal combination.
The larger the final evaluation score is, the more the construction of the corresponding user homogeneity map meets the requirement of actual business-group division. And reserving an edge weight generation algorithm and a group discovery algorithm corresponding to the maximum final evaluation score for large-scale graph data processing in the next step.
In step S208, the original heterogeneous graph is reconstructed into a homogeneous graph by using the distributed framework of large-scale graph data processing and adopting the screened optimal edge weight generation algorithm, and the homogeneous graph is stored in the distributed file system according to the node segmentation strategy.
In fig. 2, KG represents a graph database storing existing user knowledge graphs.
In step S209, the whole graph is clustered by using the screened optimal group discovery algorithm and the distributed graph parallel computing method, and the clusters are labeled by analyzing the characteristics of the user data in each cluster.
For example, according to experimental results, in the four side weight generation schemes, the node degree penalty method is most suitable for constructing a weighted user homogeneous graph through an existing user knowledge graph and applying the weighted user homogeneous graph to a bottom-up algorithm based on modularity, and the grouping effect can be effectively improved by combining the weighted user homogeneous graph with the existing user knowledge graph.
Fig. 5 is a block diagram of a group partitioning apparatus according to an embodiment of the present application.
Referring to fig. 5, the group partitioning apparatus 500 may include a data subgraph selection module 501, a homography generation module 502, a group partitioning module 503, a partitioning result evaluation module 504, and a grouping module 505. Each module in the group dividing apparatus 500 may be implemented by one or more modules, and names of the corresponding modules may vary according to types of the modules. In various embodiments, some modules in the population dividing apparatus 500 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present application may be combined to form a single entity, and thus the functions of the respective modules/elements prior to combination may be equivalently performed.
The subgraph selection module 501 may select a knowledge graph subgraph for a target type node from the target knowledge graphs. For example, the subgraph selection module 501 may extract a graph subgraph from the existing knowledge graph through an apoc function, derive the graph subgraph, and store the graph subgraph as a csv format file as test data.
The homograph generation module 502 may use a plurality of edge weight generation algorithms to calculate edge weights between the target type nodes on the knowledge-graph sub-graph, respectively, to generate a plurality of homographs corresponding to the knowledge-graph sub-graph.
The homogeneous map generation module 502 may use M kinds of edge weight generation algorithms to calculate edge weights between a plurality of target type nodes in the knowledge-graph sub-graph, respectively, to generate M homogeneous maps corresponding to the knowledge-graph sub-graph, where M is an integer greater than 1.
For example, taking the example of computing the edge weights between two target type nodes, the homogeneity-map generation module 502 can utilize a distance penalty method to compute a sum of first weights for all edges in a path from a first target type node to a second target type node in the plurality of target type nodes, divide the sum of the first weights by an order of the path, resulting in an edge weight between the first target type node and the second target type node, such as can utilize equation (1) to compute the edge weight.
The homogeneity-map generation module 502 may apply a first attenuation factor to weights of all edges in a path between a third target type node to a fourth target type node of the plurality of target type nodes using a step-size attenuation method, determine edge weights between the third target type node and the fourth target type node based on a sum of second weights of all edges to which the first attenuation factor is applied, wherein the first attenuation factor is determined based on a step size of the respective edge from the third target type node in the path.
In view of node centrality, the homogeneity map generation module 502 may employ a bi-directional calculation method according to symmetry of edge weights, apply a second attenuation factor to weights of all edges in a path between a fourth target type node and a third target type node after calculating the second weight sum in the above manner to determine a third weight sum of all edges to which the second attenuation factor is applied, and take a mean value of the second weight sum and the third weight sum as an edge weight between the third target type node and the fourth target type node, where the second attenuation factor is determined based on a step size of the corresponding edge from the fourth target type node in the path, such as calculating the edge weight according to equation (3).
The homogeneity map generation module 502 may determine a fourth weight sum of all edges in a path between a fifth target type node to a sixth target type node of the plurality of target type nodes using a node degree penalty, dividing the fourth weight sum by a sum of node degrees of other types of nodes present in the path to obtain an edge weight between the fifth target type node and the sixth target type node, wherein the node degree is a number of edges associated with the corresponding node, such as calculating the edge weight according to equation (5).
The homogeneous graph generation module 502 may construct a bipartite graph of the knowledge graph subgraph for a plurality of target type nodes using a node similarity method; a node similarity matrix in the bipartite graph is calculated, and values corresponding to elements in the node similarity matrix for the target type nodes are taken as edge weights between the target type nodes, such as edge weights calculated according to equation (7).
After obtaining a plurality of homogeneous maps using different edge weight generation algorithms, the population classification module 503 may apply a plurality of population classification algorithms to each homogeneous map of the plurality of homogeneous maps to obtain a plurality of population classification results for each homogeneous map.
The population division module 503 may apply N population division algorithms to each of the M homogeneous maps to obtain N population division results corresponding to each homogeneous map, where N is an integer greater than 1.
The population division algorithm may include, for example, a modularity-based population division algorithm, a cut-edge-based population division algorithm, a vector similarity-based spectral clustering algorithm, and the like.
The division result evaluation module 504 may evaluate the obtained group division result and select the side weight generation algorithm and the group division algorithm corresponding to the highest evaluation score from the plurality of side weight generation algorithms and the plurality of group division algorithms described above.
The partition result evaluation module 504 may score N population partition results corresponding to each of the M homogeneous maps and select a target edge weight generation algorithm and a target population partition algorithm corresponding to the highest score from the M edge weight generation algorithms and the N population partition algorithms.
As an example, for each group division result, the division result evaluation module 504 may calculate a no-true-tag-data evaluation index and a true-data-tag-present evaluation index of the group division result, and calculate an evaluation score of the group division result based on the no-true-tag-data evaluation index and the true-data-tag-present evaluation index.
The partitioning result evaluation module 504 may calculate a tag-less data evaluation index and a tag-with-true data evaluation index for each population partitioning result. The non-real tag data evaluation index is not determined based on the real labeling tag and the real tag data evaluation index is determined based on the real labeling tag, and the score of each group division result is calculated based on the non-real tag data evaluation index and the real tag data evaluation index.
For example, the division result evaluation module 504 may calculate the modularity of each group division result using a modularity algorithm, as a non-real tag data evaluation index of the corresponding group division result, calculate the adjusted rand coefficient of each group division result based on the pre-labeled group tag of the knowledge graph sub-graph, as a real tag data evaluation index of the corresponding group division result, and obtain an evaluation score of each group division result by linearly adding the modularity and the adjusted rand coefficient for each group division result, such as calculating a comprehensive evaluation index of each group division result according to equation (8) to equation (10) above.
The grouping module 505 may generate a homogeneous map of the target knowledge-graph using a selected edge weight generation algorithm (i.e., a target edge weight generation algorithm), and perform a group classification on the homogeneous map of the target knowledge-graph using a selected group classification algorithm (i.e., a target group classification algorithm).
As an implementation manner, the grouping module 505 may divide the target knowledge graph into distributed graphs according to a node segmentation strategy, reconstruct the distributed graphs into homogeneous graphs by using a selected edge weight generation algorithm, divide the homogeneous graphs of the distributed graphs in parallel by using a selected group division algorithm, and finally obtain a group division result of the target knowledge graph by using a merging algorithm.
For example, according to the optimal edge weight generation scheme screened by the partitioning result evaluation module 504, such as a node degree penalty method, the entire user knowledge graph is reconstructed by using a distributed processing framework in combination with the original user knowledge graph, a weighted user homogeneous graph is generated, and the weighted user homogeneous graph is clustered by using an optimal group discovery algorithm, such as louvain.
The group division process has been described in detail above with reference to fig. 1 to 4, and will not be described in detail here.
Fig. 6 is a schematic structural diagram of a group partition device of a hardware running environment according to an embodiment of the present application. Here, the group dividing apparatus 600 may implement the above-described function of effective grouping.
As shown in fig. 6, the group dividing apparatus 600 may include: a processing component 601, a communication bus 602, a network interface 603, an input-output interface 604, a memory 605, and a power supply component 606. Wherein the communication bus 602 is used to enable connectivity communication between the components. The input output interface 604 may include a video display (such as a liquid crystal display), microphone and speaker, and a user interaction interface (such as a keyboard, mouse, touch input device, etc.), and the input output interface 604 may optionally also include standard wired interfaces, wireless interfaces. The network interface 603 may optionally include a standard wired interface, a wireless interface (e.g., a wireless fidelity interface). The memory 605 may be a high-speed random access memory or a stable nonvolatile memory. The memory 605 may alternatively be a storage device separate from the processing component 601 described above.
Those skilled in the art will appreciate that the structure shown in fig. 6 is not limiting of the group partitioning apparatus 600, and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
As shown in fig. 6, an operating system (such as a MAC operating system), a data storage module, a network communication module, a user interface module, a program related to a group division method, and a database may be included in the memory 605 as one storage medium.
In the group division device 600 shown in fig. 6, the network interface 603 is mainly used for data communication with external devices/terminals; the input/output interface 604 is mainly used for data interaction with a user; the processing component 601 and the memory 605 in the group division apparatus 600 may be provided in the group division apparatus 600, and the group division apparatus 600 executes the group division method and the like provided by the embodiment of the present application by calling the program for implementing the group division method of the present application and various APIs provided by the operating system stored in the memory 605 through the processing component 601.
The processing component 601 may include at least one processor, with a set of computer-executable instructions stored in the memory 605 that, when executed by the at least one processor, perform a population partitioning method according to an embodiment of the present application. Further, the processing component 601 can perform a population division process or the like as described above. However, the above examples are merely exemplary, and the present application is not limited thereto.
By way of example, the group partitioning apparatus 600 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the above-described set of instructions. Here, the group dividing apparatus 600 is not necessarily a single electronic apparatus, but may be any apparatus or a collection of circuits capable of executing the above-described instructions (or instruction sets) singly or in combination. The group classification device 600 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with either locally or remotely (e.g., via wireless transmission).
In the group partitioning apparatus 600, the processing component 601 may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processing component 601 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and so forth.
The processing component 601 may execute instructions or code stored in a memory, wherein the memory 605 may also store data. Instructions and data may also be transmitted and received over a network via the network interface 603, wherein the network interface 603 may employ any known transmission protocol.
The memory 605 may be integrated with the processor, for example, RAM or flash memory disposed within an integrated circuit microprocessor or the like. In addition, the memory 605 may include a stand-alone device, such as an external disk drive, a storage array, or any other storage device that may be used by a database system. The memory and the processor may be operatively coupled or may communicate with each other, for example, through an I/O port, a network connection, etc., such that the processor is able to read files stored in the memory.
According to embodiments of the present application, an electronic device may be provided. Fig. 7 is a block diagram of an electronic device according to an embodiment of the present application, which may include at least one memory 702 and at least one processor 701, the at least one memory 702 storing a set of computer-executable instructions that, when executed by the at least one processor 701, perform a population partitioning method according to an embodiment of the present application.
The processor 701 may include a Central Processing Unit (CPU), an audio processor, a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 701 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and so forth.
The memory 702, which is one storage medium, may include an operating system (e.g., a MAC operating system), a data storage module, a network communication module, a user interface module, a recommendation module, and a database.
The memory 702 may be integrated with the processor 701, for example, RAM or flash memory may be disposed within an integrated circuit microprocessor or the like. In addition, the memory 702 may include a separate device, such as an external disk drive, storage array, or other storage device usable by any database system. The memory 702 and the processor 701 may be operatively coupled or may communicate with each other, for example, through an I/O port, a network connection, etc., such that the processor 701 is able to read files stored in the memory 702.
In addition, the electronic device 700 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 700 may be connected to each other via a bus and/or a network.
By way of example, the electronic device 700 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the above-described set of instructions. Here, the electronic device 700 is not necessarily a single electronic device, but may be any apparatus or a collection of circuits capable of executing the above-described instructions (or instruction set) individually or in combination. The electronic device 700 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with either locally or remotely (e.g., via wireless transmission).
It will be appreciated by those skilled in the art that the structure shown in fig. 7 is not limiting and may include more or fewer components than shown, or certain components may be combined, or a different arrangement of components.
According to an embodiment of the present application, there may also be provided a computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform a population partitioning method according to the present application. Examples of the computer readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, nonvolatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk storage, hard Disk Drives (HDD), solid State Disks (SSD), card memory (such as multimedia cards, secure Digital (SD) cards or ultra-fast digital (XD) cards), magnetic tape, floppy disks, magneto-optical data storage, hard disks, solid state disks, and any other means configured to store computer programs and any associated data, data files and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files and data structures to a processor or computer to enable the processor or computer to execute the programs. The computer programs in the computer readable storage media described above can be run in an environment deployed in a computer device, such as a client, host, proxy device, server, etc., and further, in one example, the computer programs and any associated data, data files, and data structures are distributed across networked computer systems such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.
In accordance with embodiments of the present application, there may also be provided a computer program product in which instructions are executable by a processor of a computer device to perform the above-described population division method.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (14)

1. A method of population partitioning, the method comprising:
selecting a knowledge graph subgraph aiming at a target type node from the target knowledge graph;
Using M kinds of edge weight generation algorithms to respectively calculate edge weights among a plurality of target type nodes in the knowledge graph subgraph so as to generate M homogeneous graphs corresponding to the knowledge graph subgraph, wherein M is an integer greater than 1;
applying an N-population division algorithm to each homogeneous map of the M homogeneous maps to obtain N-population division results corresponding to each homogeneous map, wherein N is an integer greater than 1;
scoring N group division results corresponding to each homogeneous map in the M homogeneous maps, and selecting a target side weight generation algorithm and a target group division algorithm corresponding to the highest score from the M side weight generation algorithms and the N group division algorithms;
and generating a homogeneous map of the target knowledge map through the target side weight generation algorithm, and carrying out group division on the homogeneous map of the target knowledge map through the target group division algorithm.
2. The method of claim 1, wherein the M edge weight generation algorithms include a distance penalty method,
the calculating of the edge weights among the plurality of target type nodes in the knowledge graph subgraph comprises the following steps:
Calculating a first weight sum of all edges in paths from a first target type node to a second target type node in a plurality of target type nodes by using the distance penalty method;
dividing the sum of the first weights by the order of the path to obtain the edge weight between the first target type node and the second target type node.
3. The method of claim 1, wherein the M-ary edge weight generation algorithm comprises a step-wise decay method,
the calculating of the edge weights among the plurality of target type nodes in the knowledge graph subgraph comprises the following steps:
applying a first attenuation factor to weights of all edges in paths between a third target type node and a fourth target type node in the plurality of target type nodes by using the step attenuation method;
an edge weight between the third target type node and the fourth target type node is determined based on a sum of second weights of all edges to which the first attenuation factor is applied, wherein the first attenuation factor is determined based on a step size of the corresponding edge from the third target type node in the path.
4. A method according to claim 3, wherein said determining edge weights between the third target type node and the fourth target type node based on a sum of second weights of all edges to which the first attenuation factor is applied comprises:
Applying a second attenuation factor to weights of all sides in a path between the fourth target type node to the third target type node to determine a third weight sum of all sides to which the second attenuation factor is applied;
and taking the average value of the sum of the second weights and the sum of the third weights as the edge weight between the third target type node and the fourth target type node, wherein the second attenuation factor is determined based on the step length of the corresponding edge from the fourth target type node in the path.
5. The method of claim 1, wherein the M edge weight generation algorithms include a node degree penalty method,
the calculating of the edge weights among the plurality of target type nodes in the knowledge graph subgraph comprises the following steps:
determining a fourth weight sum of all edges in a path between a fifth target type node and a sixth target type node in a plurality of target type nodes by using the node degree penalty method;
and dividing the sum of the fourth weights by the sum of node degrees of other types of nodes existing in the path to obtain the edge weight between the fifth target type node and the sixth target type node, wherein the node degree is the number of edges associated with the corresponding node.
6. The method of claim 1, wherein the M edge weight generation algorithms include a node similarity method,
the calculating of the edge weights among the plurality of target type nodes in the knowledge graph subgraph comprises the following steps:
constructing a bipartite graph of the knowledge graph subgraph aiming at a plurality of target type nodes by using the node similarity method;
and calculating a node similarity matrix in the bipartite graph, and taking values corresponding to elements in the node similarity matrix aiming at the target type nodes as edge weights among the target type nodes.
7. The method of claim 1, wherein scoring the N population division results for each of the M homogeneous maps comprises:
calculating a non-real tag data evaluation index and a real data tag evaluation index of each group division result, wherein the non-real tag data evaluation index is not determined based on a real labeling tag and the real tag data evaluation index is determined based on a real labeling tag, and the real labeling tag represents a tag labeling groups to which a plurality of target type nodes belong;
And calculating the score of each group division result based on the non-real tag data evaluation index and the real data tag evaluation index.
8. The method of claim 7, wherein the step of, the step of calculating the non-real tag data evaluation index and the real data tag evaluation index of each group division result comprises the following steps:
calculating the modularity of each group division result by using a modularity algorithm, and taking the modularity as a non-real tag data evaluation index of each group division result;
and calculating an adjustment Rand coefficient of each group division result based on the pre-labeled group label of the knowledge graph subgraph, and taking the adjustment Rand coefficient as a true label data evaluation index of each group division result.
9. The method of claim 8, wherein calculating a score for each of the group classification results based on the untrimmed data rating index and the true data tag rating index comprises:
and aiming at each group division result, obtaining an evaluation score of each group division result by linearly adding the modularity and the adjustment Rankine coefficient.
10. The method according to claim 1, wherein the generating the homogeneous map of the target knowledge-graph by the target edge weight generation algorithm and the group-dividing the homogeneous map of the target knowledge-graph by the target group-dividing algorithm include:
dividing the target knowledge graph into distributed graphs according to a node segmentation strategy;
reconstructing the distributed map into a homogeneous map by utilizing the target edge weight generation algorithm;
and carrying out group division on the homogeneous patterns of the distributed patterns by using the target group division algorithm so as to obtain a group division result of the target knowledge patterns.
11. A population dividing apparatus, the apparatus comprising:
the sub-graph selection module is configured to select a knowledge graph sub-graph aiming at the target type node from the target knowledge graphs;
the homogeneous map generation module is configured to respectively calculate edge weights among a plurality of target type nodes in the knowledge map subgraph by using M edge weight generation algorithms so as to generate M homogeneous maps corresponding to the knowledge map subgraph, wherein M is an integer larger than 1;
the group division module is configured to apply an N group division algorithm to each homogeneous map in the M homogeneous maps to obtain N group division results corresponding to each homogeneous map, wherein N is an integer greater than 1;
The classification result evaluation module is configured to score N group classification results corresponding to each homogeneous map in the M homogeneous maps, and select a target side weight generation algorithm and a target group classification algorithm corresponding to the highest score from the M side weight generation algorithms and the N group classification algorithms;
the grouping module is configured to generate a homogeneous map of the target knowledge map through the target side weight generation algorithm, and conduct group division on the homogeneous map of the target knowledge map through the target group division algorithm.
12. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the population division method of any one of claims 1-10.
13. A computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the population partitioning method of any one of claims 1-10.
14. A computer program product comprising computer instructions which, when executed by a processor, implement the population division method of any one of claims 1-10.
CN202210882017.9A 2022-07-26 2022-07-26 Group division method and related device Pending CN116150352A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210882017.9A CN116150352A (en) 2022-07-26 2022-07-26 Group division method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210882017.9A CN116150352A (en) 2022-07-26 2022-07-26 Group division method and related device

Publications (1)

Publication Number Publication Date
CN116150352A true CN116150352A (en) 2023-05-23

Family

ID=86355037

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210882017.9A Pending CN116150352A (en) 2022-07-26 2022-07-26 Group division method and related device

Country Status (1)

Country Link
CN (1) CN116150352A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117633563A (en) * 2024-01-24 2024-03-01 中国电子科技集团公司第十四研究所 Multi-target top-down hierarchical grouping method based on OPTICS algorithm

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117633563A (en) * 2024-01-24 2024-03-01 中国电子科技集团公司第十四研究所 Multi-target top-down hierarchical grouping method based on OPTICS algorithm
CN117633563B (en) * 2024-01-24 2024-05-10 中国电子科技集团公司第十四研究所 Multi-target top-down hierarchical grouping method based on OPTICS algorithm

Similar Documents

Publication Publication Date Title
Souravlas et al. A classification of community detection methods in social networks: a survey
Costa et al. Characterization of complex networks: A survey of measurements
US8943011B2 (en) Methods and systems for using map-reduce for large-scale analysis of graph-based data
De Nicola et al. Evaluating Italian public hospital efficiency using bootstrap DEA and CART
Li et al. A novel multiobjective particle swarm optimization algorithm for signed network community detection
Ranu et al. Mining discriminative subgraphs from global-state networks
Chen et al. A temporal recommendation mechanism based on signed network of user interest changes
Concolato et al. Data science: A new paradigm in the age of big-data science and analytics
Ullah et al. A novel relevance-based information interaction model for community detection in complex networks
Qiao et al. A new blockmodeling based hierarchical clustering algorithm for web social networks
Shakibian et al. Multi-kernel one class link prediction in heterogeneous complex networks
Dabaghi-Zarandi et al. Community detection in complex network based on an improved random algorithm using local and global network information
CN116150352A (en) Group division method and related device
Guo et al. Network representation learning based on community-aware and adaptive random walk for overlapping community detection
Drago MCA-based community detection
Caschera et al. MONDE: a method for predicting social network dynamics and evolution
Ma et al. Clusters detection based leading eigenvector in signed networks
Antelmi et al. SimpleHypergraphs. jl—novel software framework for modelling and analysis of hypergraphs
Buscema et al. A nonlinear, data-driven, ANNs-based approach to culture-led development policies in rural areas: The case of Gjakove and Peć districts, Western Kosovo
Liu et al. Towards dynamic reconfiguration of composite services via failure estimation of general and domain quality of services
Xue Compact memetic algorithm-based process model matching
Ma et al. The construction of big data computational intelligence system for E-government in cloud computing environment and its development impact
Long et al. BBTA: Detecting communities incrementally from dynamic networks based on tracking of backbones and bridges
Liang et al. The graph embedded topic model
CN113392294A (en) Sample labeling method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination