WO2022217712A1 - 数据挖掘方法、装置、计算机设备及存储介质 - Google Patents

数据挖掘方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2022217712A1
WO2022217712A1 PCT/CN2021/097113 CN2021097113W WO2022217712A1 WO 2022217712 A1 WO2022217712 A1 WO 2022217712A1 CN 2021097113 W CN2021097113 W CN 2021097113W WO 2022217712 A1 WO2022217712 A1 WO 2022217712A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
nodes
data mining
centrality
determining
Prior art date
Application number
PCT/CN2021/097113
Other languages
English (en)
French (fr)
Inventor
任霁野
王媛
汪伟
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022217712A1 publication Critical patent/WO2022217712A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of computer technology, in particular to the field of big data technology, and relates to a data mining method, apparatus, computer equipment and storage medium.
  • Knowledge graph is a popular research direction in the field of big data and artificial intelligence, because it can not only show the connection between things in a visual form, but also includes the application of many technologies, such as graph theory, database technology, visualization, data mining , deep learning, etc.
  • the application of knowledge graph in enterprises or institutions is generally presented in the form of a system that integrates technologies such as data mining, entity recognition, and entity association.
  • knowledge graph technology needs to be applied in a scenario that requires active data mining, the automation degree of the entire process and the accuracy of the information will become an important consideration for the performance of the system; for the issue of automation degree, different industry enterprises or teams It has its own solutions for its business.
  • the knowledge graph in the social domain has continuous streaming data input, and the data collection itself is automated.
  • the business analysis model is mainly responsible for identifying attributes of entities and relationships between entities.
  • the present application provides a data mining method, device, computer equipment and storage medium, which can effectively perform data mining based on different knowledge systems.
  • the present application provides a data mining method, including:
  • the next data mining operation is performed according to the sorting result.
  • an embodiment of the present application provides a data mining apparatus, including:
  • the data mining module is used to obtain the data mining results after performing this data mining operation
  • a generating module for generating a knowledge graph according to the data mining result
  • a sorting module configured to determine the centrality of each node in the plurality of nodes included in the knowledge graph, and sort each node according to the centrality of each node to obtain a sorting result
  • the data mining module is further configured to perform the next data mining operation according to the sorting result.
  • an embodiment of the present application provides a computer device, including a processor and a memory, where the processor and the memory are connected to each other, wherein the memory is used to store a computer program, and the computer program includes program instructions , the processor is configured to invoke the program instructions to perform the following method:
  • the next data mining operation is performed according to the sorting result.
  • an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the following method:
  • the next data mining operation is performed according to the sorting result.
  • This application can effectively carry out data mining based on different knowledge systems. When faced with fields such as knowledge graph application fields that are unfamiliar to the public, this application does not require personnel with a certain professional background to participate in the verification of the model recognition effect as in the prior art.
  • the formulation of mining strategies can greatly improve the efficiency of data mining.
  • this application analyzes the centrality of nodes, and then performs data mining according to the centrality of nodes, which also ensures a greater guarantee of data mining quality.
  • 1a is a schematic diagram of a data mining process flow provided by an embodiment of the present application.
  • Fig. 1b is a schematic diagram of another data mining process provided by an embodiment of the present application.
  • 1c is a schematic diagram of a data mining scenario provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a data mining method provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a knowledge graph provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a data mining apparatus provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • Data mining also known as data mining, data mining, etc.
  • Data mining is the process of extracting potential and effective information from massive data according to established business goals.
  • it utilizes existing database management systems and other data sources
  • the query, retrieval and reporting functions of the management system are combined with multi-dimensional analysis and statistical analysis methods to carry out online analysis and processing, so as to obtain statistical analysis data for decision-making reference. Discover hidden, previously unknown information of potential value.
  • Data mining is a multidisciplinary field that integrates the research results of the latest technologies such as database technology, artificial intelligence, machine learning, pattern recognition, fuzzy mathematics and mathematical statistics. It can be used to support business intelligence applications and decision analysis, and is currently widely used in Finance, medical and other industries. The development of data mining technology has important practical significance for all walks of life.
  • FIG. 1a a simple data mining process combined with knowledge graph can be seen in Figure 1a.
  • the data mining process shown in Figure 1a includes the following steps.
  • Data mining This process is a business-driven or knowledge-driven data collection step, which usually requires professionals familiar with the business or knowledge domain to formulate data mining strategies.
  • Entity identification This process analyzes the data in the form of text, image or sound through natural language processing, image recognition, and voiceprint recognition algorithms, and mines the target entities. Usually, entity recognition models require training data such as rich corpus and later tuning capabilities. achieve a good recognition effect.
  • Model effect inspection This process tests the effect of the entity recognition model and the entity recognition strategy in step 2 according to the effect of the knowledge graph. This process usually requires the judgment of someone with some expertise in the field.
  • Model tuning According to the model effect of the previous step, the optimization measures of the entity recognition model are formulated. In order to use the optimized entity recognition model for entity recognition in the future, so that after obtaining the data mining results, a more accurate knowledge graph can be obtained according to the optimized entity recognition model, so that data mining can be performed according to the more accurate knowledge graph.
  • this application proposes a data mining strategy, which can use the concept of "centrality" in graph theory to measure the importance of nodes in the network, and automatically calculate the importance of nodes representing entities in the application field through the concept of centrality , and rank the nodes according to their importance, and then rank the results to carry out data mining work.
  • the ranking results (usually the name of the sorted node or the image of the node) can be sent to the data mining program so that Data mining programs perform data mining work based on the sorted results.
  • the processes of centrality calculation and degree gain calculation are connected after "model tuning". In the centrality calculation step, you can choose usage centrality, indirect centrality, or a combination of the two to analyze the centrality of existing nodes according to the business scenario.
  • the degree centrality correlation formula is suitable for locating the nodes with the highest subjectivity or the most influential degree and promoting the excavation work, and the indirect centrality correlation formula is suitable for locating the nodes with the highest path traffic;
  • the weight of the centrality calculation result is used in combination.
  • the computer device can traverse multiple nodes in the knowledge graph, such as all nodes, to calculate the centrality of each node, and then calculate the centrality of each node according to the centrality of each node. Sort each node to get a sorting result, where the sorting result can be a list of nodes.
  • the mining priority of each node can be determined according to the sorting result, and the data mining work can be performed according to the mining priority of each node. Since the degree gain cannot be calculated when data mining is performed for the first time, the data mining task can be started by directly reading the sorting results through the data mining program.
  • the degree gain of each node can be calculated, and the data mining work can be performed according to the degree gain of each node.
  • the mining priority of the node and then perform data mining work according to the mining priority of each node whose degree gain is greater than 0.
  • FIG. 2 is a schematic flowchart of a data mining method according to an embodiment of the present application.
  • the method can be applied to computer equipment.
  • the computer equipment may be a server or other equipment.
  • a server can be a single server or a cluster of servers.
  • the method may include the following steps:
  • the computer device may perform this data mining operation to obtain a data mining result after performing this data mining operation.
  • This data mining operation can be the first data mining operation, or it can be a non-first data mining operation.
  • This data mining operation is not the first data mining operation, indicating that the data mining operation has been performed before this data mining operation.
  • Data mining results refer to the data obtained through data mining.
  • the computer device may perform knowledge extraction on the data mining result to obtain multiple triples, and perform knowledge fusion on the multiple triples to obtain the knowledge fusion result.
  • the computer equipment can perform knowledge processing on the knowledge fusion results to obtain a knowledge graph.
  • the process of knowledge extraction includes entity extraction, relation extraction and attribute extraction.
  • the process of knowledge fusion includes ontology matching and entity alignment.
  • Knowledge processing includes knowledge reasoning, knowledge discovery and quality assessment.
  • the entity extraction process may include the above entity identification process.
  • the entity recognition process may be implemented through an optimized entity recognition model obtained after model inspection and model tuning.
  • S203 Determine the centrality of each node in the plurality of nodes included in the knowledge graph, and sort each node according to the centrality of each node to obtain a ranking result.
  • the multiple nodes may be all nodes included in the knowledge graph, or may be some nodes included in the knowledge graph.
  • each node included in the knowledge graph may be divided into each node cluster. Node clusters can be divided according to business objectives, which will not be described in detail here.
  • the multiple nodes may be all nodes included in the target node cluster, or may be some nodes included in the target node cluster.
  • the sorting result indicates each node after sorting.
  • the sorting method may be to sort each node from front to back according to the centrality from large to small, or to sort each node from front to back according to the centrality from small to large, and so on.
  • the computer device may call the centrality algorithm to determine the centrality of each node in the plurality of nodes included in the knowledge graph, and sort each node according to the centrality of each node to obtain a sorting result.
  • the centrality algorithm may include an indirect centrality algorithm and/or a degree centrality algorithm and other centrality algorithms. The two algorithms are explained in turn below.
  • indirect centrality can also be called betweenness centrality.
  • the centrality determined by the indirect centrality algorithm can be called indirect centrality.
  • Indirect centrality can be used to characterize the indirect centrality of a node. The high indirect intermediateness of a node indicates that the node has a strong "mediation attribute" in the explored network structure, and the continuous mining value of this node lies in finding other nodes that use its "mediation ability”.
  • the computer device may invoke an indirect centrality algorithm to determine the centrality of each node in the plurality of nodes included in the knowledge graph. Specifically, the computer device may determine the number of shortest paths where each node is located among the multiple nodes included in the knowledge graph, and determine the number of shortest paths between multiple nodes, and then determine the number of shortest paths where each node is located according to the number of shortest paths where each node is located and The number of shortest paths between multiple nodes, and the indirectness centrality of each node is determined as the centrality of each node.
  • the indirect centrality algorithm described is as follows:
  • BC is indirect centrality.
  • dst() is the number of passing nodes v in the shortest path from other nodes s to the target node t.
  • dst represents the number of shortest paths from other nodes s to the target node t.
  • the target node t is a node among the plurality of nodes, and the other nodes s are nodes other than the target node t among the plurality of nodes.
  • the node v in Equation 1.1 is a node other than the target node t and other nodes s.
  • the knowledge graph shown in Fig. 3 includes multiple nodes, the nodes include company nodes or executive nodes, the attributes of the nodes are the company name or the person's name, and the attributes of the edges are the relationship between the company and the company. Relationships, relationships between people and companies, or relationships between people and people.
  • the knowledge graph shown in FIG. 3 can be divided into node cluster 1 where company 1 node is located and node cluster 2 where company 5 node is located. Among them, Company 2, Company 3 and Company 4 are all subsidiaries of Company 1. Company 6, Company 7 and Company 8 are all subsidiaries of Company 5. In this example, high betweenness can be defined as holding more shares, whereas low betweenness can be defined as holding fewer shares.
  • the following describes how the computer device invokes the indirect centrality algorithm to determine the indirect centrality of each node in the node cluster 1 .
  • the computer device takes each node in the node cluster 1 as the target node t, and takes the nodes in the node cluster 1 other than the target node as other nodes s. In order to calculate the indirect centrality of these nodes, it is necessary to count the number of shortest paths from other nodes s to the target node t (ie dst). Shares” or “shareholders”, and the number of shortest paths between nodes in node cluster 1 (that is, dst()) also needs to be counted.
  • the attributes of an edge of an intermediate node include the number of shortest paths for "holdings" or "shareholders". Based on the above steps, the following two statistical tables can be sorted out:
  • Table 1 enumerates the node pairs formed by the target node t and other nodes s, as well as the shortest path of each node pair, and the intermediate nodes of the shortest path of each node pair.
  • Table 2 lists the number of shortest paths from each node to the target node t from other nodes s, and the number of shortest paths between each node. Simply substitute these values into formula 1.1 to calculate the indirect center of each node in node cluster 1. degrees, see the last column of data in Table 2.
  • Company 2 has the strongest intermediary due to its complicated shareholding, that is, the highest indirect centrality. Taking the company 2 node as an example, substitute the value of its dst() and the value of dst into formula 1.1, and the indirect centrality of the company 2 node can be calculated as follows.
  • the centrality determined through the degree centrality algorithm may be the degree centrality.
  • Degree centrality can be used to characterize the degree centrality of a node.
  • the degree centrality used to characterize the degree centrality of a node may have the following two levels of meaning: one is that the degree centrality is used to characterize the degree centrality of the node itself, and the other is the degree centrality It is used to characterize the degree centrality of multiple nodes where a node is located.
  • the high degree centrality of a node indicates that the node has a strong "relationship prosperity" in the explored network structure, and the continuous mining value of the node lies in the divergence from multiple directions or multiple dimensions. other nodes of the association.
  • the computer device may invoke a degree centrality algorithm to determine the centrality of each node in the plurality of nodes included in the knowledge graph, one of which is that the computer device may determine the The number of nodes of the target attribute, and according to the number of nodes of the target attribute connected to each node, the degree centrality of each node is determined as the centrality of each node.
  • the target attribute can be any node attribute among multiple node attributes or a specified node attribute.
  • the degree centrality algorithm described is as follows:
  • the node v in Equation 1.2 may represent any of a plurality of nodes.
  • deg() in Equation 1.2 represents the number of nodes connected to node v that satisfy the specified condition.
  • the knowledge graph shown in FIG. 3 may also include an industry A node.
  • the degree centrality of each company node belonging to industry A in the knowledge graph shown in Figure 3 can be calculated, and the degree centrality of each company node belonging to industry A can be calculated according to the degree center Determine the extent of resources and the scale of relationships of companies belonging to industry A.
  • the computer device may determine the number of nodes connected to each of the multiple nodes that satisfy the specified condition, and determine the degree centrality of each node according to the number of nodes connected to each node that satisfy the specified condition. For example, the computer device may determine that each node is connected to a node with a target attribute (such as a company or an executive) as a node that each node is connected to that meets a specified condition. For another example, the computer device may determine that each node is connected to a specified node. attributes (such as inauguration or holding), and then determine the nodes that each node connects to the edge that is connected to the specified attribute as the nodes that each node connects to meet the specified condition.
  • a target attribute such as a company or an executive
  • the computer equipment can invoke the degree centrality algorithm to calculate the degree centrality of the company 1 node as 3, and calculate the degree centrality of the company 5 node as 5.
  • the degree centrality of company 5 node is higher than that of company 1 node, indicating that company 5 has more resources than company 1, and company 5 has a larger relationship scale than company 2.
  • the computer device invokes the degree centrality algorithm to determine the centrality of each node in the plurality of nodes included in the knowledge graph, and another method is for the computer device to determine the target attribute of the target attribute connected to each node in the plurality of nodes.
  • the computer device may determine the degree centrality of each node according to the number of nodes of the target attribute connected to each node in the plurality of nodes, and then determine the degree centrality of each node according to the degree centrality of each node and the relationship between the plurality of nodes.
  • the number of shortest paths between, and the degree centrality of multiple nodes is determined as the degree centrality of each node.
  • the degree centrality algorithm described here is as follows:
  • DC represents the degree centrality of multiple nodes.
  • n* represents the node with the highest degree centrality among multiple nodes, and DC(n*) is the degree centrality of n*.
  • DC(v i ) is the degree centrality of other nodes vi among multiple nodes.
  • V-1 (V-2) represents the maximum possible connection situation, and V can be the number of target paths between multiple nodes.
  • the target path here may include a path between n* and n* connected nodes.
  • V can be understood as the maximum number of connections of n*.
  • V may be the number of multiple nodes.
  • the computer device determines the number of nodes of the target attribute connected to each node in the plurality of nodes, and determines the number of the plurality of nodes, according to the number of nodes of the target attribute connected to each node and the number of the plurality of nodes, The degree centrality of multiple nodes is determined as the centrality of each node.
  • the computer device can determine that n* in the node cluster 2 is the company 5 node, and calculate the degree centrality of the company 5 node and the degree centrality of other company nodes in the node cluster 2 except the company 5 node.
  • n* in the node cluster 2 is the company 5 node
  • the degree centrality of the company 5 node and the degree centrality of other company nodes in the node cluster 2 except the company 5 node can be calculated.
  • the role of the executive node can be ignored, and only the connection between the company nodes can be considered.
  • the path from the industry A node to the company 5 node can be taken into account.
  • the degree centrality DC(n*) of company 5 node is 3
  • the degree centrality DC(v i ) of company 6 node, company 7 node, and company 8 node are all 1
  • the maximum number of connections V of company 5 node It is the sum of the out-degree of the company's 5 nodes and the in-degree from the industry node, which is 4.
  • the degree centrality of node cluster 2 can be used as the centrality of each company node in node cluster 2.
  • the computer device may obtain the degree centrality of each of the plurality of nodes, and the indirect centrality of each of the plurality of nodes. After obtaining the degree centrality of each node and the indirect centrality of each node, the computer device can multiply the degree centrality of each node by the first weight to obtain the first weight result corresponding to each node; and The indirect centrality of each node is multiplied by the second weight to obtain the second weight result corresponding to each node; the computer device adds the first weight result corresponding to each node and the second weight result corresponding to each node respectively , get the centrality corresponding to each node.
  • the computer device may determine the mining priority of each node according to the sorting result, and execute the next data mining operation according to the mining priority of each node. For example, the computer device may perform data mining operations on nodes with high priority first, and perform data mining operations on nodes with low priority later.
  • the centrality of each node is the degree centrality of multiple nodes, each node has the same mining priority, and the computer device can simultaneously perform the next data mining operation on each node.
  • the computer device can obtain the data mining result after performing this data mining operation, and generate a knowledge graph according to the data mining result to determine the center of each node in the multiple nodes included in the knowledge graph. and sort each node according to the centrality of each node to obtain the sorting result, so as to perform the next data mining operation according to the sorting result.
  • This process can effectively carry out data mining for different knowledge systems, which can not only improve data mining Efficiency can also guarantee the quality of data mining.
  • FIG. 4 is a schematic flowchart of another data mining method provided by an embodiment of the present application.
  • the method can be applied in computer equipment.
  • the computer equipment may be a server or other equipment.
  • a server can be a single server or a cluster of servers.
  • the method may include the following steps:
  • S403. Determine the centrality of each node in the plurality of nodes included in the knowledge graph, and sort each node according to the centrality of each node to obtain a ranking result.
  • steps S401 to S403 reference may be made to steps S201 to S203 in the embodiment of FIG. 2, and details are not described herein.
  • step S404 the computer device determines whether to perform step S405 or step S407 according to the number of times the data mining operation has been performed.
  • steps S405 to S406 when the number of times the data mining operation has been performed is less than or equal to a preset number of times, the computer device determines the mining priority for each node according to the sorting result, and determines the mining priority for each node according to the sorting result. The mining priority of the next data mining operation.
  • each node determines, from the plurality of nodes, a target node whose degree gain is greater than or equal to a preset value.
  • the computer device may calculate the degree gain of each node when the number of times the data mining operation has been performed is greater than the preset number of times, and execute the next time according to the sorting result and the degree gain of each node Data mining operations.
  • the computer device may determine a target node whose degree gain is greater than or equal to a preset value (eg, 0) from a plurality of nodes, and may, when there are a plurality of target nodes, The degree gain determines the mining priority for each target node, so that the data mining operation is performed according to the mining priority of each target node.
  • the computer device may determine the mining priority for each target node according to the ranking result and the degree gain of the target node.
  • the preset number of times may be set to 1 and other times.
  • a node with a high degree gain has a high mining priority
  • a node with a low degree gain has a low mining priority.
  • the computer device may calculate the degree gain for each node upon determining that the number of times the data mining operation has been performed is greater than one.
  • the computer device determines a target node with a degree gain greater than or equal to a preset value from a plurality of nodes according to the degree gain of each node, and when there are multiple target nodes, determines each target node according to the sorting result and the degree gain of each target node.
  • the priority of each target node so that the data mining operation is performed according to the priority of each target node.
  • the method for calculating the degree gain of each node by the computer device is that the computer device obtains the centrality of each node obtained after performing the last data mining operation, and calculates the degree gain according to the last data mining operation.
  • the degree gain of each node is calculated based on the centrality of each node obtained after the mining operation and the centrality of each node obtained after performing this data mining operation.
  • the calculation method of the degree gain can be as follows:
  • D represents the degree gain.
  • i represents the number of times the data mining operation has been performed after the data mining operation is performed this time.
  • vi is the centrality obtained by the node after performing data mining operation this time, and
  • Vi -1 is the centrality obtained by the node after performing data mining last time.
  • the data mining device can obtain the data mining results after performing this data mining operation, and generate a knowledge graph according to the data mining results to determine the data of each node in the multiple nodes included in the knowledge graph.
  • the centrality of each node is sorted, and each node is sorted according to the centrality of each node, and the sorting result is obtained, so that the next data mining operation is performed according to the sorting result. This process can improve the efficiency of data mining and ensure the quality of data mining.
  • This application relates to blockchain technology.
  • data mining results can be written into the blockchain, or different rounds of data mining operations can be performed based on the data stored in the blockchain.
  • FIG. 5 is a schematic structural diagram of a data mining apparatus according to an embodiment of the present application.
  • the apparatus can be applied to computer equipment, and specifically, the apparatus can include:
  • the data mining module 501 is configured to obtain the data mining result after the current data mining operation is performed.
  • the generating module 502 is configured to generate a knowledge graph according to the data mining result.
  • the sorting module 503 is configured to determine the centrality of each node in the plurality of nodes included in the knowledge graph, and sort each node according to the centrality of each node to obtain a sorting result.
  • the data mining module 501 is further configured to perform the next data mining operation according to the sorting result.
  • the sorting module 503 determines the centrality of each node in the plurality of nodes included in the knowledge graph, specifically determining the shortest location of each node in the plurality of nodes included in the knowledge graph the number of paths, and determine the number of the shortest paths between the multiple nodes; according to the number of the shortest paths where each node is located and the number of the shortest paths between the multiple nodes, determine the each The indirect centrality of the node is taken as the centrality of each node.
  • the sorting module 503 determines the centrality of each node in the plurality of nodes included in the knowledge graph, specifically determining the node of the target attribute connected to each node in the plurality of nodes the number of ; determine the degree centrality of each node as the centrality of each node according to the number of nodes of the target attribute connected to each node.
  • the sorting module 503 calculates the centrality of each node in the knowledge graph, specifically determining the number of nodes of the target attribute connected to each node in the plurality of nodes, and determining all the nodes in the knowledge graph.
  • the number of target paths between the plurality of nodes; the degree center of the plurality of nodes is determined according to the number of nodes with target attributes connected to each node and the number of target paths between the plurality of nodes degree as the centrality of each node.
  • the data mining module 501 performs the next data mining operation according to the sorting result, specifically determining the number of times the data mining operation has been performed; when the number of times the data mining operation has been performed is less than or equal to a preset When the number of times is determined, the mining priority for each node is determined according to the sorting result; the next data mining operation is performed according to the mining priority of each node.
  • the data mining module 501 performs the next data mining operation according to the sorting result, and specifically, when the number of times the data mining operation has been performed is greater than a preset number of times, calculates the data mining of each node. degree gain; according to the degree gain of each node, determine a target node whose degree gain is greater than or equal to a preset value from the plurality of nodes; when there are multiple target nodes, according to the sorting result and The degree gain of each of the target nodes determines the mining priority for each target node; the data mining operation is performed according to the mining priority of each target node.
  • the data mining module 501 calculates the degree gain of each node, specifically acquiring the centrality of each node obtained after performing the last data mining operation;
  • the degree gain of each node is calculated from the centrality of each node obtained after one data mining operation and the centrality of each node obtained after performing this data mining operation.
  • the data mining device can obtain the data mining results after performing this data mining operation, and generate a knowledge graph according to the data mining results to determine the information of each node in the multiple nodes included in the knowledge graph.
  • the centrality of each node is sorted according to the centrality of each node, and the sorting result is obtained, so as to perform the next data mining operation according to the sorting result.
  • FIG. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application.
  • the computer device described in this embodiment may include: one or more processors 1000 and a memory 2000 .
  • the processor 1000 and the memory 2000 may be connected through a bus or the like.
  • the processor 1000 may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC) , Off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 2000 can be a high-speed RAM memory, or a non-volatile memory, such as a disk memory.
  • the memory 2000 is used to store a set of program codes, and the processor 1000 can call the program codes stored in the memory 2000 . specifically:
  • the processor 1000 is configured to obtain a data mining result after performing this data mining operation; generate a knowledge graph according to the data mining result; determine the centrality of each node in the plurality of nodes included in the knowledge graph, Sort each node according to the centrality of each node to obtain a sorting result; perform the next data mining operation according to the sorting result.
  • the processor 1000 determines the centrality of each node in the plurality of nodes included in the knowledge graph, specifically determining the number of shortest paths where each node is located in the plurality of nodes included in the knowledge graph, and determine the number of shortest paths between the multiple nodes; determine the indirect center of each node according to the number of the shortest paths where each node is located and the number of shortest paths between the multiple nodes degree as the centrality of each node.
  • the processor 1000 determines the centrality of each node in the plurality of nodes included in the knowledge graph, specifically determining the number of nodes of the target attribute connected to each node in the plurality of nodes; according to The number of nodes of the target attribute to which each node is connected, and the degree centrality of each node is determined as the centrality of each node.
  • the processor 1000 determines the centrality of each node in the plurality of nodes included in the knowledge graph, specifically determining the number of nodes of the target attribute connected to each node in the plurality of nodes, and determining the number of target paths between the plurality of nodes; determining the degree centrality as the centrality of each node.
  • the processor 1000 performs the next data mining operation according to the sorting result, specifically determining the number of times the data mining operation has been performed; when the number of times the data mining operation has been performed is less than or equal to a preset number, According to the sorting result, the mining priority for each node is determined; and the next data mining operation is performed according to the mining priority of each node.
  • the processor 1000 performs the next data mining operation according to the sorting result, and further specifically calculates the degree gain of each node when the number of times the data mining operation has been performed is greater than a preset number of times;
  • the degree gain of each node is described, and a target node whose degree gain is greater than or equal to a preset value is determined from the plurality of nodes; when there are multiple target nodes, according to the sorting result and each target node
  • the degree gain of the node determines the mining priority for each target node; the data mining operation is performed according to the mining priority of each target node.
  • the processor 1000 calculates the degree gain of each node, specifically obtaining the centrality of each node obtained after performing the last data mining operation; The obtained centrality of each node, and the centrality of each node obtained after performing this data mining operation, calculate the degree gain of each node.
  • the processor 1000 described in the embodiments of the present application may execute the implementation manners described in the embodiments of FIG. 2 and FIG. 4 , and may also execute the implementation manners described in the embodiments of the present application, which will not be repeated here. .
  • Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the methods in the foregoing embodiments can be implemented, or the computer program is processed When the device is executed, the functions of each module of the device in the above embodiment can be implemented, which will not be repeated here.
  • the storage medium involved in this application such as a computer-readable storage medium, may be non-volatile or volatile.
  • Each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist physically alone, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of sampling hardware or in the form of sampling software function modules.
  • the computer-readable storage medium can be volatile or non-volatile.
  • the computer storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM) or a random access memory (Random Access Memory, RAM), and the like.
  • the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; Use the created data, etc.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种数据挖掘方法、装置、计算机设备及存储介质,该方法应用于大数据技术领域,该方法包括:获取执行本次数据挖掘操作后的数据挖掘结果(S201);根据所述数据挖掘结果生成知识图谱(S202);确定所述知识图谱包括的多个节点中每个节点的中心度,并根据所述每个节点的中心度对所述每个节点进行排序,得到排序结果(S203);根据所述排序结果执行下一次数据挖掘操作(S204)。采用该方法,可以基于不同知识体系有效地进行数据挖掘。该方法涉及区块链技术,如可将数据挖掘结果写入区块链中。

Description

数据挖掘方法、装置、计算机设备及存储介质
本申请要求于2021年4月16日提交中国专利局、申请号为202110410056.4,发明名称为“数据挖掘方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及大数据技术领域,涉及一种数据挖掘方法、装置、计算机设备及存储介质。
背景技术
知识图谱是目前大数据和人工智能领域的热门研究方向,因为它不光能以可视化的形式展现事物之间的联系,同时它包含了许多技术的应用,例如图论、数据库技术、可视化、数据挖掘、深度学习等。
知识图谱在企业或机构的应用一般是以集合了数据挖掘、实体识别、实体关联等技术的系统形式展现的。当知识图谱技术需要应用在一个需要主动挖掘数据的场景时,整个流程的自动化程度和信息的准确度将会成为该系统表现的一个重要考量;针对自动化程度这一议题,不同的行业企业或团队针对其业务都有自身的解决方案,例如社交领域的知识图谱有持续的流数据输入,数据采集本身是自动化的,业务分析模型主要负责标识实体的属性和实体间的关系。
但是发明人意识到,对于需要主动向外挖掘数据进行扩张的,知识上有一定专业壁垒,对于大众陌生的知识图谱应用领域,例如政治,或是金融、生物学等纵深领域的知识梳理场景,往往需要有一定专业背景的人员参与机型识别效果的验证以及挖掘策略的制定,然而,这些过程因为知识体系的不同导致数据挖掘过程十分困难。因此,如何基于不同知识体系有效地进行数据挖掘成为亟待解决的问题。
发明内容
本申请实施了提供了一种数据挖掘方法、装置、计算机设备及存储介质,可以基于不同知识体系有效地进行数据挖掘。
第一方面,本申请实施了提供了一种数据挖掘方法,包括:
获取执行本次数据挖掘操作后的数据挖掘结果;
根据所述数据挖掘结果生成知识图谱;
确定所述知识图谱包括的多个节点中每个节点的中心度,并根据所述每个节点的中心度对所述每个节点进行排序,得到排序结果;
根据所述排序结果执行下一次数据挖掘操作。
第二方面,本申请实施例提供了一种数据挖掘装置,包括:
数据挖掘模块,用于获取执行本次数据挖掘操作后的数据挖掘结果;
生成模块,用于根据所述数据挖掘结果生成知识图谱;
排序模块,用于确定所述知识图谱包括的多个节点中每个节点的中心度,并根据所述每个节点的中心度对所述每个节点进行排序,得到排序结果;
所述数据挖掘模块,还用于根据所述排序结果执行下一次数据挖掘操作。
第三方面,本申请实施例提供了一种计算机设备,包括处理器和存储器,所述处理器和所述存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行以下方法:
获取执行本次数据挖掘操作后的数据挖掘结果;
根据所述数据挖掘结果生成知识图谱;
确定所述知识图谱包括的多个节点中每个节点的中心度,并根据所述每个节点的中心度对所述每个节点进行排序,得到排序结果;
根据所述排序结果执行下一次数据挖掘操作。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行以实现以下方法:
获取执行本次数据挖掘操作后的数据挖掘结果;
根据所述数据挖掘结果生成知识图谱;
确定所述知识图谱包括的多个节点中每个节点的中心度,并根据所述每个节点的中心度对所述每个节点进行排序,得到排序结果;
根据所述排序结果执行下一次数据挖掘操作。
本申请能够基于不同知识体系有效地进行数据挖掘,在面对大众陌生的知识图谱应用领域等领域时,本申请无需如现有技术般需要有一定专业背景的人员参与机型识别效果的验证以及挖掘策略的制定,因此在数据挖掘效率有较大的提升,另外本申请通过分析节点的中心度,然后根据节点的中心度来进行数据挖掘的方式,也使得数据挖掘质量有较大的保障。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1a是本申请实施例提供的一种数据挖掘流程示意图;
图1b是本申请实施例提供的另一种数据挖掘流程的示意图;
图1c是本申请实施例提供的一种数据挖掘情景示意图;
图2是本申请实施例提供的一种数据挖掘方法的流程示意图;
图3是本申请实施例提供的一种知识图谱的示意图;
图4是本申请实施例提供的另一种数据挖掘方法的流程示意图;
图5是本申请实施例提供的一种数据挖掘装置的结构示意图;
图6是本申请实施例提供的一种计算机设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。
数据挖掘,也可以叫数据开采,数据采掘等,是按照既定的业务目标从海量数据中提取出潜在、有效的信息的处理过程.在浅层次上,它利用现有数据库管理系统等数据源管理系统的查询、检索及报表功能,与多维分析、统计分析方法相结合,进行联机分析处理,从而得出可供决策参考的统计分析数据.在深层次上,它可以从数据库等数据源中发现隐含的、先前未知的具有潜在价值的信息。
数据挖掘是一个多学科领域,它融合了数据库技术、人工智能、机器学习、模式识别、模糊数学和数理统计等最新技术的研究成果,可以用来支持商业智能应用和决策分析,目前广泛应用于金融、医疗等行业。数据挖掘技术的发展对于各行各业来说,都具有重要的现实意义。
其中,一个结合知识图谱的简单的数据挖掘流程可以参见图1a。图1a所示的数据挖掘流程包括如下步骤。
1、数据挖掘。该过程为由业务驱动或是知识驱动的数据收集步骤,通常需要对业务或知识领域熟悉的专业人士来制定数据挖掘策略。
2、实体识别。该过程通过自然语言处理、图像识别、声纹识别的算法对文本、图像或声音形式的数据进行分析,挖掘其中的目标实体,通常实体识别模型需要丰富的语料等训练数据以及后期的调优才能达到良好的识别效果。
3、生成知识图谱。将实体和关系以节点和连接线的形式展现。在一个实施例中,生成知识图谱的过程可以包括实体识别的过程。
4、模型效果检视。该过程根据知识图谱的效果对步骤2的实体识别模型和实体识别策略的效果进行检验。该过程通常需要有在该领域有一定专业积累的人士进行判断。
5、模型调优。根据上一步骤的模型效果制定实体识别模型的优化措施。以便后续利用优化的实体识别模型来实体识别,以便后续再获取数据挖掘结果后可以根据优化的实体识别模型获取更为准确的知识图谱,从而根据更为准确的知识图谱来进行数据挖掘。
6、数据挖掘。在模型调优后,根据知识所涉及的领域,可能需要对业务或知识领域熟悉的专业人士再制定新一轮的数据挖掘方向或策略。
上述过程在针对陌生或纵深的研究领域内,有时难免会需要人工介入,从专业角度出发判断节点集群的丰富程度并制定下一次的挖掘任务,若是知识图谱构建者缺乏该领域的专业知识则会因为难以判断每轮挖掘任务的丰富程度而制定新一轮的数据挖掘方向。为此,本申请提出了一种数据挖掘策略,可以使用图论中衡量节点在网络中的重要性的“中心性”概念,通过中心性概念自动测算表示应用领域中的实体的节点的重要性,并根据节点的重要性进行排序,从而排序结果来开展数据挖掘工作,在一个实施了中,可以将排序结果(通常是排序了的节点的名称或节点的图像)传送至数据挖掘程序,以便数据挖掘程序根据排序结果来开展数据挖掘工作。在一个实施例中,参见图1b所示的数据挖掘流程,在图1b所示的数据挖掘流程中,中心性计算和度数增益计算的过程衔接在“模型调优”之后。在中心度计算步骤中,可自行根据业务场景选择使用度中心性、间接中心性或两者结合的方式分析已有节点的中心性。度中心性相关公式适用于定位主题性最高或影响度最广的节点并推进挖掘工作,间接中心性相关公式适用于定位路径流量最高的节点;或者,还可以根据实际应用场景自行设定两种中心性计算结果的权重结合使用。
在一个实施例中,结合图1c来阐述所述的数据挖掘策略,计算机设备可以遍历知识图谱中多个节点,如所有节点来计算每个节点的中心度,而后根据每个节点的中心度对每个节点进行排序,得到排序结果,此处的排序结果可以为节点列表。在第一次执行数据挖掘操作时,可以根据排序结果确定每个节点的挖掘优先级,根据每个节点的挖掘优先级执行数据挖掘工作。由于在第一次进行数据挖掘时,度增益是无法计算的,因此可以通过数据挖掘程序直接读取排序结果启动数据挖掘任务。在第N(大于1)次执行数据挖掘操作时,可以计算每个节点的度增益,并根据每个节点的度增益执行数据挖掘工作,在这个过程中,具体可以确定度增益大于0的各节点的挖掘优先级,然后根据度增益大于0的各节点的挖掘优先级执行数据挖掘工作。
请参阅图2,为本申请实施例提供的一种数据挖掘方法的流程示意图。该方法可以应用于计算机设备。计算机设备可以为服务器等设备。服务器可以一个服务器或服务器集群。具体地,该方法可以包括以下步骤:
S201、获取执行本次数据挖掘操作后的数据挖掘结果。
本申请实施例中,计算机设备可以执行本次数据挖掘操作,得到执行本次数据挖掘操作后的数据挖掘结果。本次数据挖掘操作可以是第一次的数据挖掘操作,也可以是非第一次的数据挖掘操作。本次数据挖掘操作为非第一次的数据挖掘操作,表明在本次数据挖掘之前已经执行过数据挖掘操作。数据挖掘结果指经数据挖掘得到的数据。
S202、根据所述数据挖掘结果生成知识图谱。
本申请实施例中,计算机设备可以对数据挖掘结果进行知识抽取,得到多个三元组,并对多个三元组进行知识融合,得到知识融合结果。计算机设备在设备知识融合结果后,可以对知识融合结果进行知识加工,得到知识图谱。其中,知识抽取的过程包括实体抽取、关系抽取和属性抽取。知识融合的过程包括本体匹配和实体对齐。知识加工包括知识推理、 知识发现和质量评估。其中,所述的实体抽取的过程可以包括上述实体识别的过程。在一个实施例中,该实体识别的过程可以是经由模型检视和模型调优后得到的优化后的实体识别模型来实现的。
S203、确定所述知识图谱包括的多个节点中每个节点的中心度,并根据所述每个节点的中心度对所述每个节点进行排序,得到排序结果。
其中,所述的多个节点可以是知识图谱包括的所有节点,也可以是知识图谱包括的部分节点。在一个实施例中,知识图谱包括的各节点可以被划分至各节点集群。节点集群可以根据业务目标划分,在此不做赘述。相应地,所述的多个节点可以是目标节点集群包括的所有节点,也可以是目标节点集群包括的部分节点。排序结果指示了排序后的每个节点。其中,排序方式可以为按照中心度由大到小将每个节点由前到后排序,或按照中心度由小到大将每个节点由前到后排序,等等。
本申请实施例中,计算机设备可以调取中心性算法来确定知识图谱包括的多个节点中每个节点的中心度,并根据每个节点的中心度对每个节点进行排序,得到排序结果。其中,所述的中心性算法可以包括间接中心性算法和/或度中心性算法等中心性算法。下面依次对两种算法进行讲解。
其中,间接中心性,也可以叫中介中心性。经由间接中心性算法确定出的中心度可以称之为间接中心度。间接中心度可以用于表征节点的间接中心性。某节点的间接中间性高,说明该节点在已探索的网络结构中的“中介属性”强,该节点的持续挖掘价值在于找出其它使用到其“中介能力”的节点。
在一个实施例中,计算机设备可以调用间接中心性算法确定知识图谱包括的多个节点中每个节点的中心度。具体地,计算机设备可以确定知识图谱包括的多个节点中每个节点所在的最短路径的数量,并确定多个节点之间的最短路径的数量,然后根据每个节点所在的最短路径的数量以及多个节点之间的最短路径的数量,确定每个节点的间接中心度作为每个节点的中心度。其中,所述的间接中心性算法如下:
Figure PCTCN2021097113-appb-000001
其中,BC为间接中心度。dst()是从其它节点s到标的节点t的最短路径中经过节点v的数量。dst表示从其它节点s到标的节点t的最短路径的数量。其中,标的节点t为所述的多个节点中的节点,其它节点s为所述多个节点中除标的节点t之外的节点。公式1.1中的节点v为除标的节点t和其它节点s之外的节点。
举例来说,在企业风控和信息公示的场景下,知识图谱往往需要呈现企业工商信息、高管信息、企业或高管的持股情况。参见图3所示的知识图谱,图3所示的知识图谱包括多个节点,节点包括公司节点或高管节点,节点的属性为公司名称或人物名称,边的属性为公司与公司之间的关系、人物与公司之间的关系或人物与人物之间的关系。图3所示的知识图谱可以被划分为公司1节点所在节点集群1以及公司5节点所在节点集群2。其中,公司2、公司3和公司4均为公司1的子公司。公司6、公司7、公司8均为公司5的子公司。在此示例中,中介性高可被定义为持有股份多,反之,中介性低可被定义为持有股份少。下面说明计算机设备如何调用间接中心算法确定节点集群1中各节点的间接中心度。
计算机设备将节点集群1中各节点作为标的节点t,并将节点集群1中除标的节点之外的节点作为其它节点s。为了计算这些节点的间接中心度,需要统计其它节点s通往标的节点t的最短路径的数量(即dst),在本示例中具体可统计其它节点s通往标的节点t的边属性包括“持股”或“股东”的最短路径的数量,并且还需要统计节点集群1中各节点间的最短路径的数量(即dst()),在本示例中具体可统计节点集群1中各节点间通过某中介节点的边的属性包括“持股”或“股东”的最短路径的数量。基于上述步骤,可以梳理出如下两个统 计表:
表1:最短路径穷举
Figure PCTCN2021097113-appb-000002
表2:间接中心度统计
公司节点 dst() dst BC
公司1 6 8 0.75
公司2 6 8 0.75
公司3 0 8 0
公司4 0 8 0
高管1 0 8 0
高管2 0 8 0
其中,表1穷举了由标的节点t和其它节点s构成的节点对,以及每个节点对的最短路径,以及每个节点对的最短路径的中介节点。表2罗列了各节点在其它节点s通往标的节点t的最短路径的数量,以及各节点间的最短路径的数量,简单将这些数值代入公式1.1即可算出节点集群1中各节点的间接中心度,参见表2的最后一列数据。公司2由于持股情况复杂,所以中介性最强,即间接中心度最高。以公司2节点为例,将它的dst()的数值以及dst的数值代入公式1.1,可计算得到公司2节点的间接中心度,如下。
Figure PCTCN2021097113-appb-000003
其中,经由度中心性算法确定出的中心度可以为度中心度。度中心度可以用于表征节点的度中心性。在一个实施例中,关于度中心度用于表征节点的度中心性可以有以下两个层面的含义:一种是度中心度用于表征节点自身的度中心性,另一种是度中心度用于表征节点所在多个节点的度中心性。某节点的度中心性高,则说明该节点在已探索的网络结构中的“关系繁荣性“强,该节点的持续挖掘价值在于从多个方向或多个维度发散性的找出于其有关联性的其它节点。
在一个实施例中,计算机设备可以调用度中心性算法确定知识图谱包括的多个节点中每个节点的中心度,其中一种方式为,计算机设备可以确定多个节点中每个节点所连接的目标属性的节点的数量,并根据每个节点所连接的目标属性的节点的数量,确定每个节点的度中心度作为每个节点的中心度。其中,目标属性可以为多个节点属性中的任一节点属性或指定节点属性。其中,所述的度中心性算法如下:
DC(v)=deg(v)公式1.2;
其中,DC为度中心度。节点v在公式1.2中可表示多个节点中的任一节点。公式1.2中的deg()表示与节点v连接的满足指定条件的节点的数量。
举例来说,在金融分析的过程中,有时需要快速定位资源广、规模大的企业。针对某个市场或行业,这种通过关系网络判断企业资源是否广泛或关系规模是否庞大的场景,可以通过计算企业在知识图谱中的度中心度实现。在本示例中,图3所示的知识图谱还可以包括行业A节点。在需要确定行业A中资源广、关系规模大的企业时,可以计算图3所示的知识图谱中属于行业A的各公司节点的度中心度,并依据属于行业A的各公司节点的度中心度确定属于行业A的各公司的资源广泛程度、关系规模庞大程度。具体地,计算机设备可以确定多个节点中每个节点连接的满足指定条件的节点的数量,并根据每个节点连接的满足指定条件的节点的数量,确定每个节点的度中心度。例如,计算机设备可以将每个节点连接的为目标属性(如公司或高管)的节点确定为每个节点连接的满足指定条件的节点,再如,计算机设备可以确定每个节点连接的为指定属性(如就职或持股)的边,然后将每个节点连接的为指定属性的边所连接的节点确定为每个节点连接的满足指定条件的节点。计算机设备可以调取度中心性算法计算出公司1节点的度中心度为3,并计算出公司5节点的度中心度为5。公司5节点的度中心度高于公司1节点的度中心度,说明公司5的资源要比公司1广、公司5的关系规模要比公司2大。
在一个实施例中,计算机设备调用度中心性算法确定知识图谱包括的多个节点中每个节点的中心度,另一种方式为计算机设备确定多个节点中每个节点所连接的目标属性的节点的数量,并确定多个节点之间的目标路径的数量,并根据每个节点所连接的目标属性的节点的数量以及多个节点之间的目标路径的数量,确定多个节点的度中心度以作为每个节点的中心度。在一个实施例中,计算机设备可以根据多个节点中每个节点所连接的目标属性的节点的数量,确定每个节点的度中心度,然后根据每个节点的度中心度以及多个节点之间的最短路径的数量,确定多个节点的度中心度以作为每个节点的度中心度。其中,此处所述的度中心度算法如下所示:
Figure PCTCN2021097113-appb-000004
其中DC表示多个节点的度中心度。n*表示多个节点中度中心度最高的节点,而DC(n*)为n*的度中心度。DC(v i)为多个节点中其它节点v i的度中心度。(V-1)(V-2)表示最大可能相连情况,V可以为多个节点间的目标路径的数量。在一个实施例中,此处的目标路径可以包括n*与n*连接的节点之间的路径。在一个实施例中,V可以理解为n*的最大连接数。在一个实施例中,V可以为多个节点的数量。即,计算机设备确定多个节点中每个节点所连接的目标属性的节点的数量,并确定多个节点的数量,根据每个节点所连接的目标属性的节点的数量以及多个节点的数量,确定多个节点的度中心度以作为每个节点的中心度。
在企业分析中的一些特定场景,例如规模测算或行业集中性测算,会需要统计某个行业中各个头部企业占有市场规模的比重。在此场景下,可以通过集群度中心性的视角,统计由某些节点构成的节点集群在整个关系网络中占有的中心性,并结合节点所映射的实体属性来判断其对网络的影响性。以图3为例,假设需要判断公司5节点对行业A节点的规 模性影响,则需要计算节点集群2的度中心度。具体地,计算机设备可以确定节点集群2中的n*为公司5节点,并分别统计公司5节点的度中心度以及节点集群2中的除公司5节点之外的其它公司节点的度中心度。在本示例中,若需要分析行业下各集团的公司总数规模及其在该行业中的突出性,则可以不用考虑高管节点的作用,只考虑公司节点间的联系。同时,在此有向图中,可以将行业A节点连至公司5节点的路径考虑在内。至此,公司5节点的度中心度DC(n*)为3、公司6节点、公司7节点、公司8节点的度中心度DC(v i)均为1,公司5节点的最大连接数V,为公司5节点的出度数以及来自行业节点的入度数之和,即为4。将这些数值代入公式1.3,可算得节点集群2的度中心度,如下。
Figure PCTCN2021097113-appb-000005
节点集群2的度中心度,可以作为节点集群2中各公司节点的中心度。
在一个实施例中,计算机设备可以得到多个节点中每个节点的度中心度,以及多个节点中每个节点的间接中心度。在得到每个节点的度中心度以及每个节点的间接中心度后,计算机设备可以将每个节点的度中心度分别乘以第一权重,得到每个节点对应的第一权重结果;并将每个节点的间接中心度分别乘以第二权重,得到每个节点对应的第二权重结果;计算机设备将每个节点对应的第一权重结果分别与每个节点对应的第二权重结果相加,得到每个节点对应的中心度。
S204、根据所述排序结果执行下一次数据挖掘操作。
计算机设备可以根据排序结果确定每个节点的挖掘优先级,按照每个节点的挖掘优先级执行下一次数据挖掘操作。例如,计算机设备可以对优先级高的节点优先执行数据挖掘操作,对优先级低的节点靠后执行数据挖掘操作。在每个节点的中心度为多个节点的度中心度时,每个节点具有相同的挖掘优先级,计算机设备可以同时对每个节点执行下一次数据挖掘操作。
可见,图2所示的实施例中,计算机设备可以获取执行本次数据挖掘操作后的数据挖掘结果,并根据数据挖掘结果生成知识图谱以确定知识图谱包括的多个节点中每个节点的中心度,并根据每个节点的中心度对每个节点进行排序,得到排序结果,从而根据排序结果执行下一次数据挖掘操作,该过程能够给予不同知识体系有效地进行数据挖掘,不仅可以提升数据挖掘效率还可以保证数据挖掘质量。
请参阅图4,为本申请实施例提供的另一种数据挖掘方法的流程示意图。该方法可以应用于计算机设备中。计算机设备可以为服务器等设备。服务器可以一个服务器或服务器集群。具体地,该方法可以包括以下步骤:
S401、获取执行本次数据挖掘操作后的数据挖掘结果。
S402、根据所述数据挖掘结果生成知识图谱。
S403、确定所述知识图谱包括的多个节点中每个节点的中心度,并根据所述每个节点的中心度对所述每个节点进行排序,得到排序结果。
其中,步骤S401-步骤S403可以参见图2实施例中的步骤S201-步骤S203,在此不做赘述。
S404、确定已执行数据挖掘操作的次数。
其中,计算机设备在执行步骤S404后,根据已执行数据挖掘操作的次数判断执行步骤S405还是步骤S407。
S405、在已执行数据挖掘操作的次数小于或等于预设次数时,根据所述排序结果确定针对所述每个节点的挖掘优先级。
S406、根据所述每个节点的挖掘优先级执行下一次数据挖掘操作。
在步骤S405-步骤S406中,计算机设备在已执行数据挖掘操作的次数小于或等于预设次数时,根据所述排序结果确定针对所述每个节点的挖掘优先级,并根据所述每个节点的挖掘优先级执行下一次数据挖掘操作。
S407、在已执行数据挖掘操作的次数大于预设次数时,计算所述每个节点的度增益。
S408、根据所述每个节点的度增益,从所述多个节点中确定出度增益大于或等于预设数值的目标节点。
S409、在所述目标节点为多个时,根据所述排序结果以及每个所述目标节点的度增益确定针对每个目标节点的挖掘优先级。
S410、根据所述每个目标节点的挖掘优先级执行数据挖掘操作。
在步骤S406-步骤S410中,计算机设备可以在已执行数据挖掘操作的次数大于预设次数时,计算该每个节点的度增益,并根据该排序结果和该每个节点的度增益执行下一次数据挖掘操作。在一个实施例中,计算机设备可以从多个节点中确定出度增益大于或等于预设数值(如0)的目标节点,并可以在该目标节点为多个时,根据每个该目标节点的度增益确定针对每个目标节点的挖掘优先级,从而根据该每个目标节点的挖掘优先级执行数据挖掘操作。在一个实施例中,计算机设备可以根据排序结果以及该目标节点的度增益确定针对每个目标节点的挖掘优先级。其中,所述预设次数可以设置为1等次数。在一个实施例中,度增益高的节点的挖掘优先级高,度增益低的节点的挖掘优先级低。
为了保持数据挖掘工作的效率,需要一套标准来判断每一轮的数据挖掘相较于上一轮是否有更好的拓展以及某些节点是否已经停止了生长,本申请实施例可以将这套标准成为“度增益”。例如,计算机设备可以在确定已执行数据挖掘操作的次数大于1时,计算每个节点的度增益。计算机设备根据每个节点的度增益从多个节点中确定出度增益大于或等于预设数值的目标节点,并在目标节点为多个时,根据排序结果以及每个目标节点的度增益确定每个目标节点的优先级,从而根据每个目标节点的优先级执行数据挖掘操作。
在一个实施例中,计算机设备计算所述每个节点的度增益的方式具体为计算机设备获取执行上一次数据挖掘操作后得到的所述每个节点的中心度,并根据所述执行上一次数据挖掘操作后得到的所述每个节点的中心度,以及执行本次数据挖掘操作后得到的所述每个节点的中心度,计算所述每个节点的度增益。其中,度增益的计算方式可以如下:
Figure PCTCN2021097113-appb-000006
其中,D表示度增益。i表示本次执行数据挖掘操作后,已执行数据挖掘操作的次数。v i为节点在本次执行数据挖掘操作后得到的中心度,V i-1为节点在上一次执行数据挖掘后得到的中心度。随着挖掘次数的增加,度增益通常是呈先上升后下降趋近于零的趋势,当增益为零时可以停止对该节点的挖掘。
可见,图4所示的实施例中,数据挖掘装置可以获取执行本次数据挖掘操作后的数据挖掘结果,并根据数据挖掘结果生成知识图谱以确定知识图谱包括的多个节点中每个节点的中心度,并根据每个节点的中心度对每个节点进行排序,得到排序结果,从而根据排序结果执行下一次数据挖掘操作,该过程能够提升数据挖掘效率并保障数据挖掘质量。
本申请涉及区块链技术,如可将数据挖掘结果写入区块链中,或基于区块链存储的数据来执行不同轮次的数据挖掘操作。
请参阅图5,为本申请实施例提供的一种数据挖掘装置的结构示意图。具体地,该装置可以应用于计算机设备,具体地,该装置可以包括:
数据挖掘模块501,用于获取执行本次数据挖掘操作后的数据挖掘结果。
生成模块502,用于根据所述数据挖掘结果生成知识图谱。
排序模块503,用于确定所述知识图谱包括的多个节点中每个节点的中心度,并根据 所述每个节点的中心度对所述每个节点进行排序,得到排序结果。
数据挖掘模块501,还用于根据所述排序结果执行下一次数据挖掘操作。
在一种可选的实施方式中,排序模块503确定所述知识图谱包括的多个节点中每个节点的中心度,具体为确定所述知识图谱包括的多个节点中每个节点所在的最短路径的数量,并确定所述多个节点之间的最短路径的数量;根据所述每个节点所在的最短路径的数量以及所述多个节点之间的最短路径的数量,确定所述每个节点的间接中心度以作为所述每个节点的中心度。
在一种可选的实施方式中,排序模块503确定所述知识图谱包括的多个节点中每个节点的中心度,具体为确定所述多个节点中每个节点所连接的目标属性的节点的数量;根据所述每个节点所连接的目标属性的节点的数量,确定所述每个节点的度中心度以作为所述每个节点的中心度。
在一种可选的实施方式中,排序模块503计算所述知识图谱中各个节点的中心度,具体为确定所述多个节点中每个节点所连接的目标属性的节点的数量,并确定所述多个节点之间的目标路径的数量;根据所述每个节点所连接的目标属性的节点的数量以及所述多个节点之间的目标路径的数量,确定所述多个节点的度中心度以作为所述每个节点的中心度。
在一种可选的实施方式中,数据挖掘模块501根据所述排序结果执行下一次数据挖掘操作,具体为确定已执行数据挖掘操作的次数;在已执行数据挖掘操作的次数小于或等于预设次数时,根据所述排序结果确定针对所述每个节点的挖掘优先级;根据所述每个节点的挖掘优先级执行下一次数据挖掘操作。
在一种可选的实施方式中,数据挖掘模块501根据所述排序结果执行下一次数据挖掘操作,还具体为在已执行数据挖掘操作的次数大于预设次数时,计算所述每个节点的度增益;根据所述每个节点的度增益,从所述多个节点中确定出度增益大于或等于预设数值的目标节点;在所述目标节点为多个时,根据所述排序结果以及每个所述目标节点的度增益确定针对每个目标节点的挖掘优先级;根据所述每个目标节点的挖掘优先级执行数据挖掘操作。
在一种可选的实施方式中,数据挖掘模块501计算所述每个节点的度增益,具体为获取执行上一次数据挖掘操作后得到的所述每个节点的中心度;根据所述执行上一次数据挖掘操作后得到的所述每个节点的中心度,以及执行本次数据挖掘操作后得到的所述每个节点的中心度,计算所述每个节点的度增益。
可见,图5所示的实施例中,数据挖掘装置可以获取执行本次数据挖掘操作后的数据挖掘结果,并根据数据挖掘结果生成知识图谱以确定知识图谱包括的多个节点中每个节点的中心度,并根据每个节点的中心度对每个节点进行排序,得到排序结果,从而根据排序结果执行下一次数据挖掘操作,该过程能够给予不同知识体系有效地进行数据挖掘,不仅可以提升数据挖掘效率还可以保证数据挖掘质量。
请参阅图6,为本申请实施例提供的一种计算机设备的结构示意图。本实施例中所描述的计算机设备可以包括:一个或多个处理器1000和存储器2000。处理器1000和存储器2000可以通过总线等方式连接。
处理器1000可以是中央处理模块(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
存储器2000可以是高速RAM存储器,也可为非不稳定的存储器(non-volatile memory),例如磁盘存储器。存储器2000用于存储一组程序代码,处理器1000可以调用存储器2000 中存储的程序代码。具体地:
处理器1000,用于获取执行本次数据挖掘操作后的数据挖掘结果;根据所述数据挖掘结果生成知识图谱;确定所述知识图谱包括的多个节点中每个节点的中心度,并根据所述每个节点的中心度对所述每个节点进行排序,得到排序结果;根据所述排序结果执行下一次数据挖掘操作。
在一个实施例中,处理器1000确定所述知识图谱包括的多个节点中每个节点的中心度,具体为确定所述知识图谱包括的多个节点中每个节点所在的最短路径的数量,并确定所述多个节点之间的最短路径的数量;根据所述每个节点所在的最短路径的数量以及所述多个节点之间的最短路径的数量,确定所述每个节点的间接中心度以作为所述每个节点的中心度。
在一个实施例中,处理器1000确定所述知识图谱包括的多个节点中每个节点的中心度,具体为确定所述多个节点中每个节点所连接的目标属性的节点的数量;根据所述每个节点所连接的目标属性的节点的数量,确定所述每个节点的度中心度以作为所述每个节点的中心度。
在一个实施例中,处理器1000确定所述知识图谱包括的多个节点中每个节点的中心度,具体为确定所述多个节点中每个节点所连接的目标属性的节点的数量,并确定所述多个节点之间的目标路径的数量;根据所述每个节点所连接的目标属性的节点的数量以及所述多个节点之间的目标路径的数量,确定所述多个节点的度中心度以作为所述每个节点的中心度。
在一个实施例中,处理器1000根据所述排序结果执行下一次数据挖掘操作,具体为确定已执行数据挖掘操作的次数;在已执行数据挖掘操作的次数小于或等于预设次数时,根据所述排序结果确定针对所述每个节点的挖掘优先级;根据所述每个节点的挖掘优先级执行下一次数据挖掘操作。
在一个实施例中,处理器1000根据所述排序结果执行下一次数据挖掘操作,还具体为在已执行数据挖掘操作的次数大于预设次数时,计算所述每个节点的度增益;根据所述每个节点的度增益,从所述多个节点中确定出度增益大于或等于预设数值的目标节点;在所述目标节点为多个时,根据所述排序结果以及每个所述目标节点的度增益确定针对每个目标节点的挖掘优先级;根据所述每个目标节点的挖掘优先级执行数据挖掘操作。
在一个实施例中,处理器1000计算所述每个节点的度增益,具体为获取执行上一次数据挖掘操作后得到的所述每个节点的中心度;根据所述执行上一次数据挖掘操作后得到的所述每个节点的中心度,以及执行本次数据挖掘操作后得到的所述每个节点的中心度,计算所述每个节点的度增益。
具体实现中,本申请实施例中所描述的处理器1000可执行图2实施例、图4实施例所描述的实现方式,也可执行本申请实施例所描述的实现方式,在此不再赘述。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行时可实现上述实施例中方法的步骤,或者,计算机程序被处理器执行时可实现上述实施例中装置的各模块的功能,这里不再赘述。可选的,本申请涉及的存储介质如计算机可读存储介质可以是非易失性的,也可以是易失性的。
在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以是两个或两个以上模块集成在一个模块中。上述集成的模块既可以采样硬件的形式实现,也可以采样软件功能模块的形式实现。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的计算机可读存储介 质可为易失性的或非易失性的。例如,该计算机存储介质可以为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。所述的计算机可读存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。
其中,本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
以上所揭露的仅为本申请一种较佳实施例而已,当然不能以此来限定本申请之权利范围,本领域普通技术人员可以理解实现上述实施例的全部或部分流程,并依本申请权利要求所作的等同变化,仍属于本申请所涵盖的范围。

Claims (20)

  1. 一种数据挖掘方法,包括:
    获取执行本次数据挖掘操作后的数据挖掘结果;
    根据所述数据挖掘结果生成知识图谱;
    确定所述知识图谱包括的多个节点中每个节点的中心度,并根据所述每个节点的中心度对所述每个节点进行排序,得到排序结果;
    根据所述排序结果执行下一次数据挖掘操作。
  2. 根据权利要求1所述的方法,其中,所述确定所述知识图谱包括的多个节点中每个节点的中心度,包括:
    确定所述知识图谱包括的多个节点中每个节点所在的最短路径的数量,并确定所述多个节点之间的最短路径的数量;
    根据所述每个节点所在的最短路径的数量以及所述多个节点之间的最短路径的数量,确定所述每个节点的间接中心度以作为所述每个节点的中心度。
  3. 根据权利要求1所述的方法,其中,所述确定所述知识图谱包括的多个节点中每个节点的中心度,包括:
    确定所述多个节点中每个节点所连接的目标属性的节点的数量;
    根据所述每个节点所连接的目标属性的节点的数量,确定所述每个节点的度中心度以作为所述每个节点的中心度。
  4. 根据权利要求1所述的方法,其中,所述确定所述知识图谱包括的多个节点中每个节点的中心度,包括:
    确定所述多个节点中每个节点所连接的目标属性的节点的数量,并确定所述多个节点之间的目标路径的数量;
    根据所述每个节点所连接的目标属性的节点的数量以及所述多个节点之间的目标路径的数量,确定所述多个节点的度中心度以作为所述每个节点的中心度。
  5. 根据权利要求1所述的方法,其中,所述根据所述排序结果执行下一次数据挖掘操作,包括:
    确定已执行数据挖掘操作的次数;
    在已执行数据挖掘操作的次数小于或等于预设次数时,根据所述排序结果确定针对所述每个节点的挖掘优先级;
    根据所述每个节点的挖掘优先级执行下一次数据挖掘操作。
  6. 根据权利要求5所述的方法,其中,所述根据所述排序结果执行下一次数据挖掘操作,还包括:
    在已执行数据挖掘操作的次数大于预设次数时,计算所述每个节点的度增益;
    根据所述每个节点的度增益,从所述多个节点中确定出度增益大于或等于预设数值的目标节点;
    在所述目标节点为多个时,根据所述排序结果以及每个所述目标节点的度增益确定针对每个目标节点的挖掘优先级;
    根据所述每个目标节点的挖掘优先级执行数据挖掘操作。
  7. 根据权利要求6所述的方法,其中,所述计算所述每个节点的度增益,包括:
    获取执行上一次数据挖掘操作后得到的所述每个节点的中心度;
    根据所述执行上一次数据挖掘操作后得到的所述每个节点的中心度,以及执行本次数据挖掘操作后得到的所述每个节点的中心度,计算所述每个节点的度增益。
  8. 一种数据挖掘装置,包括:
    数据挖掘模块,用于获取执行本次数据挖掘操作后的数据挖掘结果;
    生成模块,用于根据所述数据挖掘结果生成知识图谱;
    排序模块,用于确定所述知识图谱包括的多个节点中每个节点的中心度,并根据所述每个节点的中心度对所述每个节点进行排序,得到排序结果;
    所述数据挖掘模块,还用于根据所述排序结果执行下一次数据挖掘操作。
  9. 一种计算机设备,包括处理器和存储器,所述处理器和所述存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行以下方法:
    获取执行本次数据挖掘操作后的数据挖掘结果;
    根据所述数据挖掘结果生成知识图谱;
    确定所述知识图谱包括的多个节点中每个节点的中心度,并根据所述每个节点的中心度对所述每个节点进行排序,得到排序结果;
    根据所述排序结果执行下一次数据挖掘操作。
  10. 根据权利要求9所述的计算机设备,其中,执行所述确定所述知识图谱包括的多个节点中每个节点的中心度,包括:
    确定所述知识图谱包括的多个节点中每个节点所在的最短路径的数量,并确定所述多个节点之间的最短路径的数量;
    根据所述每个节点所在的最短路径的数量以及所述多个节点之间的最短路径的数量,确定所述每个节点的间接中心度以作为所述每个节点的中心度。
  11. 根据权利要求9所述的计算机设备,其中,执行所述确定所述知识图谱包括的多个节点中每个节点的中心度,包括:
    确定所述多个节点中每个节点所连接的目标属性的节点的数量;
    根据所述每个节点所连接的目标属性的节点的数量,确定所述每个节点的度中心度以作为所述每个节点的中心度。
  12. 根据权利要求9所述的计算机设备,其中,执行所述确定所述知识图谱包括的多个节点中每个节点的中心度,包括:
    确定所述多个节点中每个节点所连接的目标属性的节点的数量,并确定所述多个节点之间的目标路径的数量;
    根据所述每个节点所连接的目标属性的节点的数量以及所述多个节点之间的目标路径的数量,确定所述多个节点的度中心度以作为所述每个节点的中心度。
  13. 根据权利要求9所述的计算机设备,其中,执行所述根据所述排序结果执行下一次数据挖掘操作,包括:
    确定已执行数据挖掘操作的次数;
    在已执行数据挖掘操作的次数小于或等于预设次数时,根据所述排序结果确定针对所述每个节点的挖掘优先级;
    根据所述每个节点的挖掘优先级执行下一次数据挖掘操作。
  14. 根据权利要求13所述的计算机设备,其中,执行所述根据所述排序结果执行下一次数据挖掘操作,还包括:
    在已执行数据挖掘操作的次数大于预设次数时,计算所述每个节点的度增益;
    根据所述每个节点的度增益,从所述多个节点中确定出度增益大于或等于预设数值的目标节点;
    在所述目标节点为多个时,根据所述排序结果以及每个所述目标节点的度增益确定针对每个目标节点的挖掘优先级;
    根据所述每个目标节点的挖掘优先级执行数据挖掘操作。
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序, 所述计算机程序被处理器执行以实现以下方法:
    获取执行本次数据挖掘操作后的数据挖掘结果;
    根据所述数据挖掘结果生成知识图谱;
    确定所述知识图谱包括的多个节点中每个节点的中心度,并根据所述每个节点的中心度对所述每个节点进行排序,得到排序结果;
    根据所述排序结果执行下一次数据挖掘操作。
  16. 根据权利要求15所述的计算机可读存储介质,其中,执行所述确定所述知识图谱包括的多个节点中每个节点的中心度,包括:
    确定所述知识图谱包括的多个节点中每个节点所在的最短路径的数量,并确定所述多个节点之间的最短路径的数量;
    根据所述每个节点所在的最短路径的数量以及所述多个节点之间的最短路径的数量,确定所述每个节点的间接中心度以作为所述每个节点的中心度。
  17. 根据权利要求15所述的计算机可读存储介质,其中,执行所述确定所述知识图谱包括的多个节点中每个节点的中心度,包括:
    确定所述多个节点中每个节点所连接的目标属性的节点的数量;
    根据所述每个节点所连接的目标属性的节点的数量,确定所述每个节点的度中心度以作为所述每个节点的中心度。
  18. 根据权利要求15所述的计算机可读存储介质,其中,执行所述确定所述知识图谱包括的多个节点中每个节点的中心度,包括:
    确定所述多个节点中每个节点所连接的目标属性的节点的数量,并确定所述多个节点之间的目标路径的数量;
    根据所述每个节点所连接的目标属性的节点的数量以及所述多个节点之间的目标路径的数量,确定所述多个节点的度中心度以作为所述每个节点的中心度。
  19. 根据权利要求15所述的计算机可读存储介质,其中,执行所述根据所述排序结果执行下一次数据挖掘操作,包括:
    确定已执行数据挖掘操作的次数;
    在已执行数据挖掘操作的次数小于或等于预设次数时,根据所述排序结果确定针对所述每个节点的挖掘优先级;
    根据所述每个节点的挖掘优先级执行下一次数据挖掘操作。
  20. 根据权利要求19所述的计算机可读存储介质,其中,执行所述根据所述排序结果执行下一次数据挖掘操作,还包括:
    在已执行数据挖掘操作的次数大于预设次数时,计算所述每个节点的度增益;
    根据所述每个节点的度增益,从所述多个节点中确定出度增益大于或等于预设数值的目标节点;
    在所述目标节点为多个时,根据所述排序结果以及每个所述目标节点的度增益确定针对每个目标节点的挖掘优先级;
    根据所述每个目标节点的挖掘优先级执行数据挖掘操作。
PCT/CN2021/097113 2021-04-16 2021-05-31 数据挖掘方法、装置、计算机设备及存储介质 WO2022217712A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110410056.4A CN112948469B (zh) 2021-04-16 2021-04-16 数据挖掘方法、装置、计算机设备及存储介质
CN202110410056.4 2021-04-16

Publications (1)

Publication Number Publication Date
WO2022217712A1 true WO2022217712A1 (zh) 2022-10-20

Family

ID=76232811

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097113 WO2022217712A1 (zh) 2021-04-16 2021-05-31 数据挖掘方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN112948469B (zh)
WO (1) WO2022217712A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779116B (zh) * 2021-09-10 2023-07-11 平安科技(深圳)有限公司 对象排序方法、相关设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170185910A1 (en) * 2015-12-28 2017-06-29 International Business Machines Corporation Steering graph mining algorithms applied to complex networks
CN108491511A (zh) * 2018-03-23 2018-09-04 腾讯科技(深圳)有限公司 基于图数据的数据挖掘方法和装置、模型训练方法和装置
CN109299090A (zh) * 2018-09-03 2019-02-01 平安科技(深圳)有限公司 基金知识推理方法、系统、计算机设备和存储介质
CN110427341A (zh) * 2019-06-11 2019-11-08 福建奇点时空数字科技有限公司 一种基于路径排序的知识图谱实体关系挖掘方法
CN110941664A (zh) * 2019-12-11 2020-03-31 北京百度网讯科技有限公司 知识图谱的构建方法、检测方法、装置、设备及存储介质

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1675060A1 (en) * 2004-12-23 2006-06-28 IBM Corporation A method and system for managing customer network value
CN102693317B (zh) * 2012-05-29 2014-11-05 华为软件技术有限公司 数据挖掘流程生成方法及装置
CN105740381B (zh) * 2016-01-27 2019-05-17 北京工业大学 一种基于复杂网络特性及神经网络聚类挖掘用户兴趣的方法
CN107784111B (zh) * 2017-11-06 2020-08-25 北京锐安科技有限公司 数据挖掘方法、装置、设备及存储介质
CN110135890A (zh) * 2019-04-15 2019-08-16 深圳壹账通智能科技有限公司 基于知识关系挖掘的产品数据推送方法及相关设备
CN110647522B (zh) * 2019-09-06 2022-12-27 中国建设银行股份有限公司 一种数据挖掘方法、装置及其系统
CN111309776A (zh) * 2020-01-15 2020-06-19 成都深思科技有限公司 基于数据排序的分布式网络流量聚合降维统计方法
CN111858720A (zh) * 2020-07-31 2020-10-30 苏州水易数据科技有限公司 一种基于数据库的双向数据挖掘方法和装置
CN112001649B (zh) * 2020-08-27 2022-11-29 支付宝(杭州)信息技术有限公司 一种风险数据挖掘方法、装置以及设备
CN112231350B (zh) * 2020-10-13 2022-04-12 汉唐信通(北京)科技有限公司 一种基于知识图谱的企业商机挖掘方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170185910A1 (en) * 2015-12-28 2017-06-29 International Business Machines Corporation Steering graph mining algorithms applied to complex networks
CN108491511A (zh) * 2018-03-23 2018-09-04 腾讯科技(深圳)有限公司 基于图数据的数据挖掘方法和装置、模型训练方法和装置
CN109299090A (zh) * 2018-09-03 2019-02-01 平安科技(深圳)有限公司 基金知识推理方法、系统、计算机设备和存储介质
CN110427341A (zh) * 2019-06-11 2019-11-08 福建奇点时空数字科技有限公司 一种基于路径排序的知识图谱实体关系挖掘方法
CN110941664A (zh) * 2019-12-11 2020-03-31 北京百度网讯科技有限公司 知识图谱的构建方法、检测方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN112948469B (zh) 2023-10-13
CN112948469A (zh) 2021-06-11

Similar Documents

Publication Publication Date Title
US20220075670A1 (en) Systems and methods for replacing sensitive data
CN107633265B (zh) 用于优化信用评估模型的数据处理方法及装置
Dai et al. An approach to evaluate data trustworthiness based on data provenance
CN109299090B (zh) 基金中心度计算方法、系统、计算机设备和存储介质
US9703860B2 (en) Returning related previously answered questions based on question affinity
CN111784508A (zh) 企业风险评估方法、装置及电子设备
Xiao An intelligent complex event processing with D numbers under fuzzy environment
US20230385034A1 (en) Automated decision making using staged machine learning
US20160364392A1 (en) Method and system for scoring credibility of information sources
CN103513983A (zh) 用于预测性警报阈值确定工具的方法和系统
US11562252B2 (en) Systems and methods for expanding data classification using synthetic data generation in machine learning models
US20160098737A1 (en) Corpus Management Based on Question Affinity
US8489639B2 (en) Information source alignment
Verma et al. Fuzzy association rule mining based model to predict students’ performance
US20120072456A1 (en) Adaptive resource allocation for multiple correlated sub-queries in streaming systems
Grbac et al. Stability of software defect prediction in relation to levels of data imbalance
Gross et al. Systemic test and evaluation of a hard+ soft information fusion framework: Challenges and current approaches
Sharma et al. Big data reliability: A critical review
Ridzuan et al. Diagnostic analysis for outlier detection in big data analytics
WO2022217712A1 (zh) 数据挖掘方法、装置、计算机设备及存储介质
Asmono et al. Absolute correlation weighted naïve bayes for software defect prediction
US11853400B2 (en) Distributed machine learning engine
Ceolin et al. Efficient semi-automated assessment of annotations trustworthiness
WO2023035526A1 (zh) 对象排序方法、相关设备及介质
Wang et al. Intelligent weight generation algorithm based on binary isolation tree

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21936584

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21936584

Country of ref document: EP

Kind code of ref document: A1