WO2022142859A1 - 数据处理方法、装置、计算机可读介质及电子设备 - Google Patents

数据处理方法、装置、计算机可读介质及电子设备 Download PDF

Info

Publication number
WO2022142859A1
WO2022142859A1 PCT/CN2021/132221 CN2021132221W WO2022142859A1 WO 2022142859 A1 WO2022142859 A1 WO 2022142859A1 CN 2021132221 W CN2021132221 W CN 2021132221W WO 2022142859 A1 WO2022142859 A1 WO 2022142859A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
nodes
graph network
computing
core degree
Prior art date
Application number
PCT/CN2021/132221
Other languages
English (en)
French (fr)
Inventor
李晓森
许杰
欧阳文
陶阳宇
肖品
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP21913601.7A priority Critical patent/EP4198771A4/en
Priority to JP2023521789A priority patent/JP2023546040A/ja
Publication of WO2022142859A1 publication Critical patent/WO2022142859A1/zh
Priority to US17/964,778 priority patent/US20230033019A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2323Non-hierarchical techniques based on graph theory, e.g. minimum spanning trees [MST] or graph cuts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • H04L41/0813Configuration setting characterised by the conditions triggering a change of settings
    • H04L41/082Configuration setting characterised by the conditions triggering a change of settings the condition being updates or upgrades of network functionality
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/12Discovery or management of network topologies
    • H04L41/122Discovery or management of network topologies of virtualised topologies, e.g. software-defined networks [SDN] or network function virtualisation [NFV]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • H04L41/0806Configuration setting for initial configuration or provisioning, e.g. plug-and-play
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0895Configuration of virtualised networks or elements, e.g. virtualised network function or OpenFlow elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/22Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks comprising specially adapted graphical user interfaces [GUI]

Definitions

  • This application relates to the field of artificial intelligence technology, in particular to data processing technology.
  • Reasonable sorting and mining of user data can enable the network platform to summarize user characteristics, and then combine user characteristics to better provide users with convenient and efficient platform services.
  • the larger and larger data scale will increase the data processing pressure, and the network platform needs to spend more and more computing resources and time to perform user data analysis and processing operations. Therefore, how to improve the efficiency of big data analysis and reduce related costs is an urgent problem to be solved.
  • the embodiments of the present application provide a data processing method, a data processing apparatus, a computer-readable medium, an electronic device, and a computer program product, which can overcome the large consumption of computing resources and low data processing efficiency in big data analysis to a certain extent. technical problem.
  • a data processing method executed by an electronic device, the method includes: acquiring a relational graph network, the relational graph network including nodes for representing interactive objects, and for representing multiple The edge of the interaction relationship between the interactive objects; through a device cluster including a plurality of computing devices, the core degree mining of the relationship graph network is performed, and the node core degree of all nodes or some nodes in the relationship graph network is iteratively updated.
  • a data processing apparatus includes: a graph network acquisition module configured to acquire a relational graph network, the relational graph network including nodes for representing interactive objects, and for An edge representing the interaction relationship between multiple interactive objects; the core degree mining module is configured to perform core degree mining on the relationship graph network through a device cluster including a plurality of computing devices, and iteratively update the relationship graph network.
  • the node core degree of all nodes or some nodes of a cluster compression module configured to perform compression processing on the device cluster when the network scale of the relational graph network satisfies a preset network compression condition, and remove some computing devices in the device cluster.
  • a computer-readable medium on which a computer program is stored, and when the computer program is executed by a processor, implements the data processing method in the above technical solution.
  • an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the The executable instructions are used to execute the data processing method in the above technical solution.
  • a computer program product or computer program where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method as in the above technical solutions.
  • a relational graph network is established according to the business data related to the interaction relationship between the interactive objects, and by utilizing the structural characteristics and sparseness of the relational graph network, distributed computing can be performed first through a device cluster, Core degree mining is carried out regionally. With the continuous iterative update of the node core degree, the relational graph network is pruned, and the nodes and corresponding edges that have been iteratively converged are "pruned", so that the relational graph network is continuously updated with the iterative update of the node core degree. The compression becomes smaller and the consumption of computing resources is reduced. On this basis, when the relational graph network is compressed to an appropriate size, the device cluster used for core degree mining can be further compressed. In this way, not only a large amount of computing resources can be released, but also the cost of parallel computing can be saved. It brings additional time overhead such as data distribution and improves data processing efficiency.
  • FIG. 1 shows an architectural block diagram of a data processing system applying the technical solution of the present application
  • FIG. 2 shows a flowchart of steps of a data processing method in an embodiment of the present application
  • FIG. 3 shows a flowchart of the method steps for core degree mining based on distributed computing in an embodiment of the present application
  • FIG. 4 shows a flowchart of steps for core degree mining on a partitioned graph network in an embodiment of the present application
  • FIG. 5 shows a flowchart of steps for selecting a computing node in an embodiment of the present application
  • FIG. 6 shows a flowchart of steps for determining the h-index of a computing node in an embodiment of the present application
  • FIG. 7 shows a flowchart of steps for summarizing node core degree mining results of a partition graph network in an embodiment of the present application
  • FIG. 8 shows a schematic diagram of a process of compressing and pruning a relational graph network based on iterative update of node coreness in an embodiment of the present application
  • FIG. 9 shows the overall architecture and processing flowchart of k-core mining in an application scenario according to an embodiment of the present application.
  • FIG. 10 schematically shows a structural block diagram of a data processing apparatus provided by an embodiment of the present application.
  • FIG. 11 schematically shows a structural block diagram of a computer system suitable for implementing the electronic device of the embodiment of the present application.
  • Example embodiments will now be described more fully with reference to the accompanying drawings.
  • Example embodiments can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
  • FIG. 1 shows an architectural block diagram of a data processing system to which the technical solutions of the present application can be applied.
  • the data processing system 100 may include a terminal device 110 , a network 120 and a server 130 .
  • the terminal device 110 may include various electronic devices such as smart phones, tablet computers, notebook computers, desktop computers, smart speakers, smart watches, smart glasses, and in-vehicle terminals.
  • Various application clients such as video application client, music application client, social application client, payment application client, etc. can be installed on the terminal device 110, so that the user can use corresponding application services based on the application client.
  • the server 130 may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, and cloud communications , middleware service, domain name service, security service, CDN, big data and artificial intelligence platform and other basic cloud computing services cloud server.
  • the network 120 may be a communication medium of various connection types capable of providing a communication link between the terminal device 110 and the server 130, such as a wired communication link or a wireless communication link.
  • the system architecture in this embodiment of the present application may include any number of terminal devices, networks, and servers.
  • the server 130 may be a server group composed of a plurality of server devices.
  • the technical solutions provided in the embodiments of the present application may be applied to the terminal device 110 or the server 130 , or may be jointly implemented by the terminal device 110 and the server 130 , which is not specifically limited in the present application.
  • a user uses a social networking application on the terminal device 110
  • he or she can send information to and from other users on the social networking platform, or conduct social networking activities such as voice conversation and video conversation.
  • social networking can be established with other users. relationship, and corresponding social business data will be generated on the network social platform.
  • a user uses a payment application on the terminal device 110
  • he can perform payment or collection behaviors to other users on the online payment platform.
  • a transaction relationship can be established with other users, and the online payment platform Generate corresponding transaction business data.
  • the embodiment of the present application can construct a graph network model based on the interaction relationship corresponding to the user data, and perform data mining on the graph network model to obtain the user's relationship in the interaction relationship.
  • business attributes Taking the transaction application scenario as an example, in the graph network model used to reflect the transaction relationship between merchants and consumers, a node represents a merchant or a consumer, and an edge represents a transaction relationship between the two nodes.
  • a merchant node is more Most of them are located in the center of the network, and the core degree (core value) of the node can be used as a topological feature and input into the downstream machine learning task to realize the business model mining task and identify whether the node in the graph network model is a merchant or a consumer .
  • data mining can also be performed based on the graph network model to detect whether a node (or edge) has abnormal transaction behavior, which can be used to execute illegal credit intermediary, cash out, long-term lending, gambling and other abnormal transaction behavior detection tasks.
  • the embodiments of the present application may use cloud technology to perform distributed computing.
  • Cloud technology refers to a hosting technology that unifies a series of resources such as hardware, software, and network in a wide area network or a local area network to realize data computing, storage, processing and sharing.
  • Cloud technology involves network technology, information technology, integration technology, management platform technology, application technology, etc. applied in the cloud computing business model. Background services of technical network systems require a lot of computing and storage resources, such as video websites, picture websites and more portal websites. With the high development and application of the Internet industry, in the future, each item may have its own identification mark, which needs to be transmitted to the back-end system for logical processing. Data of different levels will be processed separately. All kinds of industry data require powerful System backing support can only be achieved through cloud computing.
  • Cloud computing is a computing model that distributes computing tasks on a resource pool composed of a large number of computers, enabling various application systems to obtain computing power, storage space and information services as needed.
  • the network that provides the resources is called the “cloud”.
  • the resources in the “cloud” are infinitely expandable in the eyes of users, and can be obtained at any time, used on demand, expanded at any time, and paid for according to usage.
  • cloud platform As a basic capability provider of cloud computing, it will establish a cloud computing resource pool (referred to as cloud platform, generally referred to as IaaS (Infrastructure as a Service, Infrastructure as a Service) platform, and deploy various types of virtual resources in the resource pool for External customers choose to use.
  • cloud computing resource pool mainly includes: computing devices (which are virtualized machines, including operating systems), storage devices, and network devices.
  • the PaaS (Platform as a Service) layer can be deployed on the IaaS layer
  • the SaaS (Software as a Service) layer can be deployed on the PaaS layer
  • the SaaS can be directly deployed on the IaaS layer superior.
  • PaaS is a platform on which software runs, such as databases and web containers.
  • SaaS is a variety of business software, such as web portals, SMS group senders, etc.
  • SaaS and PaaS are upper layers relative to IaaS.
  • Big data refers to a collection of data that cannot be captured, managed and processed by conventional software tools within a certain time frame. It requires a new processing mode with stronger decision-making power, insight discovery power and process optimization ability to process it. This massive, high growth rate and diverse information asset (ie big data). With the advent of the cloud era, big data has also attracted more and more attention. Big data requires special technologies to efficiently process large amounts of data. Technologies applicable to big data, including massively parallel processing databases, data mining, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems.
  • AIaaS Artificial intelligence cloud services are also generally referred to as AIaaS (AI as a Service, Chinese for "AI as a Service”).
  • AIaaS Artificial intelligence cloud services
  • the AIaaS platform will split several types of common AI services and provide independent or packaged services in the cloud.
  • This service model is similar to opening an AI-themed mall: all developers can access and use one or more artificial intelligence services provided by the platform through API interfaces, and some senior developers can also use
  • the AI framework and AI infrastructure provided by the platform are used to deploy and operate their own cloud AI services.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology.
  • the basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • artificial intelligence technology has been researched and applied in many fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, autonomous driving, unmanned driving It is believed that with the development of technology, artificial intelligence technology will be applied in more fields and play an increasingly important value.
  • FIG. 2 shows a flowchart of steps of a data processing method in an embodiment of the present application.
  • the data processing method may be executed by an electronic device, for example, executed on the terminal device 110 shown in FIG. It can be executed on the server 130 shown, or can be executed jointly by the terminal device 110 and the server 130 .
  • the data processing method may mainly include the following steps S210 to S240.
  • Step S210 Obtain a relational graph network, where the relational graph network includes nodes used to represent interactive objects and edges used to represent interactive relationships between multiple interactive objects.
  • Step S220 Perform core degree mining on the relational graph network through a device cluster including a plurality of computing devices, and iteratively update the node coreness of all or some nodes in the relational graph network.
  • Step S230 Perform pruning processing on the relational graph network according to the core degree of the nodes, and remove some nodes and some edges in the relational graph network.
  • Step S240 When the network scale of the relational graph network satisfies the preset network compression condition, perform compression processing on the device cluster, and remove some computing devices in the device cluster.
  • a relational graph network is established according to the business data related to the interaction relationship between interactive objects, and by utilizing the structural characteristics and sparseness of the relational graph network, distributed computing can be performed first through a device cluster, Core degree mining is carried out by region. With the continuous iterative update of the node core degree, the relational graph network is pruned, and the nodes and corresponding edges that have been iteratively converged are "pruned", so that the relational graph network is continuously updated with the iterative update of the node core degree. The compression becomes smaller and the consumption of computing resources is reduced. On this basis, when the relational graph network is compressed to an appropriate size, the device cluster used for core degree mining can be further compressed. In this way, not only a large amount of computing resources can be released, but also the cost of parallel computing can be saved. It brings additional time overhead such as data distribution and improves data processing efficiency.
  • step S210 a relational graph network is obtained, where the relational graph network includes nodes used to represent interactive objects and edges used to represent interactive relationships between multiple interactive objects.
  • the interaction object can be a user object that performs business interaction on the network business platform.
  • the interaction object can include the consumer who initiates online payment and the merchant who receives payment.
  • the interaction between the interaction objects A relationship is a network transaction relationship established between consumers and merchants based on payment events.
  • a plurality of interactive objects and the interactive relationship between these interactive objects can be extracted therefrom, thereby establishing a ) and edges (Edge), in which each node can represent an interactive object, and the edge connecting two nodes represents the interactive relationship between the corresponding interactive objects of the two nodes.
  • step S220 coreness mining is performed on the relational graph network through a device cluster including a plurality of computing devices, and the node coreness of all nodes or some nodes in the relational graph network is iteratively updated.
  • the node coreness is a parameter used to measure the importance of each node in the graph network.
  • the number of cores (coreness) of each node determined when performing k-core decomposition (k-core decomposition) on the graph network can be used.
  • k-core decomposition k-core decomposition
  • the k-core of a graph refers to the remaining subgraph after repeatedly removing nodes with a degree less than or equal to k. where the degree of a node is equal to the number of neighbor nodes that have direct adjacency to the node.
  • the degree of a node can reflect the importance of the node in the local area of the graph network to a certain extent, and the importance of the node can be better measured globally by mining the number of cores of the node.
  • k-core mining is an algorithm for calculating the number of cores of all nodes in a graph network.
  • the original graph network is a graph with 0 cores
  • 1-core is a graph that removes all isolated points in the graph network
  • 2-core is a graph that first removes all nodes with degree less than 2 in the graph network, and then in the remaining graph Then remove the points with degree less than 2, and so on, until it cannot be removed
  • the 3-core is to first remove all points with degree less than 3 in the graph network, and then remove the points with degree less than 3 in the remaining graphs, and so on, until Until it can't be removed...
  • the number of cores of a node is defined as the order of the largest core where this node is located. For example, if a node has at most 5 cores and not 6 cores, then the number of cores for this node is 5.
  • FIG. 3 shows a flowchart of steps of a method for core degree mining based on distributed computing in an embodiment of the present application.
  • step S220 through a device cluster including a plurality of computing devices, core degree mining is performed on the relational graph network, and all nodes or part of the nodes in the relational graph network are iteratively updated.
  • the core degree of a node may include the following steps S310 to S330.
  • Step S310 segment the relational graph network to obtain a partitioned graph network composed of some nodes and some edges in the relational graph network.
  • the method for segmenting a relational graph network may include: first, selecting a plurality of segmentation center points in the relational graph network according to a preset number of divisions; then using the segmentation center point as a clustering center , clustering all nodes in the relational graph network to assign each node to the nearest split center point; finally, according to the clustering results of the nodes, the relational graph network is divided into multiple partitioned graph networks.
  • the segmentation center point may be a node selected according to a preset rule in the relational graph network, or a randomly selected node.
  • a certain overlapping area may be reserved between two adjacent partitioned graph networks, and the two partitioned graph networks may share a part of nodes and edges in the overlapping area, thus resulting in certain computational redundancy, Improve the reliability of coreness mining for each partition graph network.
  • Step S320 Allocate the partition graph network to a device cluster including a plurality of computing devices, and determine a computing device for performing core degree mining on the partition graph network.
  • the distributed computing of core degree mining can be realized through the device cluster composed of computing devices, and the data processing efficiency can be improved.
  • the relational graph network when the relational graph network is segmented, the relational graph network may be divided into a corresponding number of partitioned graph networks according to the number of available computing devices in the device cluster. For example, it is assumed that a device cluster for distributed computing includes M computing devices, so the relational graph network can be divided into M partitioned graph networks accordingly.
  • the relational graph network can also be divided into several partitioned graph networks of similar scales according to the computing capability of a single computing device, and then each partitioned graph network can be allocated to the same number of computing devices.
  • the relational graph network includes N nodes
  • the relational graph network can be divided into N/T partitioned graph networks, where T is a node of a single partitioned graph network that can be processed by a single computing device according to its computing capability
  • T is a node of a single partitioned graph network that can be processed by a single computing device according to its computing capability
  • the scale of the relational graph network is large and the number of partitioned graph networks is large, the number of nodes contained in each partitioned graph network is basically equal to the number of nodes.
  • N/T computing devices are selected from the device cluster, and a partitioned graph network is allocated to each computing device respectively.
  • a partitioned graph network is allocated to each computing device respectively.
  • multiple partition graph networks may be allocated to some or all of the computing devices according to the computing power and working status of the computing devices.
  • Step S330 Perform core degree mining on the partitioned graph network through the allocated computing device, iteratively update the node coreness of each node in the partitioned graph network.
  • the node core degree of each node in the relational graph network may be initialized and assigned according to a preset rule, and then the node core degree of each node is iteratively updated in each iteration round .
  • the degree of node coreness may be initialized according to the degree of the node. Specifically, in the relational graph network, for each node, the number of nodes of the neighbor nodes that have an adjacency relationship with the node is obtained, and then, for each node, according to the number of neighbor nodes of the neighbor nodes that have an adjacency relationship with the node The number of nodes, the node core degree of the node is initialized.
  • the degree of a node represents the number of adjacent nodes that have an adjacency relationship with a node.
  • the weight information can also be determined in combination with the node's own attributes, and then the core degree of the node is initialized according to the degree and weight information of the node. Assignment.
  • FIG. 4 shows a flow chart of steps of performing core degree mining on a partitioned graph network in an embodiment of the present application.
  • the core degree mining of the partition graph network in step S330, and iteratively update the node core degree of each node in the partition map network may include the following steps S410 to S440 .
  • Step S410 Select a computing node that performs core degree mining in the current iteration round in the partition graph network, and determine neighbor nodes that have an adjacency relationship with the computing node.
  • all nodes in the partition graph network can be determined as computing nodes, and the computing nodes are the ones that need to perform coreness mining calculation in the current iteration round.
  • node according to the mining result, it can be determined whether to update the node core degree of each node.
  • the computing nodes that need core degree mining in the current iteration round can be determined according to the core degree mining result of the previous iteration round and the update result of the node core degree. Some or all of the nodes will update the node coreness in the current iteration round. Nodes other than computing nodes will not perform core degree mining in the current iteration round, and naturally will not update the core degree of nodes.
  • Neighbor nodes in the embodiments of the present application refer to other nodes that have a direct connection relationship with one node. Since the node core degree of each node is affected by its neighbor nodes, as the iteration progresses, the node whose core degree has not been updated in the current iteration round may also be selected as the computing node in the subsequent iteration process.
  • FIG. 5 shows a flowchart of steps for selecting a computing node in an embodiment of the present application.
  • selecting a computing node to perform core degree mining in the current iteration round in the partition graph network may include the following steps S510 to S520 .
  • Step S510 Read the node identifier of the node to be updated from the first storage space.
  • the node to be updated includes the active node that updated the core degree of the node in the previous iteration round and the neighbor node that has an adjacency relationship with the active node.
  • the edge regions of the two adjacent partitioned graph networks may include nodes that were originally adjacent to each other in the relational graph network. And the core degrees of the two nodes will still affect each other. Therefore, in order to maintain the synchronization and consistency of node core degree updates in each partitioned graph network in the process of distributed computing, the embodiment of the present application allocates a first storage space in the system to store all data in the relational graph network. The node ID of the node to be updated.
  • the node when a node in a partitioned graph network updates its node core degree according to the core degree mining result, the node can be marked as an active node.
  • the active node and the neighbor nodes of the active node will be used as the node to be updated, and the node identifier of the node to be updated will be written into the first storage space.
  • Step S520 According to the node identifier of the node to be updated, select the computing node that performs core degree mining in the current iteration round in the partition graph network.
  • each computing device can read the node identifier of the node to be updated from the first storage space, and then can, according to the read node identifier of the node to be updated, in the In the partition graph network, the computing nodes that perform core degree mining in the current iteration round are selected.
  • the first storage space can be used to summarize the node identifiers of all nodes to be updated in the relational graph network after each iteration round ends, and at the beginning of a new iteration round , distribute the node identifiers of all nodes to be updated to different computing devices, so that each computing device selects computing nodes in the partition graph network maintained by each device.
  • Step S420 Obtain the current node coreness of the computing node and the neighbor nodes of the computing node in the current iteration round.
  • the embodiment of the present application can monitor and update the node coreness of the node in real time according to the coreness mining result in each iteration round.
  • the current node core degree of each node in the current iteration round is the latest node core degree determined after the previous iteration round.
  • this embodiment of the present application may allocate a second storage space in the system to store the node coreness of all nodes in the relational graph network.
  • a computing device needs to perform core degree mining and updating according to the existing core degree data, the current node core degree of the computing node and its neighbor nodes in the current iteration round can be read from the second storage space.
  • Step S430 Determine the temporary node core degree of the computing node according to the current node core degree of the neighbor node, and determine whether the temporary node core degree of the computing node is less than the current node core degree of the computing node, and if so, mark the computing node is an active node.
  • this method calculates the number of cores by gradually constricting the graph network as a whole from the outside to the inside, this method can only use centralized computing to serially process the overall graph network data, and it is difficult to apply distributed parallel processing. There are problems such as long computing time and poor computing performance in the face of ultra-large-scale (tens of billions/hundreds of billions of orders of magnitude) relational graph networks.
  • an iterative method based on h indication can be used to perform core degree mining.
  • the h-index of the computing node can be determined according to the current node coreness of the neighbor node, and the h-index is used as the temporary node coreness of the computing node, and the h-index is used to represent all neighbor nodes of the computing node.
  • the core degree of the current node including at most h neighbor nodes is greater than or equal to h.
  • a computing node has five neighbor nodes, and the current node core degrees of the five neighbor nodes are 2, 3, 4, 5, and 6, respectively.
  • the order of node core degree from small to large among the five neighbor nodes of the computing node, there are 5 neighbor nodes whose current node core degree is greater than or equal to 1, and 5 current node core degree is greater than or equal to 2.
  • including 4 neighbor nodes whose current core degree is greater than or equal to 3 including 3 neighbor nodes whose current node core degree is greater than or equal to 4, including 2 neighbor nodes whose current node core degree is greater than or equal to 5, including 1 neighbor node whose core degree of the current node is greater than or equal to 6.
  • the core degree of the current node including at most 3 neighbor nodes is greater than or equal to 3, so the h index of the computing node is 3, and then the temporary node of the computing node can be determined.
  • the core degree is 3.
  • FIG. 6 shows a flowchart of steps for determining the h-index of a computing node in an embodiment of the present application.
  • the method for determining and calculating the h-index of a node according to the current node coreness of a neighbor node may include the following steps S610 to S630.
  • Step S610 Sort all the neighbor nodes of the computing node according to the order of the core degree of the current node from high to low, and assign a sequence number starting with 0 to each neighbor node.
  • Step S620 Compare the arrangement number of each neighbor node and the core degree of the current node respectively, and filter out neighbor nodes whose arrangement number is greater than or equal to the core degree of the current node according to the comparison result.
  • Step S630 Among the selected neighbor nodes, the current node core degree of the neighbor node with the smallest sequence number is determined as the h index of the calculation node.
  • the embodiments of the present application can quickly and efficiently determine the h-index of the computing nodes by means of sorting and filtering, which is especially suitable for a situation where the number of computing nodes is relatively large.
  • Step S440 Update the current node core degree of the active node according to the temporary node core degree; and determine the active node and the neighbor nodes that have an adjacency relationship with the active node as computing nodes for core degree mining in the next iteration round.
  • the value of the temporary node core degree of the active node and the current node core degree of the active node can be compared. If the temporary node core degree is smaller than the current node core degree, the current node core degree can be compared. The node coreness is replaced with the temporary node coreness. And if the two are the same, it means that the computing node does not need to be updated in the current iteration round.
  • the overall update result in the relational graph network may be aggregated according to the update result of the node core degree in each partition graph network , and then provide the core degree mining basis for the next iteration round.
  • FIG. 7 shows a flowchart of steps for aggregating node coreness mining results of a partition graph network in an embodiment of the present application.
  • the method for summarizing the node core degree mining results of each partition graph network may include the following steps S710 to S730.
  • Step S710 Write the updated current node coreness of the active node into the second storage space, where the second storage space is used to store the node coreness of all nodes in the relational graph network.
  • Step S720 Obtain the node identifier of the active node and the node identifier of the neighbor node of the active node, and write the acquired node identifier into the third storage space, and the third storage space is used to store the core degree in the next iteration round. Node ID of the mined compute node.
  • Step S730 After completing the core degree mining of all partition graph networks in the current iteration round, use the data in the third storage space to overwrite the data in the first storage space, and reset the third storage space.
  • the embodiment of the present application configures a third storage space, and implements the aggregation and distribution of the node core degree mining results of the partition graph network based on the update and reset of the third storage space in each iteration round. On the basis of improving data processing efficiency, ensure the stability and reliability of data processing.
  • step S230 the relational graph network is pruned according to the core degree of the nodes, and some nodes and some edges in the relational graph network are removed.
  • the nodes and edges in the relational graph network will gradually reach a state of convergence and stability, and the node coreness will not be updated in the subsequent iteration process, nor will it affect the coreness mining of other nodes. result.
  • it can be pruned and removed to reduce the data scale of the relational graph network and the partitioned graph network.
  • this embodiment of the present application can obtain the minimum coreness of the active nodes in the current iteration round and the minimum coreness of the active nodes in the previous iteration round; if the current iteration round The minimum core degree of the active nodes in the previous iteration round is greater than the minimum core degree of the active nodes in the previous iteration round, then filter the convergence nodes in the relational graph network according to the minimum core degree of the active nodes in the previous iteration round, and the convergence nodes are Nodes whose coreness is less than or equal to the minimum coreness of the active nodes in the previous iteration round; converging nodes and edges connected to converging nodes are removed from the graph network.
  • FIG. 8 shows a schematic diagram of a process of compressing and pruning a relational graph network based on iterative update of node coreness in an embodiment of the present application.
  • the key to the compression pruning method is to analyze the change of the core value of the node in each iteration. by represents the core value of node v in the t-th iteration, and minCore (t) represents the minimum core value of the node whose core value is updated in the t-th iteration.
  • the updated core value is smaller than the original core value.
  • minCore (t) > minCore (t-1) it means that all nodes with core value less than or equal to minCore (t-1) have converged. no longer being updated.
  • nodes with smaller core values do not affect the iteration of nodes with larger core values, so the converged nodes and their corresponding edges in each iteration can be "cut out", so that the relationship graph The network gradually compresses and becomes smaller as the iteration progresses.
  • minCore (2) > minCore (1) , pruning of the relational graph network can be triggered to remove nodes with a core value of 1, so as to achieve the purpose of compressing the relational graph network.
  • step S240 when the network scale of the relational graph network satisfies the preset network compression condition, a compression process is performed on the device cluster, and some computing devices in the device cluster are removed.
  • the network compression condition may include that the number of edges in the relational graph network is less than a preset number threshold.
  • the method of compressing the device cluster can be to re-segment the relational graph network according to the network scale of the relational graph network after the pruning process to obtain a reduced number of partitioned graph networks, and based on the reduced number of partitioned graph networks Invoking relatively few computing devices.
  • a computing device when the network scale of the compressed relational graph network satisfies a certain condition, a computing device may be selected in the device cluster as a target device for performing single-computer computing on the relational graph model, And remove other computing devices than the target device from the device cluster.
  • a distributed computing mode based on multiple computing devices
  • a centralized computing mode based on a single computing device.
  • the data processing method provided by the embodiment of the present application relates to a method for k-core mining based on the idea of compression and pruning.
  • the method may rely on the iterative update of the h-index to perform graph network compression and pruning automatically when specified conditions are met.
  • the method flow of the data processing method provided in the embodiment of the present application in an application scenario may include the following steps.
  • step (5) Judging whether numMsgs is 0, when numMsgs is 0, it indicates that the core values of all nodes are no longer updated, and the iteration is stopped; otherwise, step (5) is performed.
  • the above iterative steps are first carried out in a distributed parallel computing manner.
  • the scale of the compressed subgraph G'(V, E) satisfies a given condition (for example, the number of edges is less than 30 million)
  • the distributed computing can be converted into a single-computer computing mode.
  • the single-computer computing mode can not only release a large amount of computing resources, but also save additional time overhead such as data distribution caused by parallel computing.
  • the later stage of iteration usually focuses on the update of long-chain nodes, and it is more appropriate to use a single-computer computing mode.
  • the k-core mining algorithm in the embodiments of the present application can implement distributed computing on the Spark on Angel platform.
  • Spark is a fast and general computing engine designed for large-scale data processing
  • Angel is a high-performance distributed machine learning platform designed and developed based on the concept of Parameter Server (PS).
  • PS Parameter Server
  • the Spark on Angel platform is a high-performance distributed computing platform that combines Angel's powerful parameter server function with Spark's large-scale data processing capabilities, and supports traditional machine learning, deep learning and various graph algorithms.
  • FIG. 9 shows the overall architecture and processing flowchart of k-core mining in an application scenario according to an embodiment of the present application.
  • each Executor is responsible for storing the adjacency table partition data (that is, the network data of the partition graph network GraphPartion), calculating the h-index value, and performing compression and pruning operations.
  • Angel Parameter Server is responsible for storing and update the node core value, which is the coreness vector in Figure 9.
  • the PS will store the nodes that need to be calculated in the current iteration and the next iteration, which are the ReadMessage vector and WriteMessage vector in Figure 9 respectively.
  • the node that has been updated in this iteration is called the active node.
  • the core value of the node is determined by its neighbor nodes. The change of the core value of the active node will affect the core value of its neighbor nodes, so its neighbor nodes will be in the next iteration. should be calculated, so what is stored in WriteMessage in real time is the neighbor nodes of the active node in this iteration.
  • Executor and PS perform data processing in the following interactive manner in each iteration.
  • the k-core mining method based on the compression idea provided by the embodiments of the present application can solve the problems of high resource overhead and long time consuming caused by k-core mining in an ultra-large-scale network.
  • a real-time compression method is designed, which can release part of the computing resources as the iteration progresses; combine the advantages of distributed parallel computing and single-computer computing to improve k-core mining performance;
  • the k-core mining method based on the compression idea is implemented on the performance graph computing platform, which can support ultra-large networks with tens of billions/hundreds of billions of edges, with low resource overhead and high performance.
  • FIG. 10 schematically shows a structural block diagram of a data processing apparatus provided by an embodiment of the present application. As shown in FIG.
  • the data processing apparatus 1000 may mainly include: a graph network obtaining module 1010, configured to obtain a relational graph network, the relational graph network including nodes for representing interactive objects and for representing a plurality of interactive objects The edge of the interaction relationship between them; the core degree mining module 1020 is configured to perform core degree mining on the relational graph network through a device cluster including a plurality of computing devices, and iteratively update all nodes in the relational graph network or Node core degree of some nodes; the network pruning module 1030 is configured to perform pruning processing on the relational graph network according to the node core degree, and remove some nodes and some edges in the relational graph network; cluster compression Module 1040 is configured to perform compression processing on the device cluster and remove some computing devices in the device cluster when the network scale of the relational graph network satisfies a preset network compression condition.
  • a graph network obtaining module 1010 configured to obtain a relational graph network, the relational graph network including nodes for representing interactive objects and for representing a plurality of interactive objects The edge of
  • the cluster compression module 1040 includes: a single-computer computing unit, configured to select a computing device in the device cluster as a computing device for compressing the relationship graph model A target device for single-computer computing, and other computing devices other than the target device are removed from the device cluster.
  • the core degree mining module 1020 includes: a network segmentation unit, configured to perform segmentation processing on the relational graph network, and obtain the result from the relational graph network.
  • a partitioned graph network composed of some nodes and some edges;
  • a network allocation unit configured to distribute the partitioned graph network to a device cluster including a plurality of computing devices, and to determine a calculation for performing core degree mining on the partitioned graph network A device;
  • a partition mining unit configured to perform core degree mining on the partition graph network through the allocated computing device, and iteratively update the node core degree of each node in the partition map network.
  • the partition mining unit includes: a node selection sub-unit, configured to select the calculation of core degree mining in the current iteration round in the partition graph network node, and determine the neighbor node that has an adjacency relationship with the computing node;
  • the core degree acquisition subunit is configured to acquire the computing node and the current node core degree of the neighbor node in the current iteration round;
  • core degree calculation a subunit configured to determine the temporary node core degree of the computing node according to the current node core degree of the neighbor node; and determine whether the temporary node core degree of the computing node is less than the current node core degree of the computing node , if so, mark the computing node as an active node;
  • the core degree update subunit is configured to update the current node core degree of the active node according to the temporary node core degree, and update the active node and the Neighbor nodes whose active nodes have an adjacency relationship are determined as computing nodes to perform core
  • the core degree calculation subunit includes: an h index calculation subunit, configured to determine the h index of the calculation node according to the current node core degree of the neighbor node index, and the h index is used as the temporary node core degree of the computing node, and the h index is used to indicate that the current node core degree including at most h neighbor nodes in all neighbor nodes of the computing node is greater than or equal to h.
  • the h-index calculation subunit includes: a node sorting subunit, configured to perform the calculation on the current node in an order from high to low core degree All neighbor nodes of the node are sorted, and a sequence number starting with 0 is assigned to each neighbor node; the node screening subunit is configured to compare the sequence number of each neighbor node and the core degree of the current node respectively, and filter according to the comparison result. The neighbor nodes whose sequence number is greater than or equal to the core degree of the current node are selected; the h-index determination subunit is configured to determine the current node core degree of the neighbor node with the smallest sequence number among the selected neighbor nodes as the computing node's core degree. h-index.
  • the node selection subunit includes: an identification reading subunit, configured to read the node identification of the node to be updated from the first storage space, the to-be-updated node identification
  • the update node includes an active node that updates the core degree of the node in the previous iteration round, and a neighbor node that has an adjacency relationship with the active node;
  • the identifier selection subunit is configured to be based on the node identifier of the node to be updated.
  • the computing nodes that perform core degree mining in the current iteration round are selected.
  • the data processing apparatus further includes: a core degree writing module configured to write the updated current node core degree of the active node into the second storage space, the second storage space is used to store the node core degree of all nodes in the relational graph network; the identification writing module is configured to obtain the node identification of the active node and the neighbor nodes of the active node.
  • the third storage space is used to store the node identification of the computing node for core degree mining in the next iteration round; It is configured to use the data in the third storage space to overwrite the data in the first storage space with the data in the third storage space, and reset the third storage space after completing the coreness mining of all partition graph networks in the current iteration round.
  • the coreness obtaining subunit includes: a coreness reading subunit, configured to read the computing node and the neighbor from the second storage space The current node coreness of the node in the current iteration round, and the second storage space is used to store the node coreness of all nodes in the relational graph network.
  • the network pruning module 1030 includes: a minimum coreness obtaining unit, configured to obtain the minimum coreness of the active nodes in the current iteration round, and The minimum core degree of the active nodes in the previous iteration round; the convergent node screening unit is configured to if the minimum core degree of the active nodes in the current iteration round is greater than that of the active nodes in the previous iteration round Minimum core degree, then filter the convergent nodes in the relational graph network according to the minimum core degree of the active nodes in the previous iteration round, and the convergent nodes are nodes whose core degree is less than or equal to the previous iteration round The node with the minimum core degree of active nodes in ; a convergent node removal unit configured to remove the convergent node and the edge connected to the convergent node from the relational graph network.
  • the network compression condition includes that the number of edges in the relational graph network is less than a preset number threshold.
  • the apparatus further includes a core degree initialization module, where the core degree initialization module is configured to obtain, for each node in the relational graph network, a The node number of neighbor nodes that the node has an adjacency relationship with; for each node, the node core degree of the node is initialized according to the node number of the neighbor nodes that have an adjacency relationship with the node.
  • FIG. 11 schematically shows a structural block diagram of a computer system for implementing an electronic device according to an embodiment of the present application.
  • the computer system 1100 includes a central processing unit 1101 (Central Processing Unit, CPU), which can be loaded into a random device according to a program stored in a read-only memory 1102 (Read-Only Memory, ROM) or from a storage part 1108 Various appropriate actions and processes are performed by accessing programs in the memory 1103 (Random Access Memory, RAM). In the random access memory 1103, various programs and data necessary for system operation are also stored.
  • the central processing unit 1101 , the read-only memory 1102 and the random access memory 1103 are connected to each other through a bus 1104 .
  • An input/output interface 1105 (Input/Output interface, ie, I/O interface) is also connected to the bus 1104 .
  • the following components are connected to the input/output interface 1105: an input section 1106 including a keyboard, a mouse, etc.; an output section 1107 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc. ; a storage section 1108 including a hard disk, etc.; and a communication section 1109 including a network interface card such as a local area network card, a modem, and the like. The communication section 1109 performs communication processing via a network such as the Internet.
  • a driver 1110 is also connected to the input/output interface 1105 as required.
  • a removable medium 1111 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 1110 as needed so that a computer program read therefrom is installed into the storage section 1108 as needed.
  • embodiments of the present application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication portion 1109, and/or installed from the removable medium 1111.
  • the central processing unit 1101 various functions defined in the system of the present application are executed.
  • the computer-readable medium shown in the embodiments of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above.
  • Computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Erasable Programmable Read Only Memory (EPROM), flash memory, optical fiber, portable Compact Disc Read-Only Memory (CD-ROM), optical storage device, magnetic storage device, or any suitable of the above The combination.
  • a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein.
  • Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to wireless, wired, etc., or any suitable combination of the foregoing.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • the exemplary embodiments described herein may be implemented by software, or may be implemented by software combined with necessary hardware. Therefore, the technical solutions according to the embodiments of the present application may be embodied in the form of software products, and the software products may be stored in a non-volatile storage medium (which may be CD-ROM, U disk, mobile hard disk, etc.) or on the network , which includes several instructions to cause a computing device (which may be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.
  • a computing device which may be a personal computer, a server, a touch terminal, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Software Systems (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • General Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Tourism & Hospitality (AREA)
  • Primary Health Care (AREA)
  • Human Resources & Organizations (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Signal Processing (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Game Theory and Decision Science (AREA)
  • Molecular Biology (AREA)
  • Discrete Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请属于人工智能技术领域,具体涉及一种数据处理方法、数据处理装置、计算机可读介质以及电子设备。该方法包括:获取关系图网络,所述关系图网络包括用于表示交互对象的节点和用于表示多个交互对象之间的交互关系的边;通过包括多个计算设备的设备集群,对所述关系图网络进行核心度挖掘,迭代更新所述关系图网络中的全部节点或者部分节点的节点核心度;根据所述节点核心度对所述关系图网络进行剪枝处理,移除所述关系图网络中的部分节点和部分边;当所述关系图网络的网络规模满足预设的网络压缩条件时,对所述设备集群进行压缩处理,移除所述设备集群中的部分计算设备。该方法可以降低计算资源的消耗并提高数据处理效率。

Description

数据处理方法、装置、计算机可读介质及电子设备
本申请要求于2020年12月31日提交中国专利局、申请号为2020116269066、申请名称为“数据处理方法、装置、计算机可读介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,具体涉及数据处理技术。
背景技术
随着计算机和网络技术的发展,在网络平台提供的业务服务的基础上,用户之间可以建立各种各样的交互关系,例如,用户可以在网络社交平台上与其他用户建立社交关系,也可以在网络支付平台上与其他用户建立交易关系。基于此,网络平台会积累大量的用户数据,其中包括用户使用网络平台时产生的与自身属性相关的数据,同时也包括不同用户之间因建立交互关系而产生的交互数据。
合理地对用户数据进行梳理和挖掘,可以使网络平台归纳用户特点,进而结合用户特点更好地为用户提供便利高效的平台服务。然而,随着用户数据的不断积累,越来越庞大的数据规模将会日益增大数据处理压力,网络平台需要花费越来越多的计算资源和时间,用来执行用户数据分析处理操作。因此,如何提高大数据分析的效率并降低相关成本,是目前亟待解决的问题。
发明内容
本申请实施例提供了一种数据处理方法、数据处理装置、计算机可读介质、电子设备以及计算机程序产品,能够在一定程度上克服大数据分析中存在的计算资源消耗大、数据处理效率低等技术问题。
本申请的其他特性和优点将通过下面的详细描述变得显然,或部分地通过本申请的实践而习得。
根据本申请实施例的一个方面,提供一种数据处理方法,由电子设备执行,所述方法包括:获取关系图网络,所述关系图网络包括用于表示交互对象的节点、以及用于表示多个交互对象之间的交互关系的边;通过包括多个计算设备的设备集群,对所述关系图网络进行核心度挖掘,迭代更新所述关系图网络中的全部节点或者部分节点的节点核心度;根据所述节点核心度对所述关系图网络进行剪枝处理,移除所述关系图网络中的部分节点和部分边;当所述关系图网络的网络规模满足预设的网络压缩条件时,对所述设备集群进行压缩处理,移除所述设备集群中的部分计算设备。
根据本申请实施例的一个方面,提供一种数据处理装置,该装置包括:图网络获取模块,被配置为获取关系图网络,所述关系图网络包括用于表示 交互对象的节点、以及用于表示多个交互对象之间的交互关系的边;核心度挖掘模块,被配置为通过包括多个计算设备的设备集群,对所述关系图网络进行核心度挖掘,迭代更新所述关系图网络中的全部节点或者部分节点的节点核心度;网络剪枝模块,被配置为根据所述节点核心度对所述关系图网络进行剪枝处理,移除所述关系图网络中的部分节点和部分边;集群压缩模块,被配置为当所述关系图网络的网络规模满足预设的网络压缩条件时,对所述设备集群进行压缩处理,移除所述设备集群中的部分计算设备。
根据本申请实施例的一个方面,提供一种计算机可读介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如以上技术方案中的数据处理方法。
根据本申请实施例的一个方面,提供一种电子设备,该电子设备包括:处理器;以及存储器,用于存储所述处理器的可执行指令;其中,所述处理器被配置为经由执行所述可执行指令来执行如以上技术方案中的数据处理方法。
根据本申请实施例的一个方面,提供一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行如以上技术方案中的数据处理方法。
在本申请实施例提供的技术方案中,根据涉及交互对象之间的交互关系的业务数据建立关系图网络,利用关系图网络的结构特点和稀疏性,可以先通过设备集群进行分布式计算,分区域地进行核心度挖掘。随着节点核心度的不断迭代更新,对关系图网络进行剪枝处理,“剪掉”其中已经迭代收敛的节点和对应的连边,从而使得关系图网络随着节点核心度的迭代更新而不断压缩变小,降低计算资源的消耗。在此基础上,在关系图网络被压缩至合适的大小时,可以进一步对用于进行核心度挖掘的设备集群进行压缩处理,如此,不仅可以释放大量的计算资源,还可以节省因并行计算而带来的数据分发等额外的时间开销,提高数据处理效率。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图;
图1示出了应用本申请技术方案的数据处理系统的架构框图;
图2示出了本申请一个实施例中的数据处理方法的步骤流程图;
图3示出了本申请一个实施例中基于分布式计算进行核心度挖掘的方法步骤流程图;
图4示出了本申请一个实施例中对分区图网络进行核心度挖掘的步骤流程图;
图5示出了本申请一个实施例中选取计算节点的步骤流程图;
图6示出了本申请一个实施例中确定计算节点的h指数的步骤流程图;
图7示出了本申请一个实施例中汇总分区图网络的节点核心度挖掘结果的步骤流程图;
图8示出了本申请一个实施例中基于节点核心度的迭代更新对关系图网络进行压缩剪枝的过程示意图;
图9示出了本申请实施例在一应用场景中进行k-core挖掘的整体架构及处理流程图;
图10示意性地示出了本申请实施例提供的数据处理装置的结构框图;
图11示意性示出了适于用来实现本申请实施例的电子设备的计算机系统结构框图。
具体实施方式
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本申请将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。
此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中,提供许多具体细节从而给出对本申请的实施例的充分理解。然而,本领域技术人员将意识到,可以实践本申请的技术方案而没有特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本申请的各方面。
附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
图1示出了可以应用本申请技术方案的数据处理系统的架构框图。
如图1所示,数据处理系统100可以包括终端设备110、网络120和服务器130。
终端设备110可以包括智能手机、平板电脑、笔记本电脑、台式电脑、智能音箱、智能手表、智能眼镜、车载终端等各种电子设备。终端设备110上可以安装视频应用客户端、音乐应用客户端、社交应用客户端、支付应用客户端等各种应用程序的客户端,以使用户可以基于应用程序的客户端使用相应的应用服务。
服务器130可以是独立的物理服务器,也可以是由多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、大数据和人工智能平台等基础云计算服务的云服务器。网络120可以是能够在终端设备110和服务器130之间提供通信链路的各种连接类型的通信介质,例如可以是有线通信链路或者无线通信链路。
根据实现需要,本申请实施例中的系统架构可以包括任意数目的终端设备、网络和服务器。例如,服务器130可以是由多个服务器设备组成的服务器群组。另外,本申请实施例提供的技术方案可以应用于终端设备110,也可以应用于服务器130,或者可以由终端设备110和服务器130共同实施,本申请对此不做特殊限定。
举例而言,用户在终端设备110上使用社交应用程序时,可以在网络社交平台上与其他用户相互发送信息,或者进行语音会话、视频会话等网络社交行为,基于该过程可以与其他用户建立社交关系,同时会在网络社交平台上产生相应的社交业务数据。又例如,用户在终端设备110上使用支付应用程序时,可以在网络支付平台上向其他用户实施支付行为或者收款行为,基于该过程可以与其他用户建立交易关系,同时会在网络支付平台上产生相应的交易业务数据。
采集得到社交业务数据或者交易业务数据等相关的用户数据后,本申请实施例可以基于用户数据对应的交互关系构建图网络模型,并对该图网络模型进行数据挖掘,得到用户在交互关系中的业务属性。以交易应用场景为例,在用于反映商户和消费者之间的交易关系的图网络模型中,节点表示商户或消费者,边表示两个节点之间存在交易关系,一般来说商户节点更多处于网络的中心位置,节点的核心度(core值)可以作为一种拓扑特征,输入到下游的机器学习任务中,以实现商业模式挖掘任务,识别图网络模型中的节点是商户还是消费者。另外,在支付业务的风险控制场景中,还可以基于图网络模型进行数据挖掘,以检测某个节点(或边)是否存在异常交易行为,从而用于执行非法信贷中介、套现、多头借贷、赌博等异常交易行为的检测任务。
为了提高大数据的分析和挖掘效率,本申请实施例可以利用云技术进行分布式计算。
云技术(Cloud technology)是指在广域网或局域网内将硬件、软件、 网络等系列资源统一起来,实现数据的计算、储存、处理和共享的一种托管技术。云技术涉及云计算商业模式应用的网络技术、信息技术、整合技术、管理平台技术、应用技术等等,可以组成资源池,按需所用,灵活便利。技术网络系统的后台服务需要大量的计算、存储资源,如视频网站、图片类网站和更多的门户网站。伴随着互联网行业的高度发展和应用,将来每个物品都可能存在自己的识别标志,都需要传输到后台系统进行逻辑处理,不同程度级别的数据将会分开处理,各类行业数据皆需要强大的系统后盾支撑,只能通过云计算来实现。
云计算(cloud computing)是一种计算模式,它将计算任务分布在由大量计算机构成的资源池上,使各种应用系统能够根据需要获取计算力、存储空间和信息服务。提供资源的网络被称为“云”。“云”中的资源在使用者看来是可以无限扩展的,并且可以随时获取,按需使用,随时扩展,按使用付费。
作为云计算的基础能力提供商,会建立云计算资源池(简称云平台,一般称为IaaS(Infrastructure as a Service,基础设施即服务)平台,在资源池中部署多种类型的虚拟资源,供外部客户选择使用。云计算资源池中主要包括:计算设备(为虚拟化机器,包含操作系统)、存储设备、网络设备。
按照逻辑功能划分,在IaaS层上可以部署PaaS(Platform as a Service,平台即服务)层,PaaS层上再部署SaaS(Software as a Service,软件即服务)层,也可以直接将SaaS部署在IaaS上。PaaS为软件运行的平台,如数据库、web容器等。SaaS为各式各样的业务软件,如web门户网站、短信群发器等。一般来说,SaaS和PaaS相对于IaaS是上层。
大数据(Big data)是指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的数据集合,需要具有更强的决策力、洞察发现力和流程优化能力的新处理模式,才能处理此种海量、高增长率和多样化的信息资产(即大数据)。随着云时代的来临,大数据也吸引了越来越多的关注,大数据需要特殊的技术,才能有效地处理大量的数据。适用于大数据的技术,包括大规模并行处理数据库、数据挖掘、分布式文件系统、分布式数据库、云计算平台、互联网和可扩展的存储系统。
人工智能云服务,一般也被称作是AIaaS(AI as a Service,中文为“AI即服务”)。这是目前主流的一种人工智能平台的服务方式,具体来说,AIaaS平台会对几类常见的AI服务进行拆分,并在云端提供独立或者打包的服务。这种服务模式类似于开了一个AI主题商城:所有的开发者都可以通过API接口的方式来接入并使用平台提供的一种或者是多种人工智能服务,部分资深的开发者还可以使用平台提供的AI框架和AI基础设施,来部署和运维自已专属的云人工智能服务。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获 得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
随着人工智能技术研究和进步,人工智能技术在多个领域展开研究和应用,例如,常见的智能家居、智能穿戴设备、虚拟助理、智能音箱、智能营销、无人驾驶、自动驾驶、无人机、机器人、智能医疗、智能客服等,相信随着技术的发展,人工智能技术将在更多的领域得到应用,并发挥越来越重要的价值。
下面结合具体实施方式对本申请实施例提供的数据处理方法、数据处理装置、计算机可读介质、电子设备以及计算机程序产品等技术方案做出详细说明。
图2示出了本申请一个实施例中的数据处理方法的步骤流程图,该数据处理方法可以由电子设备执行,例如,在图1所示的终端设备110上执行,也可以在图1所示的服务器130上执行,或者可以由终端设备110和服务器130共同执行。如图2所示,该数据处理方法主要可以包括如下的步骤S210至步骤S240。
步骤S210:获取关系图网络,关系图网络包括用于表示交互对象的节点、以及用于表示多个交互对象之间的交互关系的边。
步骤S220:通过包括多个计算设备的设备集群,对关系图网络进行核心度挖掘,迭代更新关系图网络中的全部节点或者部分节点的节点核心度。
步骤S230:根据节点核心度对关系图网络进行剪枝处理,移除关系图网络中的部分节点和部分边。
步骤S240:当关系图网络的网络规模满足预设的网络压缩条件时,对设备集群进行压缩处理,移除设备集群中的部分计算设备。
在本申请实施例提供的数据处理方法中,根据涉及交互对象之间的交互关系的业务数据建立关系图网络,利用关系图网络的结构特点和稀疏性,可以先通过设备集群进行分布式计算,分区域地进行核心度挖掘。随着节点核心度的不断迭代更新,对关系图网络进行剪枝处理,“剪掉”其中已经迭代收敛的节点和对应的连边,从而使得关系图网络随着节点核心度的迭代更新而不断压缩变小,降低计算资源的消耗。在此基础上,在关系图网络被压缩至合适的大小时,可以进一步对用于进行核心度挖掘的设备集群进行压缩处 理,如此,不仅可以释放大量的计算资源,还可以节省因并行计算而带来的数据分发等额外的时间开销,提高数据处理效率。
下面对以上实施例中数据处理方法的各个方法步骤分别作出详细说明。
在步骤S210中,获取关系图网络,关系图网络包括用于表示交互对象的节点、以及用于表示多个交互对象之间的交互关系的边。
交互对象可以是在网络业务平台上进行业务交互的用户对象,例如,在涉及商品交易的网络支付场景中,交互对象可以包括发起网络支付的消费者以及接收支付的商户,交互对象之间的交互关系即是消费者与商户之间基于支付事件而建立的网络交易关系。
在本申请实施例中,通过采集多个交互对象之间进行业务往来时产生的业务数据,可以从中提取出多个交互对象、以及这些交互对象相互之间的交互关系,从而建立由节点(Node)和边(Edge)组成的关系图网络,其中,每一个节点可以代表一个交互对象,连接两个节点的边代表这两个节点各自对应的交互对象之间的交互关系。
在步骤S220中,通过包括多个计算设备的设备集群,对关系图网络进行核心度挖掘,迭代更新关系图网络中的全部节点或者部分节点的节点核心度。
节点核心度是用于衡量图网络中每个节点的重要程度的参数,本申请实施例可以用对图网络进行k核分解(k-core decomposition)时确定的每个节点的核数(coreness),来表示节点的节点核心度。一个图的k核(k-core)是指反复去除度小于或等于k的节点后,所剩余的子图。其中,节点的度等于与该节点具有直接邻接关系的邻居节点的数量。一般而言,节点的度可以在一定程度上反映该节点在图网络的局部区域中的重要性,通过挖掘节点的核数可以在全局范围内更好地衡量该节点的重要性。
若一个节点存在于k-core,而在(k+1)-core中被移除,那么此节点的核数即为k。k-core挖掘是计算图网络中所有节点的核数的一种算法。举例而言,原始的图网络即为0核的图;1核就是去掉图网络中所有孤立点的图;2核就是先去掉图网络中所有度小于2的节点,然后在剩下的图中再去掉度小于2的点,依次类推,直到不能去掉为止;3核就是先去掉图网络中所有度小于3的点,然后在剩下的图中再去掉度小于3的点,依次类推,直到不能去掉为止……一个节点的核数定义为这个节点所在的最大核的阶数。例如,一个节点最多在5核而不在6核中,那么这个节点的核数即为5。
图3示出了本申请一个实施例中基于分布式计算进行核心度挖掘的方法步骤流程图。如图3所示,在以上实施例的基础上,步骤S220中的通过包括多个计算设备的设备集群,对关系图网络进行核心度挖掘,迭代更新关系图网络中的全部节点或者部分节点的节点核心度,可以包括如下的步骤S310至步骤S330。
步骤S310:对关系图网络进行分割处理,得到由关系图网络中的部分节点和部分边组成的分区图网络。
一个网络规模较大的关系图网络经过分割处理后可以得到多个规模相对较小的分区图网络。在本申请的一个实施例中,对关系图网络进行分割处理的方法可以包括:首先根据预设的分割数量,在关系图网络中选取多个分割中心点;然后以分割中心点作为聚类中心,对关系图网络中的所有节点进行聚类处理,以将每个节点分配至与其距离最近的一个分割中心点;最后根据节点的聚类结果,将关系图网络分割为多个分区图网络。分割中心点可以是在关系图网络中按照预设规则选取的节点、或者随机选取的节点。
在本申请的一个实施例中,相邻的两个分区图网络之间可以保留一定的重叠区域,两个分区图网络在重叠区域中可以共用一部分节点和边,从而产生一定的计算冗余,提高针对每个分区图网络进行核心度挖掘的可靠性。
步骤S320:将分区图网络分配至包括多个计算设备的设备集群,确定用于对分区图网络进行核心度挖掘的计算设备。
将多个分区图网络分别分配给不同的计算设备,可以通过计算设备组成的设备集群实现核心度挖掘的分布式计算,提高数据处理效率。
在本申请的一个实施例中,对关系图网络进行分割处理时,可以根据设备集群中可用的计算设备的数量,将关系图网络分割成对应数量的分区图网络。例如,假设进行分布式计算的设备集群中包括M个计算设备,因此可以将关系图网络相应地分割为M个分区图网络。
在本申请的另一实施例中,也可以根据单个计算设备的计算能力,将关系图网络分割为规模相近的若干数量的分区图网络,然后将各个分区图网络分配至相同数量的计算设备。例如,假设关系图网络中包括N个节点,可以将关系图网络分割为N/T个分区图网络,其中,T为根据单个计算设备的计算能力确定的其能够处理的单个分区图网络的节点数量,当关系图网络规模较大、且分区图网络数量较多时,每个分区图网络中包含的节点基本上等同于该节点数量。完成关系图网络分割后,再从设备集群中选取N/T个计算设备,并分别向每个计算设备分配一个分区图网络。当设备集群中的设备数量少于N/T个时,可以根据计算设备的算力以及工作状态,向部分或者全部的计算设备分配多个分区图网络。
步骤S330:通过所分配的计算设备,对分区图网络进行核心度挖掘迭代更新分区图网络中的各个节点的节点核心度。
在本申请的一个实施例中,可以首先根据预设规则,对关系图网络中的各个节点的节点核心度进行初始化赋值,然后在每个迭代轮次下对各个节点的节点核心度进行迭代更新。
在一些可选的实施方式中,可以根据节点的度对节点核心度进行初始化。具体而言,在关系图网络中,针对每个节点,获取与该节点具有邻接关系的 邻居节点的节点数量,然后,针对每个节点,根据与该节点具有邻接关系的邻居节点的邻居节点的节点数量,对该节点的节点核心度进行初始化。节点的度表示与一个节点具有邻接关系的邻居节点的节点数量,在其他一些实施方式中,也可以结合节点的自身属性确定权重信息,然后根据节点的度和权重信息共同对节点核心度进行初始化赋值。
图4示出了本申请一个实施例中对分区图网络进行核心度挖掘的步骤流程图。如图4所示,在以上实施例的基础上,步骤S330中的对分区图网络进行核心度挖掘,迭代更新分区图网络中的各个节点的节点核心度,可以包括如下的步骤S410至步骤S440。
步骤S410:在分区图网络中选取在当前迭代轮次中进行核心度挖掘的计算节点,并确定与计算节点具有邻接关系的邻居节点。
在对节点核心度完成初始化赋值后的第一个迭代轮次中,可以将分区图网络中的全部节点均确定为计算节点,计算节点即为在当前迭代轮次中需要进行核心度挖掘计算的节点,根据挖掘结果可以确定是否要对各个节点的节点核心度进行更新。
在进行核心度挖掘的每个迭代轮次中,可以根据前一迭代轮次的核心度挖掘结果以及节点核心度的更新结果,确定当前迭代轮次中需要进行核心度挖掘的计算节点,这些计算节点中的部分节点或者全部节点将在当前迭代轮次中更新节点核心度。除计算节点以外的其他节点则不会在当前迭代轮次中进行核心度挖掘,自然也不会更新节点核心度。
本申请实施例中的邻居节点是指与一个节点具有直接连接关系的其他节点。由于每个节点的节点核心度都会受到其邻居节点的影响,随着迭代的不断进行,当前迭代轮次中未更新节点核心度的节点也可能在后续的迭代过程中被选作计算节点。
图5示出了本申请一个实施例中选取计算节点的步骤流程图。如图5所示,步骤S410中的在分区图网络中选取在当前迭代轮次中进行核心度挖掘的计算节点,可以包括如下的步骤S510至步骤S520。
步骤S510:从第一存储空间中读取待更新节点的节点标识,待更新节点包括在前一迭代轮次中更新节点核心度的活跃节点、以及与活跃节点具有邻接关系的邻居节点。
由于组成关系图网络的分区图网络是在不同的计算设备上进行分布式处理的,而在相互邻接的两个分区图网络的边缘区域中,可能包括原本在关系图网络中相互邻接的节点,并且这二者的节点核心度仍然会相互影响。因此,为了在分布式计算的过程中,保持各分区图网络中节点核心度更新的同步性和一致性,本申请实施例在系统中分配第一存储空间,用以保存关系图网络中的所有的待更新节点的节点标识。
在一个迭代轮次中,当某个分区图网络中的节点根据核心度挖掘结果更 新了其节点核心度,则可以将该节点标记为活跃节点。活跃节点与活跃节点的邻居节点将被作为待更新节点,待更新节点的节点标识将被写入到第一存储空间中。
步骤S520:根据待更新节点的节点标识,在分区图网络中选取在当前迭代轮次中进行核心度挖掘的计算节点。
当开始一个迭代轮次时,每个计算设备均可以从第一存储空间中读取待更新节点的节点标识,进而可以根据所读取的待更新节点的节点标识,在该计算设备被分配的分区图网络中选取在当前迭代轮次中进行核心度挖掘的计算节点。
通过执行如上所述的步骤S510至步骤S520,可以通过第一存储空间,在每个迭代轮次结束后汇总关系图网络中的所有待更新节点的节点标识,并在新的迭代轮次开始时,向不同的计算设备分发所有待更新节点的节点标识,以使各个计算设备在各自维护的分区图网络中选取计算节点。
步骤S420:获取计算节点以及计算节点的邻居节点在当前迭代轮次中的当前节点核心度。
本申请实施例可以根据每个迭代轮次中的核心度挖掘结果,实时监测并更新节点的节点核心度。各个节点在当前迭代轮次中的当前节点核心度是经过此前的迭代轮次后确定的最新的节点核心度。
在一个可选的实施方式中,本申请实施例可以在系统中分配第二存储空间,用以存储关系图网络中的所有节点的节点核心度。当有计算设备需要根据已有的核心度数据进行核心度挖掘和更新时,可以从第二存储空间中读取计算节点及其邻居节点在当前迭代轮次中的当前节点核心度。
步骤S430:根据邻居节点的当前节点核心度,确定计算节点的临时节点核心度,并判断该计算节点的临时节点核心度是否小于该计算节点的当前节点核心度,若是,则将该计算节点标记为活跃节点。
以核数coreness作为核心度为例,在本申请的相关技术中,可以基于k-core的定义采用递归剪枝的方法对关系图网络进行核心度挖掘。具体而言,可以从k=1开始,不断从图中去掉度小于等于k的节点及其连接边,直到剩下的图中所有节点的度都大于k为止。递归剪枝类似于“剥洋葱”,在第k轮中剥落的所有节点的core值即为k。然而,由于该方法通过将图网络整体上从外向内逐步限缩的方式来计算核数,因此使得该方法只能采用集中式计算,对整体的图网络数据进行串行处理,而难以适用分布式并行处理。在面对超大规模(百亿/千亿数量级)的关系图网络时存在计算时间过长、计算性能差等问题。
为了克服该问题,本申请在一个实施例中可以采用基于h指示的迭代方法来进行核心度挖掘。具体而言,本申请实施例可以根据邻居节点的当前节点核心度确定计算节点的h指数,并将h指数作为计算节点的临时节点核心 度,h指数用于表示在计算节点的所有邻居节点中至多包括h个邻居节点的当前节点核心度大于或等于h。
举例而言,某个计算节点具有五个邻居节点,这五个邻居节点的当前节点核心度分别为2、3、4、5、6。按照节点核心度由小到大的顺序来看,在该计算节点的五个邻居节点中,包括5个当前节点核心度大于或等于的邻居节点1,包括5个当前节点核心度大于或等于2的邻居节点,包括4个当前节点核心度大于或等于3的邻居节点,包括3个当前节点核心度大于或等于4的邻居节点,包括2个当前节点核心度大于或等于5的邻居节点,包括1个当前节点核心度大于或等于6的邻居节点。由此可见,在该计算节点的所有邻居节点中,至多包括3个邻居节点的当前节点核心度大于或等于3,因此该计算节点的h指数即为3,进而可以确定该计算节点的临时节点核心度为3。
图6示出了本申请一个实施例中确定计算节点的h指数的步骤流程图。如图6所示,在以上实施例的基础上,根据邻居节点的当前节点核心度确定计算节点的h指数的方法,可以包括如下的步骤S610至步骤S630。
步骤S610:按照当前节点核心度由高到低的顺序,对计算节点的所有邻居节点进行排序,并为各个邻居节点分配以0起始的排列序号。
步骤S620:分别比较各个邻居节点的排列序号和当前节点核心度,根据比较结果筛选出排列序号大于或等于当前节点核心度的邻居节点。
步骤S630:在筛选出的邻居节点中,将排列序号最小的邻居节点的当前节点核心度确定为计算节点的h指数。
本申请实施例通过排序及筛选的方式,可以快速高效地确定计算节点的h指数,尤其适用于计算节点数量规模较大的情形。
步骤S440:根据临时节点核心度,更新活跃节点的当前节点核心度;并将活跃节点以及与活跃节点具有邻接关系的邻居节点确定为在下一迭代轮次中进行核心度挖掘的计算节点。
在获取到活跃节点的临时节点核心度后,可以比较该活跃节点的临时节点核心度与该活跃节点的当前节点核心度的数值大小,如果临时节点核心度小于当前节点核心度,则可以将当前节点核心度替换为该临时节点核心度。而如果二者相同,表示该计算节点在当前迭代轮次中不需要更新。
在本申请的一个实施例中,在根据临时节点核心度更新活跃节点的当前节点核心度之后,可以根据每个分区图网络中的节点核心度的更新结果,汇总关系图网络中的整体更新结果,进而为下一迭代轮次提供核心度挖掘基础。
图7示出了本申请一个实施例中汇总分区图网络的节点核心度挖掘结果的步骤流程图。如图7所示,在以上各实施例的基础上,汇总各个分区图网络的节点核心度挖掘结果的方法,可以包括如下的步骤S710至步骤S730。
步骤S710:将更新后的活跃节点的当前节点核心度写入第二存储空间, 第二存储空间用于存储关系图网络中所有节点的节点核心度。
步骤S720:获取活跃节点的节点标识、以及活跃节点的邻居节点的节点标识,并将所获取的节点标识写入第三存储空间,第三存储空间用于存储下一迭代轮次中进行核心度挖掘的计算节点的节点标识。
步骤S730:在当前迭代轮次中完成所有分区图网络的核心度挖掘后,利用第三存储空间中的数据覆盖第一存储空间中的数据,并重置第三存储空间。
本申请实施例配置了第三存储空间,并在各个迭代轮次中基于第三存储空间的更新和重置,实现对分区图网络的节点核心度挖掘结果的汇总和分发,在利用分布式计算提高数据处理效率的基础上,确保数据处理的稳定性和可靠性。
在步骤S230中,根据节点核心度对关系图网络进行剪枝处理,移除关系图网络中的部分节点和部分边。
随着节点核心度的挖掘和迭代更新,关系图网络中的节点和边将逐渐达到收敛稳定的状态,不会在后续的迭代过程中更新节点核心度,也不会影响其他节点的核心度挖掘结果。针对这部分收敛的节点,可以将其剪枝移除,以缩小关系图网络以及分区图网络的数据规模。
在一个可选的实施方式中,本申请实施例可以获取在当前迭代轮次中的活跃节点的最小核心度、以及在前一迭代轮次中的活跃节点的最小核心度;若当前迭代轮次中的活跃节点的最小核心度大于前一迭代轮次中的活跃节点的最小核心度,则根据前一迭代轮次中的活跃节点的最小核心度筛选关系图网络中的收敛节点,收敛节点是节点核心度小于或等于前一迭代轮次中的活跃节点的最小核心度的节点;从关系图网络中移除收敛节点以及与收敛节点相连的边。
图8示出了本申请一个实施例中基于节点核心度的迭代更新对关系图网络进行压缩剪枝的过程示意图。
压缩剪枝方法的关键在于分析每轮迭代中节点的core值变化。以
Figure PCTCN2021132221-appb-000001
表示节点v在第t轮迭代的core值,minCore (t)表示第t轮迭代中core值有更新的节点的最小core值。
Figure PCTCN2021132221-appb-000002
当一个节点的core值被更新时,其更新后的core值一小于原来core值。根据节点core值随每轮迭代递减的规律可以知道,当minCore (t)>minCore (t-1)时,表明此时所有core值小于等于minCore (t-1)的节点均已收敛,后续将不再被更新。根据k-core挖掘特征可知,core值较小的节点不影响core值较大的节点的迭代,因此可以将每轮迭代中已收敛的节点及其对应连边“剪掉”,从而使关系图网络随迭代的进行逐渐压缩变小。
如图8所示,根据初始化的core值可以确定初始的最小core值为 minCore (0)=1。经过第一轮迭代,其中部分节点的core值得到更新,在这部分core值更新的节点中,最小core值为minCore (1)=1。再经过第二轮迭代后,其中又有一部分节点的core值得到更新,在这部分core值更新的节点中,最小core值为minCore (2)=1。
由于minCore (2)>minCore (1),因此可以触发对关系图网络进行剪枝处理,移除其中core值为1的节点,从而达到压缩关系图网络的目的。
在步骤S240中,当关系图网络的网络规模满足预设的网络压缩条件时,对设备集群进行压缩处理,移除设备集群中的部分计算设备。
随着关系图网络的网络规模不断压缩变小,对其进行节点核心度挖掘时所需的计算资源也逐渐减少,此时可以跟随迭代的进行释放部分计算资源,以减少资源开销。
在本申请的一个实施例中,网络压缩条件可以包括关系图网络中的边的数量小于预设的数量阈值。对设备集群进行压缩处理的方式,可以是根据剪枝处理后的关系图网络的网络规模,对关系图网络重新进行分割,得到数量减少的分区图网络,并基于减少后的分区图网络的数量调用相对较少的计算设备。
在本申请的一个实施例中,当压缩后的关系图网络的网络规模满足某一条件时,可以在设备集群中选取一个计算设备,作为用于对该关系图模型进行单机计算的目标设备,并从设备集群中移除目标设备以外的其它计算设备。由此可以实现从基于多个计算设备的分布式计算模式转变为基于单个计算设备的集中式计算模式。
应当注意,尽管在附图中以特定顺序描述了本申请实施例中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。
基于以上实施例中对数据处理方法的介绍可知,本申请实施例提供的数据处理方法涉及一种基于压缩剪枝思想进行k-core挖掘的方法。在一些可选的实施方式中,该方法可以依托于h指数的迭代更新来进行,在满足指定条件时可以自动地进行图网络压缩剪枝。本申请实施例提供的数据处理方法在一应用场景中的方法流程可以包括以下步骤。
(1)对关系图网络G(V,E)中的每个节点v,使用节点度数初始化其core值,
Figure PCTCN2021132221-appb-000003
其中deg(v)表示节点度数,也即节点的邻居节点的个数。用最小的节点度数初始化minCore,即
Figure PCTCN2021132221-appb-000004
(2)设置numMsgs参数表示每轮迭代中core值变化了的节点个数,用零初始化numMsgs。
(3)对G(V,E)中的每个节点,根据其邻居节点的core值计算h指数(即
Figure PCTCN2021132221-appb-000005
(4)判断numMsgs是否为0,当numMsgs为0时,表明所有节点core值不再被更新,停止迭代;否则执行步骤(5)。
(5)判断minCore (t)>minCore (t-1)是否成立,如果是,则执行压缩剪枝策略:保存core值小于等于minCore (t-1)的节点及对应core值,同时将这些节点及对应连边从G(V,E)迭代图中去除,得到压缩后的子图G′(V,E)。在G′(V,E)上继续执行第3-5步的迭代;当不满足minCore (t)>minCore (t-1)时,继续在原图上执行第3-5步的迭代。
对于大规模图网络的k-core挖掘,上述迭代步骤首先以分布式并行计算的方式展开。当压缩后的子图G′(V,E)的规模满足给定条件(例如边数量小于3千万)时,可以将分布式计算转换为单机计算模式。单机计算模式不仅能释放大量计算资源,还能节省由并行计算带来的数据分发等额外时间开销。尤其对于含有长链结构的图网络来说,迭代后期通常聚焦在长链节点的更新上,此时使用单机计算模式更为合适。
本申请实施例中的k-core挖掘算法可以在Spark on Angel平台上实现分布式计算。其中,Spark是一个专为大规模数据处理而设计的快速通用的计算引擎,Angel是一个基于参数服务器(Parameter Server,PS)理念设计和开发的高性能分布式机器学习平台。Spark on Angel平台是一个将Angel强大的参数服务器功能与Spark的大规模数据处理能力相结合的高性能分布式计算平台,支持传统机器学习、深度学习和各类图算法。
图9示出了本申请实施例在一应用场景中进行k-core挖掘的整体架构及处理流程图。如图9所示,在Spark Driver的驱动下,每一个Executor负责存储邻接表分区数据(即分区图网络GraphPartion的网络数据)、计算h-index值和执行压缩剪枝操作,Angel Parameter Server负责存储和更新节点core值,也即图9中的coreness向量。为了利用k-core挖掘的稀疏性加快迭代收敛,PS上会同时存储本轮迭代和下一轮迭代需要计算的节点,分别为图9中的ReadMessage向量和WriteMessage向量。将本轮迭代中有更新的节点称为活跃节点,节点core值由其邻居节点决定的性质,活跃节点core值的变化将会影响其邻居节点的core值,因此其邻居节点在下一轮迭代中应该被计算,所以WriteMessage中实时存储的是本轮迭代中活跃节点的邻居节点。
Executor和PS在每轮迭代中以如下交互方式进行数据处理。
(1)在Executor上初始化minCore (t)=minCore (t-1),同时为本轮迭代开辟changedCore和keys2calc两个向量空间,分别用于存储本轮迭代中 有更新的节点和下轮迭代中需要计算的节点。
(2)从PS的ReadMessage中拉取本轮迭代中需要计算的节点(下面将直接称为计算节点),如果是第一次迭代则是所有节点。
(3)根据第2步中得到的计算节点确定本轮迭代中涉及计算的所有节点(计算节点及其对应邻居),并从PS的coreness中拉取对应core值。
(4)对计算节点中的每一个节点v,计算其邻居节点core值的h-index值,作为该节点的新一轮core值
Figure PCTCN2021132221-appb-000006
如果
Figure PCTCN2021132221-appb-000007
Figure PCTCN2021132221-appb-000008
写入到changedCore中,同时将节点v的core值大于minCore (t-1)的邻居节点写入到keys2calc中,确定minCore (t)
Figure PCTCN2021132221-appb-000009
(5)用changedCore更新PS上的coreness向量,用keys2calc更新PS上的WriteMessage向量。
最后,当所有分区数据都完成一轮迭代后,在PS上,用WriteMessage替换ReadMessage,同时重置WriteMessage,准备下一轮的PS读写。在汇总所有数据分区得到全局minCore (t)之后,判断minCore (t)>minCore (t-1)是否成立。如果成立,则对所有数据分区执行上述的压缩剪枝方法。
本申请实施例提供的基于压缩思想的k-core挖掘方法,可以解决超大规模网络中k-core挖掘带来的资源开销大,耗时长的问题。根据k-core挖掘的迭代特点,设计了一种实时压缩方法,可以随着迭代的进行释放部分计算资源;结合分布式并行计算和单机计算的优势提高k-core挖掘性能;在Spark on Angel高性能图计算平台上实现了基于压缩思想的k-core挖掘方法,可以支持百亿/千亿级边的超大规模网络,资源开销小且性能高。
以下介绍本申请的装置实施例,可以用于执行本申请上述实施例中的数据处理方法。图10示意性地示出了本申请实施例提供的数据处理装置的结构框图。如图10所示,数据处理装置1000主要可以包括:图网络获取模块1010,被配置为获取关系图网络,所述关系图网络包括用于表示交互对象的节点、以及用于表示多个交互对象之间的交互关系的边;核心度挖掘模块1020,被配置为通过包括多个计算设备的设备集群,对所述关系图网络进行核心度挖掘,迭代更新所述关系图网络中的全部节点或者部分节点的节点核心度;网络剪枝模块1030,被配置为根据所述节点核心度对所述关系图网络进行剪枝处理,移除所述关系图网络中的部分节点和部分边;集群压缩模块1040,被配置为当所述关系图网络的网络规模满足预设的网络压缩条件时,对所述设备集群进行压缩处理,移除所述设备集群中的部分计算设备。
在本申请的一些实施例中,基于以上各实施例,所述集群压缩模块1040包括:单机计算单元,被配置为在所述设备集群中选取一个计算设备,作为用于对所述关系图模型进行单机计算的目标设备,并从所述设备集群中移除所述目标设备以外的其它计算设备。
在本申请的一些实施例中,基于以上各实施例,所述核心度挖掘模块 1020包括:网络分割单元,被配置为对所述关系图网络进行分割处理,得到由所述关系图网络中的部分节点和部分边组成的分区图网络;网络分配单元,被配置为将所述分区图网络分配至包括多个计算设备的设备集群,确定用于对所述分区图网络进行核心度挖掘的计算设备;分区挖掘单元,被配置为通过所分配的计算设备,对所述分区图网络进行核心度挖掘,迭代更新所述分区图网络中的各个节点的节点核心度。
在本申请的一些实施例中,基于以上各实施例,所述分区挖掘单元包括:节点选取子单元,被配置为在所述分区图网络中选取在当前迭代轮次中进行核心度挖掘的计算节点,并确定与所述计算节点具有邻接关系的邻居节点;核心度获取子单元,被配置为获取所述计算节点以及所述邻居节点在当前迭代轮次中的当前节点核心度;核心度计算子单元,被配置为根据所述邻居节点的当前节点核心度,确定所述计算节点的临时节点核心度;并判断所述计算节点的临时节点核心度是否小于所述计算节点的当前节点核心度,若是,则将所述计算节点标记为活跃节点;核心度更新子单元,被配置为根据所述临时节点核心度,更新所述活跃节点的当前节点核心度,并将所述活跃节点以及与所述活跃节点具有邻接关系的邻居节点确定为在下一迭代轮次中进行核心度挖掘的计算节点。
在本申请的一些实施例中,基于以上各实施例,所述核心度计算子单元包括:h指数计算子单元,被配置为根据所述邻居节点的当前节点核心度确定所述计算节点的h指数,并将所述h指数作为所述计算节点的临时节点核心度,所述h指数用于表示在所述计算节点的所有邻居节点中至多包括h个邻居节点的当前节点核心度大于或等于h。
在本申请的一些实施例中,基于以上各实施例,所述h指数计算子单元包括:节点排序子单元,被配置为按照所述当前节点核心度由高到低的顺序,对所述计算节点的所有邻居节点进行排序,并为各个所述邻居节点分配以0起始的排列序号;节点筛选子单元,被配置为分别比较各个邻居节点的排列序号和当前节点核心度,根据比较结果筛选出排列序号大于或等于当前节点核心度的邻居节点;h指数确定子单元,被配置为在筛选出的邻居节点中,将排列序号最小的邻居节点的当前节点核心度确定为所述计算节点的h指数。
在本申请的一些实施例中,基于以上各实施例,所述节点选取子单元包括:标识读取子单元,被配置为从第一存储空间中读取待更新节点的节点标识,所述待更新节点包括在前一迭代轮次中更新节点核心度的活跃节点、以及与所述活跃节点具有邻接关系的邻居节点;标识选取子单元,被配置为根据所述待更新节点的节点标识,在所述分区图网络中选取在当前迭代轮次中进行核心度挖掘的计算节点。
在本申请的一些实施例中,基于以上各实施例,所述数据处理装置还包 括:核心度写入模块,被配置为将更新后的所述活跃节点的当前节点核心度写入第二存储空间,所述第二存储空间用于存储所述关系图网络中所有节点的节点核心度;标识写入模块,被配置为获取所述活跃节点的节点标识、以及所述活跃节点的邻居节点的节点标识,并将所获取的所述节点标识写入第三存储空间,所述第三存储空间用于存储下一迭代轮次中进行核心度挖掘的计算节点的节点标识;空间覆盖模块,被配置为在当前迭代轮次中完成所有分区图网络的核心度挖掘后,利用所述第三存储空间中的数据覆盖第一存储空间中的数据,并重置所述第三存储空间。
在本申请的一些实施例中,基于以上各实施例,所述核心度获取子单元包括:核心度读取子单元,被配置为从第二存储空间中读取所述计算节点以及所述邻居节点在当前迭代轮次中的当前节点核心度,所述第二存储空间用于存储所述关系图网络中的所有节点的节点核心度。
在本申请的一些实施例中,基于以上各实施例,所述网络剪枝模块1030包括:最小核心度获取单元,被配置为获取在当前迭代轮次中的活跃节点的最小核心度、以及在前一迭代轮次中的活跃节点的最小核心度;收敛节点筛选单元,被配置为若所述当前迭代轮次中的活跃节点的最小核心度大于所述前一迭代轮次中的活跃节点的最小核心度,则根据所述前一迭代轮次中的活跃节点的最小核心度筛选所述关系图网络中的收敛节点,所述收敛节点是节点核心度小于或等于所述前一迭代轮次中的活跃节点的最小核心度的节点;收敛节点移除单元,被配置为从所述关系图网络中移除所述收敛节点以及与所述收敛节点相连的边。
在本申请的一些实施例中,基于以上各实施例,所述网络压缩条件包括所述关系图网络中的边的数量小于预设的数量阈值。
在本申请的一些实施例中,基于以上各实施例,所述装置还包括核心度初始化模块,所述核心度初始化模块,被配置为在所述关系图网络中,针对每个节点,获取与所述节点具有邻接关系的邻居节点的节点数量;针对每个节点,根据与所述节点具有邻接关系的邻居节点的节点数量,对所述节点的节点核心度进行初始化。
本申请各实施例中提供的数据处理装置的具体细节已经在对应的方法实施例中进行了详细的描述,此处不再赘述。
图11示意性地示出了用于实现本申请实施例的电子设备的计算机系统结构框图。
需要说明的是,图11示出的电子设备的计算机系统1100仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图11所示,计算机系统1100包括中央处理器1101(Central Processing Unit,CPU),其可以根据存储在只读存储器1102(Read-Only Memory,ROM)中的程序或者从存储部分1108加载到随机访问存储器1103(Random Access  Memory,RAM)中的程序而执行各种适当的动作和处理。在随机访问存储器1103中,还存储有系统操作所需的各种程序和数据。中央处理器1101、在只读存储器1102以及随机访问存储器1103通过总线1104彼此相连。输入/输出接口1105(Input/Output接口,即I/O接口)也连接至总线1104。
以下部件连接至输入/输出接口1105:包括键盘、鼠标等的输入部分1106;包括诸如阴极射线管(Cathode Ray Tube,CRT)、液晶显示器(Liquid Crystal Display,LCD)等以及扬声器等的输出部分1107;包括硬盘等的存储部分1108;以及包括诸如局域网卡、调制解调器等的网络接口卡的通信部分1109。通信部分1109经由诸如因特网的网络执行通信处理。驱动器1110也根据需要连接至输入/输出接口1105。可拆卸介质1111,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器1110上,以便于从其上读出的计算机程序根据需要被安装入存储部分1108。
特别地,根据本申请的实施例,各个方法流程图中所描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分1109从网络上被下载和安装,和/或从可拆卸介质1111被安装。在该计算机程序被中央处理器1101执行时,执行本申请的系统中限定的各种功能。
需要说明的是,本申请实施例所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)、闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、有线等等,或者上述的任意合 适的组合。
附图中的流程图和框图,图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本申请的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、触控终端、或者网络设备等)执行根据本申请实施方式的方法。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。

Claims (16)

  1. 一种数据处理方法,由电子设备执行,所述方法包括:
    获取关系图网络,所述关系图网络包括用于表示交互对象的节点、以及用于表示多个交互对象之间的交互关系的边;
    通过包括多个计算设备的设备集群,对所述关系图网络进行核心度挖掘,迭代更新所述关系图网络中的全部节点或者部分节点的节点核心度;
    根据所述节点核心度对所述关系图网络进行剪枝处理,移除所述关系图网络中的部分节点和部分边;
    当所述关系图网络的网络规模满足预设的网络压缩条件时,对所述设备集群进行压缩处理,移除所述设备集群中的部分计算设备。
  2. 根据权利要求1所述的数据处理方法,所述对所述设备集群进行压缩处理,移除所述设备集群中的部分计算设备,包括:
    在所述设备集群中选取一个计算设备,作为用于对所述关系图模型进行单机计算的目标设备,并从所述设备集群中移除所述目标设备以外的其它计算设备。
  3. 根据权利要求1所述的数据处理方法,所述通过包括多个计算设备的设备集群,对所述关系图网络进行核心度挖掘,迭代更新所述关系图网络中的各个节点的节点核心度,包括:
    对所述关系图网络进行分割处理,得到由所述关系图网络中的部分节点和部分边组成的分区图网络;
    将所述分区图网络分配至包括多个计算设备的设备集群,确定用于对所述分区图网络进行核心度挖掘的计算设备;
    通过所分配的计算设备,对所述分区图网络进行核心度挖掘,迭代更新所述分区图网络中的各个节点的节点核心度。
  4. 根据权利要求3所述的数据处理方法,所述对所述分区图网络进行核心度挖掘,迭代更新所述分区图网络中的各个节点的节点核心度,包括:
    在所述分区图网络中选取在当前迭代轮次中进行核心度挖掘的计算节点,并确定与所述计算节点具有邻接关系的邻居节点;
    获取所述计算节点以及所述邻居节点在当前迭代轮次中的当前节点核心度;
    根据所述邻居节点的当前节点核心度,确定所述计算节点的临时节点核心度;并判断所述计算节点的临时节点核心度是否小于所述计算节点的当前节点核心度,若是,则将所述计算节点标记为活跃节点;
    根据所述临时节点核心度,更新所述活跃节点的当前节点核心度;并将所述活跃节点以及与所述活跃节点具有邻接关系的邻居节点确定为在下一迭代轮次中进行核心度挖掘的计算节点。
  5. 根据权利要求4所述的数据处理方法,所述根据所述邻居节点的当 前节点核心度,确定所述计算节点的临时节点核心度,包括:
    根据所述邻居节点的当前节点核心度确定所述计算节点的h指数,并将所述h指数作为所述计算节点的临时节点核心度;所述h指数用于表示在所述计算节点的所有邻居节点中至多包括h个邻居节点的当前节点核心度大于或等于h。
  6. 根据权利要求5所述的数据处理方法,所述根据所述邻居节点的当前节点核心度确定所述计算节点的h指数,包括:
    按照所述当前节点核心度由高到低的顺序,对所述计算节点的所有邻居节点进行排序,并为各个所述邻居节点分配以0起始的排列序号;
    分别比较各个邻居节点的排列序号和当前节点核心度,根据比较结果筛选出排列序号大于或等于当前节点核心度的邻居节点;
    在筛选出的邻居节点中,将排列序号最小的邻居节点的当前节点核心度确定为所述计算节点的h指数。
  7. 根据权利要求4所述的数据处理方法,所述在所述分区图网络中选取在当前迭代轮次中进行核心度挖掘的计算节点,包括:
    从第一存储空间中读取待更新节点的节点标识,所述待更新节点包括在前一迭代轮次中更新节点核心度的活跃节点、以及与所述活跃节点具有邻接关系的邻居节点;
    根据所述待更新节点的节点标识,在所述分区图网络中选取在当前迭代轮次中进行核心度挖掘的计算节点。
  8. 根据权利要求7所述的数据处理方法,在所述根据所述临时节点核心度,更新所述活跃节点的当前节点核心度之后,所述方法还包括:
    将更新后的所述活跃节点的当前节点核心度写入第二存储空间,所述第二存储空间用于存储所述关系图网络中所有节点的节点核心度;
    获取所述活跃节点的节点标识、以及所述活跃节点的邻居节点的节点标识,并将所获取的所述节点标识写入第三存储空间,所述第三存储空间用于存储下一迭代轮次中进行核心度挖掘的计算节点的节点标识;
    在当前迭代轮次中完成所有分区图网络的核心度挖掘后,利用所述第三存储空间中的数据覆盖第一存储空间中的数据,并重置所述第三存储空间。
  9. 根据权利要求4所述的数据处理方法,所述获取所述计算节点以及所述邻居节点在当前迭代轮次中的当前节点核心度,包括:
    从第二存储空间中读取所述计算节点以及所述邻居节点在当前迭代轮次中的当前节点核心度,所述第二存储空间用于存储所述关系图网络中的所有节点的节点核心度。
  10. 根据权利要求1所述的数据处理方法,所述根据所述节点核心度对所述关系图网络进行剪枝处理,移除所述关系图网络中的部分节点和部分边,包括:
    获取在当前迭代轮次中的活跃节点的最小核心度、以及在前一迭代轮次中的活跃节点的最小核心度;
    若所述当前迭代轮次中的活跃节点的最小核心度大于所述前一迭代轮次中的活跃节点的最小核心度,则根据所述前一迭代轮次中的活跃节点的最小核心度筛选所述关系图网络中的收敛节点,所述收敛节点是节点核心度小于或等于所述前一迭代轮次中的活跃节点的最小核心度的节点;
    从所述关系图网络中移除所述收敛节点以及与所述收敛节点相连的边。
  11. 根据权利要求1所述的数据处理方法,所述网络压缩条件包括所述关系图网络中的边的数量小于预设的数量阈值。
  12. 根据权利要求1所述的数据处理方法,在所述通过包括多个计算设备的设备集群,对所述关系图网络进行核心度挖掘之前,所述方法还包括:
    在所述关系图网络中,针对每个节点,获取与所述节点具有邻接关系的邻居节点的节点数量;
    针对每个节点,根据与所述节点具有邻接关系的邻居节点的节点数量,对所述节点的节点核心度进行初始化。
  13. 一种数据处理装置,包括:
    图网络获取模块,被配置为获取关系图网络,所述关系图网络包括用于表示交互对象的节点、以及用于表示多个交互对象之间的交互关系的边;
    核心度挖掘模块,被配置为通过包括多个计算设备的设备集群,对所述关系图网络进行核心度挖掘,迭代更新所述关系图网络中的全部节点或者部分节点的节点核心度;
    网络剪枝模块,被配置为根据所述节点核心度对所述关系图网络进行剪枝处理,移除所述关系图网络中的部分节点和部分边;
    集群压缩模块,被配置为当所述关系图网络的网络规模满足预设的网络压缩条件时,对所述设备集群进行压缩处理,移除所述设备集群中的部分计算设备。
  14. 一种计算机可读介质,其上存储有计算机程序,该计算机程序被处理器执行时实现权利要求1至12中任意一项所述的数据处理方法。
  15. 一种电子设备,包括:
    处理器;以及
    存储器,用于存储所述处理器的可执行指令;
    其中,所述处理器配置为经由执行所述可执行指令来执行权利要求1至12中任意一项所述的数据处理方法。
  16. 一种计算机程序产品,包括指令,当其在计算机上运行时,使得计算机实现如权利要求1-12中任意一项所述的数据处理方法。
PCT/CN2021/132221 2020-12-31 2021-11-23 数据处理方法、装置、计算机可读介质及电子设备 WO2022142859A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP21913601.7A EP4198771A4 (en) 2020-12-31 2021-11-23 DATA PROCESSING METHOD AND APPARATUS, COMPUTER READABLE MEDIUM AND ELECTRONIC DEVICE
JP2023521789A JP2023546040A (ja) 2020-12-31 2021-11-23 データ処理方法、装置、電子機器、及びコンピュータプログラム
US17/964,778 US20230033019A1 (en) 2020-12-31 2022-10-12 Data processing method and apparatus, computerreadable medium, and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011626906.6 2020-12-31
CN202011626906.6A CN113515672A (zh) 2020-12-31 2020-12-31 数据处理方法、装置、计算机可读介质及电子设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/964,778 Continuation US20230033019A1 (en) 2020-12-31 2022-10-12 Data processing method and apparatus, computerreadable medium, and electronic device

Publications (1)

Publication Number Publication Date
WO2022142859A1 true WO2022142859A1 (zh) 2022-07-07

Family

ID=78060647

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/132221 WO2022142859A1 (zh) 2020-12-31 2021-11-23 数据处理方法、装置、计算机可读介质及电子设备

Country Status (5)

Country Link
US (1) US20230033019A1 (zh)
EP (1) EP4198771A4 (zh)
JP (1) JP2023546040A (zh)
CN (1) CN113515672A (zh)
WO (1) WO2022142859A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115689066A (zh) * 2022-12-30 2023-02-03 湖南三湘银行股份有限公司 基于图数据算法的目标供应商风险预测方法及装置

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515672A (zh) * 2020-12-31 2021-10-19 腾讯科技(深圳)有限公司 数据处理方法、装置、计算机可读介质及电子设备
US11860977B1 (en) * 2021-05-04 2024-01-02 Amazon Technologies, Inc. Hierarchical graph neural networks for visual clustering
CN115455244B (zh) * 2022-09-16 2023-09-22 北京百度网讯科技有限公司 图数据的处理方法、装置、设备和介质
CN116436799B (zh) * 2023-06-13 2023-08-11 中国人民解放军国防科技大学 复杂网络节点重要性评估方法、装置、设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110188405A1 (en) * 2010-01-30 2011-08-04 International Business Machines Corporation Systems and methods for finding star structures as communities in networks
CN106126341A (zh) * 2016-06-23 2016-11-16 成都信息工程大学 应用于大数据的多计算框架处理系统及关联规则挖掘方法
CN108965141A (zh) * 2018-09-18 2018-12-07 深圳市风云实业有限公司 一种多路径路由树的计算方法及装置
CN113515672A (zh) * 2020-12-31 2021-10-19 腾讯科技(深圳)有限公司 数据处理方法、装置、计算机可读介质及电子设备

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013126144A2 (en) * 2012-02-20 2013-08-29 Aptima, Inc. Systems and methods for network pattern matching
US9336388B2 (en) * 2012-12-10 2016-05-10 Palo Alto Research Center Incorporated Method and system for thwarting insider attacks through informational network analysis
US9195941B2 (en) * 2013-04-23 2015-11-24 International Business Machines Corporation Predictive and descriptive analysis on relations graphs with heterogeneous entities
US9483580B2 (en) * 2013-06-11 2016-11-01 International Business Machines Corporation Estimation of closeness of topics based on graph analytics
US9660869B2 (en) * 2014-11-05 2017-05-23 Fair Isaac Corporation Combining network analysis and predictive analytics
US9699205B2 (en) * 2015-08-31 2017-07-04 Splunk Inc. Network security system
US10855706B2 (en) * 2016-10-11 2020-12-01 Battelle Memorial Institute System and methods for automated detection, reasoning and recommendations for resilient cyber systems
US10728105B2 (en) * 2018-11-29 2020-07-28 Adobe Inc. Higher-order network embedding
US11696241B2 (en) * 2019-07-30 2023-07-04 Qualcomm Incorporated Techniques for synchronizing based on sidelink synchronization signal prioritization
US11671436B1 (en) * 2019-12-23 2023-06-06 Hrl Laboratories, Llc Computational framework for modeling adversarial activities
US20230177834A1 (en) * 2021-12-07 2023-06-08 Insight Direct Usa, Inc. Relationship modeling and evaluation based on video data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110188405A1 (en) * 2010-01-30 2011-08-04 International Business Machines Corporation Systems and methods for finding star structures as communities in networks
CN106126341A (zh) * 2016-06-23 2016-11-16 成都信息工程大学 应用于大数据的多计算框架处理系统及关联规则挖掘方法
CN108965141A (zh) * 2018-09-18 2018-12-07 深圳市风云实业有限公司 一种多路径路由树的计算方法及装置
CN113515672A (zh) * 2020-12-31 2021-10-19 腾讯科技(深圳)有限公司 数据处理方法、装置、计算机可读介质及电子设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4198771A4

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115689066A (zh) * 2022-12-30 2023-02-03 湖南三湘银行股份有限公司 基于图数据算法的目标供应商风险预测方法及装置

Also Published As

Publication number Publication date
CN113515672A (zh) 2021-10-19
US20230033019A1 (en) 2023-02-02
EP4198771A1 (en) 2023-06-21
EP4198771A4 (en) 2024-04-03
JP2023546040A (ja) 2023-11-01

Similar Documents

Publication Publication Date Title
WO2022142859A1 (zh) 数据处理方法、装置、计算机可读介质及电子设备
US11836578B2 (en) Utilizing machine learning models to process resource usage data and to determine anomalous usage of resources
US9916394B2 (en) Vectorized graph processing
CN107871166B (zh) 针对机器学习的特征处理方法及特征处理系统
US10812551B1 (en) Dynamic detection of data correlations based on realtime data
CN104077723B (zh) 一种社交网络推荐系统及方法
US11935049B2 (en) Graph data processing method and apparatus, computer device, and storage medium
CN111427971B (zh) 用于计算机系统的业务建模方法、装置、系统和介质
CN106815254A (zh) 一种数据处理方法和装置
CN114667507A (zh) 使用基于应用的剖析的机器学习工作负载的弹性执行
Qureshi et al. A survey on association rule mining in cloud computing
CN106599122B (zh) 一种基于垂直分解的并行频繁闭序列挖掘方法
Garcia et al. Flute: A scalable, extensible framework for high-performance federated learning simulations
Benlachmi et al. A comparative analysis of hadoop and spark frameworks using word count algorithm
Choi et al. Intelligent reconfigurable method of cloud computing resources for multimedia data delivery
WO2020147601A1 (zh) 用于对图进行学习的系统
CN112182111A (zh) 基于区块链的分布式系统分层处理方法和电子设备
JP2022534160A (ja) 情報を出力するための方法及び装置、電子機器、記憶媒体並びにコンピュータプログラム
CN111581443A (zh) 分布式图计算方法、终端、系统及存储介质
US20200118016A1 (en) Data attribution using frequent pattern analysis
Liu et al. Cloud service selection based on rough set theory
CN113780333A (zh) 一种用户群体分类方法和装置
CN113052507A (zh) 一种均匀分配数据的方法和装置
CN110334067A (zh) 一种稀疏矩阵压缩方法、装置、设备及存储介质
Majumder et al. Optimal and Effective Resource Management in Edge Computing.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21913601

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021913601

Country of ref document: EP

Effective date: 20230313

ENP Entry into the national phase

Ref document number: 2023521789

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE