WO2019174393A1 - 图结构模型训练和垃圾账号识别 - Google Patents

图结构模型训练和垃圾账号识别 Download PDF

Info

Publication number
WO2019174393A1
WO2019174393A1 PCT/CN2019/071868 CN2019071868W WO2019174393A1 WO 2019174393 A1 WO2019174393 A1 WO 2019174393A1 CN 2019071868 W CN2019071868 W CN 2019071868W WO 2019174393 A1 WO2019174393 A1 WO 2019174393A1
Authority
WO
WIPO (PCT)
Prior art keywords
account
node
data
structure model
graph structure
Prior art date
Application number
PCT/CN2019/071868
Other languages
English (en)
French (fr)
Inventor
刘子奇
陈超超
周俊
李小龙
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Priority to SG11202004182WA priority Critical patent/SG11202004182WA/en
Priority to EP19768037.4A priority patent/EP3703332B1/en
Publication of WO2019174393A1 publication Critical patent/WO2019174393A1/zh
Priority to US16/882,084 priority patent/US10917425B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/554Detecting local intrusion or implementing counter-measures involving event detection and direct action
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks

Definitions

  • the present specification relates to the field of computer software technology, and in particular, to a graph structure model training, a garbage account identification method, device and device.
  • Some users or organizations will register a large number of accounts for bad purposes, and use these accounts to perform some abnormal operations, such as spreading messages, selling fake advertisements, swiping orders, etc. These accounts may bring risks to the platform, and the value of the platform is also higher. Low, is considered a junk account.
  • the garbage account is generally determined by means of user reporting, and corresponding processing is performed, such as freezing, canceling, and the like.
  • the embodiments of the present specification provide a graph structure model training, a garbage account identification method, a device and a device, which are used to solve the following technical problems: an effective garbage account identification scheme is needed.
  • a method for training a graph structure model includes: acquiring an account medium network diagram, wherein nodes in the account medium network diagram represent an account and a medium, and at least a part of the nodes indicate that the connected nodes have a login behavior relationship; Obtaining feature data and risk tag data of the node, the feature data reflecting a login behavior of the corresponding node in a time series; training the predefined one according to the account media network map, the feature data, and the risk tag data Figure structure model to identify junk accounts.
  • a method for identifying a garbage account includes: acquiring feature data of an account to be identified, and acquiring an account media network map to which the account to be identified belongs; and character data of the account to be identified, and the account Corresponding to the topology structure of the account to be identified in the media network diagram, inputting the graph structure model trained by using the above-mentioned graph structure model training method for calculation; determining the waiting according to the predicted data output by the trained graph structure model Identify if the account is a junk account.
  • a diagram structure model training apparatus includes: a first obtaining module, which acquires an account medium network diagram, where nodes in the account medium network diagram represent an account and a medium, and at least a part of the nodes represent the connected nodes Having a login behavior relationship; a second obtaining module, acquiring feature data and risk tagging data of the node, the feature data reflecting a login behavior of the corresponding node in a time series; and a training identification module, according to the account media network diagram Describe the feature data and the risk tag data, and train a predefined graph structure model to identify the junk account.
  • a garbage account identifying apparatus includes: an obtaining module, acquiring feature data of an account to be identified, and acquiring an account medium network map to which the account to be identified belongs; and an input module, the account to be identified Feature data, and a topology structure corresponding to the account to be identified in the account media network map, inputting a graph structure model trained by using the graph structure model training method; and determining a module according to the trained graph structure model The predicted data is outputted to determine whether the account to be identified is a junk account.
  • a diagram structure model training device includes: at least one processor; and a memory communicatively coupled to the at least one processor.
  • the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to: obtain an account media network map, the account The nodes in the media network diagram represent the account and the medium, and at least part of the nodes indicate that the connected nodes have a login behavior relationship; the feature data of the node and the risk tag data are acquired, and the feature data reflects the login of the corresponding node in the time series. Behavior; training a predefined graph structure model to identify a junk account according to the account media network map, the feature data, and the risk tag data.
  • the above-mentioned account embedding scheme based on the account medium network map can effectively identify the junk account by utilizing the media aggregation and time aggregation of the junk account.
  • FIG. 1 is a schematic diagram of an overall architecture involved in an implementation scenario of the present specification
  • FIG. 2 is a schematic flowchart of a method for training a graph structure model according to an embodiment of the present disclosure
  • FIG. 3 is a schematic flowchart of a garbage account identification method according to an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of an embodiment of the foregoing methods according to an embodiment of the present specification.
  • FIG. 5 is a schematic structural diagram of a structure training device corresponding to FIG. 2 according to an embodiment of the present disclosure
  • FIG. 6 is a schematic structural diagram of a garbage account identifying apparatus corresponding to FIG. 3 according to an embodiment of the present disclosure
  • FIG. 7 is a schematic structural diagram of a structure model training device corresponding to FIG. 2 according to an embodiment of the present disclosure
  • FIG. 8 is a schematic structural diagram of a garbage account identification device corresponding to FIG. 3 according to an embodiment of the present disclosure.
  • Embodiments of the present specification provide a graph structure model training, a garbage account identification method, apparatus, and device.
  • an account for abnormal behavior can be regarded as a junk account, such as an account that is automatically registered in a large amount by a machine.
  • the identification of junk accounts is of great significance in the security of risk control.
  • the difficulty is that newly registered accounts do not have enough account portrait information to determine whether they are junk accounts.
  • This specification takes into account the two characteristics that garbage accounts often have, media aggregation and time aggregation, and then proposes a garbage account identification scheme with supervised map embedding based on these two characteristics, which can effectively identify junk accounts.
  • the embedding may refer to mapping some original data of the nodes in the graph in a specified feature space (referred to as a hidden feature space in the present specification) to obtain a corresponding embedding vector for representing the node.
  • Media aggregation can mean that multiple junk accounts registered by the same malicious user are often registered through the same or a few media. The reason for the aggregation of media is that malicious users often seek the pursuit of interests, and they do not have enough resources to register a large number of accounts through a large number of media.
  • Time aggregation can mean that a garbage account controlled by the same malicious user often forms a large number of abnormal behaviors in a short period of time.
  • the reason for the accumulation of time is that malicious users often pursue short-term interest goals, causing accounts under their control to generate a large number of abnormal behaviors in a short period of time.
  • FIG. 1 is a schematic diagram of an overall architecture involved in an implementation scenario of the present specification.
  • the overall architecture mainly involves the pre-defined graph structure model with the supervised learning server and three types of data that can be used by the training graph structure model: the account medium network map reflecting the specified behavior relationship, and the node reflecting time series in the account medium network graph. Characteristic data of the specified behavior on the node, and risk labeling data of the node.
  • the specified behavior is such as login behavior, registration behavior, transaction behavior, and the like.
  • These training data can be generated by a supervised learning server or other device, or manually.
  • FIG. 2 is a schematic flowchart diagram of a method for training a graph structure model according to an embodiment of the present disclosure.
  • the process in Figure 2 includes the following steps:
  • S202 Obtain an account medium network diagram, where nodes in the account medium network diagram represent accounts and media, and at least part of the nodes indicate that the connected nodes have a login behavior relationship.
  • the account medium network diagram is a heterogeneous network diagram, and heterogeneity refers to a difference in node properties.
  • some nodes in the diagram may represent accounts, and some nodes may represent account-related media.
  • the account is registered or implemented by the media, such as devices, IP networks, physical addresses, and the like.
  • the account medium network map may be generated according to historical data within a certain time range of the account.
  • the historical data may include registration activity data of the account, for example, what kind of media is registered by the account, registration information filled in when the account is registered, and the historical data may also include behavior data after the account is registered, such as the login behavior data of the account, the transaction. Behavior data, etc.
  • a certain time range it is not specifically limited here, and may be preset, such as several days in the past.
  • an account media network map can be generated based on the account registration behavior data and/or the specified behavior data within a certain time range (usually a short time range) after registration, so as to identify the garbage as early as possible. Account.
  • the account medium network diagram is specifically an account device network diagram.
  • each account and each device to be represented may be determined first, and each account to be represented is represented by one node, and each device may also be represented by one node, and any node may represent an account. Either represent the device. Further, if there is a login relationship between the two nodes, an edge representing the login behavior relationship is established between the two nodes, thereby generating an account device network map.
  • the login behavior relationship mainly refers to the relationship between the account and the device. If an account is logged in on a device within a certain time range, it can be said that the account has a login behavior relationship with the device. It should be noted that, in practical applications, if there is a demand, the specific meaning of the login behavior relationship can be broadened.
  • the login behavior relationship may also include the relationship between the account and the account, if an account and another account within a certain time range Once you have logged in on the same device, you can call the account with a login behavior relationship with the other account.
  • the account medium network diagram may be an undirected graph or a directed graph, which is not specifically limited herein.
  • an undirected graph can be used; if more relationships such as transaction behavior relationships are also reflected, a directed graph can also be used.
  • the edge indicates the service.
  • the relationship direction for example, if the A node represents the buyer account and the B node represents the seller account, the side indicating the transaction behavior relationship between the A node and the B node may be the designated B node from the A node, and the pointing can also reflect the direction of the capital flow.
  • the account medium network map in order to facilitate the calculation of the graph, can be represented by a matrix. Different single rows and single columns of the matrix can respectively represent different nodes in the account medium network diagram, and different elements in the matrix respectively represent the login behavior relationship between the row and the nodes represented by the columns.
  • n the number of accounts to be represented plus the number of devices.
  • S204 Acquire feature data of the node and risk tag data, where the feature data reflects a login behavior of the corresponding node in a time series.
  • the node in step S204 may be a partial node in the account medium network diagram, and is not necessarily all nodes. For example, it may be at least part of the node representing the account.
  • the feature data representing at least part of the nodes of the medium may also be acquired, and the feature data indicating the node of the medium does not necessarily reflect the login behavior of the corresponding node in the time series.
  • the characteristic data thereof may reflect device information such as device type and device manufacturer.
  • the feature data may be generated according to historical data within a certain time range of the account.
  • the time series can be serialized (such as dividing multiple time intervals). Or, sampling discrete time points, etc., to determine the distribution of the login behavior of the account in the time series, for example, the specific time, duration, and number of times the login behavior occurs within the unit time.
  • the feature data can be generally represented as a vector or a matrix. The following embodiments are mainly described by taking feature data as a vector as an example.
  • each node in the network diagram of the account device is taken as an example.
  • Some nodes may represent junk accounts, and some nodes may represent devices that have been logged in by junk accounts. These conditions are not clear, and it is necessary to clarify at least some of them through specific means to obtain training samples with training tags. Used for follow-up supervised learning. Specific means are not specifically limited here, for example, accurate tracking analysis based on sampling, or based on user reporting.
  • the risk tag data can be marked for some nodes in advance or in real time, and the risk tag data can indicate the risk of the node, for example, whether it indicates a junk account, whether it indicates a device that has logged in to the junk account, and the like.
  • the risks here may not be limited to the contents of the junk account, for example, the risk of being vulnerable to attacks in the normal account.
  • the above training labels can be obtained based on risk labeling data.
  • the risk labeling data can be directly used as a training label.
  • the representation of risk-labeled data is diverse and is not specifically limited here. For example, if it is determined that a node is not related to the junk account, the risk tag data of the node may be recorded as 1. If it is determined that a node represents a junk account or represents a device that has logged in to the junk account, the risk tag data of the node may be recorded as 0; Wait.
  • At least some of the parameters of the graph structure model are based on a graph structure, and the portion of the parameters may be assigned based on at least a portion of the account medium network map and/or feature data.
  • the graph structure model also has some parameters that need to be solved by training optimization.
  • the graph structure model is configured to calculate the embedding of the node in the implicit feature space after multiple iterations according to the feature data of the node and the topology structure corresponding to the node in the account medium network graph. Further, the graph structure model is further configured to calculate prediction data of the node according to the embedding vector, and the prediction data indicates a possibility that the node corresponds to a junk account.
  • the form of the prediction data is diverse, and is not specifically limited herein, such as a probability value, a score of a non-probability value, or a classification category identifier.
  • the graph structure model does not need to calculate the prediction data. It can be output to other models after calculating the embedding vector. This specification does not analyze this situation in detail.
  • the following examples are mainly based on the above example.
  • the graph structure model after the graph structure model is trained, it can be used for classification or regression to predict the risk property of the input data.
  • the input data may be the feature data corresponding to the account to be identified, and the corresponding topological structure in the account medium network map of the account to be identified (not necessarily the account medium network diagram in step S202), and the calculation of the graph structure model after training , outputting predicted data, from being able to determine whether the account to be identified is a junk account.
  • the account to be identified may be represented by a node in the account media network diagram in step S202, or may be outside the account media network diagram; for the former case, the input data has been determined, so that the identification may be directly performed. In the latter case, the input data may not be determined yet, and the scheme of the present specification may be used to first determine the input data and then identify it.
  • the graph structure model may be used to predict other risks in addition to identifying the junk accounts, and the principles are the same, and will not be described here.
  • the embodiments of the present specification further provide some specific implementations of the method, and an extended solution, which will be described below.
  • the step of identifying the junk account may specifically include: acquiring feature data of the account to be identified, and acquiring an account media network map to which the account to be identified belongs; The feature data of the account to be identified, and the topology structure corresponding to the account to be identified in the account media network map are input into the trained graph structure model for calculation; and the post-training graph structure model is calculated and output. Predicting data to determine whether the account to be identified is a junk account.
  • the time series can be obtained by dividing the time range.
  • the acquiring the feature data of the node may specifically include: acquiring the node. Logging behavior data in a certain time range; dividing the certain time range to obtain a time series; and generating a feature vector as the feature data of the node according to the distribution of the login behavior data in the time series.
  • a certain time range is set to the past m days, divided by hours
  • D-dimensional feature vector x i D-dimensional feature vector x i .
  • the specific construction method of x i is not limited.
  • d can be equal to m*24
  • each element of x i can respectively represent the number of logins of account i in one of the time segments, and the element of x i can be returned. Processed.
  • the embedded vector of the node in the implicit feature space after the t-th iteration may be based on the feature data of the node, the topology structure corresponding to the node in the account medium network diagram, and the hidden feature space.
  • the node is calculated from the embedded vector after the t-1th iteration.
  • Pred i w T ⁇ i ; //(Formula 2) Calculate the predicted data based on the embedded vector
  • ⁇ (t+1) represents the embedded vector of at least one of the nodes in the implicit feature space after the t+1th iteration, and ⁇ represents a nonlinear transformation function (eg, Relu, Sigmoid, Tanh, etc.)
  • W 1 W 2 denotes a weight matrix
  • X denotes feature data of the at least one of the nodes
  • G denotes a topology corresponding to the at least one of the nodes in the account medium network diagram
  • pred i denotes an i th
  • ⁇ i represents the embedded vector after the multiple iterations of the i-th node in the implicit feature space
  • w T represents a parameter vector for dividing ⁇ i
  • T represents transposition
  • the calculation y i represents the risk annotation data of the i-th node
  • L represents a loss function for measuring the consistency gap between the prediction data and the corresponding risk annotation data, which is not specifically limited herein, for example, logistic can be used.
  • G represents the complete topology of the account medium network diagram.
  • X can be made to represent the feature data of all the nodes in the account medium network diagram
  • can be made to represent the account medium network diagram.
  • the embedded vector of all nodes for example, Each row of X represents the feature data of a node, k represents the dimension of the embedded implicit feature space, and each row of ⁇ represents an embedded vector of one node.
  • G only represent a part of the complete topology of the account medium network diagram. Accordingly, X and ⁇ may also only contain data of some nodes in the account medium network diagram.
  • the consistency of the prediction data and the corresponding risk label data can be maximized as a training target, and the graph structure model can be trained.
  • the training the predefined graph structure model may specifically include: using a back propagation algorithm and the risk labeling data,
  • Formula 1, Formula 2, and Formula 3 above are exemplary and not the only ones.
  • the items of X and G in Equation 1 can be transformed by multiplication, exponential or logarithm operations, or they can be combined, or one of them can be deleted; for example, Equation 2 can also be used.
  • the softmax function divides ⁇ i ; for example, if the loss function of the formula 3 represents the degree of agreement between the predicted data and the corresponding risk prediction data, the formula 3 can be adjusted to obtain the maximum value instead of the minimum value; Wait.
  • the embodiment of the present specification further provides a schematic flowchart of a garbage account identification method based on the above-mentioned graph structure model, as shown in FIG. 3 .
  • the process in Figure 3 includes the following steps:
  • S302 Acquire feature data of the account to be identified, and obtain an account media network map to which the account to be identified belongs.
  • S304 Enter the feature data of the account to be identified, and the topology corresponding to the account to be identified in the account media network map, and input the graph structure model trained by using the graph structure model training method to perform calculation.
  • S306 Determine, according to the predicted data output by the trained graph structure model, whether the account to be identified is a junk account.
  • the embodiment of the present specification further provides a schematic diagram of an implementation manner of the foregoing methods, as shown in FIG. 4 .
  • the scheme of FIG. 4 may include the following steps: acquiring an account device network map in the past m days, login behavior data and risk annotation data of each account; and training the predefined graph structure model through supervised learning to obtain a trained graph The structural model; the predicted data (such as one or more accounts), the corresponding account device network map, the login behavior data of each account; the predicted graph structure model is used for prediction, and the predicted result is obtained.
  • the embodiment of the present specification further provides a corresponding device and device, as shown in FIG. 5 to FIG. 8.
  • FIG. 5 is a schematic structural diagram of a diagram structure model training apparatus corresponding to FIG. 2 according to an embodiment of the present disclosure.
  • the apparatus may be located in the execution body of the process in FIG. 2, and includes: a first acquisition module 501, which acquires an account medium network diagram.
  • the node in the account media network diagram represents an account and a medium, and at least partially indicates that the connected nodes have a login behavior relationship;
  • the second obtaining module 502 acquires feature data and risk tag data of the node, and the feature
  • the data reflects the login behavior of the corresponding node in the time series;
  • the training identification module 503 trains the predefined graph structure model according to the account medium network map, the feature data and the risk label data to identify the junk account.
  • the medium comprises a device.
  • the graph structure model is configured to calculate embedding of the node in the implicit feature space after multiple iterations according to the feature data of the node and the topology structure corresponding to the node in the account media network graph. vector.
  • the graph structure model is further configured to calculate prediction data of the node according to the embedding vector, where the prediction data indicates a possibility that the node corresponds to a junk account.
  • the training identification module 503 identifies the garbage account, and specifically includes: the training identification module 503 acquires feature data of the account to be identified, and acquires an account media network map to which the account to be identified belongs; and the account to be identified Feature data, and the map structure model corresponding to the topology of the account to be identified in the account media network map is input for calculation; and the predicted data outputted after the training of the graph structure model is obtained to determine Whether the account to be identified is a junk account.
  • the embedded vector of the node in the implicit feature space after the tth iteration is according to feature data of the node, a topology corresponding to the node in the account medium network diagram, and a hidden feature space.
  • the node is calculated from the embedded vector after the t-1th iteration.
  • the embedding vector after multiple iterations of the node in the implicit feature space is calculated according to the feature data of the node and the topology structure corresponding to the node in the account media network diagram, specifically including The following formula is used to calculate the embedded vector after multiple iterations of the node in the implicit feature space:
  • ⁇ (t+1) ⁇ (XW 1 + G ⁇ (t) W 2 );
  • ⁇ (t+1) represents an embedding vector of at least one of the nodes in the implicit feature space after the t+1th iteration
  • represents a nonlinear transformation function
  • W 1 , W 2 represent a weight matrix
  • X represents the At least one feature data of the node
  • G represents a topology corresponding to the at least one of the nodes in the account medium network diagram.
  • the calculating, according to the embedded vector, the prediction data of the node specifically, calculating the prediction data of the node according to the following formula:
  • Pred i w T ⁇ i ;
  • pred i represents the iterative prediction data of the i-th node
  • ⁇ i represents the embedding vector after the plurality of iterations of the i-th node in the implicit feature space
  • w T represents the division of ⁇ i
  • the valued parameter vector, T represents the transpose operation.
  • the training identification module 503 trains the predefined graph structure model, specifically: the training identifier module 503 maximizes the consistency of the predicted data and its corresponding risk label data as a training target, and the training pre-defined Diagram structure model.
  • the training identification module 503 trains the predefined graph structure model, specifically: the training identifier module 503 uses a back propagation algorithm and the risk label data, Optimization is performed to find the optimal W 1 , W 2 , and w.
  • y i represents the risk annotation data of the i-th node
  • L represents a loss function for measuring the consistency gap between the prediction data and the corresponding risk annotation data.
  • FIG. 6 is a schematic structural diagram of a garbage account identification device corresponding to FIG. 3 according to an embodiment of the present disclosure.
  • the device may be located in the execution body of the process in FIG. 3, and includes: an obtaining module 601, which acquires feature data of an account to be identified. And obtaining an account medium network map to which the account to be identified belongs; the input module 602, inputting the feature data of the account to be identified, and the topology corresponding to the account to be identified in the account media network map, using the above figure
  • the structural model training method trains the graph structure model for calculation; the determining module 603 determines whether the to-be-identified account is a junk account according to the predicted data output by the trained graph structure model.
  • FIG. 7 is a schematic structural diagram of a diagram structure model training device corresponding to FIG. 2 according to an embodiment of the present disclosure.
  • the device includes: at least one processor; and a memory communicatively coupled to the at least one processor.
  • the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to: obtain an account media network map, the account
  • the nodes in the media network diagram represent the account and the medium, and at least part of the nodes indicate that the connected nodes have a login behavior relationship; the feature data of the node and the risk tag data are acquired, and the feature data reflects the login of the corresponding node in the time series. Behavior; training a predefined graph structure model to identify a junk account according to the account media network map, the feature data, and the risk tag data.
  • FIG. 8 is a schematic structural diagram of a garbage account identification device corresponding to FIG. 3 according to an embodiment of the present disclosure.
  • the device includes: at least one processor; and a memory communicatively coupled to the at least one processor.
  • the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to: acquire feature data of an account to be identified, and Obtaining an account media network map to which the account to be identified belongs; selecting feature data of the account to be identified, and a topology structure corresponding to the account to be identified in the account media network map, and inputting the training method by using the above-mentioned graph structure model training method
  • the subsequent graph structure model is calculated; and according to the predicted data outputted by the trained graph structure model, it is determined whether the account to be identified is a junk account.
  • the embodiment of the present specification further provides a non-volatile computer storage medium corresponding to FIG. 2, where computer executable instructions are stored, and the computer executable instructions are set to: obtain an account medium network diagram,
  • the node in the account medium network diagram represents an account and a medium, and at least a part of the node indicates that the connected nodes have a login behavior relationship;
  • the feature data of the node and the risk annotation data are acquired, and the feature data reflects the corresponding node in a time series.
  • Logging behavior training a predefined graph structure model to identify a junk account according to the account media network map, the feature data, and the risk tag data.
  • the embodiment of the present specification further provides a non-volatile computer storage medium corresponding to FIG. 3, where computer executable instructions are stored, and the computer executable instructions are set to: acquire feature data of an account to be identified. Obtaining an account medium network map to which the account to be identified belongs; selecting feature data of the account to be identified, and a topology structure corresponding to the account to be identified in the account media network map, and inputting the training using the above structural model The method structure model after the training is calculated; and the predicted data outputted by the trained graph structure model is used to determine whether the account to be identified is a junk account.
  • the device, the device, the non-volatile computer storage medium and the method provided by the embodiments of the present specification are corresponding, and therefore, the device, the device, and the non-volatile computer storage medium also have similar beneficial technical effects as the corresponding method, since The beneficial technical effects of the method are described in detail, and therefore, the beneficial technical effects of the corresponding device, device, and non-volatile computer storage medium are not described herein.
  • PLD Programmable Logic Device
  • FPGA Field Programmable Gate Array
  • HDL Hardware Description Language
  • the controller can be implemented in any suitable manner, for example, the controller can take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (eg, software or firmware) executable by the (micro)processor.
  • computer readable program code eg, software or firmware
  • examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, The Microchip PIC18F26K20 and the Silicone Labs C8051F320, the memory controller can also be implemented as part of the memory's control logic.
  • the controller can be logically programmed by means of logic gates, switches, ASICs, programmable logic controllers, and embedding.
  • Such a controller can therefore be considered a hardware component, and the means for implementing various functions included therein can also be considered as a structure within the hardware component.
  • a device for implementing various functions can be considered as a software module that can be both a method of implementation and a structure within a hardware component.
  • the system, device, module or unit illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product having a certain function.
  • a typical implementation device is a computer.
  • the computer can be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.
  • embodiments of the specification can be provided as a method, system, or computer program product.
  • embodiments of the present specification can take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware.
  • embodiments of the present specification can take the form of a computer program product embodied on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media includes both permanent and non-persistent, removable and non-removable media.
  • Information storage can be implemented by any method or technology.
  • the information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device.
  • computer readable media does not include temporary storage of computer readable media, such as modulated data signals and carrier waves.
  • embodiments of the present description can be provided as a method, system, or computer program product. Accordingly, the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment in combination of software and hardware. Moreover, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • program modules include routines, programs, objects, components, data structures, and the like that perform particular tasks or implement particular abstract data types.
  • the present specification can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are connected through a communication network.
  • program modules can be located in both local and remote computer storage media including storage devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本说明书实施例公开了图结构模型训练、垃圾账户识别方法、装置以及设备。方案包括:获取账户媒介网络图,账户媒介网络图中的节点表示账户和媒介,至少部分边表示其连接的节点间具有登录行为关系,获取节点的特征数据和风险标注数据,特征数据反映对应节点在时间序列上的登录行为,根据账户媒介网络图、特征数据和风险标注数据,训练预定义的图结构模型,利用训练后的图结构模型识别垃圾账户。

Description

图结构模型训练和垃圾账号识别
相关申请的交叉引用
本专利申请要求于2018年3月14日提交的、申请号为201810209270.1、发明名称为“图结构模型训练、垃圾账号识别方法、装置以及设备”的中国专利申请的优先权,该申请的全文以引用的方式并入本文中。
技术领域
本说明书涉及计算机软件技术领域,尤其涉及图结构模型训练、垃圾账户识别方法、装置以及设备。
背景技术
随着计算机和互联网技术的迅速发展,很多业务可以在网上进行,用户要使用这些业务,往往需要注册相应的账户,比如电商平台账户、第三方支付平台账户、论坛平台账户等。
一些用户或者组织出于不良目的,会注册大量账户,并利用这些账户进行一些异常操作,比如传播留言、推销虚假广告、刷单等,这些账户可能给平台带来风险,而且对于平台价值也较低,被视为垃圾账户。
在现有技术中,一般通过用户举报的方式,判定垃圾账户并进行相应的处理,比如冻结、注销等。
基于现有技术,需要有效的垃圾账户识别方案。
发明内容
本说明书实施例提供图结构模型训练、垃圾账户识别方法、装置以及设备,用以解决如下技术问题:需要有效的垃圾账户识别方案。
为解决上述技术问题,本说明书实施例是这样实现的:
本说明书实施例提供的一种图结构模型训练方法,包括:获取账户媒介网络图,所述账户媒介网络图中的节点表示账户和媒介,至少部分边表示其连接的节点间具有登录行为关系;获取所述节点的特征数据和风险标注数据,所述特征数据反映对应节点在时 间序列上的登录行为;根据所述账户媒介网络图、所述特征数据和所述风险标注数据,训练预定义的图结构模型,用以识别垃圾账户。
本说明书实施例提供的一种垃圾账户识别方法,包括:获取待识别账户的特征数据,以及获取所述待识别账户所属的账户媒介网络图;将所述待识别账户的特征数据,以及该账户媒介网络图中对应于所述待识别账户的拓扑结构,输入利用上述图结构模型训练方法训练后的图结构模型进行计算;根据所述训练后的图结构模型输出的预测数据,判定所述待识别账户是否为垃圾账户。
本说明书实施例提供的一种图结构模型训练装置,包括:第一获取模块,获取账户媒介网络图,所述账户媒介网络图中的节点表示账户和媒介,至少部分边表示其连接的节点间具有登录行为关系;第二获取模块,获取所述节点的特征数据和风险标注数据,所述特征数据反映对应节点在时间序列上的登录行为;训练识别模块,根据所述账户媒介网络图、所述特征数据和所述风险标注数据,训练预定义的图结构模型,用以识别垃圾账户。
本说明书实施例提供的一种垃圾账户识别装置,包括:获取模块,获取待识别账户的特征数据,以及获取所述待识别账户所属的账户媒介网络图;输入模块,将所述待识别账户的特征数据,以及该账户媒介网络图中对应于所述待识别账户的拓扑结构,输入利用上述图结构模型训练方法训练后的图结构模型进行计算;判定模块,根据所述训练后的图结构模型输出的预测数据,判定所述待识别账户是否为垃圾账户。
本说明书实施例提供的一种图结构模型训练设备,包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器。其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:获取账户媒介网络图,所述账户媒介网络图中的节点表示账户和媒介,至少部分边表示其连接的节点间具有登录行为关系;获取所述节点的特征数据和风险标注数据,所述特征数据反映对应节点在时间序列上的登录行为;根据所述账户媒介网络图、所述特征数据和所述风险标注数据,训练预定义的图结构模型,用以识别垃圾账户。
本说明书实施例采用的上述至少一个技术方案能够达到以下有益效果:通过上述基于账户媒介网络图的图嵌入方案,能够利用垃圾账户的媒介聚集性和时间聚集性,有效地识别垃圾账户。
附图说明
为了更清楚地说明本说明书实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本说明书的方案在一种实际应用场景下涉及的一种整体架构示意图;
图2为本说明书实施例提供的一种图结构模型训练方法的流程示意图;
图3为本说明书实施例提供的一种垃圾账户识别方法的流程示意图;
图4为本说明书实施例提供的上述各方法的一种实施方案示意图;
图5为本说明书实施例提供的对应于图2的一种图结构模型训练装置的结构示意图;
图6为本说明书实施例提供的对应于图3的一种垃圾账户识别装置的结构示意图;
图7为本说明书实施例提供的对应于图2的一种图结构模型训练设备的结构示意图;
图8为本说明书实施例提供的对应于图3的一种垃圾账户识别设备的结构示意图。
具体实施方式
本说明书实施例提供图结构模型训练、垃圾账户识别方法、装置以及设备。
为了使本技术领域的人员更好地理解本说明书中的技术方案,下面将结合本说明书实施例中的附图,对本说明书实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本说明书实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
一般地,用于非正常行为的账户均可以视为垃圾账户,比如通过机器自动大量注册的账户等。在大多数业务平台的风控体系中,识别垃圾账户在风控安全上具有重要意义,难点体现为新注册的账户并没有足够的账户画像信息判定其是否为垃圾账户。本说明书考虑到了垃圾账户往往具有的两种特性,媒介聚集性和时间聚集性,进而根据这两种特性提出了一种有监督图嵌入的垃圾账户识别方案,能够有效地识别垃圾账户。这里,图嵌入可以指将图中节点的一些原始数据在指定的特征空间(本说明书称为隐特征空间) 进行映射,得到相应的嵌入向量,用于表示节点。
媒介聚集性可以指:同一个恶意用户注册的多个垃圾账户往往是通过同一个或者少数几个媒介注册的。导致媒介聚集性的原因在于:恶意用户往往寻求利益上的追求,他们并没有足够的资源通过大量媒介注册大量账户。
时间聚集性可以指:同一个恶意用户控制的垃圾账户往往在某一个短时间段内形成大量非正常行为。导致时间聚集性的原因在于:恶意用户往往追求短期的利益目标,造成在他们控制下的账户必须在短时间内产生大量非正常行为。
图1为本说明书的方案在一种实际应用场景下涉及的一种整体架构示意图。该整体架构中,主要涉及预定义的图结构模型所在有监督学习服务器,以及训练图结构模型能够使用的三类数据:反映指定行为关系的账户媒介网络图、账户媒介网络图中节点反映时间序列上的指定行为的特征数据、节点的风险标注数据。指定行为比如是登录行为、注册行为、交易行为等。图结构模型训练后,能够用于识别垃圾账户。
这些训练用的数据可以由有监督学习服务器或者其他设备生成,或者也可以人工编写。
下面基于图1中示例性的架构,对本说明书的方案进行详细说明。
图2为本说明书实施例提供的一种图结构模型训练方法的流程示意图。图2中的流程包括以下步骤:
S202:获取账户媒介网络图,所述账户媒介网络图中的节点表示账户和媒介,至少部分边表示其连接的节点间具有登录行为关系。
在本说明书实施例中,账户媒介网络图是一种异质网络图,异质指节点性质差异,比如,图中某些节点可能表示账户,某些节点可能表示账户相关的媒介。账户通过媒介注册或者实现后续行为,媒介比如是设备、IP网络、物理地址等。
在本说明书实施例中,账户媒介网络图可以根据账户一定时间范围内的历史数据生成。历史数据可以包括账户的注册行为数据,比如,账户是通过怎样的媒介注册的、账户注册时填写的注册信息等;历史数据也可以包括账户注册后的行为数据,比如账户的登录行为数据、交易行为数据等。对于一定时间范围,这里不做具体限定,可以预先设定,比如最近若干天等。
对于新注册的账户,比如,可以根据账户注册行为数据,和/或注册后一定时间范围 (通常是某个短时间范围)内的指定行为数据,生成账户媒介网络图,以便于尽量提前识别垃圾账户。
为了便于描述,下面一些实施例主要以媒介为设备,指定行为为登录行为为例进行说明,则账户媒介网络图具体为账户设备网络图。
在生成账户设备网络图时,可以先确定所要表示的各账户和各设备,将要表示的每个账户分别用一个节点表示,每个设备也可以分别用一个节点表示,任意一个节点要么表示账户,要么表示设备。进一步地,若两个节点间具有登录关系,则在这两个节点间建立一条表示该登录行为关系的边,从而生成账户设备网络图。
这里,登录行为关系主要指账户与设备间的关系,若某账户一定时间范围内在某设备上登录过,则可以称该账户与该设备间具有登录行为关系。需要说明的是,在实际应用中,若有需求,登录行为关系的具体含义也可以拓宽,比如,登录行为关系也可以包括账户与账户间的关系,若某账户与另一账户一定时间范围内曾在同一设备上的登录过,则可以称该账户与该另一账户间具有登录行为关系。
在本说明书实施例中,账户媒介网络图可以是无向图,也可以是有向图,这里不做具体限定。一般地,若只反映登录行为关系,则采用无向图即可;而若还反映诸如交易行为关系等更多的关系,也可以采用有向图,在有向图中,边的指向表明业务关系方向,比如,若A节点表示买家账户,B节点表示卖家账户,则表示A节点与B节点间交易行为关系的边可以是从A节点指定B节点,该指向也能够反映资金流动方向。
在本说明书实施例中,为了便于图计算,账户媒介网络图可以用矩阵进行表示。可以使矩阵的不同的单行、单列分别表示账户媒介网络图中不同节点,矩阵中的不同元素分别表示,其所在行与列表示的节点间的登录行为关系。
例如,对于表示账户设备网络图的矩阵,比如将矩阵记作
Figure PCTCN2019071868-appb-000001
矩阵为n行n列,n表示所要表示的账户数加设备数。假定账户设备网络图为一个二部图,只有表示账户的节点与表示设备的节点间才可能有边,若有边,则对应的元素为1,否则为0,比如,若表示账户i与设备j的节点间有边,则G的第i行j列的元素g i,j=1。
S204:获取所述节点的特征数据和风险标注数据,所述特征数据反映对应节点在时间序列上的登录行为。
在本说明书实施例中,步骤S204中的节点可以是账户媒介网络图中的部分节点,而未必是全部节点。比如,可以是表示账户的至少部分节点,当然,还可以获取表示媒 介的至少部分节点的特征数据,表示媒介的节点的特征数据未必要反映对应节点在时间序列上的登录行为,这里不做具体限定,若媒介是设备,其特征数据比如可以反映诸如设备类型、设备厂商等设备信息。
在本说明书实施例中,特征数据可以根据账户一定时间范围内的历史数据生成。针对前面提到的时间聚集性,在生成特征数据时,不光考虑账户的登录行为本身,还考虑账户的登录行为与时间之间的关系,比如,可以将时间序列化(如划分多个时间区间、或者采样离散的时间点等),确定在账户的登录行为在时间序列上的分布情况,比如,登录行为发生的具体时刻、持续时间、单位时间内登录行为发生的次数等。特征数据一般可以表示为向量或者矩阵,下面一些实施例主要以特征数据表示为向量为例进行说明。
在本说明书实施例中,以账户设备网络图中的各节点为例。某些节点可能表示垃圾账户,某些节点可能表示垃圾账户登录过的设备,这些情况尚且未明确,需要通过特定手段明确其中的至少部分情况,才能够得到有训练标签的训练样本,进而才能够用于后续的有监督学习。特定手段这里不做具体限定,比如,可以基于抽样精确追踪分析,也可以基于用户举报等手段。
通过明确的上述至少部分情况,能够预先或者实时地为部分节点标注风险标注数据,风险标注数据能够表明节点所存在的风险,比如,是否表示垃圾账户,是否表示登录过垃圾账户的设备等。在实际应用中,这里的风险可以不局限于垃圾账户相关内容,比如也可以表示正常账户存在的容易受到攻击的风险等。上述的训练标签可以根据风险标注数据得到,一般地,风险标注数据可以直接作为训练标签。
风险标注数据的表示形式是多样的,这里不做具体限定。比如,若确定某节点与垃圾账户无关,该节点的风险标注数据可以记作1,若确定某节点表示垃圾账户或者表示登录过垃圾账户的设备,该节点的风险标注数据可以记作0;等等。
另外,在实际应用中,也可以只对表示账户的节点标注风险标注数据,而不对表示媒介的节点标注风险标注数据。
S206:根据所述账户媒介网络图、所述特征数据和所述风险标注数据,训练预定义的图结构模型,用以识别垃圾账户。
在本说明书实施例中,图结构模型的至少部分参数是基于图结构的,这部分参数可以用根据至少部分账户媒介网络图和/或特征数据进行赋值。图结构模型还有一部分参数需要通过训练优化求解。
例如,在一种实际应用场景下,图结构模型用于根据节点的特征数据,以及账户媒介网络图中对应于所述节点的拓扑结构,计算隐特征空间中所述节点多次迭代后的嵌入向量;进一步地,图结构模型还用于根据所述嵌入向量,计算所述节点的预测数据,预测数据表示所述节点对应于垃圾账户的可能性。
预测数据的形式是多样的,这里不做具体限定,比如是概率值、非概率值的分值、或者分类类别标识等形式。
在实际应用中,图结构模型也未必要计算预测数据,可以在计算出嵌入向量后输出给别的模型使用,本说明书不详细分析这种情况,下面一些实施例主要还是基于上例进行说明。
在本说明书实施例中,图结构模型训练后,即可以用于分类或者回归,以预测输入数据的风险性质。
例如,输入数据可以是待识别账户对应的特征数据,以及待识别账户所属账户媒介网络图(未必是步骤S202中的账户媒介网络图)中对应的拓扑结构,通过训练后的图结构模型的计算,输出预测数据,从能能够判定待识别账户是否为垃圾账户。其中,待识别账户可以是步骤S202中的账户媒介网络图中节点所表示的,也可以是该账户媒介网络图之外的;对于前一种情况,输入数据已经确定,因此可以直接进行识别,而对于后一种情况,输入数据可能尚未确定,则可以采用本说明书的方案,先确定输入数据,再进行识别。
当然,根据风险标注数据的具体内容,图结构模型除了用于识别垃圾账户以外,还可能用于预测其他方面的风险,原理都是相同的,这里不再赘述。
通过图2的方法,通过上述基于账户媒介网络图的图嵌入方案,能够利用垃圾账户的媒介聚集性和时间聚集性,有效地识别垃圾账户。
基于图2的方法,本说明书实施例还提供了该方法的一些具体实施方案,以及扩展方案,下面进行说明。
在本说明书实施例中,根据上面的例子可知,对于步骤S206,所述识别垃圾账户,具体可以包括:获取待识别账户的特征数据,以及获取所述待识别账户所属的账户媒介网络图;将所述待识别账户的特征数据,以及该账户媒介网络图中对应于所述待识别账户的拓扑结构输入训练后的所述图结构模型进行计算;获取训练后的所述图结构模型计算后输出的预测数据,以判定所述待识别账户是否为垃圾账户。
在本说明书实施例中,前面已经提到,时间序列可以通过划分时间范围得到,在这种该情况下,对于步骤S204,所述获取所述节点的特征数据,具体可以包括:获取所述节点一定时间范围内的登录行为数据;将所述一定时间范围进行划分,得到时间序列;根据所述时间序列中所述登录行为数据的分布情况,生成特征向量,作为所述节点的特征数据。
例如,假定一定时间范围被设定为过去的m天,按小时划分,则能够划分得到m*24个时间分段构成的时间序列,可以根据账户i在各时间分段内的登录次数,生成d维特征向量x i。这里并不限定x i的具体构建方式,比如,d可以等于m*24,x i的每个元素可以分别表示账户i在其中一个时间分段内的登录次数,x i的元素可以是经过归一化处理的。
在本说明书实施例中,隐特征空间中节点在第t次迭代后的嵌入向量可以是根据所述节点的特征数据、账户媒介网络图中对应于所述节点的拓扑结构,以及隐特征空间中所述节点在第t-1次迭代后的嵌入向量计算得到的。更直观地,结合上面的一些例子,一种示例性的图结构模型的定义及训练过程如下所示:
“初始化图结构模型待优化求解的参数:
Figure PCTCN2019071868-appb-000002
比如,采用标准高斯分布初始化等;
迭代训练设定次数或者直至训练收敛:
初始化Φ (1)
{for t=1 to N:    //N次迭代执行,以计算嵌入向量
Φ (t+1)=σ(XW 1+GΦ (t)W 2);} //(公式一)
pred i=w Tφ i;   //(公式二)根据嵌入向量,计算预测数据
Figure PCTCN2019071868-appb-000003
//(公式三)优化参数
其中,Φ (t+1)表示隐特征空间中至少一个所述节点在第t+1次迭代后的嵌入向量,σ表示非线性变换函数(比如,Relu、Sigmoid、Tanh等函数),W 1、W 2表示权重矩阵,X表示所述至少一个所述节点的特征数据,G表示所述账户媒介网络图中对应于所述至少一个所述节点的拓扑结构;pred i表示第i个所述节点经过迭代后的预测数据,φ i表示隐 特征空间中第i个所述节点所述多次迭代后的嵌入向量,w T表示用于将φ i分值化的参数向量,T表示转置运算;y i表示第i个所述节点的风险标注数据,L表示用于度量所述预测数据与其对应的风险标注数据的一致性差距的损失函数,这里不做具体限定,比如,可以采用logistic loss、hinge loss、cross_entropy等损失函数。
在前面的一个例子中,
Figure PCTCN2019071868-appb-000004
此时G表示所述账户媒介网络图完整的拓扑结构,在这种情况下,可以使X表示所述账户媒介网络图中全部节点的特征数据,以及可以使Φ表示所述账户媒介网络图中全部节点的嵌入向量,比如,
Figure PCTCN2019071868-appb-000005
X的每行分别表示一个节点的特征数据,
Figure PCTCN2019071868-appb-000006
k表示嵌入的隐特征空间的维度,Φ的每行分别表示一个节点的嵌入向量。
当然,也可以使G只表示账户媒介网络图完整的拓扑结构的一部分,相应地,X、Φ也可以只包含账户媒介网络图中一部分节点的数据。
在本说明书实施例中,可以以预测数据与其对应的风险标注数据的一致性最大化为训练目标,训练图结构模型。则在上例的场景下,对于步骤S206,所述训练预定义的图结构模型,具体可以包括:利用反向传播算法和所述风险标注数据,对
Figure PCTCN2019071868-appb-000007
进行优化,求得最优的W 1、W 2、w。
上面的公式一、公式二、公式三是示例性的,并非唯一方案。比如,公式一中X、G分别的所在项可以通过乘法、指数或者对数等运算进行变形,或者还可以合并这两项,或者还可以删除其中一项;再比如,公式二中也可以利用softmax函数对φ i进行分值化;再比如,若公式三的损失函数表示预测数据与对应的风险预测数据的一致化程度,则公式三中可以调整为求最大值而不是求最小值;等等。
进一步地,本说明书实施例还提供了基于上述图结构模型的一种垃圾账户识别方法的流程示意图,如图3所示。图3中的流程包括以下步骤:
S302:获取待识别账户的特征数据,以及获取所述待识别账户所属的账户媒介网络图。
S304:将所述待识别账户的特征数据,以及该账户媒介网络图中对应于所述待识别账户的拓扑结构,输入利用上述图结构模型训练方法训练后的图结构模型进行计算。
S306:根据所述训练后的图结构模型输出的预测数据,判定所述待识别账户是否为 垃圾账户。
根据上面的说明,本说明书实施例还提供了上述各方法的一种实施方案示意图,如图4所示。
图4的方案可以包括以下步骤:获取过去的m天内的账户设备网络图、每个账户的登录行为数据和风险标注数据;通过有监督学习,训练预定义的图结构模型,得到训练后的图结构模型;对待预测数据(如一个或者多个账户),获取对应的账户设备网络图、每个账户的登录行为数据;利用训练后的图结构模型进行预测,得到预测结果。
上面对本说明书实施例提供的方法进行了说明,基于同样的思路,本说明书实施例还提供了对应的装置和设备,如图5~图8所示。
图5为本说明书实施例提供的对应于图2的一种图结构模型训练装置的结构示意图,该装置可以位于图2中流程的执行主体,包括:第一获取模块501,获取账户媒介网络图,所述账户媒介网络图中的节点表示账户和媒介,至少部分边表示其连接的节点间具有登录行为关系;第二获取模块502,获取所述节点的特征数据和风险标注数据,所述特征数据反映对应节点在时间序列上的登录行为;训练识别模块503,根据所述账户媒介网络图、所述特征数据和所述风险标注数据,训练预定义的图结构模型,用以识别垃圾账户。
可选地,所述媒介包括设备。
可选地,所述图结构模型用于根据所述节点的特征数据,以及所述账户媒介网络图中对应于所述节点的拓扑结构,计算隐特征空间中所述节点多次迭代后的嵌入向量。
可选地,所述图结构模型还用于根据所述嵌入向量,计算所述节点的预测数据,所述预测数据表示所述节点对应于垃圾账户的可能性。
可选地,所述训练识别模块503识别垃圾账户,具体包括:训练识别模块503获取待识别账户的特征数据,以及获取所述待识别账户所属的账户媒介网络图;将所述待识别账户的特征数据,以及该账户媒介网络图中对应于所述待识别账户的拓扑结构输入训练后的所述图结构模型进行计算;获取训练后的所述图结构模型计算后输出的预测数据,以判定所述待识别账户是否为垃圾账户。
可选地,所述第二获取模块502获取所述节点的特征数据,具体包括:所述第二获取模块502获取所述节点一定时间范围内的登录行为数据;将所述一定时间范围进行划分,得到时间序列;根据所述时间序列中所述登录行为数据的分布情况,生成特征向量, 作为所述节点的特征数据。
可选地,隐特征空间中所述节点在第t次迭代后的嵌入向量是根据所述节点的特征数据、所述账户媒介网络图中对应于所述节点的拓扑结构,以及隐特征空间中所述节点在第t-1次迭代后的嵌入向量计算得到的。
可选地,所述根据所述节点的特征数据,以及所述账户媒介网络图中对应于所述节点的拓扑结构,计算隐特征空间中所述节点多次迭代后的嵌入向量,具体包括按照如下公式,计算隐特征空间中所述节点多次迭代后的嵌入向量:
Φ (t+1)=σ(XW 1+GΦ (t)W 2);
其中,Φ (t+1)表示隐特征空间中至少一个所述节点在第t+1次迭代后的嵌入向量,σ表示非线性变换函数,W 1、W 2表示权重矩阵,X表示所述至少一个所述节点的特征数据,G表示所述账户媒介网络图中对应于所述至少一个所述节点的拓扑结构。
可选地,所述根据所述嵌入向量,计算所述节点的预测数据,具体包括按照如下公式,计算所述节点的预测数据:
pred i=w Tφ i
其中,pred i表示第i个所述节点经过迭代后的预测数据,φ i表示隐特征空间中第i个所述节点所述多次迭代后的嵌入向量,w T表示用于将φ i分值化的参数向量,T表示转置运算。
可选地,所述训练识别模块503训练预定义的图结构模型,具体包括:所述训练识别模块503以所述预测数据与其对应的风险标注数据的一致性最大化为训练目标,训练预定义的图结构模型。
可选地,所述训练识别模块503训练预定义的图结构模型,具体包括:所述训练识别模块503利用反向传播算法和所述风险标注数据,对
Figure PCTCN2019071868-appb-000008
进行优化,求得最优的W 1、W 2、w。其中,y i表示第i个所述节点的风险标注数据,L表示用于度量所述预测数据与其对应的风险标注数据的一致性差距的损失函数。
图6为本说明书实施例提供的对应于图3的一种垃圾账户识别装置的结构示意图,该装置可以位于图3中流程的执行主体,包括:获取模块601,获取待识别账户的特征数据,以及获取所述待识别账户所属的账户媒介网络图;输入模块602,将所述待识别 账户的特征数据,以及该账户媒介网络图中对应于所述待识别账户的拓扑结构,输入利用上述图结构模型训练方法训练后的图结构模型进行计算;判定模块603,根据所述训练后的图结构模型输出的预测数据,判定所述待识别账户是否为垃圾账户。
图7为本说明书实施例提供的对应于图2的一种图结构模型训练设备的结构示意图,所述设备包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器。其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:获取账户媒介网络图,所述账户媒介网络图中的节点表示账户和媒介,至少部分边表示其连接的节点间具有登录行为关系;获取所述节点的特征数据和风险标注数据,所述特征数据反映对应节点在时间序列上的登录行为;根据所述账户媒介网络图、所述特征数据和所述风险标注数据,训练预定义的图结构模型,用以识别垃圾账户。
图8为本说明书实施例提供的对应于图3的一种垃圾账户识别设备的结构示意图,所述设备包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器。其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:获取待识别账户的特征数据,以及获取所述待识别账户所属的账户媒介网络图;将所述待识别账户的特征数据,以及该账户媒介网络图中对应于所述待识别账户的拓扑结构,输入利用上述图结构模型训练方法训练后的图结构模型进行计算;根据所述训练后的图结构模型输出的预测数据,判定所述待识别账户是否为垃圾账户。
基于同样的思路,本说明书实施例还提供了对应于图2的一种非易失性计算机存储介质,存储有计算机可执行指令,所述计算机可执行指令设置为:获取账户媒介网络图,所述账户媒介网络图中的节点表示账户和媒介,至少部分边表示其连接的节点间具有登录行为关系;获取所述节点的特征数据和风险标注数据,所述特征数据反映对应节点在时间序列上的登录行为;根据所述账户媒介网络图、所述特征数据和所述风险标注数据,训练预定义的图结构模型,用以识别垃圾账户。
基于同样的思路,本说明书实施例还提供了对应于图3的一种非易失性计算机存储介质,存储有计算机可执行指令,所述计算机可执行指令设置为:获取待识别账户的特征数据,以及获取所述待识别账户所属的账户媒介网络图;将所述待识别账户的特征数据,以及该账户媒介网络图中对应于所述待识别账户的拓扑结构,输入利用上述图结构模型训练方法训练后的图结构模型进行计算;根据所述训练后的图结构模型输出的预测 数据,判定所述待识别账户是否为垃圾账户。
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置、设备、非易失性计算机存储介质实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本说明书实施例提供的装置、设备、非易失性计算机存储介质与方法是对应的,因此,装置、设备、非易失性计算机存储介质也具有与对应方法类似的有益技术效果,由于上面已经对方法的有益技术效果进行了详细说明,因此,这里不再赘述对应装置、设备、非易失性计算机存储介质的有益技术效果。
在20世纪90年代,对于一个技术的改进可以很明显地区分是硬件上的改进(例如,对二极管、晶体管、开关等电路结构的改进)还是软件上的改进(对于方法流程的改进)。然而,随着技术的发展,当今的很多方法流程的改进已经可以视为硬件电路结构的直接改进。设计人员几乎都通过将改进的方法流程编程到硬件电路中来得到相应的硬件电路结构。因此,不能说一个方法流程的改进就不能用硬件实体模块来实现。例如,可编程逻辑器件(Programmable Logic Device,PLD)(例如现场可编程门阵列(Field Programmable Gate Array,FPGA))就是这样一种集成电路,其逻辑功能由用户对器件编程来确定。由设计人员自行编程来把一个数字系统“集成”在一片PLD上,而不需要请芯片制造厂商来设计和制作专用的集成电路芯片。而且,如今,取代手工地制作集成电路芯片,这种编程也多半改用“逻辑编译器(logic compiler)”软件来实现,它与程序开发撰写时所用的软件编译器相类似,而要编译之前的原始代码也得用特定的编程语言来撰写,此称之为硬件描述语言(Hardware Description Language,HDL),而HDL也并非仅有一种,而是有许多种,如ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language)等,目前最 普遍使用的是VHDL(Very-High-Speed Integrated Circuit Hardware Description Language)与Verilog。本领域技术人员也应该清楚,只需要将方法流程用上述几种硬件描述语言稍作逻辑编程并编程到集成电路中,就可以很容易得到实现该逻辑方法流程的硬件电路。
控制器可以按任何适当的方式实现,例如,控制器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器和嵌入微控制器的形式,控制器的例子包括但不限于以下微控制器:ARC 625D、Atmel AT91SAM、Microchip PIC18F26K20以及Silicone Labs C8051F320,存储器控制器还可以被实现为存储器的控制逻辑的一部分。本领域技术人员也知道,除了以纯计算机可读程序代码方式实现控制器以外,完全可以通过将方法步骤进行逻辑编程来使得控制器以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种控制器可以被认为是一种硬件部件,而对其内包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至,可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的,计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本说明书时可以把各单元的功能在同一个或多个软件和/或硬件中实现。
本领域内的技术人员应明白,本说明书实施例可提供为方法、系统、或计算机程序产品。因此,本说明书实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本说明书实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本说明书是参照根据本说明书实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据 处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在 包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
本领域技术人员应明白,本说明书实施例可提供为方法、系统或计算机程序产品。因此,本说明书可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本说明书可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本说明书可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本说明书,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
以上所述仅为本说明书实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (24)

  1. 一种图结构模型训练方法,包括:
    获取账户媒介网络图,所述账户媒介网络图中的节点表示账户和媒介,至少部分边表示其连接的节点间具有登录行为关系;
    获取所述节点的特征数据和风险标注数据,所述特征数据反映对应节点在时间序列上的登录行为;
    根据所述账户媒介网络图、所述特征数据和所述风险标注数据,训练预定义的图结构模型,用以识别垃圾账户。
  2. 如权利要求1所述的方法,所述媒介包括设备。
  3. 如权利要求1所述的方法,所述图结构模型用于根据所述节点的特征数据,以及所述账户媒介网络图中对应于所述节点的拓扑结构,计算隐特征空间中所述节点多次迭代后的嵌入向量。
  4. 如权利要求3所述的方法,所述图结构模型还用于根据所述嵌入向量,计算所述节点的预测数据,所述预测数据表示所述节点对应于垃圾账户的可能性。
  5. 如权利要求1所述的方法,所述获取所述节点的特征数据,具体包括:
    获取所述节点一定时间范围内的登录行为数据;
    将所述一定时间范围进行划分,得到时间序列;
    根据所述时间序列中所述登录行为数据的分布情况,生成特征向量,作为所述节点的特征数据。
  6. 如权利要求3所述的方法,隐特征空间中所述节点在第t次迭代后的嵌入向量是根据所述节点的特征数据、所述账户媒介网络图中对应于所述节点的拓扑结构,以及隐特征空间中所述节点在第t-1次迭代后的嵌入向量计算得到的。
  7. 如权利要求4所述的方法,所述根据所述节点的特征数据,以及所述账户媒介网络图中对应于所述节点的拓扑结构,计算隐特征空间中所述节点多次迭代后的嵌入向量,具体包括:
    按照如下公式,计算隐特征空间中所述节点多次迭代后的嵌入向量:
    Φ (t+1)=σ(XW 1+GΦ (t)W 2);
    其中,Φ (t+1)表示隐特征空间中至少一个所述节点在第t+1次迭代后的嵌入向量,
    σ表示非线性变换函数,
    W 1、W 2表示权重矩阵,
    X表示所述至少一个所述节点的特征数据,
    G表示所述账户媒介网络图中对应于所述至少一个所述节点的拓扑结构。
  8. 如权利要求7所述的方法,所述根据所述嵌入向量,计算所述节点的预测数据,具体包括:
    按照如下公式,计算所述节点的预测数据:
    pred i=w Tφ i
    其中,pred i表示第i个所述节点经过迭代后的预测数据,
    φ i表示隐特征空间中第i个所述节点所述多次迭代后的嵌入向量,
    w T表示用于将φ i分值化的参数向量,
    T表示转置运算。
  9. 如权利要求4所述的方法,所述训练预定义的图结构模型,具体包括:
    以所述预测数据与其对应的风险标注数据的一致性最大化为训练目标,训练预定义的图结构模型。
  10. 如权利要求8所述的方法,所述训练预定义的图结构模型,具体包括:
    利用反向传播算法和所述风险标注数据,对
    Figure PCTCN2019071868-appb-100001
    进行优化,求得最优的W 1、W 2、w;
    其中,y i表示第i个所述节点的风险标注数据,
    L表示用于度量所述预测数据与其对应的风险标注数据的一致性差距的损失函数。
  11. 一种垃圾账户识别方法,包括:
    获取待识别账户的特征数据,以及获取所述待识别账户所属的账户媒介网络图;
    将所述待识别账户的特征数据,以及该账户媒介网络图中对应于所述待识别账户的拓扑结构,输入利用如权利要求1~10任一项所述的方法训练后的图结构模型进行计算;
    根据所述训练后的图结构模型输出的预测数据,判定所述待识别账户是否为垃圾账户。
  12. 一种图结构模型训练装置,包括:
    第一获取模块,获取账户媒介网络图,所述账户媒介网络图中的节点表示账户和媒介,至少部分边表示其连接的节点间具有登录行为关系;
    第二获取模块,获取所述节点的特征数据和风险标注数据,所述特征数据反映对应节点在时间序列上的登录行为;
    训练识别模块,根据所述账户媒介网络图、所述特征数据和所述风险标注数据,训练预定义的图结构模型,用以识别垃圾账户。
  13. 如权利要求12所述的装置,所述媒介包括设备。
  14. 如权利要求12所述的装置,所述图结构模型用于根据所述节点的特征数据,以及所述账户媒介网络图中对应于所述节点的拓扑结构,计算隐特征空间中所述节点多次迭代后的嵌入向量。
  15. 如权利要求14所述的装置,所述图结构模型还用于根据所述嵌入向量,计算所述节点的预测数据,所述预测数据表示所述节点对应于垃圾账户的可能性。
  16. 如权利要求12所述的装置,所述第二获取模块获取所述节点的特征数据,具体包括:
    所述第二获取模块获取所述节点一定时间范围内的登录行为数据;
    将所述一定时间范围进行划分,得到时间序列;
    根据所述时间序列中所述登录行为数据的分布情况,生成特征向量,作为所述节点的特征数据。
  17. 如权利要求14所述的装置,隐特征空间中所述节点在第t次迭代后的嵌入向量是根据所述节点的特征数据、所述账户媒介网络图中对应于所述节点的拓扑结构,以及隐特征空间中所述节点在第t-1次迭代后的嵌入向量计算得到的。
  18. 如权利要求15所述的装置,所述根据所述节点的特征数据,以及所述账户媒介网络图中对应于所述节点的拓扑结构,计算隐特征空间中所述节点多次迭代后的嵌入向量,具体包括:
    按照如下公式,计算隐特征空间中所述节点多次迭代后的嵌入向量:
    Φ (t+1)=σ(XW 1+GΦ (t)W 2);
    其中,Φ (t+1)表示隐特征空间中至少一个所述节点在第t+1次迭代后的嵌入向量,
    σ表示非线性变换函数,
    W 1、W 2表示权重矩阵,
    X表示所述至少一个所述节点的特征数据,
    G表示所述账户媒介网络图中对应于所述至少一个所述节点的拓扑结构。
  19. 如权利要求18所述的装置,所述根据所述嵌入向量,计算所述节点的预测数据,具体包括:
    按照如下公式,计算所述节点的预测数据:
    pred i=w Tφ i
    其中,pred i表示第i个所述节点经过迭代后的预测数据,
    φ i表示隐特征空间中第i个所述节点所述多次迭代后的嵌入向量,
    w T表示用于将φ i分值化的参数向量,
    T表示转置运算。
  20. 如权利要求15所述的装置,所述训练识别模块训练预定义的图结构模型,具体包括:
    所述训练识别模块以所述预测数据与其对应的风险标注数据的一致性最大化为训练目标,训练预定义的图结构模型。
  21. 如权利要求19所述的装置,所述训练识别模块训练预定义的图结构模型,具体包括:
    所述训练识别模块利用反向传播算法和所述风险标注数据,对
    Figure PCTCN2019071868-appb-100002
    进行优化,求得最优的W 1、W 2、w;
    其中,y i表示第i个所述节点的风险标注数据,
    L表示用于度量所述预测数据与其对应的风险标注数据的一致性差距的损失函数。
  22. 一种垃圾账户识别装置,包括:
    获取模块,获取待识别账户的特征数据,以及获取所述待识别账户所属的账户媒介网络图;
    输入模块,将所述待识别账户的特征数据,以及该账户媒介网络图中对应于所述待识别账户的拓扑结构,输入利用如权利要求1~10任一项所述的方法训练后的图结构模型进行计算;
    判定模块,根据所述训练后的图结构模型输出的预测数据,判定所述待识别账户是否为垃圾账户。
  23. 一种图结构模型训练设备,包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:
    获取账户媒介网络图,所述账户媒介网络图中的节点表示账户和媒介,至少部分边 表示其连接的节点间具有登录行为关系;
    获取所述节点的特征数据和风险标注数据,所述特征数据反映对应节点在时间序列上的登录行为;
    根据所述账户媒介网络图、所述特征数据和所述风险标注数据,训练预定义的图结构模型,用以识别垃圾账户。
  24. 一种垃圾账户识别设备,包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:
    获取待识别账户的特征数据,以及获取所述待识别账户所属的账户媒介网络图;
    将所述待识别账户的特征数据,以及该账户媒介网络图中对应于所述待识别账户的拓扑结构,输入利用如权利要求1~10任一项所述的方法训练后的图结构模型进行计算;
PCT/CN2019/071868 2018-03-14 2019-01-16 图结构模型训练和垃圾账号识别 WO2019174393A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
SG11202004182WA SG11202004182WA (en) 2018-03-14 2019-01-16 Graph structure model training and junk account identification
EP19768037.4A EP3703332B1 (en) 2018-03-14 2019-01-16 Graph structure model training and junk account identification
US16/882,084 US10917425B2 (en) 2018-03-14 2020-05-22 Graph structure model training and junk account identification

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810209270.1 2018-03-14
CN201810209270.1A CN110278175B (zh) 2018-03-14 2018-03-14 图结构模型训练、垃圾账户识别方法、装置以及设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/882,084 Continuation US10917425B2 (en) 2018-03-14 2020-05-22 Graph structure model training and junk account identification

Publications (1)

Publication Number Publication Date
WO2019174393A1 true WO2019174393A1 (zh) 2019-09-19

Family

ID=67907357

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/071868 WO2019174393A1 (zh) 2018-03-14 2019-01-16 图结构模型训练和垃圾账号识别

Country Status (6)

Country Link
US (1) US10917425B2 (zh)
EP (1) EP3703332B1 (zh)
CN (1) CN110278175B (zh)
SG (1) SG11202004182WA (zh)
TW (1) TWI690191B (zh)
WO (1) WO2019174393A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110705629A (zh) * 2019-09-27 2020-01-17 北京市商汤科技开发有限公司 数据处理方法及相关装置
CN111612039A (zh) * 2020-04-24 2020-09-01 平安直通咨询有限公司上海分公司 异常用户识别的方法及装置、存储介质、电子设备
CN112699217A (zh) * 2020-12-29 2021-04-23 西安九索数据技术股份有限公司 一种基于用户文本数据和通讯数据的行为异常用户识别方法

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861120A (zh) * 2019-11-27 2021-05-28 深信服科技股份有限公司 识别方法、设备及存储介质
CN111210279B (zh) * 2020-01-09 2022-08-16 支付宝(杭州)信息技术有限公司 一种目标用户预测方法、装置和电子设备
CN111311076B (zh) * 2020-01-20 2022-07-29 支付宝(杭州)信息技术有限公司 一种账户风险管理方法、装置、设备及介质
CN111340612B (zh) * 2020-02-25 2022-12-06 支付宝(杭州)信息技术有限公司 一种账户的风险识别方法、装置及电子设备
CN111340112B (zh) * 2020-02-26 2023-09-26 腾讯科技(深圳)有限公司 分类方法、装置、服务器
CN111382403A (zh) * 2020-03-17 2020-07-07 同盾控股有限公司 用户行为识别模型的训练方法、装置、设备及存储介质
CN111488494B (zh) * 2020-04-13 2023-08-25 中国工商银行股份有限公司 账户资金转账网络图着色方法及装置
CN111506895A (zh) * 2020-04-17 2020-08-07 支付宝(杭州)信息技术有限公司 一种应用登录图的构建方法及装置
CN113554438B (zh) * 2020-04-23 2023-12-05 北京京东振世信息技术有限公司 账号的识别方法、装置、电子设备及计算机可读介质
CN111639687B (zh) * 2020-05-19 2024-03-01 北京三快在线科技有限公司 一种模型训练以及异常账号识别方法及装置
CN114201655B (zh) * 2020-09-02 2023-08-25 腾讯科技(深圳)有限公司 账号分类方法、装置、设备及存储介质
CN111915381A (zh) * 2020-09-14 2020-11-10 北京嘀嘀无限科技发展有限公司 检测作弊行为的方法、装置、电子设备和存储介质
CN114338416B (zh) * 2020-09-29 2023-04-07 中国移动通信有限公司研究院 一种时空多指标预测方法、装置和存储介质
CN112929348B (zh) * 2021-01-25 2022-11-25 北京字节跳动网络技术有限公司 信息处理方法及装置、电子设备和计算机可读存储介质
CN112861140B (zh) * 2021-01-26 2024-03-22 上海德启信息科技有限公司 一种业务数据的处理方法及装置、可读存储介质
CN112818257B (zh) * 2021-02-19 2022-09-02 北京邮电大学 基于图神经网络的账户检测方法、装置和设备
CN113283925B (zh) * 2021-04-13 2022-08-02 支付宝(杭州)信息技术有限公司 网络实验分流、节点关系预测方法、装置以及设备
CN113935407A (zh) * 2021-09-29 2022-01-14 光大科技有限公司 一种异常行为识别模型确定方法及装置
CN115018280B (zh) * 2022-05-24 2024-06-18 支付宝(杭州)信息技术有限公司 风险图模式的挖掘方法、风险识别方法及对应装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503562A (zh) * 2015-09-06 2017-03-15 阿里巴巴集团控股有限公司 一种风险识别方法及装置
CN106803178A (zh) * 2015-11-26 2017-06-06 阿里巴巴集团控股有限公司 一种处理实体的方法和设备
CN107066616A (zh) * 2017-05-09 2017-08-18 北京京东金融科技控股有限公司 用于账号处理的方法、装置及电子设备
CN107153847A (zh) * 2017-05-31 2017-09-12 北京知道创宇信息技术有限公司 预测用户是否存在恶意行为的方法和计算设备

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1106771C (zh) * 1996-11-22 2003-04-23 西门子公司 在通信网络中动态通信管理的方法
CN110009372B (zh) * 2012-08-03 2023-08-18 创新先进技术有限公司 一种用户风险识别方法和装置
CN102946331B (zh) * 2012-10-10 2016-01-20 北京交通大学 一种社交网络僵尸用户检测方法及装置
CN103778151B (zh) * 2012-10-23 2017-06-09 阿里巴巴集团控股有限公司 一种识别特征群体的方法及装置和搜索方法及装置
CN103294833B (zh) * 2012-11-02 2016-12-28 中国人民解放军国防科学技术大学 基于用户的关注关系的垃圾用户发现方法
US10009358B1 (en) * 2014-02-11 2018-06-26 DataVisor Inc. Graph based framework for detecting malicious or compromised accounts
US9396332B2 (en) * 2014-05-21 2016-07-19 Microsoft Technology Licensing, Llc Risk assessment modeling
CN104090961B (zh) * 2014-07-14 2017-07-04 福州大学 一种基于机器学习的社交网络垃圾用户过滤方法
CN104318268B (zh) * 2014-11-11 2017-09-08 苏州晨川通信科技有限公司 一种基于局部距离度量学习的多交易账户识别方法
CN104615658B (zh) * 2014-12-31 2018-01-16 中国科学院深圳先进技术研究院 一种确定用户身份的方法
CN106355405A (zh) * 2015-07-14 2017-01-25 阿里巴巴集团控股有限公司 风险识别方法、装置及风险防控系统
CN105279086B (zh) * 2015-10-16 2018-01-19 山东大学 一种基于流程图的自动检测电子商务网站逻辑漏洞的方法
EP3475889A4 (en) * 2016-06-23 2020-01-08 Capital One Services, LLC NEURONAL NETWORKING SYSTEMS AND METHOD FOR GENERATING DISTRIBUTED PRESENTATIONS OF ELECTRONIC TRANSACTION INFORMATION
US10505954B2 (en) * 2017-06-14 2019-12-10 Microsoft Technology Licensing, Llc Detecting malicious lateral movement across a computer network
CN107633263A (zh) * 2017-08-30 2018-01-26 清华大学 基于边的网络图嵌入方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503562A (zh) * 2015-09-06 2017-03-15 阿里巴巴集团控股有限公司 一种风险识别方法及装置
CN106803178A (zh) * 2015-11-26 2017-06-06 阿里巴巴集团控股有限公司 一种处理实体的方法和设备
CN107066616A (zh) * 2017-05-09 2017-08-18 北京京东金融科技控股有限公司 用于账号处理的方法、装置及电子设备
CN107153847A (zh) * 2017-05-31 2017-09-12 北京知道创宇信息技术有限公司 预测用户是否存在恶意行为的方法和计算设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3703332A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110705629A (zh) * 2019-09-27 2020-01-17 北京市商汤科技开发有限公司 数据处理方法及相关装置
CN111612039A (zh) * 2020-04-24 2020-09-01 平安直通咨询有限公司上海分公司 异常用户识别的方法及装置、存储介质、电子设备
CN111612039B (zh) * 2020-04-24 2023-09-29 平安直通咨询有限公司上海分公司 异常用户识别的方法及装置、存储介质、电子设备
CN112699217A (zh) * 2020-12-29 2021-04-23 西安九索数据技术股份有限公司 一种基于用户文本数据和通讯数据的行为异常用户识别方法
CN112699217B (zh) * 2020-12-29 2023-04-18 西安九索数据技术股份有限公司 一种基于用户文本数据和通讯数据的行为异常用户识别方法

Also Published As

Publication number Publication date
EP3703332A4 (en) 2020-12-16
TW201939917A (zh) 2019-10-01
CN110278175A (zh) 2019-09-24
TWI690191B (zh) 2020-04-01
US20200287926A1 (en) 2020-09-10
SG11202004182WA (en) 2020-06-29
EP3703332B1 (en) 2021-11-10
US10917425B2 (en) 2021-02-09
EP3703332A1 (en) 2020-09-02
CN110278175B (zh) 2020-06-02

Similar Documents

Publication Publication Date Title
WO2019174393A1 (zh) 图结构模型训练和垃圾账号识别
WO2019114344A1 (zh) 一种基于图结构模型的异常账号防控方法、装置以及设备
TWI715879B (zh) 一種基於圖結構模型的交易風險控制方法、裝置以及設備
CN108418825B (zh) 风险模型训练、垃圾账号检测方法、装置以及设备
US11537852B2 (en) Evolving graph convolutional networks for dynamic graphs
CN110363449B (zh) 一种风险识别方法、装置及系统
US20200074274A1 (en) System and method for multi-horizon time series forecasting with dynamic temporal context learning
CN110119860B (zh) 一种垃圾账号检测方法、装置以及设备
US11640617B2 (en) Metric forecasting employing a similarity determination in a digital medium environment
TW201928848A (zh) 基於圖結構模型的信用風險控制方法、裝置以及設備
CN111415015B (zh) 业务模型训练方法、装置、系统及电子设备
CN112015909B (zh) 知识图谱的构建方法及装置、电子设备、存储介质
TW201931150A (zh) 社交內容風險識別方法、裝置及設備
Xiao et al. Simulation optimization using genetic algorithms with optimal computing budget allocation
US20220078198A1 (en) Method and system for generating investigation cases in the context of cybersecurity
CN111274907B (zh) 使用类别识别模型来确定用户的类别标签的方法和装置
US11593622B1 (en) Artificial intelligence system employing graph convolutional networks for analyzing multi-entity-type multi-relational data
CN110851600A (zh) 基于深度学习的文本数据处理方法及装置
CN114241411B (zh) 基于目标检测的计数模型处理方法、装置及计算机设备
WO2016161631A1 (en) Hidden dynamic systems
CN109740054A (zh) 一种用于确定目标用户的关联财经信息的方法与设备
Wu et al. Applying a Probabilistic Network Method to Solve Business‐Related Few‐Shot Classification Problems
CN117094032B (zh) 一种基于隐私保护的用户信息加密方法及系统
Milov et al. Classification of dangerous situations for small sample size problem in maintenance decision support systems
Sudarsanam et al. Estimating Software Reliability Using Particle Swarm Optimization Technique

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19768037

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019768037

Country of ref document: EP

Effective date: 20200527

NENP Non-entry into the national phase

Ref country code: DE