WO2020147488A1 - 异常群体识别方法及装置 - Google Patents

异常群体识别方法及装置 Download PDF

Info

Publication number
WO2020147488A1
WO2020147488A1 PCT/CN2019/126030 CN2019126030W WO2020147488A1 WO 2020147488 A1 WO2020147488 A1 WO 2020147488A1 CN 2019126030 W CN2019126030 W CN 2019126030W WO 2020147488 A1 WO2020147488 A1 WO 2020147488A1
Authority
WO
WIPO (PCT)
Prior art keywords
analyzed
frequency
users
feature value
graph
Prior art date
Application number
PCT/CN2019/126030
Other languages
English (en)
French (fr)
Inventor
苗加成
章鹏
杨程远
向彪
严欢
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020147488A1 publication Critical patent/WO2020147488A1/zh

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This specification relates to the field of computer technology, in particular to a method and device for identifying abnormal groups.
  • the purpose of one or more embodiments of this specification is to provide a method and device for identifying abnormal groups to solve the problem of low recognition accuracy of abnormal groups.
  • one or more embodiments of this specification provide a method for identifying abnormal groups, including:
  • the abnormality in the users to be analyzed is determined group.
  • the acquiring the characteristic value of each of the plurality of users to be analyzed includes:
  • the determining the high-frequency characteristic value and the low-frequency characteristic value of the characteristic values of each user to be analyzed includes:
  • the first two-part graph is constructed according to the characteristic values of the users to be analyzed, wherein the first two-part graph includes nodes corresponding to the users to be analyzed, nodes corresponding to the characteristic values, and The edge between the node corresponding to the user to be analyzed and the node corresponding to the feature value;
  • the high-frequency characteristic value and the low-frequency characteristic value of the characteristic values of each user to be analyzed are determined according to the high-frequency characteristic value and the low-frequency characteristic value.
  • the mining a maximum frequent item set according to the high frequency feature value of each user to be analyzed and a preset frequent itemset mining strategy, and obtaining the low frequency maximum frequent feature value in the maximum frequent item set includes:
  • the determining the low-frequency maximum frequent feature value in the maximum frequent feature value of the user to be analyzed includes:
  • the second two-part graph includes nodes corresponding to each of the users to be analyzed, and nodes corresponding to each of the maximum frequent feature values. Nodes, and edges between the nodes corresponding to each of the users to be analyzed and the nodes corresponding to the maximum frequent feature value;
  • the determining the weight of the edges in the target bipartite graph and the clustering results of the multiple users to be analyzed obtained by performing graph clustering on the target bipartite graph
  • the abnormal groups of users to be analyzed include:
  • the target bipartite graph delete edges with a weight less than the first preset weight to obtain the bipartite graph to be clustered, and divide the nodes in the bipartite graph to be clustered by a community discovery algorithm to obtain A plurality of node sets, and a user to be analyzed corresponding to a node in each node set is determined as one abnormal group.
  • the determining the weight of the edges in the target bipartite graph and the clustering results of the multiple users to be analyzed obtained by performing graph clustering on the target bipartite graph
  • the abnormal groups of users to be analyzed include:
  • the clustering results of the plurality of users to be analyzed obtained by performing graph clustering on the target cluster graph to determine the abnormal group among the users to be analyzed includes:
  • Delete edges with weights less than the second preset weight in the target clustering graph to obtain the to-be-clustered graph use the Unicom algorithm for the to-be-clustered graph to obtain at least one largest connected subgraph, and combine each The users to be analyzed corresponding to the nodes in the largest connected subgraph are respectively determined as one of the abnormal groups; or
  • an abnormal group identification device including:
  • An obtaining module configured to obtain the characteristic value of each of the plurality of users to be analyzed
  • the determining module is used to determine the high-frequency characteristic value and the low-frequency characteristic value among the characteristic values of the users to be analyzed;
  • a mining module configured to mine the maximum frequent item set according to the high frequency feature value of each user to be analyzed and the preset frequent itemset mining strategy, and obtain the low frequency maximum frequent feature value in the maximum frequent item set;
  • a construction module configured to construct a target bipartite graph according to the low-frequency maximum frequent feature value and the low-frequency feature value in the feature values of each user to be analyzed, and to define the weights of edges in the target bipartite graph;
  • the clustering module is configured to determine the weight of the edges in the target bipartite graph and the clustering results of the multiple users to be analyzed obtained by graph clustering the target bipartite graph Anomalous groups of users to be analyzed.
  • the acquisition module includes:
  • the discretization unit is used to discretize the original personal data of the multiple users to be analyzed to obtain the characteristic value of each user to be analyzed.
  • the determining module includes:
  • the first construction unit is configured to construct a first two-part graph according to the characteristic value of each user to be analyzed, wherein the first two-part graph includes a node corresponding to each user to be analyzed, and each of the characteristics The node corresponding to the value and the edge between each node corresponding to the user to be analyzed and the node corresponding to its characteristic value;
  • the first determining unit is configured to obtain the degree of the node corresponding to each of the feature values in the first two-part graph, and determine the high-frequency feature in the feature value according to the degree of the node corresponding to each of the feature values Value and low frequency characteristic value;
  • the second determining unit is configured to determine the high-frequency characteristic value and the low-frequency characteristic value of the characteristic values of each user to be analyzed according to the high-frequency characteristic value and the low-frequency characteristic value.
  • the mining module includes:
  • the mining unit is used to mine the frequent polynomial sets whose support degree meets the preset support degree according to the high-frequency characteristic value of each user to be analyzed in combination with the FP-Growth method, and determine the maximum frequent item in the frequent polynomial set set;
  • a matching unit configured to match the characteristic value of each user to be analyzed with the maximum frequent characteristic value in the maximum frequent item set to obtain the maximum frequent characteristic value of each user to be analyzed;
  • the third determining unit is configured to determine the maximum frequent feature value of low frequency among the maximum frequent feature values of the users to be analyzed.
  • the third determining unit includes:
  • the construction subunit is used to construct a second bipartite graph according to the maximum frequent feature value of each user to be analyzed, wherein the second bipartite graph includes a node corresponding to each user to be analyzed, and each The node corresponding to the maximum frequent feature value, and the edge between the node corresponding to each user to be analyzed and the node corresponding to the maximum frequent feature value;
  • the determining subunit is used to obtain the degree of each node corresponding to the maximum frequent feature value in the second bipartite graph, and the degree of the node corresponding to each of the maximum frequent feature value is within the maximum frequent feature value Determine the maximum frequent feature value of low frequency.
  • the clustering module includes:
  • the first clustering unit is used to delete edges with a weight less than the first preset weight in the target bipartite graph to obtain the bipartite graph to be clustered, and to obtain the bipartite graph to be clustered by using the Unicom algorithm At least one largest connected subgraph, and determining the users to be analyzed corresponding to the nodes in each of the largest connected subgraphs as one of the abnormal groups; or
  • the second clustering unit is used to delete edges whose weights are less than the first preset weight in the target bipartite graph to obtain the bipartite graph to be clustered, and use the community discovery algorithm to analyze the bipartite graph to be clustered
  • the nodes in are divided to obtain multiple node sets, and the users to be analyzed corresponding to the nodes in each node set are determined as one of the abnormal groups.
  • the clustering module includes:
  • a calculation unit configured to calculate the weight between any two users to be analyzed according to the weight of the edge in the target bipartite graph
  • the second construction unit is used to convert each user to be analyzed into a node, set an edge between any two nodes, and set the weight of the edge of any two nodes to any two corresponding to be analyzed The weight between users to construct the target cluster map;
  • the third clustering unit is configured to determine an abnormal group among the users to be analyzed through the clustering results of the multiple users to be analyzed obtained by performing graph clustering on the target cluster graph.
  • the third clustering unit includes:
  • the first clustering subunit is used to delete edges whose weights are less than the second preset weight in the target clustering graph to obtain the to-be-clustered graph, and use the Unicom algorithm for the to-be-clustered graph to obtain at least one maximum Connected subgraphs, and determining the users to be analyzed corresponding to the nodes in each of the largest connected subgraphs as one of the abnormal groups; or
  • the second clustering subunit is used to delete edges with a weight less than a second preset weight in the target cluster graph to obtain a graph to be clustered, and to divide the graph to be clustered by a community discovery algorithm, To obtain a plurality of node sets, and to determine the users to be analyzed corresponding to each of the node sets as one of the abnormal groups.
  • an abnormal group identification device including:
  • a memory arranged to store computer-executable instructions that, when executed, cause the processor to:
  • the abnormality in the users to be analyzed is determined group.
  • one or more embodiments of the present specification provide a storage medium for storing computer-executable instructions, which when executed, implement the following processes:
  • the abnormality in the users to be analyzed is determined group.
  • the item set mining strategy mines the maximum frequent itemset, obtains the low frequency and maximum frequent feature values in the maximum frequent item set, and constructs the target bipartite graph according to the low frequency feature value and low frequency maximum frequent feature value of each user to be analyzed, and sets the target bipartite graph
  • the weight of the edge in the target bipartite graph is used to cluster the target bipartite graph according to the weight of the edge in the target bipartite graph to determine the abnormal group of users to be analyzed.
  • the maximum frequent itemsets are mined through the preset frequent itemset mining strategy of the high-frequency feature values of the users to be analyzed, and the low-frequency maximum frequent feature values in the maximum frequent itemsets are obtained to mine the behavior sequences of the users to be analyzed ,
  • the bipartite graph defines the weights of edges in the target bipartite graph, and performs graph clustering on the target bipartite graph according to the weights of the edges in the target bipartite graph to obtain anomalous groups.
  • the steps are simple and easy to execute.
  • FIG. 1 is a schematic flowchart of a method for identifying an abnormal group provided by an embodiment of the application
  • FIG. 2 is a schematic diagram of a process for determining high-frequency feature values and low-frequency feature values among the feature values of users to be analyzed according to an embodiment of the application;
  • FIG. 3 is a schematic diagram of the first and second parts provided by an embodiment of this application.
  • FIG. 4 is a schematic diagram 1 of the process of obtaining the low-frequency maximum frequent feature value provided by an embodiment of the application;
  • FIG. 5 is a second schematic diagram of a process for obtaining a low-frequency maximum frequent feature value provided by an embodiment of the application
  • FIG. 6 is a schematic diagram of a process for determining an abnormal group provided by an embodiment of the application.
  • FIG. 7 is a schematic diagram of the composition of an abnormal group identification device provided by an embodiment of the application.
  • FIG. 8 is a schematic structural diagram of an abnormal group identification device provided by an embodiment of the application.
  • One or more embodiments of the present specification provide a method and device for identifying abnormal groups to solve the problem of low recognition accuracy of abnormal groups.
  • Figure 1 is a schematic flow chart of an abnormal group identification method provided by an embodiment of the application.
  • the execution subject of the method can be, for example, a terminal device or a server.
  • the terminal device can be, for example, a personal computer, and the server can be, for example, an independent server. It may also be a server cluster composed of multiple servers, which is not particularly limited in this exemplary embodiment.
  • the method may include the following steps:
  • Step S102 Obtain the characteristic value of each user to be analyzed among the users to be analyzed.
  • the original personal data of multiple users to be analyzed may be obtained first, and then the original personal data of multiple users to be analyzed are discretized to obtain the characteristic value of each user to be analyzed.
  • obtaining the original personal data of multiple users to be analyzed includes: obtaining the original personal data of each user to be analyzed through an obtaining module, and collecting the original personal data of each user to be analyzed to obtain the original personal data of multiple users to be analyzed data.
  • the original personal data of each user to be analyzed may include basic personal data, behavior data, device data, etc., which is not particularly limited in this exemplary embodiment.
  • the basic personal data may include data with characteristics such as age, gender, occupation, income, education, hometown, contact information, account number, etc.
  • basic personal data may include: female (gender), 18 years old (age), undergraduate (education), lawyer (occupation), Shaanxi (hometown).
  • the behavior data may include data of multiple behavior characteristics.
  • the data of behavior characteristics included in the behavior data may be set according to different application scenarios.
  • the behavioral data may include: 2018.10.03 insurance (insured time), accident insurance (insurance type), 2019.2.1 insurance (insurance characteristics), etc.
  • the device data may include, for example, the device model, the device attribution, the commonly used address of the device used, the frequency of replacing the device, and other characteristic data, which is not particularly limited in this exemplary embodiment.
  • Discretizing the original personal data of multiple users to be analyzed to obtain the characteristic value of each user to be analyzed may include: analyzing the data distribution of each characteristic according to the data of each characteristic in the original personal data of the multiple users to be analyzed, Then according to the distribution of the data of each feature and combined with the binning method, the data of each feature is binned, and the corresponding interval after the data of each feature is binned is determined as the feature value of the corresponding data of each feature, and according to each feature The characteristic value of each user to be analyzed is combined with the original personal data of each user to be analyzed to determine the characteristic value of each user to be analyzed.
  • the binning method can be determined according to the nature of the feature. For continuous features (such as age, income, transaction amount, etc.), the equal frequency, equal width and other binning methods can be determined according to business experience and data distribution. For categorical features (for example, gender, educational background, occupation, etc.), the data of the type of feature can be binned according to the specific category of the feature. For text-based features (such as addresses, etc.), the texts with consistent patterns can be grouped into one type for binning.
  • the user to be analyzed can be marked according to the unique identifier of the user to be analyzed to distinguish the user to be analyzed.
  • the unique identifier can be, for example, an ID card, an officer ID, an account id, etc., which are not particularly limited in this exemplary embodiment.
  • Step S104 Determine the high-frequency characteristic value and the low-frequency characteristic value among the characteristic values of each user to be analyzed.
  • the high-frequency characteristic value and the low-frequency characteristic value of the characteristic values of the user to be analyzed can be determined in the following two ways, where:
  • Method 1 Count the number of times each feature value appears in the feature values of multiple users to be analyzed, and determine the high-frequency feature value and the low-frequency feature value in the feature value according to the following determination rule, where the determination rule is: If the feature number of times appears to be analyzed in a plurality of feature values that match the user's T2 i ⁇ X i> T1 i, the low-frequency characteristic values eigenvalue equation, where, X-i is the i-th feature value in the plurality of users to be analyzed The number of occurrences in the feature value, T2 i is the second preset number of occurrences corresponding to the i-th feature value, T1 i is the first preset number of occurrences corresponding to the i-th feature value, T2 i > T1 i , and T2 i The specific values of and T1 i can be determined according to the feature to which the i-th feature value belongs, that is, the specific values of T2 i and T1 i corresponding
  • the high-frequency feature value and low-frequency feature can be matched with the feature value of each user to be analyzed to obtain the high-frequency feature value and low-frequency feature value of each user to be analyzed .
  • high-frequency feature values include: A, B, D
  • low-frequency feature values include C, E
  • the feature values of the user to be analyzed include: A, B, C, E
  • the high-frequency feature values of the user to be analyzed include A, B
  • the low-frequency characteristic value of the user to be analyzed includes C and E
  • the characteristic value of the user to be analyzed includes: A, E, F
  • the high-frequency characteristic value of the user to be analyzed includes A
  • Low-frequency characteristic values include E.
  • Method two may include the following steps:
  • Step S202 Construct a first two-part graph according to the characteristic value of each user to be analyzed, where the first two-part graph includes a node corresponding to each user to be analyzed, a node corresponding to each feature value, and each user to be analyzed The edge between a node and the node corresponding to its characteristic value.
  • each user to be analyzed is converted into a node, each user to be analyzed corresponds to only one node, and the characteristic value of each user to be analyzed is converted into a node, and each characteristic value corresponds to only one node. That is, during the conversion process, if the node corresponding to a feature value already exists, the node is reused, and there is no need to set the node corresponding to the feature value.
  • the node corresponding to each user to be analyzed is located in the first two parts of the graph.
  • the node corresponding to each feature value is located on the other side of the first two-part graph, and an edge is added between the node corresponding to each user to be analyzed and the node corresponding to the feature value.
  • the first two-part graph constructed is shown in Figure 3, where the node corresponding to the first user to be analyzed 1, the node corresponding to the second user to be analyzed 2, the third The node 3 corresponding to the analysis user, the node 4 corresponding to the fourth user to be analyzed, and the node 5 corresponding to the fifth user to be analyzed are located on the left side of Fig.
  • the node corresponding to the characteristic value A, the node corresponding to the characteristic value B, and the characteristic value C The corresponding node, the node corresponding to the characteristic value D, the node corresponding to the characteristic value E, and the node corresponding to the characteristic value F are located on the right side of Figure 3, and are set between the node corresponding to each user to be analyzed and the node corresponding to its characteristic value side.
  • Step S204 Obtain the degree of the node corresponding to each feature value in the first two-part graph, and determine the high-frequency feature value and the low-frequency feature value in the feature value according to the degree of the node corresponding to each feature value.
  • the degree of the node corresponding to each feature value refers to the number of edges connected to the node corresponding to the feature value.
  • the degree of the node corresponding to feature value A is 2
  • feature value B corresponds to The degree of the node corresponding to the characteristic value C is 3
  • the degree of the node corresponding to the characteristic value C is 3
  • the degree of the node corresponding to the characteristic value D is 4
  • the degree of the node corresponding to the characteristic value E is 1
  • the degree of the characteristic value F is 4.
  • the process of determining the high-frequency characteristic value and the low-frequency characteristic value in the characteristic value according to the degree of the node corresponding to each characteristic value may include: determining the high-frequency characteristic value and the low-frequency characteristic value according to each characteristic value in combination with the following determination rules, wherein the rules are determined may be: If the feature value of the node corresponding to satisfy the equation K2 i ⁇ degree (V i)> 1, the value of the low-frequency characteristic feature value, wherein, degree (V i) is the i th value V i corresponding to the feature node degrees, the first predetermined value of K2 i V i corresponding to the i-th feature, K2 i> 1, and the specific numerical value of the characteristic may K2 i V i belongs is determined according to the i-th feature, i.e., features of the different , The specific values of the corresponding K2 i are also different; if the degree of the node corresponding to the eigenvalue satisfies the formula K1 i
  • the characteristic value A is a low-frequency characteristic value
  • the characteristic value B and the characteristic value C are high-frequency characteristic values.
  • Step S206 Determine the high-frequency characteristic value and the low-frequency characteristic value of the characteristic values of each user to be analyzed according to the high-frequency characteristic value and the low-frequency characteristic value.
  • the high-frequency feature value is matched with the feature value of each user to be analyzed, and the feature value of each user to be analyzed that successfully matches the high-frequency feature value is determined as the corresponding user to be analyzed
  • the high-frequency feature value of each user to be analyzed; the low-frequency feature value is matched with the feature value of each user to be analyzed, and the feature value of each user to be analyzed that is successfully matched with the low-frequency feature value is determined as the low frequency of the corresponding user to be analyzed Eigenvalues.
  • the characteristic value A is a low-frequency characteristic value
  • the characteristic value B and the characteristic value C are high-frequency characteristic values.
  • the low frequency feature value of the first user to be analyzed includes feature value A
  • the high frequency feature value of the first user to be analyzed includes feature value B
  • the second user to be analyzed does not have low frequency feature values
  • the high frequency of the second user to be analyzed includes: feature value B, feature value C
  • the low frequency feature value of the third user to be analyzed includes feature value A
  • the high frequency feature value of the third user to be analyzed includes feature value C
  • the fourth user to be analyzed has no low frequency feature value
  • the high-frequency characteristic value of the fourth user to be analyzed includes the characteristic value B
  • the fifth user to be analyzed has no low-frequency characteristic value
  • the high-frequency characteristic value of the fifth user to be analyzed includes the characteristic value C.
  • Step S106 Mining the maximum frequent item set according to the high frequency feature value of each user to be analyzed and the preset frequent itemset mining strategy, and obtain the low frequency maximum frequent feature value in the maximum frequent item set.
  • the preset frequent itemset mining strategy may be, for example, an Apriori (frequent itemset mining association rule) strategy, or FP-Growth, etc.
  • Apriori frequent itemset mining association rule
  • FP-Growth FP-Growth
  • Step S402 According to the high-frequency feature value of each user to be analyzed and combined with the FP-Growth method, mining the frequent polynomial sets whose support degree meets the preset support degree, and determining the maximum frequent item set in the frequent polynomial set.
  • the support degree is the number of occurrences of the high-frequency feature value in multiple users to be analyzed.
  • the specific value of the preset support degree can be set by yourself, for example, it can be 1, or 2, etc. This exemplary The embodiment does not specifically limit this.
  • Frequent polynomial sets refer to sets that include at least two high-frequency feature values.
  • the frequent multinomial set whose support degree meets the preset support degree means that the support degree of each high-frequency feature value in the frequent multinomial set is greater than the preset support degree.
  • the specific process of mining frequent polynomial sets includes: defining preset support degrees, scanning the high-frequency feature values of each user to be analyzed, to obtain the number of occurrences of each high-frequency feature value in multiple users to be analyzed (ie, support ), and filter out the high-frequency feature values whose support is less than the preset support from the high-frequency feature values of the users to be analyzed, and construct the FP tree based on the remaining high-frequency feature values of the users to be analyzed, and use the FP tree Mining frequent multinomial sets.
  • each maximum frequent item set includes multiple high frequency feature values.
  • the high frequency feature value included in the maximum frequent item set is named the maximum frequent feature value, that is, each maximum frequent item set includes multiple Maximum frequent feature value.
  • Step S404 Match the feature value of each user to be analyzed with the maximum frequent feature value in the maximum frequent item set to obtain the maximum frequent feature value of each user to be analyzed.
  • the feature value of each user to be analyzed is matched with the maximum frequent feature value in the maximum frequent item set, and the feature value of each user to be analyzed that is successfully matched with the maximum frequent feature value in the maximum frequent item set Determined as the corresponding maximum frequent feature value of each user to be analyzed.
  • Step S406 Determine the low-frequency maximum frequent feature value from the maximum frequent feature value of the user to be analyzed.
  • the low frequency maximum frequent feature value can be determined in the following two ways, among which:
  • Method 1 Count the number of occurrences of each maximum frequent feature value in multiple users to be analyzed according to the maximum frequent feature value of each user to be analyzed, and combine the number of occurrences of each maximum frequent feature value in multiple users to be analyzed
  • the above determination rule determines the low frequency maximum frequent feature value in the maximum frequent feature value, where the determination rule is: if the maximum frequent feature value appears in multiple users to be analyzed in accordance with the formula P2 i ⁇ S i , the maximum frequent feature value Is the low frequency maximum frequent feature value, where P2 i is the preset number of occurrences corresponding to the i-th maximum frequent feature value, and the specific value of P2 i can be determined according to the feature to which the i-th maximum frequent feature value belongs, that is, the feature is different , the corresponding specific value P2 i are different, S i is the i th largest eigenvalue frequent number of occurrences to be analyzed in a plurality of users.
  • Method two as shown in Figure 5, can include the following steps:
  • Step S502 Construct a second two-part graph according to the maximum frequent feature value of each user to be analyzed, where the second two-part graph includes nodes corresponding to each user to be analyzed, nodes corresponding to each maximum frequent feature value, and each Analyze the edge between the node corresponding to the user and the node corresponding to the maximum frequent feature value.
  • each user to be analyzed is converted into a node
  • each user to be analyzed corresponds to only one node
  • the maximum frequent feature value of each user to be analyzed is converted into a node, and each maximum frequent feature value is only
  • each maximum frequent feature value is only
  • the node corresponding to each user to be analyzed is located on one side of the second bipartite graph
  • the node corresponding to each maximum frequent feature value is located on the other side of the second bipartite graph
  • An edge is added between the corresponding node and the node corresponding to the maximum frequent feature value to complete the construction of the second bipartite graph.
  • Step S504 Obtain the degree of the node corresponding to each maximum frequent feature value in the second two-part graph, and determine the low-frequency maximum frequent feature value from the maximum frequent feature value according to the degree of the node corresponding to each maximum frequent feature value.
  • the degree of the node corresponding to the maximum frequent feature value is the number of edges connected to the node corresponding to the maximum frequent feature value in the bipartite graph.
  • the process of determining the low-frequency maximum frequent feature value may include: determining the low-frequency maximum frequent feature value according to the degree of the node corresponding to each maximum frequent feature value in combination with the following determination rule, where the determination rule may be: if the node corresponding to the maximum frequent feature value is If the degree satisfies the formula L2 i ⁇ degree(V i ), the maximum frequent feature value is the low frequency maximum frequent feature value, where degree(V i ) is the degree of the node corresponding to the i-th maximum frequent feature value, and L2 i is the i-th frequent preset maximum eigenvalue of the corresponding V i, and the specific numerical value of the characteristic L2 of i V i may be determined according to the relevant i-th maximum frequent feature, i.e. different characteristics, i L2 of
  • Step S108 Construct a target bipartite graph according to the low-frequency maximum frequent feature value and the low-frequency feature value among the feature values of each user to be analyzed, and define the weights of edges in the target bipartite graph.
  • the low-frequency maximum frequent feature value is matched with the feature value of each user to be analyzed, and the feature value of each user to be analyzed that successfully matches the low-frequency maximum frequent feature value is determined as the corresponding each to be analyzed The user's low frequency maximum frequent feature value.
  • the process of constructing the target bipartite graph according to the low-frequency maximum frequent feature value of each user to be analyzed and the low-frequency feature value of each user to be analyzed obtained in step S104 may include: converting each analyzed user into a node, and converting each low-frequency feature value Convert into nodes, convert each low-frequency maximum frequent feature value into a node, and add edges between the node corresponding to each user to be analyzed and the node corresponding to the low-frequency feature value, and add the node corresponding to each user to be analyzed and its low-frequency maximum frequent feature Add edges between the nodes corresponding to the values to complete the construction of the target bipartite graph.
  • Defining the weights of the edges in the target bipartite graph may include: defining the weights of the edges between the corresponding nodes of each user to be analyzed in the target bipartite graph and the nodes corresponding to the low-frequency feature values, and defining the weights of the edges in the target bipartite graph. Analyze the weight of the edge between the node corresponding to the user and the node corresponding to the low-frequency maximum frequent feature value.
  • defining the weight of the edge between the node corresponding to each user to be analyzed in the target bipartite graph and the node corresponding to the low-frequency feature value may include: determining the weight of each low-frequency feature value according to the feature to which each low-frequency feature value belongs, specifically , The higher the weight of the low-frequency feature value, the higher the probability that the user to be analyzed that includes the low-frequency feature value is an abnormal group, the lower the weight of the low-frequency feature value, and the user to be analyzed that includes the low-frequency feature value is an abnormal group The lower the probability.
  • the weights of the edges connected to the nodes corresponding to each low-frequency feature value are all set as the weights of the corresponding low-frequency feature values. For example, if the low-frequency feature values include frequent risks (risk features) and unemployed (occupation features), and the weight of frequent risks is 0.5, and the weight of unemployed is 0.1, then the weights of edges connected to nodes corresponding to frequent risks are set to 0.5, the weights of edges connected to nodes corresponding to unemployed are all set to 0.1.
  • defining the weight of the edge between the node corresponding to each user to be analyzed in the target bipartite graph and the node corresponding to the low-frequency maximum frequent feature value may include: determining each low-frequency maximum frequent feature according to the feature to which each low-frequency maximum frequent feature value belongs The weight of the value, specifically, the higher the weight of the low-frequency maximum frequent feature value, and the higher the probability that the user to be analyzed that includes the low-frequency maximum frequent feature value is an abnormal group, the lower the weight of the low-frequency maximum frequent feature value, and also includes The lower the probability that the user to be analyzed with the low frequency maximum frequent feature value is an abnormal group.
  • the weight of the edge connected to the node corresponding to each low-frequency maximum frequent feature value is set as the weight of each corresponding low-frequency maximum frequent feature value.
  • Step S110 according to the weights of the edges in the target bipartite graph and the clustering results of multiple users to be analyzed obtained by graph clustering on the target bipartite graph, determine an abnormal group of users to be analyzed.
  • the abnormal group of users to be analyzed can be determined in the following two ways, among which:
  • Method 1 Delete edges with weights less than the first preset weight in the target bipartite graph to obtain the bipartite graph to be clustered, and use the Unicom algorithm to obtain at least one largest connected subgraph in the bipartite graph to be clustered, and divide each The users to be analyzed corresponding to the nodes in the largest connected subgraphs are determined as an abnormal group.
  • the specific value of the first preset weight can be set by itself, which is not particularly limited in this exemplary embodiment. Compare the weight of each edge in the target bipartite graph with the first preset weight in turn. If the weight of the edge is less than the first preset weight, delete the edge in the target bipartite graph. If the weight of the edge is not less than With the first preset weight, the edge is retained in the target bipartite graph, and the target bipartite graph whose weight is less than the preset weight is determined as the bipartite graph to be clustered.
  • the bipartite graph to be clustered uses the Unicom algorithm to obtain at least one maximum connected subgraph, and in each maximum connected subgraph, the nodes corresponding to the low-frequency feature value and the nodes corresponding to the low-frequency maximum frequent feature value are filtered out, and each The users to be analyzed corresponding to the remaining nodes in the largest connected subgraph are collected to obtain the user set to be analyzed corresponding to each largest connected subgraph, and the user set to be analyzed corresponding to each largest connected subgraph is determined as an abnormality. group.
  • Method 2 Delete edges with weights less than the first preset weight in the target bipartite graph to obtain the bipartite graph to be clustered, and divide the nodes in the bipartite graph to be clustered by the community discovery algorithm to obtain multiple The node set, and the user to be analyzed corresponding to the node in each node set is determined as an abnormal group.
  • the principle of obtaining the bipartite graph to be clustered is the same as the principle in the above-mentioned method 1, so it will not be repeated here.
  • the community discovery algorithm may be, for example, the louvain algorithm, etc., which is not particularly limited in this exemplary embodiment.
  • the community discovery algorithm After dividing the nodes in the bipartite graph to be clustered by the community discovery algorithm to obtain multiple node sets, firstly filter out the nodes corresponding to the low-frequency feature value and the node corresponding to the low-frequency maximum frequent feature value in each node set, and The users to be analyzed corresponding to the remaining nodes in each node set are respectively collected to obtain the user set to be analyzed corresponding to each node set, and the user set to be analyzed corresponding to each node set is determined as an abnormal group.
  • the total number of users to be analyzed in each abnormal group can be obtained and screened out from the abnormal group
  • the total number of users to be analyzed is less than the preset number of abnormal groups, and the remaining abnormal groups are determined as the finally identified abnormal groups;
  • the modularity of the largest connected subgraph corresponding to each abnormal group can also be calculated, and the The modularity of the largest connected subgraph corresponding to an abnormal population is determined as the modularity of the corresponding abnormal population, and the abnormal population whose modularity is less than the preset modularity is screened out from the abnormal population, and the remaining abnormal population is determined as the final recognition Anomalous groups.
  • the above two verification methods are only exemplary and are not used to limit the present invention. They can also verify the abnormal group by analyzing the business characteristics of each user to be analyzed in the abnormal group.
  • determining the abnormal group among users to be analyzed may include the following steps:
  • Step S602 Calculate the weight between any two users to be analyzed according to the weight of the edge in the target bipartite graph.
  • the node corresponding to the low-frequency feature value and the node corresponding to the low-frequency maximum frequent feature value that are commonly connected to the nodes corresponding to any two users to be analyzed are obtained in the target bipartite graph, and the node corresponding to any two
  • the node corresponding to the low-frequency characteristic value and the node corresponding to the low-frequency maximum frequent characteristic value that are jointly connected by the nodes corresponding to the users to be analyzed are determined as the target node; according to any two users to be analyzed, the node corresponding to the user to be analyzed is
  • the weight of the edge between each target node is combined with the following formula to calculate the weight between any two users to be analyzed.
  • the above formula is:
  • weight(e) is the weight between any two users to be analyzed
  • j is the total number of target nodes
  • w(item i ) is the i-th target node w(item i ) and any two users to be analyzed
  • Step S604 Convert each user to be analyzed into a node, set an edge between any two nodes, and set the weight of the edge of any two nodes to the corresponding weight between any two users to be analyzed to construct Target cluster map.
  • each user to be analyzed is converted into a node, that is, a user to be analyzed corresponds to a node, an edge is set between any two nodes, and the weight between any two users to be analyzed is set to The weight of the edge between the two nodes corresponding to any two users to be analyzed is used to complete the construction of the target cluster graph.
  • the target bipartite graph including the node corresponding to the user to be analyzed and the node corresponding to the low-frequency feature value and the node corresponding to the low-frequency maximum frequent feature value is transformed into a target bipartite graph that includes only the node corresponding to the user to be analyzed Target cluster map.
  • Step S606 Determine an abnormal group of users to be analyzed through clustering results of multiple users to be analyzed obtained by performing graph clustering on the target cluster graph.
  • the abnormal population can be determined in the following two ways, among which:
  • Method 1 Delete edges with weights less than the second preset weight in the target clustering graph to obtain the to-be-clustered graph, and use the Unicom algorithm to obtain at least one largest connected subgraph in the to-be-clustered graph, and combine each largest connected subgraph
  • the users to be analyzed corresponding to the nodes in the figure are respectively determined as an abnormal group.
  • the specific value of the second preset weight can be set by itself, which is not particularly limited in this exemplary embodiment. Compare the weight of each edge in the target cluster graph with the second preset weight, and delete the edges with a weight less than the second preset weight in the target cluster graph, so as to convert the target cluster graph into a cluster to be clustered.
  • Class Diagram Collect the to-be-analyzed users corresponding to the nodes in each largest connected subgraph to obtain the to-be-analyzed user set corresponding to each largest connected subgraph, and determine the to-be-analyzed user set corresponding to each largest connected subgraph as An abnormal group.
  • Method 2 Delete edges with a weight less than the second preset weight in the target cluster graph to obtain the cluster graph to be clustered, and divide the cluster graph to be clustered by the community discovery algorithm to obtain multiple node sets, and to The users to be analyzed corresponding to the node set are respectively determined as an abnormal group.
  • the second preset weight has been described above, so it is not repeated here. Compare the weight of each edge in the target cluster graph with the second preset weight, and delete the edges with a weight less than the second preset weight in the target cluster graph, so as to convert the target cluster graph into a cluster to be clustered.
  • Class Diagram The community discovery algorithm may be, for example, the louvain algorithm, etc., which is not particularly limited in this exemplary embodiment.
  • the users to be analyzed corresponding to the nodes in each node set are respectively collected to obtain the user set to be analyzed corresponding to each node set , And determine the user set to be analyzed corresponding to each node set as an abnormal group.
  • the weight between any two users to be analyzed is calculated according to the weights of the edges in the target bipartite graph, and the target clustering graph is constructed according to the previous weights of any two users to be analyzed, so that the target bipartite graph
  • the conversion into a target cluster map makes the target cluster map more accurate and more intuitive to reflect the relationship between the users to be analyzed, and thereby makes the abnormal groups obtained from the target cluster map more accurate.
  • the total number of users to be analyzed in each abnormal group can be obtained and screened out from the abnormal group
  • the total number of users to be analyzed is less than the preset number of abnormal groups, and the remaining abnormal groups are determined as the finally identified abnormal groups;
  • the modularity of the largest connected subgraph corresponding to each abnormal group can also be calculated, and the The modularity of the largest connected subgraph corresponding to an abnormal population is determined as the modularity of the corresponding abnormal population, and the abnormal population whose modularity is less than the preset modularity is screened out from the abnormal population, and the remaining abnormal population is determined as the final recognition Anomalous groups.
  • the above two verification methods are only exemplary and are not used to limit the present invention. They can also verify the abnormal group by analyzing the business characteristics of each user to be analyzed in the abnormal group.
  • the maximum frequent itemsets are mined by the frequent itemset mining strategy preset for the high-frequency feature values of the users to be analyzed, and the low-frequency and maximum frequent feature values in the maximum frequent itemsets are obtained to mine the user’s Behavior sequence, which in turn makes the identification of abnormal groups more accurate; in addition, only by obtaining the low-frequency feature value and low-frequency maximum frequent feature value of each user to be analyzed, and construct the target based on the low-frequency feature value and low-frequency maximum frequent feature value of each user to be analyzed
  • the bipartite graph defines the weights of edges in the target bipartite graph, and performs graph clustering on the target bipartite graph according to the weights of the edges in the target bipartite graph to obtain anomalous groups.
  • the steps are simple and easy to execute.
  • an embodiment of the present application also provides an abnormal group identification device.
  • An abnormal group identification method as shown in FIG. 7, the device 700 may include: an acquisition module 701, a determination module 702, a mining module 703, a construction module 704, and a clustering module 705, wherein:
  • the obtaining module 701 is configured to obtain the characteristic value of each user to be analyzed among a plurality of users to be analyzed;
  • the determining module 702 is configured to determine the high-frequency characteristic value and the low-frequency characteristic value among the characteristic values of the users to be analyzed;
  • the mining module 703 is configured to mine the maximum frequent item set according to the high-frequency feature value of each user to be analyzed and the preset frequent itemset mining strategy, and obtain the low-frequency maximum frequent feature value in the maximum frequent item set;
  • the construction module 704 is configured to construct a target bipartite graph according to the low-frequency maximum frequent feature value and the low-frequency feature value in the feature values of each user to be analyzed, and to define the weights of edges in the target bipartite graph ;
  • the clustering module 705 is configured to determine the weights of the edges in the target bipartite graph and the clustering results of the multiple users to be analyzed obtained by performing graph clustering on the target bipartite graph. Describe the abnormal groups of users to be analyzed.
  • the obtaining module 701 may include:
  • the discretization unit is used to discretize the original personal data of the multiple users to be analyzed to obtain the characteristic value of each user to be analyzed.
  • the determining module 702 may include:
  • the first construction unit is configured to construct a first two-part graph according to the characteristic value of each user to be analyzed, wherein the first two-part graph includes a node corresponding to each user to be analyzed, and each of the characteristics The node corresponding to the value and the edge between each node corresponding to the user to be analyzed and the node corresponding to its characteristic value;
  • the first determining unit is configured to obtain the degree of the node corresponding to each of the feature values in the first two-part graph, and determine the high-frequency feature in the feature value according to the degree of the node corresponding to each of the feature values Value and low frequency characteristic value;
  • the second determining unit is configured to determine the high-frequency characteristic value and the low-frequency characteristic value of the characteristic values of each user to be analyzed according to the high-frequency characteristic value and the low-frequency characteristic value.
  • the mining module 703 may include:
  • the mining unit is used to mine the frequent polynomial sets whose support degree meets the preset support degree according to the high-frequency characteristic value of each user to be analyzed in combination with the FP-Growth method, and determine the maximum frequent item in the frequent polynomial set set;
  • a matching unit configured to match the characteristic value of each user to be analyzed with the maximum frequent characteristic value in the maximum frequent item set to obtain the maximum frequent characteristic value of each user to be analyzed;
  • the third determining unit is configured to determine the maximum frequent feature value of low frequency among the maximum frequent feature values of the users to be analyzed.
  • the third determining unit may include:
  • the construction subunit is used to construct a second bipartite graph according to the maximum frequent feature value of each user to be analyzed, wherein the second bipartite graph includes a node corresponding to each user to be analyzed, and each The node corresponding to the maximum frequent feature value, and the edge between the node corresponding to each user to be analyzed and the node corresponding to the maximum frequent feature value;
  • the determining subunit is used to obtain the degree of each node corresponding to the maximum frequent feature value in the second bipartite graph, and to obtain the degree of the node corresponding to each of the maximum frequent feature value in the maximum frequent feature value Determine the maximum frequent feature value of low frequency.
  • the clustering module 705 may include:
  • the first clustering unit is used to delete edges with a weight less than the first preset weight in the target bipartite graph to obtain the bipartite graph to be clustered, and to obtain the bipartite graph to be clustered by using the Unicom algorithm At least one largest connected subgraph, and determining the users to be analyzed corresponding to the nodes in each of the largest connected subgraphs as one of the abnormal groups; or
  • the second clustering unit is used to delete edges whose weights are less than the first preset weight in the target bipartite graph to obtain the bipartite graph to be clustered, and use the community discovery algorithm to analyze the bipartite graph to be clustered
  • the nodes in are divided to obtain multiple node sets, and the users to be analyzed corresponding to the nodes in each node set are determined as one of the abnormal groups.
  • the clustering module 705 may include:
  • a calculation unit configured to calculate the weight between any two users to be analyzed according to the weight of the edge in the target bipartite graph
  • the second construction unit is used to convert each user to be analyzed into a node, set an edge between any two nodes, and set the weight of the edge of any two nodes to any two corresponding to be analyzed The weight between users to construct the target cluster map;
  • the third clustering unit is configured to determine an abnormal group among the users to be analyzed through the clustering results of the multiple users to be analyzed obtained by performing graph clustering on the target cluster graph.
  • the third clustering unit may include:
  • the first clustering subunit is used to delete edges whose weights are less than the second preset weight in the target clustering graph to obtain the to-be-clustered graph, and use the Unicom algorithm for the to-be-clustered graph to obtain at least one maximum Connected subgraphs, and determining the users to be analyzed corresponding to the nodes in each of the largest connected subgraphs as one of the abnormal groups; or
  • the second clustering subunit is used to delete edges with a weight less than a second preset weight in the target cluster graph to obtain a graph to be clustered, and to divide the graph to be clustered by a community discovery algorithm, To obtain a plurality of node sets, and to determine the users to be analyzed corresponding to each of the node sets as one of the abnormal groups.
  • the abnormal group identification device in the embodiment of the present application mines the maximum frequent item set through a preset frequent item set mining strategy for the high frequency feature value of each user to be analyzed, and obtains the low frequency maximum frequent feature value in the maximum frequent item set,
  • the identification of abnormal groups is more accurate; in addition, only by obtaining the low-frequency feature value and low-frequency maximum frequent feature value of each user to be analyzed, and according to the low-frequency feature value and low-frequency feature value of each user to be analyzed Construct the target bipartite graph with the maximum frequent feature value, define the weights of the edges in the target bipartite graph, and perform graph clustering on the target bipartite graph according to the weights of the edges in the target bipartite graph to obtain anomalous groups, steps Simple and easy to implement.
  • FIG. 8 is a schematic structural diagram of the abnormal group recognition device provided by the embodiment of the application, and the device is used to perform the above Method of identifying abnormal groups.
  • the abnormal group identification device may have relatively large differences due to different configurations or performances, and may include one or more processors 801 and a memory 802, and the memory 802 may store one or more storage application programs. Or data. Among them, the memory 802 may be short-term storage or persistent storage.
  • the application program stored in the memory 802 may include one or more modules (not shown in the figure), and each module may include a series of computer-executable instructions in the device for identifying abnormal groups.
  • the processor 801 may be configured to communicate with the memory 802, and execute a series of computer executable instructions in the memory 802 on the abnormal group identification device.
  • the abnormal group identification device may also include one or more power supplies 803, one or more wired or wireless network interfaces 804, one or more input and output interfaces 805, one or more keyboards 806, and so on.
  • the abnormal group identification device includes a memory, and one or more programs, one or more programs are stored in the memory, and one or more programs may include one or more modules, and Each module may include a series of computer-executable instructions in the device for identifying abnormal groups, and the one or more programs configured to be executed by one or more processors include the following computer-executable instructions:
  • the abnormality in the users to be analyzed is determined group.
  • the acquiring the characteristic value of each of the plurality of users to be analyzed includes:
  • the determining the high-frequency characteristic value and the low-frequency characteristic value of the characteristic values of the users to be analyzed includes:
  • the first two-part graph is constructed according to the characteristic values of the users to be analyzed, wherein the first two-part graph includes nodes corresponding to the users to be analyzed, nodes corresponding to the characteristic values, and The edge between the node corresponding to the user to be analyzed and the node corresponding to the feature value;
  • the high-frequency characteristic value and the low-frequency characteristic value of the characteristic values of each user to be analyzed are determined according to the high-frequency characteristic value and the low-frequency characteristic value.
  • the maximum frequent itemsets are mined according to the high-frequency feature values of the users to be analyzed and a preset frequent itemset mining strategy, and the maximum frequent itemsets are obtained
  • Low frequency maximum frequent feature values include:
  • the determining the low-frequency maximum frequent feature value from the maximum frequent feature value of the user to be analyzed includes:
  • the second two-part graph includes nodes corresponding to each of the users to be analyzed, and nodes corresponding to each of the maximum frequent feature values. Nodes, and edges between the nodes corresponding to each of the users to be analyzed and the nodes corresponding to the maximum frequent feature value;
  • the weights of the edges in the target bipartite graph and the multiple waits obtained by clustering the target bipartite graph Analyzing the clustering results of users, and determining the abnormal group among the users to be analyzed includes:
  • the target bipartite graph delete edges with a weight less than the first preset weight to obtain the bipartite graph to be clustered, and divide the nodes in the bipartite graph to be clustered by a community discovery algorithm to obtain A plurality of node sets, and a user to be analyzed corresponding to a node in each node set is determined as one abnormal group.
  • the weights of edges in the target bipartite graph and the multiple to-be-analyzed obtained by performing graph clustering on the target bipartite graph
  • the clustering results of users, determining the abnormal group among the users to be analyzed include:
  • the clustering results of the plurality of users to be analyzed obtained by performing graph clustering on the target cluster graph are determined to be among the users to be analyzed
  • Anomalous groups include:
  • Delete edges with weights less than the second preset weight in the target clustering graph to obtain the to-be-clustered graph use the Unicom algorithm for the to-be-clustered graph to obtain at least one largest connected subgraph, and combine each The users to be analyzed corresponding to the nodes in the largest connected subgraph are respectively determined as one of the abnormal groups; or
  • the abnormal group identification device in the embodiment of the present application mines the maximum frequent item set through a preset frequent item set mining strategy for the high frequency feature value of each user to be analyzed, and obtains the low frequency maximum frequent feature value in the maximum frequent item set,
  • the identification of abnormal groups is more accurate; in addition, only by obtaining the low-frequency feature value and low-frequency maximum frequent feature value of each user to be analyzed, and according to the low-frequency feature value and low-frequency feature value of each user to be analyzed Construct the target bipartite graph with the maximum frequent feature value, define the weights of the edges in the target bipartite graph, and perform graph clustering on the target bipartite graph according to the weights of the edges in the target bipartite graph to obtain anomalous groups, steps Simple and easy to implement.
  • the embodiment of the present application also provides a storage medium for storing computer-executable instructions.
  • the storage medium may be a U disk or an optical disk.
  • the computer executable instructions stored in the storage medium can realize the following processes:
  • the abnormality in the users to be analyzed is determined group.
  • the acquiring the characteristic value of each of the plurality of users to be analyzed includes:
  • the determining the high-frequency characteristic value and the low-frequency characteristic value of the characteristic values of the users to be analyzed includes:
  • the first two-part graph is constructed according to the characteristic values of the users to be analyzed, wherein the first two-part graph includes nodes corresponding to the users to be analyzed, nodes corresponding to the characteristic values, and The edge between the node corresponding to the user to be analyzed and the node corresponding to the feature value;
  • the high-frequency characteristic value and the low-frequency characteristic value of the characteristic values of each user to be analyzed are determined according to the high-frequency characteristic value and the low-frequency characteristic value.
  • the maximum frequent item set is mined according to the high-frequency feature value of each user to be analyzed and a preset frequent item set mining strategy to obtain
  • the low-frequency maximum frequent feature values in the maximum frequent item set include:
  • the determining the low-frequency maximum frequent feature value from the maximum frequent feature values of the user to be analyzed includes:
  • the second two-part graph includes nodes corresponding to each of the users to be analyzed, and nodes corresponding to each of the maximum frequent feature values. Nodes, and edges between the nodes corresponding to each of the users to be analyzed and the nodes corresponding to the maximum frequent feature value;
  • the weights of edges in the target bipartite graph are obtained by clustering the target bipartite graph. According to the clustering results of the multiple users to be analyzed, determining the abnormal group among the users to be analyzed includes:
  • the target bipartite graph delete edges with a weight less than the first preset weight to obtain the bipartite graph to be clustered, and divide the nodes in the bipartite graph to be clustered by a community discovery algorithm to obtain A plurality of node sets, and a user to be analyzed corresponding to a node in each node set is determined as one abnormal group.
  • the weights of edges in the target bipartite graph are obtained by clustering the target bipartite graph. According to the clustering results of the multiple users to be analyzed, determining the abnormal group among the users to be analyzed includes:
  • the clustering results of the plurality of users to be analyzed obtained by performing graph clustering on the target cluster graph determine
  • the abnormal groups of users to be analyzed include:
  • Delete edges with weights less than the second preset weight in the target clustering graph to obtain the to-be-clustered graph use the Unicom algorithm for the to-be-clustered graph to obtain at least one largest connected subgraph, and combine each The users to be analyzed corresponding to the nodes in the largest connected subgraph are respectively determined as one of the abnormal groups; or
  • the maximum frequent item set is mined by the frequent item set mining strategy preset for the high-frequency feature value of each user to be analyzed, and the maximum The low-frequency maximum frequent feature value in the frequent item set can be used to mine the behavior sequence of the users to be analyzed, thereby making the identification of abnormal groups more accurate; in addition, only by obtaining the low-frequency feature value and the low-frequency maximum frequent feature value of each user to be analyzed, and according to Construct a target bipartite graph with the low-frequency feature value and low-frequency maximum frequent feature value of each user to be analyzed, and define the weight of the edge in the target bipartite graph, and perform the target bipartite graph based on the weight of the edge in the target bipartite graph Graph clustering to obtain abnormal populations, the steps are simple and easy to execute.
  • a programmable logic device Programmable Logic Device, PLD
  • PLD Field Programmable Gate Array
  • FPGA Field Programmable Gate Array
  • HDL Hardware Description Language
  • the controller can be implemented in any suitable manner.
  • the controller can take the form of, for example, a microprocessor or a processor and a computer-readable medium storing computer-readable program codes (such as software or firmware) executable by the (micro)processor. , Logic gates, switches, application specific integrated circuits (ASICs), programmable logic controllers and embedded microcontrollers. Examples of controllers include but are not limited to the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicon Labs C8051F320, the memory controller can also be implemented as part of the memory control logic.
  • the method can be logically programmed to enable the controller to use logic gates, switches, special integrated circuits, programmable logic controllers and embedded
  • the same function is realized in the form of a microcontroller or the like. Therefore, such a controller can be regarded as a hardware component, and the devices included in it for implementing various functions can also be regarded as a structure within the hardware component. Or even, the means for realizing various functions can be regarded as both a software module of the implementation method and a structure within a hardware component.
  • the system, device, module or unit explained in the above embodiments may be specifically implemented by a computer chip or entity, or by a product with a certain function.
  • a typical implementation device is a computer.
  • the computer can be, for example, a personal computer, a laptop computer, a cell phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or Any combination of these devices.
  • the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, the present application may take the form of a computer program product implemented on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code.
  • computer usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer readable memory produce an article of manufacture including an instruction device, the instructions
  • the device implements the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and/or block diagrams.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of operating steps are performed on the computer or other programmable device to generate computer-implemented processing, which is executed on the computer or other programmable device
  • the instructions provide steps for implementing the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and/or block diagrams.
  • the computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.
  • processors CPU
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-permanent memory in a computer readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media including permanent and non-permanent, removable and non-removable media, can store information by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, read-only compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices.
  • computer-readable media does not include temporary computer-readable media (transitory media), such as modulated data signals and carrier waves.
  • the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, the present application may take the form of a computer program product implemented on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code.
  • computer usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types.
  • the present application can also be practiced in distributed computing environments. In these distributed computing environments, remote processing devices connected through a communication network perform tasks.
  • program modules can be located in local and remote computer storage media including storage devices.

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种异常群体识别方法及装置,所述方法包括:获取多个待分析用户中的各待分析用户的特征值(S102);确定各待分析用户的特征值中的高频特征值和低频特征值(S104);根据各待分析用户的高频特征值和预设的频繁项集挖掘策略挖掘最大频繁项集,获取最大频繁项集中的低频最大频繁特征值(S106);根据各待分析用户的特征值中的低频最大频繁特征值和低频特征值构建目标二部图,并定义目标二部图中的边的权重(S108);根据目标二部图中的边的权重,以及通过对目标二部图进行图聚类所得到的多个待分析用户的聚类结果,确定待分析用户中的异常群体(S110)。该方法提高了异常群体识别的准确率,且步骤简单,易于执行。

Description

异常群体识别方法及装置 技术领域
本说明书涉及计算机技术领域,尤其涉及一种异常群体识别方法及装置。
背景技术
目前,在风控领域中的各种场景(如垃圾注册、营销作弊、盗卡盗账号、骗保等)中,团伙作案的趋势越来越明显,严重的影响了正常的商业秩序,给商家造成了巨大的损失。因此,如何识别团伙(即异常群体)已经成为商家在运营过程中的重要问题之一。
在常用的异常群体的识别方式中,由于标签样本的缺失和异常群体作案方式的多变性,导致异常群体识别准确率较低。
发明内容
本说明书一个或多个实施例的目的是提供一种异常群体识别方法及装置,用以解决异常群体识别准确率较低的问题。
为解决上述技术问题,本说明书一个或多个实施例是这样实现的:
一方面,本说明书一个或多个实施例提供一种异常群体识别方法,包括:
获取多个待分析用户中的各所述待分析用户的特征值;
确定各所述待分析用户的特征值中的高频特征值和低频特征值;
根据各所述待分析用户的高频特征值和预设的频繁项集挖掘策略挖掘最大频繁项集,获取所述最大频繁项集中的低频最大频繁特征值;
根据各所述待分析用户的特征值中的所述低频最大频繁特征值和所述低频特征值构建目标二部图,并定义所述目标二部图中的边的权重;
根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体。
可选的,所述获取多个待分析用户中的各所述待分析用户的特征值包括:
获取所述多个待分析用户的原始个人数据;
对所述多个待分析用户的原始个人数据进行离散化,以得到各所述待分析用户的特征值。
可选的,所述确定各所述待分析用户的特征值中的高频特征值和低频特征值包括:
根据各所述待分析用户的特征值构建第一二部图,其中,所述第一二部图包括与各所述待分析用户对应的节点、与各所述特征值对应的节点、以及各所述待分析用户对应的节点与其特征值对应的节点之间的边;
在所述第一二部图中获取各所述特征值对应的节点的度,并根据各所述特征值对应的节点的度在所述特征值中确定高频特征值和低频特征值;
根据所述高频特征值和所述低频特征值确定各所述待分析用户的特征值中的高频特征值和低频特征值。
可选的,所述根据各所述待分析用户的高频特征值和预设的频繁项集挖掘策略挖掘最大频繁项集,获取所述最大频繁项集中的低频最大频繁特征值包括:
根据各所述待分析用户的高频特征值并结合FP-Growth方法,挖掘支持度满足预设支持度的频繁多项集,并在所述频繁多项集中确定最大频繁项集;
将各所述待分析用户的特征值与所述最大频繁项集中的最大频繁特征值进行匹配,以得到各所述待分析用户的最大频繁特征值;
在所述待分析用户的最大频繁特征值中确定低频最大频繁特征值。
可选的,所述在所述待分析用户的最大频繁特征值中确定低频最大频繁特征值包括:
根据各所述待分析用户的最大频繁特征值构建第二二部图,其中,所述第二二部图包括与各所述待分析用户对应的节点、与各所述最大频繁特征值对应的节点、以及各所述待分析用户对应的节点与其最大频繁特征值对应的节点之间的边;
在所述第二二部图中获取各所述最大频繁特征值对应的节点的度,并根据各所述最大频繁特征值对应的节点的度在所述最大频繁特征值中确定低频最大频繁特征值。
可选的,所述根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体包括:
在所述目标二部图中删除权重小于第一预设权重的边,以得到待聚类二部图,并 对所述待聚类二部图采用联通算法得到至少一个最大连通子图,以及将每个所述最大连通子图中的节点对应的待分析用户确定为一个所述异常群体;或者
在所述目标二部图中删除权重小于第一预设权重的边,以得到待聚类二部图,并通过社区发现算法对所述待聚类二部图中的节点进行划分,以得到多个节点集合,以及将每个所述节点集合中的节点对应的待分析用户确定为一个所述异常群体。
可选的,所述根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体包括:
根据所述目标二部图中的边的权重计算任意两个所述待分析用户之间的权重;
将各所述待分析用户转化为节点,并在任意两个节点之间设置边,并将任意两个节点的边的权重设置为对应的任意两个所述待分析用户之间的权重,以构建目标聚类图;
通过对所述目标聚类图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体。
可选的,所述通过对所述目标聚类图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体包括:
在所述目标聚类图中删除权重小于第二预设权重的边,以得到待聚类图,并对所述待聚类图采用联通算法得到至少一个最大连通子图,以及将每个所述最大连通子图中的节点对应的待分析用户分别确定为一个所述异常群体;或者
在所述目标聚类图中删除权重小于第二预设权重的边,以得到待聚类图,并通过社区发现算法对所述待聚类图进行划分,以得到多个节点集合,以及将每个所述节点集合对应的待分析用户分别确定为一个所述异常群体。
另一方面,本说明书一个或多个实施例提供一种异常群体识别装置,包括:
获取模块,用于获取多个待分析用户中的各所述待分析用户的特征值;
确定模块,用于确定各所述待分析用户的特征值中的高频特征值和低频特征值;
挖掘模块,用于根据各所述待分析用户的高频特征值和预设的频繁项集挖掘策略挖掘最大频繁项集,获取所述最大频繁项集中的低频最大频繁特征值;
构建模块,用于根据各所述待分析用户的特征值中的所述低频最大频繁特征值和所述低频特征值构建目标二部图,并定义所述目标二部图中的边的权重;
聚类模块,用于根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体。
可选的,所述获取模块包括:
获取单元,用于获取所述多个待分析用户的原始个人数据;
离散化单元,用于对所述多个待分析用户的原始个人数据进行离散化,以得到各所述待分析用户的特征值。
可选的,所述确定模块包括:
第一构建单元,用于根据各所述待分析用户的特征值构建第一二部图,其中,所述第一二部图包括与各所述待分析用户对应的节点、与各所述特征值对应的节点、以及各所述待分析用户对应的节点与其特征值对应的节点之间的边;
第一确定单元,用于在所述第一二部图中获取各所述特征值对应的节点的度,并根据各所述特征值对应的节点的度在所述特征值中确定高频特征值和低频特征值;
第二确定单元,用于根据所述高频特征值和所述低频特征值确定各所述待分析用户的特征值中的高频特征值和低频特征值。
可选的,所述挖掘模块包括:
挖掘单元,用于根据各所述待分析用户的高频特征值并结合FP-Growth方法,挖掘支持度满足预设支持度的频繁多项集,并在所述频繁多项集中确定最大频繁项集;
匹配单元,用于将各所述待分析用户的特征值与所述最大频繁项集中的最大频繁特征值进行匹配,以得到各所述待分析用户的最大频繁特征值;
第三确定单元,用于在所述待分析用户的最大频繁特征值中确定低频最大频繁特征值。
可选的,所述第三确定单元包括:
构建子单元,用于根据各所述待分析用户的最大频繁特征值构建第二二部图,其中,所述第二二部图包括与各所述待分析用户对应的节点、与各所述最大频繁特征值对应的节点、以及各所述待分析用户对应的节点与其最大频繁特征值对应的节点之间的边;
确定子单元,用于在所述第二二部图中获取各所述最大频繁特征值对应的节点的 度,并根据各所述最大频繁特征值对应的节点的度在所述最大频繁特征值中确定低频最大频繁特征值。
可选的,所述聚类模块包括:
第一聚类单元,用于在所述目标二部图中删除权重小于第一预设权重的边,以得到待聚类二部图,并对所述待聚类二部图采用联通算法得到至少一个最大连通子图,以及将每个所述最大连通子图中的节点对应的待分析用户确定为一个所述异常群体;或者
第二聚类单元,用于在所述目标二部图中删除权重小于第一预设权重的边,以得到待聚类二部图,并通过社区发现算法对所述待聚类二部图中的节点进行划分,以得到多个节点集合,以及将每个所述节点集合中的节点对应的待分析用户确定为一个所述异常群体。
可选的,所述聚类模块包括:
计算单元,用于根据所述目标二部图中的边的权重计算任意两个所述待分析用户之间的权重;
第二构建单元,用于将各所述待分析用户转化为节点,并在任意两个节点之间设置边,并将任意两个节点的边的权重设置为对应的任意两个所述待分析用户之间的权重,以构建目标聚类图;
第三聚类单元,用于通过对所述目标聚类图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体。
可选的,所述第三聚类单元包括:
第一聚类子单元,用于在所述目标聚类图中删除权重小于第二预设权重的边,以得到待聚类图,并对所述待聚类图采用联通算法得到至少一个最大连通子图,以及将每个所述最大连通子图中的节点对应的待分析用户分别确定为一个所述异常群体;或者
第二聚类子单元,用于在所述目标聚类图中删除权重小于第二预设权重的边,以得到待聚类图,并通过社区发现算法对所述待聚类图进行划分,以得到多个节点集合,以及将每个所述节点集合对应的待分析用户分别确定为一个所述异常群体。
再一方面,本说明书一个或多个实施例提供一种异常群体识别设备,包括:
处理器;以及
被安排成存储计算机可执行指令的存储器,所述计算机可执行指令在被执行时使 所述处理器:
获取多个待分析用户中的各所述待分析用户的特征值;
确定各所述待分析用户的特征值中的高频特征值和低频特征值;
根据各所述待分析用户的高频特征值和预设的频繁项集挖掘策略挖掘最大频繁项集,获取所述最大频繁项集中的低频最大频繁特征值;
根据各所述待分析用户的特征值中的所述低频最大频繁特征值和所述低频特征值构建目标二部图,并定义所述目标二部图中的边的权重;
根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体。
再一方面,本说明书一个或多个实施例提供一种存储介质,用于存储计算机可执行指令,所述计算机可执行指令在被执行时实现以下流程:
获取多个待分析用户中的各所述待分析用户的特征值;
确定各所述待分析用户的特征值中的高频特征值和低频特征值;
根据各所述待分析用户的高频特征值和预设的频繁项集挖掘策略挖掘最大频繁项集,获取所述最大频繁项集中的低频最大频繁特征值;
根据各所述待分析用户的特征值中的所述低频最大频繁特征值和所述低频特征值构建目标二部图,并定义所述目标二部图中的边的权重;
根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体。
采用本说明书一个或多个实施例的技术方案,通过确定各待分析用户的特征值中的高频特征值和低频特征值,并通过对各待分析用户的高频特征值进行预设的频繁项集挖掘策略挖掘最大频繁项集,获取最大频繁项集中的低频最大频繁特征值,以及根据各待分析用户的低频特征值和低频最大频繁特征值构建目标二部图,并设置目标二部图中的边的权重,以根据目标二部图中的边的权重以及对目标二部图进行聚类,以确定待分析用户中的异常群体。一方面,通过对各待分析用户的高频特征值进行预设的频繁项集挖掘策略挖掘最大频繁项集,并获取最大频繁项集中的低频最大频繁特征值,以挖掘待分析用户的行为序列,进而使得异常群体的识别更加准确;另一方面,仅通过获取各待分析用户的低频特征值和低频最大频繁特征值,并根据各待分析用户的低频特征值和低 频最大频繁特征值构建目标二部图,并定义目标二部图中的边的权重,以及根据目标二部图中的边的权重并对目标二部图进行图聚类,以得到异常群体,步骤简单,且易于执行。
附图说明
为了更清楚地说明本说明书一个或多个实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书一个或多个实施例中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的异常群体识别方法的流程示意图;
图2为本申请实施例提供的确定各待分析用户的特征值中的高频特征值和低频特征值的流程示意图;
图3为本申请实施例提供的第一二部图的示意图;
图4为本申请实施例提供的获取低频最大频繁特征值的流程示意图一;
图5为本申请实施例提供的获取低频最大频繁特征值的流程示意图二;
图6为本申请实施例提供的确定异常群体的流程示意图;
图7为本申请实施例提供的异常群体识别装置的组成示意图;
图8为本申请实施例提供的异常群体识别设备的结构示意图。
具体实施方式
本说明书一个或多个实施例提供一种异常群体识别方法及装置,用以解决异常群体识别准确率较低的问题。
为了使本技术领域的人员更好地理解本说明书一个或多个实施例中的技术方案,下面将结合本说明书一个或多个实施例中的附图,对本说明书一个或多个实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本说明书一部分实施例,而不是全部的实施例。基于本说明书一个或多个实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本说明书一个或多个实施例保护的范围。
图1为本申请实施例提供的异常群体识别方法的流程示意图,该方法的执行主体例如可以为终端设备或服务器,其中,终端设备例如可以为个人计算机等,服务器例如可以为独立的一个服务器,也可以是由多个服务器组成的服务器集群,本示例性实施例对此不做特殊限定。如图1所示,该方法可以包括以下步骤:
步骤S102、获取多个待分析用户中的各待分析用户的特征值。
在本申请实施例中,可以首先获取多个待分析用户的原始个人数据,然后,对多个待分析用户的原始个人数据进行离散化,以得到各待分析用户的特征值。其中,获取多个待分析用户的原始个人数据包括:可以通过一获取模块获取各待分析用户的原始个人数据,并将各待分析用户的原始个人数据进行集合得到多个待分析用户的原始个人数据。每个待分析用户的原始个人数据均可以包括个人基本数据、行为数据、设备数据等,本示例性实施例对此不做特殊限定。个人基本数据中可以包括年龄、性别、职业、收入、学历、籍贯、联系方式、账号等特征的数据,本示例性实施例对此不做特殊限定。例如,个人基本数据可以包括:女(性别)、18岁(年龄)、本科(学历)、律师(职业)、陕西(籍贯)。行为数据可以包括多个行为特征的数据,具体的,行为数据中包括的行为特征的数据可以根据应用场景的不同进行设置。例如,在保险场景下,行为数据可以包括:2018.10.03号投保(投保时间)、意外险(投保种类)、2019.2.1号出险(出险特征)等。设备数据例如可以包括:设备型号、设备归属地、使用设备的常用地址、更换设备的频率等特征的数据,本示例性实施例对此不做特殊限定。
对多个待分析用户的原始个人数据进行离散化,以得到各待分析用户的特征值可以包括:根据多个待分析用户的原始个人数据中的各特征的数据分析各特征的数据的分布,再根据各特征的数据的分布并结合分箱方式对各特征的数据进行分箱,并将各特征的数据分箱后对应的区间确定为对应的各特征的数据的特征值,以及根据各特征的数据的特征值并结合各待分析用户的原始个人数据确定各待分析用户的特征值。
分箱方式可以根据特征所属的性质进行确定,对于连续型的特征(例如年龄、收入、交易金额等),可以根据业务经验和数据分布确定采用等频、等宽等分箱方式。对于类别型的特征(例如,性别、学历、职业等),可以根据特征的具体类别对类型的特征的数据进行分箱。对于文本型的特征(例如地址等),可以采用将模式一致的文本聚成一类的方式进行分箱。
需要说明的是,可以根据待分析用户的唯一标识对待分析用户进行标记,用以区分待分析用户。唯一标识例如可以为:身份证、军官证、账号id等,本示例性实施例对 此不做特殊限定。
步骤S104、确定各待分析用户的特征值中的高频特征值和低频特征值。
在本示例性实施例中,可以通过以下两种方式确定待分析用户的特征值中的高频特征值和低频特征值,其中:
方式一、统计每个特征值在多个待分析用户的特征值中出现的次数,并根据下述确定规则在特征值中确定高频特征值和低频特征值,其中,确定规则为:若特征值在多个待分析用户的特征值中出现的次数符合公式T2 i≥X i>T1 i,则特征值为低频特征值,其中,X i为第i个特征值在多个待分析用户的特征值中出现的次数,T2 i为第i个特征值对应的第二预设出现次数,T1 i为第i个特征值对应的第一预设出现次数,T2 i>T1 i,且T2 i和T1 i的具体数值可以根据第i个特征值所属的特征进行确定,即特征不同,对应的T2 i和T1 i的具体数值也不同;若特征值在多个待分析用户的特征值中出现的次数符合公式T3 i≥X i>T2 i,则特征值为高频特征值,其中,X i为第i个特征值在多个待分析用户的特征值中出现的次数,T2 i为第i个特征值对应的第二预设出现次数,T3 i为第i个特征值对应的第三预设出现次数,T3 i>T2 i,且T2 i和T3 i的具体数值可以根据第i个特征值所属的特征进行确定,即特征不同,对应的T2 i和T3 i的具体数值也不同。
在确定出高频特征值和低频特征值后,可以通过将高频特征值和低频特征分别与各待分析用户的特征值进行匹配,以得到各待分析用户的高频特征值和低频特征值。例如,高频特征值包括:A、B、D,低频特征值包括C、E,若待分析用户的特征值包括:A、B、C、E,则该待分析用户的高频特征值包括A、B,该待分析用户的低频特征值包括C、E;若待分析用户的特征值包括:A、E、F,则该待分析用户的高频特征值包括A,该待分析用户的低频特征值包括E。
方式二、如图2所示,可以包括以下步骤:
步骤S202、根据各待分析用户的特征值构建第一二部图,其中,第一二部图包括与各待分析用户对应的节点、与各特征值对应的节点、以及各待分析用户对应的节点与其特征值对应的节点之间的边。
在本申请实施例中,将每个待分析用户分别转化为节点,每个待分析用户仅对应一个节点,并将各待分析用户的特征值转化为节点,每个特征值仅对应一个节点,即在转化的过程中,若一个特征值对应的节点已经存在,则复用该节点,无需再设置与该特征值对应的节点,其中,与各待分析用户对应的节点位于第一二部图的一侧,与各特征 值对应的节点位于第一二部图的另一侧,且在与各待分析用户对应的节点与其特征值对应的节点之间添加边。例如,待分析用户为5个,分别为第一待分析用户至第五待分析用户,其中,第一待分析用户的特征值包括:A、B、D,第二待分析用户的特征值包括:B、C、F,第三待分析用户的特征值包括:A、C、D、F,第四待分析用户的特征值包括:B、D、F,第五待分析用户的特征值包括:C、D、E、F,基于此,构建的第一二部图如图3所示,其中,第一待分析用户对应的节点1、第二待分析用户对应的节点2、第三待分析用户对应的节点3、第四待分析用户对应的节点4以及第五待分析用户对应的节点5位于图3的左侧,特征值A对应的节点、特征值B对应的节点、特征值C对应的节点、特征值D对应的节点、特征值E对应的节点、特征值F对应的节点位于图3的右侧,且在各待分析用户对应的节点和其特征值对应的节点之间设置边。
步骤S204、在第一二部图中获取各特征值对应的节点的度,并根据各特征值对应的节点的度在特征值中确定高频特征值和低频特征值。
在本申请实施例中,各特征值对应的节点的度指与特征值对应的节点连接的边的数量,例如,在图3中,特征值A对应的节点的度为2、特征值B对应的节点的度为3、特征值C对应的节点的度为3、特征值D对应的节点的度为4、特征值E对应的节点的度为1、特征值F的度为4。
根据各特征值对应的节点的度在特征值中确定高频特征值和低频特征值的过程可以包括:根据各特征值并结合下述确定规则确定高频特征值和低频特征值,其中确定规则可以为:若特征值对应的节点的度满足公式K2 i≥degree(V i)>1,则特征值为低频特征值,其中,degree(V i)为第i个特征值V i对应的节点的度,K2 i为第i个特征值V i对应的第一预设度,K2 i>1,且K2 i的具体数值可以根据第i个特征值V i所属的特征进行确定,即特征不同,对应的K2 i的具体数值也不同;若特征值对应的节点的度满足公式K1 i≥degree(V i)>K2 i,则特征值为高频特征值,其中,degree(V i)为第i个特征值V i对应的节点的度,K2 i为第i个特征值V i对应的第一预设度,K1 i为第i各特征值V i对应的第二预设度,K1 i>K2 i,且K2 i和K1 i的具体数值可以根据第i个特征值V i所属的特征进行确定,即特征不同,对应的K2 i和K1 i的具体数值也不同。
例如,如图3所示,若K2 i为2,K1 i为3,则特征值A为低频特征值,特征值B、特征值C为高频特征值。
步骤S206、根据高频特征值和低频特征值确定各待分析用户的特征值中的高频特征值和低频特征值。
在本申请实施例中,将高频特征值分别与各待分析用户的特征值进行匹配,并将各待分析用户中的与高频特征值匹配成功的特征值确定为对应的各待分析用户的高频特征值;将低频特征值分别与各待分析用户中的特征值进行匹配,并将各待分析用户中的与低频特征值匹配成功的特征值确定为对应的各待分析用户的低频特征值。例如,如图3所示,若K2 i为2,K1 i为3,则特征值A为低频特征值,特征值B、特征值C为高频特征值。基于此,第一待分析用户的低频特征值包括特征值A、第一待分析用户的高频特征值包括特征值B,第二待分析用户没有低频特征值,第二待分析用户的高频特征值包括:特征值B、特征值C,第三待分析用户的低频特征值包括特征值A,第三待分析用户的高频特征值包括特征值C,第四待分析用户没有低频特征值,第四待分析用户的高频特征值包括特征值B,第五待分析用户没有低频特征值,第五待分析用户的高频特征值包括特征值C。
步骤S106、根据各待分析用户的高频特征值和预设的频繁项集挖掘策略挖掘最大频繁项集,获取最大频繁项集中的低频最大频繁特征值。
在本申请实施例中,预设的频繁项集挖掘策略例如可以为Apriori(挖掘关联规则的频繁项集)策略,还可以为FP-Growth等,本示例性实施例对此不做特殊限定。下面,以预设的频繁项集挖掘策略为FP-Growth为例,对上述过程进行说明,其中,如图4所示,可以包括以下步骤:
步骤S402、根据各待分析用户的高频特征值并结合FP-Growth方法,挖掘支持度满足预设支持度的频繁多项集,并在频繁多项集中确定最大频繁项集。
在本申请实施例中,支持度为高频特征值在多个待分析用户中的出现次数,预设支持度的具体数值可以自行设置,例如可以为1、也可以为2等,本示例性实施例对此不做特殊限定。频繁多项集指至少包括两个高频特征值的集合。支持度满足预设支持度的频繁多项集指频繁多项集中的每个高频特征值的支持度均大于预设支持度。
具体的挖掘频繁多项集的过程包括:定义预设支持度,扫描各待分析用户的高频特征值,以得到每个高频特征值在多个待分析用户中的出现次数(即支持度),并在各待分析用户的高频特征值中筛除支持度小于预设支持度的高频特征值,以及根据各待分析用户中剩余的高频特征值构建FP树,并在FP树中挖掘频繁多项集。在频繁多项集中获取无超集条件的频繁多项集,并将频繁多项集中的无超集条件的频繁多项集确定为最大频繁项集。需要说明的是,每个最大频繁项集中包括多个高频特征值,此处,将最大频繁项集中包括的高频特征值命名为最大频繁特征值,即每个最大频繁项集中包括多个 最大频繁特征值。
步骤S404、将各待分析用户的特征值与最大频繁项集中的最大频繁特征值进行匹配,以得到各待分析用户的最大频繁特征值。
在本申请实施例中,将各待分析用户的特征值与最大频繁项集中的最大频繁特征值进行匹配,并将各待分析用户中与最大频繁项集中的最大频繁特征值匹配成功的特征值确定为对应的各待分析用户的最大频繁特征值。
步骤S406、在待分析用户的最大频繁特征值中确定低频最大频繁特征值。
在本申请实施例中,可以通过以下两种方式确定低频最大频繁特征值,其中:
方式一、根据各待分析用户的最大频繁特征值统计各最大频繁特征值在多个待分析用户中的出现次数,并根据各最大频繁特征值在多个待分析用户中的出现次数并结合下述确定规则在最大频繁特征值中确定低频最大频繁特征值,其中,确定规则为:若最大频繁特征值在多个待分析用户中的出现次数符合公式P2 i≥S i,则最大频繁特征值为低频最大频繁特征值,其中,P2 i为第i个最大频繁特征值对应的预设出现次数,且P2 i的具体数值可以根据第i个最大频繁特征值所属的特征进行确定,即特征不同,对应的P2 i的具体数值也不同,S i为第i个最大频繁特征值在多个待分析用户中的出现次数。
方式二、如图5所示,可以包括以下步骤:
步骤S502、根据各待分析用户的最大频繁特征值构建第二二部图,其中,第二二部图包括与各待分析用户对应的节点、与各最大频繁特征值对应的节点、以及各待分析用户对应的节点与其最大频繁特征值对应的节点之间的边。
在本申请实施例中,将每个待分析用户分别转化为节点,每个待分析用户仅对应一个节点,并将各待分析用户的最大频繁特征值转化为节点,每个最大频繁特征值仅对应一个节点,其中,与各待分析用户对应的节点位于第二二部图的一侧,与各最大频繁特征值对应的节点位于第二二部图的另一侧,且在各待分析用户对应的节点与其最大频繁特征值对应的节点之间添加边,以完成对第二二部图的构建。
步骤S504、在第二二部图中获取各最大频繁特征值对应的节点的度,并根据各最大频繁特征值对应的节点的度在最大频繁特征值中确定低频最大频繁特征值。
在本申请实施例中,最大频繁特征值对应的节点的度为二部图中与该最大频繁特征值对应的节点相连的边的数量。确定低频最大频繁特征值的过程可以包括:根据各最 大频繁特征值对应的节点的度并结合下述确定规则确定低频最大频繁特征值,其中确定规则可以为:若最大频繁特征值对应的节点的度满足公式L2 i≥degree(V i),则最大频繁特征值为低频最大频繁特征值,其中,degree(V i)为第i个最大频繁特征值对应的节点的度,L2 i第i个最大频繁特征值V i对应的预设度,且L2 i的具体数值可以根据第i个最大频繁特征值V i所属的特征进行确定,即特征不同,对应的L2 i的具体数值也不同。
步骤S108、根据各待分析用户的特征值中的低频最大频繁特征值和低频特征值构建目标二部图,并定义目标二部图中的边的权重。
在本申请实施例中,将低频最大频繁特征值与各待分析用户中的特征值进行匹配,并将各待分析用户中与低频最大频繁特征值匹配成功的特征值确定为对应的各待分析用户的低频最大频繁特征值。根据各待分析用户的低频最大频繁特征值以及步骤S104中获取的各待分析用户的低频特征值构建目标二部图的过程可以包括:将各分析用户分别转化为节点,并将各低频特征值转化为节点,将各低频最大频繁特征值转化为节点,以及在各待分析用户对应的节点与其低频特征值对应的节点之间添加边,并在各待分析用户对应的节点与其低频最大频繁特征值对应的节点之间添加边,以完成对目标二部图的构建。
定义目标二部图中的边的权重可以包括:定义目标二部图中各待分析用户的对应的节点与其低频特征值对应的节点之间的边的权重,以及定义目标二部图中各待分析用户对应的节点与其低频最大频繁特征值对应的节点之间的边的权重。其中,定义目标二部图中各待分析用户的对应的节点与其低频特征值对应的节点之间的边的权重可以包括:根据各低频特征值所属的特征确定各低频特征值的权重,具体的,低频特征值的权重越高,同时包括该低频特征值的待分析用户为一个异常群体的概率越高,低频特征值的权重越低,同时包括该低频特征值的待分析用户为一个异常群体的概率越低。在确定各低频特征值的权重后,将与各低频特征值对应的节点连接的边的权重均设置为对应的各低频特征值的权重。例如,若低频特征值包括频繁出险(出险特征)、无业(职业特征),且频繁出险的权重为0.5、无业的权重为0.1,则,与频繁出险对应的节点连接的边的权重均设置为0.5,与无业对应的节点连接的边的权重均设置为0.1。同理,定义目标二部图中各待分析用户对应的节点与其低频最大频繁特征值对应的节点之间的边的权重可以包括:根据各低频最大频繁特征值所属的特征确定各低频最大频繁特征值的权重,具体的,低频最大频繁特征值的权重越高,同时包括该低频最大频繁特征值的待分析用户为一个异常群体的概率越高,低频最大频繁特征值的权重越低,同时包括该低频 最大频繁特征值的待分析用户为一个异常群体的概率越低。将与各低频最大频繁特征值对应的节点连接的边的权重设置为对应的各低频最大频繁特征值的权重。
步骤S110、根据目标二部图中的边的权重,以及通过对目标二部图进行图聚类所得到的多个待分析用户的聚类结果,确定待分析用户中的异常群体。
在本申请实施例中,可以通过以下两种方式确定待分析用户中的异常群体,其中:
方式一、在目标二部图中删除权重小于第一预设权重的边,以得到待聚类二部图,并对待聚类二部图采用联通算法得到至少一个最大连通子图,以及将每个最大连通子图中的节点对应的待分析用户确定为一个异常群体。
在本申请实施例中,第一预设权重的具体数值可以自行设置,本示例性实施例对此不做特殊限定。将目标二部图中的每个边的权重依次与第一预设权重进行比较,若边的权重小于第一预设权重,则在目标二部图中删除该边,若边的权重不小于第一预设权重,则在目标二部图中保留该边,将筛除权重小于预设权重的边的目标二部图确定为待聚类二部图。对待聚类二部图采用联通算法以得到至少一个最大连通子图,在每个最大连通子图中筛除与低频特征值对应的节点和与低频最大频繁特征值对应的节点,并将每个最大连通子图中剩余的节点对应的待分析用户进行集合,以得到每个最大连通子图对应的待分析用户集合,以及将每个最大连通子图对应的待分析用户集合分别确定为一个异常群体。
方式二、在目标二部图中删除权重小于第一预设权重的边,以得到待聚类二部图,并通过社区发现算法对待聚类二部图中的节点进行划分,以得到多个节点集合,以及将每个节点集合中的节点对应的待分析用户确定为一个异常群体。
在本申请实施例中,由于在二部图中删除权重小于第一预设权重的边,以得到待聚类二部图的原理与上述方式一中的原理相同,因此此处不在赘述。社区发现算法例如可以为louvain算法等,本示例性实施例对此不做特殊限定。在通过社区发现算法对待聚类二部图中的节点进行划分得到多个节点集合后,首先在每个节点集合中筛除与低频特征值对应的节点和低频最大频繁特征值对应的节点,并分别将每个节点集合中剩余的节点对应的待分析用户进行集合,以得到每个节点集合对应的待分析用户集合,并将每个节点集合对应的待分析用户集合分别确定为一个异常群体。
进一步的,在得到异常群体之后,为了进一步对异常群体进行验证,进而进一步的提高异常群体识别的准确度,可以获取每个异常群体中的待分析用户的总数量,并在 异常群体中筛除待分析用户的总数量少于预设数量的异常群体,并将剩余的异常群体确定为最终识别出的异常群体;还可以计算每个异常群体对应的最大连通子图的模块度,并将每个异常群体对应的最大连通子图的模块度确定为对应的异常群体的模块度,以及在异常群体中筛除模块度小于预设模块度的异常群体,将剩余的异常群体确定为最终识别出的异常群体。需要说明的是,上述两种验证方式仅为示例性的,并不用于限定本发明,其还可以通过分析异常群体中的每个待分析用户的业务特征对异常群体进行验证。
为了更加准确的对待分析用户进行聚类,以得到更加准确的异常群体,如图6所示,根据目标二部图中的边的权重,以及通过对目标二部图进行图聚类所得到的多个待分析用户的聚类结果,确定待分析用户中的异常群体可以包括以下步骤:
步骤S602、根据目标二部图中的边的权重计算任意两个待分析用户之间的权重。
在本申请实施例中,在目标二部图中获取与任意两个待分析用户对应的节点共同连接的与低频特征值对应的节点和与低频最大频繁特征值对应的节点,并将与任意两个待分析用户对应的节点共同连接的与低频特征值对应的节点和与低频最大频繁特征值对应的节点确定为目标节点;根据任意两个待分析用户中的任何一个待分析用户对应的节点与每个目标节点之间的边的权重并结合下述公式计算任意两个待分析用户之间的权重,上述公式为:
Figure PCTCN2019126030-appb-000001
其中,weight(e)为任意两个待分析用户之间的权重,j为目标节点的总数量,w(item i)为第i个目标节点w(item i)与任意两个待分析用户中的任意一个待分析用户对应的节点之间的边的权重。
步骤S604、将各待分析用户转化为节点,并在任意两个节点之间设置边,并将任意两个节点的边的权重设置为对应的任意两个待分析用户之间的权重,以构建目标聚类图。
在本申请实施例中,将各待分析用户转化为节点,即一个待分析用户对应一个节点,并在任意两个节点之间设置边,以及将任意两个待分析用户之间的权重设置为该任意两个待分析用户对应的两个节点之间的边的权重,以完成目标聚类图的构建。由上可知,通过步骤S602和步骤S604将包括待分析用户对应的节点和低频特征值对应的节点以及低频最大频繁特征值对应的节点的目标二部图转化为仅包括待分析用户对应的节点的目标聚类图。
步骤S606、通过对目标聚类图进行图聚类所得到的多个待分析用户的聚类结果,确定待分析用户中的异常群体。
在本申请实施例中,可以通过以下两种方式确定异常群体,其中:
方式一、在目标聚类图中删除权重小于第二预设权重的边,以得到待聚类图,并对待聚类图采用联通算法得到至少一个最大连通子图,以及将每个最大连通子图中的节点对应的待分析用户分别确定为一个异常群体。
在本申请实施例中,第二预设权重的具体数值可以自行设置,本示例性实施例对此不做特殊限定。将目标聚类图中的每个边的权重分别与第二预设权重进行比较,并在目标聚类图中删除权重小于第二预设权重的边,以将目标聚类图转化为待聚类图。将每个最大连通子图中的节点对应的待分析用户进行集合,以得到每个最大连通子图对应的待分析用户集合,并将每个最大连通子图对应的待分析用户集合分别确定为一个异常群体。
方式二、在目标聚类图中删除权重小于第二预设权重的边,以得到待聚类图,并通过社区发现算法对待聚类图进行划分,以得到多个节点集合,以及将每个节点集合对应的待分析用户分别确定为一个异常群体。
在申请实施例中,第二预设权重已经在上文中进行了说明,因此此处不在赘述。将目标聚类图中的每个边的权重分别与第二预设权重进行比较,并在目标聚类图中删除权重小于第二预设权重的边,以将目标聚类图转化为待聚类图。社区发现算法例如可以为louvain算法等,本示例性实施例对此不做特殊限定。在通过社区发现算法对待聚类图中的节点进行划分得到多个节点集合后,分别将每个节点集合中的节点对应的待分析用户进行集合,以得到每个节点集合对应的待分析用户集合,并将每个节点集合对应的待分析用户集合分别确定为一个异常群体。
由上可知,通过根据目标二部图中的边的权重计算任意两个待分析用户之间的权重,并根据任意两个待分析用户之前的权重构建目标聚类图,以将目标二部图转化为目标聚类图,使得目标聚类图更加准确且更加直观的反应待分析用户之间的关系,进而使得根据目标聚类图得到的异常群体更加准确。
需要说明的是,上述两种确定异常群体的方式进行示例性的,并不用于限定本发明。
进一步的,在得到异常群体之后,为了进一步对异常群体进行验证,进而进一步 的提高异常群体识别的准确度,可以获取每个异常群体中的待分析用户的总数量,并在异常群体中筛除待分析用户的总数量少于预设数量的异常群体,并将剩余的异常群体确定为最终识别出的异常群体;还可以计算每个异常群体对应的最大连通子图的模块度,并将每个异常群体对应的最大连通子图的模块度确定为对应的异常群体的模块度,以及在异常群体中筛除模块度小于预设模块度的异常群体,将剩余的异常群体确定为最终识别出的异常群体。需要说明的是,上述两种验证方式仅为示例性的,并不用于限定本发明,其还可以通过分析异常群体中的每个待分析用户的业务特征对异常群体进行验证。
综上所述,通过对各待分析用户的高频特征值进行预设的频繁项集挖掘策略挖掘最大频繁项集,并获取最大频繁项集中的低频最大频繁特征值,以挖掘待分析用户的行为序列,进而使得异常群体的识别更加准确;此外,仅通过获取各待分析用户的低频特征值和低频最大频繁特征值,并根据各待分析用户的低频特征值和低频最大频繁特征值构建目标二部图,并定义目标二部图中的边的权重,以及根据目标二部图中的边的权重并对目标二部图进行图聚类,以得到异常群体,步骤简单,且易于执行。
对应上述异常群体识别方法,基于相同的技术构思,本申请实施例还提供了一种异常群体识别装置,图7为本申请实施例提供的异常群体识别装置的组成示意图,该装置用于执行上述异常群体识别方法,如图7所示,该装置700可以包括:获取模块701、确定模块702、挖掘模块703、构建模块704、聚类模块705,其中:
获取模块701,用于获取多个待分析用户中的各所述待分析用户的特征值;
确定模块702,用于确定各所述待分析用户的特征值中的高频特征值和低频特征值;
挖掘模块703,用于根据各所述待分析用户的高频特征值和预设的频繁项集挖掘策略挖掘最大频繁项集,获取所述最大频繁项集中的低频最大频繁特征值;
构建模块704,用于根据各所述待分析用户的特征值中的所述低频最大频繁特征值和所述低频特征值构建目标二部图,并定义所述目标二部图中的边的权重;
聚类模块705,用于根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体。
可选的,所述获取模块701可以包括:
获取单元,用于获取所述多个待分析用户的原始个人数据;
离散化单元,用于对所述多个待分析用户的原始个人数据进行离散化,以得到各所述待分析用户的特征值。
可选的,所述确定模块702可以包括:
第一构建单元,用于根据各所述待分析用户的特征值构建第一二部图,其中,所述第一二部图包括与各所述待分析用户对应的节点、与各所述特征值对应的节点、以及各所述待分析用户对应的节点与其特征值对应的节点之间的边;
第一确定单元,用于在所述第一二部图中获取各所述特征值对应的节点的度,并根据各所述特征值对应的节点的度在所述特征值中确定高频特征值和低频特征值;
第二确定单元,用于根据所述高频特征值和所述低频特征值确定各所述待分析用户的特征值中的高频特征值和低频特征值。
可选的,所述挖掘模块703可以包括:
挖掘单元,用于根据各所述待分析用户的高频特征值并结合FP-Growth方法,挖掘支持度满足预设支持度的频繁多项集,并在所述频繁多项集中确定最大频繁项集;
匹配单元,用于将各所述待分析用户的特征值与所述最大频繁项集中的最大频繁特征值进行匹配,以得到各所述待分析用户的最大频繁特征值;
第三确定单元,用于在所述待分析用户的最大频繁特征值中确定低频最大频繁特征值。
可选的,所述第三确定单元可以包括:
构建子单元,用于根据各所述待分析用户的最大频繁特征值构建第二二部图,其中,所述第二二部图包括与各所述待分析用户对应的节点、与各所述最大频繁特征值对应的节点、以及各所述待分析用户对应的节点与其最大频繁特征值对应的节点之间的边;
确定子单元,用于在所述第二二部图中获取各所述最大频繁特征值对应的节点的度,并根据各所述最大频繁特征值对应的节点的度在所述最大频繁特征值中确定低频最大频繁特征值。
可选的,所述聚类模块705可以包括:
第一聚类单元,用于在所述目标二部图中删除权重小于第一预设权重的边,以得到待聚类二部图,并对所述待聚类二部图采用联通算法得到至少一个最大连通子图,以及将每个所述最大连通子图中的节点对应的待分析用户确定为一个所述异常群体;或者
第二聚类单元,用于在所述目标二部图中删除权重小于第一预设权重的边,以得到待聚类二部图,并通过社区发现算法对所述待聚类二部图中的节点进行划分,以得到多个节点集合,以及将每个所述节点集合中的节点对应的待分析用户确定为一个所述异常群体。
可选的,所述聚类模块705可以包括:
计算单元,用于根据所述目标二部图中的边的权重计算任意两个所述待分析用户之间的权重;
第二构建单元,用于将各所述待分析用户转化为节点,并在任意两个节点之间设置边,并将任意两个节点的边的权重设置为对应的任意两个所述待分析用户之间的权重,以构建目标聚类图;
第三聚类单元,用于通过对所述目标聚类图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体。
可选的,所述第三聚类单元可以包括:
第一聚类子单元,用于在所述目标聚类图中删除权重小于第二预设权重的边,以得到待聚类图,并对所述待聚类图采用联通算法得到至少一个最大连通子图,以及将每个所述最大连通子图中的节点对应的待分析用户分别确定为一个所述异常群体;或者
第二聚类子单元,用于在所述目标聚类图中删除权重小于第二预设权重的边,以得到待聚类图,并通过社区发现算法对所述待聚类图进行划分,以得到多个节点集合,以及将每个所述节点集合对应的待分析用户分别确定为一个所述异常群体。
本申请实施例中的异常群体识别装置,通过对各待分析用户的高频特征值进行预设的频繁项集挖掘策略挖掘最大频繁项集,并获取最大频繁项集中的低频最大频繁特征值,以挖掘待分析用户的行为序列,进而使得异常群体的识别更加准确;此外,仅通过获取各待分析用户的低频特征值和低频最大频繁特征值,并根据各待分析用户的低频特征值和低频最大频繁特征值构建目标二部图,并定义目标二部图中的边的权重,以及根据目标二部图中的边的权重并对目标二部图进行图聚类,以得到异常群体,步骤简单,且易于执行。
应上述异常群体识别方法,基于相同的技术构思,本申请实施例还提供了一种异常群体识别设备,图8为本申请实施例提供的异常群体识别设备的结构示意图,该设备用于执行上述的异常群体识别方法。
如图8所示,异常群体识别设备可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上的处理器801和存储器802,存储器802中可以存储有一个或一个以上存储应用程序或数据。其中,存储器802可以是短暂存储或持久存储。存储在存储器802的应用程序可以包括一个或一个以上模块(图示未示出),每个模块可以包括对异常群体识别设备中的一系列计算机可执行指令。更进一步地,处理器801可以设置为与存储器802通信,在异常群体识别设备上执行存储器802中的一系列计算机可执行指令。异常群体识别设备还可以包括一个或一个以上电源803,一个或一个以上有线或无线网络接口804,一个或一个以上输入输出接口805,一个或一个以上键盘806等。
在一个具体的实施例中,异常群体识别设备包括有存储器,以及一个或一个以上的程序,其中一个或者一个以上程序存储于存储器中,且一个或者一个以上程序可以包括一个或一个以上模块,且每个模块可以包括对异常群体识别设备中的一系列计算机可执行指令,且经配置以由一个或者一个以上处理器执行该一个或者一个以上程序包含用于进行以下计算机可执行指令:
获取多个待分析用户中的各所述待分析用户的特征值;
确定各所述待分析用户的特征值中的高频特征值和低频特征值;
根据各所述待分析用户的高频特征值和预设的频繁项集挖掘策略挖掘最大频繁项集,获取所述最大频繁项集中的低频最大频繁特征值;
根据各所述待分析用户的特征值中的所述低频最大频繁特征值和所述低频特征值构建目标二部图,并定义所述目标二部图中的边的权重;
根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体。
可选的,计算机可执行指令在被执行时,所述获取多个待分析用户中的各所述待分析用户的特征值包括:
获取所述多个待分析用户的原始个人数据;
对所述多个待分析用户的原始个人数据进行离散化,以得到各所述待分析用户的特征值。
可选的,计算机可执行指令在被执行时,所述确定各所述待分析用户的特征值中的高频特征值和低频特征值包括:
根据各所述待分析用户的特征值构建第一二部图,其中,所述第一二部图包括与各所述待分析用户对应的节点、与各所述特征值对应的节点、以及各所述待分析用户对应的节点与其特征值对应的节点之间的边;
在所述第一二部图中获取各所述特征值对应的节点的度,并根据各所述特征值对应的节点的度在所述特征值中确定高频特征值和低频特征值;
根据所述高频特征值和所述低频特征值确定各所述待分析用户的特征值中的高频特征值和低频特征值。
可选的,计算机可执行指令在被执行时,所述根据各所述待分析用户的高频特征值和预设的频繁项集挖掘策略挖掘最大频繁项集,获取所述最大频繁项集中的低频最大频繁特征值包括:
根据各所述待分析用户的高频特征值并结合FP-Growth方法,挖掘支持度满足预设支持度的频繁多项集,并在所述频繁多项集中确定最大频繁项集;
将各所述待分析用户的特征值与所述最大频繁项集中的最大频繁特征值进行匹配,以得到各所述待分析用户的最大频繁特征值;
在所述待分析用户的最大频繁特征值中确定低频最大频繁特征值。
可选的,计算机可执行指令在被执行时,所述在所述待分析用户的最大频繁特征值中确定低频最大频繁特征值包括:
根据各所述待分析用户的最大频繁特征值构建第二二部图,其中,所述第二二部图包括与各所述待分析用户对应的节点、与各所述最大频繁特征值对应的节点、以及各所述待分析用户对应的节点与其最大频繁特征值对应的节点之间的边;
在所述第二二部图中获取各所述最大频繁特征值对应的节点的度,并根据各所述最大频繁特征值对应的节点的度在所述最大频繁特征值中确定低频最大频繁特征值。
可选的,计算机可执行指令在被执行时,,所述根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体包括:
在所述目标二部图中删除权重小于第一预设权重的边,以得到待聚类二部图,并对所述待聚类二部图采用联通算法得到至少一个最大连通子图,以及将每个所述最大连通子图中的节点对应的待分析用户确定为一个所述异常群体;或者
在所述目标二部图中删除权重小于第一预设权重的边,以得到待聚类二部图,并通过社区发现算法对所述待聚类二部图中的节点进行划分,以得到多个节点集合,以及将每个所述节点集合中的节点对应的待分析用户确定为一个所述异常群体。
可选的,计算机可执行指令在被执行时,所述根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体包括:
根据所述目标二部图中的边的权重计算任意两个所述待分析用户之间的权重;
将各所述待分析用户转化为节点,并在任意两个节点之间设置边,并将任意两个节点的边的权重设置为对应的任意两个所述待分析用户之间的权重,以构建目标聚类图;
通过对所述目标聚类图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体。
可选的,计算机可执行指令在被执行时,所述通过对所述目标聚类图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体包括:
在所述目标聚类图中删除权重小于第二预设权重的边,以得到待聚类图,并对所述待聚类图采用联通算法得到至少一个最大连通子图,以及将每个所述最大连通子图中的节点对应的待分析用户分别确定为一个所述异常群体;或者
在所述目标聚类图中删除权重小于第二预设权重的边,以得到待聚类图,并通过社区发现算法对所述待聚类图进行划分,以得到多个节点集合,以及将每个所述节点集合对应的待分析用户分别确定为一个所述异常群体。
本申请实施例中的异常群体识别设备,通过对各待分析用户的高频特征值进行预设的频繁项集挖掘策略挖掘最大频繁项集,并获取最大频繁项集中的低频最大频繁特征值,以挖掘待分析用户的行为序列,进而使得异常群体的识别更加准确;此外,仅通过获取各待分析用户的低频特征值和低频最大频繁特征值,并根据各待分析用户的低频特征值和低频最大频繁特征值构建目标二部图,并定义目标二部图中的边的权重,以及根据目标二部图中的边的权重并对目标二部图进行图聚类,以得到异常群体,步骤简单,且易于执行。
对应上述异常群体识别方法,基于相同的技术构思,本申请实施例还提供了一种存储介质,用于存储计算机可执行指令,在一个具体的实施例中,该存储介质可以为U盘、光盘、硬盘等,该存储介质存储的计算机可执行指令在被处理器执行时,能实现以 下流程:
获取多个待分析用户中的各所述待分析用户的特征值;
确定各所述待分析用户的特征值中的高频特征值和低频特征值;
根据各所述待分析用户的高频特征值和预设的频繁项集挖掘策略挖掘最大频繁项集,获取所述最大频繁项集中的低频最大频繁特征值;
根据各所述待分析用户的特征值中的所述低频最大频繁特征值和所述低频特征值构建目标二部图,并定义所述目标二部图中的边的权重;
根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体。
可选的,该存储介质存储的计算机可执行指令在被处理器执行时,所述获取多个待分析用户中的各所述待分析用户的特征值包括:
获取所述多个待分析用户的原始个人数据;
对所述多个待分析用户的原始个人数据进行离散化,以得到各所述待分析用户的特征值。
可选的,该存储介质存储的计算机可执行指令在被处理器执行时,所述确定各所述待分析用户的特征值中的高频特征值和低频特征值包括:
根据各所述待分析用户的特征值构建第一二部图,其中,所述第一二部图包括与各所述待分析用户对应的节点、与各所述特征值对应的节点、以及各所述待分析用户对应的节点与其特征值对应的节点之间的边;
在所述第一二部图中获取各所述特征值对应的节点的度,并根据各所述特征值对应的节点的度在所述特征值中确定高频特征值和低频特征值;
根据所述高频特征值和所述低频特征值确定各所述待分析用户的特征值中的高频特征值和低频特征值。
可选的,该存储介质存储的计算机可执行指令在被处理器执行时,所述根据各所述待分析用户的高频特征值和预设的频繁项集挖掘策略挖掘最大频繁项集,获取所述最大频繁项集中的低频最大频繁特征值包括:
根据各所述待分析用户的高频特征值并结合FP-Growth方法,挖掘支持度满足预 设支持度的频繁多项集,并在所述频繁多项集中确定最大频繁项集;
将各所述待分析用户的特征值与所述最大频繁项集中的最大频繁特征值进行匹配,以得到各所述待分析用户的最大频繁特征值;
在所述待分析用户的最大频繁特征值中确定低频最大频繁特征值。
可选的,该存储介质存储的计算机可执行指令在被处理器执行时,所述在所述待分析用户的最大频繁特征值中确定低频最大频繁特征值包括:
根据各所述待分析用户的最大频繁特征值构建第二二部图,其中,所述第二二部图包括与各所述待分析用户对应的节点、与各所述最大频繁特征值对应的节点、以及各所述待分析用户对应的节点与其最大频繁特征值对应的节点之间的边;
在所述第二二部图中获取各所述最大频繁特征值对应的节点的度,并根据各所述最大频繁特征值对应的节点的度在所述最大频繁特征值中确定低频最大频繁特征值。
可选的,该存储介质存储的计算机可执行指令在被处理器执行时,所述根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体包括:
在所述目标二部图中删除权重小于第一预设权重的边,以得到待聚类二部图,并对所述待聚类二部图采用联通算法得到至少一个最大连通子图,以及将每个所述最大连通子图中的节点对应的待分析用户确定为一个所述异常群体;或者
在所述目标二部图中删除权重小于第一预设权重的边,以得到待聚类二部图,并通过社区发现算法对所述待聚类二部图中的节点进行划分,以得到多个节点集合,以及将每个所述节点集合中的节点对应的待分析用户确定为一个所述异常群体。
可选的,该存储介质存储的计算机可执行指令在被处理器执行时,所述根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体包括:
根据所述目标二部图中的边的权重计算任意两个所述待分析用户之间的权重;
将各所述待分析用户转化为节点,并在任意两个节点之间设置边,并将任意两个节点的边的权重设置为对应的任意两个所述待分析用户之间的权重,以构建目标聚类图;
通过对所述目标聚类图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体。
可选的,该存储介质存储的计算机可执行指令在被处理器执行时,所述通过对所述目标聚类图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体包括:
在所述目标聚类图中删除权重小于第二预设权重的边,以得到待聚类图,并对所述待聚类图采用联通算法得到至少一个最大连通子图,以及将每个所述最大连通子图中的节点对应的待分析用户分别确定为一个所述异常群体;或者
在所述目标聚类图中删除权重小于第二预设权重的边,以得到待聚类图,并通过社区发现算法对所述待聚类图进行划分,以得到多个节点集合,以及将每个所述节点集合对应的待分析用户分别确定为一个所述异常群体。
本申请实施例中的存储介质存储的计算机可执行指令在被处理器执行时,通过对各待分析用户的高频特征值进行预设的频繁项集挖掘策略挖掘最大频繁项集,并获取最大频繁项集中的低频最大频繁特征值,以挖掘待分析用户的行为序列,进而使得异常群体的识别更加准确;此外,仅通过获取各待分析用户的低频特征值和低频最大频繁特征值,并根据各待分析用户的低频特征值和低频最大频繁特征值构建目标二部图,并定义目标二部图中的边的权重,以及根据目标二部图中的边的权重并对目标二部图进行图聚类,以得到异常群体,步骤简单,且易于执行。
在20世纪90年代,对于一个技术的改进可以很明显地区分是硬件上的改进(例如,对二极管、晶体管、开关等电路结构的改进)还是软件上的改进(对于方法流程的改进)。然而,随着技术的发展,当今的很多方法流程的改进已经可以视为硬件电路结构的直接改进。设计人员几乎都通过将改进的方法流程编程到硬件电路中来得到相应的硬件电路结构。因此,不能说一个方法流程的改进就不能用硬件实体模块来实现。例如,可编程逻辑器件(Programmable Logic Device,PLD)(例如现场可编程门阵列(Field Programmable Gate Array,FPGA))就是这样一种集成电路,其逻辑功能由用户对器件编程来确定。由设计人员自行编程来把一个数字系统“集成”在一片PLD上,而不需要请芯片制造厂商来设计和制作专用的集成电路芯片。而且,如今,取代手工地制作集成电路芯片,这种编程也多半改用“逻辑编译器(logic compiler)”软件来实现,它与程序开发撰写时所用的软件编译器相类似,而要编译之前的原始代码也得用特定的编程语言来撰写,此称之为硬件描述语言(Hardware Description Language,HDL),而HDL也并非仅有一种,而是有许多种,如ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University  Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language)等,目前最普遍使用的是VHDL(Very-High-Speed Integrated Circuit Hardware Description Language)与Verilog。本领域技术人员也应该清楚,只需要将方法流程用上述几种硬件描述语言稍作逻辑编程并编程到集成电路中,就可以很容易得到实现该逻辑方法流程的硬件电路。
控制器可以按任何适当的方式实现,例如,控制器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器和嵌入微控制器的形式,控制器的例子包括但不限于以下微控制器:ARC 625D、Atmel AT91SAM、Microchip PIC18F26K20以及Silicone Labs C8051F320,存储器控制器还可以被实现为存储器的控制逻辑的一部分。本领域技术人员也知道,除了以纯计算机可读程序代码方式实现控制器以外,完全可以通过将方法步骤进行逻辑编程来使得控制器以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种控制器可以被认为是一种硬件部件,而对其内包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至,可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的,计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本申请时可以把各单元的功能在同一个或多个软件和/或硬件中实现。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的 每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还 包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
以上所述仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (11)

  1. 一种异常群体识别方法,其特征在于,包括:
    获取多个待分析用户中的各所述待分析用户的特征值;
    确定各所述待分析用户的特征值中的高频特征值和低频特征值;
    根据各所述待分析用户的高频特征值和预设的频繁项集挖掘策略挖掘最大频繁项集,获取所述最大频繁项集中的低频最大频繁特征值;
    根据各所述待分析用户的特征值中的所述低频最大频繁特征值和所述低频特征值构建目标二部图,并定义所述目标二部图中的边的权重;
    根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体。
  2. 根据权利要求1所述的异常群体识别方法,其特征在于,所述获取多个待分析用户中的各所述待分析用户的特征值包括:
    获取所述多个待分析用户的原始个人数据;
    对所述多个待分析用户的原始个人数据进行离散化,以得到各所述待分析用户的特征值。
  3. 根据权利要求1所述的异常群体识别方法,其特征在于,所述确定各所述待分析用户的特征值中的高频特征值和低频特征值包括:
    根据各所述待分析用户的特征值构建第一二部图,其中,所述第一二部图包括与各所述待分析用户对应的节点、与各所述特征值对应的节点、以及各所述待分析用户对应的节点与其特征值对应的节点之间的边;
    在所述第一二部图中获取各所述特征值对应的节点的度,并根据各所述特征值对应的节点的度在所述特征值中确定高频特征值和低频特征值;
    根据所述高频特征值和所述低频特征值确定各所述待分析用户的特征值中的高频特征值和低频特征值。
  4. 根据权利要求1所述的异常群体识别方法,其特征在于,所述根据各所述待分析用户的高频特征值和预设的频繁项集挖掘策略挖掘最大频繁项集,获取所述最大频繁项集中的低频最大频繁特征值包括:
    根据各所述待分析用户的高频特征值并结合FP-Growth方法,挖掘支持度满足预设支持度的频繁多项集,并在所述频繁多项集中确定最大频繁项集;
    将各所述待分析用户的特征值与所述最大频繁项集中的最大频繁特征值进行匹配,以得到各所述待分析用户的最大频繁特征值;
    在所述待分析用户的最大频繁特征值中确定低频最大频繁特征值。
  5. 根据权利要求4所述的异常群体识别方法,其特征在于,所述在所述待分析用户的最大频繁特征值中确定低频最大频繁特征值包括:
    根据各所述待分析用户的最大频繁特征值构建第二二部图,其中,所述第二二部图包括与各所述待分析用户对应的节点、与各所述最大频繁特征值对应的节点、以及各所述待分析用户对应的节点与其最大频繁特征值对应的节点之间的边;
    在所述第二二部图中获取各所述最大频繁特征值对应的节点的度,并根据各所述最大频繁特征值对应的节点的度在所述最大频繁特征值中确定低频最大频繁特征值。
  6. 根据权利要求1所述的异常群体识别方法,其特征在于,所述根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体包括:
    在所述目标二部图中删除权重小于第一预设权重的边,以得到待聚类二部图,并对所述待聚类二部图采用联通算法得到至少一个最大连通子图,以及将每个所述最大连通子图中的节点对应的待分析用户确定为一个所述异常群体;或者
    在所述目标二部图中删除权重小于第一预设权重的边,以得到待聚类二部图,并通过社区发现算法对所述待聚类二部图中的节点进行划分,以得到多个节点集合,以及将每个所述节点集合中的节点对应的待分析用户确定为一个所述异常群体。
  7. 根据权利要求1所述的异常群体识别方法,其特征在于,所述根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体包括:
    根据所述目标二部图中的边的权重计算任意两个所述待分析用户之间的权重;
    将各所述待分析用户转化为节点,并在任意两个节点之间设置边,并将任意两个节点的边的权重设置为对应的任意两个所述待分析用户之间的权重,以构建目标聚类图;
    通过对所述目标聚类图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体。
  8. 根据权利要求7所述的异常群体识别方法,其特征在于,所述通过对所述目标聚类图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体包括:
    在所述目标聚类图中删除权重小于第二预设权重的边,以得到待聚类图,并对所述待聚类图采用联通算法得到至少一个最大连通子图,以及将每个所述最大连通子图中的节点对应的待分析用户分别确定为一个所述异常群体;或者
    在所述目标聚类图中删除权重小于第二预设权重的边,以得到待聚类图,并通过社区发现算法对所述待聚类图进行划分,以得到多个节点集合,以及将每个所述节点集合对应的待分析用户分别确定为一个所述异常群体。
  9. 一种异常群体识别装置,其特征在于,包括:
    获取模块,用于获取多个待分析用户中的各所述待分析用户的特征值;
    确定模块,用于确定各所述待分析用户的特征值中的高频特征值和低频特征值;
    挖掘模块,用于根据各所述待分析用户的高频特征值和预设的频繁项集挖掘策略挖掘最大频繁项集,获取所述最大频繁项集中的低频最大频繁特征值;
    构建模块,用于根据各所述待分析用户的特征值中的所述低频最大频繁特征值和所述低频特征值构建目标二部图,并定义所述目标二部图中的边的权重;
    聚类模块,用于根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体。
  10. 一种异常群体识别设备,其特征在于,包括:
    处理器;以及
    被安排成存储计算机可执行指令的存储器,所述计算机可执行指令在被执行时使所述处理器:
    获取多个待分析用户中的各所述待分析用户的特征值;
    确定各所述待分析用户的特征值中的高频特征值和低频特征值;
    根据各所述待分析用户的高频特征值和预设的频繁项集挖掘策略挖掘最大频繁项集,获取所述最大频繁项集中的低频最大频繁特征值;
    根据各所述待分析用户的特征值中的所述低频最大频繁特征值和所述低频特征值构建目标二部图,并定义所述目标二部图中的边的权重;
    根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体。
  11. 一种存储介质,用于存储计算机可执行指令,其特征在于,所述计算机可执行指令在被执行时实现以下流程:
    获取多个待分析用户中的各所述待分析用户的特征值;
    确定各所述待分析用户的特征值中的高频特征值和低频特征值;
    根据各所述待分析用户的高频特征值和预设的频繁项集挖掘策略挖掘最大频繁项集,获取所述最大频繁项集中的低频最大频繁特征值;
    根据各所述待分析用户的特征值中的所述低频最大频繁特征值和所述低频特征值 构建目标二部图,并定义所述目标二部图中的边的权重;
    根据所述目标二部图中的边的权重,以及通过对所述目标二部图进行图聚类所得到的所述多个待分析用户的聚类结果,确定所述待分析用户中的异常群体。
PCT/CN2019/126030 2019-01-17 2019-12-17 异常群体识别方法及装置 WO2020147488A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910045152.6A CN109948641B (zh) 2019-01-17 2019-01-17 异常群体识别方法及装置
CN201910045152.6 2019-01-17

Publications (1)

Publication Number Publication Date
WO2020147488A1 true WO2020147488A1 (zh) 2020-07-23

Family

ID=67006647

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/126030 WO2020147488A1 (zh) 2019-01-17 2019-12-17 异常群体识别方法及装置

Country Status (3)

Country Link
CN (1) CN109948641B (zh)
TW (1) TWI718643B (zh)
WO (1) WO2020147488A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560961A (zh) * 2020-12-17 2021-03-26 中国平安人寿保险股份有限公司 基于图聚类的目标识别方法、装置、电子设备及存储介质
CN112581062A (zh) * 2020-12-25 2021-03-30 同方威视科技江苏有限公司 基于关系挖掘的快件收发组织发现方法及相关设备
CN116244650A (zh) * 2023-05-12 2023-06-09 北京富算科技有限公司 特征分箱方法、装置、电子设备和计算机可读存储介质

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948641B (zh) * 2019-01-17 2020-08-04 阿里巴巴集团控股有限公司 异常群体识别方法及装置
CN110602101B (zh) * 2019-09-16 2021-01-01 北京三快在线科技有限公司 网络异常群组的确定方法、装置、设备及存储介质
CN110609783B (zh) * 2019-09-24 2023-08-04 京东科技控股股份有限公司 用于识别异常行为用户的方法和装置
CN110880040A (zh) * 2019-11-08 2020-03-13 支付宝(杭州)信息技术有限公司 自动生成累积特征的方法及系统
CN111160917A (zh) * 2019-12-18 2020-05-15 北京三快在线科技有限公司 对象状态检测方法、装置、电子设备及可读存储介质
CN111371767B (zh) * 2020-02-20 2022-05-13 深圳市腾讯计算机系统有限公司 恶意账号识别方法、恶意账号识别装置、介质及电子设备
CN111770047B (zh) * 2020-05-07 2022-09-23 拉扎斯网络科技(上海)有限公司 异常群体的检测方法、装置及设备
CN111931048B (zh) * 2020-07-31 2022-07-08 平安科技(深圳)有限公司 基于人工智能的黑产账号检测方法及相关装置
CN112529639B (zh) * 2020-12-23 2024-09-20 中国银联股份有限公司 异常账户识别方法、装置、设备及介质
CN112968870B (zh) * 2021-01-29 2024-09-13 国家计算机网络与信息安全管理中心 一种基于频繁项集的网络团伙发现方法
CN113761080B (zh) * 2021-04-01 2024-07-19 京东城市(北京)数字科技有限公司 社区划分方法、装置、设备及存储介质
CN114117418B (zh) * 2021-11-03 2023-03-14 中国电信股份有限公司 基于社群检测异常账户的方法、系统、设备及存储介质
CN114662110B (zh) * 2022-05-18 2022-09-02 杭州海康威视数字技术股份有限公司 一种网站检测方法、装置及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140119652A1 (en) * 2011-08-30 2014-05-01 Intellectual Ventures Fund 83 Llc Detecting recurring themes in consumer image collections
CN104573116A (zh) * 2015-02-05 2015-04-29 哈尔滨工业大学 基于出租车gps数据挖掘的交通异常识别方法
CN105681312A (zh) * 2016-01-28 2016-06-15 李青山 一种基于频繁项集挖掘的移动互联网异常用户检测方法
CN105959372A (zh) * 2016-05-06 2016-09-21 华南理工大学 一种基于移动应用的互联网用户数据分析方法
CN109948641A (zh) * 2019-01-17 2019-06-28 阿里巴巴集团控股有限公司 异常群体识别方法及装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8719190B2 (en) * 2007-07-13 2014-05-06 International Business Machines Corporation Detecting anomalous process behavior
CN103812872B (zh) * 2014-02-28 2016-11-23 中国科学院信息工程研究所 一种基于混合狄利克雷过程的网络水军行为检测方法及系统
CN103927398B (zh) * 2014-05-07 2016-12-28 中国人民解放军信息工程大学 基于最大频繁项集挖掘的微博炒作群体发现方法
TW201612790A (en) * 2014-09-29 2016-04-01 Chunghwa Telecom Co Ltd Method of increasing effectiveness of information security risk assessment and risk recognition
CN107870934B (zh) * 2016-09-27 2021-07-20 武汉安天信息技术有限责任公司 一种app用户聚类方法及装置
CN107391548B (zh) * 2017-04-06 2020-08-04 华东师范大学 一种移动应用市场刷榜用户组检测方法及其系统
CN107332931A (zh) * 2017-08-07 2017-11-07 合肥工业大学 机器型论坛水军的识别方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140119652A1 (en) * 2011-08-30 2014-05-01 Intellectual Ventures Fund 83 Llc Detecting recurring themes in consumer image collections
CN104573116A (zh) * 2015-02-05 2015-04-29 哈尔滨工业大学 基于出租车gps数据挖掘的交通异常识别方法
CN105681312A (zh) * 2016-01-28 2016-06-15 李青山 一种基于频繁项集挖掘的移动互联网异常用户检测方法
CN105959372A (zh) * 2016-05-06 2016-09-21 华南理工大学 一种基于移动应用的互联网用户数据分析方法
CN109948641A (zh) * 2019-01-17 2019-06-28 阿里巴巴集团控股有限公司 异常群体识别方法及装置

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560961A (zh) * 2020-12-17 2021-03-26 中国平安人寿保险股份有限公司 基于图聚类的目标识别方法、装置、电子设备及存储介质
CN112560961B (zh) * 2020-12-17 2024-04-26 中国平安人寿保险股份有限公司 基于图聚类的目标识别方法、装置、电子设备及存储介质
CN112581062A (zh) * 2020-12-25 2021-03-30 同方威视科技江苏有限公司 基于关系挖掘的快件收发组织发现方法及相关设备
CN116244650A (zh) * 2023-05-12 2023-06-09 北京富算科技有限公司 特征分箱方法、装置、电子设备和计算机可读存储介质
CN116244650B (zh) * 2023-05-12 2023-10-03 北京富算科技有限公司 特征分箱方法、装置、电子设备和计算机可读存储介质

Also Published As

Publication number Publication date
TW202029079A (zh) 2020-08-01
CN109948641A (zh) 2019-06-28
CN109948641B (zh) 2020-08-04
TWI718643B (zh) 2021-02-11

Similar Documents

Publication Publication Date Title
WO2020147488A1 (zh) 异常群体识别方法及装置
KR102178295B1 (ko) 결정 모델 구성 방법 및 장치, 컴퓨터 장치 및 저장 매체
US10504120B2 (en) Determining a temporary transaction limit
US10216558B1 (en) Predicting drive failures
US20200034749A1 (en) Training corpus refinement and incremental updating
WO2018161900A1 (zh) 一种风控事件自动处理方法及装置
US20220229854A1 (en) Constructing ground truth when classifying data
TW201915777A (zh) 金融非結構化文本分析系統及其方法
CN113837635A (zh) 风险检测处理方法、装置及设备
Zhang et al. An affinity propagation clustering algorithm for mixed numeric and categorical datasets
WO2021120845A1 (zh) 一种同质风险单位特征集合生成方法、装置、设备及介质
CN108229564B (zh) 一种数据的处理方法、装置及设备
CN112257959A (zh) 用户风险预测方法、装置、电子设备及存储介质
US10353927B2 (en) Categorizing columns in a data table
CN115563268A (zh) 一种文本摘要的生成方法及装置、电子设备、存储介质
CN109902129B (zh) 基于大数据分析的保险代理人归类方法及相关设备
US10216792B2 (en) Automated join detection
US20170337486A1 (en) Feature-set augmentation using knowledge engine
CN114357184A (zh) 事项推荐方法及相关装置、电子设备和存储介质
US11048730B2 (en) Data clustering apparatus and method based on range query using CF tree
CN110059272B (zh) 一种页面特征识别方法和装置
CN110941719B (zh) 数据分类方法、测试方法、装置及存储介质
CN110781309A (zh) 一种基于模式匹配的实体并列关系相似度计算方法
US11500933B2 (en) Techniques to generate and store graph models from structured and unstructured data in a cloud-based graph database system
CN114579573B (zh) 信息检索方法、装置、电子设备以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19909738

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19909738

Country of ref document: EP

Kind code of ref document: A1