CN113127491B - Flow graph dividing system based on correlation characteristics - Google Patents

Flow graph dividing system based on correlation characteristics Download PDF

Info

Publication number
CN113127491B
CN113127491B CN202110468957.9A CN202110468957A CN113127491B CN 113127491 B CN113127491 B CN 113127491B CN 202110468957 A CN202110468957 A CN 202110468957A CN 113127491 B CN113127491 B CN 113127491B
Authority
CN
China
Prior art keywords
data
edge
stream
point
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110468957.9A
Other languages
Chinese (zh)
Other versions
CN113127491A (en
Inventor
王新根
陈伟
唐迪佳
杨运平
黄文杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Bangsheng Real Time Intelligent Technology Co ltd
Original Assignee
Shenzhen Bangsheng Real Time Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Bangsheng Real Time Intelligent Technology Co ltd filed Critical Shenzhen Bangsheng Real Time Intelligent Technology Co ltd
Priority to CN202110468957.9A priority Critical patent/CN113127491B/en
Publication of CN113127491A publication Critical patent/CN113127491A/en
Application granted granted Critical
Publication of CN113127491B publication Critical patent/CN113127491B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange

Abstract

The invention discloses a flow graph dividing system based on correlation characteristics, which comprises a data analysis module, a data rearrangement module, a point element storage module, an edge element storage module and a data navigation module. The data analysis module analyzes the transaction data stream into a data format of an associated map, and generates a point stream and an edge stream; the data rearrangement module carries out disorder rearrangement on the side stream data, and reduces the influence of specific transaction data on a subsequent division algorithm; the data navigation module selects a proper storage position for each side stream data; and the edge element storage module and the point element storage module write the divided edge flow data and point flow data into a database. The flow graph dividing system provided by the invention can optimize the partition process of the associated graph data according to the characteristics of the transaction data flow and improve the performance of the subsequent execution graph analysis task.

Description

Flow graph dividing system based on correlation characteristics
Technical Field
The invention belongs to the field of transaction anti-fraud, and particularly relates to a flow graph dividing system based on correlation characteristics, which is suitable for distributed storage and analysis of flow transaction data.
Background
In the field of transaction anti-fraud, the structure of a data graph such as customer information, transaction records and the like is often modeled to construct a correlation map. For example, an association map may be constructed with bank card numbers as nodes and a transfer between bank cards as edges. And part of the bank card nodes can be marked as abnormal accounts, and analysis such as risk assessment and the like can be carried out on unmarked accounts based on the incidence relation expressed by the incidence map.
Common algorithms for associative graph analysis include graph traversal, community discovery, loop detection, connectivity detection, and the like. In practice, these analysis algorithms are typically implemented using a distributed graph computation framework. The mainstream graph computation framework adopts a Pregel-like message propagation model, and the computation complexity can be approximately represented by the communication quantity between distributed nodes. Therefore, by optimizing the dividing mode of the graph data, the communication load of the distributed graph calculation framework can be reduced, and the overall calculation performance is improved.
In real world applications, large amounts of data are flooded into the system in a data stream. For example, a single transaction is taken as a piece of data, which includes attributes of debit, credit, time, platform, geographic location, etc. The system constructs these data into a graph according to preset meta-rules, for example, defining the borrower and the lender as two nodes on the graph, and creating an edge between the two nodes, wherein the nodes and the edge record other information of the transaction in an attribute mode. Typically, the node attributes include information that is fixed and unchangeable for the transaction entity, such as a bank card number, an account opening bank address, an account opening mobile phone number, an identification number, etc., while the edge attributes include information specific to a single transaction, such as time, platform, amount of money, etc.
As can be seen from the above example, the transaction data stream is large in size and complex in format. The problem that the association graph is constructed in real time by transaction data flow, the graph is guaranteed to be efficiently divided as far as possible while the load balance of storage and calculation nodes is guaranteed is relatively difficult.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a flow graph dividing system for constructing an associated graph facing transaction data flow, which is operated before an associated graph analysis task, realizes the real-time partition of the associated graph and ensures the load balance of nodes.
The purpose of the invention is realized by the following technical scheme: a flow graph dividing system based on correlation characteristics comprises a data analysis module, a data rearrangement module, a point element storage module, an edge element storage module and a data navigation module.
The data analysis module is used for analyzing the original transaction data stream received by the system and generating a point data stream and an edge data stream of the associated map, which are called point stream and edge stream for short. Specifically, the format of point data is defined through meta-rules, and the point data comprises a main attribute and a non-main attribute; defining the format of edge data through meta-rules, wherein the edge data comprises main attributes of two endpoints and non-main attributes of edges; the main attribute is used as the unique identification of the point, and the non-main attribute is used as the attribute description of the point. The generated dot flow data and edge flow data of the original transaction data stream are transmitted to a data rearrangement module.
The data rearrangement module is used for disturbing the edge flow data according to a certain rule and reducing the interference of the specific transaction data flow sequence to the flow chart division algorithm. The data rearrangement module provides a preset size of
Figure DEST_PATH_IMAGE001
The data accumulation queue of (1) stores the side stream data sent from the upstream data analysis module into the end of the data accumulation queue first, and then exchanges with a certain data in the queue randomly. Each side stream data comprises a time stamp for recording the time when the side stream data enters the queue, and the stay time of the side stream data in the queue exceeds a preset value
Figure 100002_DEST_PATH_IMAGE002
And then pushed directly to the downstream data navigation module. When the size of the data accumulation queue exceeds the preset size
Figure 166989DEST_PATH_IMAGE001
And in time, the head of the queue data is pushed to a downstream data navigation module. Only the edge stream data enters the data accumulation queue and is finally led into the data navigation module, and the point stream data is directly pushed to the point meta-storage module.
And the data navigation module is used for determining the specific storage position of the side stream data. Defining an overall system comprising
Figure DEST_PATH_IMAGE003
The edge cell storage partition is used for storing edge stream data and is numbered as
Figure 100002_DEST_PATH_IMAGE004
. For each partition
Figure DEST_PATH_IMAGE005
Maintaining a local bloom filter
Figure 100002_DEST_PATH_IMAGE006
For recording whether the partition contains an end point of certain side stream data. First, thejThe stream data of the edge is recorded as
Figure DEST_PATH_IMAGE007
Figure 100002_DEST_PATH_IMAGE008
And
Figure DEST_PATH_IMAGE009
are respectively the firstjThe primary attributes of the start and end points of the stream of edge streams,
Figure 100002_DEST_PATH_IMAGE010
are respectively the firstjThe non-primary attributes of the starting and ending points of the stream of edge streams,
Figure DEST_PATH_IMAGE011
is the firstjNon-primary property of an edge stream data edge. To partition
Figure 208763DEST_PATH_IMAGE005
Design of an objective function
Figure 100002_DEST_PATH_IMAGE012
Dividing endpoints of the edge stream data with the associated characteristics into the same partition; the partition is selected so that the value of the objective function is maximized, i.e. the partition is numbered
Figure DEST_PATH_IMAGE013
And will beThe stream data of the edge
Figure 100002_DEST_PATH_IMAGE014
Push to partition
Figure DEST_PATH_IMAGE015
The edge cell storage module. If there are multiple maximums at the same time
Figure 519658DEST_PATH_IMAGE015
Then one is randomly selected.
The point element storage module is used for storing the point flow data into a global key-value lookup table. For a primary attribute of
Figure 100002_DEST_PATH_IMAGE016
Of the point stream data of
Figure DEST_PATH_IMAGE017
And writing the main attribute and the non-main attribute of the dot flow data into the HBase database as a main key. If there are duplicate non-primary attributes, the latest version is retained.
The edge storage module is used for storing the edge stream data into the local key-value lookup table. For the starting point
Figure 947491DEST_PATH_IMAGE008
Terminal point
Figure 761863DEST_PATH_IMAGE009
To side stream data of
Figure 100002_DEST_PATH_IMAGE018
As a primary key HBase database.
Figure DEST_PATH_IMAGE019
Indicating strings of arbitrary length
Figure 100002_DEST_PATH_IMAGE020
Conversion to a fixed length of
Figure DEST_PATH_IMAGE021
By default, of a character string
Figure 240118DEST_PATH_IMAGE021
Is the maximum length that can represent the node's primary attribute.
Figure 100002_DEST_PATH_IMAGE022
The method is a character string splicing operation. HBase according to
Figure 772730DEST_PATH_IMAGE003
And the edge element storage partition regions and appoint the regions for storage according to the partition result of the data navigation module.
Furthermore, in the data analysis module, each transaction datum is a key-value dictionary table represented by a json format, and various information when a transaction occurs is recorded in detail.
Further, the data analysis module generates the dot stream data and the edge stream data according to a predefined meta-rule as follows:
a) for each defined point data format in the meta-rule, its primary attribute is
Figure DEST_PATH_IMAGE023
Checking whether the transaction data includes
Figure 237210DEST_PATH_IMAGE023
A field, if contained, generating a main attribute value as transaction data
Figure 100002_DEST_PATH_IMAGE024
The stream of points of value and into the stream of points of data. The non-main attribute of the point is defined according to the meta-rule, is obtained from the transaction data, and is ignored if the non-main attribute does not exist.
b) For each defined edge data in a meta-rule, the main attributes of its two endpoints are
Figure DEST_PATH_IMAGE025
And
Figure 100002_DEST_PATH_IMAGE026
checking whether the transaction data includes
Figure 9818DEST_PATH_IMAGE025
A field and
Figure 3182DEST_PATH_IMAGE026
and a field, only two fields are contained at the same time, one side stream data is generated, and the side stream is pushed. Other attributes of the side-stream data are defined according to meta-rules, are obtained from the transaction data, and are ignored if not present.
c) For a single transaction datum, the data can be analyzed into a plurality of flow chart data such as a single point, two points or one side of two points according to the specific definition of the meta rule.
Further, the data navigation module maintains two globally distributed key-value storage structures
Figure DEST_PATH_IMAGE027
And
Figure 100002_DEST_PATH_IMAGE028
Figure 23090DEST_PATH_IMAGE027
is a distributed hash table for storing a mapping of an arbitrary string to a 64-bit positive integer value.
Figure 25681DEST_PATH_IMAGE028
Is a global bloom filter for determining whether any string exists.
Figure 549066DEST_PATH_IMAGE027
And
Figure 244490DEST_PATH_IMAGE028
redis implementation deployed by Cluster mode.
Further, the data navigation module, its partitioned objective function
Figure 486116DEST_PATH_IMAGE012
Consists of two parts. The first part punishs the unbalanced data division by the specific formula
Figure DEST_PATH_IMAGE029
Wherein
Figure 100002_DEST_PATH_IMAGE030
And
Figure DEST_PATH_IMAGE031
to be hyper-parametric, partition
Figure 354714DEST_PATH_IMAGE005
The amount of stored data is
Figure 100002_DEST_PATH_IMAGE032
Maximum data capacity per partition of
Figure DEST_PATH_IMAGE033
Minimum data capacity of
Figure 100002_DEST_PATH_IMAGE034
. The second part is used for optimizing the locality of data division and has the specific formula
Figure DEST_PATH_IMAGE035
. Wherein the function
Figure 100002_DEST_PATH_IMAGE036
Figure DEST_PATH_IMAGE037
Is a bloom filter, when partitioned
Figure 624284DEST_PATH_IMAGE005
Point of presence
Figure 959450DEST_PATH_IMAGE016
When the temperature of the water is higher than the set temperature,
Figure 100002_DEST_PATH_IMAGE038
otherwise return to
Figure DEST_PATH_IMAGE039
Figure 100002_DEST_PATH_IMAGE040
And
Figure DEST_PATH_IMAGE041
in order to be a hyper-parameter,
Figure 100002_DEST_PATH_IMAGE042
for nodes in a written flow graph
Figure 750689DEST_PATH_IMAGE016
Degree of (c).
Figure DEST_PATH_IMAGE043
For evaluating node non-primary attribute sets
Figure 100002_DEST_PATH_IMAGE044
And partitioning
Figure 360662DEST_PATH_IMAGE005
The degree of matching of (1), specifically, each non-primary attribute in the set
Figure DEST_PATH_IMAGE045
Calculating
Figure 100002_DEST_PATH_IMAGE046
Figure 124218DEST_PATH_IMAGE022
The operator is spliced for the string.
Figure 161445DEST_PATH_IMAGE028
For a global bloom filter, return when incoming parameters exist
Figure 100002_DEST_PATH_IMAGE048
Otherwise return to
Figure 377662DEST_PATH_IMAGE039
Figure 260168DEST_PATH_IMAGE027
Is a distributed hash table that disregards key collisions. Finally obtaining
Figure DEST_PATH_IMAGE049
Figure 100002_DEST_PATH_IMAGE050
Further, in the edge storage module, for the repeated edges, a service-related aggregation function may be adopted to combine the attributes of the edges, so as to reduce the storage space.
The invention has the following beneficial effects: aiming at the spatial and temporal locality of transaction entities, transaction data with similar characteristics can be preferentially stored in the same partition, and point data copy during the operation of an offline analysis task is reduced, so that the communication load is reduced, and the overall operation efficiency is improved. The invention can determine the storage partition of each transaction record in constant time, is irrelevant to the size of a graph data set, can construct an associated graph for a transaction data stream in real time, and writes the associated graph into a storage module according to the requirement of graph partition optimization.
Drawings
FIG. 1 is a block diagram of the system;
FIG. 2 is a data navigation module service flow diagram.
Detailed Description
The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.
As shown in fig. 1, the invention relates to a flow graph partitioning system based on association features, wherein the flow graph partitioning system automatically partitions a finally generated association graph within a sub-linear time complexity under the condition that only each transaction data and limited statistical information can be seen, and simultaneously meets the requirements of load balancing, reduction of communication load of subsequent tasks and the like. The system comprises a data analysis module, a data rearrangement module, a point element storage module, an edge element storage module and a data navigation module.
The data analysis module is used for analyzing the original transaction data stream received by the system, each transaction data is a key-value dictionary table represented by a json format, and various information when a transaction occurs is recorded in detail. And generating a point data stream and an edge data stream of the associated map according to the original transaction data stream, which are referred to as the point stream and the edge stream for short. Specifically, the format of point data is defined through meta-rules, and the point data comprises a main attribute and a non-main attribute; defining the format of edge data through meta-rules, wherein the edge data comprises main attributes of two endpoints and non-main attributes of edges; the main attribute is used as the unique identification of the point, and the non-main attribute is used as the attribute description of the point. The generated dot flow data and edge flow data of the original transaction data stream are transmitted to a data rearrangement module. The data analysis module generates point stream data and edge stream data according to a predefined meta-rule as follows:
a) for each defined point data format in the meta-rule, its primary attribute is
Figure 878231DEST_PATH_IMAGE023
Checking whether the transaction data includes
Figure 86358DEST_PATH_IMAGE023
A field, if contained, generating a main attribute value as transaction data
Figure 85145DEST_PATH_IMAGE024
The stream of points of value and into the stream of points of data. The non-main attribute of the point is defined according to the meta-rule, is obtained from the transaction data, and is ignored if the non-main attribute does not exist.
b) For each defined edge data in a meta-rule, the main attributes of its two endpoints are
Figure 505762DEST_PATH_IMAGE025
And
Figure 243911DEST_PATH_IMAGE026
checking whether the transaction data includes
Figure 357360DEST_PATH_IMAGE025
A field and
Figure 548170DEST_PATH_IMAGE026
and a field, only two fields are contained at the same time, one side stream data is generated, and the side stream is pushed. Other attributes of the side-stream data are defined according to meta-rules, are obtained from the transaction data, and are ignored if not present.
c) For a single transaction datum, the data can be analyzed into a plurality of flow chart data such as a single point, two points or one side of two points according to the specific definition of the meta rule.
The data rearrangement module is used for disturbing the edge flow data according to a certain rule and reducing the interference of the specific transaction data flow sequence to the flow chart division algorithm. The data rearrangement module provides a preset size of
Figure 772478DEST_PATH_IMAGE001
The data accumulation queue of (1) stores the side stream data sent from the upstream data analysis module into the end of the data accumulation queue first, and then exchanges with a certain data in the queue randomly. Each side stream data comprises a time stamp for recording the time when the side stream data enters the queue, and the stay time of the side stream data in the queue exceeds a preset value
Figure 99554DEST_PATH_IMAGE002
And then pushed directly to the downstream data navigation module. When the size of the data accumulation queue exceeds the preset size
Figure 383905DEST_PATH_IMAGE001
And in time, the head of the queue data is pushed to a downstream data navigation module. Only the edge stream data enters the data accumulation queue and is finally led into the data navigation module, and the point stream data is directly pushed to the point meta-storage module.
As shown in fig. 2, the data navigation module is configured to determine a specific storage location of the side stream data. Defining an overall system comprising
Figure 796432DEST_PATH_IMAGE003
The edge cell storage partition is used for storing edge stream data and is numbered as
Figure 90010DEST_PATH_IMAGE004
. For each partition
Figure 271593DEST_PATH_IMAGE005
Maintaining a local bloom filter
Figure 726845DEST_PATH_IMAGE006
For recording whether the partition contains an end point of certain side stream data. First, thejThe stream data of the edge is recorded as
Figure 626668DEST_PATH_IMAGE007
Figure 458357DEST_PATH_IMAGE008
And
Figure 760026DEST_PATH_IMAGE009
are respectively the firstjThe primary attributes of the start and end points of the stream of edge streams,
Figure 386179DEST_PATH_IMAGE010
are respectively the firstjThe non-primary attributes of the starting and ending points of the stream of edge streams,
Figure 773298DEST_PATH_IMAGE011
is the firstjNon-primary property of an edge stream data edge. To partition
Figure 143100DEST_PATH_IMAGE005
Design of an objective function
Figure 299274DEST_PATH_IMAGE012
Dividing endpoints of the edge stream data with the associated characteristics into the same partition; the partition is selected so that the value of the objective function is maximized, i.e. the partition is numbered
Figure 96329DEST_PATH_IMAGE013
And stream the stream of the edge stream
Figure 970744DEST_PATH_IMAGE014
Push to partition
Figure 144237DEST_PATH_IMAGE015
The edge cell storage module. If there are multiple maximums at the same time
Figure 420497DEST_PATH_IMAGE015
Then one is randomly selected.
Data navigation module maintains two globally distributed key-value storage structures
Figure 388453DEST_PATH_IMAGE027
And
Figure 845105DEST_PATH_IMAGE028
Figure 556709DEST_PATH_IMAGE027
is a distributed hash table for storing a mapping of an arbitrary string to a 64-bit positive integer value.
Figure 687476DEST_PATH_IMAGE028
Is a global bloom filter for determining whether any string exists.
Figure 826333DEST_PATH_IMAGE027
And
Figure 675340DEST_PATH_IMAGE028
redis implementation deployed by Cluster mode.
The data navigation module, the partitioned target function thereof
Figure 925056DEST_PATH_IMAGE012
Consists of two parts. The first part punishs the unbalanced data division by the specific formula
Figure 175909DEST_PATH_IMAGE029
Wherein
Figure 485667DEST_PATH_IMAGE030
And
Figure 821971DEST_PATH_IMAGE031
to be hyper-parametric, partition
Figure 875377DEST_PATH_IMAGE005
The amount of stored data is
Figure 715157DEST_PATH_IMAGE032
Maximum data capacity per partition of
Figure 461397DEST_PATH_IMAGE033
Minimum data capacity of
Figure 753838DEST_PATH_IMAGE034
. The second part is used for optimizing the locality of data division and has the specific formula
Figure 876514DEST_PATH_IMAGE035
. Wherein the function
Figure 570801DEST_PATH_IMAGE036
Figure 487941DEST_PATH_IMAGE037
Is a bloom filter, when partitioned
Figure 533258DEST_PATH_IMAGE005
Point of presence
Figure 194046DEST_PATH_IMAGE016
When the temperature of the water is higher than the set temperature,
Figure 742839DEST_PATH_IMAGE038
otherwise return to
Figure 96460DEST_PATH_IMAGE039
Figure 875411DEST_PATH_IMAGE040
And
Figure 74311DEST_PATH_IMAGE041
in order to be a hyper-parameter,
Figure 743190DEST_PATH_IMAGE042
for nodes in a written flow graph
Figure 2133DEST_PATH_IMAGE016
Degree of (c).
Figure 523506DEST_PATH_IMAGE043
For evaluating node non-primary attribute sets
Figure 260518DEST_PATH_IMAGE044
And partitioning
Figure 49482DEST_PATH_IMAGE005
The degree of matching of (1), specifically, each non-primary attribute in the set
Figure 479327DEST_PATH_IMAGE045
Calculating
Figure 720952DEST_PATH_IMAGE046
Figure 261655DEST_PATH_IMAGE022
The operator is spliced for the string.
Figure 170705DEST_PATH_IMAGE028
For a global bloom filter, return when incoming parameters exist
Figure 505871DEST_PATH_IMAGE048
Otherwise return to
Figure 234793DEST_PATH_IMAGE039
Figure 579187DEST_PATH_IMAGE027
Is a distributed hash table that disregards key collisions. Finally obtaining
Figure 77164DEST_PATH_IMAGE049
Figure 848811DEST_PATH_IMAGE050
The point element storage module is used for storing the point flow data into a global key-value lookup table. For a primary attribute of
Figure 65029DEST_PATH_IMAGE016
Of the point stream data of
Figure 213113DEST_PATH_IMAGE017
And writing the main attribute and the non-main attribute of the dot flow data into the HBase database as a main key. If there are duplicate non-primary attributes, the latest version is retained.
The edge storage module is used for storing the edge stream data into the local key-value lookup table. For the starting point
Figure 565597DEST_PATH_IMAGE008
Terminal point
Figure 773725DEST_PATH_IMAGE009
To side stream data of
Figure 211659DEST_PATH_IMAGE018
As a primary key HBase database.
Figure 897856DEST_PATH_IMAGE019
Indicating strings of arbitrary length
Figure 370425DEST_PATH_IMAGE020
Conversion to a fixed length of
Figure DEST_PATH_IMAGE051
By default, of a character string
Figure 483875DEST_PATH_IMAGE021
Is the maximum length that can represent the node's primary attribute.
Figure 409105DEST_PATH_IMAGE022
The method is a character string splicing operation. HBase according to
Figure 898993DEST_PATH_IMAGE003
And the edge element storage partition regions and appoint the regions for storage according to the partition result of the data navigation module. For repeated edges, a business-related aggregation function may be employed to combine the attributes of the edges to reduce storage space.
Example (b):
the invention provides a system for constructing an association map from transaction data flow, dividing a flow graph in real time and finally writing the association map into a bottom database. According to the module sequence, the whole process atmosphere comprises three steps: a) analyzing the data and generating an out-of-order point flow and an edge flow; b) dividing the edge stream data and generating the edge stream data with partition marks; c) and generating a primary key for the point flow and the edge flow, and writing the primary key into an HBase database.
An example is given below for three steps, respectively:
step a) analyzing data and generating an out-of-order point flow and an edge flow:
declaring a transaction datum expressed in json format as follows:
{
"borrower" includes a first opening
“id”: 0001,
13511112222 is used as a mobile phone, and the mobile phone is provided with a mobile phone cover,
a card number of 123456789,
"Account opening website" xxx01 "
},
"lending side" for containing Chinese dictionary
“id”: 0002,
13522221111 is used as a mobile phone, and the mobile phone is provided with a mobile phone cover,
card number 987654321
},
"transaction amount": 100.00 ",
"trading platform": dd ",
123999923212 for transaction time
}
The system administrator defines the metadata containing point structure as:
{
entrance field [ "borrower", "lender" ]
"Main attribute": id ",
non-main attribute [ "card number", "account opening site" ]
}
The edge structure is:
{
primary property
"origin" means "borrower",
"end point": lender "
},
Non-primary attribute [ "transaction amount", "transaction time" ]
}
As described by the data parsing module, the system parses the raw transaction data into blob flow data and edge flow data based on the metadata. As illustrated by example, the stream of point-of-flow data contains
Figure 100002_DEST_PATH_IMAGE052
= 0001, card number 123456789, account opening site xxx01 and
Figure DEST_PATH_IMAGE053
= { "id": 0002, "card number": 987654321}, and the side stream data comprises
Figure 100002_DEST_PATH_IMAGE054
= starting point { "id": 0001}, "ending point {" id ": 0002}," transaction amount ": 100.00 {" transaction amount { "start point": 0001}, "" end point ": 0002}," "transaction amount": 100.00 { "transaction amount {" start point ": and" } end point "{" end point ": 5 {" end point ": 1 {" end point ": and" { "end point": 5 { "end point": and ": 5 {" end point ": 1 {" end point ": and" } end point ": 100."Transaction time': 123999923212 }.
The point flow data obtained by analysis can be directly sent to a downstream point element storage module and immediately written into an HBase database. The parsed edge stream data is sent to a data rearrangement module, and is sent to a data navigation module after waiting for a limited time.
And b) dividing the boundary stream data to obtain boundary stream data with partition marks. For number of
Figure 757227DEST_PATH_IMAGE005
The target function is a partition selection function, and the partition selection function is calculated
Figure DEST_PATH_IMAGE055
. As has been described in the examples herein,
Figure 307157DEST_PATH_IMAGE008
is the starting point master attribute, with a value of 0001,
Figure 719684DEST_PATH_IMAGE009
is the endpoint primary attribute, value 0002;
Figure 482104DEST_PATH_IMAGE052
is a list of attributes for the starting point described above,
Figure 929265DEST_PATH_IMAGE053
is the above list of endpoint attributes. Wherein
Figure 100002_DEST_PATH_IMAGE056
Figure 650097DEST_PATH_IMAGE033
And
Figure 549920DEST_PATH_IMAGE034
for the preset maximum capacity and minimum capacity of the partition,
Figure DEST_PATH_IMAGE057
Figure 100002_DEST_PATH_IMAGE058
and
Figure DEST_PATH_IMAGE059
calculated according to the following formula:
Figure DEST_PATH_IMAGE061
Figure DEST_PATH_IMAGE063
Figure DEST_PATH_IMAGE065
Figure DEST_PATH_IMAGE067
wherein
Figure 100002_DEST_PATH_IMAGE068
Are all the super-parameters of the system,
Figure 688601DEST_PATH_IMAGE022
is a string concatenation function.
Figure 990269DEST_PATH_IMAGE028
And
Figure 882002DEST_PATH_IMAGE006
are functions for determining the presence or absence of strings, wherein
Figure 269121DEST_PATH_IMAGE006
Primarily for determining node dominance such as
Figure 638922DEST_PATH_IMAGE008
Whether or not it is in a partition
Figure 795097DEST_PATH_IMAGE005
In the above-mentioned manner,
Figure 592152DEST_PATH_IMAGE028
the method is used for judging whether any character string appears, and the two functions are realized by adopting a bloom filter. Wherein
Figure 466567DEST_PATH_IMAGE027
The occurrence frequency of the character string is mainly returned, and the method is realized by using a distributed hash table. Suppose that
Figure DEST_PATH_IMAGE069
Property of
Figure 100002_DEST_PATH_IMAGE070
Xxx01, the account opening website
Figure DEST_PATH_IMAGE071
Is "3 @ point of opening @ xxx 01". When the character string
Figure 171218DEST_PATH_IMAGE071
Appeared out of date
Figure 100002_DEST_PATH_IMAGE072
Return 1, otherwise return 0. Each time a string appears in the new data
Figure 181899DEST_PATH_IMAGE071
Then, then
Figure DEST_PATH_IMAGE073
. In view of the performance requirements,
Figure 415434DEST_PATH_IMAGE027
key conflicts can be resolved without means such as a linked list and the like, and certain data errors are allowed.
Figure 777146DEST_PATH_IMAGE027
Should be designed according to the attributes of the traffic field,for example, when one attribute is "location": street yy in xx area of hangzhou, zhejiang, the attribute granularity can be adjusted to "location": street of hangzhou, zhejiang, and then generalized hash value calculation is performed.
As shown in fig. 2, the main flow of step b) is to sequentially calculate the partition selection function
Figure 223171DEST_PATH_IMAGE012
The value on each partition is then selected, the partition with the largest value
Figure 619517DEST_PATH_IMAGE015
As the target partition for the data.
And c) generating a main key for the point stream and the edge stream, and writing the main key into an HBase database. The primary key of the stream of point flow is
Figure 100002_DEST_PATH_IMAGE074
And
Figure DEST_PATH_IMAGE075
i.e. 0001 and 0002 as described in the above examples. The main key of the side stream data is designed as
Figure 23953DEST_PATH_IMAGE018
Assuming that the length of data required to store partition encoding is 3, the length required to store the main attribute of stream data is 5,
Figure 607381DEST_PATH_IMAGE015
=3, then the above example describes
Figure 100002_DEST_PATH_IMAGE076
=003@00001@ 00002. When creating HBase table, should be based on
Figure 122676DEST_PATH_IMAGE076
Partitioning regions by the highest 3 bits to ensure identity
Figure 373529DEST_PATH_IMAGE015
Can be written to the same batch of regions.
The above-described embodiments are intended to illustrate rather than to limit the invention, and any modifications and variations of the present invention are within the spirit of the invention and the scope of the appended claims.

Claims (6)

1. A flow graph dividing system based on correlation characteristics is characterized by comprising a data analysis module, a data rearrangement module, a point element storage module, an edge element storage module and a data navigation module;
the data analysis module is used for analyzing the original transaction data stream received by the system and generating a point data stream and an edge data stream of the associated map, which are called point stream and edge stream for short; specifically, the format of point data is defined through meta-rules, and the point data comprises a main attribute and a non-main attribute; defining the format of edge data through meta-rules, wherein the edge data comprises main attributes of two endpoints and non-main attributes of edges; the main attribute is used as the unique identification of the point, and the non-main attribute is used as the attribute description of the point; point flow data and edge flow data generated by the original transaction data flow are transmitted to a data rearrangement module;
the data rearrangement module is used for disturbing the edge flow data according to a certain rule and reducing the interference of the sequence of the specific transaction data flow to the flow chart division algorithm; the data rearrangement module provides a preset size of
Figure DEST_PATH_IMAGE002
The data accumulation queue, the side stream data sent from the upstream data analysis module is firstly stored at the end of the data accumulation queue, and then is randomly exchanged with a certain data in the queue; each side stream data comprises a time stamp for recording the time when the side stream data enters the queue, and the stay time of the side stream data in the queue exceeds a preset value
Figure DEST_PATH_IMAGE004
Then the data is directly pushed to a downstream data navigation module; when the size of the data accumulation queue exceeds the preset size
Figure 508679DEST_PATH_IMAGE002
In time, the head data of the queue is pushed to a downstream data navigation module; only the edge stream data enters a data accumulation queue and is finally led into a data navigation module, and the point stream data is directly pushed to a point element storage module;
the data navigation module is used for determining the specific storage position of the side stream data; defining an overall system comprising
Figure DEST_PATH_IMAGE006
The edge cell storage partition is used for storing edge stream data and is numbered as
Figure DEST_PATH_IMAGE008
(ii) a For each partition
Figure DEST_PATH_IMAGE010
Maintaining a local bloom filter
Figure DEST_PATH_IMAGE012
An endpoint for recording whether the partition contains certain edge stream data; first, thejThe stream data of the edge is recorded as
Figure DEST_PATH_IMAGE014
Figure DEST_PATH_IMAGE016
And
Figure DEST_PATH_IMAGE018
are respectively the firstjThe primary attributes of the start and end points of the stream of edge streams,
Figure DEST_PATH_IMAGE020
are respectively the firstjThe non-primary attributes of the starting and ending points of the stream of edge streams,
Figure DEST_PATH_IMAGE022
is the firstjNon-primary property of the edge stream data edge; to partition
Figure 65300DEST_PATH_IMAGE010
Design of an objective function
Figure DEST_PATH_IMAGE024
Dividing endpoints of the edge stream data with the associated characteristics into the same partition; the partition is selected so that the value of the objective function is maximized, i.e. the partition is numbered
Figure DEST_PATH_IMAGE026
And stream the stream of the edge stream
Figure DEST_PATH_IMAGE028
Push to partition
Figure DEST_PATH_IMAGE030
The edge element storage module; if there are multiple maximums at the same time
Figure 782720DEST_PATH_IMAGE030
Then one is randomly selected;
the point element storage module is used for storing point stream data into a global key-value lookup table; for a primary attribute of
Figure DEST_PATH_IMAGE032
Of the point stream data of
Figure DEST_PATH_IMAGE034
Writing the main attribute and the non-main attribute of the dot stream data into an HBase database as a main key; if there are duplicate non-primary attributes, then the latest version is retained;
the edge storage module is used for storing the edge stream data into a local key-value lookup table; for the starting point
Figure 646771DEST_PATH_IMAGE016
Terminal point
Figure 195564DEST_PATH_IMAGE018
To side stream data of
Figure DEST_PATH_IMAGE036
As a primary key HBase database;
Figure DEST_PATH_IMAGE038
indicating strings of arbitrary length
Figure DEST_PATH_IMAGE040
Conversion to a fixed length of
Figure DEST_PATH_IMAGE042
By default, of a character string
Figure 221289DEST_PATH_IMAGE042
The maximum length can represent the main attribute of the node;
Figure DEST_PATH_IMAGE044
performing character string splicing operation; HBase according to
Figure 455699DEST_PATH_IMAGE006
And the edge element storage partition regions and appoint the regions for storage according to the partition result of the data navigation module.
2. The system for dividing a flow graph based on associated features of claim 1, wherein in the data parsing module, each transaction datum is a key-value dictionary table represented by json format, and various information when a transaction occurs is recorded in detail.
3. The system for dividing a flow graph based on associated features according to claim 1, wherein the data parsing module generates the dot flow data and the edge flow data according to a predefined meta-rule as follows:
a) for each defined point data format in meta-rule with main attribute m, checking whether transaction data contains data
Figure DEST_PATH_IMAGE046
A field, if contained, generating a main attribute value as transaction data
Figure DEST_PATH_IMAGE048
(ii) a stream of point data of values and pushing in the stream of point data; the non-main attribute of the point is defined according to the meta-rule, is obtained from the transaction data, and is ignored if the non-main attribute does not exist;
b) for each defined edge data in a meta-rule, the main attributes of its two endpoints are
Figure DEST_PATH_IMAGE050
And
Figure DEST_PATH_IMAGE052
checking whether the transaction data includes
Figure 857861DEST_PATH_IMAGE050
A field and
Figure 261161DEST_PATH_IMAGE052
a field, generating an edge stream data only when two fields are contained simultaneously, and pushing the edge stream; other attributes of the side stream data are defined according to meta-rules, obtained from the transaction data, and if the attributes do not exist, the attributes are ignored;
c) for a single transaction, the data is parsed into single point, two point or two point one-edge flow graph data according to the specific definition of the meta-rule.
4. The system for flow graph partitioning based on associative features according to claim 1, wherein the data navigation module maintains two globally distributed key-value storage structures
Figure DEST_PATH_IMAGE054
And
Figure DEST_PATH_IMAGE056
Figure 723366DEST_PATH_IMAGE054
is a distributed hash table for storing the mapping from any string to a 64-bit positive integer value;
Figure 212116DEST_PATH_IMAGE056
is a global bloom filter for judging whether any string exists;
Figure 683549DEST_PATH_IMAGE054
and
Figure 206934DEST_PATH_IMAGE056
redis implementation deployed by Cluster mode.
5. The correlation-feature-based flow graph partitioning system according to claim 4, wherein the data navigation module is a partitioned objective function
Figure 371199DEST_PATH_IMAGE024
Consists of two parts; the first part punishs the unbalanced data division by the specific formula
Figure DEST_PATH_IMAGE058
Wherein
Figure DEST_PATH_IMAGE060
And
Figure DEST_PATH_IMAGE062
to be hyper-parametric, partition
Figure 816087DEST_PATH_IMAGE010
Stored numberAccording to the quantity of
Figure DEST_PATH_IMAGE064
Maximum data capacity per partition of
Figure DEST_PATH_IMAGE066
Minimum data capacity of
Figure DEST_PATH_IMAGE068
(ii) a The second part is used for optimizing the locality of data division and has the specific formula
Figure DEST_PATH_IMAGE070
(ii) a Wherein the function
Figure DEST_PATH_IMAGE072
Figure DEST_PATH_IMAGE074
Is a bloom filter, when partitioned
Figure 261849DEST_PATH_IMAGE010
Point of presence
Figure 905320DEST_PATH_IMAGE032
When the temperature of the water is higher than the set temperature,
Figure DEST_PATH_IMAGE076
otherwise return to
Figure DEST_PATH_IMAGE078
Figure DEST_PATH_IMAGE080
And
Figure DEST_PATH_IMAGE082
in order to be a hyper-parameter,
Figure DEST_PATH_IMAGE084
for nodes in a written flow graph
Figure 912591DEST_PATH_IMAGE032
Degree of (d);
Figure DEST_PATH_IMAGE086
for evaluating node non-primary attribute sets
Figure DEST_PATH_IMAGE088
And partitioning
Figure 110354DEST_PATH_IMAGE010
The degree of matching of (1), specifically, each non-primary attribute in the set
Figure DEST_PATH_IMAGE090
Calculating
Figure DEST_PATH_IMAGE092
Figure 658010DEST_PATH_IMAGE044
Splicing operators for the character strings;
Figure 155987DEST_PATH_IMAGE056
for a global bloom filter, return when incoming parameters exist
Figure DEST_PATH_IMAGE094
Otherwise return to
Figure 396476DEST_PATH_IMAGE078
Figure 81535DEST_PATH_IMAGE054
Is a distributed hash table; finally obtaining
Figure DEST_PATH_IMAGE096
Figure DEST_PATH_IMAGE098
6. The system according to claim 1, wherein for the repeated edges, a service-dependent aggregation function is adopted to combine attributes of the edges in the edge storage module to reduce storage space.
CN202110468957.9A 2021-04-28 2021-04-28 Flow graph dividing system based on correlation characteristics Active CN113127491B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110468957.9A CN113127491B (en) 2021-04-28 2021-04-28 Flow graph dividing system based on correlation characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110468957.9A CN113127491B (en) 2021-04-28 2021-04-28 Flow graph dividing system based on correlation characteristics

Publications (2)

Publication Number Publication Date
CN113127491A CN113127491A (en) 2021-07-16
CN113127491B true CN113127491B (en) 2022-03-22

Family

ID=76780928

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110468957.9A Active CN113127491B (en) 2021-04-28 2021-04-28 Flow graph dividing system based on correlation characteristics

Country Status (1)

Country Link
CN (1) CN113127491B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7689602B1 (en) * 2005-07-20 2010-03-30 Bakbone Software, Inc. Method of creating hierarchical indices for a distributed object system
US8972337B1 (en) * 2013-02-21 2015-03-03 Amazon Technologies, Inc. Efficient query processing in columnar databases using bloom filters
CN109426574A (en) * 2017-08-31 2019-03-05 华为技术有限公司 Distributed computing system, data transmission method and device in distributed computing system
CN109740037A (en) * 2019-01-02 2019-05-10 山东省科学院情报研究所 The distributed online real-time processing method of multi-source, isomery fluidised form big data and system
CN110704630A (en) * 2019-04-15 2020-01-17 中国石油大学(华东) Self-optimization mechanism for identified associated graph

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11636115B2 (en) * 2019-09-26 2023-04-25 Fungible, Inc. Query processing using data processing units having DFA/NFA hardware accelerators
US11636154B2 (en) * 2019-09-26 2023-04-25 Fungible, Inc. Data flow graph-driven analytics platform using data processing units having hardware accelerators

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7689602B1 (en) * 2005-07-20 2010-03-30 Bakbone Software, Inc. Method of creating hierarchical indices for a distributed object system
US8972337B1 (en) * 2013-02-21 2015-03-03 Amazon Technologies, Inc. Efficient query processing in columnar databases using bloom filters
CN109426574A (en) * 2017-08-31 2019-03-05 华为技术有限公司 Distributed computing system, data transmission method and device in distributed computing system
CN109740037A (en) * 2019-01-02 2019-05-10 山东省科学院情报研究所 The distributed online real-time processing method of multi-source, isomery fluidised form big data and system
CN110704630A (en) * 2019-04-15 2020-01-17 中国石油大学(华东) Self-optimization mechanism for identified associated graph

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于互联网的物理对象多域协同感知与融合分析技术;张毅;《中国优秀博硕士学位论文全文数据库(硕士)》;20190115(第01期);全文 *

Also Published As

Publication number Publication date
CN113127491A (en) 2021-07-16

Similar Documents

Publication Publication Date Title
JP6744854B2 (en) Data storage method, data inquiry method, and device thereof
US10580025B2 (en) Micro-geographic aggregation system
US20200090003A1 (en) Semantic-aware feature engineering
US20210182859A1 (en) System And Method For Modifying An Existing Anti-Money Laundering Rule By Reducing False Alerts
US20070112618A1 (en) Systems and methods for automatic generation of information
US20130085910A1 (en) Flexible account reconciliation
CN106776848B (en) Database query method and device
CN109829721B (en) Online transaction multi-subject behavior modeling method based on heterogeneous network characterization learning
CN110349013A (en) Risk control method and device
WO2023165271A1 (en) Knowledge graph construction and graph calculation
CN111090780A (en) Method and device for determining suspicious transaction information, storage medium and electronic equipment
CN115687432A (en) Method, apparatus, and medium for monitoring anomalous transaction data
US11188981B1 (en) Identifying matching transfer transactions
CN106844541B (en) Online analysis processing method and device
CN113127491B (en) Flow graph dividing system based on correlation characteristics
WO2019095569A1 (en) Financial analysis method based on financial and economic event on microblog, application server, and computer readable storage medium
CN116611914A (en) Salary prediction method and device based on grouping statistics
CN113379464B (en) Block chain-based site selection method, device, equipment and storage medium
CN109919811B (en) Insurance agent culture scheme generation method based on big data and related equipment
CN112882816A (en) Service calling method and device
CN112348657A (en) Method and device for determining target credit user, computer equipment and storage medium
CN111984798A (en) Atlas data preprocessing method and device
TWI657393B (en) Marketing customer group prediction system and method
CN116954591B (en) Generalized linear model training method, device, equipment and medium in banking field
US20230121356A1 (en) Synthesizing user transactional data for de-identifying sensitive information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant