CN113127491A - Flow graph dividing system based on correlation characteristics - Google Patents

Flow graph dividing system based on correlation characteristics Download PDF

Info

Publication number
CN113127491A
CN113127491A CN202110468957.9A CN202110468957A CN113127491A CN 113127491 A CN113127491 A CN 113127491A CN 202110468957 A CN202110468957 A CN 202110468957A CN 113127491 A CN113127491 A CN 113127491A
Authority
CN
China
Prior art keywords
data
stream
edge
point
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110468957.9A
Other languages
Chinese (zh)
Other versions
CN113127491B (en
Inventor
王新根
陈伟
唐迪佳
杨运平
黄文杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Bangsheng Real Time Intelligent Technology Co ltd
Original Assignee
Shenzhen Bangsheng Real Time Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Bangsheng Real Time Intelligent Technology Co ltd filed Critical Shenzhen Bangsheng Real Time Intelligent Technology Co ltd
Priority to CN202110468957.9A priority Critical patent/CN113127491B/en
Publication of CN113127491A publication Critical patent/CN113127491A/en
Application granted granted Critical
Publication of CN113127491B publication Critical patent/CN113127491B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • General Business, Economics & Management (AREA)
  • Technology Law (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Strategic Management (AREA)
  • Computational Linguistics (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a flow graph dividing system based on correlation characteristics, which comprises a data analysis module, a data rearrangement module, a point element storage module, an edge element storage module and a data navigation module. The data analysis module analyzes the transaction data stream into a data format of an associated map, and generates a point stream and an edge stream; the data rearrangement module carries out disorder rearrangement on the side stream data, and reduces the influence of specific transaction data on a subsequent division algorithm; the data navigation module selects a proper storage position for each side stream data; and the edge element storage module and the point element storage module write the divided edge flow data and point flow data into a database. The flow graph dividing system provided by the invention can optimize the partition process of the associated graph data according to the characteristics of the transaction data flow and improve the performance of the subsequent execution graph analysis task.

Description

Flow graph dividing system based on correlation characteristics
Technical Field
The invention belongs to the field of transaction anti-fraud, and particularly relates to a flow graph dividing system based on correlation characteristics, which is suitable for distributed storage and analysis of flow transaction data.
Background
In the field of transaction anti-fraud, the structure of a data graph such as customer information, transaction records and the like is often modeled to construct a correlation map. For example, an association map may be constructed with bank card numbers as nodes and a transfer between bank cards as edges. And part of the bank card nodes can be marked as abnormal accounts, and analysis such as risk assessment and the like can be carried out on unmarked accounts based on the incidence relation expressed by the incidence map.
Common algorithms for associative graph analysis include graph traversal, community discovery, loop detection, connectivity detection, and the like. In practice, these analysis algorithms are typically implemented using a distributed graph computation framework. The mainstream graph computation framework adopts a Pregel-like message propagation model, and the computation complexity can be approximately represented by the communication quantity between distributed nodes. Therefore, by optimizing the dividing mode of the graph data, the communication load of the distributed graph calculation framework can be reduced, and the overall calculation performance is improved.
In real world applications, large amounts of data are flooded into the system in a data stream. For example, a single transaction is taken as a piece of data, which includes attributes of debit, credit, time, platform, geographic location, etc. The system constructs these data into a graph according to preset meta-rules, for example, defining the borrower and the lender as two nodes on the graph, and creating an edge between the two nodes, wherein the nodes and the edge record other information of the transaction in an attribute mode. Typically, the node attributes include information that is fixed and unchangeable for the transaction entity, such as a bank card number, an account opening bank address, an account opening mobile phone number, an identification number, etc., while the edge attributes include information specific to a single transaction, such as time, platform, amount of money, etc.
As can be seen from the above example, the transaction data stream is large in size and complex in format. The problem that the association graph is constructed in real time by transaction data flow, the graph is guaranteed to be efficiently divided as far as possible while the load balance of storage and calculation nodes is guaranteed is relatively difficult.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a flow graph dividing system for constructing an associated graph facing transaction data flow, which is operated before an associated graph analysis task, realizes the real-time partition of the associated graph and ensures the load balance of nodes.
The purpose of the invention is realized by the following technical scheme: a flow graph dividing system based on correlation characteristics comprises a data analysis module, a data rearrangement module, a point element storage module, an edge element storage module and a data navigation module.
The data analysis module is used for analyzing the original transaction data stream received by the system and generating a point data stream and an edge data stream of the associated map, which are called point stream and edge stream for short. Specifically, the format of point data is defined by meta-rules (meta-graphs), including a primary attribute and a non-primary attribute; defining the format of edge data through a meta-rule (meta-graph), wherein the edge data comprises main attributes of two endpoints and non-main attributes of an edge; the main attribute is used as the unique identification of the point, and the non-main attribute is used as the attribute description of the point. The generated dot flow data and edge flow data of the original transaction data stream are transmitted to a data rearrangement module.
The data rearrangement module is used for disturbing the edge flow data according to a certain rule and reducing the interference of the specific transaction data flow sequence to the flow chart division algorithm. The data rearrangement module provides a preset size NpThe data accumulation queue of (1) stores the side stream data sent from the upstream data analysis module into the end of the data accumulation queue first, and then exchanges with a certain data in the queue randomly. Each piece of side stream data comprises a timestamp for recording the time when the side stream data enters the queue, and when the stay time of the side stream data in the queue exceeds a preset value sigma, the side stream data can be directly pushed to a downstream data navigation module. When the size of the data accumulation queue exceeds the preset NpAnd in time, the head of the queue data is pushed to a downstream data navigation module. Only the side stream data will enter the data accumulation queue and finally import the dataAnd the navigation module directly pushes the point stream data to the point meta storage module.
And the data navigation module is used for determining the specific storage position of the side stream data. Defining the whole system to contain NEThe edge memory partition is used for storing edge stream data and is numbered 0, 1, …, NE-1. Maintaining a local bloom filter for each partition i
Figure BDA0003044566770000021
For recording whether the partition contains an end point of certain side stream data. J-th stream data is recorded as Ej=(v1,v2,A1,A2,Pe),v1And v2Are the main attributes of the start and end points of the jth stream data, A1And A2Non-primary attributes, P, of the start and end points, respectively, of the jth stream of streamdataeIs a non-primary property of the jth edge stream data edge. For partition i, an objective function C (i, v) is designed1,v2,A1,A2) Dividing endpoints of the edge stream data with the associated characteristics into the same partition; selecting the partition with the maximum objective function value, namely the partition with the number ind ═ argmaxi{C(i,v1,v2,A1,A2) And stream the edge data (v)1,v2,Pe) And pushing the data to the edge element storage module of the partition ind. If there are multiple largest ind's at the same time, one is randomly selected.
The point element storage module is used for storing the point flow data into a global key-value lookup table. For the point flow data with the main attribute of v, K is usedvAnd (5) writing the main attribute and the non-main attribute of the dot flow data into the HBase database when v is a main key. If there are duplicate non-primary attributes, the latest version is retained.
The edge storage module is used for storing the edge stream data into the local key-value lookup table. For the starting point v1Endpoint v2To side stream data of
Figure BDA0003044566770000022
As a primary key HBase database. fix (x) indicates that a character string x of an arbitrary length is converted into a fixed length of lfixCharacter string of (1), default |)fixIs the maximum length that can represent the node's primary attribute.
Figure BDA0003044566770000023
The method is a character string splicing operation. HBase according to NEAnd the edge element storage partition regions and appoint the regions for storage according to the partition result of the data navigation module.
Furthermore, in the data analysis module, each transaction datum is a key-value dictionary table represented by a json format, and various information when a transaction occurs is recorded in detail.
Further, the data analysis module generates the dot stream data and the edge stream data according to a predefined meta-rule as follows:
a) for each defined point data format in the meta-rule, its primary attribute is KmChecking whether the transaction data contains KmA field, if contained, generating a main attribute value as transaction data KmThe stream of points of value and into the stream of points of data. The non-main attribute of the point is defined according to the meta-rule, is obtained from the transaction data, and is ignored if the non-main attribute does not exist.
b) For each defined edge data in the meta-rule, the main attribute of two endpoints is SmAnd DmChecking whether the transaction data contains SmField and DmAnd a field, only two fields are contained at the same time, one side stream data is generated, and the side stream is pushed. Other attributes of the side-stream data are defined according to meta-rules, are obtained from the transaction data, and are ignored if not present.
c) For a single transaction datum, the data can be analyzed into a plurality of flow chart data such as a single point, two points or one side of two points according to the specific definition of the meta rule.
Further, the data navigation module maintains two globally distributed key-value storage structures MgAnd Bg。MgIs a distributed hash table for storing the mapping of any string to a 64-bit positive integer valueAnd (4) shooting. B isgIs a distributed bloom filter for determining whether any string exists. MgAnd BgRedis implementation deployed by Cluster mode.
Further, the data navigation module, its partitioned objective function C (i, v)1,v2,V1,V2) Consists of two parts. The first part punishs the unbalanced data division by the specific formula
Figure BDA0003044566770000031
Wherein λ3And e is a hyper-parameter, and the stored data volume of the partition i is HiMaximum data capacity per partition of HmxMinimum data capacity of Hmn. The second part is used for optimizing the locality of data division, and the specific formula is C2(i,v1,v2,V1,V2)=λ1×S(i,v1,V1)+λ2×S(i,v2,V2). Wherein the function
Figure BDA0003044566770000032
Figure BDA0003044566770000033
Which is a bloom filter, when partition i is present at point v,
Figure BDA0003044566770000034
return 1, otherwise return 0. Pi1And pi2For a hyper-parameter, d (v) is the degree of node v in the written flow graph. sim (i, V) is used for evaluating the matching degree of the node non-main attribute set V and the partition i, specifically, each non-main attribute a in the set belongs to V, and calculation is carried out
Figure BDA0003044566770000035
Figure BDA0003044566770000036
The operator is spliced for the string. B isgFor a global bloom filter, return when incoming parameters exist1, otherwise 0 is returned. MgIs a global hash table that ignores key conflicts. Finally obtaining
Figure BDA0003044566770000037
Figure BDA0003044566770000038
C(i,v1,v2,V1,V2)=C1(i)+C2(i,v1,v2,V1,V2)。
Further, in the edge storage module, for the repeated edges, a service-related aggregation function may be adopted to combine the attributes of the edges, so as to reduce the storage space.
The invention has the following beneficial effects: aiming at the spatial and temporal locality of transaction entities, transaction data with similar characteristics can be preferentially stored in the same partition, and point data copy during the operation of an offline analysis task is reduced, so that the communication load is reduced, and the overall operation efficiency is improved. The invention can determine the storage partition of each transaction record in constant time, is irrelevant to the size of a graph data set, can construct an associated graph for a transaction data stream in real time, and writes the associated graph into a storage module according to the requirement of graph partition optimization.
Drawings
FIG. 1 is a block diagram of the system;
FIG. 2 is a data navigation module service flow diagram.
Detailed Description
The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.
As shown in fig. 1, the invention relates to a flow graph partitioning system based on association features, wherein the flow graph partitioning system automatically partitions a finally generated association graph within a sub-linear time complexity under the condition that only each transaction data and limited statistical information can be seen, and simultaneously meets the requirements of load balancing, reduction of communication load of subsequent tasks and the like. The system comprises a data analysis module, a data rearrangement module, a point element storage module, an edge element storage module and a data navigation module.
The data analysis module is used for analyzing the original transaction data stream received by the system, each transaction data is a key-value dictionary table represented by a json format, and various information when a transaction occurs is recorded in detail. And generating a point data stream and an edge data stream of the associated map according to the original transaction data stream, which are referred to as the point stream and the edge stream for short. Specifically, the format of point data is defined by meta-rules (meta-graphs), including a primary attribute and a non-primary attribute; defining the format of edge data through a meta-rule (meta-graph), wherein the edge data comprises main attributes of two endpoints and non-main attributes of an edge; the main attribute is used as the unique identification of the point, and the non-main attribute is used as the attribute description of the point. The generated dot flow data and edge flow data of the original transaction data stream are transmitted to a data rearrangement module. The data analysis module generates point stream data and edge stream data according to a predefined meta-rule as follows:
a) for each defined point data format in the meta-rule, its primary attribute is KmChecking whether the transaction data contains KmA field, if contained, generating a main attribute value as transaction data KmThe stream of points of value and into the stream of points of data. The non-main attribute of the point is defined according to the meta-rule, is obtained from the transaction data, and is ignored if the non-main attribute does not exist.
b) For each defined edge data in the meta-rule, the main attribute of two endpoints is SmAnd DmChecking whether the transaction data contains SmField and DmAnd a field, only two fields are contained at the same time, one side stream data is generated, and the side stream is pushed. Other attributes of the side-stream data are defined according to meta-rules, are obtained from the transaction data, and are ignored if not present.
c) For a single transaction datum, the data can be analyzed into a plurality of flow chart data such as a single point, two points or one side of two points according to the specific definition of the meta rule.
The data rearrangement module is used for disturbing the edge flow data according to a certain rule and reducing the interference of the specific transaction data flow sequence to the flow chart division algorithm. The data rearrangement module provides a preset size NpIs sent from the upstream data analysis moduleThe side stream data is stored at the end of the data accumulation queue and then randomly exchanged with a certain data exchange position in the queue. Each piece of side stream data comprises a timestamp for recording the time when the side stream data enters the queue, and when the stay time of the side stream data in the queue exceeds a preset value sigma, the side stream data can be directly pushed to a downstream data navigation module. When the size of the data accumulation queue exceeds the preset NpAnd in time, the head of the queue data is pushed to a downstream data navigation module. Only the edge stream data enters the data accumulation queue and is finally led into the data navigation module, and the point stream data is directly pushed to the point meta-storage module.
As shown in fig. 2, the data navigation module is configured to determine a specific storage location of the side stream data. Defining the whole system to contain NEThe edge memory partition is used for storing edge stream data and is numbered 0, 1, …, NE-1. Maintaining a local bloom filter for each partition i
Figure BDA0003044566770000041
For recording whether the partition contains an end point of certain side stream data. J-th stream data is recorded as Ej=(v1,v2,A1,A2,Pe),v1And v2Are the main attributes of the start and end points of the jth stream data, A1And A2Non-primary attributes, P, of the start and end points, respectively, of the jth stream of streamdataeIs a non-primary property of the jth edge stream data edge. For partition i, an objective function C (i, v) is designed1,v2,A1,A2) Dividing endpoints of the edge stream data with the associated characteristics into the same partition; selecting the partition with the maximum objective function value, namely the partition with the number ind ═ argmaxi{C(i,v1,v2,A1,A2) And stream the edge data (v)1,v2,Pe) And pushing the data to the edge element storage module of the partition ind. If there are multiple largest ind's at the same time, one is randomly selected.
Data navigation module maintains two globally distributed key-value storage structures MgAnd Bg。MgIs a distributed hash table for storing a mapping of an arbitrary string to a 64-bit positive integer value. B isgIs a distributed bloom filter for determining whether any string exists. MgAnd BgRedis implementation deployed by Cluster mode.
The data navigation module, the objective function C (i, v) of the partition1,v2,V1,V2) Consists of two parts. The first part punishs the unbalanced data division by the specific formula
Figure BDA0003044566770000051
Wherein λ3And e is a hyper-parameter, and the stored data volume of the partition i is HiMaximum data capacity per partition of HmxMinimum data capacity of Hmn. The second part is used for optimizing the locality of data division, and the specific formula is C2(i,v1,v2,V1,V2)=λ1×S(i,v1,V1)+λ2×S(i,v2,V2). Wherein the function
Figure BDA0003044566770000052
Figure BDA0003044566770000053
Which is a bloom filter, when partition i is present at point v,
Figure BDA0003044566770000054
return 1, otherwise return 0. Pi1And pi2For a hyper-parameter, d (v) is the degree of node v in the written flow graph. sim (i, V) is used for evaluating the matching degree of the node non-main attribute set V and the partition i, specifically, each non-main attribute a in the set belongs to V, and calculation is carried out
Figure BDA0003044566770000055
Figure BDA0003044566770000056
Splicing operations for stringsAnd (4) sign. B isgFor a global bloom filter, a 1 is returned when the incoming parameter exists, otherwise a 0 is returned. MgIs a global hash table that ignores key conflicts. Finally obtaining
Figure BDA0003044566770000057
Figure BDA0003044566770000058
C(i,v1,v2,V1,V2)=C1(i)+C2(i,v1,v2,V1,V2)。
The point element storage module is used for storing the point flow data into a global key-value lookup table. For the point flow data with the main attribute of v, K is usedvAnd (5) writing the main attribute and the non-main attribute of the dot flow data into the HBase database when v is a main key. If there are duplicate non-primary attributes, the latest version is retained.
The edge storage module is used for storing the edge stream data into the local key-value lookup table. For the starting point v1Endpoint v2To side stream data of
Figure BDA0003044566770000059
As a primary key HBase database. fix (x) indicates that a character string x of an arbitrary length is converted into a fixed length of lfixCharacter string of (1), default |)fixIs the maximum length that can represent the node's primary attribute.
Figure BDA00030445667700000510
The method is a character string splicing operation. HBase according to NEAnd the edge element storage partition regions and appoint the regions for storage according to the partition result of the data navigation module. For repeated edges, a business-related aggregation function may be employed to combine the attributes of the edges to reduce storage space.
Example (b):
the invention provides a system for constructing an association map from transaction data flow, dividing a flow graph in real time and finally writing the association map into a bottom database. According to the module sequence, the whole process atmosphere comprises three steps: a) analyzing the data and generating an out-of-order point flow and an edge flow; b) dividing the edge stream data and generating the edge stream data with partition marks; c) and generating a primary key for the point flow and the edge flow, and writing the primary key into an HBase database.
An example is given below for three steps, respectively:
step a) analyzing data and generating an out-of-order point flow and an edge flow:
declaring a transaction datum expressed in json format as follows:
Figure BDA0003044566770000061
the system administrator defines the metadata containing point structure as:
Figure BDA0003044566770000062
the edge structure is:
Figure BDA0003044566770000063
Figure BDA0003044566770000071
as described by the data parsing module, the system parses the raw transaction data into blob flow data and edge flow data based on the metadata. As illustrated by example, the stream of point data contains V10001 { "id": 123456789, "account opening site": xxx01} and V20002, card number, 987654321, and the edge data comprises E, starting point, id, 0001, ending point, id, 0002, transaction amount, 100.00, transaction time, 123999923212.
The point flow data obtained by analysis can be directly sent to a downstream point element storage module and immediately written into an HBase database. The parsed edge stream data is sent to a data rearrangement module, and is sent to a data navigation module after waiting for a limited time.
And b) dividing the boundary stream data to obtain boundary stream data with partition marks. For the partition numbered i, the objective function is the partition selection function, whose partition selection function C (i, v) is calculated1,v2,V1,V2)=C1(i)+C2(i,v1,v2,V1,V2). As illustrated, v1Is the starting point principal attribute, value 0001, v2Is the endpoint primary attribute, value 0002; v1Is the attribute list of the above starting point, V2Is the above list of endpoint attributes. Wherein
Figure BDA0003044566770000072
HmxAnd HmnFor a preset maximum and minimum partition capacity, C2(i,v1,v2,V1,V2)=λ1S(i,v1,V1)+λ2S(i,v2,V2),S(i,v1,V1) And S (i, v)2,V2) Calculated according to the following formula:
Figure BDA0003044566770000073
Figure BDA0003044566770000074
Figure BDA0003044566770000075
Figure BDA0003044566770000076
wherein λ1,λ2,λ3,π1,π2Are all hyperparametric,
Figure BDA0003044566770000077
Is a string concatenation function. B isgAnd
Figure BDA0003044566770000078
are functions for determining the presence or absence of strings, wherein
Figure BDA0003044566770000079
Mainly used for judging node main attribute such as v1Whether on partition i, BgThe method is used for judging whether any character string appears, and the two functions are realized by adopting a bloom filter. Wherein M isgThe occurrence frequency of the character string is mainly returned, and the method is realized by using a distributed hash table. Assuming that i is 3 and the attribute a is "Account site": xxx01, then
Figure BDA00030445667700000710
Is "3 @ point of opening @ xxx 01". When the character string
Figure BDA00030445667700000711
Appeared out of date
Figure BDA00030445667700000712
Return 1, otherwise return 0. Each time a string appears in the new data
Figure BDA00030445667700000713
Then
Figure BDA00030445667700000714
Considering the performance requirements, MgKey conflicts can be resolved without means such as a linked list and the like, and certain data errors are allowed. MgThe hash function of (1) should be designed according to the attributes of the service field, for example, when one attribute is "location": yhy street in xx region in hangzhou, zhejiang, the granularity of the attribute can be adjusted to "location": hangzhou, zhejiang, and then generalized hash value calculation is performed.
Step b) Main flow, as shown in FIG. 2The procedure is to sequentially calculate a partition selection function C (i, v)1,v2,V1,V2) The value on each partition, and then the partition ind with the largest value is selected as the target partition of the data.
And c) generating a main key for the point stream and the edge stream, and writing the main key into an HBase database. The primary key of the stream of point flow is Kv=v1And Kv=v2I.e. 0001 and 0002 as described in the above examples. The main key of the side stream data is designed as
Figure BDA0003044566770000081
Assuming that the data length required for storing partition coding is 3, the length required for storing the pivot stream data main attribute is 5, and ind is 3, K is described in the above examplee003@00001@ 00002. When creating HBase table, it should be based on KeThe highest 3 bits are used to partition the regions, ensuring that data from the same ind can be written to the same batch of regions.
The above-described embodiments are intended to illustrate rather than to limit the invention, and any modifications and variations of the present invention are within the spirit of the invention and the scope of the appended claims.

Claims (6)

1. A flow graph dividing system based on correlation characteristics is characterized by comprising a data analysis module, a data rearrangement module, a point element storage module, an edge element storage module and a data navigation module.
The data analysis module is used for analyzing the original transaction data stream received by the system and generating a point data stream and an edge data stream of the associated map, which are called point stream and edge stream for short. Specifically, the format of point data is defined by meta-rules (meta-graphs), including a primary attribute and a non-primary attribute; defining the format of edge data through a meta-rule (meta-graph), wherein the edge data comprises main attributes of two endpoints and non-main attributes of an edge; the main attribute is used as the unique identification of the point, and the non-main attribute is used as the attribute description of the point. The generated dot flow data and edge flow data of the original transaction data stream are transmitted to a data rearrangement module.
The data rearrangement module is used forThe boundary flow data is disturbed according to a certain rule, and the interference of the sequence of the specific transaction data flow to the flow chart division algorithm is reduced. The data rearrangement module provides a preset size NpThe data accumulation queue of (1) stores the side stream data sent from the upstream data analysis module into the end of the data accumulation queue first, and then exchanges with a certain data in the queue randomly. Each piece of side stream data comprises a timestamp for recording the time when the side stream data enters the queue, and when the stay time of the side stream data in the queue exceeds a preset value sigma, the side stream data can be directly pushed to a downstream data navigation module. When the size of the data accumulation queue exceeds the preset NpAnd in time, the head of the queue data is pushed to a downstream data navigation module. Only the edge stream data enters the data accumulation queue and is finally led into the data navigation module, and the point stream data is directly pushed to the point meta-storage module.
And the data navigation module is used for determining the specific storage position of the side stream data. Defining the whole system to contain NEThe edge memory partition is used for storing edge stream data and is numbered 0, 1, …, NE-1. Maintaining a local bloom filter B for each partition il (i)For recording whether the partition contains an end point of certain side stream data. J-th stream data is recorded as Ei=(v1,v2,A1,A2,Pe),v1And v2Are the main attributes of the start and end points of the jth stream data, A1And A2Non-primary attributes, P, of the start and end points, respectively, of the jth stream of streamdataeIs a non-primary property of the jth edge stream data edge. For partition i, an objective function C (i, v) is designed1,v2,A1,A2) Dividing endpoints of the edge stream data with the associated characteristics into the same partition; selecting the partition with the maximum objective function value, namely the partition with the number ind ═ argmaxi{C(i,v1,v2,A1,A2) And stream the edge data (v)1,v2,Pe) And pushing the data to the edge element storage module of the partition ind. If there are multiple largest ind's at the same time, one is randomly selected.
The above-mentionedAnd the point element storage module is used for storing the point flow data into a global key-value lookup table. For the point flow data with the main attribute of v, K is usedvAnd (5) writing the main attribute and the non-main attribute of the dot flow data into the HBase database when v is a main key. If there are duplicate non-primary attributes, the latest version is retained.
The edge storage module is used for storing the edge stream data into the local key-value lookup table. For the starting point v1Endpoint v2To side stream data of
Figure FDA0003044566760000011
As a primary key HBase database. fix (x) indicates that a character string x of an arbitrary length is converted into a fixed length of lfixCharacter string of (1), default |)fixIs the maximum length that can represent the node's primary attribute.
Figure FDA0003044566760000012
The method is a character string splicing operation. HBase according to NEAnd the edge element storage partition regions and appoint the regions for storage according to the partition result of the data navigation module.
2. The system for dividing a flow graph based on associated features of claim 1, wherein in the data parsing module, each transaction datum is a key-value dictionary table represented by json format, and various information when a transaction occurs is recorded in detail.
3. The system for dividing a flow graph based on associated features according to claim 1, wherein the data parsing module generates the dot flow data and the edge flow data according to a predefined meta-rule as follows:
a) for each defined point data format in the meta-rule, its primary attribute is KmChecking whether the transaction data contains KmA field, if contained, generating a main attribute value as transaction data KmThe stream of points of value and into the stream of points of data. The non-principal attribute of the point is defined according to the meta-rule, fromAnd acquiring transaction data, and if the transaction data does not exist, ignoring the transaction data.
b) For each defined edge data in the meta-rule, the main attribute of two endpoints is SmAnd DmChecking whether the transaction data contains SmField and DmAnd a field, only two fields are contained at the same time, one side stream data is generated, and the side stream is pushed. Other attributes of the side-stream data are defined according to meta-rules, are obtained from the transaction data, and are ignored if not present.
c) For a single transaction datum, the data can be analyzed into a plurality of flow chart data such as a single point, two points or one side of two points according to the specific definition of the meta rule.
4. The system for dividing a flow graph based on associated features as claimed in claim 1, wherein the data navigation module maintains two globally distributed key-value storage structures MgAnd Bg。MgIs a distributed hash table for storing a mapping of an arbitrary string to a 64-bit positive integer value. B isgIs a distributed bloom filter for determining whether any string exists. MgAnd BgRedis implementation deployed by Cluster mode.
5. The correlation-feature-based flow graph partitioning system according to claim 4, wherein the data navigation module is configured to partition an objective function C (i, v) into sections1,v2,V1,V2) Consists of two parts. The first part punishs the unbalanced data division by the specific formula
Figure FDA0003044566760000021
Wherein λ3And e is a hyper-parameter, and the stored data volume of the partition i is HiMaximum data capacity per partition of HmxMinimum data capacity of Hmn. The second part is used for optimizing the locality of data division, and the specific formula is C2(i,v1,v2,V1,V2)=λ1×S(i,v1,V1)+λ2×S(i,v2,V2). Wherein the function
Figure FDA0003044566760000022
Figure FDA0003044566760000023
Figure FDA0003044566760000024
Which is a bloom filter, when partition i is present at point v,
Figure FDA0003044566760000025
return 1, otherwise return 0. Pi1And pi2For a hyper-parameter, d (v) is the degree of node v in the written flow graph. sim (i, V) is used for evaluating the matching degree of the node non-main attribute set V and the partition i, specifically, each non-main attribute a in the set belongs to V, and calculation is carried out
Figure FDA0003044566760000026
Figure FDA0003044566760000027
The operator is spliced for the string. B isgFor a global bloom filter, a 1 is returned when the incoming parameter exists, otherwise a 0 is returned. MgIs a global hash table that ignores key conflicts. Finally obtaining
Figure FDA0003044566760000028
Figure FDA0003044566760000029
C(i,v1,v2,V1,V2)=C1(i)+C2(i,v1,v2,V1,V2)。
6. The system according to claim 1, wherein for the repeated edges, a service-dependent aggregation function is adopted to combine attributes of the edges in the edge storage module to reduce storage space.
CN202110468957.9A 2021-04-28 2021-04-28 Flow graph dividing system based on correlation characteristics Active CN113127491B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110468957.9A CN113127491B (en) 2021-04-28 2021-04-28 Flow graph dividing system based on correlation characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110468957.9A CN113127491B (en) 2021-04-28 2021-04-28 Flow graph dividing system based on correlation characteristics

Publications (2)

Publication Number Publication Date
CN113127491A true CN113127491A (en) 2021-07-16
CN113127491B CN113127491B (en) 2022-03-22

Family

ID=76780928

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110468957.9A Active CN113127491B (en) 2021-04-28 2021-04-28 Flow graph dividing system based on correlation characteristics

Country Status (1)

Country Link
CN (1) CN113127491B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7689602B1 (en) * 2005-07-20 2010-03-30 Bakbone Software, Inc. Method of creating hierarchical indices for a distributed object system
US8972337B1 (en) * 2013-02-21 2015-03-03 Amazon Technologies, Inc. Efficient query processing in columnar databases using bloom filters
CN109426574A (en) * 2017-08-31 2019-03-05 华为技术有限公司 Distributed computing system, data transmission method and device in distributed computing system
CN109740037A (en) * 2019-01-02 2019-05-10 山东省科学院情报研究所 The distributed online real-time processing method of multi-source, isomery fluidised form big data and system
CN110704630A (en) * 2019-04-15 2020-01-17 中国石油大学(华东) Self-optimization mechanism for identified associated graph
US20210097082A1 (en) * 2019-09-26 2021-04-01 Fungible, Inc. Query processing using data processing units having dfa/nfa hardware accelerators
US20210097108A1 (en) * 2019-09-26 2021-04-01 Fungible, Inc. Data flow graph-driven analytics platform using data processing units having hardware accelerators

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7689602B1 (en) * 2005-07-20 2010-03-30 Bakbone Software, Inc. Method of creating hierarchical indices for a distributed object system
US8972337B1 (en) * 2013-02-21 2015-03-03 Amazon Technologies, Inc. Efficient query processing in columnar databases using bloom filters
CN109426574A (en) * 2017-08-31 2019-03-05 华为技术有限公司 Distributed computing system, data transmission method and device in distributed computing system
CN109740037A (en) * 2019-01-02 2019-05-10 山东省科学院情报研究所 The distributed online real-time processing method of multi-source, isomery fluidised form big data and system
CN110704630A (en) * 2019-04-15 2020-01-17 中国石油大学(华东) Self-optimization mechanism for identified associated graph
US20210097082A1 (en) * 2019-09-26 2021-04-01 Fungible, Inc. Query processing using data processing units having dfa/nfa hardware accelerators
US20210097108A1 (en) * 2019-09-26 2021-04-01 Fungible, Inc. Data flow graph-driven analytics platform using data processing units having hardware accelerators

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张毅: "基于互联网的物理对象多域协同感知与融合分析技术", 《中国优秀博硕士学位论文全文数据库(硕士)》 *

Also Published As

Publication number Publication date
CN113127491B (en) 2022-03-22

Similar Documents

Publication Publication Date Title
JP6744854B2 (en) Data storage method, data inquiry method, and device thereof
US8145621B2 (en) Graphical representation of query optimizer search space in a database management system
Giannotti et al. Efficient mining of temporally annotated sequences
CN106294772B (en) The buffer memory management method of distributed memory columnar database
US20070112618A1 (en) Systems and methods for automatic generation of information
US9129010B2 (en) System and method of partitioned lexicographic search
CN111159252A (en) Transaction execution method and device, computer equipment and storage medium
CN111581234B (en) RAC multi-node database query method, device and system
US20180300147A1 (en) Database Operating Method and Apparatus
CN109324905A (en) Database operation method, device, electronic equipment and storage medium
WO2012044214A1 (en) Method and arrangement for processing data
EP3940547B1 (en) Workload aware data partitioning
CN110349013A (en) Risk control method and device
CN111427971A (en) Business modeling method, device, system and medium for computer system
US11188981B1 (en) Identifying matching transfer transactions
CN114238389A (en) Database query optimization method, apparatus, electronic device, medium, and program product
CN113934713A (en) Order data indexing method, system, computer equipment and storage medium
US8548980B2 (en) Accelerating queries based on exact knowledge of specific rows satisfying local conditions
CN113127491B (en) Flow graph dividing system based on correlation characteristics
US20230153286A1 (en) Method and system for hybrid query based on cloud analysis scene, and storage medium
US11720563B1 (en) Data storage and retrieval system for a cloud-based, multi-tenant application
CN111723129B (en) Report generation method, report generation device and electronic equipment
CN114331665A (en) Training method and device for credit judgment model of predetermined applicant and electronic equipment
US11841857B2 (en) Query efficiency using merged columns
CN113449005B (en) Account management method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant