WO2021134318A1 - 一种基于聚合边与时序聚合边的快速海量时序数据处理方法 - Google Patents
一种基于聚合边与时序聚合边的快速海量时序数据处理方法 Download PDFInfo
- Publication number
- WO2021134318A1 WO2021134318A1 PCT/CN2019/130147 CN2019130147W WO2021134318A1 WO 2021134318 A1 WO2021134318 A1 WO 2021134318A1 CN 2019130147 W CN2019130147 W CN 2019130147W WO 2021134318 A1 WO2021134318 A1 WO 2021134318A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- time
- graph
- aggregation
- series
- edges
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
- G06F16/244—Grouping and aggregation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2477—Temporal data queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24568—Data stream processing; Continuous queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/9035—Filtering based on additional data, e.g. user or group profiles
Definitions
- the invention provides a fast and massive time-series data processing method based on aggregation edges and time-series aggregation edges.
- This method includes modeling based on graph data structure, dynamic graphing, Aggregated Edge and Time-series Aggregated Edge, as well as graph association relationship query and pattern matching based on aggregated edge, etc.
- the corresponding method It is mainly suitable for the fields of finance, electric power, transportation, Internet, etc., to conduct real-time analysis of the association relationship existing in the data.
- a simple query can be queried through a database-based table structure and table join (join).
- table join join
- the technology generally chooses to model the association relationship on the complex business scenario into a graph structure.
- a graph is a data structure composed of nodes (Vertex) and edges (Edge).
- Graph Database is a database that uses graph data structure for semantic query. It uses nodes, edges and attributes to represent and store data.
- nodes represent entities and edges represent relationships. There can be zero, one or more attributes on a node or edge, and the attribute key of an entity is unique.
- the attribute graph model treats users as nodes and transactions that have occurred as edges.
- the attributes on the side can record the details of the transaction (transaction amount, transaction location, etc.). If multiple transactions occur between two users, multiple edges will be established to indicate the relationship.
- the graph database product converts all data information into points, edges, and attributes into the gallery.
- the advantage of this is that the transaction information in the graph database is complete and easy to display; but the disadvantage is that the data is too comprehensive, which will require calculations. It takes some time to filter the calculated data.
- the current graph database products lack the ability to aggregate based on time series.
- the query is based on time-series aggregation
- the aggregation calculation must be performed at the same time during the query process, which causes the delay of the query.
- Many graph database products cannot even generate query results in real time under the condition of massive data.
- the present invention proposes a fast massive time series data association relationship processing method based on aggregation edges and time series aggregation edges, which realizes that under the massive data mode, The association relationship of the graph data structure is processed quickly in real time.
- the present invention proposes an innovative data structure of "aggregation edge” and "time sequence aggregation edge” in incremental stream computing based on time windows, which is suitable for data modeling of real-time dynamic graphs.
- the present invention also introduces a sequence diagram query language, increases the description semantics of sequence information, not only supports basic query based on points, edges, and attributes, but also enables users to perform graphical query on the results of index calculations in a certain time window. Including graph matching and graph filtering.
- a fast massive time series data processing method based on aggregation edges and time series aggregation edges including:
- quick query is performed based on aggregated indicators.
- the generation of the aggregated edge includes: in the process of building the association relationship graph, performing aggregation calculation in advance according to the attribute fields pre-selected and defined in the business, and forming the result of the aggregation calculation on the attribute of the edge.
- the generation of the time series aggregation edge includes: cutting continuous time according to a certain time unit to form a series of time windows of fixed length; all time series data are allocated to the corresponding time attribute field value according to the agreed time. Within the time window; the data in the time window is aggregated according to the aggregation algorithm required by the business, and the aggregation value corresponding to the time window is obtained.
- different calculation indicators adopt different aggregation algorithms according to their different calculation business content, such as one or more of counting, summation, average, maximum, minimum, variance, standard deviation, collection, and deduplication collection. ; Different calculation indicators can be assigned different time window lengths according to their business meanings.
- the graph association relationship query language increases the descriptive semantics of time series information, and can realize the query based on points, edges and attributes.
- the user can perform a graph query for the index calculation result in a certain time window, and the graph query includes graph matching and graph filtering.
- graph association relationship query language supports predicate filtering semantics and fuzzy matching based on indefinite step count edges.
- the graph matching is specifically: given a starting point and a graph pattern that needs to be matched, returning entity objects that satisfy the matching pattern.
- the graph filtering is specifically: obtaining a specified subset of the results according to the filtering conditions on the basis of graph matching.
- the filter condition may specify a time window, and the specified time window may be different from the time window used when the graph is matched.
- the beneficial effects of the present invention are: the "aggregation edge” and “time sequence aggregation edge” technologies proposed by the present invention based on time window aggregation calculation results are very suitable for marketing, real-time risk control and other fields based on massive data mining.
- the advantages here are self-evident. In summary, they mainly include:
- FIG. 1 is a schematic diagram of comparison between simple edges and aggregate edges in an embodiment of the present invention
- Figure 2 is a schematic diagram of the implementation of time sequence aggregation in a simple business scenario
- Figure 3 is a schematic diagram of the implementation of time sequence aggregation in a complex business scenario.
- Figure 1 shows the process of establishing the graph relation structure of the present invention, and compares the process of building graphs in general graph databases.
- the innovation of the present invention lies in the innovative concept of "aggregation edge" embodied in the process of generating graph data structure.
- the "simple edge” processing method is commonly used in the traditional graph structure during the mapping process.
- the concept of "aggregation edge” proposed by the present invention is that in the process of building a map, aggregation calculation can be performed in advance according to the attribute fields pre-selected and defined in the business. For example, in the process of building an association relationship map of a transaction transfer, the present invention does not simply record the relationship between the transfer details of the two counterparties. According to the index requirements of the business query, for example, the variable calculation of the number of transfers should be focused on, we can pre-aggregate and calculate the number of transfers, and form the "aggregation edge" of the result of the aggregation calculation on the edge attributes.
- time slice refers to cutting the continuous time according to a certain time unit (such as every day, every hour, the past 30 minutes, etc.) to form a series of fixed-length time windows. All data is allocated to the corresponding time window according to the value of a certain agreed time attribute field (such as transaction time or event occurrence time). The data in the time window is aggregated according to the algorithm required by the business, and an aggregate value is calculated for this time window. Different calculation indicators will have different aggregation value algorithms according to their different calculation business content. Further, different calculation indicators can be assigned different time slice lengths according to their business meanings.
- the graph model structure and storage method of the present invention is to store the result of the calculation index in the memory of the system using the data structure concept of "time sequence aggregation edge".
- the result of the "time-series aggregation edge” can be further used for calculation needs of various real-time correlation calculations or real-time decision-making.
- Timing aggregation edge As shown in Fig. 2, an implementation example of the "timing aggregation edge" of the present invention is given. Assuming that account A made the following transfer flow to account B within six hours, the number represents the transfer amount; the query index is the total transfer amount of account A to account B in a certain number of hours; the defined time window length is 1 hour. Here, the number of data entries is 13, and the number of time windows is 6. The present invention does not directly store 13 pieces of original data, but only stores the results obtained by aggregation of each time window, there are 6 in total, which are called aggregation values. When responding to a query, these aggregated values are used to perform certain calculations (in this case, addition) according to a predetermined algorithm.
- the amount of calculation is much less than the calculation with the original data. Therefore, regardless of time and space, the consumption of the present invention is less than that of other graph databases.
- the information (amount) in the relationship (transfer) between entities (accounts) is calculated based on the predefined index (cumulative amount), and the aggregation result of the time window of each data is calculated and stored in the system memory , That is, the formation of "time sequence aggregation edge".
- Figure 3 further shows a complex business scenario.
- the nodes shown on the figure represent different entities in the business scenario. Some are based on the user dimension, and others are based on the IP address dimension. All indicators and correlations (such as IP_A's login frequency, the list of devices that have appeared in the past period, etc.) have timing information, that is, according to the different specified time windows, the results of that time period can be displayed. According to business requirements, the "time sequence aggregation edge" technology is used to store the aggregate calculation results in the memory for further query.
- the present invention also introduces a new adaptive graph association relationship query language at the same time.
- the query language syntax is similar to Cypher language.
- the graph association relationship query language not only supports basic deterministic point, edge, and attribute queries, but also adds descriptive semantics of time series information, so that users can perform graph queries on the calculation results of indicators in a certain time window.
- Graph queries include Graph matching, graph filtering, etc.
- the present invention also supports predicate filtering semantics such as "all” and “any”, and also supports fuzzy matching based on indefinite number of steps. This greatly enhances the technical support of the present invention for business scenarios.
- the graph matching in the present invention refers to a given starting point and a graph pattern that needs to be matched, and returning entity objects that satisfy the matching pattern. For example, “find out all the merchants whose bank card bound to the account number 123 has spent the past 24 hours” and so on.
- "account bound to bank card” and "use bank card to consume in merchants” are two association relationships.
- the graph association relationship query sentence of the present invention can be used to describe this series of association relationships, and the results can be matched.
- the graph filtering in the present invention means that on the basis of graph matching, a specified subset of the results can be obtained according to the filtering conditions. For example, in the above example, you can add the condition "find out the subset of merchants whose accumulated transaction amount in 3 months is greater than 100,000 yuan".
- the filter condition can also specify a time window, and this time window can be different from the time window used when the graph is matched (3 months and 24 hours).
- the aggregation calculation in the present invention refers to the aggregation calculation capacity on the "time-series aggregation edge"; it also refers to the calculation or aggregation calculation between the indicators of different points and edges. For example, “find out the set of accounts S transferred in the past 24 hours for account X with account number 123, and find the maximum value of X's transfer amount minus c, where c is the average of the maximum value of the transfer amount of all elements in S value".
- the "maximum value" in the above example is for a certain account.
- the graph query in the present invention returns the result of the above operation to the inquirer or sends it to other modules as a feature for further graph analysis.
- Neo4j is a high-performance open source NoSQL graph database. Since the open source Neo4j does not support horizontal expansion, the test is performed on a single node. The test focused on checking the efficiency of mapping and the query of business scenarios.
- Neo4j completed the mapping of 100 million pieces of data, which took a total of 6153.544s; the present invention completed the mapping of 100 million pieces of data, which took a total of 1026s.
- the response time is counted with 90 pieces of data, and the performance comparison is shown in the following table:
- Business scenario 2 In the past 4 days, starting from any supporting party card number A, up to 4 levels. In the figure, the accumulated payment amount and payment amount of each node in the past 2 days are not more than 90%.
- Business scenario 3 In the past 4 days, up to 4 layers from any supporting party card number A, there are adjacent payers and payees in the transaction link, and the standard deviation of the transaction amount within the past 2 days is greater than 10.0 and The average value is less than 13,000 yuan.
- the present invention also supports horizontal expansion into clusters. Use the performance of the cluster to further enhance the support for the processing capacity of massive data.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Fuzzy Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种基于聚合边与时序聚合边的快速海量时序数据处理方法,实现在海量数据模式下,对于基于图数据结构的关联关系进行快速实时处理。所述方法在基于时间窗口的增量式流计算上,提出了一种 "聚合边"和"时序聚合边"的数据结构,适用于实时动态图的数据建模。同时引入了一种时序图查询语言,增加时序信息的描述语义,不仅支持基本的基于点、边和属性的查询,还能够实现用户针对某一个时间窗口内的指标计算结果进行图查询,包括图匹配和图过滤。适用于基于海量数据挖掘的营销、实时风控等领域,具有良好的时效性控制及高度可扩展能力。
Description
本发明提供了一种基于聚合边与时序聚合边的快速海量时序数据处理方法。该方法包括了基于图数据结构的建模、动态建图、聚合边(Aggregated Edge)和时序聚合边(Time-series Aggregated Edge)、以及在聚合边基础之上的图关联关系查询和模式匹配等相应方法。主要适用于金融、电力、交通、互联网等领域,对数据中存在的关联关系进行实时分析。
在金融实时风控、精准营销等领域,经常会涉及如某用户“过去24小时的曾消费过的商户”、“过去180天的累计转账额大于100万元的对手方”等相关存在着关联关系变量等的计算问题。同时也会涉及比如某用户“过去1周内是否给另外某一用户转账次数超过100笔”这样的基于关联关系的模式匹配问题。
在解决关联关系查询中,简单的查询可以通过基于数据库的表结构与表拼接(join)来查询。在复杂的业务场景中,当实体有很多不同的类型,与此同时,关系的不同类型也很多的时候,由于数据库表的本质上是基于二元关系,基于数据库表的拼接操作就会很复杂,查询性能有可能无法满足业务要求。技术上面一般会选择把复杂业务场景上的关联关系建模成图(Graph)的结构。图是一种由节点(Vertex)和边(Edge)组成的数据结构。图数据库(Graph Database)是一个使用图数据结构进行语义查询的数据库,它使用节点、边和属性来表示和存储数据。在对业务场景进行建模的时候,一般会选择使用属性图模型的建模方式。在属性图中,节点代表是实体,边代表是关系。节点或边上可以有零个、一个或多个属性,一个实体的属性键是唯一的。比如在交易的业务场景中,属性图模型会把用户当成节点,把发生过交易当成边。边上的属性可以记录交易的详情(交易金额,交易地点等)。如果两个用户间发生多笔交易,则会建立多条边表明关系。
在使用通用的图数据库产品属性图模型建模的技术方案进行关联关系计算中存在了几个主要缺陷,分别是:
1.查询结果集数据量很大的情况下,响应时间较长,无法满足业务上对于实时响应时间的要求。在真实的业务场景下,查询过去30天因为发生过交易从而产生关联关系的对象的时候就 会因为过去30天的巨大的交易流水量,无法在集群单结点上完成关联关系的查询计算。
2.图数据库产品把所有数据信息转化为点、边、属性放入图库中,这样做的优点是图数据库中的交易信息完整,便于展示;但缺点是数据太过全面,导致计算时将需计算的数据筛选出来就需要一部分时间。
3.需要较大规模的计算设备。由于处理方式需要进行的基于图数据结构的密集计算,一般需要搭建大规模集群和基于图计算框架中间件来解决问题。这样就需要比较大的硬件成本和中间件维护成本。
4.目前的图数据库产品缺乏基于时序聚合的能力。在查询基于时序聚合的业务场景下,在查询过程还要同时进行聚合计算,造成了查询的延误性。很多的图数据库产品在海量数据的情况下甚至无法实时产生查询结果。
发明内容
针对目前图数据库或是图计算中间件产品处理海量时序数据存在的问题,本发明提出一种基于聚合边与时序聚合边的快速海量时序数据关联关系处理方法,实现在海量数据模式下,对于基于图数据结构的关联关系进行快速的实时处理。
本发明在基于时间窗口的增量式流计算上,提出了一种创新性的“聚合边”和“时序聚合边”的数据结构,适用于实时动态图的数据建模。本发明同时引入了一种时序图查询语言,增加时序信息的描述语义,不仅支持基本的基于点、边和属性的查询,还能够实现用户针对某一个时间窗口内的指标计算结果进行图查询,包括图匹配和图过滤。
本发明的目的是通过以下技术方案来实现的:一种基于聚合边与时序聚合边的快速海量时序数据处理方法,包括:
将时序数据基于时间窗口进行聚合计算,得到聚合结果;
利用所述聚合结果生成聚合边、时序聚合边,并按聚合边的数据结构将数据存放在内存中;
将聚合边或时序聚合边按照属性图的方式建立成图关系;
对于生成的图关系,基于聚合指标进行快速查询。
进一步地,所述聚合边的生成包括:在关联关系图的建图过程中,根据业务上预先选择与定义的属性字段,预先进行聚合计算,在边的属性上形成聚合计算后的结果。
进一步地,所述时序聚合边的生成包括:将连续的时间按照某个时间单位进行切割,形成一系列固定长度的时间窗口;所有时序数据根据约定的时间属性字段的值,被分配到对应的时间窗 口内;时间窗口内的数据根据业务要求的聚合算法进行聚合,得到该时间窗口对应的聚合值。
进一步地,不同的计算指标根据其计算业务内容的不同,采用不同的聚合算法,例如计数、求和、平均、最大、最小、方差、标准差、采集、去重采集中的一种或多种;不同的计算指标根据其业务含义可以分配不同的时间窗口长度。
进一步地,通过图关联关系查询语言对聚合边的数据进行查询,所述图关联关系查询语言增加时序信息的描述语义,能够实现基于点、边和属性的查询。
进一步地,用户能够针对某一个时间窗口内的指标计算结果进行图查询,所述图查询包括图匹配和图过滤。
进一步地,所述图关联关系查询语言支持谓词过滤语义及基于不定步数边的模糊匹配。
进一步地,所述图匹配具体为:给定起点以及需要匹配的图模式,返回满足匹配模式中的实体对象。
进一步地,所述图过滤具体为:在图匹配的基础上根据过滤条件求得结果的指定子集。
进一步地,所述图过滤中,过滤条件可以指定时间窗口,指定的时间窗口可以和图匹配时采用的时间窗口不同。
本发明的有益效果是:本发明提出的基于时间窗口聚合计算结果基础上的“聚合边”与“时序聚合边”技术,非常适用于基于海量数据挖掘的营销、实时风控等领域,其带来的优势是不言而喻的,概括起来,主要包括:
1)良好的时效性控制。关联关系的计算过程中,容易涉及到基于图结构的遍历。由于有了预先聚合计算的结果,大大减少了图遍历过程中的计算量。通过聚合结果,可以降低图搜索中的空间大小。
2)高度可扩展能力。在计算变量、业务规模等提升的情况下,通过简单的增加计算设备以及分布式存储内存即可提升计算能力,从而确保应对复杂逻辑计算的延时可控。
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例中简单边与聚合边对比示意图;
图2为一个简单的业务场景中时序聚合边实现示意图;
图3为一个复杂的业务场景中时序聚合边实现示意图。
为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图对本发明的具体实施方式做详细的说明。
在下面的描述中阐述了很多具体细节以便于充分理解本发明,但是本发明还可以采用其他不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本发明内涵的情况下做类似推广,因此本发明不受下面公开的具体实施例的限制。
如图1展现了本发明建立图关系结构过程,并且对比了一般图数据库的建图过程。本发明的创新点就在于图数据结构的生成过程中体现创新的“聚合边”的概念。
传统的图结构在建图过程中常用的是“简单边”的处理方式。本发明提出的“聚合边”概念,就是在建图的过程中,可以根据业务上预先选择与定义的属性字段预先进行聚合计算。比如:在交易转账的关联关系图建图过程,本发明不是简单只记录了两两对手方的转账明细的关系。可以根据业务查询的指标要求,例如要重点计算转账次数这个变量计算,我们可以预先聚合计算转账次数,在边的属性上形成聚合计算后结果的“聚合边”。
在“聚合边”的基础上,本发明结合“时间切片”的概念进一步拓展了“聚合边”的概念,从而引出“时序聚合边”的概念。“时间切片”指的是把连续的时间按照某个时间单位(如每天、每小时、过去的30分钟等)进行切割,进而形成一系列固定长度的时间窗口。所有数据根据其某一约定的时间属性字段的值(比如交易时间,或是事件发生时间),被分配到对应的时间窗口内。时间窗口内的数据根据业务要求的算法进行聚合,为这个时间窗口计算出一个聚合值。不同的计算指标会根据其计算业务内容的不同,具有不同的聚合值算法。进一步地,不同的计算指标根据其业务含义可以分配不同的时间切片长度。
本发明的图模型结构与存储方式,就是在系统的内存当中,以“时序聚合边”的数据结构概念存储计算指标的结果。该“时序聚合边”结果可以进一步用于各种各样的实时关联关系的计算或是实时决策等的计算需要。
如图2所示,给出了本发明的“时序聚合边”的一个实现示例。假设账户A向账户B在六个小时内进行了如下转账流水,数字代表转账金额;查询指标是账户A向账户B在某几个小时内的转账总额;定义时间窗口长度为1小时。此处,数据条目数是13,时间窗口数量是6。本发明 不是直接存储13条原始数据,仅存储每个时间窗口聚合得到的结果,共6个,称为聚合值。在响应查询时,利用这些聚合值按预先约定的算法进行某些计算(本例是相加)。可以看出来,在查询的时候,计算量比用原始数据进行计算要少得多。因此,不论时间、空间,本发明的消耗都比其它图数据库要少。像这样把实体(账户)间的关联关系(转账)上具有的信息(金额),依据预先定义的指标(累计金额),计算出各数据所在时间窗口的聚合结果,并存储在系统的内存当中,即形成“时序聚合边”。
图3进一步展示一个复杂的业务场景。图上展示的结点代表了业务场景中不同的实体。有基于用户维度的,也有基于IP地址维度。其中的所有指标和关联关系(例如IP_A的登陆频次、过去一段时间出现的设备列表等)都带有时序信息,即可以根据指定的时间窗口的不同,展现出那一时间段应有的结果。按业务要求,利用“时序聚合边”技术,将聚合计算结果放在内存当中,供进一步查询使用。
为了让使用者方便地对“时序聚合边”的数据进行查询,本发明也同时引入了一种适配的新型图关联关系查询语言。该查询语言语法类似Cypher语言。该图关联关系查询语言除了支持基本的确定点、边、属性查询外,还在其中增加了时序信息的描述语义,使得用户可以针对某一个时间窗口内的指标计算结果进行图查询,图查询包括图匹配、图过滤等。本发明还支持“所有”、“任意”等谓词过滤语义,也支持基于不定步数边的模糊匹配。这样大大增强了本发明对于业务场景的技术支持力度。
如下表格展示若干查询语言举例:
本发明中的图匹配指的是给定起点以及需要匹配的图模式,返回满足匹配模式中的实体对象。例如“找出所有账户号为123绑定的银行卡过去24小时曾消费的商户”等。这里的“账户绑定银行卡”和“使用银行卡在商户消费”是两个关联关系,可以使用本发明的图关联关系查询语句描述这种串联的关联关系,并匹配出结果。
本发明中的图过滤指的是在图匹配的基础上,可以根据过滤条件求得结果的指定子集。例如上例中可以添加条件“找出商户中3个月累计交易金额大于十万元的子集”。过滤条件也可以指定时间窗口,而且这个时间窗口可以和图匹配时采用的时间窗口不同(3个月和24小时)。
本发明中的聚合计算指的是“时序聚合边”上的聚合计算能力;它也指不同的点、边的指标相互之间的计算或聚合计算。例如“找出账户号为123的账户X过去24小时转账过的账户集合S,求X的转账金额的最大值减去c的结果,其中c是S中所有元素的转账金额的最大值的平均值”。上例中的“最大值”是针对某个账户而言的,它存在“时序聚合边”当中,是其上的一个聚合值,因此可以直接在图立方内存数据中获取到;“平均值”是针对集合S做一次聚合操作,该操作在发明的图关联关系查询语言的执行计划中进行,是不同的点指标相互之间的聚合计算;“减去”则是中间结果相互之间的一般计算。
本发明中的图查询即将上述操作的结果返回给查询者,或送入其它模块,作为进一步进行图分析的特征。
为了比较本发明提出“聚合边”与“时序聚合边”技术对于关联关系查询的性能提升,我们也同开源软件Neo4j进行了对比的性能测试。Neo4j是一个高性能的开源的NoSQL图数据库。由于开源的Neo4j不支持水平扩展,所以,测试在单节点进行。测试重点考查了建图效率与业务场景的查询。
测试中采用了金融场景中常见的交易流水,数据结构如下表所示:
字段 | 类型 | 备注 |
transTime | Long | 转账时间 |
fromCardNo | String(32) | 转账发起方卡号(支持方) |
toCardNo | String(32) | 转账接收方卡号(收款方) |
transAmt | Long | 转账金额,以分为单位 |
channel | String(4) | 渠道代码。 |
bizCode | String(3) | 业务代码,交易对应的业务编码。 |
stat | Integer | 交易状态,0–成功,1–余额不足 |
在建图的性能比较中,采用的方法是通过Benchmark工具加载1亿交易流水进行建图,评估建图效率。Neo4j完成1亿条数据建图,共耗时6153.544s;本发明完成1亿条数据建图,共耗时1026s。以90条数据统计响应时间,性能对比如下表所示:
在图关系查询的场景上,比较了以下三个业务场景。
业务场景一:过去4天,从任意支持方卡号A出发最多4层,存在卡号B,B满足过去1天,消费渠道不为web的付款金额累计不等600元;且过去n天内(n=2,3,4,5),消费渠道不为web的付款金额累计不等于7000元;且过去n天内(n=2,3,4,5),付款累计金额不等于10000元。
业务场景二:过去4天,从任意支持方卡号A出发最多4层,图中各节点过去2天累计收款金额和付款金额相关均不超过90%。
业务场景三:过去4天,从任意支持方卡号A出发最多4层,在交易链路中存在相邻的支付方和收款方,两者满足过去2天内的交易金额的标准差大于10.0且平均值小于13000元。
从以上的测试结果可见,本发明的平均每秒事务数更大,响应时间更短。印证了本发明提出的“聚合边”和“时序聚合边”可以加速海量时序数据的关联关系查询。而且对比开源的Neo4j图数据库,本发明还支持水平扩展成集群。利用集群的性能进一步提升对于海量数据的处理能力的支撑。
以上所述仅是本发明的优选实施方式,虽然本发明已以较佳实施例披露如上,然而并非用以限定本发明。任何熟悉本领域的技术人员,在不脱离本发明技术方案范围情况下,都可利用上述揭示的方法和技术内容对本发明技术方案做出许多可能的变动和修饰,或修改为等同变化的等效实施例。因此,凡是未脱离本发明技术方案的内容,依据本发明的技术实质对以上实施例所做的任何的简单修改、等同变化及修饰,均仍属于本发明技术方案保护的范围内。
Claims (10)
- 一种基于聚合边与时序聚合边的快速海量时序数据处理方法,其特征在于,包括:将时序数据基于时间窗口进行聚合计算,得到聚合结果;利用所述聚合结果生成聚合边、时序聚合边,并按聚合边的数据结构将数据存放在内存中;将聚合边或时序聚合边按照属性图的方式建立成图关系;对于生成的图关系,基于聚合指标进行快速查询。
- 根据权利要求1所述的方法,其特征在于,所述聚合边的生成包括:在关联关系图的建图过程中,根据业务上预先选择与定义的属性字段,预先进行聚合计算,在边的属性上形成聚合计算后的结果。
- 根据权利要求1所述的方法,其特征在于,所述时序聚合边的生成包括:将连续的时间按照某个时间单位进行切割,形成一系列固定长度的时间窗口;所有时序数据根据约定的时间属性字段的值,被分配到对应的时间窗口内;时间窗口内的数据根据业务要求的聚合算法进行聚合,得到该时间窗口对应的聚合值。
- 根据权利要求3所述的方法,其特征在于,不同的计算指标根据其计算业务内容的不同,采用不同的聚合算法;不同的计算指标根据其业务含义可以分配不同的时间窗口长度。
- 根据权利要求1所述的方法,其特征在于,通过图关联关系查询语言对聚合边的数据进行查询,所述图关联关系查询语言增加时序信息的描述语义,支持基于点、边和属性的查询。
- 根据权利要求5所述的方法,其特征在于,用户能够针对某一个时间窗口内的指标计算结果进行图查询,所述图查询包括图匹配和图过滤。
- 根据权利要求6所述的方法,其特征在于,所述图匹配具体为:给定起点以及需要匹配的图模式,返回满足匹配模式中的实体对象。
- 根据权利要求6所述的方法,其特征在于,所述图过滤具体为:在图匹配的基础上根据过滤条件求得结果的指定子集。
- 根据权利要求8所述的方法,其特征在于,所述图过滤中,过滤条件可以指定时间窗口,指定的时间窗口可以和图匹配时采用的时间窗口不同。
- 根据权利要求6所述的方法,其特征在于,所述图关联关系查询语言支持谓词过滤语义及基于不定步数边的模糊匹配。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2019/130147 WO2021134318A1 (zh) | 2019-12-30 | 2019-12-30 | 一种基于聚合边与时序聚合边的快速海量时序数据处理方法 |
US17/358,017 US12013847B2 (en) | 2019-12-30 | 2021-06-25 | Fast processing method of massive time-series data based on aggregated edge and time-series aggregated edge |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2019/130147 WO2021134318A1 (zh) | 2019-12-30 | 2019-12-30 | 一种基于聚合边与时序聚合边的快速海量时序数据处理方法 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/358,017 Continuation US12013847B2 (en) | 2019-12-30 | 2021-06-25 | Fast processing method of massive time-series data based on aggregated edge and time-series aggregated edge |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021134318A1 true WO2021134318A1 (zh) | 2021-07-08 |
Family
ID=76686070
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/130147 WO2021134318A1 (zh) | 2019-12-30 | 2019-12-30 | 一种基于聚合边与时序聚合边的快速海量时序数据处理方法 |
Country Status (2)
Country | Link |
---|---|
US (1) | US12013847B2 (zh) |
WO (1) | WO2021134318A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20240202814A1 (en) * | 2022-12-14 | 2024-06-20 | International Business Machines Corporation | Graph feature based system for flow management |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103593433A (zh) * | 2013-11-12 | 2014-02-19 | 中国科学院信息工程研究所 | 一种面向海量时序数据的图数据处理方法及系统 |
CN104867055A (zh) * | 2015-06-16 | 2015-08-26 | 咸宁市公安局 | 一种金融网络可疑资金追踪与识别方法 |
CN106682986A (zh) * | 2016-12-27 | 2017-05-17 | 南京搜文信息技术有限公司 | 一种基于大数据的复杂金融交易网络活动图的构造方法 |
CN109164980A (zh) * | 2018-08-03 | 2019-01-08 | 北京涛思数据科技有限公司 | 一种时序数据的聚合优化处理方法 |
US20190370818A1 (en) * | 2014-10-08 | 2019-12-05 | Morgan Stanley Services Group Inc. | Computerized account database access tool |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10078664B2 (en) * | 2014-12-05 | 2018-09-18 | General Electric Company | Searching for and finding data across industrial time series data |
US11429627B2 (en) * | 2018-09-28 | 2022-08-30 | Splunk Inc. | System monitoring driven by automatically determined operational parameters of dependency graph model with user interface |
-
2019
- 2019-12-30 WO PCT/CN2019/130147 patent/WO2021134318A1/zh active Application Filing
-
2021
- 2021-06-25 US US17/358,017 patent/US12013847B2/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103593433A (zh) * | 2013-11-12 | 2014-02-19 | 中国科学院信息工程研究所 | 一种面向海量时序数据的图数据处理方法及系统 |
US20190370818A1 (en) * | 2014-10-08 | 2019-12-05 | Morgan Stanley Services Group Inc. | Computerized account database access tool |
CN104867055A (zh) * | 2015-06-16 | 2015-08-26 | 咸宁市公安局 | 一种金融网络可疑资金追踪与识别方法 |
CN106682986A (zh) * | 2016-12-27 | 2017-05-17 | 南京搜文信息技术有限公司 | 一种基于大数据的复杂金融交易网络活动图的构造方法 |
CN109164980A (zh) * | 2018-08-03 | 2019-01-08 | 北京涛思数据科技有限公司 | 一种时序数据的聚合优化处理方法 |
Also Published As
Publication number | Publication date |
---|---|
US20210319014A1 (en) | 2021-10-14 |
US12013847B2 (en) | 2024-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10628409B2 (en) | Distributed data transformation system | |
WO2020220810A1 (zh) | 数据融合的方法和装置 | |
US10691646B2 (en) | Split elimination in mapreduce systems | |
TWI496015B (zh) | Text matching method and device | |
CN108376143B (zh) | 一种新型的olap预计算系统及生成预计算结果的方法 | |
Jin et al. | Tracking multiple social media for stock market event prediction | |
CN106021541A (zh) | 区分准标识符属性的二次k-匿名隐私保护算法 | |
WO2023078120A1 (zh) | 图数据的查询 | |
Singh et al. | Mining of high‐utility itemsets with negative utility | |
WO2023103527A1 (zh) | 一种访问频次的预测方法及装置 | |
CN113836310B (zh) | 知识图谱驱动的工业品供应链管理方法和系统 | |
US20200257684A1 (en) | Higher-order data sketching for ad-hoc query estimation | |
Suriarachchi et al. | Big provenance stream processing for data intensive computations | |
Tu et al. | Functional coefficient cointegration models subject to time–varying volatility with an application to the purchasing power parity | |
Hu et al. | Approximation with error bounds in spark | |
WO2021134318A1 (zh) | 一种基于聚合边与时序聚合边的快速海量时序数据处理方法 | |
Trang | Limitations of Big Data partitions technology | |
Rezvani et al. | Truss decomposition using triangle graphs | |
CN117591516A (zh) | 监管报送数据分析系统、方法、设备及存储介质 | |
Ashokkumar et al. | Link-based clustering algorithm for clustering web documents | |
Govindasamy et al. | Prediction of events based on complex event processing and probabilistic fuzzy logic | |
Ham et al. | An improved algorithm for mining frequent weighted itemsets | |
Agorgianitis et al. | Evaluating Distributed Methods for CBR Systems for Monitoring Business Process Workflows. | |
Zhang et al. | Discovering top-k patterns with differential privacy-an accurate approach | |
Monica et al. | Survey on big data by coordinating mapreduce to integrate variety of data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19958157 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19958157 Country of ref document: EP Kind code of ref document: A1 |