CN113434556B

CN113434556B - Data processing method and system

Info

Publication number: CN113434556B
Application number: CN202110834327.9A
Authority: CN
Inventors: 唐坤
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2021-07-22
Filing date: 2021-07-22
Publication date: 2022-05-31
Anticipated expiration: 2041-07-22
Also published as: CN113434556A; CN115080622A

Abstract

The embodiment of the specification discloses a data processing method and a data processing system. Wherein, the method comprises the following steps: acquiring a data flow transfer diagram; the data flow graph comprises nodes and edges, wherein data flow exists among the nodes, and the edges among the nodes reflect the flow direction of data among the nodes; acquiring a flow path of the same data from an initial node to a termination node from a data flow diagram, and further acquiring a plurality of flow paths corresponding to different data; at least cutting off a plurality of circulation paths from the starting node to the ending node for a plurality of times respectively to obtain a plurality of outgoing sub-paths; at least cutting off a plurality of circulation paths from the termination node to the initial node for a plurality of times respectively to obtain a plurality of source sub-paths; and storing the plurality of circulation paths, the plurality of going sub-paths and the plurality of source sub-paths on a plurality of storage devices in a distributed manner for query.

Description

Data processing method and system

Technical Field

The present disclosure relates to the field of information technologies, and in particular, to a data processing method and system.

Background

In daily life and production, scenes of data circulation often occur, such as fund circulation in different fund accounts, goods transportation among different warehouses, travel and migration of users in different places, and the like. The data are recorded, so that the data can be further used for subsequent analysis, decision and the like, and more convenience or value is created for production.

However, as time goes on, more and more data are generated, and how to record the data and make the data be used efficiently becomes a problem to be solved.

Disclosure of Invention

One aspect of embodiments of the present specification provides a data processing method. The method comprises the following steps: acquiring a data flow transfer diagram; the data flow graph comprises nodes and edges, wherein data flow exists among the nodes, and the edges among the nodes reflect the flow direction of data among the nodes; acquiring a flow path of the same data from an initial node to a termination node from the data flow graph, and further acquiring a plurality of flow paths corresponding to different data; at least cutting off a plurality of circulation paths from the starting node to the ending node for a plurality of times respectively to obtain a plurality of outgoing sub-paths; at least cutting off a plurality of circulation paths from the termination node to the initial node for a plurality of times respectively to obtain a plurality of source sub-paths; and storing the plurality of circulation paths, the plurality of going sub-paths and the plurality of source sub-paths on a plurality of storage devices in a distributed manner for query.

Another aspect of embodiments of the present specification provides a data processing system. The system comprises: the data flow turning chart acquisition module is used for acquiring a data flow turning chart; the data flow graph comprises nodes and edges, wherein data flow exists among the nodes, and the edges among the nodes reflect the flow direction of data among the nodes; the data flow graph comprises a data flow graph acquisition module, a flow path acquisition module and a data flow path conversion module, wherein the data flow graph acquisition module is used for acquiring a flow path of the same data from an initial node to a termination node so as to obtain a plurality of flow paths corresponding to different data; the outgoing path truncation module is used for truncating the multiple circulation paths from the starting node to the ending node for multiple times at least to obtain multiple outgoing sub-paths; the source path truncation module is used for truncating a plurality of circulation paths from the termination node to the starting node for a plurality of times to obtain a plurality of source sub-paths; and the path storage module is used for storing the plurality of circulation paths, the plurality of outgoing sub-paths and the plurality of source sub-paths on a plurality of storage devices in a distributed manner for query.

Another aspect of embodiments of the present specification provides a data query method. The method comprises the following steps: acquiring a query request; the query request comprises a path matching condition, and the path matching condition describes a plurality of nodes in a path to be queried and a data flow relation between the nodes; respectively acquiring one or more candidate paths from data circulation paths and/or sub-paths in one or more storage devices based on the query request; the data circulation path and/or the sub-path in one or more storage devices are obtained by processing through the data processing method; and summarizing the candidate paths to obtain the path to be queried.

Another aspect of embodiments of the present specification provides a data query system. The system comprises: the query request acquisition module is used for acquiring a query request; the query request comprises a path matching condition, and the path matching condition describes a plurality of nodes in a path to be queried and a data flow relation between the nodes; a path obtaining module, configured to obtain one or more candidate paths from data flow paths and/or sub-paths in one or more storage devices, respectively, based on the query request; the data flow path and/or the sub-path in one or more storage devices are obtained by processing the data processing method; and the path processing module is used for summarizing the one or more candidate paths to obtain the path to be queried.

Another aspect of embodiments of the present specification provides a data processing apparatus comprising at least one storage medium for storing computer instructions and at least one processor; the at least one processor is configured to execute the computer instructions to implement a data processing method.

Another aspect of embodiments of the present specification provides a computer-readable storage medium storing computer instructions, and a computer performs a data processing method when the computer reads the computer instructions from the storage medium.

Another aspect of embodiments of the present specification provides a data processing apparatus comprising at least one storage medium for storing computer instructions and at least one processor; the at least one processor is configured to execute the computer instructions to implement a data query method.

Another aspect of embodiments of the present specification provides a computer-readable storage medium storing computer instructions, which, when read by a computer, perform a data query method.

Drawings

The present description will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments like numerals are used to indicate like structures, wherein:

FIG. 1 is an exemplary diagram of a data flow diagram shown in accordance with some embodiments of the present description;

FIG. 2 is an exemplary flow diagram of a data processing method according to some embodiments of the present description;

FIG. 3 is an exemplary diagram illustrating obtaining multiple flow paths according to some embodiments of the present description;

FIG. 4 is an exemplary schematic diagram of a backward truncation, as shown in some embodiments herein;

FIG. 5 is an exemplary schematic diagram of a source truncation shown in accordance with some embodiments of the present description;

FIG. 6 is an exemplary diagram of similar path merging, shown in accordance with some embodiments of the present description;

FIG. 7 is an exemplary diagram illustrating distributed storage in terms of time partitions according to some embodiments of the present description;

FIG. 8 is an exemplary diagram of an update flow path shown in accordance with some embodiments herein;

FIG. 9 is an exemplary flow diagram of a data query method, shown in accordance with some embodiments of the present description;

FIG. 10 is an exemplary block diagram of a data processing system, shown in accordance with some embodiments of the present description;

FIG. 11 is an exemplary block diagram of a data query system, shown in accordance with some embodiments of the present description.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples or embodiments of the present description, and that for a person skilled in the art, the present description can also be applied to other similar scenarios on the basis of these drawings without inventive effort. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.

It should be understood that "system", "device", "unit" and/or "module" as used herein is a method for distinguishing different components, elements, parts, portions or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this specification and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Flow charts are used in this description to illustrate operations performed by a system according to embodiments of the present description. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

In daily life and production, scenes of data circulation often occur, such as fund circulation in different fund accounts, goods transportation among different warehouses, travel and migration of users in different places, and the like. These scenarios can generate a large amount of data streaming data, which contains abundant valuable information.

Take funds as an example. Funds may be transferred between the user and the business, business to business, and the transfer of funds may also be referred to as the transfer of funds.

The intention of the user or the enterprise for transferring the funds can be known through analyzing the fund flow data, illegal behaviors can be found in time, such as fund collection through an unreasonable means and fund utilization in an abnormal channel. Therefore, the information of the fund transfer is analyzed, for example, each fund in the fund transfer process, from which the fund comes (recharging, transferring, etc.), where the fund passes, how long the fund finally flows to, the direction of the fund, whether the fund using route is illegal, where the fund staying in a certain fund account is from, whether the fund source is normal, etc. the fund transfer process is executed by the fund transfer system.

As time goes on, the circulation information is more and more, which leads to the increase of the volume of the graph, therefore, it is necessary to provide a data processing method to reduce the pressure of data storage and improve the data query speed in the subsequent analysis; other embodiments of the present description provide a data query method that increases the speed and flexibility of data queries. It should be noted that the above examples related to fund flow are only for illustrative purposes and are not intended to limit the application scenarios of the technical solutions disclosed in the present specification, and for example, the above examples may also be used for analyzing telephone charges, traffic, freight transportation, personnel flow, vehicle flow, and the like. The technical solutions disclosed in the present specification are explained in detail by the description of the drawings below.

Fig. 1 is an exemplary diagram of a data flow diagram shown in accordance with some embodiments of the present description.

In some embodiments, data flows may be represented by data flow diagrams. As shown in fig. 1, data flow graph 100 may include nodes 110 and edges 120.

Nodes 110 have data flow between them. In some embodiments, node 110 may correspond to an account (such as a bank account, an application account, etc. capable of storing funds), a place, a warehouse, etc. Data may be an abstract reference to information about people or things that are flowing between different nodes. Illustratively, the data may be information relating to funds, telephone charges, traffic, people, goods, vehicles, and the like. In some embodiments, the data flow graph 100 may be used to represent the flow of funds, charges, traffic, etc. between nodes, e.g., from which node the data originated and to which node the data went. In some embodiments, the data flow diagram 100 may also be used to represent a trajectory of motion of a person or vehicle, etc., e.g., where the person or vehicle came from, where it went to.

Edges 120 may be used to represent the direction in which data is flowing between nodes 110. The edge 120 may be a directed edge, pointing from one node to another. In some embodiments, the edges 120 may also represent data flow scenarios between the nodes 110. E.g., the manner in which the data is streamed, etc. For example, when the data is capital, telephone charge, traffic and the like, the data can flow from the node A to the node I, and the A-B-C-D-E-F-I can represent the path through which the data flows. In the flow path, data may flow to node I in a number of ways, where node a may be user a's bank account, node B may be user a's application account, and node C may be user B's bank account, … …. The edge can represent the direction of the image flow and scene information of transfer, charge, consumption and the like. For example, an edge A-B may indicate that the node A transfers 10 dollars of money to the node B by means of money transfer on 1/2021, and an edge B-C may indicate that the node B charges the node C by means of money transfer on 1/2/2021 with 10 dollars of money. A-B-C-D-E-F-I shows that the sum of 10 Yuan of the A node passes through B, C, D, E, F nodes through the corresponding circulation scene and finally flows into the node I. For another example, the nodes may represent locations and the edges may represent directions and patterns of movement of people and vehicles. For example, the path A-B-C-D-E-F-I may indicate when a user has moved from node A to node I by which means (e.g., walking, riding, driving, riding on public transportation lights) and may also indicate when a vehicle has moved from node A to node I past those nodes.

In some embodiments, it is necessary to obtain relevant information from the data stream transition graph based on a preset query condition (or referred to as a query request). For example, the path matching condition in the query request includes "a-B", that is, all flow paths including the flow sub-path from the a node to the B node need to be acquired from the data flow transition graph. Traversing is required to be performed in the whole graph to acquire all the circulation paths meeting the path matching condition. The method needs to traverse the nodes and edges of the whole graph, has low query efficiency and occupies more computing resources. Therefore, in some embodiments of the present specification, a data processing method is provided, which may disassemble or merge and store data of a data flow graph, so as to improve query efficiency and save query time.

FIG. 2 is an exemplary flow diagram of a data processing method according to some embodiments of the present description. In some embodiments, flow 200 may be performed by a processing device (e.g., a server or a load balancing device in a distributed cluster of devices). For example, the process 200 may be stored in a storage device (e.g., an onboard memory unit of a processing device or an external storage device) in the form of a program or instructions that, when executed, may implement the process 200. The flow 200 may include the following operations.

Step 202, obtaining a data flow map. In some embodiments, step 202 may be performed by the dataflow graph acquisition module 1010.

Data may be an abstract reference to information for something. Such as funds, charges, traffic, people, vehicles, etc.

A data flow graph is a data structure that reflects the information of the flow of data between nodes. The nodes may be different for different data. For example, when the data is funds, telephone charges, and information related to flow rates, the node may be an account (which may include type and account numbers, such as 2088 balance, 139xxxxxxxx mobile phone account number, 1399 application balance, etc.) capable of storing/holding funds, flow rates, and telephone charges, a bank account number (e.g., 2088xx bank debit card, 2072yy bank credit card), and the like; when the data is information related to a person or a vehicle (such as an identity or license plate information), the node may be a space such as a building or a place. An edge may represent a direction in which data is transferred from within one node to within another node. For example, funds are transferred from 2088xx bank debit card to 2072yy bank credit card; as another example, user XX goes from the Beijing railway station to a certain cell.

In some embodiments, the edges between nodes may also represent scenarios in which data is streamed between nodes. The scenario refers to a scenario when data is transferred between nodes, for example, funds are transferred from a 2088xx bank debit card to a 2072yy bank credit card by means of money transfer on 1 month and 1 day of 2021, wherein the time and the money transfer can be regarded as the scenario when the data is transferred between the nodes. In some embodiments, the scenario may also include top-up, consumption, red envelope, and the like.

In some embodiments, the processing device may obtain the data flow graph by reading from a storage device, a database, invoking a data interface, or constructing a data flow graph based on nodes and edges.

It should be noted that, in the following description, for the sake of clarity and conciseness, the technical solution disclosed in the present specification is mainly described by taking data as fund-related information as an example, but it should be understood that the technical solution disclosed in the present specification can also be applied to other scenarios.

Step 204, obtaining the same data from the data flow graph from the start node to the end node, and further obtaining a plurality of flow paths corresponding to different data. In some embodiments, step 204 may be performed by the flow path acquisition module 1020.

The circulation path refers to a path formed by nodes and edges connecting the nodes, which a certain fund passes through when circulating in the data circulation diagram. As shown in FIG. 1, a plurality of flow paths are included, such as A-B-C-D-E-F-I, A-B-C-D-E-F-X, K-L-D-E-F-H, and the like.

The starting node may refer to the first node of a certain fund in the data flow graph, or the fund outside the data flow graph firstly enters the graph through the starting node. Such as nodes that have no incoming edges but have outgoing edges in the dataflow graph. Such as node a, node K, and node M in fig. 1. In some embodiments, even a node with an incoming edge may be the starting node for a certain fund. If the L node receives a fund with an amount of 50 yuan from the K node and simultaneously a fund with an amount of 30 yuan flows in from the outside of the graph, the L node can be regarded as a starting node of fund data with an amount of 30 yuan.

The termination node may refer to the last node in the data flow graph for a particular fund, or the fund that last stayed at its termination node. Such as nodes with incoming edges but no outgoing edges in the dataflow graph. For example, node I, node X, node H, and node J in fig. 1. In some embodiments, even a node with an out-edge may be the starting node for a certain fund. If the G node receives a fund with the amount of 30 yuan from the F node, 10 yuan is transferred to the J node, and the remaining 20 yuan remains in the J node, the J node can be considered as a termination node of the fund data with the amount of 20 yuan. It is to be appreciated that the flow path is data dependent, and in some embodiments, the same data is determined while its flow path is available.

In some embodiments, each circulation data can have a unique identity, and then the same circulation data can be determined quickly and conveniently based on the identity; such as a user with a user number, a vehicle with a license plate number, etc. However, for the fund data, especially in the form of digital currency, there is no explicit identification of the different fund data and there may also be an inclusive or a split relationship between the different fund data. For example, node A receives 500 dollars from node B, 200 dollars are transferred to node C, 150 dollars are transferred to node E, and 150 dollars remain in node A. Here, it can be seen that there are 4 fund data, i.e., 500-dollar funds transferred, 200-dollar funds transferred to the C node, 150-dollar funds transferred to the E node, and the remaining 150-dollar funds. Wherein the 200-dollar funds are included in the 500-dollar data or are separated from the 500-dollar data. It may be considered that the incoming edge of a B-to-A transfer of 500 elements relates to the same fund data of 200 elements as the outgoing edge of a-to-C transfer of 200 elements. Therefore, a practical method is needed to find the same fund data in the data flow graph.

In some embodiments, the same data in the incoming data and the outgoing data of a node may be determined first. For example, a portion of the incoming fund data and a portion of the outgoing fund data that satisfy a preset condition in the same node are determined to be the same data. The preset conditions may include last-in-first-out, equal amount. The last-in first-out can be that the funds flow into the node, and the funds flow out first according to the time sequence, that is, the fund data b which flows out first after the fund data a flows in the same node contains the same fund data; the equal amount may be the amount of money that is preferentially flowing out of the inflow node when money is flowing out of the node and equal to the amount when money is flowing out, i.e. the equal amount of inflow money is considered to be the same data in the same node as the outflow money of the subsequent outflow node. In some embodiments, the same data in the inflow funds and the outflow funds in the same node can be further determined by combining the principle of equal amount on the basis of the last-in first-out principle.

In some embodiments, after the same data in the nodes is determined, the next node of the data flow path may be determined based on the intersection of the same data between the nodes, so that a flow path may be determined. Therefore, it is easy to understand that the data of the circulation path is the intersection of the same data of each node on the circulation path.

For ease of understanding, the same data and flow paths are described in conjunction with FIG. 1. Illustratively, fund data with the amount of 100 yuan enters a data flow transfer graph from a node D, then 50 yuan of the fund data is transferred to a node E, the node E transfers 20 yuan to a node F, the node E transfers 30 yuan to a node Q, then the node F transfers 10 yuan to a node G, and finally the 10 yuan stays at the node G. Then, for the E node, the 50 entries transferred from the D node include 30 entries transferred to the Q node, so the flow path of the fund data with the amount of 30 entries is D-E-Q, and for this flow path, the intersection of the same data of each node — 30 entries. Similarly, for the E node, the 50 elements transferred from the D node comprise 20 elements transferred to the F node, and for the F node, the 10 element part in the 20 elements transferred from the E node and the 10 elements transferred from the E node to the G node are the same data, so that a circulation path D-E-F-G is provided, and the data of the circulation path D-E-F-G is the intersection of the same data of all the nodes, namely 10 elements.

In some implementations, the processing device may extract nodes in the flow graph that are intersections of the same data and edges connecting the nodes based on one or more of the same data, one or more starting nodes, and one or more ending nodes in the flow graph, resulting in multiple flow paths corresponding to different data.

Referring to fig. 3, fig. 3 is an exemplary diagram illustrating obtaining multiple flow paths according to some embodiments herein. For the sake of example, in fig. 3, 310 represents a simple data flow diagram, and 320 represents a plurality of flow paths obtained from the data flow diagram 310.

It is not assumed that the a node has 50 yuan, and the a node transfers 30 yuan, 15 yuan, and 5 yuan to the B node at different times, respectively. For the node B, the 30-element, 15-element and 5-element data are respectively one data, i.e. three data are in total. The node B in turn streams these data to the node C and finally to the node F at different times, respectively. After that, the F node transfers 30 elements to the I node, 15 elements to the X node and 5 elements to the H node. Therefore, 3 flow paths, that is, a node a to I node flow path 321, a node a to X node flow path 322, and a node a to H node flow path 323, can be obtained, and the data respectively corresponds to 30 elements, 15 elements, and 5 elements. As is easily seen from fig. 3, in the aforementioned 3 circulation paths, the node a is the start node, and the I, X, H nodes are the end nodes respectively. The flow path may be considered to be the complete path from the entry of the fund data into the flow graph to the last stop. It will be appreciated that the data flow graph may be augmented with time, and that a deadline may be given when discussing the flow path, which may be the complete path that a piece of funds data stays from entering the flow graph to the deadline.

And step 206, at least intercepting the plurality of circulation paths from the starting node to the ending node for a plurality of times to obtain a plurality of outgoing sub-paths. In some embodiments, step 206 may be performed by the go path truncation module 1030.

Truncation refers to cutting a portion from the flow path and discarding it. The sub-path is a path in which the number of nodes and edges obtained by cutting the circulation path is less than that of the original circulation path.

The outgoing sub-path refers to a sub-path obtained by cutting from the starting node to the end node of the circulation path. The outgoing sub-path may be used to represent a sub-path in the flow path for funds going from any node other than the terminating node to the terminating node of the flow path. For example, referring to fig. 4, fig. 4 is an exemplary schematic diagram of a going truncation, according to some embodiments herein. In the diagram 400, assuming that the path 410 is a circulation path requiring forward truncation, the path a-B-C-D-E-F-I indicates that a certain fund data starts from the node a and finally stays at the node I after circulating among a plurality of nodes.

For each of the circulation paths, the processing device may sequentially truncate and discard the front portion of each node from the next node of the start node to the previous node of the end node in the circulation path to obtain a plurality of initial forwarding sub-paths. For example, referring to fig. 4, the front of each node (including B, C, D, E, F nodes) from the node next to the a node (start node) (node B) to the node before the end node (I node) (node F) is truncated and discarded in sequence, and each truncation and discarding results in an initial outgoing sub-path. For example, by performing truncation according to the above-described exemplary nodes, an initial going sub-path 411, an initial going sub-path 412, an initial going sub-path 413, an initial going sub-path 414, and an initial going sub-path 415 may be obtained. And respectively truncating the multiple circulation paths according to the described mode to obtain multiple initial outgoing sub-paths of the multiple circulation paths.

In some embodiments, the processing device may directly take the initial go sub-path as the go sub-path.

In some embodiments, the processing device may merge similar outgoing sub-paths in the initial outgoing sub-paths of the multiple flow paths to obtain the multiple outgoing sub-paths. For example, for some analysis scenarios, the initial go sub-paths at their minimum analysis time span or time period (e.g., one day, one week, one month, etc.) may be merged to effectively reduce the storage space of the paths while meeting the analysis time granularity requirement.

A similar going sub-path refers to a path having at least the same nodes and edges reflecting the same flow turning direction. That is, nodes that are similarly going to the subpath are the same and the direction of the edges between the nodes (including the order between the nodes) is the same. In some embodiments, the data for similar going sub-paths may be the same or different, e.g., the difference between multiple similar going sub-paths may be in the amount of their funds or the time at which the transaction was made. In some embodiments, the similar going sub-paths also require the same scenario, e.g., the same manner of funds transaction (e.g., both transfers, charges), etc. For example, assuming that a plurality of similar going sub-paths can be represented as E-F-I, in the plurality of similar going sub-paths, the scenario of the paths is the same except that the nodes and edges are the same, for example, the scenario between E-F is transfer, and the scenario between F-I is top-up. Then similarly in the go sub-path, E-F needs to transfer and F-I needs to top up.

The merging includes merging or summing the data corresponding to the similar going sub-paths. The union may refer to the data of different similar outgoing sub-paths being simultaneously used as the merged outgoing sub-path. For example, when the data is the information related to the person or the vehicle, the identities of the person or the vehicle may be subjected to union, and the data of the sub-route obtained by union is the information related to a plurality of persons or a plurality of vehicles. The summation refers to summing data of different similar outgoing sub-paths, for example, if data corresponding to multiple similar outgoing sub-paths are 30-ary, 20-ary, and 10-ary, respectively, the merged data of the outgoing sub-paths is 30+20+ 10-ary, 60-ary.

In some embodiments, a union may also be obtained when the data is the fund, for example, assuming that there are 3 similar going sub-paths F-I (the similar going sub-paths may be obtained by truncating different circulation paths), and the data of the three similar going sub-paths are respectively 30, 20 and 10, the union may refer to that the data of the merged going sub-path is "30, 20 and 10". In some embodiments, the merged outgoing sub-path may further retain a transaction scenario of data, for example, a transaction manner, and the like, where the merged outgoing sub-path data may be "30 yuan for transfer at 1 month and 1 year 2021, 20 yuan for recharge at 1 month and 2 months and 10 yuan for consumption at 1 month and 3 months and 2021 year 2021.

Referring to fig. 6, fig. 6 is an exemplary diagram of similar path merging, shown in accordance with some embodiments herein. As shown in fig. 6, in diagram 600, sub-path 610, sub-path 620, and sub-path 630 are similar sub-paths, and it is understood that the similar paths may represent the initial outgoing sub-paths. The data of the sub-path 610 is 10 yuan, the data of the sub-path 620 is 20 yuan, and the data of the sub-path 630 is 30 yuan, and the data of the sub-path 640 is combined to obtain the data of the sub-path 60 yuan.

It will be understood that the merged outgoing sub-path has the same nodes and edges reflecting the same forwarding path as the original initial outgoing sub-path, and the data of the merged outgoing sub-path may be a union or a sum of the data of the original initial outgoing sub-path.

It should be noted that, in the sub-path obtained by truncation, the first node may be referred to as a head node, and the head node may or may not be the start node (as defined in the related description of step 204). If the head nodes in the outgoing sub-path are not the start nodes, and the head nodes in the source sub-path are the start nodes. Similarly, in the truncated sub-path, the last node may be referred to as a tail node, and the tail node may or may not be a termination node (as defined in the related description of step 204). If the tail nodes in the outgoing sub-path are all termination nodes, and the tail nodes in the source sub-path are not termination nodes. In the above example, the flow path 410 may be used to query or analyze the A node for funds travel. In the going sub-path obtained by cutting the circulation path, the head node correspondingly becomes other nodes through which the fund of the node a flows, such as the node B and the node C, and the correspondingly formed going sub-path can be used for indicating the fund going direction of other nodes in the path. For example, the go sub-path 411 may be used to represent a fund go to node B and the go sub-path 412 may be used to represent a fund go to node C. Therefore, when all paths from the node B to the node I through the node C are to be analyzed, an initial candidate path with the head node as the node B may be selected from each flow path and its destination sub-path, and then a candidate path of the path node C may be selected from the initial candidate paths as a query result. Compared with full-graph traversal, the query efficiency can be greatly improved.

And 208, at least intercepting the plurality of circulation paths from the termination node to the initial node for a plurality of times to obtain a plurality of source sub-paths. In some embodiments, step 208 may be performed by the source path truncation module 1040.

The source sub-path is a sub-path obtained by cutting from the end node of the circulation path to the direction of the start node. The source sub-path may be used to represent a sub-path in which any node in the flow path other than the originating node receives funds from the originating node of the flow path. For example, referring to fig. 5, fig. 5 is an exemplary schematic diagram of a source truncation according to some embodiments shown herein. In the diagram 500, assuming that the path 510 is a circulation path requiring source truncation, the path a-B-C-D-E-F-I indicates that the funds of the I node flow from the a node, pass through a plurality of nodes, enter, and finally stay. It will be appreciated that the diversion path may also be used to represent the source of funds for each node after the initiation node of the diversion path. For example, it can be easily known by looking at the flow path 510 and truncating each source sub-path, in which the capital source of each node (B, C, D, E, F, I node) after the start node is the start node a. For each flow path, the processing device may sequentially truncate and discard the back of each node from the node before the termination node to the node next to the start node in the flow path to obtain a plurality of initial source sub-paths. For example, referring to fig. 5, the initial source subpath can be obtained by first truncating and discarding the back of each node (including B, C, D, E, F nodes) from the node (F node) before the I node (terminating node) to the node (B node) next to the originating node (a node). For example, by performing truncation according to the above-described exemplary nodes, an initial source sub-path 511, an initial source sub-path 512, an initial source sub-path 513, an initial source sub-path 514, and an initial source sub-path 515 may be obtained. And respectively truncating the multiple circulation paths according to the described mode, so as to obtain multiple initial source sub-paths of the multiple circulation paths.

In some embodiments, the processing device may directly take the initial source sub-path as the source sub-path. In some embodiments, the processing device may merge similar outgoing sub-paths of a plurality of initial source sub-paths of a plurality of flow paths to obtain the plurality of source sub-paths. The similar source sub-paths at least have the same nodes and edges reflecting the same flow turning direction, and the merging comprises the union or summation of data corresponding to the similar source sub-paths.

For more details about similar source sub-paths and merging similar source sub-paths, refer to the description of similar destination sub-paths in step 206, and refer to similar conditions and merging manners, and refer to each other, which are not described herein again.

In this embodiment, the flow-to-path is cut and split according to the above manner, so that not only is the path splitting efficiency high, but also the fund destination or fund source of a plurality of nodes can be rapidly and accurately queried, thereby not only improving the query efficiency, but also making the query more flexible. In addition, by merging the similar sub-paths obtained by splitting, the data volume can be effectively reduced, and the data storage pressure is reduced.

Step 210, storing the multiple circulation paths, the multiple going sub paths and the multiple source sub paths on multiple storage devices in a distributed manner for query. In some embodiments, step 210 may be performed by the path storage module 1050.

In some embodiments, the processing device may store the multiple circulation paths obtained from the data flow transition diagram, the multiple outgoing sub-paths and the multiple source sub-paths obtained after the interception, in a distributed manner on the multiple storage devices according to a preset storage rule.

The preset storage rule may be that the storage spaces of the plurality of storage devices are spatially divided, and the plurality of circulation paths, the plurality of going sub-paths, and the plurality of source sub-paths are stored in the corresponding intervals. In some embodiments, the processing device may partition the storage space of the plurality of storage devices by time partition, target classification partition, target crowd partition, or the like. Time partitioning may refer to partitioning the storage space in time spans, e.g., partitioning by hours, by days, by weeks, etc., with one time period corresponding to one partition. The target classification partition may be a partition according to a category of data corresponding to the path, for example, dividing the fund into a large amount of fund, a medium amount of fund and a small amount of fund, which respectively correspond to different partitions. The target population zones may be zones according to gender of the person, for example, male corresponds to one zone, female corresponds to one zone, and the like. It should be understood that a partition on a storage device may refer to a physical partition on a storage medium, e.g., the same storage page where data of the same partition is stored in the storage medium; a partition of a storage device may also refer to a logical partition, such as where data of the same partition logically belongs to the same partition, but is not necessarily stored in the same physical partition of the storage medium when actually stored.

Referring to fig. 7, fig. 7 is an exemplary diagram illustrating distributed storage in time partitions according to some embodiments of the present description. In diagram 700, each storage device may have multiple time partitions, including, for example, partition 710, partition 720, partition 730, partition 740, etc., taking a partition by day as an example. If T represents the time of a day, partition 710 may be the partition corresponding to T-1, partition 720 may be the partition corresponding to T, partition 730 may be the partition corresponding to T +1, and partition 740 may be the time partition of T +2, … ….

When the processing device stores the multiple circulation paths, the multiple outgoing sub-paths and the multiple source sub-paths on multiple storage devices in a distributed manner, the processing device may divide the paths (including the circulation paths and various sub-paths) based on the interruption mode and the time partition of the circulation paths, and then store the divided results into corresponding time partitions of the multiple storage devices, for example, for outgoing interruption, the multiple circulation paths and the multiple outgoing sub-paths are stored together according to the divided results; and for source truncation, correspondingly storing the plurality of circulation paths and the plurality of source sub-paths together according to the storage division result.

In some embodiments, when storing a plurality of flow paths and a plurality of going sub-paths based on time partitions, the plurality of flow paths and the paths of the plurality of going sub-paths whose times related to a start node or a head node belong to the same period may be stored in the same time partition on one or more storage devices. The time related to the start node or the head node may be a time when a transaction related to the start node or the head node occurs, or a time when the start node or the head node receives the fund data of the corresponding circulation path, or a time when the start node or the head node transfers the fund data of the corresponding circulation path to a downstream node thereof, and the like. For example, assume A-B-C are the flow paths and B-C are the flow paths. Then, in storing, the time associated with the start node or head node of the flow path a-B-C, the go sub-path B-C, such as the occurrence time of the transaction associated with the start node or head node (e.g., the time of the fund transaction from the a node to the B node, and the time of the fund transaction from the B node to the C node), the corresponding partition may be determined, and corresponding storage may be performed, for example, if the time associated with the start node or head node of the path is 2021 year, month, 1 day, then the corresponding partition 710 may be determined; if the time related to the path starting node or the head node is 1 month and 2 days 2021 year, the path starting node or the head node corresponds to the partition 720; the time associated with the start node or head node of the path is 2021 year, 1 month, 3 days, and corresponds to partition 730. In some embodiments, the processing device may store all paths divided into the same partition corresponding to multiple storage spaces randomly or according to a certain rule (e.g., store one full and then store other paths or store them uniformly into multiple partitions).

In some embodiments, when storing a plurality of flow paths and a plurality of source sub-paths, paths of the plurality of flow paths and the plurality of source sub-paths whose times related to the terminating node or the tail node belong to the same period may be stored in the same time partition on one or more storage devices. Following the above example, the time associated with the termination node or end node of the diversion path A-B-C and the source sub-path A-B, such as the time of occurrence of the transaction associated with the termination node or end node (e.g., the time of the transaction of funds from the A node to the C node, and the time of the transaction of funds from the A node to the B node), corresponds to partition 710 if the time is 1/2021/day, corresponds to partition 720 if the time is 1/2/2021/3/1/day, and corresponds to partition 730.

In some embodiments, the storage space of the storage device may not be partitioned in advance, but after the convection path, the going sub-path, and the source sub-path are partitioned according to time partitions, the paths in each partition may be stored to one or more storage devices according to a distributed storage manner, and the paths of each partition naturally form a logical time partition on the storage device. For example, as shown in fig. 7, the paths divided into the partition 710 are stored to the

storage devices

1, 2, … …, M, while the paths divided into the partition 720, the partition 730, the partition 740, and the like are distributed to the

storage devices

1, 2, … …, M in the same manner. It should be noted that the processing device may store the paths in the partitions in any manner, for example, randomly, uniformly, chronologically, and the like.

After the data storage is completed, the user can load the path corresponding to one or more partitions from one or more storage devices through the processing device to perform query. For more description of the query, refer to fig. 9 and its related description, which are not described herein.

In some embodiments, the plurality of flow paths, the plurality of outgoing sub-paths, and the plurality of source sub-paths are encoded and/or compressed for storage. The encoding may be expressed in an encoding format such as a vector for a path, and for example, information such as a node of a path may be encoded into a vector. The compression can be to compress the path data, and the space occupation during storage can be effectively reduced through encoding and/or compression.

The storage and analysis of data of the data stream may be persistent. In some embodiments, data flow data may be collected at a preset time period and existing data flow charts augmented based on data flow data acquired during a new time period. The time period may be a day, a week, a month, etc.

Illustratively, the processing device may update the path in accordance with the methods described in the embodiments below.

In some embodiments, the processing device may obtain new data streaming data. The newly added data flow data comprises nodes and edges between the nodes. In some embodiments, the additional data flow data may include transaction data for a date next to the current date. For example, funds remaining in the flow path termination node are passed through the transaction flow to the new node at the next date.

The processing device may update the plurality of flow paths, the plurality of outgoing sub-paths, and the plurality of source sub-paths based on the newly added data flow data.

In some embodiments, funds remaining in the end node of the flow path are all flowed to the new node at the next date. Correspondingly, the updating method may include adding one or more new nodes after the existing one or more circulation paths and the existing one or more termination nodes of the going sub-paths, respectively. When a new node is added after the end node of the circulation path, the outgoing path truncation and the source truncation may be performed with reference to step 206 and step 208, so as to obtain an outgoing sub-path and a source sub-path for the newly added node. For example, the existing circulation path a-B-C, the outgoing sub-path B-C thereof, and the source sub-path a-B, all the capital data of the circulation path on the next date passes through the D node and finally stays at the E node, so the D node and the E node are respectively added after the existing circulation path and the termination node of the outgoing sub-path thereof to obtain the circulation path a-B-C-D-E and the outgoing sub-path B-C-D-E, in addition, the outgoing truncation and the source truncation are also required to be performed on the circulation path with the added D, E node to obtain the outgoing sub-path C-D-E, D-E and the source sub-path a-B-C-D, A-B-C for the newly added node. After updating, the original circulation path A-B-C, the outgoing sub-path B-C and the source sub-path A-B are amplified into a circulation path A-B-C-D-E, an outgoing sub-path B-C-D-E, C-D-E, D-E and a source sub-path A-B-C-D, A-B-C, A-B.

In some embodiments, funds remaining in the end node of the flow path may also be partially flowed to the new node at the next date. Accordingly, the updating method may include splitting the existing one or more circulation paths, the existing one or more going sub-paths, and the existing one or more source sub-paths based on the data, respectively. The splitting includes splitting data of the original path (for example, splitting the amount of money 50 into 30 and 20), and adding one or more nodes after the end node of the split partial path, and adding the corresponding outgoing sub-path and the source sub-path to the split flow-through path in which the node is added after the end node.

Referring to fig. 8, assuming that the data of the circulation path 810 is 50 m and the transaction time associated with the start node is T, at T +1, the end node I of the circulation path 810 generates a new transaction, in which the amount of 30 m stays at the Y node through the X node, i.e. a new path I-X-Y is formed (820) and the data is 30 m. When the data flow graph is updated based on the flow data of T +1, the original flow path 810 may be split to obtain a path 811 with 30 yuan data and a flow path 812 with 20 yuan data. After the end node of the split path 811, the X node and the Y node are added, and the circulation path 830 with data of 30 yuan is obtained. The outgoing sub-path and the source sub-path of the original streaming path 810 may be split in the same manner as described above, so as to obtain the outgoing sub-path and the source sub-path of the 20-tuple streaming path 812 and the partial outgoing sub-path and the partial source sub-path of the 30-tuple streaming path 830. Then, the divided circulation path 830 is subjected to the outgoing truncation and the source truncation according to the

steps

206 and 208, so as to obtain an outgoing sub-path and a source sub-path for the newly added node X, Y, and further obtain all the outgoing sub-paths and all the source sub-paths of the circulation path 830.

With continued reference to FIG. 8, at T +2, the termination node I of the flow path 812 again generates a new transaction and forms a new path I-U-V (840). Adding the new path 840 to the flow path 812 results in a path 850. Since all the funds in the termination node I flow to the newly added path I-U-V, the processing may be performed according to the first updating method, which is not described herein again.

In some embodiments, on the next date, a new flow path occurs, such as a fund that goes beyond the data flow graph through the H node and stays at the K node through the G node. Accordingly, the processing device may directly add a new flow path H-G-K and its corresponding outgoing sub-path and source sub-path.

FIG. 9 is an exemplary flow diagram of a data query method, shown in accordance with some embodiments of the present description. In some embodiments, flow 900 may be performed by a processing device (e.g., a query server having a signal connection to a cluster of storage devices). For example, the process 900 may be stored in a storage device (e.g., an onboard storage unit of a processing device or an external storage device) in the form of a program or instructions that, when executed, may implement the process 900. Flow 900 may include the following operations.

Step 902, obtain a query request. In some embodiments, step 902 may be performed by query request acquisition module 1110.

The query request can be initiated by a user needing to perform the analysis of the streaming data. Specifically, the query request may be initiated to the processing device through the user side. The query request may include description information or matching conditions of the path that the user wants to query. Such as path nodes, edges, data flow relationships, and the like.

In some embodiments, the query request may include a path matching condition that describes a plurality of nodes in the path to be queried and data flow relationships therebetween. The path to be queried is the path that the user wants to query. The data flow relationship may include data flow scenarios (e.g., transfer, load, consume, data flow time, etc.) and data flow directions (e.g., from which node to trade to which node). Illustratively, the path matching condition may be "payment node: a node A; a collection node: node B ".

In some embodiments, the query request may also include a tag term indicating a go to analysis or a source analysis. The flag may be any predetermined form, for example, "1" indicates a destination, "2" indicates a source, and the like. The label item may be used to instruct the processing device to obtain a desired query path from the corresponding circulation path and the forward sub-path or the source sub-path when obtaining the query request.

And 904, respectively acquiring one or more candidate paths from the data circulation path and/or the sub-path in one or more storage devices based on the query request. In some embodiments, step 904 may be performed by path acquisition module 1120.

The candidate path refers to a path which is obtained from one or more storage devices according to the query request and is matched with the query request. The candidate paths may include a turn path, a go sub-path, and/or a source sub-path.

In some embodiments, the circulation path and/or the sub-path in one or more storage devices may be processed by a data processing method as described in the embodiments of the present specification.

In some embodiments, the processing device may obtain, in each storage device, a flow path and/or a sub-path matching the query request based on the query request. For example, in the query request, the user may specify a start node, a next node of the start node, and a data flow relationship between the two nodes of a path that the user wants to query, or a stop node, a previous node of the stop node, and a data flow relationship between the two nodes, the processing device may match paths in each storage device based on the nodes, the edges, and the data flow relationship, and if the nodes, the edges, and the data flow relationship all match, the processing device may determine that the path is a candidate path and acquire the candidate path.

For example, the processing device may obtain one or more candidate paths from the circulation paths and/or sub-paths in one or more storage devices based on the label items in the query request and the path matching conditions, in a manner described in the following embodiments.

And when the label item in the query request indicates the forwarding analysis, respectively acquiring one or more candidate paths meeting the path matching condition from the circulation path and the forwarding sub-path in one or more storage devices.

And when the mark item in the query request indicates source analysis, respectively acquiring one or more candidate paths meeting the path matching condition from the circulation path and the source sub-path in one or more storage devices.

At this time, the candidate paths obtained by the processing device from one or more storage devices correspond to the label items in addition to satisfying the path matching conditions in the query request.

In some embodiments, the tag entry in the query request is "1", and the path matching condition includes "payment node: A. c, E, collection node: B. f' is adopted. A payment node refers to the originating or head node of a path, i.e., the source of data for the nodes on a path. For example, referring to fig. 4, the initial node a of the flow path 410 is a payor node, and the head node B of the flow path 411 is a payor node. The payee node may be a node downstream of the payer node, such as a node B in the flow path 410 or a node C in the go sub-path 411.

The processing device may extract one or more initial candidate paths from the flow path and the go sub-path in the one or more storage devices that the originating node or the head node is the same as the payment node. For example, in conjunction with fig. 4, if the payment node in the query request is the A, C, E node, matching the payment node with the start node or the head node of the path in fig. 4 can be used to match the flow path 410, the go sub-path 412, and the go sub-path 414 as initial candidate paths.

The processing device may obtain the one or more candidate paths from the one or more initial candidate paths based on other nodes specified by the path matching condition and a data flow relationship therebetween. For example, the path matching condition further specifies that the receiving node is a node B or a node F. Matching the path matching condition with the initial candidate path can obtain that the circulation path 410 and the going sub-path 414 in the initial candidate path satisfy the condition, and then the candidate paths obtained from the initial candidate path are the circulation path 410 and the going sub-path 414.

In some embodiments, the tag entry in the query request is "2", and the path matching condition includes "payee node: C. e, the payment node: d' is used.

The processing device may first extract one or more initial candidate paths for which the terminating node or the terminating node is the same as the payee node from the data flow path and the source subpath in the one or more storage devices. For example, in conjunction with fig. 5, the source sub-path 512 and the source sub-path 514 may be matched as initial candidate paths.

The processing device may obtain the one or more candidate paths from the one or more initial candidate paths based on other nodes specified by the path matching condition and a data flow relationship therebetween. For example, the path matching condition may also specify the payment node as a D node. By matching the path matching condition with the initial candidate path, the source sub-path 512 in the initial candidate path can be obtained, and the candidate path obtained from the initial candidate path is the source sub-path 512.

Therefore, path query is carried out by the method, traversal of nodes of the whole graph is avoided, and query efficiency is effectively improved.

In some embodiments, the user may also specify a time span of interest in the query request. The interest time span refers to a time period when data in a path to be queried flows in a flow path. For example, the interest time span is 2021 year 5 month 18 day to 5 month 20 day. Accordingly, the processing device queries the storage devices for paths that are streamed within the time span of interest and that satisfy other conditions in the query request. In particular, the processing device may determine one or more time periods to which it relates based on the time span of interest, and thus determine corresponding time partitions in one or more of the storage devices. For example, if the interest time span is T-T +1, the time period involved includes T and T + 1. As previously described, partition 710 in FIG. 7 has a time period of T-1, partition 720 has a time period of T, and partition 730 has a time period of T +1, then based on the time span of interest, the corresponding time partition in the one or more storage devices may be determined to be partition 720 and partition 730.

The processing device may obtain one or more initial candidate paths from corresponding time partitions in one or more storage devices based on other conditions in the query request, such as path matching conditions. For example, the processing device may retrieve the initial candidate paths to the storage devices corresponding to partition 720 and partition 730. The obtaining method can refer to the above related description, and is not described herein again.

In some embodiments, the nodes or edges of the paths stored in the storage device, including the circulation path and its sub-paths, relate to multiple time periods, so the processing device may further truncate the one or more initial candidate paths based on the interest time span to obtain a portion of the one or more initial candidate paths that belongs to the interest time span. For example, if the time span of an initial path includes a time period from T to T +10, the processing device may truncate and discard a part of the path corresponding to the time period from T +2 to T +10 from the initial candidate path based on the interest time span, where the remaining part after truncation is the part of the initial candidate path belonging to the interest time span.

The processing device may directly take a portion of the one or more initial candidate paths that belong to the time span of interest as the one or more candidate paths.

Step 906, summarizing the one or more candidate paths to obtain the path to be queried. In some embodiments, step 906 may be performed by path processing module 1130.

In some embodiments, the processing device may retrieve the one or more candidate paths from the one or more storage devices. For example, referring to fig. 7, the processing device may obtain candidate paths corresponding to the

partitions

720 and 730 from the

storage devices

1, 2, … … and M, respectively.

The aggregation process may include one or more of merging, decompressing, decoding. Merging may refer to merging or summing data corresponding to similar sub-paths in the candidate paths. For example, step 904 truncates the initial candidate path based on the time span of interest, and similar sub-paths may exist in the remaining paths after truncation, and may be merged. Further description of merging and similar paths may be found in relation to step 206 and will not be described here. Decompression may refer to restoring the path data after compression as described in step 210 to a pre-compression state. Decoding may refer to restoring the encoded data to a plaintext state using an inverse of the encoding algorithm. For example, information such as nodes, edges, and data of a path represented by a vector is decoded into a plaintext state that can be understood by a user.

It should be noted that the above description of the respective flows is only for illustration and description, and does not limit the applicable scope of the present specification. Various modifications and changes to the flow may occur to those skilled in the art, given the benefit of this disclosure. However, such modifications and variations are intended to be within the scope of the present description. For example, the present specification may be directed to variations on the process steps, such as the addition of pre-processing steps and storage steps.

FIG. 10 is an exemplary block diagram of a data processing system shown in accordance with some embodiments of the present description. As shown in fig. 10, the system 1000 may include a data flow transition diagram acquisition module 1010, a flow transition path acquisition module 1020, a go path truncation module 1030, a source path truncation module 1040, and a path storage module 1050.

The data flow graph obtaining module 1010 may be configured to obtain a data flow graph.

Data may be an abstract reference to something. Such as funds, charges, traffic, people, vehicles, etc.

The data flow graph is a data structure which is applied to data by node pairs, and edges between the nodes reflect flow information of the data between the nodes. In some embodiments, edges between nodes may also represent scenarios in which data is circulated between nodes.

In some embodiments, the data flow graph obtaining module 1010 may obtain the data flow graph by reading from a storage device or a database, calling a data interface, or constructing the data flow graph based on nodes and edges.

The flow path obtaining module 1020 may be configured to obtain a flow path of the same data from the start node to the end node from the data flow graph, so as to obtain multiple flow paths corresponding to different data.

The circulation path refers to a path formed by nodes and edges connecting the nodes, which a certain fund passes through when circulating in the data circulation diagram.

In some embodiments, the data is funds, and the node may include an account for storing the funds, where a portion of the transferred-in funds and a portion of the transferred-out funds of the same node that satisfy the preset condition are determined as the same data, and the data of the circulation path is an intersection of the same data of the nodes on the circulation path. The preset conditions comprise last-in first-out and equal money amount.

In some embodiments, the flow path acquisition module 1020 may first determine that the same data in the incoming data and the outgoing data of the node. When the same data in the nodes is determined, the next node of the data circulation path can be determined based on the intersection of the same data among the nodes, so that a circulation path can be determined.

In some embodiments, the flow path obtaining module 1020 may extract nodes that are intersections of the same data and edges connecting the nodes in the flow graph based on one or more same data, one or more start nodes, and one or more end nodes in the data flow graph, to obtain multiple flow paths corresponding to different data.

The outgoing path truncation module 1030 may be configured to truncate at least a plurality of flow paths from the start node to the end node for a plurality of times, respectively, to obtain a plurality of outgoing sub-paths.

In some embodiments, the going path truncation module 1030 may truncate and discard a front portion of each node from a node next to the start node to a node previous to the end node in each of the flow paths in sequence to obtain a plurality of initial going sub-paths; further obtaining a plurality of initial outgoing sub-paths of the plurality of circulation paths; merging similar outgoing sub-paths in a plurality of initial outgoing sub-paths of a plurality of circulation paths to obtain a plurality of outgoing sub-paths; the similar going sub-paths at least have the same nodes and edges reflecting the same flow turning direction, and the merging comprises the union or summation of data corresponding to the similar going sub-paths.

The source path truncation module 1040 may be configured to at least truncate the plurality of flow paths from the terminating node to the starting node for a plurality of times, respectively, to obtain a plurality of source sub-paths.

In some embodiments, the source path truncation module 1040 may truncate and discard, for each of the circulation paths, the rear portion of each node from the node before the termination node to the node next to the start node in the circulation path in sequence, to obtain a plurality of initial source sub-paths; further obtaining a plurality of initial source sub-paths of a plurality of circulation paths;

merging similar source sub-paths in a plurality of initial source sub-paths of a plurality of circulation paths to obtain a plurality of source sub-paths; the similar source sub-paths at least have the same nodes and edges reflecting the same flow turning direction, and the merging comprises the union or summation of data corresponding to the similar source sub-paths.

The path storage module 1050 may be configured to store the plurality of circulation paths, the plurality of going sub-paths, and the plurality of source sub-paths in a distributed manner on a plurality of storage devices for querying.

In some embodiments, the path storage module 1050 may store the multiple flow paths obtained from the data flow transition diagram, the multiple outgoing sub-paths and the multiple source sub-paths obtained after the cutting, in a distributed manner on multiple storage devices according to a preset storage rule.

In some embodiments, the storage device has a plurality of time partitions. The path storage module 1050 may store the plurality of circulation paths and paths of the plurality of going sub-paths, which have the same time period with respect to the start node, in the same time partition on one or more storage devices; and storing the paths of the plurality of circulation paths and the paths of the plurality of source sub-paths, which have the same time period relative to the time of the termination node, in the same time partition on one or more storage devices.

In some embodiments, the path storage module 1050 may store the plurality of flow paths, the plurality of going sub-paths, and the plurality of source sub-paths after encoding and/or compressing.

FIG. 11 is an exemplary block diagram of a data query system, shown in accordance with some embodiments of the present description. As shown in fig. 11, the system 1100 may include a query request acquisition module 1110, a path acquisition module 1120, and a path processing module 1130.

The query request obtaining module 1110 may be configured to obtain a query request.

The query request can be initiated by a user needing to perform the analysis of the streaming data. Specifically, the processing device may be initiated with a query request through the user side. The query request may include description information or matching conditions of the path that the user wants to query. Such as path nodes, edges, data flow relationships, and the like.

In some embodiments, the query request further includes a tag term indicating a go to analysis or a source analysis.

The path obtaining module 1120 may be configured to obtain one or more candidate paths from the data stream path and/or the sub-path in the one or more storage devices, respectively, based on the query request.

In some embodiments, the path retrieval module 1120 may retrieve, in each storage device, a data flow path and/or sub-path matching the query request based on the query request.

In some embodiments, the query request further includes a tag term indicating a go to analysis or a source analysis. The path obtaining module 1120 may obtain one or more candidate paths satisfying the path matching condition from the data flow path and the forward sub-path in the one or more storage devices, respectively, when the flag entry in the query request indicates the forward analysis; and when the mark item in the query request indicates source analysis, respectively acquiring one or more candidate paths meeting the path matching condition from the data circulation path and the source sub-path in one or more storage devices.

In some embodiments, the path matching condition includes that the source node path obtaining module 1120 can extract one or more initial candidate paths with the same starting node as the source node from the data flow path and the outgoing sub-path in one or more storage devices; and acquiring the one or more candidate paths from the one or more initial candidate paths based on other nodes specified by the path matching condition and the data flow relation between the other nodes.

In some embodiments, the path matching condition comprises a destination node; the path obtaining module 1120 may extract one or more initial candidate paths of the termination node, which are the same as the destination node, from the data flow path and the source sub-path in one or more storage devices; and acquiring the one or more candidate paths from the one or more initial candidate paths based on other nodes specified by the path matching condition and the data flow relation between the other nodes.

In some embodiments, the query request further includes an interest time span; the path acquisition module 1120 can determine one or more time periods involved based on the time span of interest, and thus determine corresponding time partitions in one or more storage devices; obtaining one or more initial candidate paths from corresponding time partitions in one or more storage devices based on the path matching conditions; truncating the one or more initial paths based on the interest time span to obtain a portion of the one or more initial paths belonging to the interest time span; and taking the part of one or more initial paths belonging to the interest time span as one or more candidate paths.

The path processing module 1130 may be configured to perform summarization on the one or more candidate paths to obtain the path to be queried.

In some embodiments, the aggregation processing includes one or more of merging, decompressing, and decoding.

The path processing module 1130 may retrieve the one or more candidate paths from the one or more storage devices.

With regard to the detailed description of the modules of the system shown above, reference may be made to the flow chart section of this specification, e.g., the associated description of fig. 2-9.

It should be understood that the systems shown in fig. 10 and 11 and their modules may be implemented in a variety of ways. For example, in some embodiments, the system and its modules may be implemented in hardware, software, or a combination of software and hardware. Wherein the hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory for execution by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the methods and systems described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided, for example, on a carrier medium such as a diskette, CD-or DVD-ROM, a programmable memory such as read-only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The system and its modules in this specification may be implemented not only by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., but also by software executed by various types of processors, for example, or by a combination of hardware circuits and software (e.g., firmware).

It should be noted that the above description of the data processing system and its modules is merely for convenience of description and is not intended to limit the present description to the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, given the teachings of the present system, any combination of modules or sub-system configurations may be used to connect to other modules without departing from such teachings. For example, in some embodiments, the data flow graph obtaining module 1010, the flow path obtaining module 1020, the forward path cutting module 1030, the source path cutting module 1040, and the path storing module 1050 may be different modules in a system, or may be a module that implements the functions of two or more modules described above. For example, the outgoing path truncation module 1030 and the source path truncation module 1040 may be two modules, or one module may have both the outgoing truncation function and the source truncation function. For example, each module may share one memory module, and each module may have its own memory module. Such variations are within the scope of the present disclosure.

The beneficial effects that may be brought by the embodiments of the present description include, but are not limited to: (1) the method comprises the steps of extracting a flow path from a data flow transfer diagram formed by mass data flow data, and performing truncation and partition processing on the flow path, so that the efficiency of data query can be effectively improved, and the data storage pressure is reduced to a certain extent; (2) and path splitting and/or data partitioning are carried out based on the analyzed minimum granularity so as to meet flexible and changeable query modes and improve the use experience of users. It is to be noted that different embodiments may produce different advantages, and in different embodiments, any one or combination of the above advantages may be produced, or any other advantages may be obtained.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be regarded as illustrative only and not as limiting the present specification. Various modifications, improvements and adaptations to the present description may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present specification and thus fall within the spirit and scope of the exemplary embodiments of the present specification.

Also, the description uses specific words to describe embodiments of the description. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the specification is included. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the specification may be combined as appropriate.

Moreover, those skilled in the art will appreciate that aspects of the present description may be illustrated and described in terms of several patentable species or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful improvement thereof. Accordingly, aspects of this description may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.), or by a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present description may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.

The computer storage medium may comprise a propagated data signal with the computer program code embodied therewith, for example, on baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, etc., or any suitable combination. A computer storage medium may be any computer-readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated over any suitable medium, including radio, cable, fiber optic cable, RF, or the like, or any combination of the preceding.

Computer program code required for the operation of various portions of this specification may be written in any one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C + +, C #, VB.NET, Python, and the like, a conventional programming language such as C, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, a dynamic programming language such as Python, Ruby, and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any network format, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or in a cloud computing environment, or as a service, such as a software as a service (SaaS).

Additionally, the order in which the elements and sequences of the process are recited in the specification, the use of alphanumeric characters, or other designations, is not intended to limit the order in which the processes and methods of the specification occur, unless otherwise specified in the claims. While certain presently contemplated useful embodiments of the invention have been discussed in the foregoing disclosure by way of various examples, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein described. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the present specification, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features than are expressly recited in a claim. Indeed, the embodiments may be characterized as having less than all of the features of a single disclosed embodiment.

Numerals describing the number of components, attributes, etc. are used in some embodiments, it being understood that such numerals used in the description of the embodiments are modified in some instances by the use of the modifier "about", "approximately" or "substantially". Unless otherwise indicated, "about", "approximately" or "substantially" indicates that the number allows a variation of ± 20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that may vary depending upon the desired properties of the individual embodiments. In some embodiments, the numerical parameter should take into account the specified significant digits and employ a general digit preserving approach. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the range are approximations, in the specific examples, such numerical values are set forth as precisely as possible within the scope of the application.

For each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., cited in this specification, the entire contents of each are hereby incorporated by reference into this specification. Except where the application history document does not conform to or conflict with the contents of the present specification, it is to be understood that the application history document, as used herein in the present specification or appended claims, is intended to define the broadest scope of the present specification (whether presently or later in the specification) rather than the broadest scope of the present specification. It is to be understood that the descriptions, definitions and/or uses of terms in the accompanying materials of this specification shall control if they are inconsistent or contrary to the descriptions and/or uses of terms in this specification.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments described herein. Other variations are also possible within the scope of the present description. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the specification can be considered consistent with the teachings of the specification. Accordingly, the embodiments of the present description are not limited to only those embodiments explicitly described and depicted herein.

Claims

1. A method of data processing, the method comprising:

acquiring a data flow transfer diagram; the data flow graph comprises nodes and edges, wherein data flow exists among the nodes, and the edges among the nodes reflect the flow direction of data among the nodes; the data comprises funding data, the node corresponding to an account storing the data;

acquiring a circulation path of the same data from a starting node to a terminating node from the data circulation graph so as to obtain a plurality of circulation paths corresponding to different data;

at least sequentially truncating a plurality of circulation paths from the starting node to the ending node to obtain a plurality of outgoing sub-paths;

at least sequentially truncating a plurality of circulation paths from a termination node to an initial node to obtain a plurality of source sub-paths;

and storing the plurality of circulation paths, the plurality of going sub-paths and the plurality of source sub-paths on a plurality of storage devices in a distributed manner for query.

2. The method of claim 1, wherein the truncating at least the plurality of flow paths from the start node to the end node for a plurality of times to obtain a plurality of outgoing sub-paths comprises:

for each circulation path, sequentially cutting off and discarding the front part of each node from the next node of the starting node to the previous node of the ending node in the circulation path to obtain a plurality of initial outgoing sub-paths; thus obtaining a plurality of initial outgoing sub-paths of a plurality of circulation paths;

taking a plurality of initial outgoing sub-paths of a plurality of circulation paths as the plurality of outgoing sub-paths, or merging similar outgoing sub-paths in the plurality of initial outgoing sub-paths of the plurality of circulation paths to obtain the plurality of outgoing sub-paths; the similar going sub-paths at least have the same nodes and edges reflecting the same flow turning direction, and the merging comprises the union or summation of data corresponding to the similar going sub-paths.

3. The method of claim 1, wherein the truncating at least the plurality of flow paths from the terminating node to the originating node for a plurality of times to obtain a plurality of source sub-paths comprises:

for each circulation path, sequentially cutting off and discarding the back part of each node from the previous node of the termination node to the next node of the starting node in the circulation path to obtain a plurality of initial source sub-paths; thus, a plurality of initial source sub-paths of a plurality of circulation paths are obtained;

taking a plurality of initial source sub-paths of a plurality of circulation paths as the plurality of source sub-paths, or merging similar source sub-paths in the plurality of initial source sub-paths of the plurality of circulation paths to obtain the plurality of source sub-paths; the similar source sub-paths at least have the same nodes and edges reflecting the same flow turning direction, and the merging comprises the union or summation of data corresponding to the similar source sub-paths.

4. The method of claim 1, the storage device having a plurality of time partitions; the distributively storing the plurality of circulation paths, the plurality of going sub-paths and the plurality of source sub-paths on a plurality of storage devices comprises:

storing paths, of which the time related to the starting node or the head node belongs to the same time period, in the multiple circulation paths and the multiple outgoing sub-paths in the same time partition on one or more storage devices;

and storing paths, of the plurality of circulation paths and the plurality of source sub-paths, of which the time related to the termination node or the tail node belongs to the same time period in the same time partition on one or more storage devices.

5. The method according to claim 1, wherein the part of the transferred-in fund data and the transferred-out fund data of the same node meeting the preset condition are determined as the same data, and the data of the circulation path is the intersection of the transferred-in fund data and the same data in the transferred-out fund data of each node on the circulation path.

6. The method of claim 5, wherein the preset conditions include last-in-first-out and equal amount.

7. The method of claim 1, wherein the plurality of flow paths, and the plurality of source paths are encoded and/or compressed for storage.

8. The method of claim 1, further comprising:

acquiring newly added data stream conversion data; the newly added data stream-to-data conversion comprises nodes and edges between the nodes;

and updating the plurality of circulation paths, the plurality of outgoing sub-paths and the plurality of source sub-paths based on the newly added data circulation data.

9. The method of claim 8, the update comprising one or more of:

respectively adding one or more nodes after the existing one or more circulation paths and the existing termination nodes of one or more outgoing sub-paths, and adding the outgoing sub-paths and the source sub-paths aiming at the newly added nodes for the circulation paths of which the nodes are added after the termination nodes;

splitting one or more existing circulation paths, one or more existing outgoing sub-paths and one or more existing source sub-paths respectively based on data, adding one or more nodes respectively after the end nodes of the split partial paths, and adding outgoing sub-paths and source sub-paths aiming at newly added nodes to the split circulation paths with the nodes added after the end nodes; the splitting comprises decomposing the data of the original path;

and adding a new circulation path and a corresponding outgoing sub-path and a corresponding source sub-path.

10. A data processing system, the system comprising:

the data flow turning chart acquisition module is used for acquiring a data flow turning chart; the data flow graph comprises nodes and edges, wherein data flow exists among the nodes, and the edges among the nodes reflect the flow direction of data among the nodes; the data comprises funding data, the node corresponding to an account storing the data;

a flow path obtaining module, configured to obtain a flow path of the same data from the start node to the end node from the data flow graph, so as to obtain multiple flow paths corresponding to different data;

the outgoing path truncation module is used for sequentially truncating at least a plurality of circulation paths from the starting node to the ending node to obtain a plurality of outgoing sub-paths;

the source path truncation module is used for sequentially truncating at least a plurality of circulation paths from the termination node to the starting node to obtain a plurality of source sub-paths;

and the path storage module is used for storing the plurality of circulation paths, the plurality of outgoing sub-paths and the plurality of source sub-paths on a plurality of storage devices in a distributed manner for query.

11. A data processing apparatus comprising at least one storage medium and at least one processor, the at least one storage medium for storing computer instructions; the at least one processor is configured to execute the computer instructions to implement the method of any of claims 1-9.

12. A method of data query, the method comprising:

acquiring a query request; the query request comprises a path matching condition, and the path matching condition describes a plurality of nodes in a path to be queried and a data flow relation between the nodes;

respectively acquiring one or more candidate paths from circulation paths and/or sub-paths in one or more storage devices based on the query request; wherein a flow path and/or a sub-path in one or more storage devices is obtained by a processing method according to any one of claims 1-9;

and summarizing the candidate paths to obtain the path to be queried.

13. The method of claim 12, the query request further comprising a tag term indicating a go to analysis or a source analysis;

the obtaining one or more candidate paths from the circulation path and/or the sub-path in one or more storage devices based on the query request respectively includes:

when the label item in the query request indicates the forwarding analysis, respectively acquiring one or more candidate paths meeting the path matching condition from the circulation path and the forwarding sub-path in one or more storage devices;

14. The method of claim 13, the path matching condition comprising a payment node; the obtaining one or more candidate paths satisfying the path matching condition from the circulation path and the forwarding sub-path in one or more storage devices includes:

extracting one or more initial candidate paths of which the starting node or the head node is the same as the payment node from the circulation path and the going sub-path in one or more storage devices;

and acquiring the one or more candidate paths from the one or more initial candidate paths based on other nodes specified by the path matching condition and the data flow relation between the other nodes.

15. The method of claim 13, the path matching condition comprising a collection node; the obtaining one or more candidate paths satisfying the path matching condition from the circulation path and the source sub-path in one or more storage devices includes:

extracting one or more initial candidate paths of which the termination node or the tail node is the same as the collection node from the circulation path and the source sub-path in one or more storage devices;

16. The method of claim 12, the query request further comprising a time span of interest;

determining one or more time periods involved based on the time span of interest, and further determining corresponding time partitions in one or more storage devices;

obtaining one or more initial candidate paths from corresponding time partitions in one or more storage devices based on the path matching conditions;

truncating the one or more initial paths based on the interest time span to obtain a portion of the one or more initial paths belonging to the interest time span;

and taking the part of one or more initial paths belonging to the interest time span as one or more candidate paths.

17. The method of claim 12, the aggregation process comprising one or more of: merging, decompressing and decoding.

18. A data query system, the system comprising:

the query request acquisition module is used for acquiring a query request; the query request comprises a path matching condition, and the path matching condition describes a plurality of nodes in a path to be queried and a data flow relation between the nodes;

a path obtaining module, configured to obtain one or more candidate paths from a circulation path and/or a sub-path in one or more storage devices, respectively, based on the query request; wherein a flow path and/or a sub-path in one or more storage devices is obtained by a processing method according to any one of claims 1-9;

and the path processing module is used for summarizing the one or more candidate paths to obtain the path to be queried.

19. A data query device comprising at least one storage medium and at least one processor, the at least one storage medium storing computer instructions; the at least one processor is configured to execute the computer instructions to implement the method of any of claims 12-17.