WO2021174945A1

WO2021174945A1 - Data cost calculation method, system, computer device, and storage medium

Info

Publication number: WO2021174945A1
Application number: PCT/CN2020/135737
Authority: WO
Inventors: 陈玉; 张茜; 凌海挺; 刘丽扬
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-10-21
Filing date: 2020-12-11
Publication date: 2021-09-10
Also published as: CN112256720B; CN112256720A

Abstract

Provided is a data cost calculation method based on data lineage: generating a data lineage relationship by means of SQL statements or SQL statements contained in processing scripts, said data lineage relationship forming a directed acyclic graph; obtaining statistical information and frequency information of task execution of a data platform, and corresponding same to the directed acyclic graph (S32); calculating the cost of nodes and edges related to target data in the directed acyclic graph (S33); obtaining the cost of the edges and nodes and adding up to obtain the total cost of the target data (S34). Hence, after combining with the data lineage relationship, it is possible to calculate and display the cost of data at a higher degree of granularity; at the same time, the invention causes the manner of pricing of data applications to be more reasonable. Further, the invention provides a more detailed and reasonable reference for the evaluation of data values inside and outside an enterprise, facilitating the most granular calculation of the cost of data, such that the cost of each piece of data can be accurately quantified. The described method also relates to blockchain technology.

Description

Data cost calculation method, system, computer equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on October 21, 2020, with the application number 202011132525.2. The application title is "Data cost calculation method, system, computer equipment and storage medium", the entire content of which is incorporated by reference Incorporated in this application.

Technical field

This application relates to the field of data processing technology, in particular to data cost calculation methods, systems, computer equipment and storage media.

Background technique

Existing data kinship analysis programs or systems are mostly used for data traceability, relying on citation analysis, etc., and no case has been found to be used in combination with data cost calculation. At present, enterprises are processing and storing more and more data, and big data technology has been widely used. Data processing and storage also consume a lot of resources, but the corresponding costs cannot be effectively calculated and displayed. At present, the calculation granularity of data cost within the enterprise is relatively large, and the difference in data cost cannot be reflected in a more fine-grained way for internal management and related decision-making.

The inventor realizes that most of the current data costs are statistically calculated according to the overall processing process and storage resources occupied, and the table-level, field-level, or record-level costs cannot be obtained. Only when the data cost is clear, can reasonable pricing or cost settlement be made when the data is used internally or externally.

The inventor realizes that the cost of data can be calculated by using the expenses incurred by the use of related resources, but other data used in the data processing process should also be counted as the cost of the current data, and there can be more perspectives to assess the cost of data or value.

Summary of the invention

Based on this, this application provides a data cost calculation method, system, computer equipment, and storage medium to be able to calculate and display the cost of data in a finer-grained manner, and at the same time, to make the pricing method of data applications more reasonable.

In order to achieve the above objective, this application provides a data cost calculation method based on the blood relationship of the data, and the data cost calculation method includes:

Acquire SQL statements used in the data processing process or scripts used in the data processing process, and generate data blood relationship through the SQL statement or the SQL statement contained in the processing script, and the data blood relationship forms a directed acyclic graph;

Obtain the statistical information and frequency information of the task execution of the data platform, and correspond to the directed acyclic graph

Calculate the cost of nodes and edges related to the target data in the directed acyclic graph;

Obtain the costs of the edges and nodes, and add them to obtain the total cost of the target data.

In order to achieve the above objective, the present application also provides a data cost calculation system based on the blood relationship of the data, and the data cost calculation system includes:

The data set module is used to obtain the SQL statements used in the data processing or the scripts used in the data processing, and generate the data blood relationship through the SQL statement or the SQL statement contained in the processing script, and the data blood relationship forms a directed Acyclic graph

The information module is used to obtain the statistical information and frequency information of the task execution of the data platform, and correspond to the directed acyclic graph;

The first calculation module is used to calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph;

The second calculation module is used to obtain the cost of the edges and nodes, and add them to obtain the total cost of the target data.

In order to achieve the above objective, the present application also provides a computer device, including a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the processor Perform the following steps: Obtain the SQL statements used in the data processing process or the scripts used in the data processing process, and generate the data blood relationship through the SQL statement or the SQL statement contained in the processing script, and the data blood relationship forms a directed acyclic ring Figure; Obtain the statistical information and frequency information of the task execution of the data platform, and correspond to the directed acyclic graph; calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph; obtain the edge and node Cost, and add up to get the total cost of the target data.

In order to achieve the above-mentioned purpose, this application also provides a storage medium that stores a program file capable of implementing the following steps. The steps include: obtaining SQL statements or scripts used in the data processing process, and passing SQL The SQL statement contained in the sentence or the processing script generates the blood relationship of the data, and the blood relationship of the data forms a directed acyclic graph; obtains the statistical information and frequency information of the task execution of the data platform, and corresponds to the directed acyclic graph; calculates The cost of the node and the cost of the edge related to the target data in the directed acyclic graph; obtain the cost of the edge and the node, and add them to obtain the total cost of the target data.

The foregoing application provides a data cost calculation method, system, computer equipment, and storage medium, wherein the data cost calculation method obtains SQL statements or scripts used in the data processing process, and passes SQL The SQL statement contained in the sentence or the processing script generates the blood relationship of the data, and the blood relationship of the data forms a directed acyclic graph; obtains the statistical information and frequency information of the task execution of the data platform, and corresponds to the directed acyclic graph; calculates The cost of the node and the cost of the edge related to the target data in the directed acyclic graph; obtain the cost of the edge and the node, and add them to obtain the total cost of the target data. Therefore, the data cost calculation method described in this application can calculate and display the cost of data in a more fine-grained manner after combining the blood relationship of the data. At the same time, it can make the pricing method of data application more reasonable. The assessment can provide a more detailed and reasonable reference basis.

Description of the drawings

Figure 1 is an implementation environment diagram of a data cost calculation method provided in an embodiment;

Figure 2 is a block diagram of the internal structure of a computer device in an embodiment;

Figure 3 is a flow chart of a data cost calculation method in an embodiment;

Figure 4 is a schematic diagram of a directed acyclic graph in an embodiment;

FIG. 5 is a flowchart of node and edge calculation in a directed acyclic graph in an embodiment;

FIG. 6 is a schematic diagram of a directed acyclic graph in which SQL statements are multiple-in and multiple-out in an embodiment;

Figure 7 is a schematic diagram of a data cost calculation system in an embodiment;

Figure 8 is a schematic structural diagram of a computer device in an embodiment;

FIG. 9 is a schematic diagram of the structure of a storage medium in an embodiment.

Detailed ways

In order to make the purpose, technical solutions, and advantages of this application clearer and clearer, the following further describes the application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application.

It can be understood that the terms "first", "second", etc. used in this application can be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish the first element from another element.

FIG. 1 is an implementation environment diagram of a data cost calculation method based on data blood relationship provided in an embodiment. As shown in FIG. 1, the implementation environment includes a computer device 110 and a display device 120.

The computer device 110 may be a computer device used by the user, such as a computer, and a data cost calculation system based on the blood relationship of the data is installed on the computer device 110. When calculating, the user can perform the calculation on the computer device 110 according to the data cost calculation method based on the blood relationship of the data, and display the calculation result on the display device 120.

It should be noted that the combination of the computer device 110 and the display device 120 can be a smart phone, a tablet computer, a notebook computer, a desktop computer, etc., but is not limited to this.

Figure 2 is a schematic diagram of the internal structure of a computer device in an embodiment. As shown in Figure 2, the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected through a system bus. Wherein, the non-volatile storage medium of the computer device stores an operating system, a database, and computer-readable instructions. The database may store control information sequences. When the computer-readable instructions are executed by the processor, the processor can implement a A data cost calculation method based on the blood relationship of the data. The processor of the computer equipment is used to provide calculation and control capabilities, and supports the operation of the entire computer equipment. Computer readable instructions may be stored in the memory of the computer device. When the computer readable instructions are executed by the processor, the processor may cause the processor to execute a data cost calculation method based on the blood relationship of the data. The network interface of the computer device is used to connect and communicate with the terminal. Those skilled in the art can understand that the structure shown in FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.

As shown in Figure 3, in one embodiment, a data cost calculation method based on the blood relationship of data is proposed. Or indirect expenses and expenses. This application can also be applied to data warehouse scenarios, so as to promote the purpose of building big data. The data cost calculation method may be applied to the above-mentioned computer device 110 and display device 120, and specifically may include the following steps:

Step 31: Obtain the SQL statement used in the data processing process or the script used in the data processing process, and generate the data blood relationship through the SQL statement or the SQL statement contained in the processing script, and the data blood relationship forms a directed acyclic graph .

Specifically, the data processing process and the data volume in the data warehouse are similar to the pyramid structure, the bottom-up processing and storage, the bottom-level data volume and the resources used for processing are much larger than the data volume provided for use. For the data at the top of the pyramid, its processing and storage cost does not reflect its true manufacturing cost, and the manufacturing and storage cost of the lower-level data related to its processing is more reasonable. Therefore, the cumulative cost of data can be easily calculated based on the blood relationship of the data. The cumulative cost can be calculated in two ways: one method is to calculate the general cost of each node in the blood relationship, and then accumulate it step by step according to the blood relationship, until the limit is met; the second method is based on the data blood relationship The directed acyclic graph generated by the relationship calculates the cost of the nodes and the cost of the edges in the graph respectively, and then accumulates them according to the calculation target and the related cost of the edges and nodes. This method chooses the second method to correctly calculate the data cost. The following specific examples are used to illustrate. For example, the calculation steps of the relevant indicators of the customer’s daily average deposit balance are as follows:

Step 1. Read data from the local currency current account table (A, store local currency current account and balance data), write it into the local currency daily average deposit balance table (E), and calculate the daily customer local currency current deposit balance (A->E);

Step 2. Read data from the local currency fixed-term account table (B, store local currency fixed-term account and balance data), write it into the local currency daily average deposit balance table (E), and calculate the daily customer local currency fixed deposit balance (B->E);

Step 3. Read data from the foreign currency current account table (C, store foreign currency current account and balance data), write it into the foreign currency daily average deposit balance table (F), and calculate the daily customer foreign currency current account balance (C->F);

Step 4. Read data from the foreign currency fixed-term account table (D, store foreign currency fixed-term account and balance data), write it into the foreign currency daily average deposit balance table (F), and calculate the daily customer foreign currency fixed deposit balance (D->F);

Step 5. Read data from the daily average deposit balance table of local currency (E, store user ID and domestic currency deposit balance data), read data from the daily average deposit balance table of foreign currency (F, store user ID and foreign currency deposit balance data) , Write the customer's daily average deposit balance table (G, store user ID and balance data), and calculate the customer's daily average deposit balance (E->G, F->G).

Among them, steps 1-4 all need to read the customer account relationship table (Z, store the corresponding relationship between the user ID and the account), and write the customer information into the target table synchronously. Each step is to execute the corresponding SQL statement and save the data Read and process from the source table and write it into the target table. Further, the blood relationship of the data is based on the executed SQL statement analysis to generate the relationship between the table and the table and the field and the field. The relationship can be stored in the form of a two-dimensional table, and each blood relationship data records a relationship between the data. For example, field A->field E, therefore, based on multiple blood relationship data, a directed acyclic graph (DAG) as shown in Figure 4 can be drawn.

Please further refer to Figure 4. The nodes in the figure represent data storage, the connections between nodes represent the data processing process; the nodes can represent data tables, records or a single field, and the edges with directions between the nodes represent the related data processing process. The computing resources occupied. Specifically, all edges in the graph are directed edges, and the data source table or field points to the data target table or field. The cost calculation related to the blood relationship of data mainly involves the cost of computing resources used in the storage and processing process. Among them, the cost of manpower, space, electricity and other resources are not considered in the data cost calculation method, that is, the data cost calculation method mainly focuses on The related costs of storage and computing resources used in the process of data storage and processing, and other costs are not considered in this data cost calculation method. It should be noted that the data cost calculation method mainly uses the result of the blood relationship of the data, and the generation method is not concerned, even the result of the blood relationship written manually can be used.

Further, in one embodiment, generating the data blood relationship through the SQL statements contained in the processing script, and generating the directed acyclic graph through the data blood relationship, specifically includes:

S311. Extract regularized SQL statements from the script file containing the SQL code, and complete the cleaning of the SQL statements;

Further, the S311 includes:

S3111, obtain the script file containing the SQL code, and look for the flag bit of the SQL code;

Preferably, the script file may be a script such as perl.

S3112. Use the flag bit to filter irrelevant content in the script file, and retain the regularized SQL code statements.

S312: Perform lexical analysis on the regularized SQL sentences to generate data blood relationship, and generate a directed acyclic graph according to the data blood relationship.

Step 32: Obtain the statistical information and frequency information of the task execution of the data platform, and correspond to the directed acyclic graph.

Wherein, the statistical information includes the resource usage of each task, the resource usage includes information such as storage usage, CPU usage, and memory usage; the frequency information includes information such as the number of historical executions of the task and the start and end time of execution.

Specifically, the task of the data platform can be a SQL statement, and each SQL corresponds to one or more edges in the directed acyclic graph. After the mapping relationship is established, the resource usage corresponding to each edge can be referenced in the calculation process. quantity.

Specifically, the processing cost of the designated data in each different time period can be calculated according to the different time periods of task execution. For example, a certain task is executed once a month, and the resource usage and cost of related processing can be calculated every quarter or every half year. In this way, the relevant information of the target data can be clearly known based on the statistical information and frequency information, which can facilitate the calculation of the data cost in each time period.

Step 33: Calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph.

According to the two calculation methods of accumulative cost, the first method may cause repeated calculations for multiple referenced nodes, and the calculation result error will be larger. For example, node A, node B, node C, and node D in Figure 4 will accumulate The cost of node Z. The second method calculates the cost of each node separately, then calculates the cost of each edge, and finally takes the sum of the two as the cost of the target data. The calculation result is more accurate, that is, the data cost calculation method described in this application.

Further, in the process of batch processing of data generated in the big data environment, the main resources occupied are storage, CPU and memory (MEM); the measurement unit of storage is byte, which is multiplied by the number of redundancy; the measurement unit of CPU is second *The number of cores, the unit of measurement of memory is second*MB. Among them, the calculation in the cloud environment is relatively simple, and the purchased resources can be converted into the corresponding measurement unit for easy calculation, while the traditional environment requires a reasonable way to convert the software and hardware costs into the corresponding measurement unit for calculation. Simply put, according to the different data platforms, the unit price parameter of the data platform resource usage is introduced, that is, the unit price of the resource usage of different data platforms may be different, and the technology used for data processing and storage is determined according to the cost of the data. And the type of hardware to complete the calculation of the data cost. Furthermore, in the same enterprise, a reasonable and unified pricing method can be formed according to the cost of the data during the data exchange process.

Specifically, the following examples are used for illustration. The current environmental resource costs of big data processing are as follows:

1000 CPU cores, the annual cost is 1 million yuan, and the price per core*s is about 1000000/1000 (number of cores)/(365*86400) = 0.0000317 yuan;

The annual cost of 5TB memory is 500,000 yuan, and the cost per GB per second is about 500,000/(5*1024)/(365*86400) = 0.0000030966 yuan;

The annual cost of storing 20TB is 50,000 yuan, and the annual price per GB is about 500,000/(20*1024)=2.4414 yuan.

According to Figure 4, assume that the computing resources used in the execution of the aforementioned SQL (processing instructions) are: CPU 2000core*s, MEM 500GB*s, node A-related data occupies 10GB of storage, node Z occupies 2GB of related storage, and node E-related data occupies storage 3GB, the processing and storage cost of the calculated data based on this part of the directed acyclic graph is (CPU unit price) 0.0000317*2000+(memory unit price) 0.0000030966*500+(storage unit price) 2.4414*(10+2+3)=0.0634 +0.0015483+36.621=36.6859483 yuan, the data cost of this part can be calculated accurately and quickly.

Further, in one embodiment, assuming that the cost C _{k of the} data node (table) K is calculated, the table of the data source (nodes in the DAG graph) and the related processing SQL (edges in the DAG graph) need to be obtained through the blood relationship of the data. Resources consumed. Among them, S _i is used to indicate the cost of storage resources occupied by the relevant node, and X is used to indicate the resources consumed by processing and generating the SQL of the target table; there can be multiple SQLs for generating the target table data. Use X _p to indicate the consumption of each SQL. The cost of resources; each SQL will be executed multiple times using X _{pq to} represent the cost of resources consumed by each SQL each time; the blood relationship generated by each SQL may correspond to multiple edges in the DAG, use count(L _x ) Indicates the number of edges in the DAG corresponding to each SQL, please refer to Figure 5, as follows:

331. Calculate the cost of a node in a directed acyclic graph;

Specifically, the cost of the node is the storage cost. According to the above description, the calculation formula of the node is: ∑ _i distinct{S _i }+S _k , where S _i represents the storage resource cost occupied by the relevant node, S _k represents the storage cost of the target data.

332. Calculate the cost of edges in the directed acyclic graph.

Specifically, the cost of the edge is the cost of CPU and MEM. According to the above description, the calculation formula of the edge is:

Among them, N _Lp represents the number of edges related to the target data, X _pq represents the cost of resources consumed by each processing instruction each time, and count(L _x ) represents the number of edges in the directed acyclic graph corresponding to each processing instruction.

Step 34: Obtain the costs of the edges and nodes, and add them to obtain the total cost of the target data.

When the SQL statement is multi-in and one-out (insert…from…), N _{Lp is} equal to count(L _x ); when the SQL statement is multi-in and multi-out (from…insert…insert…), N _{Lp is} less than count(L _x ).

Based on this, the total cost of the target data can be summarized, that is, the total data cost C _k of the node (table) K is the following calculation formula:

Further, take an example for description. For example, taking the data processing of node G in Figure 4 as an example, the SQL statement is multi-in and one-out, and a total of 5 SQL statements are involved, namely:

A+Z→E is X ₁ :

insert into table_E

select z.cust_id,a.bal

from table_A a

join table_Z z

on a.acct_no=z.acct_no.

According to this SQL, table-level data blood relationship can be generated:

A→E is marked as L _AE and Z→E is marked as L _ZE . This SQL corresponds to the two edges Z→E and A→E in the graph. The cust_id data in the E table comes from the Z table, and the bal data in the E table comes from A table.

X ₁ corresponds to count(L _x1 )=2, N _L1 =2. By analogy, B+Z→E is X ₂ , C+Z→E is X ₃ , D+Z→E is X ₄ , and the corresponding count(L _x )=2, N _LP =2.

E+F→G is X ₅ :

insert into table_G

select nvl(e.cust_id,f.cust_id)as cust_id,

sum(nvl(e.bal,0)+nvl(f.bal,0))as bal

from table_E e

full outer join table_F f

on e.cust_id=f.cust_id

group by nvl(e.cust_id,f.cust_id);

X ₅ corresponds to count(L _x5 )=2, N _L5 =2.

The data in Table G comes from Tables A, B, C, D, Z, E, F, where node Z appears multiple times in the DAG, and the storage cost of multiple occurrences of nodes should be deduplicated when calculating the cost, so distinct The i∈{A, B, C, D, Z, E, F} in {S _{i }.} Assuming that each SQL has been executed 10 times, that is, executed multiple times on the same day, then q=10. According to the above information, the total cost of table G is C _G , and the formula can be obtained:

Further, in the current big data environment, data processing is all at the table level, and the table-level data cost can be calculated according to the above description. For example, if Table G in Figure 4 contains 11 data fields, the result of dividing the data in Table G by 11 can be used as the cost of each field; for example, each record in Table G stores a total of 20 bytes, of which 10 fields are only Store 1 byte of data, and the remaining field stores 10 bytes, so the storage cost of storing the 10-byte field is 50% of the storage cost of table G, and the storage cost of each other field is 5% of table G. The record-level cost calculation method is similar. For example, if table G contains 100,000 records, the cost of each record is C _G /100000.

In another embodiment, when the SQL statement is multiple-in and multiple-out, another example is as follows. For the multiple-in and multiple-out legend, please refer to Figure 6. The processing-related SQL is as follows:

From table_A a

join table_B b

On a.id=b.id

Insert into table_C

Select a.id,a.bal+b.bal

Where a.type=1 and b.type=2

Insert into table_D

Select b.id,a.bal+b.bal

Where a.type=3 and b.type=4;

This SQL will generate 4 edges as shown in Figure 6. Assuming that the resource cost consumed by a single execution of this SQL is X _P , then count(L _x )=4. If the processing cost of node D is calculated, it is related to node D There are only two edges, namely A→D and B→D, then N _Lp = 2. Assuming that this SQL has also been executed q = 10 times, then the cost of node D after executing this SQL 10 times is brought into the calculation formula as follows :

According to the above description, the steps 1 to 3 describe the data cost calculation method based on the blood relationship of the data. The data cost calculation method can be applied to the cost calculation of table-level and field-level data. The record-level cost is based on the table-level or field-level cost calculation. Level cost is calculated based on the average value of the number of records. Specifically, the data processing process (SQL) corresponds to the edges in the figure. Because each edge is processed in batches corresponding to multiple records in a table, the average value can be used to calculate the cost for data processed in multiple batches in the same table.

Further, in an embodiment, the number of resources used for the same SQL may be different due to changes in the amount of data each time. For example, A->E in Figure 4, assuming that the cost of processing and using resources for the first time is 10 yuan, corresponding Generate 10,000 records, the second time the resource cost will be 12 yuan corresponding to 14,000 records, then the average processing cost of these 24,000 records is (10+12)/24000 about 0.091 yuan.

In an optional implementation manner, it is also possible to upload the calculation result of the data cost calculation method based on the blood relationship of the data to the blockchain.

Specifically, the corresponding summary information is obtained based on the calculation result of the data cost calculation method based on the blood relationship of the data. Specifically, the summary information is obtained by hashing the calculation result of the data cost calculation method based on the blood relationship of the data, such as Use sha256s algorithm processing to get. Uploading summary information to the blockchain can ensure its security and fairness and transparency to users. The user can download the summary information from the blockchain in order to verify whether the calculation result of the data cost calculation method based on the blood relationship of the data has been tampered with. The blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

This application provides a data cost calculation method based on the blood relationship of the data. By defining a data set, a directed acyclic graph is generated based on the blood relationship of the data; the cost of the node and the cost of the edge related to the target data in the directed acyclic graph are calculated ; Get the cost of the edges and nodes, and add them to get the total cost of the target data. Therefore, after combining the blood relationship of the data, the cost of the data can be calculated and displayed in a more fine-grained manner, and at the same time, the pricing method of the data application can be made more reasonable. Furthermore, the evaluation of the value of data inside and outside the enterprise provides a more detailed and reasonable reference, which facilitates the most fine-grained calculation of the cost of data, so that the cost of each piece of data can be accurately quantified. At the same time, this application also involves blockchain technology.

As shown in Figure 7, the present application also provides a data cost calculation system based on data blood relationship. The data cost calculation system can be integrated into the above-mentioned computer device 110, and specifically can include a data set module 20, an information module 30, and a data set module. A calculation module 40 and a second calculation module 50.

The data set module 20 is used to obtain SQL statements used in the data processing process or scripts used in the data processing process, and generate data blood relationship through the SQL statement or the SQL statement contained in the processing script, the data blood relationship Form a directed acyclic graph;

The information module 30 is used to obtain the statistical information and frequency information of the task execution of the data platform, and correspond to the directed acyclic graph;

The first calculation module 40 is used to calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph;

The second calculation module 50 is used to obtain the cost of the edges and nodes, and add them to obtain the total cost of the target data.

In one embodiment, the statistical information includes the resource usage of each task, the resource usage includes information such as storage usage, CPU usage, and memory usage; the frequency information includes the historical execution times of the task and the start and end of execution. Time and other information.

In one embodiment, the first calculation module 40 is used to calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph.

Among them, in one embodiment, the cost of the node in the directed acyclic graph is calculated. Specifically, the cost of the node is the storage cost. According to the above description, the calculation formula of the node is: ∑ _i distinct{S _i }+ S _k , where S _i represents the storage resource cost occupied by the relevant node, and S _k represents the storage cost of the target data.

Wherein, the cost of the edge in the directed acyclic graph is calculated. Specifically, the cost of the edge is the cost of the CPU and MEM. According to the above description, the calculation formula of the edge is:

Further, in one embodiment, the second calculation module 50 is used to obtain the cost of the edges and nodes, and add them to obtain the total cost of the target data.

Among them, when the SQL statement is multi-in and one-out (insert…from…), N _{Lp is} equal to count(L _x ); when the SQL statement is multi-in and multi-out (from…insert…insert…), N _{Lp is} less than count (L _x ).

In one embodiment, the data cost calculation system further includes a display module (not shown) for displaying calculation results. The display module may be a display of a desktop computer or a display device of other computer equipment.

Please refer to FIG. 8, which is a schematic structural diagram of a device according to an embodiment of the application. As shown in FIG. 8, the device 200 includes a processor 201 and a memory 202 coupled to the processor 201.

The memory 202 stores program instructions for implementing the data cost calculation method based on the blood relationship of the data described in any of the above embodiments.

The processor 201 is configured to execute program instructions stored in the memory 202.

The processor 201 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 201 may be an integrated circuit chip with signal processing capability. The processor 201 may also be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component . The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.

Refer to FIG. 9, which is a schematic structural diagram of a storage medium according to an embodiment of the application. The storage medium of the embodiment of the present application stores a program file 301 that can implement all the above methods, where the program file 301 may be stored in the above storage medium in the form of a software product, and the computer-readable storage medium may be a non-volatile , It can also be volatile, which includes several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor to execute all or all of the methods described in the various embodiments of this application. Part of the steps. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. , Or terminal devices such as computers, servers, mobile phones, and tablets.

It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.

The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority or inferiority of the embodiments. Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including a number of instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.

Claims

A data cost calculation method based on the blood relationship of the data, wherein the data cost calculation method includes:

Acquire SQL statements used in the data processing process or scripts used in the data processing process, and generate data blood relationship through the SQL statement or the SQL statement contained in the processing script, and the data blood relationship forms a directed acyclic graph;

Obtain the statistical information and frequency information of the task execution of the data platform, and correspond to the directed acyclic graph;

Calculate the cost of nodes and edges related to the target data in the directed acyclic graph;

Obtain the costs of the edges and nodes, and add them to obtain the total cost of the target data.
The data cost calculation method according to claim 1, wherein the statistical information includes resource usage of each task, the resource usage includes storage usage, CPU usage, and memory usage; and the frequency information includes task history The number of executions and the start and end time of execution.
The data cost calculation method according to claim 2, wherein according to the difference of the data platform, the unit price parameter of the resource usage of the data platform is introduced; in the calculation process of the data cost, the cost of the node is the storage cost, and the edge The cost is the cost of CPU and memory.
The data cost calculation method according to claim 1, wherein the calculation of the cost of the node related to the target data in the directed acyclic graph comprises: ∑ i distinct{S i }+S k , where S i represents the related node Occupied storage resource cost, Sk represents the storage cost of target data;

The calculation of the cost of the edge related to the target data in the directed acyclic graph:
Among them, N Lp represents the number of edges related to the target data, X pq represents the cost of resources consumed by each processing instruction each time, and count(L x ) represents the number of edges in the directed acyclic graph corresponding to each processing instruction.
The data cost calculation method according to claim 4, wherein said obtaining the costs of the edges and nodes and adding them to obtain the total cost of the target data comprises:
Among them, C k represents the total cost of the target data.
5. The data cost calculation method according to claim 1, wherein the SQL statement contained in the processing script generates the blood relationship of the data, and the formation of the blood relationship of the data into a directed acyclic graph comprises:

Extract regularized SQL statements from script files containing SQL codes to complete the cleaning of SQL statements;

Perform lexical analysis on the regularized SQL statements to generate data blood relationship, and generate a directed acyclic graph according to the data blood relationship.
The data cost calculation method according to claim 1, wherein, after the total cost of the target data is obtained, the total cost of the target data is uploaded to the blockchain, so that the total cost of the target data is affected by the blockchain. The cost is stored encrypted.
A data cost calculation system based on data blood relationship, wherein the data cost calculation system includes:

The data set module is used to obtain the SQL statements used in the data processing or the scripts used in the data processing, and generate the data blood relationship through the SQL statement or the SQL statement contained in the processing script, and the data blood relationship forms a directed Acyclic graph

The information module is used to obtain the statistical information and frequency information of the task execution of the data platform, and correspond to the directed acyclic graph;

The first calculation module is used to calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph;

The second calculation module is used to obtain the cost of the edges and nodes, and add them to obtain the total cost of the target data.
A computer device includes a memory and a processor, and computer-readable instructions are stored in the memory. When the computer-readable instructions are executed by the processor, the processor executes the following steps: The used SQL statement or the script used in the data processing process, and the data blood relationship is generated through the SQL statement or the SQL statement contained in the processing script, and the data blood relationship forms a directed acyclic graph; obtains the statistics of the task execution of the data platform Information and frequency information, and correspond to the directed acyclic graph; calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph; obtain the cost of the edge and node, and accumulate it to obtain the target data total cost.
9. The computer device according to claim 9, wherein the statistical information includes resource usage of each task, the resource usage includes storage usage, CPU usage, and memory usage; and the frequency information includes historical execution times of tasks And the start and end time of execution.
The computer device according to claim 10, wherein, according to the difference of the data platform, the unit price parameter of the resource usage of the data platform is introduced; in the calculation process of the data cost, the cost of the node is the storage cost, and the cost of the edge Is the cost of CPU and memory.
The computer device according to claim 9, wherein the calculation of the cost of the node related to the target data in the directed acyclic graph comprises: ∑ i distinct{S i }+S k , where S i represents the occupation of the related node The cost of storage resources, Sk represents the storage cost of the target data;

The calculation of the cost of the edge related to the target data in the directed acyclic graph:
Among them, N Lp represents the number of edges related to the target data, X pq represents the cost of resources consumed by each processing instruction each time, and count(L x ) represents the number of edges in the directed acyclic graph corresponding to each processing instruction.
The computer device according to claim 12, wherein said obtaining the costs of the edges and nodes and adding them to obtain the total cost of the target data comprises:
Among them, C k represents the total cost of the target data.
9. The computer device according to claim 9, wherein the SQL statement contained in the processing script generates a data blood relationship, and the data blood relationship forming a directed acyclic graph comprises:

Extract regularized SQL statements from script files containing SQL codes to complete the cleaning of SQL statements;

Perform lexical analysis on the regularized SQL statements to generate data blood relationship, and generate a directed acyclic graph according to the data blood relationship.
A storage medium, in which a program file capable of realizing the following steps is stored. The steps include: obtaining SQL statements or scripts used in the data processing process, and passing the SQL statements or the scripts contained in the processing script The SQL statement generates the blood relationship of the data, and the blood relationship of the data forms a directed acyclic graph; obtains the statistical information and frequency information of the task execution of the data platform, and corresponds to the directed acyclic graph; calculates the target in the directed acyclic graph The cost of the data-related node and the cost of the edge; the cost of the edge and the node is obtained, and the total cost of the target data is obtained by accumulation.
The storage medium according to claim 15, wherein the statistical information includes resource usage of each task, the resource usage includes storage usage, CPU usage, and memory usage; and the frequency information includes historical execution times of tasks And the start and end time of execution.
The storage medium according to claim 16, wherein the unit price parameter of the resource usage of the data platform is introduced according to the difference of the data platform; in the calculation process of the data cost, the cost of the node is the storage cost, and the cost of the edge Is the cost of CPU and memory.
The storage medium according to claim 15, wherein the calculation of the cost of the node related to the target data in the directed acyclic graph comprises: ∑ i distinct{S i }+S k , where S i represents the occupation of the related node The cost of storage resources, Sk represents the storage cost of the target data;

The calculation of the cost of the edge related to the target data in the directed acyclic graph:
Among them, N Lp represents the number of edges related to the target data, X pq represents the cost of resources consumed by each processing instruction each time, and count(L x ) represents the number of edges in the directed acyclic graph corresponding to each processing instruction.
18. The storage medium of claim 18, wherein said obtaining the cost of the edges and nodes and adding them to obtain the total cost of the target data comprises:
Among them, C k represents the total cost of the target data.
15. The storage medium of claim 15, wherein the SQL statements contained in the processing script generate data blood relationship, and the data blood relationship forming a directed acyclic graph comprises:

Extract regularized SQL statements from script files containing SQL codes to complete the cleaning of SQL statements;

Perform lexical analysis on the regularized SQL statements to generate data blood relationship, and generate a directed acyclic graph according to the data blood relationship.