WO2021174945A1 - Data cost calculation method, system, computer device, and storage medium - Google Patents

Data cost calculation method, system, computer device, and storage medium Download PDF

Info

Publication number
WO2021174945A1
WO2021174945A1 PCT/CN2020/135737 CN2020135737W WO2021174945A1 WO 2021174945 A1 WO2021174945 A1 WO 2021174945A1 CN 2020135737 W CN2020135737 W CN 2020135737W WO 2021174945 A1 WO2021174945 A1 WO 2021174945A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
cost
directed acyclic
acyclic graph
blood relationship
Prior art date
Application number
PCT/CN2020/135737
Other languages
French (fr)
Chinese (zh)
Inventor
陈玉
张茜
凌海挺
刘丽扬
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021174945A1 publication Critical patent/WO2021174945A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance

Definitions

  • This application relates to the field of data processing technology, in particular to data cost calculation methods, systems, computer equipment and storage media.
  • the inventor realizes that most of the current data costs are statistically calculated according to the overall processing process and storage resources occupied, and the table-level, field-level, or record-level costs cannot be obtained. Only when the data cost is clear, can reasonable pricing or cost settlement be made when the data is used internally or externally.
  • the inventor realizes that the cost of data can be calculated by using the expenses incurred by the use of related resources, but other data used in the data processing process should also be counted as the cost of the current data, and there can be more perspectives to assess the cost of data or value.
  • this application provides a data cost calculation method, system, computer equipment, and storage medium to be able to calculate and display the cost of data in a finer-grained manner, and at the same time, to make the pricing method of data applications more reasonable.
  • this application provides a data cost calculation method based on the blood relationship of the data, and the data cost calculation method includes:
  • the present application also provides a data cost calculation system based on the blood relationship of the data, and the data cost calculation system includes:
  • the data set module is used to obtain the SQL statements used in the data processing or the scripts used in the data processing, and generate the data blood relationship through the SQL statement or the SQL statement contained in the processing script, and the data blood relationship forms a directed Acyclic graph
  • the information module is used to obtain the statistical information and frequency information of the task execution of the data platform, and correspond to the directed acyclic graph;
  • the first calculation module is used to calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph;
  • the second calculation module is used to obtain the cost of the edges and nodes, and add them to obtain the total cost of the target data.
  • the present application also provides a computer device, including a memory and a processor.
  • the memory stores computer-readable instructions.
  • the processor Perform the following steps: Obtain the SQL statements used in the data processing process or the scripts used in the data processing process, and generate the data blood relationship through the SQL statement or the SQL statement contained in the processing script, and the data blood relationship forms a directed acyclic ring Figure; Obtain the statistical information and frequency information of the task execution of the data platform, and correspond to the directed acyclic graph; calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph; obtain the edge and node Cost, and add up to get the total cost of the target data.
  • this application also provides a storage medium that stores a program file capable of implementing the following steps.
  • the steps include: obtaining SQL statements or scripts used in the data processing process, and passing SQL
  • the SQL statement contained in the sentence or the processing script generates the blood relationship of the data, and the blood relationship of the data forms a directed acyclic graph; obtains the statistical information and frequency information of the task execution of the data platform, and corresponds to the directed acyclic graph; calculates The cost of the node and the cost of the edge related to the target data in the directed acyclic graph; obtain the cost of the edge and the node, and add them to obtain the total cost of the target data.
  • the foregoing application provides a data cost calculation method, system, computer equipment, and storage medium, wherein the data cost calculation method obtains SQL statements or scripts used in the data processing process, and passes SQL
  • the SQL statement contained in the sentence or the processing script generates the blood relationship of the data, and the blood relationship of the data forms a directed acyclic graph; obtains the statistical information and frequency information of the task execution of the data platform, and corresponds to the directed acyclic graph; calculates The cost of the node and the cost of the edge related to the target data in the directed acyclic graph; obtain the cost of the edge and the node, and add them to obtain the total cost of the target data.
  • the data cost calculation method described in this application can calculate and display the cost of data in a more fine-grained manner after combining the blood relationship of the data. At the same time, it can make the pricing method of data application more reasonable. The assessment can provide a more detailed and reasonable reference basis.
  • Figure 1 is an implementation environment diagram of a data cost calculation method provided in an embodiment
  • Figure 2 is a block diagram of the internal structure of a computer device in an embodiment
  • Figure 3 is a flow chart of a data cost calculation method in an embodiment
  • Figure 4 is a schematic diagram of a directed acyclic graph in an embodiment
  • FIG. 5 is a flowchart of node and edge calculation in a directed acyclic graph in an embodiment
  • FIG. 6 is a schematic diagram of a directed acyclic graph in which SQL statements are multiple-in and multiple-out in an embodiment
  • Figure 7 is a schematic diagram of a data cost calculation system in an embodiment
  • Figure 8 is a schematic structural diagram of a computer device in an embodiment
  • FIG. 9 is a schematic diagram of the structure of a storage medium in an embodiment.
  • FIG. 1 is an implementation environment diagram of a data cost calculation method based on data blood relationship provided in an embodiment. As shown in FIG. 1, the implementation environment includes a computer device 110 and a display device 120.
  • the computer device 110 may be a computer device used by the user, such as a computer, and a data cost calculation system based on the blood relationship of the data is installed on the computer device 110.
  • a data cost calculation system based on the blood relationship of the data is installed on the computer device 110.
  • the user can perform the calculation on the computer device 110 according to the data cost calculation method based on the blood relationship of the data, and display the calculation result on the display device 120.
  • the combination of the computer device 110 and the display device 120 can be a smart phone, a tablet computer, a notebook computer, a desktop computer, etc., but is not limited to this.
  • Figure 2 is a schematic diagram of the internal structure of a computer device in an embodiment.
  • the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected through a system bus.
  • the non-volatile storage medium of the computer device stores an operating system, a database, and computer-readable instructions.
  • the database may store control information sequences.
  • the processor can implement a A data cost calculation method based on the blood relationship of the data.
  • the processor of the computer equipment is used to provide calculation and control capabilities, and supports the operation of the entire computer equipment.
  • Computer readable instructions may be stored in the memory of the computer device.
  • the processor may cause the processor to execute a data cost calculation method based on the blood relationship of the data.
  • the network interface of the computer device is used to connect and communicate with the terminal.
  • a data cost calculation method based on the blood relationship of data is proposed. Or indirect expenses and expenses. This application can also be applied to data warehouse scenarios, so as to promote the purpose of building big data.
  • the data cost calculation method may be applied to the above-mentioned computer device 110 and display device 120, and specifically may include the following steps:
  • Step 31 Obtain the SQL statement used in the data processing process or the script used in the data processing process, and generate the data blood relationship through the SQL statement or the SQL statement contained in the processing script, and the data blood relationship forms a directed acyclic graph .
  • the data processing process and the data volume in the data warehouse are similar to the pyramid structure, the bottom-up processing and storage, the bottom-level data volume and the resources used for processing are much larger than the data volume provided for use.
  • its processing and storage cost does not reflect its true manufacturing cost, and the manufacturing and storage cost of the lower-level data related to its processing is more reasonable. Therefore, the cumulative cost of data can be easily calculated based on the blood relationship of the data.
  • the cumulative cost can be calculated in two ways: one method is to calculate the general cost of each node in the blood relationship, and then accumulate it step by step according to the blood relationship, until the limit is met; the second method is based on the data blood relationship
  • the directed acyclic graph generated by the relationship calculates the cost of the nodes and the cost of the edges in the graph respectively, and then accumulates them according to the calculation target and the related cost of the edges and nodes. This method chooses the second method to correctly calculate the data cost.
  • the following specific examples are used to illustrate. For example, the calculation steps of the relevant indicators of the customer’s daily average deposit balance are as follows:
  • Step 1 Read data from the local currency current account table (A, store local currency current account and balance data), write it into the local currency daily average deposit balance table (E), and calculate the daily customer local currency current deposit balance (A->E);
  • Step 2 Read data from the local currency fixed-term account table (B, store local currency fixed-term account and balance data), write it into the local currency daily average deposit balance table (E), and calculate the daily customer local currency fixed deposit balance (B->E);
  • Step 3 Read data from the foreign currency current account table (C, store foreign currency current account and balance data), write it into the foreign currency daily average deposit balance table (F), and calculate the daily customer foreign currency current account balance (C->F);
  • Step 4 Read data from the foreign currency fixed-term account table (D, store foreign currency fixed-term account and balance data), write it into the foreign currency daily average deposit balance table (F), and calculate the daily customer foreign currency fixed deposit balance (D->F);
  • Step 5 Read data from the daily average deposit balance table of local currency (E, store user ID and domestic currency deposit balance data), read data from the daily average deposit balance table of foreign currency (F, store user ID and foreign currency deposit balance data) , Write the customer's daily average deposit balance table (G, store user ID and balance data), and calculate the customer's daily average deposit balance (E->G, F->G).
  • E store user ID and domestic currency deposit balance data
  • F store user ID and foreign currency deposit balance data
  • G store user ID and balance data
  • steps 1-4 all need to read the customer account relationship table (Z, store the corresponding relationship between the user ID and the account), and write the customer information into the target table synchronously.
  • Each step is to execute the corresponding SQL statement and save the data Read and process from the source table and write it into the target table.
  • the blood relationship of the data is based on the executed SQL statement analysis to generate the relationship between the table and the table and the field and the field.
  • the relationship can be stored in the form of a two-dimensional table, and each blood relationship data records a relationship between the data. For example, field A->field E, therefore, based on multiple blood relationship data, a directed acyclic graph (DAG) as shown in Figure 4 can be drawn.
  • DAG directed acyclic graph
  • the nodes in the figure represent data storage, the connections between nodes represent the data processing process; the nodes can represent data tables, records or a single field, and the edges with directions between the nodes represent the related data processing process.
  • the cost calculation related to the blood relationship of data mainly involves the cost of computing resources used in the storage and processing process. Among them, the cost of manpower, space, electricity and other resources are not considered in the data cost calculation method, that is, the data cost calculation method mainly focuses on The related costs of storage and computing resources used in the process of data storage and processing, and other costs are not considered in this data cost calculation method. It should be noted that the data cost calculation method mainly uses the result of the blood relationship of the data, and the generation method is not concerned, even the result of the blood relationship written manually can be used.
  • generating the data blood relationship through the SQL statements contained in the processing script, and generating the directed acyclic graph through the data blood relationship specifically includes:
  • the S311 includes:
  • the script file may be a script such as perl.
  • S312 Perform lexical analysis on the regularized SQL sentences to generate data blood relationship, and generate a directed acyclic graph according to the data blood relationship.
  • Step 32 Obtain the statistical information and frequency information of the task execution of the data platform, and correspond to the directed acyclic graph.
  • the statistical information includes the resource usage of each task, the resource usage includes information such as storage usage, CPU usage, and memory usage; the frequency information includes information such as the number of historical executions of the task and the start and end time of execution.
  • the task of the data platform can be a SQL statement, and each SQL corresponds to one or more edges in the directed acyclic graph. After the mapping relationship is established, the resource usage corresponding to each edge can be referenced in the calculation process. quantity.
  • the processing cost of the designated data in each different time period can be calculated according to the different time periods of task execution. For example, a certain task is executed once a month, and the resource usage and cost of related processing can be calculated every quarter or every half year. In this way, the relevant information of the target data can be clearly known based on the statistical information and frequency information, which can facilitate the calculation of the data cost in each time period.
  • Step 33 Calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph.
  • the first method may cause repeated calculations for multiple referenced nodes, and the calculation result error will be larger.
  • node A, node B, node C, and node D in Figure 4 will accumulate The cost of node Z.
  • the second method calculates the cost of each node separately, then calculates the cost of each edge, and finally takes the sum of the two as the cost of the target data.
  • the calculation result is more accurate, that is, the data cost calculation method described in this application.
  • the main resources occupied are storage, CPU and memory (MEM); the measurement unit of storage is byte, which is multiplied by the number of redundancy; the measurement unit of CPU is second *The number of cores, the unit of measurement of memory is second*MB.
  • the calculation in the cloud environment is relatively simple, and the purchased resources can be converted into the corresponding measurement unit for easy calculation, while the traditional environment requires a reasonable way to convert the software and hardware costs into the corresponding measurement unit for calculation.
  • the unit price parameter of the data platform resource usage is introduced, that is, the unit price of the resource usage of different data platforms may be different, and the technology used for data processing and storage is determined according to the cost of the data. And the type of hardware to complete the calculation of the data cost. Furthermore, in the same enterprise, a reasonable and unified pricing method can be formed according to the cost of the data during the data exchange process.
  • the table of the data source (nodes in the DAG graph) and the related processing SQL (edges in the DAG graph) need to be obtained through the blood relationship of the data.
  • each SQL will be executed multiple times using X pq to represent the cost of resources consumed by each SQL each time; the blood relationship generated by each SQL may correspond to multiple edges in the DAG, use count(L x ) Indicates the number of edges in the DAG corresponding to each SQL, please refer to Figure 5, as follows:
  • the cost of the node is the storage cost.
  • the calculation formula of the node is: ⁇ i distinct ⁇ S i ⁇ +S k , where S i represents the storage resource cost occupied by the relevant node, S k represents the storage cost of the target data.
  • the cost of the edge is the cost of CPU and MEM.
  • the calculation formula of the edge is: Among them, N Lp represents the number of edges related to the target data, X pq represents the cost of resources consumed by each processing instruction each time, and count(L x ) represents the number of edges in the directed acyclic graph corresponding to each processing instruction.
  • Step 34 Obtain the costs of the edges and nodes, and add them to obtain the total cost of the target data.
  • N Lp is equal to count(L x ); when the SQL statement is multi-in and multi-out (from...insert...insert%), N Lp is less than count(L x ).
  • the total cost of the target data can be summarized, that is, the total data cost C k of the node (table) K is the following calculation formula:
  • A+Z ⁇ E is X 1 :
  • table-level data blood relationship can be generated:
  • a ⁇ E is marked as L AE and Z ⁇ E is marked as L ZE .
  • This SQL corresponds to the two edges Z ⁇ E and A ⁇ E in the graph.
  • the cust_id data in the E table comes from the Z table, and the bal data in the E table comes from A table.
  • B+Z ⁇ E is X 2
  • C+Z ⁇ E is X 3
  • D+Z ⁇ E is X 4
  • E+F ⁇ G is X 5 :
  • Table G The data in Table G comes from Tables A, B, C, D, Z, E, F, where node Z appears multiple times in the DAG, and the storage cost of multiple occurrences of nodes should be deduplicated when calculating the cost, so distinct
  • the total cost of table G is C G , and the formula can be obtained:
  • data processing is all at the table level, and the table-level data cost can be calculated according to the above description.
  • Table G in Figure 4 contains 11 data fields
  • the result of dividing the data in Table G by 11 can be used as the cost of each field; for example, each record in Table G stores a total of 20 bytes, of which 10 fields are only Store 1 byte of data, and the remaining field stores 10 bytes, so the storage cost of storing the 10-byte field is 50% of the storage cost of table G, and the storage cost of each other field is 5% of table G.
  • the record-level cost calculation method is similar. For example, if table G contains 100,000 records, the cost of each record is C G /100000.
  • the steps 1 to 3 describe the data cost calculation method based on the blood relationship of the data.
  • the data cost calculation method can be applied to the cost calculation of table-level and field-level data.
  • the record-level cost is based on the table-level or field-level cost calculation.
  • Level cost is calculated based on the average value of the number of records.
  • the data processing process SQL corresponds to the edges in the figure. Because each edge is processed in batches corresponding to multiple records in a table, the average value can be used to calculate the cost for data processed in multiple batches in the same table.
  • the number of resources used for the same SQL may be different due to changes in the amount of data each time. For example, A->E in Figure 4, assuming that the cost of processing and using resources for the first time is 10 yuan, corresponding Generate 10,000 records, the second time the resource cost will be 12 yuan corresponding to 14,000 records, then the average processing cost of these 24,000 records is (10+12)/24000 about 0.091 yuan.
  • the corresponding summary information is obtained based on the calculation result of the data cost calculation method based on the blood relationship of the data.
  • the summary information is obtained by hashing the calculation result of the data cost calculation method based on the blood relationship of the data, such as Use sha256s algorithm processing to get.
  • Uploading summary information to the blockchain can ensure its security and fairness and transparency to users.
  • the user can download the summary information from the blockchain in order to verify whether the calculation result of the data cost calculation method based on the blood relationship of the data has been tampered with.
  • the blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • This application provides a data cost calculation method based on the blood relationship of the data.
  • a directed acyclic graph is generated based on the blood relationship of the data; the cost of the node and the cost of the edge related to the target data in the directed acyclic graph are calculated ; Get the cost of the edges and nodes, and add them to get the total cost of the target data. Therefore, after combining the blood relationship of the data, the cost of the data can be calculated and displayed in a more fine-grained manner, and at the same time, the pricing method of the data application can be made more reasonable.
  • the present application also provides a data cost calculation system based on data blood relationship.
  • the data cost calculation system can be integrated into the above-mentioned computer device 110, and specifically can include a data set module 20, an information module 30, and a data set module.
  • a calculation module 40 and a second calculation module 50 are also provided.
  • the data set module 20 is used to obtain SQL statements used in the data processing process or scripts used in the data processing process, and generate data blood relationship through the SQL statement or the SQL statement contained in the processing script, the data blood relationship Form a directed acyclic graph;
  • the information module 30 is used to obtain the statistical information and frequency information of the task execution of the data platform, and correspond to the directed acyclic graph;
  • the first calculation module 40 is used to calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph;
  • the second calculation module 50 is used to obtain the cost of the edges and nodes, and add them to obtain the total cost of the target data.
  • the statistical information includes the resource usage of each task, the resource usage includes information such as storage usage, CPU usage, and memory usage; the frequency information includes the historical execution times of the task and the start and end of execution. Time and other information.
  • the first calculation module 40 is used to calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph.
  • the cost of the node in the directed acyclic graph is calculated.
  • the cost of the node is the storage cost.
  • the calculation formula of the node is: ⁇ i distinct ⁇ S i ⁇ + S k , where S i represents the storage resource cost occupied by the relevant node, and S k represents the storage cost of the target data.
  • the cost of the edge in the directed acyclic graph is calculated.
  • the cost of the edge is the cost of the CPU and MEM.
  • the calculation formula of the edge is: Among them, N Lp represents the number of edges related to the target data, X pq represents the cost of resources consumed by each processing instruction each time, and count(L x ) represents the number of edges in the directed acyclic graph corresponding to each processing instruction.
  • the second calculation module 50 is used to obtain the cost of the edges and nodes, and add them to obtain the total cost of the target data.
  • N Lp is equal to count(L x ); when the SQL statement is multi-in and multi-out (from...insert...insert%), N Lp is less than count (L x ).
  • the total cost of the target data can be summarized, that is, the total data cost C k of the node (table) K is the following calculation formula:
  • the data cost calculation system further includes a display module (not shown) for displaying calculation results.
  • the display module may be a display of a desktop computer or a display device of other computer equipment.
  • FIG. 8 is a schematic structural diagram of a device according to an embodiment of the application.
  • the device 200 includes a processor 201 and a memory 202 coupled to the processor 201.
  • the memory 202 stores program instructions for implementing the data cost calculation method based on the blood relationship of the data described in any of the above embodiments.
  • the processor 201 is configured to execute program instructions stored in the memory 202.
  • the processor 201 may also be referred to as a CPU (Central Processing Unit, central processing unit).
  • the processor 201 may be an integrated circuit chip with signal processing capability.
  • the processor 201 may also be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component .
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • FIG. 9 is a schematic structural diagram of a storage medium according to an embodiment of the application.
  • the storage medium of the embodiment of the present application stores a program file 301 that can implement all the above methods, where the program file 301 may be stored in the above storage medium in the form of a software product, and the computer-readable storage medium may be a non-volatile , It can also be volatile, which includes several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor to execute all or all of the methods described in the various embodiments of this application. Part of the steps.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes.
  • terminal devices such as computers, servers, mobile phones, and tablets.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Technology Law (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • Fuzzy Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided is a data cost calculation method based on data lineage: generating a data lineage relationship by means of SQL statements or SQL statements contained in processing scripts, said data lineage relationship forming a directed acyclic graph; obtaining statistical information and frequency information of task execution of a data platform, and corresponding same to the directed acyclic graph (S32); calculating the cost of nodes and edges related to target data in the directed acyclic graph (S33); obtaining the cost of the edges and nodes and adding up to obtain the total cost of the target data (S34). Hence, after combining with the data lineage relationship, it is possible to calculate and display the cost of data at a higher degree of granularity; at the same time, the invention causes the manner of pricing of data applications to be more reasonable. Further, the invention provides a more detailed and reasonable reference for the evaluation of data values inside and outside an enterprise, facilitating the most granular calculation of the cost of data, such that the cost of each piece of data can be accurately quantified. The described method also relates to blockchain technology.

Description

数据成本计算方法、系统、计算机设备和存储介质Data cost calculation method, system, computer equipment and storage medium
本申请要求于2020年10月21日提交中国专利局、申请号为202011132525.2、申请名称为“数据成本计算方法、系统、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on October 21, 2020, with the application number 202011132525.2. The application title is "Data cost calculation method, system, computer equipment and storage medium", the entire content of which is incorporated by reference Incorporated in this application.
技术领域Technical field
本申请涉及数据处理技术领域,特别是涉及数据成本计算方法、系统、计算机设备和存储介质。This application relates to the field of data processing technology, in particular to data cost calculation methods, systems, computer equipment and storage media.
背景技术Background technique
现有数据血缘分析程序或系统多用于数据溯源、依赖引用分析等方面,尚未找到与数据成本计算结合使用的案例。当前企业加工存储的数据越来越多,大数据技术获得了广泛的应用,数据加工和存储也消耗了大量的资源,但对应的成本并未能够有效的计算及展示。当前企业内部对于数据成本的计算粒度较大,并不能从更细粒度上体现数据成本的差异,供企业内部管理及相关决策使用。Existing data kinship analysis programs or systems are mostly used for data traceability, relying on citation analysis, etc., and no case has been found to be used in combination with data cost calculation. At present, enterprises are processing and storing more and more data, and big data technology has been widely used. Data processing and storage also consume a lot of resources, but the corresponding costs cannot be effectively calculated and displayed. At present, the calculation granularity of data cost within the enterprise is relatively large, and the difference in data cost cannot be reflected in a more fine-grained way for internal management and related decision-making.
发明人意识到,当前数据的成本大多都是按照加工过程和占用存储资源整体进行统计计算,无法获得表级、字段级或记录级别的成本。在数据成本清晰的情况下,才能在企业内部或外部使用数据时进行合理定价或成本结算。The inventor realizes that most of the current data costs are statistically calculated according to the overall processing process and storage resources occupied, and the table-level, field-level, or record-level costs cannot be obtained. Only when the data cost is clear, can reasonable pricing or cost settlement be made when the data is used internally or externally.
发明人意识到,数据的成本可通过使用相关资源所产生的费用进行计算,但数据加工过程中用到的其它数据也应该算作当前数据的成本,可以有更多视角来评定数据的成本或价值。The inventor realizes that the cost of data can be calculated by using the expenses incurred by the use of related resources, but other data used in the data processing process should also be counted as the cost of the current data, and there can be more perspectives to assess the cost of data or value.
发明内容Summary of the invention
基于此,本申请提供了一种数据成本计算方法、系统、计算机设备和存储介质,以能够更细粒度的计算和展现数据的成本,同时,能使数据应用的计价方式更为合理。Based on this, this application provides a data cost calculation method, system, computer equipment, and storage medium to be able to calculate and display the cost of data in a finer-grained manner, and at the same time, to make the pricing method of data applications more reasonable.
为实现上述目的,本申请提供一种基于数据血缘的数据成本计算方法,所述数据成本计算方法包括:In order to achieve the above objective, this application provides a data cost calculation method based on the blood relationship of the data, and the data cost calculation method includes:
获取数据加工过程中使用的SQL语句或者数据加工过程中使用的脚本,并通过SQL语句或加工脚本中所包含的SQL语句生成数据血缘关系,所述数据血缘关系形成有向无环图;Acquire SQL statements used in the data processing process or scripts used in the data processing process, and generate data blood relationship through the SQL statement or the SQL statement contained in the processing script, and the data blood relationship forms a directed acyclic graph;
获取数据平台任务执行的统计信息和频率信息,并对应到有向无环图中Obtain the statistical information and frequency information of the task execution of the data platform, and correspond to the directed acyclic graph
计算有向无环图中目标数据相关的节点的成本和边的成本;Calculate the cost of nodes and edges related to the target data in the directed acyclic graph;
获取所述边和节点的成本,并进行累加以得到目标数据总成本。Obtain the costs of the edges and nodes, and add them to obtain the total cost of the target data.
为实现上述目的,本申请还提供一种基于数据血缘的数据成本计算系统,所述数据成本计算系统包括:In order to achieve the above objective, the present application also provides a data cost calculation system based on the blood relationship of the data, and the data cost calculation system includes:
数据集模块,用于获取数据加工过程中使用的SQL语句或者数据加工过程中使用的脚本,并通过SQL语句或加工脚本中所包含的SQL语句生成数据血缘关系,所述数据血缘关系形成有向无环图;The data set module is used to obtain the SQL statements used in the data processing or the scripts used in the data processing, and generate the data blood relationship through the SQL statement or the SQL statement contained in the processing script, and the data blood relationship forms a directed Acyclic graph
信息模块,用于获取数据平台任务执行的统计信息和频率信息,并对应到有向无环图中;The information module is used to obtain the statistical information and frequency information of the task execution of the data platform, and correspond to the directed acyclic graph;
第一计算模块,用于计算有向无环图中目标数据相关的节点的成本和边的成本;The first calculation module is used to calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph;
第二计算模块,用于获取所述边和节点的成本,并进行累加以得到目标数据总成本。The second calculation module is used to obtain the cost of the edges and nodes, and add them to obtain the total cost of the target data.
为实现上述目的,本申请还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如下步骤:获取数据加工过程中使用的SQL语句或者数据加工过程中使用的脚本,并通过SQL语句或加工脚本中所包含的SQL语句生成数据血缘关系,所述数据血缘关系形成有向无环图;获取数据平台任务执行的统计信息和频率信息,并对应到有向无环图中;计算有向无环图中目标数据相关的节点的成本和边的成本;获取所述边和节点的成本,并进行累加以得到目标 数据总成本。In order to achieve the above objective, the present application also provides a computer device, including a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the processor Perform the following steps: Obtain the SQL statements used in the data processing process or the scripts used in the data processing process, and generate the data blood relationship through the SQL statement or the SQL statement contained in the processing script, and the data blood relationship forms a directed acyclic ring Figure; Obtain the statistical information and frequency information of the task execution of the data platform, and correspond to the directed acyclic graph; calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph; obtain the edge and node Cost, and add up to get the total cost of the target data.
为实现上述目的,本申请还提供一种存储介质,存储有能够实现如下步骤的程序文件,所述步骤包括:获取数据加工过程中使用的SQL语句或者数据加工过程中使用的脚本,并通过SQL语句或加工脚本中所包含的SQL语句生成数据血缘关系,所述数据血缘关系形成有向无环图;获取数据平台任务执行的统计信息和频率信息,并对应到有向无环图中;计算有向无环图中目标数据相关的节点的成本和边的成本;获取所述边和节点的成本,并进行累加以得到目标数据总成本。In order to achieve the above-mentioned purpose, this application also provides a storage medium that stores a program file capable of implementing the following steps. The steps include: obtaining SQL statements or scripts used in the data processing process, and passing SQL The SQL statement contained in the sentence or the processing script generates the blood relationship of the data, and the blood relationship of the data forms a directed acyclic graph; obtains the statistical information and frequency information of the task execution of the data platform, and corresponds to the directed acyclic graph; calculates The cost of the node and the cost of the edge related to the target data in the directed acyclic graph; obtain the cost of the edge and the node, and add them to obtain the total cost of the target data.
上述本申请提供了一种数据成本计算方法、系统、计算机设备和存储介质,其中,所述数据成本计算方法通过获取数据加工过程中使用的SQL语句或者数据加工过程中使用的脚本,并通过SQL语句或加工脚本中所包含的SQL语句生成数据血缘关系,所述数据血缘关系形成有向无环图;获取数据平台任务执行的统计信息和频率信息,并对应到有向无环图中;计算有向无环图中目标数据相关的节点的成本和边的成本;获取所述边和节点的成本,并进行累加以得到目标数据总成本。因此,本申请所述数据成本计算方法在结合数据血缘关系后,能够更细粒度的计算和展现数据的成本,同时,能够使数据应用的计价方式更为合理,这样,为企业对于数据价值的评定可以提供更加详细和合理的参考依据。The foregoing application provides a data cost calculation method, system, computer equipment, and storage medium, wherein the data cost calculation method obtains SQL statements or scripts used in the data processing process, and passes SQL The SQL statement contained in the sentence or the processing script generates the blood relationship of the data, and the blood relationship of the data forms a directed acyclic graph; obtains the statistical information and frequency information of the task execution of the data platform, and corresponds to the directed acyclic graph; calculates The cost of the node and the cost of the edge related to the target data in the directed acyclic graph; obtain the cost of the edge and the node, and add them to obtain the total cost of the target data. Therefore, the data cost calculation method described in this application can calculate and display the cost of data in a more fine-grained manner after combining the blood relationship of the data. At the same time, it can make the pricing method of data application more reasonable. The assessment can provide a more detailed and reasonable reference basis.
附图说明Description of the drawings
图1为一个实施例中提供的数据成本计算方法的实施环境图;Figure 1 is an implementation environment diagram of a data cost calculation method provided in an embodiment;
图2为一个实施例中计算机设备的内部结构框图;Figure 2 is a block diagram of the internal structure of a computer device in an embodiment;
图3为一个实施例中数据成本计算方法的流程图;Figure 3 is a flow chart of a data cost calculation method in an embodiment;
图4为一个实施例中有向无环图的示意图;Figure 4 is a schematic diagram of a directed acyclic graph in an embodiment;
图5为一个实施例中有向无环图中节点和边计算的流程图;FIG. 5 is a flowchart of node and edge calculation in a directed acyclic graph in an embodiment;
图6为一个实施例中SQL语句为多进多出的有向无环图的示意图;FIG. 6 is a schematic diagram of a directed acyclic graph in which SQL statements are multiple-in and multiple-out in an embodiment;
图7为一个实施例中数据成本计算系统的示意图;Figure 7 is a schematic diagram of a data cost calculation system in an embodiment;
图8为一个实施例中的计算机设备的结构示意图;Figure 8 is a schematic structural diagram of a computer device in an embodiment;
图9为一个实施例中的存储介质的结构示意图。FIG. 9 is a schematic diagram of the structure of a storage medium in an embodiment.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions, and advantages of this application clearer and clearer, the following further describes the application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application.
可以理解,本申请所使用的术语“第一”、“第二”等可在本文中用于描述各种元件,但这些元件不受这些术语限制。这些术语仅用于将第一个元件与另一个元件区分。It can be understood that the terms "first", "second", etc. used in this application can be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish the first element from another element.
图1为一个实施例中提供的基于数据血缘的数据成本计算方法的实施环境图,如图1所示,在该实施环境中,包括计算机设备110和显示设备120。FIG. 1 is an implementation environment diagram of a data cost calculation method based on data blood relationship provided in an embodiment. As shown in FIG. 1, the implementation environment includes a computer device 110 and a display device 120.
计算机设备110可以为用户使用的电脑等计算机设备,计算机设备110上安装有基于数据血缘的数据成本计算系统。当计算时,用户可以在计算机设备110依照基于数据血缘的数据成本计算方法进行计算,并通过显示设备120显示计算结果。The computer device 110 may be a computer device used by the user, such as a computer, and a data cost calculation system based on the blood relationship of the data is installed on the computer device 110. When calculating, the user can perform the calculation on the computer device 110 according to the data cost calculation method based on the blood relationship of the data, and display the calculation result on the display device 120.
需要说明的是,计算机设备110和显示设备120组合起来可以为智能手机、平板电脑、笔记本电脑、台式计算机等,但并不局限于此。It should be noted that the combination of the computer device 110 and the display device 120 can be a smart phone, a tablet computer, a notebook computer, a desktop computer, etc., but is not limited to this.
图2为一个实施例中计算机设备的内部结构示意图。如图2所示,该计算机设备包括通过系统总线连接的处理器、非易失性存储介质、存储器和网络接口。其中,该计算机设备的非易失性存储介质存储有操作系统、数据库和计算机可读指令,数据库中可存储有控件信息序列,该计算机可读指令被处理器执行时,可使得处理器实现一种基于数据血缘的数据成本计算方法。该计算机设备的处理器用于提供计算和控制能力,支撑整个计算机设备的运行。该计算机设备的存储器中可存储有计算机可读指令,该计算机可读指令被处理器执行时,可 使得处理器执行一种基于数据血缘的数据成本计算方法。该计算机设备的网络接口用于与终端连接通信。本领域技术人员可以理解,图2中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Figure 2 is a schematic diagram of the internal structure of a computer device in an embodiment. As shown in Figure 2, the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected through a system bus. Wherein, the non-volatile storage medium of the computer device stores an operating system, a database, and computer-readable instructions. The database may store control information sequences. When the computer-readable instructions are executed by the processor, the processor can implement a A data cost calculation method based on the blood relationship of the data. The processor of the computer equipment is used to provide calculation and control capabilities, and supports the operation of the entire computer equipment. Computer readable instructions may be stored in the memory of the computer device. When the computer readable instructions are executed by the processor, the processor may cause the processor to execute a data cost calculation method based on the blood relationship of the data. The network interface of the computer device is used to connect and communicate with the terminal. Those skilled in the art can understand that the structure shown in FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
如图3所示,在一个实施例中,提出了一种基于数据血缘的数据成本计算方法,其中,所述数据成本是指企业对数据的获取、传递、表达、存储、搜索、处理等直接或间接的支出与费用。本申请还可应用于数据仓库场景中,从而推动大数据的建设的目的。所述数据成本计算方法可以应用于上述的计算机设备110和显示设备120中,具体可以包括以下步骤:As shown in Figure 3, in one embodiment, a data cost calculation method based on the blood relationship of data is proposed. Or indirect expenses and expenses. This application can also be applied to data warehouse scenarios, so as to promote the purpose of building big data. The data cost calculation method may be applied to the above-mentioned computer device 110 and display device 120, and specifically may include the following steps:
步骤31,获取数据加工过程中使用的SQL语句或者数据加工过程中使用的脚本,并通过SQL语句或加工脚本中所包含的SQL语句生成数据血缘关系,所述数据血缘关系形成有向无环图。Step 31: Obtain the SQL statement used in the data processing process or the script used in the data processing process, and generate the data blood relationship through the SQL statement or the SQL statement contained in the processing script, and the data blood relationship forms a directed acyclic graph .
具体的,数据仓库中数据加工过程和数据量类似金字塔结构,自底向上加工存储,底层的数据量和加工用到的资源相对提供使用的数据量要大的多。处于金字塔顶层的数据,其加工存储成本并不能反映其真实的制造成本,还应包含与其加工相关的下层数据的制造存储成本更为合理。因此,基于数据血缘能够较为简便的计算出数据的累积成本。累积成本的计算可以有两种方式:一种方式是计算出数据血缘中每个节点的一般成本,然后根据血缘关系逐级递归进行累加,直到满足限定条件终止;第二种方式是根据数据血缘关系生成的有向无环图分别计算图中节点的成本和边的成本,再根据计算目标及相关的边和节点成本进行累加。本方法选择第二种方式进行,以能正确计算数据成本。下面具体举例进行说明,例如,客户日均存款余额相关指标的计算步骤如下:Specifically, the data processing process and the data volume in the data warehouse are similar to the pyramid structure, the bottom-up processing and storage, the bottom-level data volume and the resources used for processing are much larger than the data volume provided for use. For the data at the top of the pyramid, its processing and storage cost does not reflect its true manufacturing cost, and the manufacturing and storage cost of the lower-level data related to its processing is more reasonable. Therefore, the cumulative cost of data can be easily calculated based on the blood relationship of the data. The cumulative cost can be calculated in two ways: one method is to calculate the general cost of each node in the blood relationship, and then accumulate it step by step according to the blood relationship, until the limit is met; the second method is based on the data blood relationship The directed acyclic graph generated by the relationship calculates the cost of the nodes and the cost of the edges in the graph respectively, and then accumulates them according to the calculation target and the related cost of the edges and nodes. This method chooses the second method to correctly calculate the data cost. The following specific examples are used to illustrate. For example, the calculation steps of the relevant indicators of the customer’s daily average deposit balance are as follows:
步骤1、从本币活期账户表读取数据(A,存储本币活期账号与余额数据),写入本币日均存款余额表(E),计算每日客户本币活期存款余额(A->E);Step 1. Read data from the local currency current account table (A, store local currency current account and balance data), write it into the local currency daily average deposit balance table (E), and calculate the daily customer local currency current deposit balance (A->E);
步骤2、从本币定期账户表读取数据(B,存储本币定期账号与余额数据),写入本币日均存款余额表(E),计算每日客户本币定期存款余额(B->E);Step 2. Read data from the local currency fixed-term account table (B, store local currency fixed-term account and balance data), write it into the local currency daily average deposit balance table (E), and calculate the daily customer local currency fixed deposit balance (B->E);
步骤3、从外币活期账户表读取数据(C,存储外币活期账号与余额数据),写入外币日均存款余额表(F),计算每日客户外币活期存款余额(C->F);Step 3. Read data from the foreign currency current account table (C, store foreign currency current account and balance data), write it into the foreign currency daily average deposit balance table (F), and calculate the daily customer foreign currency current account balance (C->F);
步骤4、从外币定期账户表读取数据(D,存储外币定期账号与余额数据),写入外币日均存款余额表(F),计算每日客户外币定期存款余额(D->F);Step 4. Read data from the foreign currency fixed-term account table (D, store foreign currency fixed-term account and balance data), write it into the foreign currency daily average deposit balance table (F), and calculate the daily customer foreign currency fixed deposit balance (D->F);
步骤5、从本币日均存款余额表中读取数据(E,存储用户ID与本币存款余额数据),从外币日均存款余额表中读取数据(F,存储用户ID与外币存款余额数据),写入客户日均存款余额表(G,存储用户ID与余额数据),计算客户日均存款余额(E->G,F->G)。Step 5. Read data from the daily average deposit balance table of local currency (E, store user ID and domestic currency deposit balance data), read data from the daily average deposit balance table of foreign currency (F, store user ID and foreign currency deposit balance data) , Write the customer's daily average deposit balance table (G, store user ID and balance data), and calculate the customer's daily average deposit balance (E->G, F->G).
其中,步骤1-4都需要读取客户账户关系表(Z,存储用户ID和账号的对应关系),将客户信息同步写入目标表中,每个步骤都是执行对应的SQL语句,将数据从源表读取加工后写入到目标表中。进一步的,数据血缘是根据执行的SQL语句分析生成表与表和字段与字段之间的关系,该等关系可以采用二维表格的形式存储,每条血缘数据都记录着一条数据间的关系,如字段A->字段E,因此,基于多条血缘关系数据可以绘制如图4所示的有向无环图(DAG)。Among them, steps 1-4 all need to read the customer account relationship table (Z, store the corresponding relationship between the user ID and the account), and write the customer information into the target table synchronously. Each step is to execute the corresponding SQL statement and save the data Read and process from the source table and write it into the target table. Further, the blood relationship of the data is based on the executed SQL statement analysis to generate the relationship between the table and the table and the field and the field. The relationship can be stored in the form of a two-dimensional table, and each blood relationship data records a relationship between the data. For example, field A->field E, therefore, based on multiple blood relationship data, a directed acyclic graph (DAG) as shown in Figure 4 can be drawn.
请进一步参考图4,图中的节点表示数据的存储,节点间的连线表示数据的加工过程;节点可以表示数据表、记录或单个字段,节点间带有方向的边表示相关数据加工过程所占用的计算资源。具体的,图中所有的边都是有向边,由数据源表或字段指向数据目标表或字段。数据血缘相关的成本计算主要涉及到存储和加工过程中使用的计算资源成本,其中,人力、场地、电力等资源成本不在所述数据成本计算方法考虑之内,即所述数据成本计算方法主要关注数据的存储和加工过程中使用到的存储和计算资源的相关成本,其他成本不在该数据成本计算方法考虑之内。需要说明的是,该数据成本计算方法主要使用数据血缘的结果,其生成方式并不关注,即使是人工编写的血缘关系结果也可使用。Please further refer to Figure 4. The nodes in the figure represent data storage, the connections between nodes represent the data processing process; the nodes can represent data tables, records or a single field, and the edges with directions between the nodes represent the related data processing process. The computing resources occupied. Specifically, all edges in the graph are directed edges, and the data source table or field points to the data target table or field. The cost calculation related to the blood relationship of data mainly involves the cost of computing resources used in the storage and processing process. Among them, the cost of manpower, space, electricity and other resources are not considered in the data cost calculation method, that is, the data cost calculation method mainly focuses on The related costs of storage and computing resources used in the process of data storage and processing, and other costs are not considered in this data cost calculation method. It should be noted that the data cost calculation method mainly uses the result of the blood relationship of the data, and the generation method is not concerned, even the result of the blood relationship written manually can be used.
进一步的,一个实施例中,通过加工脚本中所包含的SQL语句生成数据血缘关系,并通过数据血缘关系生成有向无环图,具体包括:Further, in one embodiment, generating the data blood relationship through the SQL statements contained in the processing script, and generating the directed acyclic graph through the data blood relationship, specifically includes:
S311、从含有SQL代码的脚本文件中提取得到规则化的SQL语句,完成对SQL语句的清洗;S311. Extract regularized SQL statements from the script file containing the SQL code, and complete the cleaning of the SQL statements;
进一步的,所述S311包括:Further, the S311 includes:
S3111、获取含有SQL代码的脚本文件,并寻找SQL代码的标志位;S3111, obtain the script file containing the SQL code, and look for the flag bit of the SQL code;
优选的,脚本文件可为perl等脚本。Preferably, the script file may be a script such as perl.
S3112、利用标志位过滤脚本文件中的无关内容,保留得到规则化的SQL代码语句。S3112. Use the flag bit to filter irrelevant content in the script file, and retain the regularized SQL code statements.
S312、对规则化的SQL语句进行词法分析,生成数据血缘关系,并根据数据血缘关系生成有向无环图。S312: Perform lexical analysis on the regularized SQL sentences to generate data blood relationship, and generate a directed acyclic graph according to the data blood relationship.
步骤32,获取数据平台任务执行的统计信息和频率信息,并对应到有向无环图中。Step 32: Obtain the statistical information and frequency information of the task execution of the data platform, and correspond to the directed acyclic graph.
其中,所述统计信息包括每次任务的资源使用量,所述资源使用量包括存储用量、CPU用量和内存用量等信息;所述频率信息包括任务的历史执行次数和执行的起止时间等信息。Wherein, the statistical information includes the resource usage of each task, the resource usage includes information such as storage usage, CPU usage, and memory usage; the frequency information includes information such as the number of historical executions of the task and the start and end time of execution.
具体的,数据平台的任务可以是一条SQL语句,每条SQL都对应有向无环图中的一条到多条边,在建立映射关系后,可在计算过程中引用各条边对应的资源使用量。Specifically, the task of the data platform can be a SQL statement, and each SQL corresponds to one or more edges in the directed acyclic graph. After the mapping relationship is established, the resource usage corresponding to each edge can be referenced in the calculation process. quantity.
具体的,可按照任务执行的不同时间段分别统计每个不同时间段指定数据的加工成本,例如某个任务每月执行一次,可以统计每个季度或每半年相关加工的资源用量和成本。如此,根据统计信息和频率信息就能清楚知道目标数据的相关信息,可以方便每个时间段的数据成本计算。Specifically, the processing cost of the designated data in each different time period can be calculated according to the different time periods of task execution. For example, a certain task is executed once a month, and the resource usage and cost of related processing can be calculated every quarter or every half year. In this way, the relevant information of the target data can be clearly known based on the statistical information and frequency information, which can facilitate the calculation of the data cost in each time period.
步骤33,计算有向无环图中目标数据相关的节点的成本和边的成本。Step 33: Calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph.
根据累积成本的两种计算方式,所述第一种方式可能会对多重引用的节点造成重复计算,计算结果误差会较大,例如图4中节点A、节点B、节点C以及节点D会累计节点Z的成本。第二种方式分别计算各个节点的成本,再计算每条边的成本,最后取二者之和作为目标数据的成本,计算结果较为准确,即本申请所述的数据成本计算方法。According to the two calculation methods of accumulative cost, the first method may cause repeated calculations for multiple referenced nodes, and the calculation result error will be larger. For example, node A, node B, node C, and node D in Figure 4 will accumulate The cost of node Z. The second method calculates the cost of each node separately, then calculates the cost of each edge, and finally takes the sum of the two as the cost of the target data. The calculation result is more accurate, that is, the data cost calculation method described in this application.
进一步的,在大数据环境批处理生成的数据的过程中,主要占用资源为存储、CPU和内存(MEM);存储的计量单位为字节,根据冗余数量乘以倍数;CPU计量单位为秒*核心数量,内存的计量单位为秒*MB。其中,在云环境的计算相对简便,购买的资源都可转换为对应计量单位便于计算,而传统环境则需要合理的方式将软硬件成本转换为对应计量单位后进行计算。简单的说,就是根据数据平台的不同,引入数据平台资源使用量的单价参数,即不同的数据平台的资源使用量的单价可能存在不同,根据数据的成本决策数据的加工和存储所使用的技术和硬件类型来完成数据成本的计算。进一步的,在同一企业中,其数据交换过程中可根据数据的成本形成合理的、统一的计价方式。Further, in the process of batch processing of data generated in the big data environment, the main resources occupied are storage, CPU and memory (MEM); the measurement unit of storage is byte, which is multiplied by the number of redundancy; the measurement unit of CPU is second *The number of cores, the unit of measurement of memory is second*MB. Among them, the calculation in the cloud environment is relatively simple, and the purchased resources can be converted into the corresponding measurement unit for easy calculation, while the traditional environment requires a reasonable way to convert the software and hardware costs into the corresponding measurement unit for calculation. Simply put, according to the different data platforms, the unit price parameter of the data platform resource usage is introduced, that is, the unit price of the resource usage of different data platforms may be different, and the technology used for data processing and storage is determined according to the cost of the data. And the type of hardware to complete the calculation of the data cost. Furthermore, in the same enterprise, a reasonable and unified pricing method can be formed according to the cost of the data during the data exchange process.
具体的,下面举例进行说明,在当前大数据加工环境资源成本如下:Specifically, the following examples are used for illustration. The current environmental resource costs of big data processing are as follows:
1000个CPU核心,每年费用为100万元,每core*s的价格约为1000000/1000(核心数量)/(365*86400)=0.0000317元;1000 CPU cores, the annual cost is 1 million yuan, and the price per core*s is about 1000000/1000 (number of cores)/(365*86400) = 0.0000317 yuan;
5TB内存每年费用50万元,则每GB每秒的费用约为500000/(5*1024)/(365*86400)=0.0000030966元;The annual cost of 5TB memory is 500,000 yuan, and the cost per GB per second is about 500,000/(5*1024)/(365*86400) = 0.0000030966 yuan;
存储20TB,每年费用为5万元,每GB每年的价格约为500000/(20*1024)=2.4414元。The annual cost of storing 20TB is 50,000 yuan, and the annual price per GB is about 500,000/(20*1024)=2.4414 yuan.
根据图4,假设前述SQL(加工指令)执行过程使用的计算资源为:CPU 2000core*s,MEM 500GB*s,节点A相关数据占用存储10GB,节点Z占用相关存储2GB,节点E相关数据占用存储3GB,则基于这部分有向无环图计算数据的加工和存储成本为(CPU单价)0.0000317*2000+(内存单价)0.0000030966*500+(存储单价)2.4414*(10+2+3)=0.0634+0.0015483+36.621=36.6859483元,可以准确和快捷的计算出该部分的数据成本。According to Figure 4, assume that the computing resources used in the execution of the aforementioned SQL (processing instructions) are: CPU 2000core*s, MEM 500GB*s, node A-related data occupies 10GB of storage, node Z occupies 2GB of related storage, and node E-related data occupies storage 3GB, the processing and storage cost of the calculated data based on this part of the directed acyclic graph is (CPU unit price) 0.0000317*2000+(memory unit price) 0.0000030966*500+(storage unit price) 2.4414*(10+2+3)=0.0634 +0.0015483+36.621=36.6859483 yuan, the data cost of this part can be calculated accurately and quickly.
进一步的,在一个实施例中,假设计算数据节点(表)K的成本C k,需要通过数据血缘得到数据来源的表(DAG图中的节点)和相关加工SQL(DAG图中的边)所消耗的资源。其中,使用S i表示相关节点所占用的存储资源成本,使用X表示加工生成目标表的SQL所消耗的资源;生成目标表数据的SQL可以有多个使用X p分别表示每个SQL所消耗的资源的成 本;每条SQL会被执行多次使用X pq表示每条SQL每次所消耗的资源的成本;每条SQL产生的血缘关系可能对应DAG中的多条边,使用count(L x)表示每个SQL对应DAG中边的数量,请参考图5,具体如下: Further, in one embodiment, assuming that the cost C k of the data node (table) K is calculated, the table of the data source (nodes in the DAG graph) and the related processing SQL (edges in the DAG graph) need to be obtained through the blood relationship of the data. Resources consumed. Among them, S i is used to indicate the cost of storage resources occupied by the relevant node, and X is used to indicate the resources consumed by processing and generating the SQL of the target table; there can be multiple SQLs for generating the target table data. Use X p to indicate the consumption of each SQL. The cost of resources; each SQL will be executed multiple times using X pq to represent the cost of resources consumed by each SQL each time; the blood relationship generated by each SQL may correspond to multiple edges in the DAG, use count(L x ) Indicates the number of edges in the DAG corresponding to each SQL, please refer to Figure 5, as follows:
331、计算有向无环图中节点的成本;331. Calculate the cost of a node in a directed acyclic graph;
具体的,所述节点的成本就是存储成本,根据以上描述,所述节点的计算公式为:∑ idistinct{S i}+S k,其中,S i表示相关节点所占用的存储资源成本,S k表示目标数据的存储成本。 Specifically, the cost of the node is the storage cost. According to the above description, the calculation formula of the node is: ∑ i distinct{S i }+S k , where S i represents the storage resource cost occupied by the relevant node, S k represents the storage cost of the target data.
332、计算有向无环图中边的成本。332. Calculate the cost of edges in the directed acyclic graph.
具体的,所述边的成本是CPU和MEM的成本,根据以上描述,所述边的计算公式:
Figure PCTCN2020135737-appb-000001
其中,N Lp表示与目标数据相关边的数量,X pq表示每条加工指令每次所消耗的资源的成本,count(L x)表示每个加工指令对应有向无环图中边的数量。
Specifically, the cost of the edge is the cost of CPU and MEM. According to the above description, the calculation formula of the edge is:
Figure PCTCN2020135737-appb-000001
Among them, N Lp represents the number of edges related to the target data, X pq represents the cost of resources consumed by each processing instruction each time, and count(L x ) represents the number of edges in the directed acyclic graph corresponding to each processing instruction.
步骤34,获取所述边和节点的成本,并进行累加以得到目标数据总成本。Step 34: Obtain the costs of the edges and nodes, and add them to obtain the total cost of the target data.
当SQL语句为多进一出(insert…from…)时,N Lp与count(L x)相等;当SQL语句为多进多出(from…insert…insert…)时,N Lp小于count(L x)。 When the SQL statement is multi-in and one-out (insert…from…), N Lp is equal to count(L x ); when the SQL statement is multi-in and multi-out (from…insert…insert…), N Lp is less than count(L x ).
据此,可以总结目标数据总成本,即节点(表)K的总数据成本C k为以下计算公式: Based on this, the total cost of the target data can be summarized, that is, the total data cost C k of the node (table) K is the following calculation formula:
Figure PCTCN2020135737-appb-000002
Figure PCTCN2020135737-appb-000002
进一步的,举例进行说明,例如,以图4中节点G的数据加工为例,SQL语句为多进一出,共涉及到5条SQL语句,分别为:Further, take an example for description. For example, taking the data processing of node G in Figure 4 as an example, the SQL statement is multi-in and one-out, and a total of 5 SQL statements are involved, namely:
A+Z→E为X 1: A+Z→E is X 1 :
insert into table_Einsert into table_E
select z.cust_id,a.balselect z.cust_id,a.bal
from table_A afrom table_A a
join table_Z zjoin table_Z z
on a.acct_no=z.acct_no。on a.acct_no=z.acct_no.
根据此SQL可以生成表级数据血缘关系:According to this SQL, table-level data blood relationship can be generated:
A→E标记为L AE,Z→E标记为L ZE,此SQL对应图中Z→E和A→E两条边,E表中的cust_id数据来源于Z表,E表中bal数据来源于A表。 A→E is marked as L AE and Z→E is marked as L ZE . This SQL corresponds to the two edges Z→E and A→E in the graph. The cust_id data in the E table comes from the Z table, and the bal data in the E table comes from A table.
X 1对应的count(L x1)=2,N L1=2。以此类推B+Z→E为X 2,C+Z→E为X 3,D+Z→E为X 4,对应的count(L x)=2,N LP=2。 X 1 corresponds to count(L x1 )=2, N L1 =2. By analogy, B+Z→E is X 2 , C+Z→E is X 3 , D+Z→E is X 4 , and the corresponding count(L x )=2, N LP =2.
E+F→G为X 5E+F→G is X 5 :
insert into table_Ginsert into table_G
select nvl(e.cust_id,f.cust_id)as cust_id,select nvl(e.cust_id,f.cust_id)as cust_id,
sum(nvl(e.bal,0)+nvl(f.bal,0))as balsum(nvl(e.bal,0)+nvl(f.bal,0))as bal
from table_E efrom table_E e
full outer join table_F ffull outer join table_F f
on e.cust_id=f.cust_idon e.cust_id=f.cust_id
group by nvl(e.cust_id,f.cust_id);group by nvl(e.cust_id,f.cust_id);
X 5对应的count(L x5)=2,N L5=2。 X 5 corresponds to count(L x5 )=2, N L5 =2.
表G的数据来源于表A、B、C、D、Z、E、F,其中,节点Z在DAG中出现多次,在计算成本时应对多次出现节点的存储成本进行去重,因此distinct{S i}中的i∈{A、B、C、D、Z、E、F}。假设每个SQL都执行过10次,即当日多次执行,则q=10,根 据上述信息,表G的总成本为C G,带入公式可得: The data in Table G comes from Tables A, B, C, D, Z, E, F, where node Z appears multiple times in the DAG, and the storage cost of multiple occurrences of nodes should be deduplicated when calculating the cost, so distinct The i∈{A, B, C, D, Z, E, F} in {S i }. Assuming that each SQL has been executed 10 times, that is, executed multiple times on the same day, then q=10. According to the above information, the total cost of table G is C G , and the formula can be obtained:
Figure PCTCN2020135737-appb-000003
Figure PCTCN2020135737-appb-000003
进一步的,在当前的大数据环境下,数据的加工都是表级的,根据以上描述可以计算出表级的数据成本。例如图4中表G如果包含11个数据字段,可以将表G的数据除以11的结果作为每个字段的成本;例如表G中每条记录共存储20字节,其中10个字段都只存储1字节数据,剩余一个字段存储10字节,那么存储10字节的字段占用的存储成本就是表G存储成本的50%,其他每个字段的存储成本是表G的5%。记录级的成本计算方式类似,例如表G包含10万条记录,那么每条记录的成本为C G/100000。 Further, in the current big data environment, data processing is all at the table level, and the table-level data cost can be calculated according to the above description. For example, if Table G in Figure 4 contains 11 data fields, the result of dividing the data in Table G by 11 can be used as the cost of each field; for example, each record in Table G stores a total of 20 bytes, of which 10 fields are only Store 1 byte of data, and the remaining field stores 10 bytes, so the storage cost of storing the 10-byte field is 50% of the storage cost of table G, and the storage cost of each other field is 5% of table G. The record-level cost calculation method is similar. For example, if table G contains 100,000 records, the cost of each record is C G /100000.
在另一实施例中,当SQL语句为多进多出时,另有示例如下,其中多进多出图例请参考图6,其加工相关SQL如下:In another embodiment, when the SQL statement is multiple-in and multiple-out, another example is as follows. For the multiple-in and multiple-out legend, please refer to Figure 6. The processing-related SQL is as follows:
From table_A aFrom table_A a
join table_B bjoin table_B b
On a.id=b.idOn a.id=b.id
Insert into table_CInsert into table_C
Select a.id,a.bal+b.balSelect a.id,a.bal+b.bal
Where a.type=1 and b.type=2Where a.type=1 and b.type=2
Insert into table_DInsert into table_D
Select b.id,a.bal+b.balSelect b.id,a.bal+b.bal
Where a.type=3 and b.type=4;Where a.type=3 and b.type=4;
此SQL会生成如图6所示的4条边,假设此SQL单次执行所消耗资源成本为X P,那么count(L x)=4,若计算节点D的加工成本,那么与节点D相关的只有两条边,分别是A→D和B→D,那么N Lp=2,假设此SQL同样已执行q=10次,那么节点D在执行此SQL 10次后的成本带入计算公式如下: This SQL will generate 4 edges as shown in Figure 6. Assuming that the resource cost consumed by a single execution of this SQL is X P , then count(L x )=4. If the processing cost of node D is calculated, it is related to node D There are only two edges, namely A→D and B→D, then N Lp = 2. Assuming that this SQL has also been executed q = 10 times, then the cost of node D after executing this SQL 10 times is brought into the calculation formula as follows :
Figure PCTCN2020135737-appb-000004
Figure PCTCN2020135737-appb-000004
根据以上描述,所述步骤1至3描述了基于数据血缘的数据成本计算方法,该数据成本计算方法可应用于表级、字段级数据的成本计算,记录级的成本则是根据表级或字段级成本,按照记录数量取均值计算。具体的,数据的加工过程(SQL)对应图中的边,因批量加工每条边对应一张表中的多条记录,对于同一张表中多个批次加工的数据可以采用均值的方式计算成本。According to the above description, the steps 1 to 3 describe the data cost calculation method based on the blood relationship of the data. The data cost calculation method can be applied to the cost calculation of table-level and field-level data. The record-level cost is based on the table-level or field-level cost calculation. Level cost is calculated based on the average value of the number of records. Specifically, the data processing process (SQL) corresponds to the edges in the figure. Because each edge is processed in batches corresponding to multiple records in a table, the average value can be used to calculate the cost for data processed in multiple batches in the same table.
进一步的,在一实施例中,每次相同SQL可能因数据数量的变化导致使用资源的数量可能不同,例如图4的A->E,假设第一次加工使用资源的成本是10元,对应产生10000条记录,第二次将使用资源成本是12元对应产生14000条记录,那么这24000条记录的平均加工成本就是(10+12)/24000约为0.091元。Further, in an embodiment, the number of resources used for the same SQL may be different due to changes in the amount of data each time. For example, A->E in Figure 4, assuming that the cost of processing and using resources for the first time is 10 yuan, corresponding Generate 10,000 records, the second time the resource cost will be 12 yuan corresponding to 14,000 records, then the average processing cost of these 24,000 records is (10+12)/24000 about 0.091 yuan.
在一个可选的实施方式中,还可以:将所述基于数据血缘的数据成本计算方法的计算结果上传至区块链中。In an optional implementation manner, it is also possible to upload the calculation result of the data cost calculation method based on the blood relationship of the data to the blockchain.
具体地,基于所述基于数据血缘的数据成本计算方法的计算结果得到对应的摘要信息,具体来说,摘要信息由所述基于数据血缘的数据成本计算方法的计算结果进行散列处理得到,比如利用sha256s算法处理得到。将摘要信息上传至区块链可保证其安全性和对用户的公正透明性。用户可以从区块链中下载得该摘要信息,以便查证所述基于数据血缘的数据成本计算方法的计算结果是否被篡改。本示例所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层 平台、平台产品服务层以及应用服务层等。Specifically, the corresponding summary information is obtained based on the calculation result of the data cost calculation method based on the blood relationship of the data. Specifically, the summary information is obtained by hashing the calculation result of the data cost calculation method based on the blood relationship of the data, such as Use sha256s algorithm processing to get. Uploading summary information to the blockchain can ensure its security and fairness and transparency to users. The user can download the summary information from the blockchain in order to verify whether the calculation result of the data cost calculation method based on the blood relationship of the data has been tampered with. The blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
本申请提供了一种基于数据血缘的数据成本计算方法,通过定义数据集,获取根据数据血缘关系生成有向无环图;计算有向无环图中目标数据相关的节点的成本和边的成本;获取所述边和节点的成本,并进行累加以得到目标数据总成本。由此,在结合数据血缘关系后,能够更细粒度的计算和展现数据的成本,同时,能够使数据应用的计价方式更为合理。进一步的,企业内外对数据价值的评定提供更加详细、合理的参考,便于最细粒度计算数据的成本,使每条数据的成本都可以被精确量化。同时,本申请还涉及区块链技术。This application provides a data cost calculation method based on the blood relationship of the data. By defining a data set, a directed acyclic graph is generated based on the blood relationship of the data; the cost of the node and the cost of the edge related to the target data in the directed acyclic graph are calculated ; Get the cost of the edges and nodes, and add them to get the total cost of the target data. Therefore, after combining the blood relationship of the data, the cost of the data can be calculated and displayed in a more fine-grained manner, and at the same time, the pricing method of the data application can be made more reasonable. Furthermore, the evaluation of the value of data inside and outside the enterprise provides a more detailed and reasonable reference, which facilitates the most fine-grained calculation of the cost of data, so that the cost of each piece of data can be accurately quantified. At the same time, this application also involves blockchain technology.
如图7所示,本申请还提供了一种基于数据血缘的数据成本计算系统,该数据成本计算系统可以集成于上述的计算机设备110中,具体可以包括数据集模块20、信息模块30、第一计算模块40以及第二计算模块50。As shown in Figure 7, the present application also provides a data cost calculation system based on data blood relationship. The data cost calculation system can be integrated into the above-mentioned computer device 110, and specifically can include a data set module 20, an information module 30, and a data set module. A calculation module 40 and a second calculation module 50.
所述数据集模块20,用于获取数据加工过程中使用的SQL语句或者数据加工过程中使用的脚本,并通过SQL语句或加工脚本中所包含的SQL语句生成数据血缘关系,所述数据血缘关系形成有向无环图;The data set module 20 is used to obtain SQL statements used in the data processing process or scripts used in the data processing process, and generate data blood relationship through the SQL statement or the SQL statement contained in the processing script, the data blood relationship Form a directed acyclic graph;
信息模块30,用于获取数据平台任务执行的统计信息和频率信息,并对应到有向无环图中;The information module 30 is used to obtain the statistical information and frequency information of the task execution of the data platform, and correspond to the directed acyclic graph;
所述第一计算模块40,用于计算有向无环图中目标数据相关的节点的成本和边的成本;The first calculation module 40 is used to calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph;
所述第二计算模块50,用于获取所述边和节点的成本,并进行累加以得到目标数据总成本。The second calculation module 50 is used to obtain the cost of the edges and nodes, and add them to obtain the total cost of the target data.
在一个实施例中,所述统计信息包括每次任务的资源使用量,所述资源使用量包括存储用量、CPU用量和内存用量等信息;所述频率信息包括任务的历史执行次数和执行的起止时间等信息。In one embodiment, the statistical information includes the resource usage of each task, the resource usage includes information such as storage usage, CPU usage, and memory usage; the frequency information includes the historical execution times of the task and the start and end of execution. Time and other information.
在一个实施例中,所述第一计算模块40用于计算有向无环图中目标数据相关的节点的成本和边的成本。In one embodiment, the first calculation module 40 is used to calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph.
其中,一个实施例中,计算有向无环图中节点的成本,具体的,所述节点的成本就是存储成本,根据以上描述,所述节点的计算公式为:∑ idistinct{S i}+S k,其中,S i表示相关节点所占用的存储资源成本,S k表示目标数据的存储成本。 Among them, in one embodiment, the cost of the node in the directed acyclic graph is calculated. Specifically, the cost of the node is the storage cost. According to the above description, the calculation formula of the node is: ∑ i distinct{S i }+ S k , where S i represents the storage resource cost occupied by the relevant node, and S k represents the storage cost of the target data.
其中,计算有向无环图中边的成本,具体的,所述边的成本是CPU和MEM的成本,根据以上描述,所述边的计算公式:
Figure PCTCN2020135737-appb-000005
其中,N Lp表示与目标数据相关边的数量,X pq表示每条加工指令每次所消耗的资源的成本,count(L x)表示每个加工指令对应有向无环图中边的数量。
Wherein, the cost of the edge in the directed acyclic graph is calculated. Specifically, the cost of the edge is the cost of the CPU and MEM. According to the above description, the calculation formula of the edge is:
Figure PCTCN2020135737-appb-000005
Among them, N Lp represents the number of edges related to the target data, X pq represents the cost of resources consumed by each processing instruction each time, and count(L x ) represents the number of edges in the directed acyclic graph corresponding to each processing instruction.
进一步的,在一个实施例中,所述第二计算模块50用于获取所述边和节点的成本,并进行累加以得到目标数据总成本。Further, in one embodiment, the second calculation module 50 is used to obtain the cost of the edges and nodes, and add them to obtain the total cost of the target data.
其中,当SQL语句为多进一出(insert…from…)时,N Lp与count(L x)相等;当SQL语句为多进多出(from…insert…insert…)时,N Lp小于count(L x)。 Among them, when the SQL statement is multi-in and one-out (insert…from…), N Lp is equal to count(L x ); when the SQL statement is multi-in and multi-out (from…insert…insert…), N Lp is less than count (L x ).
据此,可以总结目标数据总成本,即节点(表)K的总数据成本C k为以下计算公式: Based on this, the total cost of the target data can be summarized, that is, the total data cost C k of the node (table) K is the following calculation formula:
Figure PCTCN2020135737-appb-000006
Figure PCTCN2020135737-appb-000006
在一个实施例中,所述数据成本计算系统还包括显示模块(未图示),用于显示计算结果,所述显示模块可以是台式电脑的显示器,也可以是其他计算机设备的显示装置。In one embodiment, the data cost calculation system further includes a display module (not shown) for displaying calculation results. The display module may be a display of a desktop computer or a display device of other computer equipment.
请参考图8,图8为本申请实施例的设备的结构示意图。如图8所示,该设备200包括处理器201及和处理器201耦接的存储器202。Please refer to FIG. 8, which is a schematic structural diagram of a device according to an embodiment of the application. As shown in FIG. 8, the device 200 includes a processor 201 and a memory 202 coupled to the processor 201.
存储器202存储有用于实现上述任一实施例所述基于数据血缘的数据成本计算方法的程 序指令。The memory 202 stores program instructions for implementing the data cost calculation method based on the blood relationship of the data described in any of the above embodiments.
处理器201用于执行存储器202存储的程序指令。The processor 201 is configured to execute program instructions stored in the memory 202.
其中,处理器201还可以称为CPU(Central Processing Unit,中央处理单元)。处理器201可能是一种集成电路芯片,具有信号的处理能力。处理器201还可以是通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 201 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 201 may be an integrated circuit chip with signal processing capability. The processor 201 may also be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component . The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
参阅图9,图9为本申请实施例的存储介质的结构示意图。本申请实施例的存储介质存储有能够实现上述所有方法的程序文件301,其中,该程序文件301可以以软件产品的形式存储在上述存储介质中,所述计算机可读存储介质可以是非易失性,也可以是易失性,其包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施方式所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质,或者是计算机、服务器、手机、平板等终端设备。Refer to FIG. 9, which is a schematic structural diagram of a storage medium according to an embodiment of the application. The storage medium of the embodiment of the present application stores a program file 301 that can implement all the above methods, where the program file 301 may be stored in the above storage medium in the form of a software product, and the computer-readable storage medium may be a non-volatile , It can also be volatile, which includes several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor to execute all or all of the methods described in the various embodiments of this application. Part of the steps. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. , Or terminal devices such as computers, servers, mobile phones, and tablets.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority or inferiority of the embodiments. Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including a number of instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.

Claims (20)

  1. 一种基于数据血缘的数据成本计算方法,其中,所述数据成本计算方法包括:A data cost calculation method based on the blood relationship of the data, wherein the data cost calculation method includes:
    获取数据加工过程中使用的SQL语句或者数据加工过程中使用的脚本,并通过SQL语句或加工脚本中所包含的SQL语句生成数据血缘关系,所述数据血缘关系形成有向无环图;Acquire SQL statements used in the data processing process or scripts used in the data processing process, and generate data blood relationship through the SQL statement or the SQL statement contained in the processing script, and the data blood relationship forms a directed acyclic graph;
    获取数据平台任务执行的统计信息和频率信息,并对应到有向无环图中;Obtain the statistical information and frequency information of the task execution of the data platform, and correspond to the directed acyclic graph;
    计算有向无环图中目标数据相关的节点的成本和边的成本;Calculate the cost of nodes and edges related to the target data in the directed acyclic graph;
    获取所述边和节点的成本,并进行累加以得到目标数据总成本。Obtain the costs of the edges and nodes, and add them to obtain the total cost of the target data.
  2. 如权利要求1所述的数据成本计算方法,其中,所述统计信息包括每次任务的资源使用量,所述资源使用量包括存储用量、CPU用量和内存用量;所述频率信息包括任务的历史执行次数和执行的起止时间。The data cost calculation method according to claim 1, wherein the statistical information includes resource usage of each task, the resource usage includes storage usage, CPU usage, and memory usage; and the frequency information includes task history The number of executions and the start and end time of execution.
  3. 如权利要求2所述的数据成本计算方法,其中,根据数据平台的不同,引入数据平台资源使用量的单价参数;在数据成本的计算过程中,所述节点的成本为存储成本,所述边的成本为CPU和内存的成本。The data cost calculation method according to claim 2, wherein according to the difference of the data platform, the unit price parameter of the resource usage of the data platform is introduced; in the calculation process of the data cost, the cost of the node is the storage cost, and the edge The cost is the cost of CPU and memory.
  4. 如权利要求1所述的数据成本计算方法,其中,所述计算有向无环图中目标数据相关的节点的成本包括:∑ idistinct{S i}+S k,其中,S i表示相关节点所占用的存储资源成本,S k表示目标数据的存储成本; The data cost calculation method according to claim 1, wherein the calculation of the cost of the node related to the target data in the directed acyclic graph comprises: ∑ i distinct{S i }+S k , where S i represents the related node Occupied storage resource cost, Sk represents the storage cost of target data;
    所述计算有向无环图中目标数据相关的边的成本:
    Figure PCTCN2020135737-appb-100001
    其中,N Lp表示与目标数据相关边的数量,X pq表示每条加工指令每次所消耗的资源的成本,count(L x)表示每个加工指令对应有向无环图中边的数量。
    The calculation of the cost of the edge related to the target data in the directed acyclic graph:
    Figure PCTCN2020135737-appb-100001
    Among them, N Lp represents the number of edges related to the target data, X pq represents the cost of resources consumed by each processing instruction each time, and count(L x ) represents the number of edges in the directed acyclic graph corresponding to each processing instruction.
  5. 如权利要求4所述的数据成本计算方法,其中,所述获取所述边和节点的成本,并进行累加以得到目标数据总成本,包括:
    Figure PCTCN2020135737-appb-100002
    其中,C k表示目标数据总成本。
    The data cost calculation method according to claim 4, wherein said obtaining the costs of the edges and nodes and adding them to obtain the total cost of the target data comprises:
    Figure PCTCN2020135737-appb-100002
    Among them, C k represents the total cost of the target data.
  6. 如权利要求1所述的数据成本计算方法,其中,所述加工脚本中所包含的SQL语句生成数据血缘关系,所述数据血缘关系形成有向无环图包括:5. The data cost calculation method according to claim 1, wherein the SQL statement contained in the processing script generates the blood relationship of the data, and the formation of the blood relationship of the data into a directed acyclic graph comprises:
    从含有SQL代码的脚本文件中提取得到规则化的SQL语句,完成对SQL语句的清洗;Extract regularized SQL statements from script files containing SQL codes to complete the cleaning of SQL statements;
    对规则化的SQL语句进行词法分析,生成数据血缘关系,并根据数据血缘关系生成有向无环图。Perform lexical analysis on the regularized SQL statements to generate data blood relationship, and generate a directed acyclic graph according to the data blood relationship.
  7. 如权利要求1所述的数据成本计算方法,其中,所述得到目标数据总成本之后,将所述目标数据总成本上传至区块链中,以使得所述区块链对所述目标数据总成本进行加密存储。The data cost calculation method according to claim 1, wherein, after the total cost of the target data is obtained, the total cost of the target data is uploaded to the blockchain, so that the total cost of the target data is affected by the blockchain. The cost is stored encrypted.
  8. 一种基于数据血缘的数据成本计算系统,其中,所述数据成本计算系统包括:A data cost calculation system based on data blood relationship, wherein the data cost calculation system includes:
    数据集模块,用于获取数据加工过程中使用的SQL语句或者数据加工过程中使用的脚本,并通过SQL语句或加工脚本中所包含的SQL语句生成数据血缘关系,所述数据血缘关系形成有向无环图;The data set module is used to obtain the SQL statements used in the data processing or the scripts used in the data processing, and generate the data blood relationship through the SQL statement or the SQL statement contained in the processing script, and the data blood relationship forms a directed Acyclic graph
    信息模块,用于获取数据平台任务执行的统计信息和频率信息,并对应到有向无环图中;The information module is used to obtain the statistical information and frequency information of the task execution of the data platform, and correspond to the directed acyclic graph;
    第一计算模块,用于计算有向无环图中目标数据相关的节点的成本和边的成本;The first calculation module is used to calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph;
    第二计算模块,用于获取所述边和节点的成本,并进行累加以得到目标数据总成本。The second calculation module is used to obtain the cost of the edges and nodes, and add them to obtain the total cost of the target data.
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如下步骤:获取数据加工过程中使用的SQL语句或者数据加工过程中使用的脚本,并通过SQL语句或加工脚本中所包含的SQL语句生成数据血缘关系,所述数据血缘关系形成有向无环图;获取数据平台任务执行的统计信息和频率信息,并对应到有向无环图中;计算有向无环图中目标数据相关的节点的成本和边的成本;获取所述边和节点的成本,并进行累加以得到目标数据总成本。A computer device includes a memory and a processor, and computer-readable instructions are stored in the memory. When the computer-readable instructions are executed by the processor, the processor executes the following steps: The used SQL statement or the script used in the data processing process, and the data blood relationship is generated through the SQL statement or the SQL statement contained in the processing script, and the data blood relationship forms a directed acyclic graph; obtains the statistics of the task execution of the data platform Information and frequency information, and correspond to the directed acyclic graph; calculate the cost of the node and the cost of the edge related to the target data in the directed acyclic graph; obtain the cost of the edge and node, and accumulate it to obtain the target data total cost.
  10. 如权利要求9所述的计算机设备,其中,所述统计信息包括每次任务的资源使用量,所述资源使用量包括存储用量、CPU用量和内存用量;所述频率信息包括任务的历史执行次 数和执行的起止时间。9. The computer device according to claim 9, wherein the statistical information includes resource usage of each task, the resource usage includes storage usage, CPU usage, and memory usage; and the frequency information includes historical execution times of tasks And the start and end time of execution.
  11. 如权利要求10所述的计算机设备,其中,根据数据平台的不同,引入数据平台资源使用量的单价参数;在数据成本的计算过程中,所述节点的成本为存储成本,所述边的成本为CPU和内存的成本。The computer device according to claim 10, wherein, according to the difference of the data platform, the unit price parameter of the resource usage of the data platform is introduced; in the calculation process of the data cost, the cost of the node is the storage cost, and the cost of the edge Is the cost of CPU and memory.
  12. 如权利要求9所述的计算机设备,其中,所述计算有向无环图中目标数据相关的节点的成本包括:∑ idistinct{S i}+S k,其中,S i表示相关节点所占用的存储资源成本,S k表示目标数据的存储成本; The computer device according to claim 9, wherein the calculation of the cost of the node related to the target data in the directed acyclic graph comprises: ∑ i distinct{S i }+S k , where S i represents the occupation of the related node The cost of storage resources, Sk represents the storage cost of the target data;
    所述计算有向无环图中目标数据相关的边的成本:
    Figure PCTCN2020135737-appb-100003
    其中,N Lp表示与目标数据相关边的数量,X pq表示每条加工指令每次所消耗的资源的成本,count(L x)表示每个加工指令对应有向无环图中边的数量。
    The calculation of the cost of the edge related to the target data in the directed acyclic graph:
    Figure PCTCN2020135737-appb-100003
    Among them, N Lp represents the number of edges related to the target data, X pq represents the cost of resources consumed by each processing instruction each time, and count(L x ) represents the number of edges in the directed acyclic graph corresponding to each processing instruction.
  13. 如权利要求12所述的计算机设备,其中,所述获取所述边和节点的成本,并进行累加以得到目标数据总成本,包括:
    Figure PCTCN2020135737-appb-100004
    其中,C k表示目标数据总成本。
    The computer device according to claim 12, wherein said obtaining the costs of the edges and nodes and adding them to obtain the total cost of the target data comprises:
    Figure PCTCN2020135737-appb-100004
    Among them, C k represents the total cost of the target data.
  14. 如权利要求9所述的计算机设备,其中,所述加工脚本中所包含的SQL语句生成数据血缘关系,所述数据血缘关系形成有向无环图包括:9. The computer device according to claim 9, wherein the SQL statement contained in the processing script generates a data blood relationship, and the data blood relationship forming a directed acyclic graph comprises:
    从含有SQL代码的脚本文件中提取得到规则化的SQL语句,完成对SQL语句的清洗;Extract regularized SQL statements from script files containing SQL codes to complete the cleaning of SQL statements;
    对规则化的SQL语句进行词法分析,生成数据血缘关系,并根据数据血缘关系生成有向无环图。Perform lexical analysis on the regularized SQL statements to generate data blood relationship, and generate a directed acyclic graph according to the data blood relationship.
  15. 一种存储介质,其中,存储有能够实现如下步骤的程序文件,所述步骤包括:获取数据加工过程中使用的SQL语句或者数据加工过程中使用的脚本,并通过SQL语句或加工脚本中所包含的SQL语句生成数据血缘关系,所述数据血缘关系形成有向无环图;获取数据平台任务执行的统计信息和频率信息,并对应到有向无环图中;计算有向无环图中目标数据相关的节点的成本和边的成本;获取所述边和节点的成本,并进行累加以得到目标数据总成本。A storage medium, in which a program file capable of realizing the following steps is stored. The steps include: obtaining SQL statements or scripts used in the data processing process, and passing the SQL statements or the scripts contained in the processing script The SQL statement generates the blood relationship of the data, and the blood relationship of the data forms a directed acyclic graph; obtains the statistical information and frequency information of the task execution of the data platform, and corresponds to the directed acyclic graph; calculates the target in the directed acyclic graph The cost of the data-related node and the cost of the edge; the cost of the edge and the node is obtained, and the total cost of the target data is obtained by accumulation.
  16. 如权利要求15所述的存储介质,其中,所述统计信息包括每次任务的资源使用量,所述资源使用量包括存储用量、CPU用量和内存用量;所述频率信息包括任务的历史执行次数和执行的起止时间。The storage medium according to claim 15, wherein the statistical information includes resource usage of each task, the resource usage includes storage usage, CPU usage, and memory usage; and the frequency information includes historical execution times of tasks And the start and end time of execution.
  17. 如权利要求16所述的存储介质,其中,根据数据平台的不同,引入数据平台资源使用量的单价参数;在数据成本的计算过程中,所述节点的成本为存储成本,所述边的成本为CPU和内存的成本。The storage medium according to claim 16, wherein the unit price parameter of the resource usage of the data platform is introduced according to the difference of the data platform; in the calculation process of the data cost, the cost of the node is the storage cost, and the cost of the edge Is the cost of CPU and memory.
  18. 如权利要求15所述的存储介质,其中,所述计算有向无环图中目标数据相关的节点的成本包括:∑ idistinct{S i}+S k,其中,S i表示相关节点所占用的存储资源成本,S k表示目标数据的存储成本; The storage medium according to claim 15, wherein the calculation of the cost of the node related to the target data in the directed acyclic graph comprises: ∑ i distinct{S i }+S k , where S i represents the occupation of the related node The cost of storage resources, Sk represents the storage cost of the target data;
    所述计算有向无环图中目标数据相关的边的成本:
    Figure PCTCN2020135737-appb-100005
    其中,N Lp表示与目标数据相关边的数量,X pq表示每条加工指令每次所消耗的资源的成本,count(L x)表示每个加工指令对应有向无环图中边的数量。
    The calculation of the cost of the edge related to the target data in the directed acyclic graph:
    Figure PCTCN2020135737-appb-100005
    Among them, N Lp represents the number of edges related to the target data, X pq represents the cost of resources consumed by each processing instruction each time, and count(L x ) represents the number of edges in the directed acyclic graph corresponding to each processing instruction.
  19. 如权利要求18所述的存储介质,其中,所述获取所述边和节点的成本,并进行累加以得到目标数据总成本,包括:
    Figure PCTCN2020135737-appb-100006
    其中,C k表 示目标数据总成本。
    18. The storage medium of claim 18, wherein said obtaining the cost of the edges and nodes and adding them to obtain the total cost of the target data comprises:
    Figure PCTCN2020135737-appb-100006
    Among them, C k represents the total cost of the target data.
  20. 如权利要求15所述的存储介质,其中,所述加工脚本中所包含的SQL语句生成数据血缘关系,所述数据血缘关系形成有向无环图包括:15. The storage medium of claim 15, wherein the SQL statements contained in the processing script generate data blood relationship, and the data blood relationship forming a directed acyclic graph comprises:
    从含有SQL代码的脚本文件中提取得到规则化的SQL语句,完成对SQL语句的清洗;Extract regularized SQL statements from script files containing SQL codes to complete the cleaning of SQL statements;
    对规则化的SQL语句进行词法分析,生成数据血缘关系,并根据数据血缘关系生成有向无环图。Perform lexical analysis on the regularized SQL statements to generate data blood relationship, and generate a directed acyclic graph according to the data blood relationship.
PCT/CN2020/135737 2020-10-21 2020-12-11 Data cost calculation method, system, computer device, and storage medium WO2021174945A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011132525.2 2020-10-21
CN202011132525.2A CN112256720B (en) 2020-10-21 2020-10-21 Data cost calculation method, system, computer device and storage medium

Publications (1)

Publication Number Publication Date
WO2021174945A1 true WO2021174945A1 (en) 2021-09-10

Family

ID=74264461

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/135737 WO2021174945A1 (en) 2020-10-21 2020-12-11 Data cost calculation method, system, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN112256720B (en)
WO (1) WO2021174945A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113868253A (en) * 2021-09-28 2021-12-31 中通服创立信息科技有限责任公司 Data relationship capturing and big data relationship tree construction method
CN113934750A (en) * 2021-10-26 2022-01-14 上海泽字信息科技有限公司 Data blood relationship analysis method based on compiling mode
CN114090018A (en) * 2022-01-25 2022-02-25 树根互联股份有限公司 Index calculation method and device of industrial internet equipment and electronic equipment
CN114254081A (en) * 2021-12-22 2022-03-29 中冶赛迪重庆信息技术有限公司 Enterprise big data search system and method and electronic equipment
CN114428822A (en) * 2022-01-27 2022-05-03 云启智慧科技有限公司 Data processing method and device, electronic equipment and storage medium
CN117076095A (en) * 2023-10-16 2023-11-17 华芯巨数(杭州)微电子有限公司 Task scheduling method, system, electronic equipment and storage medium based on DAG

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114064640A (en) * 2021-11-09 2022-02-18 珠海市新德汇信息技术有限公司 Blood relationship construction method, storage medium and equipment applied to data tracing
CN115511644A (en) * 2022-08-29 2022-12-23 易保网络技术(上海)有限公司 Processing method for target policy, electronic device and readable storage medium
CN118134530B (en) * 2024-05-07 2024-10-01 杭州逸琨科技有限公司 Resource consumption evaluation method for data element, electronic device, and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100153431A1 (en) * 2008-12-11 2010-06-17 Louis Burger Alert triggered statistics collections
CN106991101A (en) * 2016-01-21 2017-07-28 阿里巴巴集团控股有限公司 A kind of method and apparatus of spreadsheet analysis processing
CN109325078A (en) * 2018-09-18 2019-02-12 拉扎斯网络科技(上海)有限公司 Data blood margin determination method and device based on structural data
CN111125269A (en) * 2019-12-31 2020-05-08 腾讯科技(深圳)有限公司 Data management method, blood relationship display method and related device
CN111652652A (en) * 2020-06-09 2020-09-11 苏宁云计算有限公司 Cost calculation method and device for calculation platform, computer equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2789196B1 (en) * 1999-01-28 2001-03-30 Univ Paris Curie METHOD FOR GENERATING MULTIMEDIA DOCUMENT DESCRIPTIONS
CN107644073A (en) * 2017-09-18 2018-01-30 广东中标数据科技股份有限公司 A kind of field consanguinity analysis method, system and device based on depth-first traversal
CN108446383B (en) * 2018-03-21 2021-12-10 吉林大学 Data task redistribution method based on geographic distributed data query
CN111694858A (en) * 2020-04-28 2020-09-22 平安科技(深圳)有限公司 Data blood margin analysis method, device, equipment and computer readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100153431A1 (en) * 2008-12-11 2010-06-17 Louis Burger Alert triggered statistics collections
CN106991101A (en) * 2016-01-21 2017-07-28 阿里巴巴集团控股有限公司 A kind of method and apparatus of spreadsheet analysis processing
CN109325078A (en) * 2018-09-18 2019-02-12 拉扎斯网络科技(上海)有限公司 Data blood margin determination method and device based on structural data
CN111125269A (en) * 2019-12-31 2020-05-08 腾讯科技(深圳)有限公司 Data management method, blood relationship display method and related device
CN111652652A (en) * 2020-06-09 2020-09-11 苏宁云计算有限公司 Cost calculation method and device for calculation platform, computer equipment and storage medium

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113868253A (en) * 2021-09-28 2021-12-31 中通服创立信息科技有限责任公司 Data relationship capturing and big data relationship tree construction method
CN113868253B (en) * 2021-09-28 2024-04-23 中通服创立信息科技有限责任公司 Data relationship capturing and big data relationship tree construction method
CN113934750A (en) * 2021-10-26 2022-01-14 上海泽字信息科技有限公司 Data blood relationship analysis method based on compiling mode
CN114254081A (en) * 2021-12-22 2022-03-29 中冶赛迪重庆信息技术有限公司 Enterprise big data search system and method and electronic equipment
CN114254081B (en) * 2021-12-22 2024-06-04 中冶赛迪信息技术(重庆)有限公司 Enterprise big data search system, method and electronic equipment
CN114090018A (en) * 2022-01-25 2022-02-25 树根互联股份有限公司 Index calculation method and device of industrial internet equipment and electronic equipment
CN114090018B (en) * 2022-01-25 2022-05-24 树根互联股份有限公司 Index calculation method and device of industrial internet equipment and electronic equipment
CN114428822A (en) * 2022-01-27 2022-05-03 云启智慧科技有限公司 Data processing method and device, electronic equipment and storage medium
CN114428822B (en) * 2022-01-27 2022-07-29 云启智慧科技有限公司 Data processing method and device, electronic equipment and storage medium
CN117076095A (en) * 2023-10-16 2023-11-17 华芯巨数(杭州)微电子有限公司 Task scheduling method, system, electronic equipment and storage medium based on DAG
CN117076095B (en) * 2023-10-16 2024-02-09 华芯巨数(杭州)微电子有限公司 Task scheduling method, system, electronic equipment and storage medium based on DAG

Also Published As

Publication number Publication date
CN112256720B (en) 2021-08-17
CN112256720A (en) 2021-01-22

Similar Documents

Publication Publication Date Title
WO2021174945A1 (en) Data cost calculation method, system, computer device, and storage medium
CN110008257B (en) Data processing method, device, system, computer equipment and storage medium
US10803187B2 (en) Computerized methods and systems for implementing access control to time series data
WO2020119051A1 (en) Cloud platform resource usage prediction method and terminal device
US10255108B2 (en) Parallel execution of blockchain transactions
Agmon Ben-Yehuda et al. Deconstructing Amazon EC2 spot instance pricing
Zheng et al. Service-generated big data and big data-as-a-service: an overview
Hang et al. Optimal blockchain network construction methodology based on analysis of configurable components for enhancing hyperledger fabric performance
Ruiz-Alvarez et al. An automated approach to cloud storage service selection
US7031901B2 (en) System and method for improving predictive modeling of an information system
US7035786B1 (en) System and method for multi-phase system development with predictive modeling
Zhao et al. Cloud data management
US11687535B2 (en) Automatic computation of features from a data stream
CN114741402A (en) Method and device for processing service feature pool, computer equipment and storage medium
TW202013277A (en) Subscription risk quantization method, withholding risk quantization method, apparatuses and devices
US10691653B1 (en) Intelligent data backfill and migration operations utilizing event processing architecture
US9652766B1 (en) Managing data stored in memory locations having size limitations
Gonzalez-Aparicio et al. Evaluation of ACE properties of traditional SQL and NoSQL big data systems
CN115936789A (en) Resource numerical value change data generation method and device considering nonlinear time constant
CN106874327B (en) Counting method and device for business data
CN113901046A (en) Virtual dimension table construction method and device
Kanagasabai et al. Ec2bargainhunter: It's easy to hunt for cost savings on amazon ec2!
Barreto Real time data intake and data warehouse integration
CN110442587B (en) Service information upgrading method and terminal equipment
US20220327634A1 (en) Generating relevant attribute data for benchmark comparison

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20923479

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20923479

Country of ref document: EP

Kind code of ref document: A1