CN116975051A - Blood relationship data determination method, device, computer equipment and storage medium - Google Patents

Blood relationship data determination method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN116975051A
CN116975051A CN202310459188.5A CN202310459188A CN116975051A CN 116975051 A CN116975051 A CN 116975051A CN 202310459188 A CN202310459188 A CN 202310459188A CN 116975051 A CN116975051 A CN 116975051A
Authority
CN
China
Prior art keywords
data
task
entity
data table
map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310459188.5A
Other languages
Chinese (zh)
Inventor
贾骐玮
罗亮
李玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310459188.5A priority Critical patent/CN116975051A/en
Publication of CN116975051A publication Critical patent/CN116975051A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24558Binary matching operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a blood relationship data determination method, a blood relationship data determination device, blood relationship data determination equipment and a storage medium. The method comprises the following steps: and acquiring a task entity map and a data table entity map corresponding to the data development task to be executed, determining a data table and a data development task of the same neighbor information based on the data table entity map and the task entity map, and determining the matching similarity between the data table and the data development task of the same neighbor information. And carrying out entity fusion on the entity map of the data table and the task entity map, wherein the matching similarity accords with a preset fusion condition, so as to obtain an entity fusion map, and determining updated blood relationship data based on the entity fusion map. By adopting the method, fusion of different entities can be realized by means of the knowledge graph, and updated blood-edge relationship data can be further determined based on the entity fusion graph, so that identification errors and invalid data in the data processing process can be reduced, and the coverage rate and accuracy of the determined blood-edge relationship data are improved.

Description

Blood relationship data determination method, device, computer equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for determining blood relationship data.
Background
Along with the development of artificial intelligence technology and the increasing of data volume involved in the data development process, a way of utilizing the blood relationship of a data table to identify the upstream and downstream operation dependency relationship appears in the data development task process, so that the data development efficiency is improved. The blood relationship of the data table refers to an upstream-downstream relationship formed between data and a link relationship generated between the data in the process of generating, processing and transferring to death.
In the conventional technology, SQL logic is processed by analyzing plaintext corresponding to a data development task to obtain a read table and a write table of the task, and then the read table and the write table are connected in series upstream and downstream, so that blood-edge relation data of the data table is obtained.
However, in the traditional analysis mode, the task type of obtaining the plaintext processing SQL logic is fewer, the data source obtaining difficulty is higher, the whole coverage of various data development tasks cannot be realized, and the accuracy of the obtained blood relationship data is lower because the obtained data is more unilateral. Therefore, the blood relationship data of the data table obtained by the traditional analysis method still has the problem of lower accuracy.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a blood-margin relationship data determination method, apparatus, computer device, computer-readable storage medium, and computer program product that can improve the accuracy of blood-margin relationship data between determined data tables.
In a first aspect, the present application provides a method for determining blood-lineage relationship data. The method comprises the following steps:
acquiring a task entity map corresponding to a data development task to be executed and a data table entity map associated with the data development task;
determining a data table and a data development task of the same neighbor information based on the data table entity map and the task entity map;
determining matching similarity between the data table and the data development task of the same neighbor information;
and carrying out entity fusion on the entity map of the data table and the entity map of the task, wherein the matching similarity accords with a preset fusion condition, obtaining an entity fusion map, and determining updated blood-cause relation data based on the entity fusion map.
In a second aspect, the application further provides a blood relationship data determining device. The device comprises:
the system comprises a map acquisition module, a data development module and a data development module, wherein the map acquisition module is used for acquiring a task entity map corresponding to a data development task to be executed and a data table entity map associated with the data development task;
The first determining module is used for determining a data table and a data development task of the same neighbor information based on the data table entity map and the task entity map;
the second determining module is used for determining the matching similarity between the data table and the data development task of the same neighbor information;
and the entity fusion module is used for carrying out entity fusion on the data table entity map and the task entity map, the matching similarity of which accords with the preset fusion condition, obtaining an entity fusion map, and determining updated blood relationship data based on the entity fusion map.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:
acquiring a task entity map corresponding to a data development task to be executed and a data table entity map associated with the data development task;
determining a data table and a data development task of the same neighbor information based on the data table entity map and the task entity map;
determining matching similarity between the data table and the data development task of the same neighbor information;
And carrying out entity fusion on the entity map of the data table and the entity map of the task, wherein the matching similarity accords with a preset fusion condition, obtaining an entity fusion map, and determining updated blood-cause relation data based on the entity fusion map.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
acquiring a task entity map corresponding to a data development task to be executed and a data table entity map associated with the data development task;
determining a data table and a data development task of the same neighbor information based on the data table entity map and the task entity map;
determining matching similarity between the data table and the data development task of the same neighbor information;
and carrying out entity fusion on the entity map of the data table and the entity map of the task, wherein the matching similarity accords with a preset fusion condition, obtaining an entity fusion map, and determining updated blood-cause relation data based on the entity fusion map.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:
Acquiring a task entity map corresponding to a data development task to be executed and a data table entity map associated with the data development task;
determining a data table and a data development task of the same neighbor information based on the data table entity map and the task entity map;
determining matching similarity between the data table and the data development task of the same neighbor information;
and carrying out entity fusion on the entity map of the data table and the entity map of the task, wherein the matching similarity accords with a preset fusion condition, obtaining an entity fusion map, and determining updated blood-cause relation data based on the entity fusion map.
In the blood relationship data determining method, the blood relationship data determining device, the computer equipment, the storage medium and the computer program product, the data table and the data development task of the same neighbor information are determined based on the task entity map corresponding to the data development task to be executed and the data table entity map associated with the data development task, so that data blocks of the data table and the data development task before similarity matching processing are realized, the subsequent matching processing times are reduced through the block processing, and the data processing efficiency is improved. Further, through determining the matching similarity between the data table and the data development task of the same neighbor information and carrying out entity fusion on the data table entity map and the task entity map, which are matched with the preset fusion condition, the entity fusion map is obtained, so that updated blood-edge relationship data is further determined based on the entity fusion map, identification errors and invalid data in the data processing process are reduced, and the coverage rate and accuracy of the determined blood-edge relationship data are improved.
Drawings
FIG. 1 is a diagram of an application environment for a method of determining blood relationship data in one embodiment;
FIG. 2 is a flow chart of a method of determining blood relationship data in one embodiment;
FIG. 3 is a schematic diagram of a process for obtaining an entity fusion map in one embodiment;
FIG. 4 is a flow chart of a method for determining blood relationship data according to another embodiment;
FIG. 5 is a flow diagram of determining matching similarity between a data table and a data development task for the same neighbor information in one embodiment;
FIG. 6 is a flow chart of determining weight data corresponding to each attribute similarity one-to-one in one embodiment;
FIG. 7 is a flow chart of a method of determining blood relationship data in yet another embodiment;
FIG. 8 is a flow chart of a method of determining blood relationship data in yet another embodiment;
FIG. 9 is a schematic overall flow diagram of a method of determining blood relationship data in one embodiment;
FIG. 10 is a block diagram of a blood relationship data determination device in one embodiment;
FIG. 11 is an internal block diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The blood relationship data determining method provided by the embodiment of the application can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, network media, auxiliary driving and the like. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Among them, artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.
Cloud technology (Cloud technology) refers to a hosting technology that unifies serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. Specifically, cloud technology (Cloud technology) is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like applied based on Cloud computing business models, and can form a resource pool, and the resource pool is flexible and convenient as required. The background service of the technical network system needs a large amount of computing and storage resources, such as video websites, picture websites and more portal websites, so that with the high development and application of the internet industry, each object possibly has an own identification mark and needs to be transmitted to the background system for logic processing, data of different levels can be separately processed, various industry data needs strong system rear shield support and can only be realized through cloud computing, and the cloud computing technology becomes an important support in the technical network system and the actual business application process. Big data (Big data) in cloud technology refers to a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which needs a new processing mode to have stronger decision making ability, insight discovery ability and flow optimization ability. With the advent of the cloud age, big data has attracted more and more attention, and special techniques are required for big data to effectively process a large amount of data within a tolerant elapsed time. Technologies applicable to big data include massively parallel processing databases, data mining, distributed file systems, distributed databases, cloud computing platforms, the internet, and scalable storage systems. With the development of the internet, real-time data flow and diversification of connected devices, and the promotion of demands of search services, social networks, mobile commerce, open collaboration and the like, cloud computing is rapidly developed. Unlike the previous parallel distributed computing, the generation of cloud computing will promote the revolutionary transformation of the whole internet mode and enterprise management mode in concept.
The method for determining the blood-lineage relation data provided by the embodiment of the application particularly relates to a big data technology in a cloud technology and an artificial intelligence technology, and can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, portable wearable devices, aircrafts, etc., and the internet of things devices may be smart speakers, smart car devices, etc. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be an independent physical server, or may be a server cluster formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms, where the terminal 102 and the server 104 may be directly or indirectly connected through wired or wireless communication modes, which is not limited in the embodiment of the present application.
Further, both the terminal 102 and the server 104 may be used separately to perform the blood-edge relationship data determination according to the embodiment of the present application, and the terminal 102 and the server 104 may also cooperate to perform the blood-edge relationship data determination according to the embodiment of the present application. For example, taking the example that the terminal 102 and the server 104 cooperatively execute the data determination of the blood relationship provided by the embodiment of the present application, the server 104 determines the data table and the data development task of the same neighbor information by acquiring a task entity map corresponding to the data development task to be executed and a data table entity map associated with the data development task, and based on the data table entity map and the task entity map. The task entity graph corresponding to the data development task to be executed and the data table entity graph associated with the data development task may be stored in a cloud storage of the server 104, or in a data storage system, or in a local storage of the terminal 102, and may be obtained from the server 104, or the data storage system, or the terminal 102 when the blood relationship data needs to be determined. Further, the server 104 performs entity fusion on the entity map of the data table and the entity map of the task, where the entity map and the entity map of the task have the matching similarity meeting the preset fusion condition, by determining the matching similarity between the data table and the data development task of the same neighbor information, so as to obtain an entity fusion map, and further determine updated blood-cause relationship data based on the entity fusion map. After the updated blood-edge relation data is obtained, the condition of missing dependence association between the upstream and the downstream of the data development task can be found in an auxiliary mode based on the updated blood-edge relation data, the condition of error dependence relationship can be determined according to the updated blood-edge relation data, and the method is applied to finding out nodes or data without downstream in the data treatment process so as to treat, save cost and further improve the processing efficiency of the data development task.
In one embodiment, as shown in fig. 2, a method for determining blood-edge relationship data is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:
step S202, a task entity map corresponding to a data development task to be executed and a data table entity map associated with the data development task are obtained.
In the process of developing different application programs or data development projects, massive data are required to be processed and analyzed according to different actual demands, in order to reduce complicated processing steps in the data development process and reduce resource consumption, the related data are required to be preprocessed in advance, and the association relationship or the dependency relationship and the like existing between different data are obtained through preprocessing and further matching and fusion processing of the data, so that the downstream-free data or nodes are identified and managed in an auxiliary manner under the scenes of invalid data management and repeated data development judgment, the data development efficiency is improved, and the data operation and maintenance cost is reduced.
Specifically, aiming at different data development tasks in the application program or data development project process, task entity patterns corresponding to the data development tasks and data table entity patterns related to the data development tasks need to be acquired, so that fusion of the patterns is realized based on task entities of task entity patterns and data table entities in the data table entity patterns, the fused entity patterns are obtained, and blood relationship data newly added among the data development tasks, among the data tables and between the data development tasks and the data tables is obtained based on the fused entity patterns. The newly added blood relationship data can assist in finding out the dependency relationship missed by the upstream and downstream of the data development task, so that the problem of data island is solved, and meanwhile, invalid nodes or data which do not exist downstream can be treated, so that the data operation and maintenance cost in the data development process is reduced.
In one embodiment, the method for generating the task entity graph corresponding to the data development task to be executed comprises the following steps:
acquiring a scheduling dependency relationship among the data development tasks, and generating a task dependency relationship data table corresponding to the data development tasks according to the scheduling dependency relationship; determining a task attribute data table corresponding to the data development task according to various metadata information in the data development task; and generating a task entity map taking the task as an entity according to the task dependency relationship data table and the task attribute data table.
Specifically, task scheduling execution data of different data development tasks on different data platforms are obtained, scheduling dependency relations among the data development tasks are obtained, and a task dependency relation data table corresponding to the data development tasks is generated according to the scheduling dependency relations. The task scheduling execution data of the data development task between different data platforms can be understood as the execution sequence of the data development task under the condition that the data development task is in the same platform or cross-platform, for example, the detailed task execution conditions such as the job1 is executed first and then the job2 is executed, so that the scheduling dependency relationship between the job1 and the job2 can be determined, and further, the task dependency relationship data table corresponding to different data development tasks can be obtained by obtaining the respective detailed execution conditions of different data development tasks.
After task dependency relationship data tables corresponding to different data development tasks are obtained, task attribute data tables corresponding to the data development tasks are further determined according to various metadata information in the data development tasks. The metadata may be understood as information for describing data attributes of data, and is used to support functions such as indicating storage locations, historical data, searching resources, and file records, so that task attribute data corresponding to a data development task may be determined according to data attribute information of various metadata related to the data development task, thereby obtaining a task attribute data table.
Specifically, the data attribute information of various metadata related in the data development task may include a task name, a responsible person, a task creation time, an associated instance execution time of the task, and the like. One associated instance corresponding to one task, namely, one associated instance of one data development task, can be understood as a complete process of executing the corresponding data development task.
Further, after the task dependency relationship data table and the task attribute data table are obtained, the task dependency relationship data table and the task attribute data table are combined to generate a task entity map taking the task as an entity.
The method comprises the steps of determining a correlation key in a task dependency relationship data table and a task attribute data table, and further carrying out data combination processing on the task dependency relationship data table and the task attribute data table based on the correlation key to obtain graph relationship data which takes a task as an entity after combination processing, so as to obtain a task entity graph which takes the task as an entity. The association key may specifically be an attribute commonly existing in the task dependency data table and the task attribute data table, such as a data development task, including job1, job2, job N, and the like.
In one embodiment, a manner of generating a data table entity graph associated with a data development task includes:
acquiring a task read-write data table corresponding to a data development task to be executed and a task input-output storage configuration data table; combining and combining the data table based on the task read-write data table and the task input-output storage configuration data table to obtain a primary data blood relationship between the data table and the data development task; determining a table attribute data table corresponding to the data table according to various metadata information in the data development task; based on the primary data blood relationship and the table attribute data table, a data table entity map taking the data table as an entity is generated.
Specifically, the basic execution logic of the data development task to be executed is analyzed to obtain a task reading table of the data development task to be executed (namely, which data table is specifically required to be read from when the data task is executed), and a task writing table (namely, which data table is specifically required to be written with data after the data task is executed), so as to obtain a task reading and writing data table of the data development task. And similarly, extracting task configuration data corresponding to the data development task, analyzing the task configuration data to obtain input configuration data and output configuration data corresponding to the data development task, and obtaining a read-write data table corresponding to the task types including the ex-warehouse task and the in-warehouse task based on the input configuration data and the output configuration data, thereby obtaining the task ex-warehouse configuration data table.
After the task read-write data table and the task input-output-input configuration data table are obtained, merging and combining processing is carried out on the basis of the task read-write data table and the task input-output-input configuration data table, so that a primary data blood-edge relation between the data table and the data development task is obtained.
Specifically, by adopting a mode of reading and writing the data table for the task and configuring the data table for the task access library, the merging and combination of the two data tables are realized, so that the primary data blood-edge relationship between the data table and the data development task is obtained. Specifically, the union operation is performed on the task read-write data table, the data development task in the task input-output storage configuration data table and the task read-write data table, so that the primary data blood relationship between the data table and the data development task can be obtained. The primary data blood-edge relationship can be understood as a dependency relationship between a related data development task and a data table in a task read-write data table and a task input-output storage configuration data table, and the dependency relationship comprises data tables which respectively need to read data and write data when different tasks are executed.
Similarly, after the task read-write data table and the task input-output configuration data table are obtained, the table attribute data table corresponding to the data table is further determined according to various metadata information in the data development task. The metadata may be understood as information for describing data attributes of the data, and further according to various metadata information in the data development task, each table attribute data of the task read-write data table and the task in-out and-in library configuration data table corresponding to the data development task may be further determined, so that according to each table attribute data, a table attribute data table corresponding to the task read-write data table and the task in-out and-in library configuration data table is obtained. The table attribute data specifically includes a table name, a responsible person, a time of creating the table, a time of generating the partition, and the like, and may further include a generating task corresponding to the data table. The partition generation time represents a time point when the storage space for storing data is created, and a data development task corresponding to the task representation and the data table is generated, namely, a certain data table is used when a certain data development task is executed, and then the data development task corresponding to the table is stored in the data table.
Further, after the primary data blood-edge relationship and the table attribute data table are obtained, a data table entity map taking the data table as an entity is generated further based on the primary data blood-edge relationship and the table attribute data table. The method specifically may include determining an association key in a primary data blood-edge relationship and a table attribute data table, and performing data combination processing on the primary data blood-edge relationship and the table attribute data table based on the association key to obtain spectrum relationship data taking the data table as an entity after combination processing, so as to obtain a data table entity spectrum taking the data table as an entity. The association key may specifically be an attribute commonly existing in the primary data blood-edge relationship and the table attribute data table, such as a data table, including table1, table2, table N, and the like.
Step S204, determining a data table and a data development task of the same neighbor information based on the data table entity map and the task entity map.
Specifically, each data table with a connection relation in the data table entity map is determined, each data development task with a connection relation in the task entity map is determined, and the data table with the same connection object and the data development task which are used as the data table and the data development task with the same neighbor information are determined based on each data table with the connection relation and each data development task with the connection relation.
Based on the data table entity map taking the data table as an entity, each data table with a connection relationship, such as table1 and table2, can be determined, and similarly, based on the task entity map taking the task as an entity, each task with a connection relationship, such as job1 and job2, can be determined.
Further, the primary data blood relationship can be understood as the dependency association relationship between the related data development task and the data table in the task read-write data table and the task input-output storage configuration data table, and further the connection relationship exists between the data development task and the data table. It will be appreciated that, based on each data table having a connection relationship, each data development task having a connection relationship, and a primary data blood relationship, it is possible to further determine the data table and the data development task having the same connection object.
For example, if the job1 and the job2, the job2 and the job3 have a connection relationship, the table2 and the table3 have a connection relationship, and the job1 and the table1 have a connection relationship, then it may be determined that the table1, the job2 and the job1 all have a connection relationship, that is, the job1 is the same connection object between the table1 and the job2, and further the table1 and the job2 belong to the data table and the data development task of the same neighbor information. Similarly, it can also be determined that both the job2 and the table3 are connected with the table2, and then the table2 is a common connection object between the job2 and the table3, and further the job2 and the table3 input the data table and the development task of the same neighbor information.
Step S206, determining the matching similarity between the data table and the data development task of the same neighbor information.
Specifically, based on the data table and the data development task of the same neighbor information, attribute information identification and matching are further performed, and the attribute information of the same category between the data table and the data development task, such as the table name in the data table and the task name in the data development task, can be determined as attribute information of the same category, and similarly, such as the creation time of the table in the data table and the creation time of the task in the data development task, can be determined as attribute information of the same category. The attribute information related to the data table and the data development task is various, and accordingly, the attribute information of various same categories can be obtained, namely, the attribute information is not limited to categories such as task names (or table names), creation time and the like.
Further, after obtaining attribute information of the same category between the data table and the data development task, attribute similarity between the data table and the data development task under each category is further determined according to the category of the attribute information, so that weighted fusion is performed by combining weight data corresponding to the attribute similarity, and matching similarity between the data table and the data development task of the same neighbor information is obtained through fusion.
And step S208, carrying out entity fusion on the entity map of the data table and the task entity map, wherein the matching similarity accords with the preset fusion condition, obtaining an entity fusion map, and determining updated blood-cause relation data based on the entity fusion map.
Specifically, the preset fusion condition includes that the matching similarity is greater than a preset similarity threshold, and then the target data table entity map and the target task entity map, of which the matching similarity is greater than the preset similarity threshold, can be screened out by acquiring the preset similarity threshold and comparing the matching similarity with the preset similarity threshold.
The preset similarity threshold can be set and adjusted according to different actual development tasks, and is not limited to a specific value. If the preset similarity threshold is larger, the data which can be fused is indicated to be smaller, and the association relationship between the data which can be fused is also smaller, so that the aim of obtaining more accurate blood-margin relationship data after fusion can be fulfilled. For example, if the preset similarity threshold is set to 0.8, it indicates that the entity map of the target data table and the entity map of the target task with the matching similarity greater than 0.8 need to be fused.
Further, processing logic processing is carried out on the target data table corresponding to the target data table entity map, the target data table is packaged into a target data development task corresponding to the target task entity map, and entity fusion is carried out on the target data table and the target data development task, so that an entity fusion map is obtained. The data table and the data development task are aligned and fused by means of the capability of the knowledge graph, so that series connection of the data development task and the upstream and downstream read-write relations of the data table can be formed, dependency and association relations among data are complemented, and blood margin coverage among the data is enhanced.
When processing logic processing is performed on the target data table corresponding to the target data table entity map, ETL processing (i.e. Extract Transform Load, understood as describing the process of extracting (extracting), converting (converting) and loading (Load) data from the source end to the destination end) may be specifically adopted, specifically, extracting, converting and loading and packaging the target data table in the target data table entity map into a target data development task corresponding to the target task entity map, so as to achieve fusion of the two entities of the target data table and the target data development task, and obtain a final entity fusion map.
In one embodiment, after the entity fusion map is obtained, updated blood-relationship data is further determined according to blood-relationship between data tables in the data-table entity map, blood-relationship between data development tasks between task entity maps, and the entity fusion map. The updated blood-edge relationship data may specifically include a blood-edge relationship between newly added data tables, or a blood-edge relationship between data development tasks, or a blood-edge relationship between data tables and data development tasks.
Specifically, the connection relationship between the data tables in the data table entity patterns can be used as the blood relationship between the data tables in the data table entity patterns, and likewise, the connection relationship between the data development tasks between the task entity patterns can also be used as the blood relationship between the data development tasks between the task entity patterns. And then according to the blood relationship between the data tables in the entity map of the existing data table, the blood relationship between the data development tasks between the task entity maps and the connection relationship (namely the blood relationship) of each node in the entity fusion map, newly-increased blood relationship data obtained after the entity fusion processing can be obtained.
For example, the target data table entity map for entity fusion processing includes table1, table2 and table3, and there is a connection relationship between table2 and table3, while the target task entity map includes job1, job2 and job3, and there is a connection relationship between job1 and job2, and between job2 and job 3. In the entity fusion map obtained by the entity fusion processing, the obtained nodes comprise: the method comprises the steps of a table1-job1 node, a table2-job2 node and a table3-job3 node, wherein a connection relationship exists between the table2-job2 node and the table3-job3 node, and a connection relationship exists between the table1-job1 node and the table2-job2 node.
It can be understood that after the fusion processing, since the table1-job1 node can share the knowledge of the original job1, the table1-job1 node can be linked to the table2-job2 node, so that a new connection relationship exists between the table1-job1 node and the table2-job2 node, that is, the new connection relationship between the table1 and the table2 specifically includes the connection relationship between the table1 and the table 2.
In one embodiment, as shown in fig. 3, a process of obtaining an entity fusion map is provided, and referring to fig. 3, after determining a target data table entity map and a target task entity map with matching similarity greater than a preset similarity threshold, entity fusion is performed based on the target data table entity map and the target task entity map. The target data table entity map comprises table1, table2 and table3, wherein a connection relationship exists between table2 and table3, the target task entity map comprises job1, job2 and job3, a connection relationship exists between job1 and job2, and a connection relationship exists between job2 and job 3.
Specifically, after the entity fusion of the target data table entity spectrum and the target task entity spectrum, the entity fusion spectrum shown in fig. 3 can be obtained. In the entity fusion map, the obtained nodes comprise a table1-job1 node, a table2-job2 node and a table3-job3 node, wherein the table1-job1 node can share the knowledge of the original job1 and can be linked to the table2-job2 node due to the connection relationship between the table2 and the table3 and the connection relationship between the job2 and the job 3.
Therefore, in the obtained entity fusion map, a connection relationship exists between the table2-job2 node and the table3-job3 node, and a new connection relationship is established between the table1-job1 node and the table2-job2 node, so that updated blood-edge relationship data can be obtained, and the blood-edge relationship among nodes without the connection relationship before entity fusion is carried out is obtained, so that the blood-edge coverage among the data is enhanced. Meanwhile, the problem of data island can be solved by fusing data with various attributes, and dependence and association relations between the data table and the data development task are further complemented, so that the problem of missing blood-related relations among data due to data analysis result errors caused by certain data loss (such as the problem of library name loss and the like) during data analysis can be solved.
According to the blood relationship data determining method, the task entity map corresponding to the data development task to be executed and the data table entity map associated with the data development task are obtained, and the data table and the data development task of the same neighbor information are determined based on the data table entity map and the task entity map, so that data blocks of the data table and the data development task before similarity matching processing are realized, the subsequent matching processing times are reduced through the block processing, and the data processing efficiency is improved. Further, through determining the matching similarity between the data table and the data development task of the same neighbor information and carrying out entity fusion on the data table entity map and the task entity map, which are matched with the preset fusion condition, the entity fusion map is obtained, so that updated blood-edge relationship data is further determined based on the entity fusion map, identification errors and invalid data in the data processing process are reduced, and the coverage rate and accuracy of the determined blood-edge relationship data are improved.
In one embodiment, as shown in fig. 4, a blood relationship data determining method is provided, which specifically includes the following steps:
step S401, a scheduling dependency relationship among the data development tasks is obtained, and a task dependency relationship data table corresponding to the data development tasks is generated according to the scheduling dependency relationship.
Specifically, task scheduling execution data of different data development tasks on different data platforms are obtained, and scheduling dependency relations among the data development tasks are determined according to the task scheduling execution data. The task scheduling execution data may be understood as an execution sequence of the data development task under the condition that the data development task is in the same platform or cross-platform, for example, the detailed task execution conditions such as the job1 is executed before the job2 is executed, so that a scheduling dependency relationship between the job1 and the job2 can be determined, further, task dependency relationships corresponding to different data development tasks can be obtained by obtaining respective detailed execution conditions of different data development tasks, and further, according to the scheduling dependency relationship between the data development tasks, a task dependency relationship data table corresponding to the data development task can be generated.
In one embodiment, the task dependency data table is specifically shown in table 1 below:
TABLE 1 task dependency data Table
From task To task
job1 job2
job2 job3
job3 job4
Referring to table 1, it can be seen that, in the task dependency relationship data table, the entities are tasks, form tasks and to tasks, which can be understood as that a scheduling dependency relationship exists between the first performing of the form tasks and the second performing of the to tasks, that is, for example, the first performing of the blob 1 and the second performing of the blob 2, that is, between the blob 2 and the blob 3, that is, for example, the first performing of the blob 2 and the second performing of the blob 3.
Step S402, determining a task attribute data table corresponding to the data development task according to various metadata information in the data development task.
Specifically, metadata may be understood as information for describing data attributes of data, to support functions such as indicating a storage location, history data, resource searching, and file recording, and then, according to data attribute information (such as a name of data, a storage location, and a data creation time of data) of various metadata involved in a data development task, task attribute data corresponding to the data development task may be determined, including a task name, a responsible person, a task creation time, and an associated instance execution time of a task, etc., thereby obtaining a task attribute data table.
In one embodiment, the task attribute data table is specifically shown in table 2 below:
TABLE 2 task Attribute data sheet
Tasks Task name Responsible person Task creation time Associated instance execution time
job1 XX XX XX XX
job2 XX XX XX XX
job3 XX XX XX XX
job4 XX XX XX XX
As can be seen from table 2, each task, such as job1, job2, job3, etc., includes detailed attribute information such as a task name, a responsible person, a task creation time, and an associated instance execution time of the task. One associated instance corresponding to one task, namely, one associated instance of one data development task, can be understood as a complete process of executing the corresponding data development task.
Step S403, generating a task entity map taking the task as an entity according to the task dependency relationship data table and the task attribute data table.
Specifically, based on a task dependency relationship data table and a task attribute data table, combining to generate a task entity map taking a task as an entity, wherein the task entity map taking the task as the entity is obtained by determining an associated key in the task dependency relationship data table and the task attribute data table, and further based on the associated key, performing data combination processing on the task dependency relationship data table and the task attribute data table to obtain map relationship data taking the task as the entity after combination processing. The association key may specifically be an attribute commonly existing in the task dependency data table and the task attribute data table, such as a data development task, including job1, job2, job N, and the like.
In one embodiment, the data table corresponding to the task entity graph with tasks as entities is shown in the following table 3:
TABLE 3 data sheet corresponding to task entity atlas
Referring to table 3, it can be seen that the data combination processing is performed on the task dependency relationship data table and the task attribute data table by using the attribute commonly existing in table 1 (i.e., the task dependency relationship data table) and table 2 (i.e., the task attribute data table), namely, the "task" as the associated key, namely, the task as the entity, the from task as the form entity, the to task as the to entity, so as to generate the data table corresponding to the task entity map using the task as the entity. Referring to table 3, it can be seen that, in table 3, the form entity is job1, the to entity is a task, and the detailed information in the form entity attribute is the task attribute information of job1 in table 2 (i.e. task attribute data table), which includes the information such as the task name, responsible person, task creation time and the execution time of the associated instance of the task corresponding to job 1.
Step S404, a task read-write data table corresponding to the data development task to be executed and a task input-output-input-storage configuration data table are obtained.
Specifically, task log data corresponding to each data development task to be executed are obtained, analysis is carried out based on the task log data, execution SQL logic carried by the task log data is obtained, and a task read-write data table of the task is obtained by analyzing the execution SQL logic. In the process of data development, data generally flows between different data development tasks, and when the data flows to the SQL task, corresponding SQL logic is executed, that is, the SQL logic is executed, which can be specifically understood as different SQL statements to be executed, such as a select statement, a from statement, a to statement, and the like.
When the SQL logic is executed for analysis, an SQL analysis engine (such as Apache Calcite, which is specifically understood as a dynamic data management framework) may be specifically adopted for analysis, so as to obtain a task reading table and a task writing table of the data development task, so as to obtain a task reading and writing data table of the task. The task read-write data table of the data development task is shown in the following table 4:
table 4 task read-write data table
Tasks Reading meter Write table
job1 table1 table2
job2 table2 table3
As can be seen from table 4, when executing the blob 1, it is necessary to read the required data from the table1 and write the data obtained after executing the blob 1 into the table2, and similarly, when executing the blob 2, it is necessary to read the required data from the table2 and write the data obtained after executing the blob 2 into the table 3.
Similarly, task configuration data corresponding to each data development task to be executed are obtained, the task configuration data are analyzed, input configuration data and output configuration data corresponding to the data development task are obtained, and a read-write data table corresponding to task types including an ex-warehouse task and an in-warehouse task is obtained based on the input configuration data and the output configuration data, so that the task ex-warehouse configuration data table is obtained.
Further, the task input-output configuration data table is shown in table 5 below:
TABLE 5 task input-output and input-input configuration data sheet
Tasks Reading meter Write table
job3 table3 Delivering to the warehouse
job4 Warehouse entry table1
The method comprises the steps of analyzing input configuration data and output configuration data of a data development task, and obtaining a read-write data table of the type of the in-out task in the data development task. Referring to table 5, the job3 is a job of delivering, the read table is table3, the write table is delivered, that is, when executing the job3, the required data is read from the table3, and the obtained data requiring delivery is delivered, and the emphasis is on specifying the specific data actually requiring delivery. Similarly, the job4 is a warehousing task, the read table is a warehouse entry, the write table is table1, namely, when the job4 is executed, data needing to be warehoused is acquired from other platforms or inside, the acquired data needing to be warehoused is written into the table1, and the emphasis is on determining specific data actually needing to be warehoused.
And step S405, carrying out merging and combining processing based on the task read-write data table and the task input-output storage configuration data table to obtain a primary data blood relationship between the data table and the data development task.
Specifically, by adopting a mode of reading and writing the data table for the task and configuring the data table for the task access library, the merging and combination of the two data tables are realized, so that the primary data blood-edge relationship between the data table and the data development task is obtained. Specifically, the union operation is performed on the task read-write data table, the data development task in the task input-output storage configuration data table and the task read-write data table, so that the primary data blood relationship between the data table and the data development task can be obtained. When the union operation is performed, the sorting processing is not performed, repeated rows are not deleted, and the data union processing between the two data tables is completed.
The primary data blood-edge relationship can be understood as a dependency relationship between a related data development task and a data table in a task read-write data table and a task input-output storage configuration data table, and the dependency relationship comprises data tables which respectively need to read data and write data when different tasks are executed.
Further, the data table corresponding to the primary data blood relationship is shown in the following Table 6,
TABLE 6 data sheet corresponding to Primary data blood relationship
Tasks Reading meter Write table
job1 table1 table2
job2 table2 table3
job3 table3 Delivering to the warehouse
job4 Warehouse entry table1
It can be seen from table 6 that, by performing the union operation on the data development task and the task read-write data table in table 4 (i.e., the task read-write data table) and table 5 (i.e., the task input-output configuration data table), the dependency relationship between the data development task and the data table involved in table 4 (i.e., the task read-write data table) and table 5 (i.e., the task input-output configuration data table) can be obtained, including the data tables that need to read data and write data when executing different tasks. For example, when executing the blob 1, it is necessary to read the required data from the table1 and write the data obtained after executing the blob 1 into the table2, and when executing the blob 2, it is necessary to read the required data from the table2 and write the data obtained after executing the blob 2 into the table 3. Similarly, when the job3 is executed, the required data is read from the table3, the obtained data needing to be exported is exported, when the job4 is executed, the data needing to be imported is obtained from other platforms or the inside, and the obtained data needing to be imported is written into the table 1.
Step S406, determining a table attribute data table corresponding to the data table according to various metadata information in the data development task.
Specifically, metadata may be understood as information for describing data attributes of data, and is used to support functions such as indicating storage locations, historical data, resource searching, and file recording, so that table attribute data corresponding to a data table may be determined according to data attribute information (such as names of data, storage locations, and data creation time) of various metadata related in a data development task, including a table name, a responsible person, creation time of a table, partition generation time, and the like, and may further include a generation task corresponding to the data table, so as to obtain a table attribute data table corresponding to the data table.
The table attribute data table corresponding to the data table is shown in the following table 7:
table 7 table attribute data table
Data sheet Table name Responsible person Creation time of table Partition generation time
table1 XX XX XX XX
table2 XX XX XX XX
table3 XX XX XX XX
table4 XX XX XX XX
As can be seen from table 7, each data table, such as table1, table2, table3, etc., includes detailed attribute information such as table name, responsible person, creation time of table, partition generation time, etc. The data development task corresponding to the data table is generated, namely, when a certain data development task is executed, a certain data table is used, and then the data development task corresponding to the table is stored in the data table.
Step S407, generating a data table entity map taking the data table as an entity based on the primary data blood relationship and the table attribute data table.
Specifically, the primary data blood-edge relation and the associated key in the table attribute data table are determined, and based on the associated key, data combination processing is carried out on the primary data blood-edge relation and the table attribute data table, so that the graph relation data taking the data table as an entity after combination processing is obtained, and the data table entity graph taking the data table as an entity is obtained. The association key may specifically be an attribute commonly existing in the primary data blood-edge relationship and the table attribute data table, such as a data table, including table1, table2, table N, and the like.
In one embodiment, the data table corresponding to the data table entity map using the data table as the entity is shown in the following table 8:
table 8 data table corresponding to entity map
Referring to table 8, it can be seen that by determining table 6 (i.e., the data table corresponding to the primary data blood-edge relationship) and table 7 (i.e., the table attribute data table), that is, "data table" is used as the association key, that is, the data table is used as the entity, that is, the read table is used as the from entity, that is, the write table is used as the to entity, and that the data table corresponding to the primary data blood-edge relationship and the table attribute data table are subjected to data combination processing, thereby generating the data table corresponding to the data table entity map using the data table as the entity. Referring to table 8, it can be seen that when the from entity in table 8 is table1 and the to entity is table2, detailed information in the attribute of the from entity, that is, table attribute information of table1 in table 7 (i.e., table attribute data table), includes information such as generation task, table name, responsible person, creation time of table, partition generation time, and the like.
Step S408, each data table with connection relation in the data table entity map is determined, and each data development task with connection relation in the task entity map is determined.
Specifically, based on the data table entity map taking the data table as an entity, each data table with a connection relationship, such as table1 and table2, can be determined, and similarly, based on the task entity map taking the task as an entity, each task with a connection relationship, such as job1 and job2, can be determined.
Step S409, determining the data table and the data development task with the same connection object as the data table and the data development task with the same neighbor information based on the data tables with the connection relationship and the data development tasks with the connection relationship.
Specifically, the primary data blood relationship can be understood as the dependency association relationship between the related data development task and the data table in the task read-write data table and the task input-output storage configuration data table, and further the connection relationship exists between the data development task and the data table. It will be appreciated that, based on each data table having a connection relationship, each data development task having a connection relationship, and a primary data blood relationship, it is possible to further determine the data table and the data development task having the same connection object.
For example, if the job1 and the job2, the job2 and the job3 have a connection relationship, the table2 and the table3 have a connection relationship, and the job1 and the table1 have a connection relationship, then it may be determined that the table1, the job2 and the job1 all have a connection relationship, that is, the job1 is the same connection object between the table1 and the job2, and further the table1 and the job2 belong to the data table and the data development task of the same neighbor information. Similarly, it can also be determined that both the job2 and the table3 are connected with the table2, and then the table2 is a common connection object between the job2 and the table3, and further the job2 and the table3 input the data table and the development task of the same neighbor information.
Step S410, determining the matching similarity between the data table and the data development task of the same neighbor information.
Specifically, based on the data table and the data development task of the same neighbor information, attribute information identification and matching are further performed, and the attribute information of the same category between the data table and the data development task, such as the table name in the data table and the task name in the data development task, is determined, so that the attribute information of the same category can be used.
Further, according to the categories of the attribute information, the attribute similarity between the data table and the data development task under each category is determined, so that the weighted fusion is carried out on the attribute similarity by combining weight data corresponding to the attribute similarity, and the matching similarity between the data table and the data development task of the same neighbor information is obtained through fusion.
Step S411, carrying out entity fusion on the entity map of the data table and the task entity map, wherein the matching similarity accords with the preset fusion condition, obtaining an entity fusion map, and determining updated blood-cause relation data based on the entity fusion map.
Specifically, a target data table entity map and a target task entity map, the matching similarity of which is larger than a preset similarity threshold, are screened out by acquiring the preset similarity threshold and comparing the matching similarity with the preset similarity threshold.
Further, processing logic processing is carried out on the target data table corresponding to the target data table entity map, the target data table is packaged into a target data development task corresponding to the target task entity map, and entity fusion is carried out on the target data table and the target data development task, so that an entity fusion map is obtained.
In one embodiment, after the entity fusion map is obtained, updated blood-relationship data is further determined according to blood-relationship between data tables in the data-table entity map, blood-relationship between data development tasks between task entity maps, and the entity fusion map.
Specifically, the connection relationship between the data tables in the data table entity patterns can be used as the blood relationship between the data tables in the data table entity patterns, and likewise, the connection relationship between the data development tasks between the task entity patterns can also be used as the blood relationship between the data development tasks between the task entity patterns. And then according to the blood relationship between the data tables in the entity map of the existing data table, the blood relationship between the data development tasks between the task entity maps and the connection relationship (namely the blood relationship) of each node in the entity fusion map, newly-increased blood relationship data obtained after the entity fusion processing can be obtained.
For example, the target data table entity map for entity fusion processing includes table1, table2 and table3, and there is a connection relationship between table2 and table3, while the target task entity map includes job1, job2 and job3, and there is a connection relationship between job1 and job2, and between job2 and job 3. In the entity fusion map obtained by the entity fusion processing, the obtained nodes comprise: the method comprises the steps of a table1-job1 node, a table2-job2 node and a table3-job3 node, wherein a connection relationship exists between the table2-job2 node and the table3-job3 node, and a connection relationship exists between the table1-job1 node and the table2-job2 node.
It can be understood that after the fusion processing, since the table1-job1 node can share the knowledge of the original job1, the table1-job1 node can be linked to the table2-job2 node, so that a new connection relationship exists between the table1-job1 node and the table2-job2 node, that is, the new connection relationship between the table1 and the table2 specifically includes the connection relationship between the table1 and the table 2.
In one embodiment, by determining the matching similarity between the entity map of the data table and the entity map of the task, and further determining the entity map of the target data table and the entity map of the target task for entity fusion according to the matching similarity, the dependency relationship missing from the upstream and downstream of the data development task can be found in an assisted manner by performing entity fusion, for example, if it is determined that the matching similarity between the job1 and the table1 is greater than a preset similarity threshold, and the upstream of the table1 is table2, and the task write table of the job2 is table2, the job1 and the job2 have higher similarity, and the dependency relationship needs to be configured for timing operation.
Similarly, an error dependency relationship of misconfiguration can be found, for example, a scheduling dependency relationship exists between the job1 and the job2, a task write table of the job2 is table2, but after matching similarity calculation is performed on the job1 and the table2, the matching similarity does not meet a preset fusion condition, which indicates that the misconfiguration dependency relationship exists among the job1, the job2 and the table2, and further identification and adjustment processing are needed.
In the embodiment, the data table and the data development task of the same neighbor information are determined based on the data table entity map and the task entity map by acquiring the task entity map corresponding to the data development task to be executed and the data table entity map associated with the data development task, so that data blocks of the data table and the data development task before similarity matching processing are realized, the subsequent matching processing times are reduced through the blocking processing, and the data processing efficiency is improved. The matching similarity between the data table and the data development task of the same neighbor information is determined through diversified similarity identification processing, and the service actual logic in the data development process is not required to be acquired in real time when the similarity calculation is performed, so that the similarity calculation can be realized only according to the basic attribute information, and the leakage risk of the service actual logic is reduced. Further, entity fusion is carried out on the entity map of the data table and the entity map of the task, wherein the matching similarity accords with a preset fusion condition, so that an entity fusion map is obtained, updated blood-cause relation data are further determined based on the entity fusion map, identification errors and invalid data in the data processing process are reduced, and the coverage rate and accuracy of the determined blood-cause relation data are improved.
In one embodiment, as shown in fig. 5, the step of determining the matching similarity between the data table and the data development task of the same neighbor information specifically includes:
in step S502, attribute information of the same category between the data table and the data development task is determined, the attribute information including a plurality of categories.
Wherein, the attribute information of the data table comprises a table name, a responsible person, a creation time of the table, a partition generation time and the like, and the attribute information of the data development task comprises a task name, a responsible person, a task creation time, an associated instance execution time of the task and the like.
Specifically, attribute information of the same category between the data table and the data development task, such as a table name in the data table and a task name in the data development task, attribute information of the same category, a creation time of the table in the data table and a creation time of the task in the data development task, attribute information of the same category, and similarly, a responsible person in the data table and a responsible person in the data development task, a creation time of the table in the data table and a task creation time in the data development task, a partition generation time in the data table and an associated instance execution time in the data development task, respectively, also belong to attribute information of the same category.
The method comprises the steps of obtaining an upstream task queue and a downstream task queue according to scheduling dependency relations among tasks in a task dependency relation data table, and obtaining an upstream table queue and a downstream table queue according to upstream and downstream relations of each data table in a data table corresponding to a primary data blood-edge relation. Similarly, the upstream task queue and the upstream table queue, and the downstream task queue and the downstream table queue also belong to the same category of attribute information, respectively.
Step S504, obtaining similarity determination logic corresponding to attribute information of different categories, and determining the similarity of each attribute between a data table and a data development task of the same neighbor information according to the similarity determination logic and the attribute information of the corresponding category.
Specifically, the similarity determination logic corresponding to the attribute information of different categories is different, and the similarity determination logic under the corresponding category is executed according to the attribute information content under the category by acquiring the similarity determination logic corresponding to the attribute information of different categories, so as to obtain each attribute similarity between the data table and the data development task of the same neighbor information.
For example, for attribute categories of task names (or table names), similarity determination logic is used to calculate text overlap (such as JACcard, i.e., JACcard coefficients, used to calculate text similarity), to determine attribute similarity between task names and table names, where a greater text overlap (i.e., JACcard values) indicates a greater similarity between the two.
Similarly, for attribute categories such as a table creation time (or a task creation time), a partition generation time (or an associated instance execution time), etc., a fixed time is preset, a time difference between the table creation time and the fixed time and a time difference between the task creation time and the fixed time are calculated respectively, and a difference between the two time differences is calculated to determine attribute similarity of the time categories in such a manner that whether the difference between the two time differences is smaller than a preset difference threshold is determined. If the difference between the two time differences is smaller than the preset difference threshold, the similarity between the two times is high.
And step S506, determining the matching similarity between the data table and the data development task of the same neighbor information according to the attribute similarity and the weight data corresponding to the attribute similarity one by one.
Specifically, in the actual data development process, the weight data corresponding to the similarity of the attribute information of different categories are different, namely the weight data corresponding to the similarity of different attributes needs to be acquired, and the weighted fusion processing is performed on the similarity of each attribute according to the weight data corresponding to the similarity of each attribute one by one, so as to acquire the matching similarity between the data table of the same neighbor information and the data development task.
Further, the matching similarity S between the data table and the data development task of the same neighbor information is calculated by the following formula (1):
S=S1*W1+S2*W2+S3*W3+…+Sn*Wn(1);
the data table of the same neighbor information and the attribute similarity between the data table and the data development task are represented by S1, S2, … and Sn, and W1, W2, … and Wn represent weight data corresponding to the attribute similarity one by one, and the matching similarity S is specifically obtained by performing weighted fusion processing on the attribute similarity according to the weight data corresponding to the attribute similarity one by one.
In this embodiment, by determining attribute information of the same category between the data table and the data development task and acquiring similarity determination logic corresponding to attribute information of different categories, each attribute similarity between the data table and the data development task of the same neighbor information is determined according to the similarity determination logic and the attribute information of the corresponding category. Further, according to the attribute similarity and weight data corresponding to the attribute similarity one by one, the matching similarity between the data table and the data development task of the same neighbor information is determined. Different attributes in the data table and the data development task are comprehensively considered, and based on attribute similarity of the different attributes and weight data corresponding to the attributes, the similarity matching degree between the data table and the data development task of the same neighbor information is rapidly and accurately determined, so that fusion and matching processing operations between data with low similarity are reduced, data operation and maintenance cost is reduced, and data development processing efficiency is improved.
In one embodiment, as shown in fig. 6, the step of determining weight data corresponding to each attribute similarity one by one specifically includes:
step S602, initializing weight data of each attribute similarity to obtain initial weights.
Wherein the same category of attribute information between the data table and the data development task comprises: the creation time of the table in the data table, the creation time of the task in the data development task, the responsible person in the data table and the responsible person in the data development task, the creation time of the table in the data table and the task creation time in the data development task, the partition generation time in the data table and the associated instance execution time in the data development task, the upstream task queue and the upstream table queue, and the downstream task queue and the downstream table queue.
Specifically, the attribute similarity between the data table and the data development task includes: the creation time of the table and the creation time of the task, the responsible person in the data table and the responsible person in the data development task, the creation time of the table and the creation time of the task, the partition generation time and the execution time of the associated instance, the upstream task queue and the upstream table queue, and the attribute similarity between attribute categories such as the downstream task queue and the downstream table queue.
The weight data corresponding to the attribute similarity of different attribute categories are different, the initial weight can be obtained after initializing the weight data of each attribute similarity, and then the weight data for matching similarity calculation can be obtained through final training by performing regression training on the initial weight.
In step S604, a labeled standard blood-edge relationship data set is obtained, where the labeled standard blood-edge relationship data set includes metadata with a blood-edge relationship and metadata without a blood-edge relationship.
Specifically, the standard blood-edge relationship data set after labeling used for training the similarity of each attribute specifically comprises metadata with blood-edge relationship and metadata without blood-edge relationship, wherein the metadata with blood-edge relationship can be understood as metadata with true dependency and association relationship determined after pre-screening, testing and labeling. Similarly, metadata of a blood relationship does not exist, and it can be understood that metadata of a dependency relationship is determined after screening, checking and labeling in advance.
The ratio between the metadata with the blood edge relation and the metadata without the blood edge relation in the marked standard blood edge relation data set is 1:1, namely a standard blood-edge relation data set after labeling, comprises the same quantity of metadata with blood-edge relation and metadata without blood-edge relation.
Step S606, performing regression training on the initial weight of each attribute similarity according to the marked standard blood relationship data set, and training to obtain weight data corresponding to each attribute similarity one by one.
Specifically, a regression training mode is adopted, initial weights of the similarity of all the attributes are continuously adjusted and updated according to the marked standard blood relationship data set after marking, and weight data corresponding to the similarity of all the attributes are obtained through training until training ending conditions are reached.
The regression training method may specifically include multiple regression methods such as linear regression, logistic regression, and polynomial regression. The training ending condition may be understood as that the number of iterations or adjustments of the initial weight of the attribute similarity reaches a preset number of times threshold, or that the training loss value in the training process of the initial weight of the attribute similarity reaches a preset loss threshold.
Further, the initial weight of each attribute similarity is adjusted and updated by using the marked standard blood relationship data set, and when the iteration and adjustment times of the initial weight of the attribute similarity reach a preset times threshold or the training loss value in the training process of the initial weight of the attribute similarity reaches a preset loss threshold, training of the initial weight is completed, and weight data corresponding to each attribute similarity are obtained through training.
In one embodiment, regression training is performed on the initial weights of the attribute similarities by adopting a logistic regression (i.e. Logistic Regression, logistic regression model), and weight data corresponding to the attribute similarities are obtained by training, specifically, regression training is performed on the initial weights of the attribute similarities according to the marked standard blood-edge relationship data set by adopting a logistic regression mode, and in the training process, the initial weights of the attribute similarities are continuously adjusted by using the marked metadata with blood-edge relationships and the metadata without blood-edge relationships in the marked standard blood-edge relationship data set until the training ending condition is determined.
The training ending condition may specifically be that the training loss value in the training process reaches a preset loss threshold, specifically, the training loss function in the regression training process may specifically be a mean square error loss function (i.e. MSE, which is fully called Mean squared error), or a cross entropy loss function (i.e. Cross Entropy Loss), etc., and then specifically may be that when the mean square error loss function value or the cross entropy loss function value reaches the preset loss threshold, the training ending of the initial weight of the similarity of each attribute is determined, and the trained weight data is obtained. The preset loss threshold value can be adjusted and set according to actual service requirements, and is not limited to a certain specific value or a certain specific values.
Similarly, the training ending condition may be that the iteration number of the initial weight of each attribute similarity reaches a preset number threshold, that is, if the iteration number of the initial weight of each attribute similarity is detected to reach the preset number threshold, the training ending of the initial weight of each attribute similarity is determined, and trained weight data is obtained. The preset frequency threshold can be adjusted and set according to actual service requirements, and is not limited to a certain specific value or a certain specific values.
In this embodiment, by initializing the weight data of each attribute similarity, an initial weight is obtained, a labeled standard blood-edge relationship dataset including metadata with a blood-edge relationship and metadata without a blood-edge relationship is obtained, and then regression training is performed on the initial weight of each attribute similarity according to the labeled standard blood-edge relationship dataset, so as to train to obtain weight data corresponding to each attribute similarity one by one. The method and the device realize comprehensive consideration of different attribute similarities, and adjust and update the weight of each attribute similarity until weight data meeting actual service requirements are obtained, so that the accuracy of the matching similarity determined by the attribute similarity and the weight data can be improved, and further fusion of the data table entity map and the task entity map can be performed according to the accurate matching similarity, so that accurate and comprehensive blood-edge relationship data between the data can be obtained.
In one implementation, as shown in fig. 7, a method for determining blood-edge relationship data is provided, which specifically includes the following steps:
step S701, acquiring a task entity graph corresponding to a data development task to be executed and a data table entity graph associated with the data development task.
Step S702, determining each data table with connection relation in the data table entity map, and determining each data development task with connection relation in the task entity map.
Step S703, determining the data table and the data development task having the same connection object as the data table and the data development task having the same neighbor information based on the data tables having the connection relationship and the data development tasks having the connection relationship.
In step S704, attribute information of the same category between the data table and the data development task is determined, the attribute information including a plurality of categories.
Step S705, obtaining similarity determination logic corresponding to attribute information of different categories, and determining each attribute similarity between a data table and a data development task of the same neighbor information according to the similarity determination logic and the attribute information of the corresponding category.
Step S706, according to the attribute similarity and the weight data corresponding to the attribute similarity one by one, the matching similarity between the data table and the data development task of the same neighbor information is determined.
Step S707, a target data table entity map and a target task entity map with matching similarity greater than a preset similarity threshold are obtained.
Step S708, processing logic processing is carried out on the target data table corresponding to the target data table entity map, the target data table is packaged into a target data development task corresponding to the target task entity map, and entity fusion is carried out on the target data table and the target data development task, so that an entity fusion map is obtained.
Step S709, according to the blood relationship between the data tables in the entity patterns of the data tables, the blood relationship between the data development tasks between the task entity patterns, and the entity fusion patterns, updated blood relationship data is determined.
According to the blood relationship data determining method, the task entity map corresponding to the data development task to be executed and the data table entity map associated with the data development task are obtained, and the data table and the data development task of the same neighbor information are determined based on the data table entity map and the task entity map, so that data blocks of the data table and the data development task before similarity matching processing are realized, the subsequent matching processing times are reduced through the block processing, and the data processing efficiency is improved. Further, through determining the matching similarity between the data table and the data development task of the same neighbor information and carrying out entity fusion on the data table entity map and the task entity map, which are matched with the preset fusion condition, the entity fusion map is obtained, so that updated blood-edge relationship data is further determined based on the entity fusion map, identification errors and invalid data in the data processing process are reduced, and the coverage rate and accuracy of the determined blood-edge relationship data are improved.
In one embodiment, as shown in fig. 8, a blood relationship data determining method is provided, which specifically includes the following steps:
step S801, a scheduling dependency relationship among the data development tasks is obtained, and a task dependency relationship data table corresponding to the data development tasks is generated according to the scheduling dependency relationship.
Step S802, determining a task attribute data table corresponding to the data development task according to various metadata information in the data development task.
Step S803, a task entity map taking the task as an entity is generated according to the task dependency relationship data table and the task attribute data table.
Step S804, a task read-write data table corresponding to the data development task to be executed and a task input-output-input-storage configuration data table are obtained.
And step S805, carrying out merging and combining processing based on the task read-write data table and the task input-output storage configuration data table to obtain a primary data blood relationship between the data table and the data development task.
Step S806, determining a table attribute data table corresponding to the data table according to various metadata information in the data development task.
Step S807, a data table entity map is generated with the data table as an entity based on the primary data blood relationship and the table attribute data table.
Step S808, determining each data table with connection relations in the data table entity map, and determining each data development task with connection relations in the task entity map.
Step S809, based on each data table having a connection relationship and each data development task having a connection relationship, determines the data table and the data development task having the same connection object as the data table and the data development task of the same neighbor information.
Step S810, determining attribute information of the same category between the data table and the data development task; the attribute information includes a plurality of categories.
Step S811, obtaining similarity determination logic corresponding to attribute information of different categories, and determining the similarity of each attribute between the data table and the data development task of the same neighbor information according to the similarity determination logic and the attribute information of the corresponding category.
Step S812, initializing weight data of the attribute similarity to obtain initial weights.
Step S813, a labeled standard blood-edge relationship data set is obtained, where the labeled standard blood-edge relationship data set includes metadata with a blood-edge relationship and metadata without a blood-edge relationship.
Step S814, performing regression training on the initial weights of the attribute similarities according to the labeled standard blood relationship data set, and training to obtain weight data corresponding to the attribute similarities one by one.
Step S815, determining the matching similarity between the data table and the data development task of the same neighbor information according to the attribute similarity and the weight data corresponding to the attribute similarity one by one.
Step S816, a target data table entity map and a target task entity map with the matching similarity larger than a preset similarity threshold are obtained.
Step S817, processing logic processing is performed on the target data table corresponding to the target data table entity map, the target data table is packaged into a target data development task corresponding to the target task entity map, and entity fusion is performed on the target data table and the target data development task, so that an entity fusion map is obtained.
Step S818, determining updated blood-relation data according to the blood-relation between data tables in the data table entity patterns, the blood-relation between data development tasks between task entity patterns and the entity fusion patterns.
According to the blood relationship data determining method, the task entity map corresponding to the data development task to be executed and the data table entity map associated with the data development task are obtained, and the data table and the data development task of the same neighbor information are determined based on the data table entity map and the task entity map, so that data blocks of the data table and the data development task before similarity matching processing are realized, the subsequent matching processing times are reduced through the block processing, and the data processing efficiency is improved. The matching similarity between the data table and the data development task of the same neighbor information is determined through diversified similarity identification processing, and the service actual logic in the data development process is not required to be acquired in real time when the similarity calculation is performed, so that the similarity calculation can be realized only according to the basic attribute information, and the leakage risk of the service actual logic is reduced. Further, entity fusion is carried out on the entity map of the data table and the entity map of the task, wherein the matching similarity accords with a preset fusion condition, so that an entity fusion map is obtained, updated blood-cause relation data are further determined based on the entity fusion map, identification errors and invalid data in the data processing process are reduced, and the coverage rate and accuracy of the determined blood-cause relation data are improved.
In one implementation, as shown in fig. 9, a method for determining blood-edge relationship data is provided, which specifically includes the following parts:
p1, obtaining a task read-write data table corresponding to a data development task
Specifically, task log data corresponding to each data development task to be executed are obtained, analysis is carried out based on the task log data, execution SQL logic carried by the task log data is obtained, and a task read-write data table of the task is obtained by analyzing the execution SQL logic. In the process of data development, data generally flows between different data development tasks, and when the data flows to the SQL task, corresponding SQL logic is executed, that is, the SQL logic is executed, which can be specifically understood as different SQL statements to be executed, such as a select statement, a from statement, a to statement, and the like.
When the SQL logic is executed for analysis, an SQL analysis engine (such as Apache Calcite, which is specifically understood as a dynamic data management framework) may be specifically adopted for analysis, so as to obtain a task reading table and a task writing table of the data development task, so as to obtain a task reading and writing data table of the task.
P2, obtaining a task input-output storage configuration data table corresponding to the data development task
Specifically, task configuration data corresponding to each data development task to be executed are obtained, the task configuration data are analyzed, input configuration data and output configuration data corresponding to the data development task are obtained, and a read-write data table corresponding to task types including a job leaving task and a job entering task is obtained based on the input configuration data and the output configuration data, so that a job entering and exiting configuration data table is obtained.
P3, determining the blood-edge relation of the primary data
Specifically, the merging and combining processing is carried out based on the task read-write data table and the task input-output storage configuration data table, so as to obtain the primary data blood relationship between the data table and the data development task.
Further, by adopting a mode of reading and writing the data table for the task and configuring the data table for the task access library, the merging and combination of the two data tables are realized, so that the primary data blood-edge relationship between the data table and the data development task is obtained. Specifically, the union operation is performed on the task read-write data table, the data development task in the task input-output storage configuration data table and the task read-write data table, so that the primary data blood relationship between the data table and the data development task can be obtained. When the union operation is performed, the sorting processing is not performed, repeated rows are not deleted, and the data union processing between the two data tables is completed.
The primary data blood-edge relationship can be understood as a dependency relationship between a related data development task and a data table in a task read-write data table and a task input-output storage configuration data table, and the dependency relationship comprises data tables which respectively need to read data and write data when different tasks are executed.
P4, obtaining a table attribute data table corresponding to the data table
Specifically, a table attribute data table corresponding to the data table is determined according to various metadata information in the data development task. The metadata may be understood as information describing data attributes of the data, and is used to support functions such as indicating storage locations, historical data, searching resources, and file records, so that according to data attribute information (such as names of data, storage locations, and creation time of data) of various metadata related in data development tasks, table attribute data corresponding to a data table may be determined, including a table name, a responsible person, creation time of a table, partition generation time, and the like, and may further include a generation task corresponding to the data table, so as to obtain a table attribute data table corresponding to the data table.
P5, determining a data table entity map taking the data table as an entity
Specifically, a data table entity map is generated with the data table as an entity based on the primary data blood relationship and the table attribute data table.
Further, the primary data blood-edge relation and the associated key in the table attribute data table are determined, and based on the associated key, data combination processing is carried out on the primary data blood-edge relation and the table attribute data table, so that the graph relation data taking the data table as an entity after combination processing is obtained, and a data table entity graph taking the data table as an entity is obtained. The association key may specifically be an attribute commonly existing in the primary data blood-edge relationship and the table attribute data table, such as a data table, including table1, table2, table N, and the like.
P6, obtaining a task attribute data table corresponding to the data development task
Specifically, a scheduling dependency relationship among the data development tasks is obtained, and a task dependency relationship data table corresponding to the data development tasks is generated according to the scheduling dependency relationship.
Further, task scheduling execution data of different data development tasks on different data platforms are obtained, and scheduling dependency relations among the data development tasks are determined according to the task scheduling execution data. The task scheduling execution data may be understood as an execution sequence of the data development task under the condition that the data development task is in the same platform or cross-platform, for example, the detailed task execution conditions such as the job1 is executed before the job2 is executed, so that a scheduling dependency relationship between the job1 and the job2 can be determined, further, task dependency relationships corresponding to different data development tasks can be obtained by obtaining respective detailed execution conditions of different data development tasks, and further, according to the scheduling dependency relationship between the data development tasks, a task dependency relationship data table corresponding to the data development task can be generated.
P7, obtaining a task dependency relationship data table corresponding to the data task
Specifically, a task attribute data table corresponding to the data development task is determined according to various metadata information in the data development task.
The metadata may be understood as information for describing data attributes of the data, and is used to support functions such as indicating storage locations, historical data, resource searching, and file recording, and then, according to data attribute information (such as names of data, storage locations, and data creation time of data) of various metadata related in the data development task, task attribute data corresponding to the data development task may be determined, including task names, responsible persons, task creation time, and associated instance execution time of tasks, so as to obtain a task attribute data table.
P8, determining a task entity map taking tasks as entities
Specifically, a task entity map taking a task as an entity is generated according to the task dependency relationship data table and the task attribute data table.
Further, based on the task dependency relationship data table and the task attribute data table, combining to generate a task entity map taking the task as an entity, wherein the task entity map taking the task as the entity is obtained by determining the association key in the task dependency relationship data table and the task attribute data table, and further based on the association key, performing data combination processing on the task dependency relationship data table and the task attribute data table to obtain map relationship data taking the task as the entity after combination processing. The association key may specifically be an attribute commonly existing in the task dependency data table and the task attribute data table, such as a data development task, including job1, job2, job N, and the like.
P9, generating an entity fusion map through entity fusion, and determining updated blood relationship data according to the entity fusion map
Specifically, based on the data table entity map and the task entity map, determining a data table and a data development task of the same neighbor information, determining matching similarity between the data table and the data development task of the same neighbor information, further carrying out entity fusion on the data table entity map and the task entity map, obtaining an entity fusion map, and determining updated blood-edge relationship data based on the entity fusion map.
And determining the similarity of each attribute between the data table and the data development task of the same neighbor information according to the similarity determination logic and the attribute information of the corresponding category by determining the attribute information of the same category between the data table and the data development task and acquiring the similarity determination logic corresponding to the attribute information of different categories. And determining the matching similarity between the data table and the data development task of the same neighbor information according to the attribute similarity and the weight data corresponding to the attribute similarity one by one.
Specifically, the attribute similarity between the data table and the data development task includes: the creation time of the table and the creation time of the task, the responsible person in the data table and the responsible person in the data development task, the creation time of the table and the creation time of the task, the partition generation time and the execution time of the associated instance, the upstream task queue and the upstream table queue, and the attribute similarity between attribute categories such as the downstream task queue and the downstream table queue.
Further, a target data table entity map and a target task entity map with the matching similarity larger than a preset similarity threshold are obtained, processing logic processing is carried out on a target data table corresponding to the target data table entity map, the target data table is packaged into a target data development task corresponding to the target task entity map, and entity fusion is carried out on the target data table and the target data development task, so that an entity fusion map is obtained. And further, the updated blood-relation data can be determined according to the blood-relation among the data tables in the data table entity patterns, the blood-relation among the data development tasks among the task entity patterns and the entity fusion patterns.
In one embodiment, the target data table corresponding to the target data table entity map is processed through processing logic, the target data table is packaged into a target data development task corresponding to the target task entity map, and entity fusion is carried out on the target data table and the target data development task, so that an entity fusion map is obtained. The data table and the data development task are aligned and fused by means of the capability of the knowledge graph, so that series connection of the data development task and the upstream and downstream read-write relations of the data table can be formed, dependency and association relations among data are complemented, and blood margin coverage among the data is enhanced.
When processing logic processing is performed on the target data table corresponding to the target data table entity map, ETL processing (i.e. Extract Transform Load, understood as describing the process of extracting (extracting), converting (converting) and loading (Load) data from the source end to the destination end) may be specifically adopted, specifically, extracting, converting and loading and packaging the target data table in the target data table entity map into a target data development task corresponding to the target task entity map, so as to achieve fusion of the two entities of the target data table and the target data development task, and obtain a final entity fusion map.
In one embodiment, after the entity fusion map is obtained, updated blood-relationship data is further determined according to blood-relationship between data tables in the data-table entity map, blood-relationship between data development tasks between task entity maps, and the entity fusion map.
Specifically, the connection relationship between the data tables in the data table entity patterns can be used as the blood relationship between the data tables in the data table entity patterns, and likewise, the connection relationship between the data development tasks between the task entity patterns can also be used as the blood relationship between the data development tasks between the task entity patterns. And then according to the blood relationship between the data tables in the entity map of the existing data table, the blood relationship between the data development tasks between the task entity maps and the connection relationship (namely the blood relationship) of each node in the entity fusion map, newly-increased blood relationship data obtained after the entity fusion processing can be obtained.
For example, the target data table entity map for entity fusion processing includes table1, table2 and table3, and there is a connection relationship between table2 and table3, while the target task entity map includes job1, job2 and job3, and there is a connection relationship between job1 and job2, and between job2 and job 3. In the entity fusion map obtained by the entity fusion processing, the obtained nodes comprise: the method comprises the steps of a table1-job1 node, a table2-job2 node and a table3-job3 node, wherein a connection relationship exists between the table2-job2 node and the table3-job3 node, and a connection relationship exists between the table1-job1 node and the table2-job2 node.
It can be understood that after the fusion processing, since the table1-job1 node can share the knowledge of the original job1, the table1-job1 node can be linked to the table2-job2 node, so that a new connection relationship exists between the table1-job1 node and the table2-job2 node, that is, the new connection relationship between the table1 and the table2 specifically includes the connection relationship between the table1 and the table 2.
In one embodiment, the method for determining the weight data corresponding to the attribute similarity one by one comprises the following steps:
initializing weight data of each attribute similarity to obtain initial weights; acquiring a marked standard blood edge relation data set, wherein the marked standard blood edge relation data set comprises metadata with blood edge relation and metadata without blood edge relation; and carrying out regression training on the initial weight of each attribute similarity according to the marked standard blood relationship data set, and training to obtain weight data corresponding to each attribute similarity one by one.
Specifically, after initializing weight data of each attribute similarity, initial weights are obtained, and then regression training is performed on the initial weights, so that weight data for calculating the matching similarity is finally obtained through training.
The method comprises the steps of obtaining a marked standard blood-edge relation data set, wherein the marked standard blood-edge relation data set comprises metadata with blood-edge relation and metadata without blood-edge relation, adopting a regression training mode, continuously adjusting and updating initial weights of all attribute similarities according to the marked standard blood-edge relation data set until training end conditions are reached, and training to obtain weight data corresponding to all attribute similarities. The training ending condition may be understood as that the iteration and adjustment times of the initial weight of the attribute similarity reach a preset time threshold, or the training loss value in the training process of the initial weight of the attribute similarity reaches a preset loss value.
In one embodiment, a specific application scenario of the above blood relationship data determining method includes:
1) And applying the blood relationship data obtained according to the blood relationship data determining method to practical business projects such as application program development, data development and the like, and realizing the dependency analysis of each data development task, and analyzing to obtain the dependency and association relationship between each data development task and the data table.
Through diversified similarity recognition and knowledge graph fusion processing, association and dependency between the data table and the data development task can be completed, deeper blood relationship data among the data can be mined, so that the blood margin coverage rate among various data with different attributes is improved, the occurrence of problems of task error dependency, error management effective nodes and the like caused by low blood margin coverage rate is reduced, the data development efficiency is further improved, and the data operation and maintenance cost is further reduced.
Furthermore, because the plaintext processing SQL logic is not required to be acquired aiming at all data development tasks, diversified similarity recognition processing is adopted, the attribute information involved in the similarity recognition processing is taken as basic attribute information, and the actual service information in the project development process is not involved, so that the leakage of the actual service information can be avoided, and the service security guarantee in the data development process is improved.
2) The blood-edge relation data obtained according to the blood-edge relation data determining method is applied to data processing, downstream-free nodes are found in an auxiliary mode according to the blood-edge relation data, node processing is carried out, and a downstream-free blood-edge data table and a similar processing link table can be identified under the repeated data development and judgment scene, so that the data operation and maintenance efficiency is improved.
3) And applying the blood edge relation data obtained according to the blood edge relation data determining method to downstream evaluation analysis, specifically, when an upstream node is changed, realizing evaluation analysis on an influence surface of the downstream node by utilizing the blood edge relation data, accurately and timely determining the downstream node influenced by the change, and notifying the downstream node influenced by the change so as to timely process logic adjustment.
4) And (3) transversely accessing and expanding a new type platform, and applying a blood-margin relation data determining method to different accessed platforms to generate blood-margin relation data among data with different attributes. Specifically, when the multi-element similarity calculation is carried out, the whole calculation link adopts a general structure mode to carry out upstream and downstream serial calculation, and a new data source is only connected in the same structure at the source, so that the whole blood-edge calculation link can be integrated with low cost, the low-cost transverse expansion is realized, and accurate and comprehensive blood-edge relation data is produced.
According to the blood relationship data determining method, the task entity map corresponding to the data development task to be executed and the data table entity map associated with the data development task are obtained, and the data table and the data development task of the same neighbor information are determined based on the data table entity map and the task entity map, so that data blocks of the data table and the data development task before similarity matching processing are realized, the subsequent matching processing times are reduced through the block processing, and the data processing efficiency is improved. The matching similarity between the data table and the data development task of the same neighbor information is determined through diversified similarity identification processing, and the service actual logic in the data development process is not required to be acquired in real time when the similarity calculation is performed, so that the similarity calculation can be realized only according to the basic attribute information, and the leakage risk of the service actual logic is reduced. Further, entity fusion is carried out on the entity map of the data table and the entity map of the task, wherein the matching similarity accords with a preset fusion condition, so that an entity fusion map is obtained, updated blood-cause relation data are further determined based on the entity fusion map, identification errors and invalid data in the data processing process are reduced, and the coverage rate and accuracy of the determined blood-cause relation data are improved.
It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a blood-edge relationship data determining device for realizing the blood-edge relationship data determining method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the device for determining blood-edge relationship data provided below may refer to the limitation of the method for determining blood-edge relationship data hereinabove, and will not be repeated here.
In one embodiment, as shown in fig. 10, there is provided a blood-relationship data determining apparatus including: the map acquisition module 1002 a first determination module 1004, a second determination module 1006, and an entity fusion module 1008, wherein:
the graph acquisition module 1002 is configured to acquire a task entity graph corresponding to a data development task to be executed, and a data table entity graph associated with the data development task.
The first determining module 1004 is configured to determine a data table and a data development task of the same neighbor information based on the data table entity graph and the task entity graph.
A second determining module 1006 is configured to determine a matching similarity between the data table and the data development task of the same neighbor information.
And the entity fusion module 1008 is configured to perform entity fusion on the entity spectrum of the data table and the task entity spectrum, where the matching similarity meets a preset fusion condition, obtain an entity fusion spectrum, and determine updated blood-cause relationship data based on the entity fusion spectrum.
In the blood relationship data determining device, the data table and the data development task of the same neighbor information are determined based on the task entity map and the task entity map by acquiring the task entity map corresponding to the data development task to be executed and the data table entity map associated with the data development task, so that data blocks of the data table and the data development task before similarity matching processing are realized, the subsequent matching processing times are reduced through the block processing, and the data processing efficiency is improved. Further, through determining the matching similarity between the data table and the data development task of the same neighbor information and carrying out entity fusion on the data table entity map and the task entity map, which are matched with the preset fusion condition, the entity fusion map is obtained, so that updated blood-edge relationship data is further determined based on the entity fusion map, identification errors and invalid data in the data processing process are reduced, and the coverage rate and accuracy of the determined blood-edge relationship data are improved.
In one embodiment, the profile acquisition module is further configured to: acquiring a scheduling dependency relationship among the data development tasks, and generating a task dependency relationship data table corresponding to the data development tasks according to the scheduling dependency relationship; determining a task attribute data table corresponding to the data development task according to various metadata information in the data development task; and generating a task entity map taking the task as an entity according to the task dependency relationship data table and the task attribute data table.
In one embodiment, the profile acquisition module is further configured to: acquiring a task read-write data table corresponding to a data development task to be executed and a task input-output storage configuration data table; combining and combining the data table based on the task read-write data table and the task input-output storage configuration data table to obtain a primary data blood relationship between the data table and the data development task; determining a table attribute data table corresponding to the data table according to various metadata information in the data development task; based on the primary data blood relationship and the table attribute data table, a data table entity map taking the data table as an entity is generated.
In one embodiment, the first determining module is further configured to: determining each data table with connection relation in the data table entity map, and determining each data development task with connection relation in the task entity map; and determining the data table and the data development task with the same connection object as the data table and the data development task with the same neighbor information based on the data tables with the connection relation and the data development tasks with the connection relation.
In one embodiment, the second determining module is further configured to: determining attribute similarity between a data table and a data development task of the same neighbor information; and determining the matching similarity between the data table and the data development task of the same neighbor information according to the attribute similarity and the weight data corresponding to the attribute similarity one by one.
In one embodiment, the second determining module is further configured to: determining attribute information of the same category between the data table and the data development task; the attribute information includes a plurality of categories; and obtaining similarity determination logic corresponding to the attribute information of different categories, and determining the similarity of each attribute between the data table and the data development task of the same neighbor information according to the similarity determination logic and the attribute information of the corresponding category.
In one embodiment, the second determining module is further configured to: initializing weight data of each attribute similarity to obtain initial weights; acquiring a marked standard blood edge relation data set, wherein the marked standard blood edge relation data set comprises metadata with blood edge relation and metadata without blood edge relation; and carrying out regression training on the initial weight of each attribute similarity according to the marked standard blood relationship data set, and training to obtain weight data corresponding to each attribute similarity one by one.
In one embodiment, the entity fusion module is further configured to: acquiring a target data table entity map and a target task entity map, wherein the matching similarity is larger than a preset similarity threshold; and processing logic processing is carried out on the target data table corresponding to the target data table entity map, the target data table is packaged into a target data development task corresponding to the target task entity map, and entity fusion is carried out on the target data table and the target data development task, so that an entity fusion map is obtained.
In one embodiment, the entity fusion module is further configured to: and determining updated blood relationship data according to the blood relationship between the data tables in the data table entity patterns, the blood relationship between the data development tasks between the task entity patterns and the entity fusion patterns.
The respective modules in the above blood relationship data determination apparatus may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 11. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer equipment is used for storing data such as a task entity map corresponding to a data development task, a data table entity map associated with the data development task, a data table of the same neighbor information and the data development task, matching similarity between the data table of the same neighbor information and the data development task, an entity fusion map, updated blood relationship data and the like. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of blood-lineage relationship data determination.
It will be appreciated by those skilled in the art that the structure shown in FIG. 11 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
It should be noted that, the user information (including, but not limited to, the user's device, personal information, etc.) and the data (including, but not limited to, the data for analysis, storage, and presentation, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use, and processing of the related data are required to comply with the relevant laws and regulations and standards of the relevant country and region.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (13)

1. A method of determining blood relationship data, the method comprising:
acquiring a task entity map corresponding to a data development task to be executed and a data table entity map associated with the data development task;
determining a data table and a data development task of the same neighbor information based on the data table entity map and the task entity map;
Determining matching similarity between the data table and the data development task of the same neighbor information;
and carrying out entity fusion on the entity map of the data table and the entity map of the task, wherein the matching similarity accords with a preset fusion condition, obtaining an entity fusion map, and determining updated blood-cause relation data based on the entity fusion map.
2. The method of claim 1, wherein generating a task entity graph corresponding to a data development task to be performed comprises:
acquiring a scheduling dependency relationship among the data development tasks, and generating a task dependency relationship data table corresponding to the data development tasks according to the scheduling dependency relationship;
determining a task attribute data table corresponding to the data development task according to various metadata information in the data development task;
and generating a task entity map taking the task as an entity according to the task dependency relationship data table and the task attribute data table.
3. The method of claim 1, wherein generating a data table entity graph associated with the data development task comprises:
acquiring a task read-write data table corresponding to a data development task to be executed and a task input-output storage configuration data table;
Combining and combining the data table based on the task read-write data table and the task input-output storage configuration data table to obtain a primary data blood relationship between the data table and the data development task;
determining a table attribute data table corresponding to the data table according to various metadata information in the data development task;
and generating a data table entity map taking the data table as an entity based on the primary data blood relationship and the table attribute data table.
4. A method according to any one of claims 1 to 3, wherein determining matching similarities between data tables and data development tasks of the same neighbor information comprises:
determining attribute similarity between a data table and a data development task of the same neighbor information;
and determining the matching similarity between the data table and the data development task of the same neighbor information according to the attribute similarity and the weight data corresponding to the attribute similarity one by one.
5. The method of claim 4, wherein determining the attribute similarity between the data table and the data development task for the same neighbor information comprises:
determining attribute information of the same category between the data table and the data development task; the attribute information includes a plurality of categories;
And obtaining similarity determination logic corresponding to the attribute information of different categories, and determining the similarity of each attribute between the data table and the data development task of the same neighbor information according to the similarity determination logic and the attribute information of the corresponding category.
6. The method of claim 4, wherein determining weight data that corresponds one-to-one to each of the attribute similarities comprises:
initializing weight data of each attribute similarity to obtain initial weights;
acquiring a marked standard blood edge relation data set, wherein the marked standard blood edge relation data set comprises metadata with blood edge relation and metadata without blood edge relation;
and carrying out regression training on the initial weight of each attribute similarity according to the marked standard blood relationship data set, and training to obtain weight data corresponding to each attribute similarity one by one.
7. A method according to any one of claims 1 to 3, wherein the determining a data table and a data development task of the same neighbor information based on the data table entity profile and a task entity profile comprises:
determining each data table with connection relation in the data table entity map, and determining each data development task with connection relation in the task entity map;
And determining the data table and the data development task with the same connection object as the data table and the data development task with the same neighbor information based on the data tables with the connection relation and the data development tasks with the connection relation.
8. A method according to any one of claims 1 to 3, wherein the predetermined fusion condition comprises a matching similarity being greater than a predetermined similarity threshold; and carrying out entity fusion on the data table entity map and the task entity map, wherein the matching similarity accords with a preset fusion condition, so as to obtain an entity fusion map, and the method comprises the following steps:
acquiring a target data table entity map and a target task entity map, wherein the matching similarity is larger than a preset similarity threshold;
and processing logic processing is carried out on the target data table corresponding to the target data table entity map, the target data table is packaged into a target data development task corresponding to the target task entity map, and entity fusion is carried out on the target data table and the target data development task, so that an entity fusion map is obtained.
9. The method of claim 8, wherein determining updated blood-lineage relationship data based on the entity fusion map includes:
And determining updated blood relationship data according to the blood relationship between data tables in the data table entity patterns, the blood relationship between data development tasks between the task entity patterns and the entity fusion patterns.
10. A blood relationship data determining apparatus, the apparatus comprising:
the system comprises a map acquisition module, a data development module and a data development module, wherein the map acquisition module is used for acquiring a task entity map corresponding to a data development task to be executed and a data table entity map associated with the data development task;
the first determining module is used for determining a data table and a data development task of the same neighbor information based on the data table entity map and the task entity map;
the second determining module is used for determining the matching similarity between the data table and the data development task of the same neighbor information;
and the entity fusion module is used for carrying out entity fusion on the data table entity map and the task entity map, the matching similarity of which accords with the preset fusion condition, obtaining an entity fusion map, and determining updated blood relationship data based on the entity fusion map.
11. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 9 when the computer program is executed.
12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 9.
13. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 9.
CN202310459188.5A 2023-04-19 2023-04-19 Blood relationship data determination method, device, computer equipment and storage medium Pending CN116975051A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310459188.5A CN116975051A (en) 2023-04-19 2023-04-19 Blood relationship data determination method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310459188.5A CN116975051A (en) 2023-04-19 2023-04-19 Blood relationship data determination method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116975051A true CN116975051A (en) 2023-10-31

Family

ID=88475603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310459188.5A Pending CN116975051A (en) 2023-04-19 2023-04-19 Blood relationship data determination method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116975051A (en)

Similar Documents

Publication Publication Date Title
CN110489520B (en) Knowledge graph-based event processing method, device, equipment and storage medium
US11537719B2 (en) Deep neural network system for similarity-based graph representations
US10725981B1 (en) Analyzing big data
US9330138B1 (en) Translating queries into graph queries using primitives
EP3144826B1 (en) A method and apparatus for representing compound relationships in a graph database
US9535963B1 (en) Graph-based queries
US9361320B1 (en) Modeling big data
Qian et al. Mining regional co-location patterns with k NNG
US20170161641A1 (en) Streamlined analytic model training and scoring system
US9378239B1 (en) Verifying graph-based queries
Gadepally et al. Ai enabling technologies: A survey
CN114579584B (en) Data table processing method and device, computer equipment and storage medium
CN113254630B (en) Domain knowledge map recommendation method for global comprehensive observation results
Bojchevski et al. Is pagerank all you need for scalable graph neural networks
US20230056760A1 (en) Method and apparatus for processing graph data, device, storage medium, and program product
CN112257959A (en) User risk prediction method and device, electronic equipment and storage medium
Sun Personalized music recommendation algorithm based on spark platform
EP3115911A1 (en) Method and system for fusing business data for distributional queries
CN116978450A (en) Protein data processing method, device, electronic equipment and storage medium
US11709798B2 (en) Hash suppression
CN116975051A (en) Blood relationship data determination method, device, computer equipment and storage medium
EP3771992A1 (en) Methods and systems for data ingestion in large-scale databases
Do et al. DW-PathSim: a distributed computing model for topic-driven weighted meta-path-based similarity measure in a large-scale content-based heterogeneous information network
CN111091198B (en) Data processing method and device
Milios et al. Component aggregation for PEPA models: An approach based on approximate strong equivalence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication