CN115062016A

CN115062016A - Incidence relation extraction method and device and computer equipment

Info

Publication number: CN115062016A
Application number: CN202210666119.7A
Authority: CN
Inventors: 侯义福; 聂冬琴; 赵喆; 王伟
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2022-09-16

Abstract

The application relates to an incidence relation extraction method, an incidence relation extraction device, computer equipment, a storage medium and a computer program product, relates to the technical field of big data, and can be used in the field of financial science and technology or other related fields. The method comprises the following steps: identifying a script of a target system to obtain a plurality of SQL sentences contained in the script; analyzing source table information at a source table position corresponding to a target field in an SQL statement; determining a first association relation between a target field and a preset temporary table under the condition that the source table information is a search statement; analyzing the source table information at the source table position corresponding to the target field in the search statement, and determining a second association relation between a preset temporary table at the target table position of the search statement and the source data table until the analyzed source table information corresponding to the target field is the source data table; and in the first incidence relation and the second incidence relation, extracting the target field and the source data table as the target incidence relation of the target system, thereby achieving the effect of extracting the accurate incidence relation of all the fields.

Description

Incidence relation extraction method and device and computer equipment

Technical Field

The present application relates to the field of big data technologies, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for extracting an association relationship.

Background

With the development of the big data platform technology, massive data can be converted into an information set required by a user through processing, and the information set can also be used in various scenes of big data processing, the data flow is complex, the data link is longer and longer, and the relationship between the data is complicated. Especially in the field of financial technology, analyzing the data flow direction among systems is an important channel for analyzing the influence among systems. The field information in the source data table of a complex system may be used by a plurality of subsystems, and how to accurately acquire the flow direction relationship of data information among the systems is an urgent problem to be solved.

In the conventional technology, data flow direction is generally positioned through a data table level relation, but the source data table comprises a plurality of pieces of data, and the data in the target table is not completely the same as the data in the source table, so that the range of data flow direction is expanded, and the accuracy of data influence analysis is reduced.

Disclosure of Invention

In view of the above, it is necessary to provide a correlation extraction method, an apparatus, a computer device, a computer-readable storage medium, and a computer program product, which can improve the accuracy of data impact analysis, in view of the above technical problems.

In a first aspect, the present application provides an association relationship extraction method. The method comprises the following steps:

identifying a script of a target system to obtain a plurality of SQL sentences contained in the script;

analyzing source table information at a source table position corresponding to a target field in the SQL statement;

determining a first association relation between the target field and a preset temporary table under the condition that the source table information is a search statement; analyzing source table information at a source table position corresponding to a target field in the search statement, and determining a second association relation between a preset temporary table at the target table position of the search statement and the source data table until the analyzed source table information corresponding to the target field is a source data table;

and in the first incidence relation and the second incidence relation, extracting the target field and the source data table as a target incidence relation of the target system.

In one embodiment, the identifying the script of the target system to obtain the plurality of SQL statements contained in the script includes:

and identifying the script of the target system according to a preset regular expression to obtain a plurality of SQL sentences contained in the script.

In one embodiment, the method further comprises:

and adding the target association relation between the target field and the source data table to a preset field level blood relationship database.

In one embodiment, the method further comprises:

analyzing the SQL statement to obtain an alias of the source data table and an alias of the preset temporary table;

the extracting the target field and the source data table as the target association relationship of the target system in the first association relationship and the second association relationship includes:

and in the first association relationship and the second association relationship, extracting the target field and the source data table as a target association relationship of the target system according to the alias of a first preset temporary table and the alias of the source data table.

In one embodiment, the method further comprises:

acquiring a first quantity of scripts of the target system and a quantity of SQL sentences corresponding to each script;

calculating the average processing SQL statement quantity of each preset parallel task according to the quantity of the SQL statements corresponding to each script and the quantity of the preset parallel tasks;

and averagely dividing the SQL sentences corresponding to the scripts into preset parallel tasks according to the number of the SQL sentences corresponding to the scripts and the number of the average processing SQL sentences.

In one embodiment, the target table location of the SQL statement comprises a plurality of fields, and the method further comprises:

and searching source table information contained in the source table position corresponding to the field according to a preset search algorithm based on prior knowledge, and extracting the field and the source data table to serve as a target association relation of the target system under the condition that the source table information is a source data table.

In a second aspect, the present application further provides an association relationship extraction apparatus. The device comprises:

the identification module is used for identifying the script of the target system to obtain a plurality of SQL sentences contained in the script;

the first analysis module is used for analyzing the source table information at the source table position corresponding to the target field in the SQL statement;

the second analysis module is used for determining a first association relation between the target field and a preset temporary table under the condition that the source table information is a search statement; analyzing source table information at a source table position corresponding to a target field in the search statement, and determining a second association relation between a preset temporary table at the target table position of the search statement and the source data table until the analyzed source table information corresponding to the target field is a source data table;

and the extracting module is used for extracting the target field and the source data table from the first incidence relation and the second incidence relation to be used as the target incidence relation of the target system.

In one embodiment, the identification module is specifically configured to:

In one embodiment, the association relationship extracting apparatus further includes:

and the adding module is used for adding the target association relation between the target field and the target data corresponding to the source data table to a preset field level blood relationship database.

the alias acquisition module is used for analyzing the SQL statement to obtain an alias of the source data table and an alias of the preset temporary table;

the extraction module is specifically configured to:

and in the first association relation and the second association relation, extracting the target field and the source data table as a target association relation of the target system according to the alias of a first preset temporary table and the alias of the source data table.

the concurrent task setting module is used for acquiring a first quantity of scripts of the target system and a quantity of SQL sentences corresponding to each script; calculating the average processing SQL statement quantity of each preset parallel task according to the SQL statement quantity corresponding to each script and the quantity of preset parallel tasks; and averagely dividing the SQL sentences corresponding to the scripts into preset parallel tasks according to the number of the SQL sentences corresponding to the scripts and the number of the average processing SQL sentences. In one embodiment, the target table position of the SQL statement comprises a plurality of fields, and the incidence relation extracting device further comprises:

and the searching module is used for searching the source table information contained in the source table position corresponding to the field according to a preset searching algorithm based on prior knowledge, and extracting the field and the source data table to be used as the target association relation of the target system under the condition that the source table information is a source data table.

In a third aspect, the application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:

and extracting the target field and the source data table from the first incidence relation and the second incidence relation as a target incidence relation of the target system.

In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

In one embodiment, the identifying a script of a target system to obtain a plurality of SQL statements contained in the script includes:

In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:

The incidence relation extracting method, the incidence relation extracting device, the computer equipment, the storage medium and the computer program product comprise the following steps: identifying a script of a target system to obtain a plurality of SQL sentences contained in the script; analyzing source table information at a source table position corresponding to a target field in an SQL statement; determining a first association relation between a target field and a preset temporary table under the condition that the source table information is a search statement; analyzing the source table information at the source table position corresponding to the target field in the search statement, and determining a second association relation between a preset temporary table at the target table position of the search statement and the source data table until the analyzed source table information corresponding to the target field is the source data table; and extracting the target field and the source data table as the target association relation of the target system in the first association relation and the second association relation. By adopting the method, a plurality of SQL sentences can be obtained based on the script of the target system, and the accurate incidence relation of all the fields in the target system can be obtained based on the source table information among the SQL sentences and the incidence relation among the target fields, so that the terminal can carry out influence analysis according to the accurate incidence relation based on the analysis method in the related technology to obtain the accurate influence analysis result of the target system.

Drawings

FIG. 1 is a flow chart illustrating a method for extracting associations according to an embodiment;

FIG. 2 is a flow diagram illustrating the parallel task partitioning step in one embodiment;

FIG. 3 is a schematic diagram of the system in one embodiment;

FIG. 4 is a flowchart illustrating a method for extracting association relationships in another embodiment;

FIG. 5 is a block diagram showing the structure of an association extracting apparatus according to an embodiment;

FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Today with advanced information technology, mass data needs to be processed and then converted into data sets required by users, with the gradual increase of data processing scenes, the flow direction of the data and links from production to final use of the data become longer and longer, and the complex and intricate relationships among the data are interwoven together like a network. In the field of financial technology, it is a very valuable task to analyze data flow among systems, and data (fields) in a source data table in one system may be used by target tables in other systems. Since the data in the target table is not completely the same as the information in the source data table in the related art, the data in the source data table, which is different from the data in the target table, does not affect the target table even if the data in the source data table is changed. That is, analyzing the data flow direction of a single table dimension in the related art may result in the influence range of the source data table, resulting in a decrease in the accuracy of the intersystem influence analysis.

In an embodiment, as shown in fig. 1, an association relationship extraction method is provided, and this embodiment is exemplified by applying the method to a terminal, it can be understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and is implemented by interaction between the terminal and the server, where the terminal may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device and the like, and the server can be realized by an independent server or a server cluster formed by a plurality of servers. The incidence relation method provided by the embodiment can acquire the field dimension incidence relation, and the incidence relation extraction method comprises the following steps:

step 102, identifying a script of the target system to obtain a plurality of SQL statements (Structured Query Language) included in the script.

The data source system may include a plurality of subsystems, and each subsystem may be a target system. The SQL statement may be a command statement that operates on the database, which may be an add operation, a delete operation, a modify operation, a lookup operation, and so on.

Specifically, the terminal obtains scripts of each subsystem in the data source system, and analyzes the scripts of each subsystem respectively to obtain an SQL statement or multiple SQL statements contained in the scripts of each subsystem.

In one example, for a script of each subsystem, the terminal may perform separator identification on the script to obtain a plurality of separators included in the script, and perform division based on the plurality of separators script to obtain a plurality of SQL statements.

In another example, multiple programming languages may be included in the script. For example, SQL and perl languages may be included. In this way, the terminal can also match the SQL sentences in the script through a preset SQL sentence regular expression, output the sentences successfully matched as SQL sentences, and output the SQL sentences to the SQL sentence storage file corresponding to each subsystem; and removing the sentences which fail to be matched as other information. Thus, each subsystem has an SQL statement storage file that contains all SQL statements in the subsystem.

Optionally, the analyzing, by the terminal, the script of the subsystem to obtain an SQL statement may be:

insert into a select col1,col2 from b；

and 104, analyzing the source table information at the source table position corresponding to the target field in the SQL statement.

The target field of the SQL statement is a field at a target table position, the field is a content field of the target table, and the number of the fields may be one or more. Each field corresponds to source table information at a source table position, namely, each field has a source data table corresponding to the field, and the source table information between the fields can be the same or different and can be determined according to actual application scenarios.

Specifically, the terminal parses the SQL statement, and determines source table information at a source table position corresponding to a target field included in the SQL statement, where the source table information may be a source data table or a nested search statement, and the search statement may be a sub-query statement or a selection statement (select statement).

For example, in the case of an SQL statement of insert into a select col1, col2 from b, the target table of the SQL statement is a, the target fields of the target table locations are col1 and col2, the source table information at the source table location corresponding to the target field col1 is b, and the source table information at the source table location corresponding to the target field col2 is b. b may be a source data table (actual data table) or b may be a lookup statement.

Optionally, the source data table may be in a subsystem with the SQL statement or not in a subsystem with the SQL statement, and in the target system, the subsystem may perform field calling and query across the subsystems.

For example, in the case where the SQL statement is as follows:

insert into xx1col3selectsum(C.col3)fromleft join(select col3+10from xx2 D)C。

the target table of the SQL statement is xx1, the target field at the target table position is sum (c.col3), and the source table information at the source table position corresponding to the target field sum (c.col3) is a lookup statement: left join (select col3+10from xx 2D) C.

And 106, under the condition that the source table information is the search statement, determining a first association relation between the target field and the preset temporary table, analyzing the source table information at the source table position corresponding to the target field in the search statement, and determining a second association relation between the preset temporary table at the target table position of the search statement and the source data table until the analyzed source table information corresponding to the target field is the source data table.

Specifically, the terminal may analyze the SQL statement through an open source analysis module in the PYTHON language, where the open source analysis module may be an SQLPARSE module. In this way, the terminal analyzes the SQL statement to obtain the source table information at the source table position corresponding to the target field included in the target table position of the SQL statement. When the source table information at the source table position corresponding to the target field is the search statement, the terminal may determine an association relationship between the target field and the first preset temporary table, where the source representing the target field is the first preset temporary table, and the first preset temporary table is a table obtained by the search statement. The terminal may analyze the search statement based on the open source analysis module to obtain source table information at a source table position corresponding to a field including the target field in the search statement, and obtain a second association relationship between the first preset temporary table and the source data table when the source table information is the source data table. If the source table information at the source table position corresponding to the field containing the target field is also the search statement in the search statement, the terminal re-analyzes the source table information until the analyzed source table information is the source data table, and obtains the association relation between the preset temporary table and the source data table.

Optionally, in the search statement, the field including the target field may be a field that is exactly the same as the target field, or may be a field that performs processing such as function calculation on the target field, for example, the target field may be sum (c. col3), and the field including the target field may be col3+ 10.

And 108, extracting the target field and the source data table as the target association relation of the target system in the first association relation and the second association relation.

Specifically, the first association relationship is an association relationship between a target table corresponding to the target field and a preset temporary table, the second association relationship may be an association relationship between the preset temporary table corresponding to the target field and a next-layer preset temporary table, the association relationship may be multiple, and the second association relationship further includes an association relationship between the preset temporary table corresponding to the target field and the source data table. The terminal may extract the target field and the source data table as the target association relationship of the target system in the first association relationship and the second association relationship.

In one example, the SQL statement may be:

insert into xx1 col3selectsum(C.col3)from left join(select col3+10from xx2 D)C。

in this way, the terminal first determines that the target table of the SQL statement is xx1, the target field is sum (c. col3), and the source table position of the first layer corresponding to the target field col3 is behind the first source table identifier, which may be from. Thus, the source table information at the source table position of the first layer corresponding to the destination field col3 determined by the terminal may be left join (select col3+10from xx 2D) C. The terminal may determine that the source table information of the first layer is a lookup statement. In the case that the source table information of the first layer is a lookup statement, the terminal may determine a first association relationship between a target field sum (c.col3) of the target table xx1 and a first preset temporary table _ temp1 (the name of the preset temporary table may be preset). Thus, the terminal needs to analyze the lookup statement in the first layer, that is, analyze "left join (select col3+10from xx 2D) C", determine a field col3+10 corresponding to the target field sum (c.col3), and further determine source table information xx2 at the source table position corresponding to the field col3+10, so that the terminal can determine that the source table information xx2 is a source data table, and the terminal can obtain a second association relationship between the first preset temporary table _ temp1 and the source data table xx2 of the field col3+ 10.

A first association relationship, target field sum (c.col3), target table xx1 and a first preset temporary table _ temp 1; second association relationship, field col3+10 corresponding to target field sum (c. col3), first preset temporary table _ temp1 (target table) and source data table xx 2.

In this way, for the target field sum (c.col3), the terminal may extract in the first association relationship and the second association relationship to obtain the target association relationship between the target field and the source data table as the target system. I.e. the source data table for the target field sum (c. col3) is xx2, i.e. the target field sum (c. col3) is from xx 2.

In another example, the source table information xx2 may also be a lookup statement, so that the terminal needs to parse the lookup statement xx2 to obtain a field corresponding to the field col3+10 in the lookup statement xx2 and source table information at the source table position of the field, until the directly parsed source table information is a source data table.

Optionally, the terminal further needs to acquire metadata information of each subsystem, where the metadata information includes a library name, a table name, and a field name. Each subsystem has a corresponding database, the name of the database is the name of the database, each subsystem has a plurality of SQL statements, each SQL statement can relate to a target table, an intermediate temporary table and a source data table in the execution process, the tables are all stored in the database, and the table names can include the name of the target table, the name of the intermediate temporary table and the name of the source data table. The field names may be the names of the various fields contained in the target table.

In the incidence relation extraction method, a script of a target system is identified to obtain a plurality of SQL sentences contained in the script; analyzing source table information at a source table position corresponding to a target field in an SQL statement; determining a first association relation between a target field and a preset temporary table under the condition that the source table information is a search statement; analyzing the source table information at the source table position corresponding to the target field in the search statement, and determining a second association relation between a preset temporary table at the target table position of the search statement and the source data table until the analyzed source table information corresponding to the target field is the source data table; and extracting the target field and the source data table as the target association relation of the target system in the first association relation and the second association relation. By adopting the method, a plurality of SQL sentences can be obtained based on the script of the target system, and the accurate incidence relation of all fields in the target system can be obtained based on the source table information among the SQL sentences and the incidence relation among the target fields, so that the terminal can carry out influence analysis according to the accurate incidence relation based on the analysis method in the related technology, and obtain the accurate influence analysis result of the target system.

In one embodiment, the specific processing procedure of "identifying the script of the target system to obtain a plurality of SQL statements contained in the script" in step 102 includes:

Specifically, the preset regular expression may be an SQL statement regular expression, so that when the script of the target system includes multiple programming languages (e.g., SQL language and perl language), the terminal matches the SQL statement existing in the script through the preset SQL statement regular expression, outputs the successfully matched statement as an SQL statement, and can output the SQL statement to the SQL statement storage file corresponding to each subsystem; and removing the sentences which fail to be matched as other information. Thus, each subsystem has an SQL statement storage file that contains all SQL statements in the subsystem.

In the embodiment, the script is analyzed through the preset regular expression to obtain the SQL statement, so that the effect of accurately analyzing the script can be achieved.

In one embodiment, the association relationship extraction method further includes:

and adding the target association relation corresponding to the target field and the source data table to a preset field level blood relationship database.

Specifically, the terminal can add the target association relations of the source data tables corresponding to the target fields and the plurality of target fields contained in the target table to a preset field level consanguinity relation database, so that the field level consanguinity relation database can be updated in real time.

and analyzing the SQL statement to obtain the alias of the source data table and the alias of the preset temporary table.

Specifically, the terminal may analyze the SQL statement through an open source analysis module in the PYTHON language to obtain an alias of each table included in the SQL statement, that is, each table included in the SQL statement has a table name and an alias of the table, and the table name corresponds to the alias of the table.

For example, in the case where the SQL statement is as follows:

the name of the preset temporary table contained in the SQL statement may be table _ temp1, and the alias of the preset temporary table may be C; the table name of the source data table contained in the SQL statement may be xx2, and the alias of the source data table may be D.

Correspondingly, the specific processing procedure of "extracting the target field and the source data table as the target association relationship of the target system in the first association relationship and the second association relationship" in step 108 includes:

and in the first incidence relation and the second incidence relation, extracting a target field and a source data table as a target incidence relation of a target system according to the alias of the preset temporary table and the alias of the source data table.

Specifically, in the first association relationship and the second association relationship, the terminal may determine a correspondence between the association relationships according to the aliases of the plurality of preset temporary tables and the aliases of the source data table to obtain an association relationship chain, and extract a head portion (target field) of the association relationship chain and a tail portion (source data table) of the association relationship chain as a target association relationship of the target system.

In one embodiment, as shown in fig. 2, the association relationship extracting method further includes:

step 202, acquiring a first number of scripts of the target system and a number of SQL statements corresponding to each script.

Wherein the first number of scripts of the target system is the number of subsystems that the target system contains.

Specifically, the terminal needs to acquire the number of scripts included in the target system and the number of SQL statements included in each script.

And 204, calculating the average processing SQL statement quantity of each preset parallel task according to the SQL statement quantity corresponding to each script and the quantity of the preset parallel tasks.

Specifically, the terminal may calculate the number of all SQL statements included in the target system according to the number of SQL statements corresponding to each script, and may use the sum of the number of SQL statements corresponding to each script as the number of all SQL statements included in the target system. In this way, the terminal can take the ratio of the number of all SQL statements contained in the target system to the number of the preset parallel tasks as the average processing SQL statement number of each preset parallel task. The terminal can calculate by the following formula:

Ct＝floor(S/n)，

where Ct is the average number of SQL statements processed by each preset parallel task, Floor is a Floor function, S is the number of all SQL statements included in the target system, and n is the number of preset parallel tasks.

And step 206, averagely dividing the SQL statements corresponding to the scripts into preset parallel tasks according to the number of the SQL statements corresponding to the scripts and the number of the average processed SQL statements.

Specifically, the terminal can perform analysis by utilizing system resources more fully when a plurality of preset parallel tasks are executed by using a load balancing algorithm based on weighted distribution. The terminal may count the number of SQL statements in all scripts to be analyzed as S, the number of currently used parallel tasks is n, and the number of the files in which the SQL statements are located is P, and then each concurrent theoretical execution SQL count Ct (the average number of SQL statements processed by each preset parallel task) may be calculated according to the above formula when the SQL statements are greater than the preset number threshold. And the terminal distributes the SQL sentences for each preset parallel task according to the quantity of the SQL sentences contained in each file. For example, the method can be based on the principle that a script with more sentences is distributed first and a script with less sentences is distributed later, so that the number of the sentences executed in each concurrency is basically consistent, the situation that a single concurrency execution sentence has more execution time than other concurrencies is avoided, the execution time of each parallel task can be reduced, and the execution effect of each parallel task is improved.

In one embodiment, the SQL statement includes a plurality of fields at the target table location.

Correspondingly, the association relation extraction method further comprises the following steps:

and searching source table information contained in the source table position corresponding to the field according to a preset search algorithm based on prior knowledge, and extracting the field and the source data table to serve as a target association relation of a target system under the condition that the source table information is a source data table.

Specifically, the target table position of the SQL statement analyzed by the terminal includes a plurality of fields, and the priori knowledge in the preset search algorithm based on the priori knowledge may be: and analyzing the source table information corresponding to the source table position corresponding to each field, and if the source table information at the source table position corresponding to the field is a source data table, directly extracting the association relationship between the field and the source data table as a target association relationship of the target system.

For example, one possible first SQL statement may be:

insert into dwods.dam test(col1,co12,co13)

select

A.coll

,B.col2

,sum(C.col3)

from dwsdata.dcm test1

left join dusumdata.dcm test2 B

on A.party_id＝B.party_id

left join(selectcol3+10from dwpdata.dam test3 D)C。

specifically, the plurality of fields contained at the target table location of the SQL statement may be col1, co12, co 13. Thus, the terminal performs a search in a priori knowledge based search algorithm, and the source table information at the source table location of the field col1 may be "dwsdata. dcm test 1", the source table information at the source table location of the field col2 may be "dusumdata. dcm test 2", and the source table information at the source table location of the field col3 may be "select col3+10from dwddr data. dam test 3D". In this way, the terminal can determine that the source data table corresponding to the field col1 is "dwsdata. dcm test 1" and the source data table of the field col2 is "dusumdata. dcm test 2", and the terminal can output the association as a target association of the target system and add the target association to a preset field-level blood relationship database. Therefore, the terminal can delete the association relation of the target table corresponding to the field so as to narrow the subsequent searching range and improve the extraction efficiency of the association relation.

The specific implementation process of the association extraction method provided in the embodiment of the present invention is described in detail below with reference to a specific schematic diagram, and specifically, the association extraction method is applied to an association extraction system, where the structural schematic diagram of the association extraction system may be as shown in fig. 3, and the association extraction system includes a data source system unit 1, a system job script SQL statement extraction unit 2, a source system metadata extraction unit 3, a system job script SQL parsing unit 4, a target table and source table field association extraction unit 5, a field level blood relationship extraction unit 6, a field level blood relationship storage unit 7, an analyst workstation unit 8, and an influence analysis result unit 9.

The data source system unit 1 includes a plurality of subsystems, and different independent business process logics exist among the subsystems. The incidence relation extraction system can extract SQL sentences from all the subsystem scripts in the data source system unit 1 by using regular expression matching through the system operation script SQL sentence extraction unit 2, and filter other information except the SQL sentences to output the information to a file. And the metadata information in the data source system is extracted by the source system metadata extraction unit 3, so that each field information included in the target table is obtained when the SQL statement performs field extraction.

The incidence relation extraction system can analyze the SQL statement through the system operation script SQL analysis unit 4 to obtain a plurality of incidence relations and corresponding relations between table names of the tables and table aliases. Specifically, the SQL statement may be decomposed into an association relationship between a target table (including an intermediate temporary table) and a source table field, and a corresponding alias relationship between the source table and the intermediate temporary table and the sub-query by an open source SQL parsing module SQLPARSE in the PYTHON language.

For example, for the above possible first SQL statement, the extraction result of the system job script SQL parsing unit 4 may be as shown in table 1 and table 2:

TABLE 1

TABLE 2

Source meter	Source table alias
		dwpdata.dam test3	D
dwsdata.dcm test1	A
		dusumdata.dcm test2	B
table_temp1	C

The incidence relation extraction system can search the corresponding field relation between the target table and the source table in the system cache through a quick search algorithm based on prior knowledge through the target table and source table field incidence relation extraction unit 5, and for a single script analysis result (the field source of the target table and the alias information of the source table), two pieces of prior knowledge can be preset:

prior knowledge 1: firstly, filtering the temporary table and the sub-query condition by searching the target table, and directly searching the field source of the target table;

prior knowledge 2: and filtering the target table in the process of circularly traversing the source table to reduce the search range.

The extraction result of the source table field association extraction unit 5 may be as shown in table 3 below:

TABLE 3

The association relation extraction system may perform field name extraction by the field level consanguinity relation extraction unit 6. I.e., a process that removes all other information except the field name, such as TRIM (), the COALEASE () function, etc. The field names can be extracted using regular expressions in this cell. For example, the field "Sum (col3+ 10)" culling may be modified to "col 3".

The extraction result of the field level kindred relation extraction unit 6 may be as shown in table 4:

in this way, the extraction result of the field level kinship extraction unit 6 can be stored in the field kinship table in the database, i.e., the field level kinship storage unit 7.

The analyst workbench unit 8 and the influence analysis result unit 9 may be used for analyzing the association relationship between the fields by the user.

As shown in fig. 4, the complete extraction procedure of the genetic relationship of the fields includes the following steps:

s401, preprocessing to obtain an sql file and system metadata information;

s402, performing concurrent analysis on the extracted sql file according to the concurrent number and based on a load balancing algorithm of weighted distribution, so that the system resources can be utilized to the maximum extent to quickly acquire the relationships between the field and the alias of the target table and the relationship between the source table and the alias. And judging whether the analysis is finished. If the parsing step of the sql file is completed, S407 is performed, and if the parsing step of the sql file is not completed, S403 is performed.

S403, analyzing the sql file.

S404, judging whether the analysis is finished, if not, returning to the step S403 for repeated execution; if the parsing is completed, S405 is performed.

S405, based on the breadth-first search algorithm of the priori knowledge, quickly extracting the blood relationship correlation of the field of the target table, namely searching the corresponding field relationship between the target table and the source table.

S406, whether the last field is already determined is judged. If it is the last field, executing S407; if not, returning to execute S405.

And S407, removing the key word information by regular matching to obtain all the source table field information.

And S408, storing the field level blood relationship in a database table to finish the back exit program.

In an example, according to the method for extracting an association provided by the embodiment of the present invention, a target association in a target system, that is, an association between a target field and a source data table, may be obtained. The actual application scenario for performing impact analysis according to the target association relationship in the target system may be, for example: after a field in an upstream table is changed, how to determine the scope of influence of the changed field in the upstream table. The specific process for performing the impact analysis may be: after determining that the target field in the upstream table changes, the terminal may determine the target association relationship corresponding to the target field according to the association relationship extraction method provided in the above embodiment. Therefore, the target table related to the field is determined according to the target association relationship corresponding to the field, and the target table can be modified in a targeted manner, and the modification manner can be determined according to the actual application scenario, which is not limited in the present invention. By the method for extracting the association relation, provided by the embodiment of the invention, the influence range of the changed field in the upstream table can be quickly and accurately determined, and the operation and maintenance efficiency of the target system and the like are improved.

The method provided by the embodiment of the invention can extract the field level blood relationship among all the subsystem operation scripts related to the data warehouse system, store the field level blood relationship in the physical table, support real-time updating, and accurately query the dependency relationship of the field in all the subsequent subsystems through the field level blood relationship table, so that accurate and rapid impact analysis can be performed, the analysis effectiveness is improved, the labor cost is reduced, and the analysis accuracy is improved.

It should be understood that, although the steps in the flowcharts related to the embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the above embodiments may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the present application further provides an association relationship extraction apparatus for implementing the association relationship extraction method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme described in the method, so the specific limitations in one or more embodiments of the association extraction device provided below may refer to the limitations on the association extraction method in the above, and are not described herein again.

In one embodiment, as shown in fig. 5, there is provided an association relation extraction apparatus 300 including:

the identification module 301 is configured to identify a script of a target system to obtain a plurality of SQL statements included in the script;

a first parsing module 302, configured to parse source table information at a source table position corresponding to a target field in an SQL statement;

the second parsing module 303 is configured to determine a first association relationship between the target field and the preset temporary table when the source table information is the search statement; analyzing the source table information at the source table position corresponding to the target field in the search statement, and determining a second association relation between a preset temporary table at the target table position of the search statement and the source data table until the analyzed source table information corresponding to the target field is the source data table;

and the extracting module 304 is configured to extract the target association relationship between the target field and the source data table as the target association relationship of the target system in the first association relationship and the second association relationship.

The modules in the association relationship extraction apparatus 300 may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, the identification module is specifically configured to:

and the adding module is used for adding the target association relation corresponding to the target field and the source data table to a preset field level blood relationship database.

the alias acquisition module is used for analyzing the SQL statement to obtain an alias of the source data table and an alias of a preset temporary table;

an extraction module specifically configured to:

and in the first association relation and the second association relation, extracting a target field and a source data table as a target association relation of a target system according to the alias of the first preset temporary table and the alias of the source data table.

the concurrent task setting module is used for acquiring a first number of scripts of the target system and the number of SQL sentences corresponding to each script; calculating the average processing SQL statement quantity of each preset parallel task according to the SQL statement quantity corresponding to each script and the quantity of the preset parallel tasks; and averagely dividing the SQL sentences corresponding to the scripts into preset parallel tasks according to the number of the SQL sentences corresponding to the scripts and the number of the average processing SQL sentences. In one embodiment, the target table location of the SQL statement comprises a plurality of fields, and the association extracting apparatus further comprises:

and the searching module is used for searching the source table information contained in the source table position corresponding to the field according to a preset searching algorithm based on priori knowledge, and extracting the field and the source data table as a target association relation of a target system under the condition that the source table information is the source data table.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the association relation data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an association relationship extraction method.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In an embodiment, a computer program product is provided, comprising a computer program which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

It should be noted that the method and apparatus in the embodiments of the present disclosure may be used in the field of big data technology, and may be used in the field of financial technology or other related fields.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases involved in the embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features. The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application should be subject to the appended claims.

Claims

1. An incidence relation extraction method, characterized in that the method comprises:

identifying a script of a target system to obtain a plurality of Structured Query Language (SQL) statements contained in the script;

2. The method according to claim 1, wherein the identifying the script of the target system to obtain the plurality of SQL statements contained in the script comprises:

3. The method according to claim 1 or 2, characterized in that the method further comprises:

4. The method of claim 2, further comprising:

5. The method of claim 1, further comprising:

calculating the average processing SQL statement quantity of each preset parallel task according to the SQL statement quantity corresponding to each script and the quantity of preset parallel tasks;

6. The method of claim 1, wherein the SQL statement comprises a plurality of fields at target table locations, the method further comprising:

7. An association relationship extraction apparatus, characterized in that the apparatus comprises:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 6.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 6 when executed by a processor.