CN117493391A - Task matching method, device, computer equipment and storage medium - Google Patents

Task matching method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN117493391A
CN117493391A CN202310770053.0A CN202310770053A CN117493391A CN 117493391 A CN117493391 A CN 117493391A CN 202310770053 A CN202310770053 A CN 202310770053A CN 117493391 A CN117493391 A CN 117493391A
Authority
CN
China
Prior art keywords
task
field
scheduling
data table
wide
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310770053.0A
Other languages
Chinese (zh)
Inventor
聂志学
周家林
王思远
蒋宁
吴海英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mashang Xiaofei Finance Co Ltd
Original Assignee
Mashang Xiaofei Finance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mashang Xiaofei Finance Co Ltd filed Critical Mashang Xiaofei Finance Co Ltd
Priority to CN202310770053.0A priority Critical patent/CN117493391A/en
Publication of CN117493391A publication Critical patent/CN117493391A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2452Query translation
    • G06F16/24526Internal representations for queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation
    • G06F16/24545Selectivity estimation or determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Operations Research (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a task matching method, a task matching device, computer equipment and a storage medium, and relates to the technical field of data analysis. The method comprises the following steps: acquiring wide table configuration information of a wide table data table; analyzing task texts in each scheduling task to obtain a task field blood relationship of each scheduling task; performing field traversal collision on the wide table configuration information of the wide table data table and the blood edge relation of each scheduling task to obtain a first traversal collision result; the first traversal collision result comprises field hit information of each scheduling task relative to a wide table data table; and determining target scheduling tasks matched with the wide-table data table from the scheduling tasks based on the first traversal collision result. By the method, the matching accuracy of the scheduling task and the wide-table data table can be improved.

Description

Task matching method, device, computer equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of data processing, in particular to a task matching method, a task matching device, computer equipment and a storage medium.
Background
The wide table data table refers to a database table with a relatively large number of fields, and generally refers to a database table in which indexes, dimensions and attributes related to a service theme are related together. Through the use of the wide-table data table, the multi-table association problem during multi-dimensional analysis can be solved, and the speed of data query and the convenience of analysis operation can be improved. It follows that the use of a wide table data table may improve the efficiency of data analysis and data processing.
In order to improve the application rate of the wide table data, the scheduling task matched with the wide table data table can be determined by matching the scheduling task with the wide table data table, so that the scheduling task is modified from the calling source data table to the calling wide table data table when being scheduled to be executed. Therefore, in the field of data processing, how to determine potentially matching scheduled tasks for a broad-table data table is one of the hot spot problems of current research.
Disclosure of Invention
The embodiment of the application provides a task matching method, a task matching device, computer equipment and a storage medium, which can improve the matching accuracy of a wide table data table and a scheduling task. The technical scheme is as follows:
in one aspect, a task matching method is provided, the method including:
acquiring wide table configuration information of a wide table data table; the wide table configuration information comprises a source data table to which a field belongs in the wide table data table and attribute information of the field in the source data table to which the field belongs;
analyzing task texts in the N scheduling tasks to obtain the blood relationship of task fields of the N scheduling tasks; the task field blood relationship of the ith scheduling task is used for representing a source data table called by the ith scheduling task when the ith scheduling task is executed and called field attribute information; n is an integer greater than 1, i is an integer greater than 1 and less than or equal to N;
Performing field traversal collision on the wide table configuration information and the task field blood edge relations of the N scheduling tasks to obtain a first traversal collision result; the first traversal collision result comprises field hit information of each scheduling task relative to the wide table data table;
and determining the target scheduling task matched with the wide table data table from N scheduling tasks based on the first traversal collision result.
In one aspect, a task matching device is provided, the device comprising:
the acquisition module is used for acquiring the wide table configuration information of each wide table data table; the broad table configuration information of each broad table data table comprises the source data table to which the field belongs in the corresponding broad table data table and the attribute information of the field in the source data table to which the field belongs;
the analysis module is used for analyzing task texts in the N scheduling tasks to obtain the task field blood relationship of the N scheduling tasks; the task field blood relationship of the ith scheduling task is used for representing a source data table called by the ith scheduling task when the ith scheduling task is executed and called field attribute information; n is an integer greater than 1, i is an integer greater than 1 and less than or equal to N;
the collision module is used for performing field traversal collision on the wide table configuration information and the task field blood edge relations of the N scheduling tasks to obtain a first traversal collision result; the first traversal collision result comprises field hit information of each scheduling task relative to the wide table data table;
And the matching module is used for determining the target scheduling task matched with the wide table data table from N scheduling tasks based on the first traversal collision result.
In one aspect, a computer device is provided, the computer device comprising a processor and a memory, the memory storing at least one computer program, the at least one computer program loaded and executed by the processor to implement the task matching method described above.
In one aspect, a computer readable storage medium having at least one computer program stored therein is provided, the computer program being loaded and executed by a processor to implement the task matching method described above.
In one aspect, a computer program product is provided that includes at least one computer program that is loaded and executed by a processor to implement the task matching method provided in the various alternative implementations described above.
The technical scheme that this application provided can include following beneficial effect:
according to the task matching method, the wide table configuration information of the wide table data table is obtained, the task field blood edge relations of all the scheduling tasks are obtained, field traversal collision is conducted on the wide table configuration information of the wide table data table and the task field blood edge relations of all the scheduling tasks, so that a first traversal collision result is obtained, and the target scheduling task matched with each wide table data table is determined from all the scheduling tasks based on the first traversal collision result, so that when the target scheduling task is executed, a data source of the target scheduling task is switched from a source data table of the target scheduling task to the wide table data table, and the target scheduling task is executed by using the wide table configuration information in the wide table data table. Therefore, in the process of matching the wide table data table with the scheduling task, the method is a field-level matching process, the fields form the minimum unit required by the execution of the wide table data table and the scheduling task, and the matching is performed from the minimum unit, so that the fine matching can be realized, and the matching accuracy is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 illustrates a flow chart of a task matching method provided by an exemplary embodiment of the present application;
FIG. 2 illustrates a schematic diagram of a standardized splitting process provided by an exemplary embodiment of the present application;
FIG. 3 illustrates a Spark SQL overall execution plan flow chart provided by an exemplary implementation of the present application;
FIG. 4 is a schematic diagram of a field traversal collision process provided by an exemplary embodiment of the present application;
FIG. 5 illustrates a schematic diagram of another field traversal collision process provided by an exemplary embodiment of the present application;
FIG. 6 illustrates a schematic diagram of a data source determination method provided by an exemplary embodiment of the present application;
FIG. 7 illustrates a block diagram of a task matching device provided in an exemplary embodiment of the present application;
fig. 8 shows a block diagram of a computer device according to an exemplary embodiment of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
First, explanation is made on the related terms related to the present application:
1) Data table: also called a data model, is an abstraction of data features that describes the static features, dynamic behavior, and constraints of a system at the level of abstraction, providing a framework for abstracting the information representation and operation of a database system. The data table is described in three parts, namely a data structure, a data operation and a data constraint. The data table in the invention can be a public stable data table in a data warehouse.
2) Broad table: refers to database tables with a relatively large number of fields. Typically a database table that associates together business topic related metrics, dimensions, attributes. Through the use of the wide table, the multi-table association problem during multi-dimensional analysis can be solved, and the speed of data query and the convenience of analysis operation can be improved.
3) Spark SQL (Structured Query Language ) execution plan: execution planning is the conversion of SQL into a set of optimized logical and physical operations. In Spark SQL, the execution plan is a process of converting SQL into a group of Jobs (DAGs) through processing and optimization of a series of components, and then putting the Jobs into Spark executives for execution, wherein the execution plan is divided into: an analytical logic Plan (Parsed Logical Plan), an analytical logic Plan (Analyzed Logical Plan), an optimization logic Plan (Optimized Logical Plan), and a Physical execution Plan (Physical Plan).
4) Scheduling tasks: scheduling refers to the process of a system to execute a task at a specific time or under specific conditions in order to automatically complete a specific task. Scheduling tasks can free up more manpower and the system automatically performs tasks. The scheduling task in the application can refer to a table writing output task of a plurality of offline several-bin environment data tables under a big data cluster, and the scheduling period comprises days, weeks, months and the like.
5) Metadata (Metadata): the term "data about data", that is, data describing data, is data that exists for describing relevant information of the data.
In the process of scheduling task processing, data in a data table is usually required to be called to complete related tasks, each task is usually configured with one or more corresponding original data sources, the data required to be called by the same scheduling task can be scattered, multi-source calling is required, and a wide table data table can be associated with a plurality of related fields in the same data table, so that data query is facilitated; therefore, in order to improve the data acquisition efficiency in the execution process of the scheduling task and further improve the task completion efficiency, the wide-table data table can be used as a data source of the scheduling task, that is, the wide-table data table and the scheduling task need to be matched, so that when one scheduling task is executed, the scheduling task can be executed by scheduling the matched wide-table data table.
In the related art, when a scheduling task is matched with a wide-table data table, firstly, a table-to-table relationship, a table-to-task relationship and a task-to-task relationship are established through a metadata database; then, the schedule task list is mined through several relationships described above: 1. acquiring a list of upstream ODSs (Operational Data Store, operation data storage) tables referenced by the processing process of the known wide-table data table through the established table-to-table relation query; 2. inquiring through the relation between the established table and the task to obtain a downstream SQL class task list of the ODS table; 3. and eliminating the scheduling task of the referenced wide-table data table from the task list through the established relation between the table and the task, thereby screening and acquiring the task list of the ODS table which is directly used without referencing the wide-table data table. And finally, taking the scheduling task in the scheduling task list obtained by screening as a scheduling task which is potentially applicable to the wide table data table.
In the related art, the range of the task applicable to the wide-table data table is deduced only by using the relation established in the metadata base, and the range is limited by the accuracy of the metadata, so that the accuracy is low; and the relation inference is only carried out on the task name level of the wide-table data table and the scheduling task, so that the confidence of the matching result is lower.
In order to ensure that a high-accuracy read wide-table data table is matched with a scheduling task, the embodiment of the application provides a task matching method, and a matching relationship between the wide-table data table and the scheduling task is determined in a field level matching mode. The task matching method of the present application is specifically described below by means of several embodiments.
Referring to fig. 1, a flowchart of a task matching method according to an exemplary embodiment of the present application is provided, and the method may be performed by a computer device, which may be implemented as a server or a terminal. The terminal can comprise a mobile phone, a notebook computer, intelligent interactable equipment, a vehicle-mounted terminal and the like; the servers may include independent physical servers, a server cluster consisting of multiple servers, or cloud servers capable of cloud computing. As shown in fig. 1, the task matching method may include the steps of:
step 110, obtaining the broad table configuration information of the broad table data table; the broad table configuration information comprises the source data table to which the field belongs in the broad table data table and the attribute information of the field in the source data table to which the field belongs.
In one possible implementation manner, the broad table data table may refer to any broad table data table not traversed in the broad table data table set, where the broad table data table set may include M broad table data tables, and not traversed refers to that no matching processing of the broad table data table and the scheduling task has been performed yet. It should be understood that the matching process is performed for each of the broad table data tables in the broad table data table set, that is, the matching process for each of the broad table data tables is the same in steps S110 to S140, so the present application describes the matching process for any one broad table data table as an example without specific description.
In one possible implementation manner, the computer device may obtain the broad table configuration information of the broad table data table from a field reference source table and a field information table of the broad table data table, where the field reference source table includes information of a source data table to which a field in the broad table data table belongs, and the field information table includes attribute information of a field in the broad table data table in a corresponding reference source table, where the attribute information may include a field name, a field data type, a field chinese description, and so on.
When the wide table data table is developed, on the one hand, the number of fields of the wide table is large, namely, the columns of the wide table are large, so that the source information and the attribute information of the field data of each field need to be recorded; on the other hand, for a newly added field in the source of field data, the wide table data table needs to have an adaptive function of automatically and dynamically adding the field. Therefore, for the above two reasons, when developing the broad table data table, a configuration table is synchronously developed to record the source of the field of each column of the characteristic of the broad table data table, and the configuration table also records the source of the field and the attribute information of the source field in the broad table data table. The configuration information of the wide table data table may refer to information recorded in the above configuration table.
Illustratively, table 1 shows configuration information of a broad table data table provided in an exemplary embodiment of the present application:
TABLE 1
Source store name Source table name Source field name Field data type Chinese description of field
Source store name 1 Source table name 1 Source field name 1 double xx scoring
Source store name 2 Source table name 2 Source field name 2 double xx scoring
Source store name 3 Source table name 3 Source field name 3 double xx fusion score
Source store name 4 Source table name 4 Source field name 4 int Number of xx applications
Source store name 5 Source table name 5 Source field name 5 string xx class classification level
As shown in table 1, the wide table configuration information of the wide table data table includes fields such as a source library name and a source table name indicating source information of field data in the wide table data table; fields indicating attribute information of field data in the wide table data table such as source field name, field data type, field chinese description are also included. It should be noted that the attribute information of the field data shown in table 1 is merely illustrative, and the number and types of the attribute information of the field data may be extracted and set differently based on the actual scheduling task requirement, which is not limited in this application.
And 120, analyzing task texts in the N scheduling tasks to obtain the task field blood relationship of the N scheduling tasks, wherein N is an integer greater than 1.
The task field blood relationship of each scheduling task is used for representing a source data table called by the corresponding scheduling task when the corresponding scheduling task is executed and the called field attribute information. For example, the N scheduled tasks include an ith scheduled task, i is an integer greater than 1 and less than or equal to N, and a task field blood-edge relationship of the ith scheduled task is used to represent a source data table called by the ith scheduled task when the ith scheduled task is executed and called field attribute information.
The scheduling task is a task automatically executed by the system at a specified time or when a specific condition is met; the task text is the task operation text corresponding to the scheduled task. In the embodiment of the application, the execution of the scheduling task is realized by calling the data in the original data source corresponding to the scheduling task, so that the task field blood-edge relation indicating the source data table to be called and the field attribute information to be called can be obtained from the task text by analyzing the task text of the scheduling task.
It should be noted that, the process of obtaining the broad table configuration information of the broad table data table by the computer device and the process of obtaining the blood relationship of the task field of the scheduling task may be performed separately or simultaneously, which is not limited in this application.
And 130, performing field traversal collision on the wide table configuration information and the blood edge relations of the N scheduling tasks to obtain a first traversal collision result.
The first traversal collision result comprises field hit information of each scheduling task relative to a corresponding wide-table data table. The field hit information may include a field hit number and a field hit rate, among others. Hits are understood to be coincident or identical, and the number of field hits for an i-th scheduled task relative to the wide table data table refers to the number of fields that are hit or coincident with the wide table data table fields in the fields that need to be referenced when the i-th scheduled task is executed. The field hit rate is calculated from the number of field hits and the required fields for each scheduled task.
In the embodiment of the application, the computer equipment uses the wide table data table as a traversing collision unit, and performs field traversing collision on the wide table configuration information of the wide table data table and the task field blood edge relation of each scheduling task so as to obtain a first traversing collision result of the wide table data table.
After obtaining the broad table configuration information of the broad table data table and the blood-edge relations of the task fields of each scheduling task, performing field traversing collision on the two data, namely performing traversing collision on the fields in the broad table configuration information and the fields in the blood-edge relations of the task fields, namely mining task sentences which can be completed by using the broad table data table in the work which is originally completed by using the source data table by the scheduling task, and obtaining the field hit information of the scheduling task relative to the broad table data table. In the embodiment of the application, the field hit information of each scheduling task relative to the wide table data table comprises a field hit number and a field hit rate; optionally, the field hit information may further include: hit field set, and match type, wherein match type is used for pointing out the dispatch task to visit the data table of the wide table to disperse, or, dispatch task visit the data table of the wide table to not disperse, wherein, dispatch task visit the data table of the wide table to disperse means dispatch task call data from multiple data tables of the wide table; accessing the wide table data table by the scheduling task is not decentralized, meaning that the scheduling task calls data from a single wide table data table.
When performing traversal collision, the computer equipment can acquire fields contained in the task field blood-edge relation of the scheduling task one by one, judge whether each field in the scheduling task exists in the fields contained in the wide table configuration information of the current wide table data table, if so, add the name of the field to a preset hit field set and hit the corresponding scheduling task record once, and repeat the process until the field contained in the task field blood-edge relation of the scheduling task is traversed; after the traversing collision of the dispatching task in the current wide table data table is completed, the field hit information of the dispatching task relative to the current wide table data table is counted, for example, the field hit rate of the dispatching task relative to the wide table data table is calculated, the field hit rate is the ratio between the number of fields in a hit field set and the number of fields of the dispatching task, and the number of field hits in the hit field set is counted.
In this embodiment of the present application, the computer device may determine, as the task access broad-table data table is not dispersed, a matching type between the broad-table data table with a hit rate greater than a hit rate threshold in the first traversal result and the scheduled task, and in this embodiment of the present application, record a first matching type, where the value of the hit rate threshold may be set based on actual requirements, and illustratively, the value of the hit rate threshold may be 0 or any other desired value, which is not limited in this application; further, the computer device may count the scheduling tasks and the broad table data table belonging to the first matching type into a first data table, where the first data table includes a task name, a broad table data table name, a hit field set, a field hit number, and a field hit rate, and the first data table is a first traversal collision result, so as to facilitate a subsequent determination of a target scheduling task matching the broad table data table from the first data table.
And 140, determining a target scheduling task matched with the wide-table data table from N scheduling tasks based on the first traversal collision result.
And when the target scheduling task matched with the wide table data table is executed, switching from a source data table for calling the target scheduling task to a wide table data table. The target scheduling task is executed through the wide table configuration information in the wide table data table, the characteristics that a plurality of relevant fields can be associated in the wide table data table are utilized, the number of data sources required by the target scheduling task to call data can be reduced, the data acquisition efficiency is improved, and meanwhile, the application rate of the existing wide table data table can be improved. The number of target scheduling tasks matched with the wide table data table may be one or more, and the embodiment of the present application is not limited.
In an embodiment of the application, the computer device may determine a target scheduling task that matches the broad table data table based on the field hit information in the first traversal crash result. In a specific implementation, a first collision result meeting a first hit condition may be selected from the first traversal collision results, and then a scheduling task corresponding to the first collision result is determined to be a target scheduling task matched with the wide-table data table. Wherein, meeting the first hit condition may refer to: the number of field hits in the field hit information is greater than the first hit number threshold, and/or the field hit rate in the field hit information is greater than the first hit rate threshold. For example, 2 scheduling tasks are total, the number of field hits in the field hit information of the first scheduling task is 5, the field hit rate is 30%, the number of field hits in the field hit information of the second scheduling task is 7, the field hit rate is 60%, the first hit number threshold is 6, the first hit rate threshold is 50%, it can be seen that the hit number of the second scheduling task is greater than the first hit number threshold, and the hit rate of the second scheduling task is greater than the first hit rate threshold. Thus, the second scheduled task is determined to be the target scheduled task that matches the broad table data table.
In summary, according to the task matching method provided by the embodiment of the present application, by acquiring the broad table configuration information of the broad table data table and acquiring the task field blood edge relationship of each scheduling task, field traversal collision is performed on the broad table configuration information of the broad table data table and the task field blood edge relationship of each scheduling task to obtain a first traversal collision result, and the target scheduling task matched with each broad table data table is determined from each scheduling task based on the first traversal collision result, so that when the target scheduling task is executed, the data source of the target scheduling task is switched from the source data table of the target scheduling task to the broad table data table, so that the target scheduling task is executed by using the broad table configuration information in the broad table data table. Therefore, in the process of matching the wide table data table with the scheduling task, the method is a field-level matching process, the fields form the minimum unit required by the execution of the wide table data table and the scheduling task, and the matching is performed from the minimum unit, so that the fine matching can be realized, and the matching accuracy is improved.
It should be appreciated that if there is already a matching scheduling task in the target scheduling task for which the wide table data table matches, since the matching scheduling task is already matched with the wide table data table, the matching scheduling task need not be matched with the wide table data table any more later. Based on this, step S140, based on the first traversal collision result, may include:
Determining N wide table data tables called by the dispatching tasks based on the calling relation between the wide table data tables established in the metadata base and the dispatching tasks; determining candidate scheduling tasks matched with the wide-table data table from N scheduling tasks based on the first traversal collision result; and performing de-duplication processing on candidate scheduling tasks matched with the wide table data table according to the invoked wide table data tables of the N scheduling tasks to obtain target scheduling tasks matched with the wide table data table.
The metadata database table records the established calling relation between the wide table data table and the task, so that the wide table data table which is not established with the scheduling task and has the application relation can be screened from the wide table data table matched with the scheduling task based on the calling relation record in the metadata database table.
As can be seen from the foregoing, the broad table data table mentioned in step S110 to step S140 may refer to any broad table data table not traversed in the broad table data table set; after step S110-step S140 are performed, if there are not yet wide table data tables in the wide table data table set, step S110-step S140 are repeatedly performed until each wide table data table in the wide table data table set is traversed.
If all of the broad table data tables in the broad table data table set are traversed, but there are still remaining ones of the N scheduling tasks that do not match any of the broad table data tables. This is because the data required for the same scheduling task may exist in different broad table data tables, i.e., the scheduling task accesses the broad table data table dispersion, such as the field a and the field b required for one scheduling task, but the field a matches one broad table data table, the field b matches the other broad table data table, and the hit field information of the two broad table data tables may not satisfy the first hit condition by the above-mentioned traversal collision, so that the scheduling task cannot match any one broad table data table. In order to solve the situation, another way is provided in the embodiments of the present application, that is, broad table configuration information of M broad table data tables in the broad table data table set is integrated to obtain integrated broad table configuration information; performing traversal collision on the remaining scheduling tasks and the configuration information of the integrated broad table to obtain a second traversal collision result; and then determining a matched broad-table data table by the residual scheduling tasks according to the second traversal collision result. In specific implementation, the method can comprise the following steps:
Step 210, if there are remaining scheduling tasks in the N scheduling tasks that are not matched with any one of the broad table data tables, integrating the broad table configuration information of the M broad table data tables to obtain integrated broad table configuration information.
If there are any remaining scheduled tasks in each scheduled task that do not match any of the broad data tables, that is, the result of the traversal collision between the remaining scheduled task and each broad data table does not meet the first hit condition, it means that the remaining scheduled task cannot obtain enough required data from a single broad data table, it may be required to obtain the required data from multiple broad data tables respectively, for example, half of the field data required for a certain remaining scheduled task is included in broad data table 1, the other half of the field data is included in broad data table 2, and so on; therefore, in order to implement matching of the broad table data table of the remaining scheduling task, in the embodiment of the present application, broad table configuration information in all broad table data tables is packaged as one integrated broad table configuration information, so as to implement matching of the broad table data table of the remaining scheduling task based on the integrated broad table configuration information.
And 220, performing collision on the task field blood relationship integrating the broad table configuration information and the remaining scheduling tasks to obtain a second traversal collision result.
The second traversal crash result includes field hit information for the remaining scheduled tasks relative to the integrated wide-table configuration information. In the process, the computer equipment takes the residual scheduling task as a traversal collision unit, acquires the fields in the task field blood-edge relation of the residual scheduling task, judges whether each field in the residual scheduling task exists in the fields contained in the integrated wide table configuration information, if so, adds the name of the field into a preset hit field set and hits the record corresponding to the residual scheduling task once, and repeats the process until the task field contained in the task field blood-edge relation of the residual scheduling task is traversed; after finishing the traversal collision of the remaining scheduling tasks in the integrated wide table configuration information, counting the field hit information of the remaining scheduling tasks relative to the integrated wide table configuration information; in this embodiment of the present application, the field hit information of the remaining scheduled task with respect to the integrated wide table configuration information includes a field hit number and a field hit rate, and the second traversal collision result further includes a hit field set, where the hit field set includes a plurality of hit fields, and the plurality of hit fields refer to a task field blood edge relationship of the remaining scheduled task and a field included in the integrated wide table configuration information; when the field hit rate of the remaining scheduling task relative to the integrated broad table configuration information is calculated, the field hit rate is a ratio between the number of fields in the hit field set and the number of fields of the remaining scheduling task.
In the embodiment of the application, the computer device may determine the matching type between the wide table data table with the hit rate greater than the hit rate threshold value and the remaining scheduled tasks in the second traversal result as the task access style table data table dispersion, which is denoted as the second matching type in the embodiment of the application; further, the computer device counts the fields matched by the remaining scheduled tasks in the integrated field set into a statistical data set, where the statistical data set may include a task name, a hit field set, a field hit number, and a field hit rate, and generates a second data table based on the hit field set in the statistical data set, where the second data table includes the task name, the data table set, the hit field set, the field hit number, and the field hit rate, and the second data table is a second traversal collision result, so that at least two matched wide table data tables are determined for the remaining scheduled tasks based on the second data table.
Step 230, determining a matched broad-table data table for the remaining scheduled tasks according to the second traversal collision result.
In this embodiment of the present application, the computer device may perform screening of the data table set by combining the number of field hits and/or the field hit rate with the second data table (i.e., the second traversal collision result), to obtain a second collision result under the second matching type, where the second collision result is a scheduling task-data table set, so that each wide table data table in the data table set is determined as a wide table data table matched with the remaining scheduling tasks. The number of wide table data tables in the data table set may be two or more, which is not limited in this application. If the second traversal collision result meets a second hit condition, determining at least two broad-table data tables to which a hit field set belongs according to fields included in each broad-table data table; wherein satisfying the second hit condition includes at least one of: the field hit number in the second collision traversal result is larger than a second hit number threshold, and the field hit rate in the second collision traversal result is larger than a second hit rate threshold;
And taking at least two broad table data tables to which the hit field set belongs as broad table data tables matched with the remaining scheduling tasks. It should be noted that, the values of the second hit number threshold and the second hit rate threshold in the second hit condition may be set based on actual requirements, which is not limited in this application.
Schematically, if the second traversal collision result of one remaining scheduling task meets the second hit condition, and it is determined, according to the fields included in each broad table data table, that the fields in the hit field set in the second traversal collision result respectively belong to the broad table data table 1 and the broad table data table 2, then the two broad table data tables are determined as broad table data tables matched with the remaining scheduling task.
If the second traversal collision result does not meet the second hit condition, the currently existing broad-table data table does not have a broad-table data table or a data table set meeting the data requirement of the scheduling task, and subsequent data source updating cannot be performed. After the broad table data table or the data table set matched with each scheduling task is determined, when each scheduling task is executed, the calling of the corresponding source data table is switched to the calling of the corresponding broad table data table or the calling of the broad table data table in the corresponding data table set.
In the embodiment of fig. 1, the process of analyzing the task text in the N schedules to obtain the blood-edge relationship of the task fields of each schedule may be implemented as follows:
s301, carrying out standardized splitting on task texts of all scheduling tasks to obtain task statement sets of N scheduling tasks; the task sentence set comprises a plurality of standardized task sentences.
Since task text of a scheduled task depends on user development content on a big data platform, text quality and text specification of the task text are not unified, so that the task text needs to be split and converted into standardized task sentences to improve accuracy of extracting blood-edge relations of task fields, and illustratively, a standardized splitting process can comprise the following steps: annotation processing, text segmentation and special grammar filtering according to target symbols; the annotation processing refers to deleting personalized annotation content in the task text; the text segmentation according to the target symbol refers to segmenting the task text into single task sentences according to the target symbol, and the target symbol can be a semicolon schematically; special grammar filtering refers to filtering out special grammar statements in task text, such as DDL (Data Definition Language, data definition statement) and/or DML (Data Manipulation Language, data manipulation statement), such as drop/alter/set/refresh/truncate (delete/change/set/refresh/truncate), etc. The standardized splitting process can be realized through a corresponding standardized splitting engine, task texts of all scheduling tasks are processed through the standardized splitting engine, task statement sets corresponding to all the scheduling tasks are obtained, and association relations between the scheduling tasks and the task statement sets are formed for subsequent processing.
Taking a scheduled task as an example, fig. 2 shows a schematic diagram of a standardized splitting process provided by an exemplary embodiment of the present application, taking the scheduled task as an example, after an SQL text corresponding to the scheduled task is obtained, standard splitting may be performed on the SQL text by using an encapsulated SQL standardized splitting engine 210, where the standardized splitting process may include annotation processing, splitting the SQL text according to a score, and filtering with a special grammar, and finally obtaining an SQL task statement set corresponding to the SQL task.
S320, performing execution plan analysis on each standardized task sentence in the task sentence set of each scheduling task to obtain an abstract syntax tree corresponding to each standardized task sentence in each scheduling task; the abstract syntax tree comprises configuration information of an original data source corresponding to the task sentence.
An Execution plan (Execution plan), also called query plan or interpretation plan, is a specific step of the database to execute task sentences; taking the example that the scheduling task is an SQL task, the SQL execution plan mainly refers to an execution plan based on a Spark SQL engine, in the execution plan of the Spark SQL engine, an SQL sentence to be executed is processed by an optimizer (Catalyst), converted into RDD (Resilient Distributed Dataset, elastic distributed data set) and then delivered to a cluster to execute the corresponding task. Illustratively, taking an SQL task as an example, fig. 3 shows a flowchart of a Spark SQL overall execution plan provided by an exemplary implementation of the present application, and as shown in fig. 3, a process of processing an SQL statement by an optimizer (Catalyst) 310 includes: 1) The SQL sentence is converted into an unresolved logic execution plan (Unresolved Logical Plan), and at the moment, the logic execution plan verifies the correctness of the SQL grammar and still needs to verify the correctness of the list names of the table names; 2) Based on the metadata information, catalyst, describing the attribute of the data set and the position of the data set, verifying the correctness of the list name of the name by using catalyst, and converting the verification into a resolved logic execution Plan (logic Plan); 3) Optimizing the parsed logic execution plan to obtain an optimized logic execution plan (Optimizied Logical Plan); 4) Converting the optimized logic execution plan into a Physical execution plan (Physical planes); 5) Determining a target physical execution plan (Selected Physical Plan) from the physical execution plans based on cost optimization rules through CBO (cost-based optimizer); 6) The target physical execution plan is converted into RDD so as to execute the task corresponding to the SQL statement based on the RDD.
In the process of converting an SQL statement into RDD by catalysis, a target physical execution plan may be generated, and in the embodiment of the application, an abstract syntax tree of a target task statement may be generated based on the target physical execution plan. The process may be implemented as:
converting the target task statement into a logic execution plan; the target task statement is any one of standardized task statements in each scheduling task;
optimizing the logic execution plan, and converting the optimized logic execution plan into a physical execution plan; determining a target physical execution plan from the physical execution plans according to the cost optimization rule; an abstract syntax tree of target task statements is generated based on the target physical execution plan.
An abstract syntax tree (AST, abstract Syntax Tree) is a tree representation of the abstract syntax structure of source code, each node on the tree representing a structure in the source code; wherein the hierarchy of the abstract syntax tree of the physical execution plan comprises: project (item) hierarchy, representing projection operations in SQL, i.e. select columns; a Hash agagreate level representing data aggregation; exchange level, which indicates that the shuffle phase needs to move data across the cluster; broadcast Hash Join (broadcast Hash connection) hierarchy, which means by Hash Join based on broadcast scheme, etc. In the embodiment of the application, the planning process of Scan data in Hadoop File Relations (Hadoop File relationship) is resolved through Spark Plan (i.e. Physical Plan), namely the File Scan node and the Filter node under the Project level are located. Analyzing the contents of the two nodes to obtain the selected column, the screening condition of the scanning Partition and the filtering condition according to the column in the whole SQL execution process, namely, grabbing the values of four attributes of Scan, partition Count, pumped Filters and Partition Filters (scanning, partition Count, filter and Partition filter), thereby obtaining an abstract syntax tree corresponding to the physical execution plan of each standardized SQL task sentence, wherein the abstract syntax tree comprises configuration information of the original data source corresponding to the task sentence, such as information of a source table, partition information of the source table, fields used for reading the source table, filtering conditions used for reading the source table and the like. The information of the source table comprises a source table name, a file path of the source table, a storage format of the source table and the like; the partition information of the source table indicates the source table partition that needs to be accessed when the SQL task statement is executed.
S330, constructing the blood relationship of the task fields of each scheduling task based on the abstract syntax tree of each standardized task statement in each scheduling task.
The computer equipment can analyze abstract syntax trees corresponding to the standardized task sentences respectively, and construct task field blood-edge relations of the standardized task sentences, wherein the task field blood-edge relations comprise field information such as task names (tasks), source table names (tables), source field names (columns), operation types (ops) and the like; the task field blood relationship comprises various fields corresponding to standardized task sentences.
Based on the embodiments shown in fig. 1 and fig. 2 of the present application, the computer device may match each scheduled task with each broad table data table simultaneously, that is, match each scheduled task in the task set with each broad table data table in the broad table data table set: after acquiring the task field blood edge relation of each scheduling task and the broad table configuration information of each broad table data table, sequentially performing field traversal collision on the field blood edge relation of each scheduling task and the broad table configuration information of each broad table data table to obtain a first traversal collision result of each broad table data table, screening based on a first hit condition, and determining a first collision result under a first matching type, wherein the first collision result comprises the broad table data table and the corresponding scheduling task; and eliminating the scheduling tasks in the first collision result, performing field traversal collision on the remaining scheduling tasks (namely, the remaining scheduling tasks) and the integrated broad table configuration information respectively to obtain second traversal collision results corresponding to the remaining scheduling tasks, and determining a second collision result under a second matching type based on a second hit condition and a hit field set, wherein the second collision result comprises the remaining scheduling tasks and a data table set corresponding to the remaining scheduling tasks. Schematically, fig. 4 shows a schematic diagram of a field traversal collision process provided in an exemplary embodiment of the present application, as shown in fig. 4, a task set includes a task1 and a task2, each task has a corresponding task field blood edge relationship, a broad table data table set includes a broad table data table 1 and a broad table data table 2, each broad table data table has corresponding broad table configuration information, performing traversal collision on the blood edge relation of the task field of the task1 with the broad table configuration information of the broad table data table 1 and the broad table configuration information of the broad table data table 2, performing collision result record on the hit field corresponding to the data with the corresponding relation of the task and the broad table data table, performing traversal collision on the blood edge relation of the task field of the task2 with the broad table configuration information of the broad table data table 1 and the broad table configuration information of the broad table data table 2, and performing collision result record on the hit field corresponding to the data with the corresponding relation of the task and the broad table data table; generating a first data table 410 based on the record content, wherein the first data table 410 comprises task names, wide table data table names, hit field sets, field hit numbers and field hit rates, and a first traversal collision result comprising each scheduling task-wide table data table can be obtained from the first data table; then, the combination of the scheduling task-wide table data table in the first traversal crash result is screened through the first hit condition, the combination of the scheduling task-wide table data table conforming to the first hit condition is determined as the first crash result in the case that the task access wide table data table is not scattered (namely, the first matching type), as shown in fig. 4, if the first hit number threshold in the first hit condition is 2, the first hit rate threshold is 60%, and the first hit condition contains two indexes of field hit number and field hit rate, the first hit result comprises the scheduling task 1-wide table data table 1, and the scheduling task 2-wide table data table 1.
If the task set is not provided with the corresponding first collision result except that the task1 and the task2 are provided with the corresponding first collision result under the first matching type, eliminating the scheduled task in the first collision result, and performing traversal collision on the remaining scheduled task and integrated broad table configuration information, wherein the integrated broad table configuration information comprises broad table configuration information of a broad table data table 1 and broad table configuration information of a broad table data table 2; fig. 5 shows a schematic diagram of another field traversing collision process provided in an exemplary embodiment of the present application, as shown in fig. 5, assuming that a task3 and a task4 are remaining tasks after the task3 is rejected in the first collision result, performing a field traversing collision on a task field blood edge relationship of the task3 and the integrated wide table configuration information 510, performing a collision result record on a hit field corresponding to the task blood edge relationship and the integrated wide table configuration information 510, performing a field traversing collision on the task field blood edge relationship and the task4 and the integrated wide table configuration information 510, performing a collision result record on a hit field corresponding to the task blood edge relationship and the integrated wide table configuration information 510, respectively counting a field hit number and a field hit rate of the task3 in the integrated wide table configuration information 510, generating a second data table 520 based on the counted result and the collision result record, wherein the second data table 520 contains a hit number and a hit rate of the task in the integrated wide table configuration information 510, and the hit rate of the second data table. And then, screening the combination of the scheduling task-wide table data table set in the second traversal collision result through the second hit condition, and determining the scheduling task-wide table data table set conforming to the second hit condition as a second collision result under the condition that the task access wide table data tables are scattered (namely, a second matching type). If the second hit number threshold in the second hit condition is 4, the second hit rate threshold is 90%, and the second hit condition includes two indexes of the field hit number and the field hit rate, the second collision result includes the task 3-wide table data table 1, 2, and the task 4-wide table data table 1, 2.
In one possible implementation manner, after the broad table data table matched with each scheduling task in the first collision result and the second collision result is removed, a scheduling task-broad table data table list is generated based on the removed result, and is used for indicating the broad table data table which can be called by each scheduling task, and the data source of each scheduling task is replaced by a corresponding broad table data table or a data table set respectively.
In this embodiment, a computer device may implement all or part of the steps of the embodiment shown in fig. 1 by calling corresponding modules, and fig. 6 shows a schematic diagram of a data source determining method provided in an exemplary embodiment of the present application, where, as shown in fig. 6, the computer device may include a configuration information acquisition module 610, a standardized splitting module 620, an execution plan parsing module 630, a task field-level blood-edge relationship construction module 640, a field traversal collision hit module 650, and a scheduling task-wide table data table determining module 660, where the configuration information acquisition module 610 is configured to obtain a field reference source table and a field information table of the wide table data table by reading the wide table data table and a corresponding source library field information configuration table to obtain wide table configuration information of the wide table data table; the standardized splitting module 620 is configured to perform standardized splitting on task text of each task instance in the big data cluster, so as to obtain a set of standardized task sentences of each scheduling task; the execution plan parsing module 630 is configured to perform execution plan parsing on each standardized task sentence in the task sentence set corresponding to each scheduling task through execution plan parsing, so as to obtain each abstract syntax tree corresponding to each standardized task sentence; the task field blood-edge relationship construction module 640 is used for traversing each scheduling task and constructing the task field blood-edge relationship of each scheduling task based on the abstract syntax tree corresponding to each standardized task statement of the scheduling task; the field traversal collision hit module 650 is configured to perform field traversal collision on the broad table configuration information of each broad table data table and the task field blood edge relationship of each scheduling task, so as to obtain field hit information of each scheduling task relative to each broad table data table; the task-wide table data table determining module 660 is configured to filter the result by setting the threshold of the field hit number and hit rate, and reversely reject the matching result of the task that has called the wide table data table, and finally obtain a list of potentially applicable scheduling tasks of each wide table data table, that is, obtain a list of wide table data tables that can be called by the scheduling task, so as to facilitate performing special promotion and treatment based on the obtained list of wide table data tables that can be called by the scheduling task, and change the scheduling task data source.
Fig. 7 shows a block diagram of a task matching device according to an exemplary embodiment of the present application, where the task matching device may perform all or part of the steps of the embodiment shown in fig. 1, and as shown in fig. 7, the task matching device may include:
an obtaining module 710, configured to obtain the broad configuration information of each broad data table; the broad table configuration information of each broad table data table comprises the source data table to which the field belongs in the corresponding broad table data table and the attribute information of the field in the source data table to which the field belongs;
the parsing module 720 is configured to parse task texts in the N scheduled tasks to obtain a task field blood relationship of the N scheduled tasks; the task field blood relationship of the ith scheduling task is used for representing a source data table called by the ith scheduling task when the ith scheduling task is executed and called field attribute information; n is an integer greater than 1, i is an integer greater than 1 and less than or equal to N;
the collision module 730 is configured to perform field traversal collision on the configuration information of the broad table and the blood-edge relationships of the task fields of the N scheduling tasks, so as to obtain a first traversal collision result; the first traversal collision result comprises field hit information of each scheduling task relative to the wide table data table;
And a matching module 740, configured to determine, from the N scheduling tasks, a target scheduling task that is matched with the broad table data table based on the first traversal collision result.
In one embodiment, the field hit information of the task relative to any one of the wide table data tables includes a field hit number and a field hit rate;
the matching module 740 performs the following steps when determining, based on the first traversal collision result, a target scheduled task that is matched with each of the wide-table data tables from the N scheduled tasks:
determining a first collision result meeting a first hit condition for each broad table data table from the first traversal collision results of each broad table data table; wherein meeting the first hit condition includes at least one of: the field hit number is greater than a first hit number threshold; the field hit rate is greater than the first hit rate threshold;
and determining the scheduling task corresponding to the first collision result of each wide table data table as a target scheduling task matched with each wide table data table.
In a possible implementation manner, the task matching device further includes an integrating module 750, configured to integrate the broad table configuration information of each broad table data table to obtain integrated broad table configuration information if there are remaining scheduled tasks in each scheduled task that are not matched with any broad table data table;
The determining module 760 is configured to collide the integrated wide table configuration information with the task field blood-edge relationship of the remaining scheduling tasks to obtain a second traversal collision result, where the second traversal collision result includes field hit information of the remaining scheduling tasks relative to the integrated wide table configuration information;
and the matching module 740 is further configured to determine a matched broad-table data table for the remaining scheduled tasks according to the second traversal collision result.
In one possible implementation, the field hit information of the remaining scheduled tasks relative to the integrated broad table configuration information includes a field hit number and a field hit rate; the second traversal collision result further comprises a hit field set, wherein the hit field set comprises a plurality of hit fields, and the hit fields refer to the task field blood-edge relationship of the remaining scheduling tasks and the fields contained in the integrated broad table configuration information;
the matching module 740 performs the following steps when determining a matched broad-table data table for the remaining scheduled tasks according to the second traversal collision result:
if the second traversal collision result meets a second hit condition, determining at least two broad-table data tables to which the hit field set belongs according to fields included in each broad-table data table; wherein satisfying the second hit condition includes at least one of: the field hit number in the second collision traversal result is larger than a second hit number threshold, and the field hit rate in the second collision traversal result is larger than a second hit rate threshold;
And taking at least two wide table data tables to which the hit field set belongs as wide table data tables matched with the residual scheduling tasks.
In one possible implementation manner, the determining module 760 is further configured to determine, based on the calling relationship between the broad table data table and the scheduled task established in the metadata base, a broad table data table that has been called by each scheduled task;
the matching module 740 performs the following steps when determining, based on the first traversal collision result, a target scheduled task that is matched with each of the wide-table data tables from the N scheduled tasks:
determining N wide table data tables called by the dispatching tasks based on the calling relation between the wide table data tables established in the metadata base and the dispatching tasks; determining candidate scheduling tasks matched with the wide-table data table from N scheduling tasks based on the first traversal collision result;
in one possible implementation manner, the parsing module 720 performs the following steps when parsing the task text in the N scheduled tasks to obtain the task field blood-edge relationships of the N scheduled tasks:
carrying out standardized splitting on task texts of each scheduling task to obtain task statement sets of each scheduling task; the task statement set comprises a plurality of standardized task statements;
Performing execution plan analysis on each standardized task sentence in the task sentence set of each scheduling task to obtain an abstract syntax tree corresponding to each standardized task sentence in each scheduling task; the abstract syntax tree comprises configuration information of an original data source corresponding to a task sentence;
constructing the blood relationship of the task fields of each scheduling task based on the abstract syntax tree of each standardized task statement in each scheduling task.
In one possible implementation manner, the parsing module 720 performs the following steps when performing plan parsing on the target task sentence to obtain an abstract syntax tree corresponding to the target task sentence:
converting the target task statement into a logic execution plan; the target task statement is any one of standardized task statements in each scheduling task;
optimizing the logic execution plan, and converting the optimized logic execution plan into a physical execution plan;
determining a target physical execution plan from the physical execution plans according to cost optimization rules;
and generating an abstract syntax tree of the target task sentence based on the target physical execution plan.
In summary, according to the task matching method provided by the embodiment of the present application, by acquiring the broad table configuration information of the broad table data table and acquiring the task field blood edge relationship of each scheduling task, field traversal collision is performed on the broad table configuration information of the broad table data table and the task field blood edge relationship of each scheduling task to obtain a first traversal collision result, and the target scheduling task matched with each broad table data table is determined from each scheduling task based on the first traversal collision result, so that when the target scheduling task is executed, the data source of the target scheduling task is switched from the source data table of the target scheduling task to the broad table data table, so that the target scheduling task is executed by using the broad table configuration information in the broad table data table. Therefore, in the process of matching the wide table data table with the scheduling task, the method is a field-level matching process, the fields form the minimum unit required by the execution of the wide table data table and the scheduling task, and the matching is performed from the minimum unit, so that the fine matching can be realized, and the matching accuracy is improved.
Fig. 8 illustrates a block diagram of a computer device 800, as shown in an exemplary embodiment of the present application. The computer apparatus 800 includes a central processing unit (Central Processing Unit, CPU) 801, a system Memory 804 including a random access Memory (Random Access Memory, RAM) 802 and a Read-Only Memory (ROM) 803, and a system bus 805 connecting the system Memory 804 and the central processing unit 801. The computer device 800 also includes a mass storage device 806 for storing an operating system 809, application programs 810, and other program modules 811.
The computer readable medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-Only register (Erasable Programmable Read Only Memory, EPROM), electrically erasable programmable read-Only Memory (EEPROM) flash Memory or other solid state Memory technology, CD-ROM, digital versatile disks (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 804 and mass storage device 806 described above may be collectively referred to as memory.
According to various embodiments of the present application, the computer device 800 may also operate by being connected to a remote computer on a network, such as the Internet. I.e., the computer device 800 may be connected to the network 808 through a network interface unit 807 coupled to the system bus 805, or other types of networks or remote computer systems (not shown) may also be coupled to the computer device using the network interface unit 807.
The memory further includes at least one instruction, at least one program, a code set, or an instruction set, where the at least one instruction, the at least one program, the code set, or the instruction set is stored in the memory, and the central processor 801 implements all or part of the steps of the task matching method shown in the foregoing embodiments by executing the at least one instruction, the at least one program, the code set, or the instruction set.
In an exemplary embodiment, there is also provided a computer readable storage medium having stored therein at least one computer program loaded and executed by a processor to implement all or part of the steps in the task matching method described above. For example, the computer readable storage medium may be Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), compact disc Read-Only Memory (CD-ROM), magnetic tape, floppy disk, optical data storage device, and the like.
In an exemplary embodiment, a computer program product is also provided, which comprises at least one computer program loaded by a processor and performing all or part of the steps of the task matching method as described in any of the embodiments of fig. 1 above.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1. A method of task matching, the method comprising:
Acquiring wide table configuration information of a wide table data table; the wide table configuration information comprises a source data table to which a field belongs in the wide table data table and attribute information of the field in the source data table to which the field belongs;
analyzing task texts in the N scheduling tasks to obtain the blood relationship of task fields of the N scheduling tasks; the task field blood relationship of the ith scheduling task is used for representing a source data table called by the ith scheduling task when the ith scheduling task is executed and called field attribute information; n is an integer greater than 1, i is an integer greater than 1 and less than or equal to N;
performing field traversal collision on the wide table configuration information and the task field blood edge relations of the N scheduling tasks to obtain a first traversal collision result; the first traversal collision result comprises field hit information of each scheduling task relative to the wide table data table;
and determining the target scheduling task matched with the wide table data table from N scheduling tasks based on the first traversal collision result.
2. The method of claim 1, wherein the field hit information for each scheduled task relative to the wide table data table includes a field hit number and a field hit rate;
The determining, based on the first traversal collision result, a target scheduling task matched with the wide table data table from N scheduling tasks, including:
determining a first collision result meeting a first hit condition for the wide table data table from the first traversal collision results; meeting the first hit condition includes at least one of: the field hit number is greater than a first hit number threshold; the field hit rate is greater than the first hit rate threshold;
and determining the scheduling task corresponding to the first collision result as a target scheduling task matched with the wide table data table.
3. The method of claim 2, wherein the wide table data table belongs to any one of a wide table data table set, and the wide table data table set includes M wide table data tables; the method further comprises the steps of:
if the wide table data table set does not have the non-traversed wide table data table, acquiring the residual scheduling task in the N scheduling tasks, wherein the residual scheduling task is not matched with any wide table data table in the wide table data table set;
integrating the broad table configuration information of the M broad table data tables to obtain integrated broad table configuration information, and colliding task field blood-edge relations of the integrated broad table configuration information and the residual scheduling tasks to obtain a second traversal collision result, wherein the second traversal collision result comprises field hit information of the residual scheduling tasks relative to the integrated broad table configuration information;
And determining a matched wide-table data table for the remaining scheduling tasks according to the second traversal collision result.
4. The method of claim 3, wherein the field hit information for the remaining scheduled tasks relative to the integrated wide table configuration information includes a field hit number and a field hit rate; the second traversal collision result further comprises a hit field set, wherein the hit field set comprises a plurality of hit fields, and the hit fields refer to the task field blood-edge relationship of the remaining scheduling tasks and the fields contained in the integrated broad table configuration information;
the determining a matched broad table data table for the remaining scheduled tasks according to the second traversal collision result includes:
if the second traversal collision result meets a second hit condition, determining at least two broad-table data tables to which the hit field set belongs according to fields included in each broad-table data table; wherein satisfying the second hit condition includes at least one of: the field hit number in the second collision traversal result is larger than a second hit number threshold, and the field hit rate in the second collision traversal result is larger than a second hit rate threshold;
And taking at least two wide table data tables to which the hit field set belongs as wide table data tables matched with the residual scheduling tasks.
5. The method according to any one of claims 2 to 4, wherein determining the target scheduled task for which the broad-table data table matches from N scheduled tasks based on the first traversal collision result, comprises:
determining N wide table data tables called by the dispatching tasks based on the calling relation between the wide table data tables established in the metadata base and the dispatching tasks;
determining candidate scheduling tasks matched with the wide-table data table from N scheduling tasks based on the first traversal collision result;
and performing de-duplication processing on candidate scheduling tasks matched with the wide table data table according to the invoked wide table data tables of the N scheduling tasks to obtain target scheduling tasks matched with the wide table data table.
6. The method of claim 1, wherein the parsing task text in the N scheduled tasks to obtain a task field blood-edge relationship of the N scheduled tasks comprises:
carrying out standardized splitting on task texts of each scheduling task to obtain task statement sets of each scheduling task; each task sentence set comprises a plurality of standardized task sentences;
Performing execution plan analysis on each standardized task sentence in a task sentence set of each scheduling task to obtain an abstract syntax tree corresponding to each standardized task sentence in each scheduling task;
constructing the blood relationship of the task fields of each scheduling task based on the abstract syntax tree of each standardized task statement in each scheduling task.
7. The method of claim 6, wherein performing a plan parsing on the target task sentence to obtain an abstract syntax tree corresponding to the target task sentence comprises:
converting the target task statement into a logic execution plan; the target task statement is any one of standardized task statements in each scheduling task;
optimizing the logic execution plan, and converting the optimized logic execution plan into a physical execution plan;
determining a target physical execution plan from the physical execution plans according to cost optimization rules;
and generating an abstract syntax tree of the target task sentence based on the target physical execution plan.
8. A task matching device, the method comprising:
the acquisition module is used for acquiring the wide table configuration information of each wide table data table; the broad table configuration information of each broad table data table comprises the source data table to which the field belongs in the corresponding broad table data table and the attribute information of the field in the source data table to which the field belongs;
The analysis module is used for analyzing task texts in the N scheduling tasks to obtain the task field blood relationship of the N scheduling tasks; the task field blood relationship of the ith scheduling task is used for representing a source data table called by the ith scheduling task when the ith scheduling task is executed and called field attribute information; n is an integer greater than 1, i is an integer greater than 1 and less than or equal to N;
the collision module is used for performing field traversal collision on the wide table configuration information and the task field blood edge relations of the N scheduling tasks to obtain a first traversal collision result; the first traversal collision result comprises field hit information of each scheduling task relative to the wide table data table;
and the matching module is used for determining the target scheduling task matched with the wide table data table from N scheduling tasks based on the first traversal collision result.
9. A computer device, characterized in that it comprises a processor and a memory, said memory storing at least one computer program, said at least one computer program being loaded and executed by said processor to implement the task matching method according to any of claims 1 to 7.
10. A computer readable storage medium having stored therein at least one computer program loaded and executed by a processor to implement the task matching method of any one of claims 1 to 7.
CN202310770053.0A 2023-06-26 2023-06-26 Task matching method, device, computer equipment and storage medium Pending CN117493391A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310770053.0A CN117493391A (en) 2023-06-26 2023-06-26 Task matching method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310770053.0A CN117493391A (en) 2023-06-26 2023-06-26 Task matching method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117493391A true CN117493391A (en) 2024-02-02

Family

ID=89681605

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310770053.0A Pending CN117493391A (en) 2023-06-26 2023-06-26 Task matching method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117493391A (en)

Similar Documents

Publication Publication Date Title
CN111522816B (en) Data processing method, device, terminal and medium based on database engine
US11068439B2 (en) Unsupervised method for enriching RDF data sources from denormalized data
CN106897322B (en) A kind of access method and device of database and file system
US8943059B2 (en) Systems and methods for merging source records in accordance with survivorship rules
CN110908997A (en) Data blood margin construction method and device, server and readable storage medium
CN111949541A (en) Multi-source database statement checking method and device
CN109388523A (en) A method of based on binary log file access pattern MySQL database
CN112035508A (en) SQL (structured query language) -based online metadata analysis method, system and equipment
US20180357278A1 (en) Processing aggregate queries in a graph database
CN109885585A (en) Support the distributed data base system and method for storing process, trigger and view
CN110414259A (en) A kind of method and apparatus for constructing data element, realizing data sharing
CN111078705A (en) Spark platform based data index establishing method and data query method
CN116483850A (en) Data processing method, device, equipment and medium
CN112783916A (en) SQL statement auditing method and device, storage medium and electronic equipment
CN110321388B (en) Quick sequencing query method and system based on Greenplus
CN109376154B (en) Data reading and writing method and data reading and writing system
US10877998B2 (en) Highly atomized segmented and interrogatable data systems (HASIDS)
US20110231360A1 (en) Persistent flow method to define transformation of metrics packages into a data store suitable for analysis by visualization
CN116186116A (en) Asset problem analysis method based on equal protection assessment
CN117493391A (en) Task matching method, device, computer equipment and storage medium
CN114168122A (en) Data script generation method and device
CN113868138A (en) Method, system, equipment and storage medium for acquiring test data
US20180113908A1 (en) Transforming and evaluating missing values in graph databases
KR102605930B1 (en) Method for processing structured data and unstructured data on database and data processing platform providing the method
KR102605931B1 (en) Method for processing structured data and unstructured data on a plurality of databases and data processing platform providing the method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination