CN117331975A - Method and device for executing data processing task, computer equipment and storage medium - Google Patents

Method and device for executing data processing task, computer equipment and storage medium Download PDF

Info

Publication number
CN117331975A
CN117331975A CN202311331276.3A CN202311331276A CN117331975A CN 117331975 A CN117331975 A CN 117331975A CN 202311331276 A CN202311331276 A CN 202311331276A CN 117331975 A CN117331975 A CN 117331975A
Authority
CN
China
Prior art keywords
data
cache
data processing
connection
processing task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311331276.3A
Other languages
Chinese (zh)
Inventor
王少伟
蒋杰
刘煜宏
陈鹏
罗韩梅
范晓亮
杨昱睿
侯忱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202311331276.3A priority Critical patent/CN117331975A/en
Publication of CN117331975A publication Critical patent/CN117331975A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application relates to a data processing task execution method, apparatus, computer device, storage medium and computer program product. The method relates to artificial intelligence technology, can be applied to database query scenes, and comprises the following steps: determining a data processing task to be executed; when the data processing task is determined to meet the buffer multiplexing judgment condition, determining at least two data sources to be accessed correspondingly when the data processing task is executed; respectively acquiring respective data connection cache results from at least two data sources; the data connection caching result comprises connection data in corresponding data sources; the connection data is data belonging to corresponding data sources in the data aimed at connection in the process of connecting the data of at least two data sources respectively; and obtaining a data connection result according to each data connection cache result, and executing a data processing task based on the data connection result. By adopting the method, the execution efficiency of the data processing task can be improved.

Description

Method and device for executing data processing task, computer equipment and storage medium
Technical Field
The present invention relates to the field of computer technology, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for performing a data processing task.
Background
With the development of computer technology, things, information and the like in the real world are converted into digital forms which can be processed by a computer through digital technology means, so that the digital technology for storing, transmitting and processing the information is realized, and the digital technology is widely applied to the fields of social media, electronic commerce, online payment, travel service, mobile office and the like. In different application scenarios, the digitally generated data is often stored through different databases, such as storing and maintaining data in separate databases between different departments of different institutions.
When the data of different databases are subjected to joint analysis, the data stored in the different databases are often required to be connected together in a centralized manner for analysis processing, but the efficiency of the analysis processing is low in the prior art that the data are connected together in a centralized manner.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a data processing task execution method, apparatus, computer device, computer readable storage medium, and computer program product that can improve the execution efficiency of data processing tasks.
In a first aspect, the present application provides a method for performing a data processing task. The method comprises the following steps:
Determining a data processing task to be executed;
when the data processing task is determined to meet the buffer multiplexing judgment condition, determining at least two data sources to be accessed correspondingly when the data processing task is executed;
respectively acquiring respective data connection cache results from at least two data sources; the data connection caching result comprises connection data in corresponding data sources; the connection data is data belonging to corresponding data sources in the data aimed at connection in the process of connecting the data of at least two data sources respectively;
and obtaining a data connection result according to each data connection cache result, and executing a data processing task based on the data connection result.
In a second aspect, the present application further provides a data processing task execution device. The device comprises:
the task determining module is used for determining a data processing task to be executed;
the data source determining module is used for determining at least two data sources to be accessed when determining that the data processing task meets the buffer multiplexing judging condition and the data processing task is executed;
the cache result acquisition module is used for respectively acquiring respective data connection cache results from at least two data sources; the data connection caching result comprises connection data in corresponding data sources; the connection data is data belonging to corresponding data sources in the data aimed at connection in the process of connecting the data of at least two data sources respectively;
And the cache result processing module is used for obtaining the data connection result according to each data connection cache result and executing the data processing task based on the data connection result.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the above data processing task execution method when the processor executes the computer program.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the above data processing task execution method.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the above data processing task execution method.
When the data processing task to be executed is determined to meet the buffer multiplexing judgment condition, respective data connection buffer results are respectively obtained from at least two data sources corresponding to access to the data processing task when the data processing task is executed, the data connection buffer results comprise connection data belonging to corresponding data sources in the connected data in the process of connecting the respective data of the at least two data sources, the data connection results are obtained according to the data connection buffer results, and the data processing task is executed based on the data connection results. For the data processing task meeting the buffer multiplexing judgment condition, respective data connection buffer results are directly obtained from at least two data sources respectively, the data connection results are obtained according to the data connection buffer results to execute the data processing task, the data processing task can be executed by utilizing the data connection buffer results obtained by connecting in advance, and the data connection processing process can be simplified, so that the processing efficiency of executing the data processing task is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for a person having ordinary skill in the art.
FIG. 1 is an application environment diagram of a method of performing data processing tasks in one embodiment;
FIG. 2 is a flow diagram of a method of performing data processing tasks in one embodiment;
FIG. 3 is a diagram of the results of table join of two data tables in one embodiment;
FIG. 4 is a schematic diagram of splitting table join results into data join cache results in one embodiment;
FIG. 5 is a flow chart of buffering data connection buffering results in one embodiment;
FIG. 6 is a flow diagram of query statement task execution in one embodiment;
FIG. 7 is a schematic diagram of task execution for two tasks with the same table join result in one embodiment;
FIG. 8 is a flow chart of a multiplexing table connection result in one embodiment;
FIG. 9 is a schematic diagram of a temporary wide table in one embodiment;
FIG. 10 is a flow diagram of a lookup management table using linear probing in one embodiment;
FIG. 11 is a schematic interface diagram of a data analysis system in one embodiment;
FIG. 12 is a block diagram of a data processing task execution device in one embodiment;
fig. 13 is an internal structural view of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology.
With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, digital twin, virtual man, robot, artificial Intelligence Generated Content (AIGC), conversational interactions, smart medical, smart customer service, game AI, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.
The scheme provided by the embodiment of the application relates to an artificial intelligence machine learning technology, and can schedule data processing tasks for data joint analysis, and the scheme is specifically described by the following embodiment.
The method for executing the data processing task provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with each service server 104 via a network. The data storage system may store data that the respective service server 104 needs to process. The data storage system may be provided separately, may be integrated on the corresponding service server 104, or may be placed on a cloud or other server. The server 106 for scheduling execution of data processing tasks may be connected to the respective data storage systems via a network to query data from the respective data storage systems for processing, such as for online analysis processing (Online Analytical Processing, OLAP). During operation of the terminal 102, for example, during communication with the various service servers 104 to obtain various service, the terminal 102 may generate various service data, where the various service data may be stored by respective data storage systems of the various service servers 104, and the respective data storage systems may be relatively independent. If the data processing task to be executed involves the data stored in each data storage system, when determining that the data processing task to be executed meets the buffer multiplexing determination condition, the server 106 obtains respective data connection buffer results from at least two data sources corresponding to be accessed when the data processing task is executed, for example, the respective data connection buffer results may be obtained from each data storage system, the data connection buffer results include connection data belonging to corresponding data sources in the data to be connected in the process of connecting the respective data of at least two data sources, and the server 106 obtains the data connection results according to the respective data connection buffer results and executes the data processing task based on the data connection results. Further, the server may also feed back the task execution result obtained by executing the data processing task to the terminal 102, so that the terminal 102 may obtain the task execution result, for example, may obtain an online analysis processing result.
The terminal 102 may be, but not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The business server 104 and the server 106 may be independent physical servers, may be a server cluster or a distributed system formed by a plurality of physical servers, and may also be cloud servers for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication.
Cloud technology (Cloud technology) refers to a hosting technology for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. Cloud technology (Cloud technology) is based on the general terms of network technology, information technology, integration technology, management platform technology, application technology and the like applied by Cloud computing business models, and can form a resource pool, so that the Cloud computing business model is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.
Cloud storage (cloud storage) is a new concept that extends and develops in the concept of cloud computing, and a distributed cloud storage system (hereinafter referred to as a storage system for short) refers to a storage system that integrates a large number of storage devices (storage devices are also referred to as storage nodes) of various types in a network to work cooperatively through application software or application interfaces through functions such as cluster application, grid technology, and a distributed storage file system, so as to provide data storage and service access functions for the outside. At present, the storage method of the storage system is as follows: when creating logical volumes, each logical volume is allocated a physical storage space, which may be a disk composition of a certain storage device or of several storage devices. The client stores data on a certain logical volume, that is, the data is stored on a file system, the file system divides the data into a plurality of parts, each part is an object, the object not only contains the data but also contains additional information such as a data Identification (ID) and the like, the file system writes each object into a physical storage space of the logical volume, and the file system records storage position information of each object, so that when the client requests to access the data, the file system can enable the client to access the data according to the storage position information of each object. The process of allocating physical storage space for the logical volume by the storage system specifically includes: physical storage space is divided into stripes in advance according to the set of capacity measures for objects stored on a logical volume (which measures tend to have a large margin with respect to the capacity of the object actually to be stored) and redundant array of independent disks (RAID, redundant Array of Independent Disk), and a logical volume can be understood as a stripe, whereby physical storage space is allocated for the logical volume.
The Database (Database), which can be considered as an electronic filing cabinet, is a place for storing electronic files, and users can perform operations such as adding, inquiring, updating, deleting and the like on the data in the files. A "database" is a collection of data stored together in a manner that can be shared with multiple users, with as little redundancy as possible, independent of the application. The database management system (Database Management System, abbreviated as DBMS) is a computer software system designed for managing databases, and generally has basic functions of storage, interception, security, backup and the like. The database management system may classify according to the database model it supports, e.g., relational, XML (Extensible Markup Language ); or by the type of computer supported, e.g., server cluster, mobile phone; or by the query language used, e.g., SQL (structured query language ), XQuery; or by performance impact emphasis, such as maximum scale, maximum speed of operation; or other classification schemes. Regardless of the manner of classification used, some DBMSs are able to support multiple query languages across categories, for example, simultaneously.
In an exemplary embodiment, as shown in fig. 2, a method for performing a data processing task is provided, where the method is performed by a computer device, specifically, may be performed by a computer device such as a terminal or a server, or may be performed by the terminal and the server together, and in this embodiment, the method is applied to the server in fig. 1, and is described by taking the example as an example, and includes the following steps 202 to 208. Wherein:
step 202, determining a data processing task to be performed.
The data processing task refers to a task for processing the data combination in the database, specifically, the task for analyzing and calculating the data combination in different databases, the data processing task may be a task for comparing the data sizes in different databases, the task for mining the data in different databases, or the task for performing statistical analysis processing on the data in different databases according to multiple dimensions. The data processing tasks correspond to actual application scenarios, and different scenarios may correspond to different data processing tasks.
Specifically, the server may determine a data processing task to be executed, where the data processing task may be configured in advance according to actual needs, for example, the data processing task to be executed may be configured according to an application scenario and the data processing requirement.
Step 204, when it is determined that the data processing task meets the buffer multiplexing determination condition, it is determined that the data processing task is executed corresponding to at least two data sources to be accessed.
The buffer multiplexing determination condition is used for determining whether the pre-buffered intermediate result can be multiplexed when the data processing task is executed, and the intermediate result can be intermediate data related to the execution of the data processing task, such as a virtual table obtained by connecting data tables which can comprise a plurality of databases. The buffer multiplexing determination condition may be set according to actual needs, for example, when it is determined that there is an intermediate result that the data processing task can be multiplexed correspondingly, the buffer multiplexing determination condition may be considered to be satisfied. The data source is a source of data processing tasks for processing data, and can be different databases in particular. Each data source may correspond to one database or may correspond to multiple databases. The data processing task involves processing data in a plurality of data sources, i.e. at least two data sources need to be accessed to obtain data from the at least two data sources for joint processing when performing the data processing task.
Alternatively, the server may make a determination based on the buffer reuse determination condition to determine whether the data processing task satisfies the buffer reuse determination condition, for example, the server may determine whether there is an intermediate result that the data processing task is reusable, and if so, may consider that the data processing task satisfies the buffer reuse determination condition. In a specific application, the server can mark the reusable intermediate result through the identification information, and can determine whether the reusable intermediate result of the data processing task exists through the identification information, wherein the identification information can be a mark for storing the reusable intermediate result, if the reusable intermediate result is stored through the cache table, the identification information can be table information of the cache table, and if the table information of the cache table exists, the cache multiplexing judgment condition is considered to be met. When the data processing task meets the buffer multiplexing judgment condition, the buffered intermediate result can be directly multiplexed when the data processing task is executed, so that the regeneration of the intermediate result is avoided, and the execution efficiency of the data processing task can be improved. The server can determine at least two data sources to be accessed correspondingly when the data processing task is executed, and the specific server can determine each data source to be accessed correspondingly when the data processing task is executed according to the task configuration information of the data processing task.
Step 206, respectively obtaining respective data connection cache results from at least two data sources; the data connection caching result comprises connection data in corresponding data sources; the connection data is data belonging to corresponding data sources in the data aimed at connection in the process of connecting the data of at least two data sources.
The data connection caching result comprises connection data in corresponding data sources, wherein the connection data refers to data belonging to the corresponding data sources in the process of connecting the data of at least two data sources. When data are acquired from a plurality of data sources and processed in a centralized manner, the data acquired from the plurality of data sources need to be connected, and processing is performed based on the result obtained by the connection. In the result of the connection, data from the respective data sources is included. For example, when online analysis is performed on the data source a, the data source B, and the data source C, data needs to be acquired from each data source, for example, data a is acquired from the data source a, data B is acquired from the data source B, data C is acquired from the data source C, and the data d to be processed can be obtained by connecting the data a, the data B, and the data C, and further online analysis may be performed on the data d. Wherein, the data a belongs to the data source A, and is the connection data belonging to the data source A in the process of connecting the data a, the data b and the data c; similarly, data B is connection data of data source B, and data C is connection data of data source C.
Specifically, the server may access at least two data sources respectively, and obtain respective data connection cache results from the at least two data sources respectively, where the data connection cache results include connection data cached in advance, and the connection data is specifically data belonging to a corresponding data source in the data to which each data source is connected when each data source is connected. By combining the data connection buffer results, the data connection results of the data sources can be obtained, and the data processing tasks can be executed according to the data connection results, such as on-line analysis processing. In a specific application, the data connection cache results of each data source can be cached in the form of a cache table, and the server can respectively inquire the corresponding cache table in each data source and acquire the corresponding data connection cache results from the inquired cache table.
In one specific application, the process for data connection in different data sources is specifically a table connection process, i.e. the connection of data tables in different data sources. As shown in fig. 3, data table 1 and data table 2 may be data tables from different databases. The account numbers and the corresponding user names are recorded in the data table 1, and comprise Zhang Mou, li Mou and Zhao Mou, and the account numbers corresponding to the account numbers are 1, 2 and 3 respectively; the account number, date and corresponding order number are recorded in the data table 2, for example, the order number generated by the corresponding date of the account number 1 is 20231001 is XXXXX1. When the joint processing is performed on the data table 1 and the data table 2, at least part of the data in the data table 1 and the data table 2 may need to be connected, specifically, the data table 1 and the data table 2 may need to be directly connected, a table connection result 1 may be obtained, and the table connection result 1 may include a user name, a date and a corresponding order number, that is, the data in the data table 1 and the data table 2 are connected, so as to form a complete data table. In table join result 1, including data from data table 1 and data from data table 1, the respective data join cache results may be cached by the respective databases, e.g., data from data table 1 in table join result 1 may be cached by database 1 and data from data table 2 in table join result 1 may be cached by database 2. When the data processing task to be executed can multiplex the table connection result 1, the server can directly and quickly obtain the data connection cache result from the database 1 and the database 2, and can obtain the table connection result 1 after merging so as to execute the data processing task.
In addition to table connection of all the data in the data table 1 and the data table 2, the data in the data table 1 and the data table 2 may be partially connected in the table, and as shown in fig. 4, the table connection result 2 includes a user name and an order number, the data connection cache result 1 and the data connection cache result 2 may be obtained by splitting the table connection result 2, the data in the data connection cache result 1 is from the data table 1, and the data in the data connection cache result 2 is from the data table 2. The data connection cache result 1 may be cached in the database 1 to which the data table 1 belongs, and the data connection cache result 2 may be cached in the database 2 to which the data table 2 belongs. When the data processing task to be executed can multiplex the table connection result 2, the server can directly and rapidly obtain the data connection cache result 1 and the data connection cache result 2 from the database 1 and the database 2 respectively, and can obtain the table connection result 2 after merging so as to execute the data processing task.
And step 208, obtaining a data connection result according to each data connection caching result, and executing a data processing task based on the data connection result.
Wherein the data connection result is a result of connecting data of a plurality of data sources for performing data processing tasks. For example, the data source is a database, data in the database is stored in the form of a data table, when the data processing task is executed, the data table in the database can be subjected to table connection Join according to the requirement to obtain a table connection result, and the table connection result can be used as the data connection result for executing the corresponding data processing task. The table Join is used to combine rows from two or more tables, and in a privacy computing scenario, a virtual wide table is generated using homomorphic encryption, secure multiparty computing, differential privacy, etc., with each participant holding a respective data portion in the virtual wide table. The server may specifically combine the data connection buffer results to obtain a data connection result, and may execute a data processing task based on the data connection result obtained by the combination, so that the data connection process does not need to be repeated, and the data connection flow may be simplified, thereby improving the execution efficiency of the data processing task.
In a specific application, the configured data processing tasks to be executed include task 1, task 2 and task 3, where task 1, task 2 and task 3 all need to process data table a in database a and data table B in database B, that is, when task 1, task 2 and task 3 execute, the corresponding data sources to be accessed include database a and database B, and when task 1, task 2 and task 3 connect data table a in database a and data table B in database B, the same table connection result will be obtained, and respective task processing is executed based on the same table connection result. The server may execute task 1, task 2 and task 3 according to the sequence numbers, and when task 1 is executed, the server may determine that there is no table connection result, that is, there is no respective data connection buffer result in both database a and database B, and the server may perform table connection on data table a in database a and data table B in database B to obtain table connection result d, and execute task 1 based on table connection result d. In the table connection result d, data from the database a and the database B are included, for example, the table connection result d includes a data connection cache result 1 from the database a and a data connection cache result 2 from the database B, and the server may cache the data connection cache result 1 into the database a and the data connection cache result 2 into the database B.
The server may continue to execute the task 2, where it is determined that the task 2 meets the buffer multiplexing determination condition, that is, there is a multiplexing-available buffer result corresponding to the task 2, the server may directly determine that the corresponding database a and the database B to be accessed are to be accessed when the task 2 is executed, and obtain the data connection buffer result 1 from the database a, and obtain the data connection result based on the obtained data connection buffer result 1 and the data connection buffer result 2, for example, the server may combine the data connection buffer result 1 and the data connection buffer result 2 to obtain the data connection result, that is, may quickly restore the table connection result d, and the server may perform the task 2 based on the table connection result d. For task 3, the server may determine that task 3 also satisfies the cache multiplexing determination condition, so that task 3 can be performed in the same manner as task 2. For the task 2 and the task 3, the cached data connection caching result when the task 1 is executed can be multiplexed to obtain the data connection result, and the same table connection processing is not required to be repeated for a plurality of times, so that the data connection flow is simplified, and the execution efficiency of the data processing task is improved.
In the above data processing task execution method, when it is determined that the data processing task to be executed meets the buffer multiplexing determination condition, respective data connection buffer results are respectively obtained from at least two data sources corresponding to be accessed when the data processing task is executed, the data connection buffer results include connection data belonging to corresponding data sources in the connected data in the process of connecting the respective data of at least two data sources, the data connection results are obtained according to the data connection buffer results, and the data processing task is executed based on the data connection results. For the data processing task meeting the buffer multiplexing judgment condition, respective data connection buffer results are directly obtained from at least two data sources respectively, the data connection results are obtained according to the data connection buffer results to execute the data processing task, the data processing task can be executed by utilizing the data connection buffer results obtained by connecting in advance, and the data connection processing process can be simplified, so that the processing efficiency of executing the data processing task is improved.
In an exemplary embodiment, when it is determined that the data processing task meets the buffer multiplexing determination condition, determining that the data processing task is executed corresponds to at least two data sources to be accessed includes: acquiring task configuration information of a data processing task; when the task configuration information indicates that a reusable cache result associated with the data processing task exists, determining that at least two data sources to be accessed correspond to the data processing task when the data processing task is executed.
The task configuration information is configuration information of the data processing task, and may include various configuration parameters for the execution of the data processing task, for example, may include execution time, execution trigger condition, execution times, execution authority, address information, cache processing type, and the like of the data processing task. The task configuration information can be configured for each data processing task according to actual needs. Based on the task configuration information of the data processing task, it may be determined that the data processing task satisfies the cache multiplexing determination condition, e.g., whether there is a reusable cache result associated with the data processing task according to the task configuration information, thereby determining whether the cache multiplexing determination condition is satisfied. The reusable buffer result may specifically include a data connection buffer result buffered in each data source, and a data connection result required when performing a data processing task may be obtained based on the data connection buffer result.
Optionally, the server may query task configuration information of the data processing task, specifically may query attribute information of the data processing task according to a task identifier of the data processing task, and may obtain task configuration information for the data processing task from the attribute information of the data processing task, where the task configuration information may specifically include a cache address, clear cache switch status information, and the like, and may be used to determine whether the data processing task meets a cache multiplexing determination condition. When the task configuration information of the data processing task indicates that there is a reusable cache result associated with the data processing task, indicating that the reusable cache result of the data processing task has been cached in advance, the server may determine that the data processing task meets a cache multiplexing determination condition, and determine that at least two data sources to be accessed correspond to each other when the data processing task is executed. In specific implementation, the server may query whether there is a reusable cache result of the data processing task based on the cache address in the task configuration information, if so, it may determine that the data processing task meets a cache multiplexing determination condition, and the server may determine that the data processing task is executed corresponding to at least two data sources to be accessed.
In this embodiment, the server determines, based on task configuration information of the data processing task, whether the data processing task meets a buffer multiplexing determination condition, so that an execution mode of the data processing task can be accurately determined, and when the buffer multiplexing determination condition is met, the data processing task can be executed by multiplexing a data connection buffer result, which is favorable for improving the execution efficiency of the data processing task.
In one exemplary embodiment, when the task configuration information indicates that there is a reusable cache result associated with the data processing task, determining at least two data sources to be accessed corresponding to the execution of the data processing task includes: when the task configuration information comprises a reusable cache address associated with the data processing task, determining a cache management table according to the reusable cache address; when the cache management table comprises a cache table for recording a reusable cache result, determining at least two data sources to be accessed when the data processing task is executed based on the reusable cache address.
The task configuration information of the data processing task may include a cache address, where the cache address may specifically include a reusable cache address, and the reusable cache address is an address for storing a data connection cache result. The data connection caching result can be recorded through a caching table, the caching table can be managed through a caching management table, the corresponding caching management table can be queried based on the reusable caching address, and the corresponding data connection caching result is obtained from the caching table of the caching management table. The buffer management table comprises a buffer table for recording the reusable buffer result, namely the buffer management table can directly comprise the buffer table, or the buffer management table comprises the buffer table information, so that the corresponding buffer table can be obtained by inquiring according to the buffer table information in the buffer management table. In a specific application, meta information of the cache table may be recorded in the cache management table, where the meta information may include a cache table name, a cache table creator, a creation time, a latest usage time, and header information of the cache table, and based on the meta information, a corresponding cache table may be queried to obtain a corresponding data connection cache result from the cache table. When the cache management table directly comprises the cache table, the cache table can be directly inquired from the cache management table, and the corresponding data connection cache result can be further obtained from the cache table.
For example, the server may parse the task configuration information of the data processing task to determine whether the task configuration information includes a reusable buffer address associated with the data processing task, and may specifically determine whether the reusable buffer address associated with the data processing task exists according to a field of the task configuration information. When the task configuration information includes a reusable buffer address associated with the data processing task, the server may determine a buffer management table based on the reusable buffer address, and the specific server may query according to the reusable buffer address to determine a corresponding buffer management table. The server may traverse the cache management table to determine whether the cache management table has a cache table itself or meta information of the cache table, where the cache table is used to record a reusable cache result, and the reusable cache result may specifically be a data connection cache result cached by each data source. In the case that the cache management table includes the cache table itself or meta information of the cache table, the server may determine, based on the reusable cache address, at least two data sources corresponding to be accessed when the data processing task is executed, e.g., may determine, based on the reusable cache address, the pointed data source, thereby determining at least two data sources corresponding to be accessed when the data processing task is executed.
In this embodiment, the server determines the cache management table based on the reusable cache address associated with the data processing task in the task configuration information, and determines at least two data sources according to the reusable cache address when the cache management table includes the cache table for recording the reusable cache result, so that whether the data connection cache result exists or not is accurately determined based on the cache management table, and the data processing task can be executed by multiplexing the data connection cache result when the data connection cache result exists, which is beneficial to improving the execution efficiency of the data processing task.
In one exemplary embodiment, determining a cache management table from a reusable cache address includes: determining a management table to be checked according to the reusable cache address, and acquiring header information of the management table to be checked; and when the header information passes the verification of the management table, determining the management table to be verified as a cache management table.
The management table to be checked is a management table which needs to be checked. A plurality of management tables may be included in the reusable cache address, each of the management tables may belong to a management table of a different role, including a cache management table that manages the cache tables, and the server may determine the cache management table by traversing each of the management tables. The header information may be used to represent management tables, different management tables may have different header information, and each management table may be accurately distinguished by the header information, where specific header information may include a field name and a field attribute of each field of the corresponding management table, for example, may include a name field name of a cache table of a character string type, a creator field creator of the cache table of the character string type, a last used time field last_used_time of the cache table of the character string type, and so on.
Optionally, the server may determine the to-be-checked management table based on the reusable buffer address, specifically, may query the management table according to the reusable buffer address, and use the queried management table as the to-be-checked management table. In a specific implementation, the path pointed by the reusable cache address may include a plurality of management tables, the server may traverse each management table to determine the cache management table, and the specific server may use the management table pointed by each traversal as the management table to be checked. The server can acquire the header information of the management table to be checked, and can directly read the corresponding header information based on the management table to be checked. The server may perform verification based on header information of the to-be-verified management table, and when the header information of the to-be-verified management table passes the verification of the management table, it indicates that the to-be-verified management table belongs to the management table for managing the cache table, and then the server may determine the to-be-verified management table as the cache management table. When checking the header information of the management table to be checked, the header information of the management table to be checked can be compared with the header format of the preset cache management table, so as to check the management table aiming at the header information, thereby determining whether the management table to be checked belongs to the cache management table.
In this embodiment, the server obtains the to-be-checked management table according to the reusable cache address, checks the header information of the to-be-checked management table, and determines that the to-be-checked management table is the cache management table when checking through the management table, so that the cache management table can be accurately determined from the reusable cache address.
In an exemplary embodiment, when the cache management table includes a cache table for recording a reusable cache result, determining, based on the reusable cache address, at least two data sources corresponding to be accessed when the data processing task is executed, includes: determining a cache table name; the cache table name identifies the cache table used to record the reusable cache result; when the cache management table comprises a cache table name, determining at least two data sources to be accessed when the data processing task is executed based on the data source to which the reusable cache address belongs.
Wherein the cache table names are used to identify the cache tables that record the reusable cache results, different cache tables may correspond to different cache table names. Specifically, the server may determine a cache table name, and the particular server may calculate the cache table name, e.g., may calculate the cache table name according to a fixed format and data source information. In a specific application, the specific format of the cache table name may be "fixed prefix+table name+data source hash information". The server may query the cache management table according to the cache table name to determine whether the cache management table includes the cache table name, or may determine whether the cache management table includes the cache table identified by the cache table name. When the server determines that the cache table name exists in the cache management table or the cache table identified by the cache table name exists in the cache management table, it may be determined that the cache management table includes a cache table for recording a reusable cache result, and the server may determine a data source to which the reusable cache address belongs, and determine at least two data sources corresponding to be accessed when the data processing task is executed according to the data source to which the reusable cache address belongs.
In this embodiment, the server determines whether the cache table exists in the cache management table through the cache table name, and determines at least two data sources based on the data source to which the reusable cache address belongs when determining that the cache management table includes the cache table, so that the cache table determination can be accurately performed based on the cache table name, which is beneficial to ensuring efficient execution of the data processing task.
In an exemplary embodiment, obtaining respective data connection cache results from at least two data sources, respectively, includes: respectively determining respective cache tables of at least two data sources; the cache table is used for recording connection data in the corresponding data source; and respectively obtaining the data connection caching results of the at least two data sources from the caching tables of the at least two data sources.
The buffer table is used for recording connection data in the corresponding data sources, and the connection data buffered by each data source can be combined to obtain a complete data connection result. Specifically, after determining that the data processing task is executed and corresponds to at least two data sources to be accessed, the server further determines respective cache tables of the data sources, and obtains respective cached data connection cache results of the data sources from the respective cache tables of the data sources. In a specific application, for each of at least two data sources, the server may query the cache table in the data source to which it is directed, and specifically may query the cache management table for managing the cache table in the data source to which it is directed, and determine the corresponding cache table based on the cache management table query.
In this embodiment, the server obtains respective data connection buffer results from respective buffer tables of respective data sources, so that the data connection buffer results can be recorded through the buffer tables, and the data processing task is executed by multiplexing the data connection buffer results, which is beneficial to improving the execution efficiency of the data processing task.
In one exemplary embodiment, obtaining data connection results from respective data connection cache results and performing data processing tasks based on the data connection results includes: combining the data connection caching results to obtain a data connection result; and acquiring data to be processed aimed at by the data processing task from the data connection result, and executing the data processing task based on the data to be processed.
The data to be processed is the data aimed at when the data processing task is executed, namely the data to be processed is the data aimed at by the data processing task. For example, the server may combine the obtained data connection cache results from the respective data sources to obtain a data connection result. In a specific application, the data connection cache result may be a partial result of the table connection, and a complete table connection result may be obtained by merging the partial results of the table connections. When executing the data processing task based on the data connection result, the server acquires the data to be processed aimed at by the data processing task from the data connection result, and executes the data processing task based on the acquired data to be processed. In a specific implementation, different data processing tasks may be processed for different data when they are for the same data connection result, i.e. different data processing tasks may correspond to different data to be processed.
In this embodiment, the server merges the data connection buffer results to obtain a data connection result, obtains to-be-processed data aimed by the data processing task from the data connection result, executes the data processing task based on the to-be-processed data, and executes the data processing task by using the data connection buffer result obtained by the pre-connection, thereby simplifying the data connection processing process and improving the processing efficiency of executing the data processing task.
In an exemplary embodiment, as shown in fig. 5, the data processing task execution method further includes a process of caching a data connection cache result, specifically includes steps 502 to 506, where:
step 502, when it is determined that the data processing task does not meet the buffer multiplexing determination condition, data to be connected for the data processing task are respectively obtained from at least two data sources.
The data to be connected refers to data to be connected when the data of each data source are to be connected, different data sources can have different data to be connected, and different connection modes can also correspond to different data to be connected. Optionally, when it is determined that the data processing task does not meet the buffer multiplexing determination condition, it indicates that the data processing task cannot be executed by using the buffer data, and for execution of the data processing task, the data connection processing needs to be performed completely, and then the server may obtain the data to be connected for the data processing task from at least two data sources respectively. Specifically, the server may determine the targeted data table based on the data processing task, and obtain the targeted data to be connected from at least two data sources according to the targeted data table.
And step 504, connecting the data to be connected to obtain a data connection result, and executing a data processing task based on the data connection result.
Specifically, the server may connect each piece of to-be-connected data, and specifically may connect each piece of to-be-connected data according to a data connection manner determined by the data processing task, e.g., may perform table connection on each piece of to-be-connected data according to a table connection manner determined by the data processing task, so as to obtain a data connection result. The server may perform the data processing task for the data based on the data connection result, e.g., may acquire data to be processed from the data connection result, and perform the data processing task based on the acquired data to be processed.
Step 506, obtaining respective data connection caching results of the at least two data sources according to the data connection results, and storing the data connection caching results in the corresponding data sources.
The data connection caching result comprises connection data in a corresponding data source, and the data connection caching result can be obtained by splitting the data connection result according to the data source. The server may obtain the data connection buffer result of each of the at least two data sources according to the data connection result, and the specific server may split the data connection result according to each data source to determine the data connection buffer result of each data source, where the server stores the obtained data connection buffer result in the data source corresponding to each data source.
In this embodiment, for a data processing task that does not meet the buffer multiplexing determination condition, the server may acquire the data connection to be connected from each data source, then execute the data processing task, obtain a data connection buffer result based on the data connection result, and store the data connection buffer result in a corresponding database, so as to buffer the data connection buffer result, so that for a subsequent data processing task that meets the buffer multiplexing determination condition, the data processing task may be executed by using the data connection buffer result obtained by pre-connection, so that the data connection processing process may be simplified, and thus the processing efficiency of executing the data processing task may be improved.
In one exemplary embodiment, storing data connection cache results in respective data sources includes: based on the data source information of at least two data sources, respectively creating respective cache tables for the at least two data sources; and storing the data connection cache result into a cache table of a corresponding data source in the at least two data sources.
The data source information is used for describing the data source, and specifically may include a data source identifier of the data source. Specifically, the server may obtain the respective data source information of each data source, for example, may obtain the respective data source name of each data source, and the server may create respective cache tables for each data source based on the respective data source name of each data source. For example, the server may create a respective cache table of each data source according to the format of "fixed prefix+data source information+suffix sequence number", where the data source information may be used to mark each data source, and specifically may include a data source name, data source hash information, and the like, and the created cache table is used to store a corresponding reusable cache result, that is, to cache a data connection cache result. For each data connection cache result, the server stores the data connection cache result into a cache table of a corresponding data source of the at least two data sources.
In this embodiment, the server creates the cache table based on the data source information of the data source to store the data connection cache result, so that the data connection cache result can be cached, so that the data processing task can be executed by using the data connection cache result obtained by the pre-connection for the data processing task meeting the cache multiplexing determination condition, the data connection processing process can be simplified, and the processing efficiency of executing the data processing task can be improved.
In an exemplary embodiment, the data processing task execution method further includes: the task configuration information of the data processing task does not include a reusable buffer address associated with the data processing task, a buffer management table does not exist in the reusable buffer address, or a buffer table for recording a reusable buffer result is not included in the buffer management table, and it is determined that the data processing task does not meet the buffer multiplexing determination condition.
Wherein the reusable cache address is an address for storing the data connection cache result. The data connection caching result can be recorded through a caching table, the caching table can be managed through a caching management table, the corresponding caching management table can be queried based on the reusable caching address, and the corresponding data connection caching result is obtained from the caching table of the caching management table.
Optionally, the server may obtain task configuration information of the data processing task, parse the task configuration information of the data processing task, and when the task configuration information of the data processing task does not include a reusable buffer address associated with the data processing task, indicate that a corresponding reusable buffer address is not configured for the data processing task, and no intermediate result that the data processing task is reusable, and may determine that the data processing task does not meet a buffer reuse determination condition. In addition, the server determines that the task configuration information includes a reusable buffer address associated with the data processing task, but when the buffer management table does not exist in the reusable buffer address, it indicates that the corresponding buffer management table is not configured for the data processing task, and it may be determined that the data processing task does not satisfy the buffer multiplexing determination condition. In addition, when the task configuration information includes a reusable buffer address associated with the data processing task and a buffer management table exists in the reusable buffer address, but the buffer management table does not include a buffer table for recording a reusable buffer result, which indicates that a corresponding reusable intermediate result is not buffered for the data processing task, the server may determine that the data processing task does not meet the buffer multiplexing determination condition.
In this embodiment, the task configuration information of the data processing task does not include a reusable buffer address associated with the data processing task, a buffer management table does not exist in the reusable buffer address, or a buffer table for recording a reusable buffer result is not included in the buffer management table, and the server determines that the data processing task does not satisfy the buffer multiplexing determination condition, so that multi-level determination can be performed based on the reusable buffer address, the buffer management table, and the buffer table, and accuracy of determination of the buffer multiplexing determination condition is ensured.
In one exemplary embodiment, determining a data processing task to be performed includes: acquiring a task set; the task set comprises at least two data processing tasks to be grouped; grouping each data processing task to be grouped according to the data connection information of the data processing tasks to be grouped to obtain at least one task group; the data processing tasks to be grouped, which belong to the same task group, have the same data connection information; a data processing task to be performed is determined from at least one task group.
The task set comprises at least two data processing tasks to be grouped, wherein the data processing tasks to be grouped are not aiming at data processing tasks for grouping processing. The data connection information is used for describing connection processing required when performing the task of processing the data to be grouped, and specifically may include the data to be connected, the connection mode, and the like. The same data connection information can obtain the same data connection result after connection.
For example, the server may obtain a task set including a plurality of data processing tasks to be grouped, and group the respective data processing tasks to be grouped in the task set. The specific server may determine respective data connection information of each to-be-grouped data processing task, and may specifically include a to-be-connected data table and a table connection manner, and the server may group each to-be-grouped data processing task according to the data connection information, so as to obtain at least one task group, where to-be-grouped data processing tasks belonging to the same task group have the same data connection information. The server may determine the data processing task to be performed based on the classified at least one task group, and the specific server may obtain the data processing task to be performed from the classified at least one task group.
In this embodiment, the server groups the data processing tasks to be grouped in the task set according to the data connection information, so that the data processing tasks to be grouped with the same data connection information are divided into the same task group, and the data processing tasks to be executed are determined from the task group, so that the data processing tasks in the same task group after grouping can execute the data processing tasks by using the data connection buffer result obtained by the pre-connection, and the data connection processing process can be simplified, thereby improving the processing efficiency of the execution of the data processing tasks.
In an exemplary embodiment, the data processing task execution method further includes: and deleting the data connection cache results stored in each of the at least two data sources when the clearing judgment condition for the data connection cache results is met.
The clearing judgment condition is used for judging whether to delete the data connection cache result, and particularly can be flexibly set according to actual needs. For example, if all the data processing tasks for the same data connection result are executed, it can be considered that the erasure judgment condition is satisfied. Specifically, the server may detect a condition for determining the data connection buffer result to be cleared, and when it is detected that the condition for determining the data connection buffer result to be cleared is satisfied, it indicates that the data connection buffer result is already disabled, and the server may delete the data connection buffer results stored in each of the at least two data sources for space release. In a specific application, the server may delete the data connection buffer results stored in each data source, so as to clear the intermediate results buffered in each data source, and release the storage space of the data source in time.
In this embodiment, when the condition of clearing and determining the data connection cache result is satisfied, the server deletes the data connection cache result stored in each data source, so that the storage space in each data source can be released in time.
The application scenario also provides an application scenario, and the application scenario applies the data processing task execution method. Specifically, the application of the data processing task execution method in the application scene is as follows:
in an actual application scene, data are stored and maintained independently among different departments of different institutions. Traditional OLAP online analysis processing needs to integrate scattered data, which not only causes a certain degree of plaintext data leakage, but also brings about potential data safety hazards. Due to factors such as user privacy protection, data protection regulations, business interests and the like, data scattered in different institutions cannot be directly concentrated together, so that the phenomenon of 'data islanding' is ubiquitous. Privacy protection, which refers to taking measures to protect information and identities of individuals from access, use, or disclosure by unauthorized persons, may include encryption, anonymization, access control, data deletion, etc. techniques and policy measures, and is critical to protecting individuals' rights, preventing identity theft, preventing malicious activity, and facilitating trust. Data islands are lack of correlation between data and databases are not compatible with each other.
A solution is to use a privacy computing technology to perform security joint analysis, the processed data is invisible, the security use of the data can be ensured, but the additionally introduced complex cryptographic computing process and communication process lead to too slow computing speed and can not meet the service requirements with strict requirements on the computing speed, so that the time consumption of tasks is reduced, and the key problem of the service landing of the privacy computing scheme is solved.
Specifically, a complex cryptographic calculation process and a dense communication process are introduced into the security joint analysis scheme, so that the operation speed is low. As shown in fig. 6, the SQL task execution flow in the security joint analysis, that is, in a conventional SQL (Structured Query Language ) task main flow, the data of each database participant is only used in the present computing engine. Specifically, for the a-side database 1 (TDW, tencent distributed Data Warehouse, distributed data warehouse) and the B-side database 2 (Hive, data warehouse tool), a read input data phase (Reader), a Sub Query phase of filtering input data, a Join (PSI, private Set Intersection, privacy-preserving set intersection) phase of performing security sample alignment By a privacy set intersection technique, a Group By phase of performing federal grouping, a Result field calculation phase (receiver Results, transfer Results) for non-sensitive data, and a Result drop phase (Writer) are sequentially performed, wherein dashed arrows connect the two PSI, group By, and transmission Result phases, indicating that communication between the parties through the network is required. The Join (PSI) stage is serious in time consumption due to the complexity of the federal algorithm and the intensive network communication requirement, and the time consumption of the Join (PSI) stage is more than 60% of the total time consumption of the task through statistics, so that the operation speed of the security joint analysis is slower.
The security joint analysis refers to that in a scene that original data is not allowed to leave the local and joint analysis is needed to be carried out on the original data and partner data, the data availability is invisible by utilizing a multiparty security calculation technology, the working principle of the security joint analysis is that data of each device is subjected to local calculation, and only an aggregation result is provided for a task initiator, but not any data from other participants. An Quanyang this alignment refers to the process of matching or aligning data in different data sources without exposing sensitive information, and security sample alignment is an important issue in the field of privacy protection, and common application scenarios include medical data sharing, financial fraud detection, and the like. To achieve secure sample alignment, commonly employed techniques include homomorphic encryption, secure multiparty computing, differential privacy, and the like. The table Join is used to combine rows from two or more tables, and in a privacy computing scenario, a virtual wide table T is generated using homomorphic encryption, secure multiparty computing, differential privacy, etc., with each participant holding a respective data portion in T.
In addition, the security joint analysis scheme has the problem of repeated calculation of links. In task statistics, the computation links of multiple independent tasks are repeated, and Join phases with high time consumption are often included in the repeated computation links. As shown in fig. 7, for task 1, after the game table and the flow table are connected correspondingly according to ids, according to the game column, the sum of the push_count columns (corresponding pseudocode may be a selected game, and sum (push_count) from game join flow on game.id=flow.id group by gamename) is counted in a grouping manner, and in a specific execution, the original data of the two data tables of the game table and the flow table need to be obtained from two different databases respectively, wherein the game table includes the game column and the id column, and the flow table includes the id column, the date column and the push_count column. In the middle process of executing the task 1, two data tables of the game table and the flow table are required to be subjected to table connection (Join) to obtain a table connection result, specifically a virtual wide table, wherein the virtual wide table comprises a gamenem column, a date column and a push_count column, and the virtual wide table is further subjected to summation aiming at the push_count column after being grouped according to the gamenem column based on the virtual wide table, so that an execution result of the task 1 can be obtained.
Similarly, for task 2, after the game table and the flow table are correspondingly connected according to id, the sum of the game columns is counted in groups according to the date columns (corresponding pseudo code may be select date, count (gamename) from game join flow on game.id=flow.id group), when task 2 is executed, the same as task 1 is executed, the game table and the flow table are required to be connected according to id, after the same virtual wide table is obtained, the virtual wide table is grouped according to the date columns and the sum statistics is performed based on the game columns, and the execution result of task 2 is obtained. It can be seen that two independent tasks 1 and 2 use the same two tables to make table connections in the same way, so the phases of reading the original data and Join in the computing links of the two tasks are repeated. And then multiplexing the results of the Join stage on the premise of ensuring the data safety in the scene that a group of tasks need to be subjected to grouping calculation on the same Join stage result, so that the task execution speed of the safety joint analysis scheme can be remarkably improved.
The scheme for protecting the security of the computing process in the existing privacy computing technology also comprises a closed domain technical scheme, and the implementation scheme is as follows: hold specialized storage and computing clusters and configure related application groups and cluster associations; setting a computing cluster, ensuring that the submitting operation on the computing cluster can only write results into a closed domain storage cluster and cannot write data out of other clusters; setting IP Tables (Internet Protocol Address Tables, internet protocol address table) for the storage clusters, ensuring that only the computing clusters in the closed domain can access the storage nodes, and the external cluster computing nodes cannot access the data in the closed domain; the ex-warehouse component is optimized, when the data in the closed domain needs to be exported, an auditing operation is triggered, and the data can be exported after the auditing of a system administrator is passed; relatively private, or sensitive, data is placed in the closed domain and the application group is authorized to ensure that the application group has access to the data in the closed domain. However, the closed domain technical scheme needs to store the sensitive data of different participants in the closed domain in a concentrated manner, so that the cost for maintaining the data consistency is high, and when the calculation result is exported from the closed domain, the manual audit is triggered, so that the audit cost is high.
Based on the above, the embodiment provides a Join result multiplexing method for privacy calculation based on a cache idea aiming at the problem that the privacy calculation technology takes a long time, and aims to multiplex the results of the Join stage on the premise of ensuring the data security for a group of tasks needing to be calculated on the same Join stage result in a grouping way. The method specifically comprises three parts of buffering Join results, multiplexing Join results and cleaning Join result buffering, multiplexing task intermediate results composed of data of a plurality of participants, and ensuring that other data tables are not affected when the Join result buffering tables are operated by introducing a management table mechanism. The idea of caching is that if some resources or data are frequently used, the data can be cached, and each time operation is performed, the data is found in the cache first to see whether the data exists, if so, the data is directly used, if not, the data is obtained and set in the cache, and the data can be directly obtained from the cache when accessed next time, so that a great amount of time is saved, and the cache is a typical space time-changing scheme. According to the data processing task execution method provided by the embodiment, all the participants store data locally, the data are not required to be stored in a closed domain in a centralized manner, and the risk of data leakage is fundamentally avoided; high data maintenance cost and manual auditing cost are avoided; meanwhile, under the condition that repeated calculation links exist among independent tasks, the task execution speed is improved.
Specifically, the method for executing the data processing task provided in this embodiment may be applied to a federal learning platform system, and may be specifically applied to a video App (Application), a game platform, a financial service, and the like. For example, the social application payment research and development part and the social application technical architecture part both need to use the data of the other party to meet the requirements of the financial credit business, but in order to ensure the data security, the two departments cannot directly communicate data, so that the social application payment research and development part, the social application technical architecture part and the data platform part can jointly form a financial federal learning joint project, and the federal learning platform system of the data platform part provides financial federal application services with invisible data and meeting the security inspection requirements, so that the two parties can not take out the local data, and the data use requirements of the financial credit business are met. And moreover, the federation learning platform system perfects the specific federation learning and joint analysis functions of the financial credit business, directionally optimizes the performance of federation algorithm of the financial credit scene and continuously improves the capability of the user.
Specifically, the method for executing the data processing task solves the problem of executing the independent task including the repeated calculation link when the security joint analysis is performed in the privacy calculation scene, and particularly adopts the Join result multiplexing scheme to reduce unnecessary time consumption caused by repeated calculation of the Join process. The data processing task execution method provided by the embodiment provides a Join result multiplexing scheme based on the whole process of repeated calculation link result multiplexing of privacy protection, and the Join result multiplexing scheme comprises the following four parts: a management table mechanism, which relates to how to avoid the influence of a cache table on the existing data table under the cache path; caching Join results, which relates to how to cache Join results in the task execution process on the premise of ensuring data security; multiplexing Join results, relating to how to find and multiplex previously cached Join results in the task execution process; cleaning the Join result cache involves how to safely clean up the Join result after it has no value to use.
In particular, as shown in fig. 8, in the process flow of Join result multiplexing, for a data processing task to be executed, the server may determine whether a buffer table address is configured for the data processing task, that is, determine whether the data processing task meets a buffer multiplexing determination condition, if it is determined that the buffer table address is configured, consider that it is determined that the data processing task meets the buffer multiplexing determination condition, the server may further determine whether to clear a buffer table, and if it is required to clear the buffer table, the server may clear a table connection (Join) result buffer. If the cache table does not need to be cleaned, the server further determines whether the cache table is cached, if the cache table exists, the server can directly multiplex the table connection result, execute the data processing task based on the multiplexed table connection result, and particularly can perform operations such as grouping calculation according to the data processing task requirement based on the multiplexed table connection result. If the cache does not exist, the server can read the data, perform table connection operation, cache the table connection result, and further perform operations such as grouping calculation according to the data processing task requirement based on the table connection result.
Further, for a set of tasks that contain duplicate computation links, the overall flow when using Join result multiplexing scheme: assuming that after security Join, the databases of both parties a and B generate a virtual temporary wide table T containing only the portion of the intersection of parties a and B, as shown in fig. 9, party a includes the data of the id and group columns in table T, and party B includes the data of the id and value columns in table T. The tables stored by the A side and the B side respectively form a virtual table T, and each record of the virtual table T is connected through an id column of the A side and the B side and can be a single id or a joint id. Based on the table T, data interaction and calculation related to other privacy calculation algorithm protocols after Join can be performed.
In a specific implementation, task set initialization is needed, the task set can be analyzed first, the tasks are grouped according to the used data table and table connection mode, the tasks which can obtain the same temporary wide table T after the Join is safe are grouped into a group, and then a plurality of tasks in the same group can multiplex the same wide table T. For a group containing only one task, the task is directly executed, and Join result multiplexing is not involved. And respectively configuring cache table addresses Address A and Address B of the A side and the B side and cache table names for tasks in the same group, so that repeated calculation of Join results in each group of tasks is removed based on the cache table connection results.
After initialization is completed, the task set is divided into a plurality of groups, one group containing N (N > 1) tasks is taken as an example, the tasks in one group are sequentially executed, when the computing resources are sufficient, only the first task needs to be executed first, after the Join result is cached, the other tasks can be simultaneously executed, and in the configuration, the address of a cache table is set as address A and address B for the A side and the B side respectively.
When executing the first task, since there is no cached Join result in the cache table address, the task result caching is completed during the task execution, the a part of the wide table T is stored in address a, and the B part of the wide table T is stored in address B.
When executing the second to Nth tasks, because the first task has cached Join results in the Address A and the Address B, the A side and the B side can load respective parts of the wide table T from the Address A and the Address B respectively, do not need to carry out a safe Join process, and directly carry out subsequent processes such as joint group calculation and the like on the wide table T according to a privacy calculation algorithm protocol.
After the execution of N tasks belonging to the same group is finished, the cached Join result has no use value, and the storage space occupied by the cache table in the Address A and the Address B can be released.
Further, in order to ensure the security of other data tables in the cache table storage path, a management table mechanism is introduced, and specifically, the meta information of the cache table in the path is stored through the cache table management table. When the data processing task is executed, each pair of data sources, namely the Host side and the Guest side, corresponds to a management table, and the Join results cached in the past task by the two sides are managed. By creating the management table for each pair of task participants, the cache tables of others can be prevented from being misused, and then meta-information in the management table can be updated by the pair of task participants when the operations of creating the cache tables, searching the cache tables and deleting the cache tables are performed. The cache table meta information stored in the management table may include a cache table name, a cache table creator, creation time, last use time, header information of the cache table, and the like.
The management table itself also has the potential to affect other tables under the storage path, and the effect of the management table can be reduced by using linear probing in creating and looking up the management table. As shown in fig. 10, in order to find the management table by using the linear probing method, the management table name is initialized, and the format is "fixed prefix+task participant information+suffix number", that is, the management table name is composed of fixed prefix, task participant information and suffix number, the suffix number is modified linearly in the finding process, and then whether the management table is a cache table management table is confirmed by verifying the header; searching a management table in a database to determine whether the management table exists; if the management table exists, the header information can be further verified, namely whether the header information is correct or not can be verified; if the header information is correct, the server may read the management table to obtain a cache table based on the management table, and read the data connection cache result obtained by caching in advance from the cache table to execute the corresponding data processing task. If the header information is incorrect, it can be determined that the current management table is incorrect, and the query and verification of the management table can be further performed after adding one to the suffix word of the management table name until the correct management table is obtained, so as to obtain the pre-cached data connection caching result, and execute the corresponding data processing task. In addition, if the management table does not exist, the server can determine whether the management table needs to be created, and if the management table needs to be created, the management table can be created so as to realize the caching process of the data connection caching result by reading the management table.
For a specific example to illustrate the linear probing method, assuming a task has two parties, userA and userB, the initialization management table is named "powerfl_sql_cache_management_userB_0", where "powerfl_sql_cache_management__" is a fixed prefix, "userA_userB" is task party information, and the last "_0" is a suffix number. The suffix number is a portion to be linearly modified.
Assuming that n rounds of lookup have been performed, the management table of the current round, i.e., the n+1st round, is named "powerfl_sql_cache_management_usera_userb_n", the lookup procedure is as follows:
step 1, searching a management table in a database according to table names: according to the table name 'powerfl_sql_cache_management_userA_userB_0', searching a management table in a database, dividing the search into two cases of finding and not finding, if the table is found, executing the step 2, and if the table is not found, executing the step 3;
step 2, comparing header information: acquiring header information of the table, including field names and field attributes of each field of the table, wherein the header of each management table is fixed and comprises a character string type cache table name section name, a character string type cache table creator field creator, a time field last_used_time and the like of the character string type cache table, so that the searched header information can be compared with expected header information;
If the management tables are consistent, the correct management table is found, and step 4 is performed; if not, the table is not a management table, and the suffix number is increased by one, the searching is restarted from the step 1, and the next round is started. The suffix number in the management table name to be searched in the next round is added one for the management table name of the round, if the suffix number of the management table name of the round is n, the suffix number of the next round is n+1. Specifically, when the suffix number of the present round is 0, the suffix number is added by one to change 'powerfl_sql_cache_management_userB_0' into 'powerfl_sql_cache_management_userB_1';
step 3, judging whether a management table needs to be created at present: if the step of caching Join results as described in 3.2.3 is performed, a management table named "powerfl_sql_cache_management_userA_userB_n" is created and a new management table is returned; if the result is the multiplexing Join result in the execution 3.2.4 or the clearing Join result in the execution 3.2.5, the management table does not exist to indicate that the cache table does not exist, the linear search is ended, and the empty table is returned;
step 4, returning a result: after the multi-round searching, if the table head is found to be in line with the expected management table in the step 2, loading the table as a searching result, and ending the searching.
For the processing of the cache Join result, related parameter configuration including the address of the cache table of the two parties and the state of the clean cache switch can be obtained; checking a cleaning cache switch, and executing the process of cleaning the cache of the Join result if the cache table is required to be cleaned; firstly, loading a cache table management table under the path by using a linear detection method, and needing to be newly built when the management table does not exist; calculating the actual storage name of the cache table; searching a designated cache table in the management table, and finding that the cache table does not exist, namely, when a task is executed for the first time, the cache table does not exist; executing the whole task flow, and storing the result into a designated table under a designated path after the Join stage is completed; and updating the cache table management table, and adding information of the new cache table into the management table.
The reason why the cache table name in the configuration item is not directly used when the actual storage name of the cache table is calculated is to avoid that the cache table name cannot be cached correctly when the table name is repeated with the existing table name. And adding a hash value of the fixed prefix and the participant name to the cache table name in the configuration item to obtain the cache table name actually stored in the storage path. For example, in the configuration, the cache table name is "cacheTableName", and the actual storage is "powerfl_sql_cachetablename_1568604896", where "powerfl_sql" is a fixed prefix, and "1568604896" is a hash value calculated according to the participant information, and the calculation process of the actual storage name of the cache table is performed in the same way as in the process of multiplexing and cleaning the Join result.
For the processing of multiplexing Join results, acquiring relevant parameter configuration including the address of the cache table of the two parties and the state of a clean cache switch; checking a cleaning cache switch, and executing the process of cleaning the cache of the Join result if the cache table is required to be cleaned; loading a cache table management table under the path by using a linear detection method, when the management table does not exist, indicating that the cache table does not exist, and the Join result cannot be multiplexed, and executing a process of caching the Join result; determining whether a cache table exists in a cache table management table, and if the record of the cache table does not exist in the management table, switching to the execution of the cache Join result flow; reading a designated cache table and loading the designated cache table into a task execution environment; and providing the data in the cache table as Join results for the subsequent stage for use. By multiplexing Join results, the repeated computation links of the Join phase are skipped, and the time taken to load the cache table for multiplexing is almost negligible compared to the extremely time-consuming Join process.
For the process of cleaning the Join result cache, after the cache table has no use value, cleaning is needed, and the specific process is as follows: opening a cleaning cache switch in the task script; when the parameters are acquired and checked, the clear cache switch is found to be opened, and the cache table address is set; loading a cache table management table under the path by using a linear detection method, and directly finishing the cache table cleaning work if the management table does not exist to indicate that the cache table does not exist; determining whether a cache table exists in a cache table management table, and if the cache table does not exist, directly ending the cache table cleaning work; deleting the cache table; deleting the corresponding record in the cache table management table; if no record exists in the management table after deleting the record, deleting the management table. The Join result cache cleaning work can comprise two parts, namely a cleaning cache table and a cleaning management table, and the cleaning work of the cache table and the management table is completed on the premise of not influencing other data tables under the storage path through a management table mechanism.
According to the data processing task execution method provided by the embodiment, the storage address, the cache table name and the cache table cleaning time of the cache table can be determined by a user independently. In one specific application, configuration items need to be added when utilizing an SQL data analysis system. As shown in FIG. 11, in the front-end interface of the SQL data analysis system, the 3 rd-4 th line code is a cache table cleaning switch for starting the process of cleaning the cache; the 5 th-8 th line codes are used for describing cache table addresses of both parties involved in the task so as to acquire a pre-cached table connection result through the cache table addresses; the 10 th-14 th codes are user query statement scripts, specifically user SQL scripts, and are used for carrying out joint analysis processing on the table connection results. The user may initiate the process of performing the data processing task by triggering the run control. The task configuration items of the first to Nth tasks are the same, and a cache table cleaning switch is set to be true when cleaning work is started; the n+1st task sets the cache table clean-up switch to "true". The data processing task execution method provided in this embodiment includes, but is not limited to, the presently mentioned SQL data analysis system, and also includes a plurality of other product forms.
The data processing task execution method provided by the embodiment reduces the total time consumption of a group of tasks performing the same security Join operation. Performance testing is performed on one hundred million tasks on the cluster, the number of data partitions in operation is 200, and the computing engine Spark computing resources are configured to: 50 executives (execution servers) with a memory size of 12G, and two drivers with a memory size of 45G. The statistical task time is shown in table 1 below. The main optimization of the data processing task execution method provided by the embodiment is to reduce the time consumption of the Join stage, and the time consumption of the Join stage in different tasks is different in proportion, so that the lifting effect of the scheme is related to specific task types, but the speed of executing the various tasks is improved by more than 240%, and the effect is obvious. Specifically, as shown in table 1 below, join is a table connection process, UDAF (user defined aggregation function) is a process of aggregation according to a user-defined aggregation function, group by is a packet process, and a and B may be two different data tables.
TABLE 1
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a data processing task execution device for realizing the above related data processing task execution method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the device for performing data processing tasks provided below may refer to the limitation of the method for performing data processing tasks hereinabove, and will not be repeated herein.
In one exemplary embodiment, as shown in fig. 12, there is provided a data processing task execution apparatus 1200, including: a task determination module 1202, a data source determination module 1204, a cache result acquisition module 1206, and a cache result processing module 1208, wherein:
a task determination module 1202 for determining a data processing task to be performed;
the data source determining module 1204 is configured to determine that the data processing task is executed corresponding to at least two data sources to be accessed when it is determined that the data processing task meets the buffer multiplexing determination condition;
a cache result obtaining module 1206, configured to obtain respective data connection cache results from at least two data sources respectively; the data connection caching result comprises connection data in corresponding data sources; the connection data is data belonging to corresponding data sources in the data aimed at connection in the process of connecting the data of at least two data sources respectively;
The buffer result processing module 1208 is configured to obtain a data connection result according to each data connection buffer result, and execute a data processing task based on the data connection result.
In one embodiment, the data source determining module 1204 is further configured to obtain task configuration information of the data processing task; when the task configuration information indicates that a reusable cache result associated with the data processing task exists, determining that at least two data sources to be accessed correspond to the data processing task when the data processing task is executed.
In one embodiment, the data source determining module 1204 is further configured to determine, when the task configuration information includes a reusable buffer address associated with the data processing task, a buffer management table according to the reusable buffer address; when the cache management table comprises a cache table for recording a reusable cache result, determining at least two data sources to be accessed when the data processing task is executed based on the reusable cache address.
In one embodiment, the data source determining module 1204 is further configured to determine a management table to be checked according to the reusable cache address, and obtain header information of the management table to be checked; and when the header information passes the verification of the management table, determining the management table to be verified as a cache management table.
In one embodiment, the data source determination module 1204 is further configured to determine a cache table name; the cache table name identifies the cache table used to record the reusable cache result; when the cache management table comprises a cache table name, determining at least two data sources to be accessed when the data processing task is executed based on the data source to which the reusable cache address belongs.
In one embodiment, the cache result obtaining module 1206 is further configured to determine respective cache tables of at least two data sources; the cache table is used for recording connection data in the corresponding data source; and respectively obtaining the data connection caching results of the at least two data sources from the caching tables of the at least two data sources.
In one embodiment, the buffer result processing module 1208 is further configured to combine the buffer results of each data connection to obtain a data connection result; and acquiring data to be processed aimed at by the data processing task from the data connection result, and executing the data processing task based on the data to be processed.
In one embodiment, the system further comprises a data acquisition module to be connected, a data connection module to be connected and a cache processing module; wherein: the to-be-connected data acquisition module is used for respectively acquiring to-be-connected data aimed by the data processing task from at least two data sources when the data processing task is determined to not meet the buffer multiplexing judgment condition; the data connection module to be connected is used for connecting the data to be connected to obtain a data connection result and executing a data processing task based on the data connection result; and the cache processing module is used for obtaining the data connection cache result of each of the at least two data sources according to the data connection result and storing the data connection cache result into the corresponding data source.
In one embodiment, the cache processing module is further configured to create respective cache tables for the at least two data sources based on data source information of the at least two data sources, respectively; and storing the data connection cache result into a cache table of a corresponding data source in the at least two data sources.
In one embodiment, the method further includes a multiplexing condition determination module, configured to determine that the data processing task does not satisfy the multiplexing condition of the buffer when the task configuration information of the data processing task does not include a reusable buffer address associated with the data processing task, a buffer management table does not exist in the reusable buffer address, or a buffer table for recording a reusable buffer result is not included in the buffer management table.
In one embodiment, the task determination module 1202 is further configured to obtain a task set; the task set comprises at least two data processing tasks to be grouped; grouping each data processing task to be grouped according to the data connection information of the data processing tasks to be grouped to obtain at least one task group; the data processing tasks to be grouped, which belong to the same task group, have the same data connection information; a data processing task to be performed is determined from at least one task group.
In one embodiment, the method further comprises a buffer clearing module, configured to delete the data connection buffer result stored in each of the at least two data sources when a clearing determination condition for the data connection buffer result is satisfied.
The respective modules in the above-described data processing task execution device may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one exemplary embodiment, a computer device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 13. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing data processing task data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of performing data processing tasks.
It will be appreciated by those skilled in the art that the structure shown in fig. 13 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use, and processing of the related data are required to meet the related regulations.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims (16)

1. A method of performing a data processing task, the method comprising:
determining a data processing task to be executed;
when the data processing task is determined to meet the buffer multiplexing judgment condition, determining at least two data sources to be accessed correspondingly when the data processing task is executed;
respectively acquiring respective data connection cache results from the at least two data sources; the data connection caching result comprises connection data in corresponding data sources; the connection data is data belonging to the corresponding data source in the data aimed at connection in the process of connecting the data of each of the at least two data sources;
And obtaining a data connection result according to each data connection cache result, and executing the data processing task based on the data connection result.
2. The method according to claim 1, wherein when it is determined that the data processing task meets a buffer multiplexing determination condition, determining that the data processing task is executed corresponds to at least two data sources to be accessed, comprises:
acquiring task configuration information of the data processing task;
and when the task configuration information indicates that a reusable cache result associated with the data processing task exists, determining at least two data sources to be accessed when the data processing task is executed.
3. The method of claim 2, wherein determining at least two data sources to be accessed when the task configuration information indicates that there are reusable cache results associated with the data processing task, comprises:
when the task configuration information comprises a reusable cache address associated with the data processing task, determining a cache management table according to the reusable cache address;
when the buffer management table comprises a buffer table for recording a multiplexing buffer result, determining at least two data sources to be accessed correspondingly when the data processing task is executed based on the multiplexing buffer address.
4. The method of claim 3, wherein said determining a cache management table from said reusable cache address comprises:
determining a management table to be checked according to the reusable cache address, and acquiring header information of the management table to be checked;
and when the header information passes the verification of the management table, determining the management table to be verified as a cache management table.
5. A method according to claim 3, wherein when the cache management table includes a cache table for recording a reusable cache result, determining, based on the reusable cache address, at least two data sources corresponding to be accessed when the data processing task is executed, includes:
determining a cache table name; the cache table name identifies a cache table for recording a reusable cache result;
when the cache management table comprises the cache table name, determining at least two data sources to be accessed correspondingly when the data processing task is executed based on the data source to which the reusable cache address belongs.
6. The method of claim 1, wherein the obtaining respective data connection cache results from the at least two data sources, respectively, comprises:
Respectively determining respective cache tables of the at least two data sources; the cache table is used for recording connection data in the corresponding data source;
and respectively obtaining the data connection caching results of the at least two data sources from the caching tables of the at least two data sources.
7. The method according to claim 1, wherein obtaining a data connection result from each of the data connection cache results and performing the data processing task based on the data connection result comprises:
merging the data connection caching results to obtain a data connection result;
and acquiring data to be processed aimed at by the data processing task from the data connection result, and executing the data processing task based on the data to be processed.
8. The method according to claim 1, wherein the method further comprises:
when the data processing task is determined not to meet the buffer multiplexing judgment condition, respectively acquiring data to be connected, which are aimed at the data processing task, from the at least two data sources;
connecting the data to be connected to obtain a data connection result, and executing the data processing task based on the data connection result;
And obtaining respective data connection caching results of the at least two data sources according to the data connection results, and storing the data connection caching results into the corresponding data sources.
9. The method of claim 8, wherein storing the data connection cache results in the respective data sources comprises:
creating respective cache tables for the at least two data sources based on the data source information of the at least two data sources;
and storing the data connection caching result into a caching table of a corresponding data source in the at least two data sources.
10. The method of claim 8, wherein the method further comprises:
and when the task configuration information of the data processing task does not comprise a reusable cache address associated with the data processing task, a cache management table does not exist in the reusable cache address, or a cache table for recording a reusable cache result is not included in the cache management table, determining that the data processing task does not meet the cache multiplexing judgment condition.
11. The method of claim 1, wherein the determining the data processing task to be performed comprises:
Acquiring a task set; the task set comprises at least two data processing tasks to be grouped;
grouping each data processing task to be grouped according to the data connection information of the data processing tasks to be grouped to obtain at least one task group; the data processing tasks to be grouped, which belong to the same task group, have the same data connection information;
and determining a data processing task to be executed from the at least one task group.
12. The method according to any one of claims 1 to 11, further comprising:
and deleting the data connection cache results stored in the at least two data sources when the clearing judgment condition for the data connection cache results is met.
13. A data processing task execution device, the device comprising:
the task determining module is used for determining a data processing task to be executed;
the data source determining module is used for determining at least two data sources to be accessed when the data processing task is determined to meet the buffer multiplexing judging condition;
the cache result acquisition module is used for respectively acquiring respective data connection cache results from the at least two data sources; the data connection caching result comprises connection data in corresponding data sources; the connection data is data belonging to the corresponding data source in the data aimed at connection in the process of connecting the data of each of the at least two data sources;
And the cache result processing module is used for obtaining a data connection result according to each data connection cache result and executing the data processing task based on the data connection result.
14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 12 when the computer program is executed.
15. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 12.
16. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 12.
CN202311331276.3A 2023-10-13 2023-10-13 Method and device for executing data processing task, computer equipment and storage medium Pending CN117331975A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311331276.3A CN117331975A (en) 2023-10-13 2023-10-13 Method and device for executing data processing task, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311331276.3A CN117331975A (en) 2023-10-13 2023-10-13 Method and device for executing data processing task, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117331975A true CN117331975A (en) 2024-01-02

Family

ID=89292861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311331276.3A Pending CN117331975A (en) 2023-10-13 2023-10-13 Method and device for executing data processing task, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117331975A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117851413A (en) * 2024-03-07 2024-04-09 腾讯科技(深圳)有限公司 Data operation method, device, electronic equipment, storage medium and program product
CN117851413B (en) * 2024-03-07 2024-06-07 腾讯科技(深圳)有限公司 Data operation method, device, electronic equipment, storage medium and program product

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117851413A (en) * 2024-03-07 2024-04-09 腾讯科技(深圳)有限公司 Data operation method, device, electronic equipment, storage medium and program product
CN117851413B (en) * 2024-03-07 2024-06-07 腾讯科技(深圳)有限公司 Data operation method, device, electronic equipment, storage medium and program product

Similar Documents

Publication Publication Date Title
US11328003B2 (en) Data relationships storage platform
Jaseena et al. Issues, challenges, and solutions: big data mining
US10725981B1 (en) Analyzing big data
US10614248B2 (en) Privacy preserving cross-organizational data sharing with anonymization filters
US9652512B2 (en) Secure matching supporting fuzzy data
US9361320B1 (en) Modeling big data
US8346774B1 (en) Protecting network entity data while preserving network properties
Tarekegn et al. Big data: security issues, challenges and future scope
TW202025020A (en) Block chain-based content management system, method and device and electronic equipment
CN113127848A (en) Storage method of permission system data and related equipment
US20120310918A1 (en) Unique join data caching method
Zheng Database as a service-current issues and its future
US11968214B2 (en) Efficient retrieval and rendering of access-controlled computer resources
Irudayasamy et al. Parallel bottom-up generalization approach for data anonymization using map reduce for security of data in public cloud
WO2021118413A2 (en) Data processing method, comprising secure multilateral computing and data analysis methods
CN117331975A (en) Method and device for executing data processing task, computer equipment and storage medium
Gao et al. An efficient framework for multiple subgraph pattern matching models
Kaur et al. Comparison study of big data processing systems for IoT cloud environment
US11675751B2 (en) Systems and methods for capturing data schema for databases during data insertion
US20230367636A1 (en) System and method for determining memory resource configuration for network nodes to operate in a distributed computing network
Mayuri et al. A Study on Use of Big Data in Cloud Computing Environment
Kumar et al. Security Issues in Hadoop Associated with Big Data
Liu User identity linkage method based on user online habit
Zala et al. A survey on data mining and analysis in Hadoop and MongoDB
Suriakala A Comparative Study Of Different Types Of Nosql Datamodels And Databases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication