CN117407430B - Data query method, device, computer equipment and storage medium - Google Patents

Data query method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN117407430B
CN117407430B CN202311670343.4A CN202311670343A CN117407430B CN 117407430 B CN117407430 B CN 117407430B CN 202311670343 A CN202311670343 A CN 202311670343A CN 117407430 B CN117407430 B CN 117407430B
Authority
CN
China
Prior art keywords
data
target condition
target
python
meet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311670343.4A
Other languages
Chinese (zh)
Other versions
CN117407430A (en
Inventor
汪翔春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202311670343.4A priority Critical patent/CN117407430B/en
Publication of CN117407430A publication Critical patent/CN117407430A/en
Application granted granted Critical
Publication of CN117407430B publication Critical patent/CN117407430B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The large data processing platform maintains a custom function created by a user based on Python language, receives a data query statement of a source data table, analyzes and determines that the data query statement comprises a target condition statement, wherein the target condition statement comprises a target condition and a result expression corresponding to the target condition, and the result expression is used for calling the custom function; identifying whether the data records in the source data table meet the target condition, and generating a data mark representing whether the data records in the source data table meet the target condition based on the identification result; and transmitting the data mark to a Python process for executing the custom function, so that the Python process performs filtering processing on the data records which do not meet the target condition based on the data mark in the process of calculating the source data table, and executing the custom function to calculate the data records which meet the target condition.

Description

Data query method, device, computer equipment and storage medium
Technical Field
The present disclosure relates to the field of big data technologies, and in particular, to a data query method, a data query device, a computer device, and a storage medium.
Background
In some big data processing platforms, there may be problems executing SQL (Structured Query Language ). Taking Spark as an example, spark is a fast, general-purpose engine that can be applied to large-scale data processing. Spark provides users with a number of built-in operators, which refer to functions or methods for transforming and computing data sets, to implement various data processing and analysis operations.
Spark provides interface PySpark to a Python (a programming language) developer, who can use PySpark to submit the code of a custom function and register it in Spark SQL as Python operator, which can be called by subsequent users in SQL statements, which Spark will use for corresponding calculations.
There are problems with using custom functions in platforms today. For example, the conditional statement itself may include a target condition specified by the user and an operation to be executed after the target condition is satisfied, if a Python custom function needs to be called in the operation to be executed after the target condition is satisfied, in actual execution of the SQL by Spark, the Python operator is first allowed to calculate the full data in the data table, and then other operators screen out calculation results corresponding to the data satisfying the target condition from calculation results of the full data. Therefore, the prior art has the defect that the Python operator performs redundant calculation on the data which does not meet the target condition.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides a data query method, apparatus, computer device, and storage medium.
According to a first aspect of embodiments of the present disclosure, there is provided a data query method applied to a big data processing platform, where the big data processing platform maintains a custom function created by a user based on Python language, the method including:
receiving a data query statement of a source data table;
Analyzing the data query statement and determining a target conditional statement contained in the data query statement; the target condition statement comprises a target condition and a result expression corresponding to the target condition, wherein the result expression is used for calling the custom function;
Identifying whether the data records in the source data table meet the target condition, and generating a data mark representing whether the data records in the source data table meet the target condition based on the identification result;
And transmitting the data mark to a Python process for executing the custom function, so that the Python process performs filtering processing on the data records which do not meet the target condition based on the data mark in the process of calculating the source data table, and executing the custom function to calculate the data records which meet the target condition.
According to a second aspect of embodiments of the present specification, there is provided a data querying device for use in a big data processing platform having a user-defined function created based on Python language maintained therein, the device comprising:
the receiving module is used for receiving a data query statement of the source data table;
The analysis module analyzes the data query statement and determines a target condition statement contained in the data query statement; the target condition statement comprises a target condition and a result expression corresponding to the target condition, wherein the result expression is used for calling the custom function;
the generation module is used for identifying whether the data records in the source data table meet the target conditions or not and generating data marks representing whether the data records in the source data table meet the target conditions or not based on the identification result;
And the transmission module is used for transmitting the data mark to a Python process for executing the custom function, so that the Python process performs filtering processing on the data records which do not meet the target condition based on the data mark in the process of calculating the source data table, and executes the custom function to calculate the data records which meet the target condition.
According to a third aspect of embodiments of the present specification, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method embodiments of the first aspect are implemented when the computer program is executed by the processor.
According to a fourth aspect of embodiments of the present specification, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method embodiments of the first aspect described above.
The technical scheme provided by the embodiment of the specification can comprise the following beneficial effects:
In the embodiment of the specification, the large data processing platform analyzes the SQL, if a result expression corresponding to a target condition in a target condition statement of the SQL needs to call a Python custom function, the embodiment identifies whether a data record in a source data table of the SQL meets the target condition or not, and generates a data mark representing whether the data record meets the target condition or not for the data record; the data mark can be transmitted to a Python process for executing the Python custom function, so that the Python process can know which data records need to be called for calculation based on the data mark, and which data records need not be calculated and can be filtered out directly; thus, the Python process does not need to perform full computation on the source data table, so that the situation that the Python process performs redundant computation does not occur.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the specification and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a code schematic diagram of Spark versus SQL execution plan shown in this specification according to an example embodiment.
FIG. 2 is a flow chart illustrating a method of querying data according to an exemplary embodiment of the present description.
Fig. 3 is a hardware configuration diagram of a computer device in which an identification data query apparatus is located according to an exemplary embodiment of the present disclosure.
Fig. 4 is a block diagram of a data querying device shown in accordance with an exemplary embodiment of the present description.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present description as detailed in the accompanying claims.
The terminology used in the description presented herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in this specification to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" depending on the context.
User information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in this disclosure are both user-authorized or fully authorized information and data by parties, and the collection, use and processing of relevant data requires compliance with relevant laws and regulations and standards of relevant countries and regions, and is provided with corresponding operation portals for user selection of authorization or denial.
Taking Spark as an example, spark provides users with many built-in operators, which refer to functions or methods for transforming and computing data sets, to implement various data processing and analysis operations. The user may also submit a custom function created based on the Python language, which in Spark may also be referred to as Python operator.
For the SQL of the user, spark can analyze, logically optimize and physically optimize the SQL, and finally the actual execution of the SQL is usually completed by a plurality of operators in a certain sequence.
It has been found that Spark can be problematic when actually executing some SQL containing conditional statements. For example, the conditional statement itself may include a target condition specified by the user and an operation to be performed after the target condition is satisfied; if the Python custom function needs to be called in the operation to be executed after the target condition is met, in the actual execution of the SQL by Spark, the Python operator is firstly enabled to calculate the full data, and then the other operators screen out the calculation results corresponding to the data meeting the target condition from the calculation results of the full data, so that redundant calculation of the Python operator on the data which does not meet the target condition is caused.
Taking CASEWHEN as an example, assume that SQL_1 is used to select the field poi and trademark from the table1 and generate a new calculation field similarity_dot; the similarity_dot field is generated according to a target condition specified by a user, and if "poi is not null and trademark is not null" is satisfied, a custom function DotSimilarity is called to calculate the similarity value of poi and trademark, and if not, the similarity value is set to 0.
SQL_1:
SELECT
Poi,// select field poi
Trademark,// selection field trademark
(CASE
WHEN poi IS NOT NULL AND TRADEMARK IS NOT NULL// target condition
THEN DotSimilarity (poi, trademark)// meet target condition call custom function DotSimilarity
ELSE 0// default value for unsatisfied target condition
END) AS SIMILARITY _dot// calculate field similarity_dot
FROM table 1;// Source data Table1
The Spark's resolution of this SQL is such that:
'Project ['poi, 'trademark, CASE WHEN (isnotnull('poi) AND isnotnull('trademark)) THEN 'DotSimilarity('poi, 'trademark) ELSE 0 END AS similarity_dot]
+- 'UnresolvedRelation [table1];
The meaning of the above code is: based on table1, 3 fields need to be fetched using the projection operator: poi, trademark and similarity_dot; in the field similarity_dot, the value of similarity_dot is the return value of the function DotSimilarity if the target condition is satisfied, and the value of similarity_dot is 0 if the target condition is not satisfied.
Based on this analysis, the Spark execution plan is such (code schematic can be seen in FIG. 1. It can also be seen from the code that CASEWHEN statements are extracted into Project operators, whereas Python operators do not have the statement):
① Firstly, calculating the similarity of poi and trademark of each piece of data in table1 by using a Python operator to obtain a data table with a similarity_dot field added;
② Then, a projection operator Project carries out projection operation on the data table with the similarity_dot field added; the projection operator is a Spark built-in operator used for selecting a specific column from the data table; the projection operator can take poi, trademark and similarity_dot out of the data table, wherein for similarity_dot, processing is performed based on the target condition, the similarity value is reserved when the target condition is met, and the similarity_dot field which does not meet the target condition is set to be a default value 0 specified in SQL.
Therefore, when Spark is executed, the Python operator is enabled to perform full calculation on the data table, and then the projection operator is used for screening. However, the target conditions contained in the SQL indicate the data records which do not need to call the custom function for calculation, but in the native function of Spark, the Python operator is firstly calculated in full quantity during specific execution, and then filtered by the projection operator, so that the Python operator is finally subjected to redundant calculation.
In some scenarios, if the data records meeting the target conditions do not trigger the Python custom function to report errors, the SQL can be normally executed, but only the performance waste is brought. However, if the data record specified in the SQL does not meet the target condition triggers the Python custom function to report errors, the above execution of the SQL by Spark may also cause the SQL query to fail.
One implementation idea for solving the above problems is that the user adds the filter code to the code of the custom function by himself, but this way is inflexible and also causes that the custom function is difficult to multiplex.
Based on this, the embodiment of the present disclosure provides a data query scheme, which analyzes SQL in a big data processing platform, and if a result expression corresponding to a target condition in a target condition statement of SQL needs to call a Python custom function, the embodiment identifies whether a data record in a source data table of SQL meets the target condition, and generates a data tag for the data record, wherein the data tag represents whether the data record meets the target condition; the data mark can be transmitted to a Python process for executing the Python custom function, so that the Python process can know which data records need to be called for calculation and which data records need not to be calculated for filtering based on the data mark, and therefore the Python process does not need to perform full calculation on a source data table, the condition that the Python process performs redundant calculation does not occur, and SQL query failure can be avoided.
FIG. 2 is a flow chart of a data query method according to an exemplary embodiment of the present disclosure, which may be applied to a large data processing platform that maintains custom functions created by a user based on Python language; the method of the embodiment can comprise the following steps:
in step 202, a data query statement is received for a source data table.
In step 204, the data query statement is parsed to determine a target conditional statement contained in the data query statement.
The target condition statement comprises a target condition and a result expression corresponding to the target condition, and the result expression is used for calling the custom function.
In step 206, it is identified whether the data record in the source data table meets the target condition, and a data flag is generated based on the identification result, which characterizes whether the data record in the source data table meets the target condition.
In step 208, the data tag is transferred to a Python process for executing the custom function, so that the Python process performs filtering processing on the data records that do not meet the target condition based on the data tag in the process of calculating the source data table, and executes the custom function to calculate for the data records that meet the target condition.
The method of the embodiment can be applied to big data processing platforms, including but not limited to Spark, hadoop (a big data processing platform) or MapReduce (a big data processing platform), and the like. For convenience of description, the following embodiments will be described with Spark as an example. It should be emphasized that, although Spark is taken as an example for illustration, it should be understood by those skilled in the art that the solution of the present embodiment may be applied to other big data processing platforms, and the present embodiment is not limited thereto.
The method is applied to Spark for example, and can be applied to a component Spark SQL for structured data processing in Spark, specifically, a Python starting component for starting a Python process under the component, such as SparkBasePythonRunner components, and the like, and can be configured according to requirements in practical application. Spark is written in the Scala (a programming language) language based on Java (a programming language) virtual machine, and also provides support for other programming languages, such as Python. Spark provides interface PySpark to the Python developer, who can use PySpark to submit the code of the custom function and register with Spark SQL. The custom function created by the user of this embodiment based on the Python language may be one or more (at least two).
In some examples, for a section of SQL query submitted by a user, spark may convert an SQL statement into a specific execution job, for example, may parse the SQL, convert the parsing result into a logic execution plan, optimize the logic execution plan, convert the optimized logic execution plan into a physical execution plan, and so on, to finally obtain the execution job corresponding to the SQL. Alternatively, the generation and optimization of the execution plan may be responsible for a Catalyst (functional relationship query optimization framework) in the Spark SQL component, and the embodiment may obtain, based on the processing procedure of the Catalyst, whether the data query statement in step 204 includes the result of the target conditional statement.
The target condition statement in this embodiment includes a target condition, and the result expression corresponding to the target condition needs to call a custom function, and as an example, the target condition statement may be a CASE word statement or an IF word statement.
Taking the CASE of the CASE where statement:
SELECT field 1
,……
,(CASE
WHEN Condition 1 THEN UDF1
WHEN Condition 2 THEN UDF2
ELSE 0
END) AS SIMILARITY _dot// calculate field similarity_dot
A FROM source data table;
The "SELECT … … FROM" in the above SQL statement represents the source data table to be queried by the SQL.
The target condition statement may include at least one target condition, and the SQL statement includes two target conditions, namely, condition 1 and condition 2.
"THEN … …" in "CASE WHEN … … THEN … … ELSE … …" is a result expression corresponding to the target condition; the result expression corresponding to the condition 1 in the SQL needs to call the custom function UDF1, the result expression corresponding to the condition 2 needs to call the custom function UDF2, and the result expressions respectively represent that the similarity_dot field of the data meeting the condition 1 in the source data table is the calculation result of UDF1, and the similarity_dot field of the data meeting the condition 2 in the source data table is the calculation result of UDF 2.
The default value of "0" after "ELSE" in "CASE WHEN … … THEN … … ELSE … …" in SQL described above indicates that the similarity_dot field of the data in the source data table that does not satisfy condition 1 and does not satisfy condition 2 is 0.
CASE white may be used for one or more condition determinations, and IF white statement is similar to CASE white, and IF white statement may also be used for one or more condition determinations, as an example:
SELECT field 1
,……
(IF condition 1 THEN UDF1)
ELSE 0
END) AS similarity_dot
A FROM source data table;
the SQL contains a target condition, "THEN … …" in the IF … … THEN … … ELSE 0, "which is a result expression corresponding to the target condition; the result expression corresponding to the condition 1 in the SQL needs to call the user-defined function UDF1. "0" after "ELSE" is the default value.
In Spark's native design, the Python operator is provided with the full amount of data in the data table, and the native Python operator does not support filtering operations, and the full amount of data in the data table provided to it is calculated using custom functions. Such a process flow is not problematic if the execution of the custom function is not collocated in the conditional statement. However, if the custom function is used with the result expression corresponding to the target condition in the conditional statement, the above-mentioned problem of redundant calculation or execution failure may occur.
For the above-mentioned execution operation of SQL, since the operation of selecting a field from the data table is involved in the SQL, spark will use a projection operator (Project operator) to perform the operation of selecting a field from the data table, because the calculation result of the custom function needs to be used as a new field in the table, the execution process of the Python operator is included in the projection operator, and Spark will cause the Python operator to execute first and then execute the projection operator. Since the Python operator is executed first, the Python operator is closest to the source data table, the input to the Python operator is the full data of the data table, the Python operator carries out full computation, and then the projection operator processes the new field by adopting the target condition and the default value.
In order for the Python operator not to make unnecessary calculations, the present embodiment provides the Python operator with a data flag of whether the data record satisfies the target condition. In step 206, the present embodiment may identify whether the data record in the source data table satisfies the target condition, and generate a data flag indicating whether the data record in the source data table satisfies the target condition based on the identification result.
Based on this, in step 208, the data tag may be provided to a Python process for executing the custom function, i.e. the Python operator (Python function) is running in the Python process, which in practice may perform the calculation of one or more Python operators based on the number of custom functions involved in the SQL statement. The Python process may be started by a Python process starting component in Spark, where the component may apply the method of this embodiment when starting, start a Python process capable of executing a Python custom function, and make the Python process perform filtering processing on a data record that does not meet the target condition based on the data tag in the process of calculating the source data table, and execute the custom function to calculate the data record that meets the target condition.
Optionally, for the task of executing SQL of the source data table, multiple identical Python processes may be started in Spark, and these processes may be run on different nodes of the cluster to perform parallel computation on different data slices of the source data table.
Optionally, in step 108, it is necessary to communicate with the Python process and provide the data tag, optionally, the data tag may be passed to the Python process by the SparkArrowEvalPythonExec operator of Spark, which is a built-in operator in Spark, available to execute Python code in PySpark. Specifically, the Python code fragments may be sent to Python interpreters, which may execute the Python codes in parallel in a distributed environment and return the results.
There are a number of implementations of the generation of the data markers. For example, a piece of data may be generated that stores an identification of each data record and a data tag for that data record, which may be transmitted separately to the Python process. For example, the data tag may be in JSON (JavaScript Object Notation, a lightweight data exchange format) format, and the JSON format data corresponding to the data tag recorded with each data record in the source data table may be transmitted to the python process. Optionally, in the interaction process with the Python process, the data (such as JSON format data) of the data mark of each data record in the source data table is recorded, and the data can be interacted with the source data table separately and independently.
In other examples, a row type storage is taken as an example, and a new column is added in the source data table to obtain a new data table, and the newly added column is used as a data mark field for storing a data mark for indicating whether the data record meets the target condition. Take the following table 1 as an example:
TABLE 1
The source data table contains two columns: column1 and Column2, a new Column of marks is added, and Mark of each data record stores a data Mark whether the data record meets the target condition.
Optionally, taking a custom function as an example in table 1, that is, only 1 custom function needs to be called by the result expression in SQL, the data tag may include a tag value for indicating whether the data record meets the target condition; the specific value of the tag value can be flexibly configured according to actual needs, for example, a '0' can be used for indicating that the data record does not meet the target condition, and a '1' can be used for indicating that the data record meets the target condition.
Optionally, in practical application, the SQL may involve a plurality of custom functions, and the Python process may execute the plurality of custom functions, so as to distinguish data records to be calculated corresponding to each custom function, where in this embodiment, for a case where there is a custom function, a target condition statement includes a plurality of target conditions and a result expression corresponding to each target condition in the plurality of target conditions, where each result expression corresponding to each target condition is used to call a different custom function;
The identifying whether the data record in the source data table meets the target condition and generating a data flag indicating whether the data record in the source data table meets the target condition based on the identification result may include:
Identifying whether the data record in the source data table meets each of the plurality of target conditions, and generating a data flag characterizing whether the data record in the source data table meets each of the plurality of target conditions based on the identification result.
For example, a CASE word statement may include a plurality of target conditions, see the following examples:
CASE
WHEN Condition 1 THEN UDF1
WHEN Condition 2 THEN UDF2
ELSE default results
END
The above SQL involves two target conditions, and the two target conditions call different custom functions to calculate the same data table, so in order for the Python process to distinguish the data records to be calculated corresponding to each custom function, the embodiment may identify whether each data record meets each target condition, so as to generate a data tag for each data record, where each data record meets each target condition.
In other examples, the SQL may further include a plurality of CASE while sentences, each CASE while sentence may include one or more target conditions, and the embodiment is not limited thereto.
Optionally, there may be multiple implementations of the data record corresponding to the data flag that characterizes whether the data record meets each of the multiple target conditions; each data record may be associated with a plurality of data markers, each data marker respectively characterizing whether the data record meets one of the plurality of target conditions. In other examples, the data markers include a key-value key value pair with a function identification of the custom function as a key to indicate whether the data record satisfies a target condition with a tag value of value.
Take the following table 2 as an example:
TABLE 2
Two pieces of information about whether the data record meets two target conditions are recorded in Mark of the data record in table 2, specifically, two key-value key value pairs, taking Mark of the first data record as an example, two key value pairs of ' udf ' 1 ', 0 ' and ' udf ' 1 ', target condition 1 corresponds to udf, target condition 2 corresponds to udf2, and the first key value pair indicates that the first data record does not meet target condition 1, and udf1 is not needed to calculate the first data record; the second key value pair indicates that the data record satisfies target condition 2, which needs to be calculated using udf a 2.
Based on this, the data in which the data flag of each data record in the source data table is recorded can be transmitted to the Python process for executing a plurality of custom functions at a time without separately transmitting the corresponding data flag for each custom function. Optionally, the data tag may specifically be stored in JSON format, and the JSON format data corresponding to the data tag recorded with each data record in the source data table may be transmitted to the Python process. Of course, in practical application, the data mark may also take other formats, which are not limited in this embodiment.
Optionally, the transferring the data tag to a Python process for executing the custom function, so that the Python process performs filtering processing on the data record that does not meet the target condition based on the data tag in the process of calculating the source data table, and executing the custom function to perform calculation on the data record that meets the target condition may include:
And transmitting the data mark to a Python process for executing the plurality of custom functions, so that the Python process executes the calculation of each custom function in the plurality of custom functions on the source data table respectively, and based on the data mark, filtering the data records which do not meet the target conditions corresponding to the custom functions, and executing the calculation of the custom functions on the data records which meet the target conditions corresponding to the custom functions.
When each custom function in the plurality of custom functions is used for calculating the source data table, based on the mark of each data record, the data record meeting the filtering condition corresponding to the custom function is filtered, and the custom function is used for calculating the data record not meeting the filtering condition corresponding to the custom function.
When the data tag characterizes whether the data record meets each target condition, the Python process may determine the data record required to be calculated by each custom function based on the data tag in the calculation process of executing each custom function. In some examples, the function parameters of each custom function may be transferred to the Python process, where the function parameters may be one or more fields in a data record, and the data tag corresponding to each custom function may be transferred while the function parameters of each custom function are transferred to the Python process.
In some examples, the target conditional statement may also contain default results;
the transferring the data tag to a Python process for executing the custom function, so that the Python process performs filtering processing on the data records that do not meet the target condition based on the data tag in the process of calculating the source data table, and executing the custom function to calculate the data records that meet the target condition may include:
And transmitting the data mark and the default result to a Python process for executing the custom function, so that the Python process determines the calculation result of the data records which do not meet the target condition as the default result based on the data mark in the process of calculating the source data table, and returns the calculation result of each data record in the source data table after executing the custom function on the data records which meet the target condition to obtain the calculation result.
As previously analyzed, "CASE white … … th … … ELSE … …" followed by "ELSE" carries the default result, and "IF … … th … … ELSE … …" followed by "ELSE" carries the default result; the default results, namely the default results corresponding to the fact that the data records do not meet the target conditions and the custom function is not needed to calculate. The full amount of data in the source data table is still transmitted to the custom function, the custom function can be executed to calculate the data records meeting the target condition, and the data records not meeting the target condition correspond to the default result. In this embodiment, the default result is also transmitted to the Python process, so that the Python process still returns the calculation result of each data record in the source data table, and the calculation result of the data record in the source data table which does not meet the target condition is the default result.
Based on the calculation results of each data record returned by the Python process, as can be known from the analysis of the Spark on the execution plan of the SQL, the subsequent calculation can be performed by using other operators to finish the execution of the SQL. Since the calculation result of each data record returned by the Python process includes the calculation result of the data record meeting the target condition and the default result of the data record not meeting the target condition, the subsequent other operators can reduce operations, for example, screening based on the target condition is not needed, and the calculation result corresponding to the data record not meeting the target condition is not needed to be set as the default value.
As can be seen from the above embodiments, in the big data processing platform, the analysis is performed on the SQL, if the result expression corresponding to the target condition in the target condition statement of the SQL needs to call the Python custom function, the embodiment identifies whether the data record in the source data table of the SQL meets the target condition, and generates a data tag for representing whether the data record meets the target condition; the data mark can be transmitted to a Python process for executing the Python custom function, so that the Python process can know which data records need to be called for calculation and which data records need not to be calculated for filtering based on the data mark, and therefore the Python process does not need to perform full calculation on a source data table, the condition that the Python process performs redundant calculation does not occur, and SQL query failure can be avoided.
The embodiment realizes pushing the condition which is specified by the user in SQL and corresponds to the data record without calling the custom function to the Python operator, thereby avoiding the data record which does not participate in calculation from being calculated and reducing the expenditure of Python function calling; and the condition that SQL query fails due to the fact that abnormality can be thrown out when the data record without calculation is calculated by the custom function is avoided.
The data query method of the present embodiment may run in a computer device, including but not limited to a server, a cloud server, a server cluster, a tablet computer, a personal digital assistant, a laptop computer, or a desktop computer.
Corresponding to the foregoing embodiments of the data query method, the present specification also provides embodiments of the data query device and a computer to which the data query device is applied.
The embodiments of the data query device of the present specification may be applied to a computer device, such as a server or a terminal device. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor where the device is located. In terms of hardware, as shown in fig. 3, a hardware structure diagram of a computer device where the data query device in this specification is located is shown in fig. 3, and in addition to the processor 310, the memory 330, the network interface 320, and the nonvolatile memory 340 shown in fig. 3, the computer device where the data query device 331 in this embodiment is located may further include other hardware according to the actual function of the computer device, which is not described herein.
As shown in fig. 4, fig. 4 is a block diagram of a data query apparatus according to an exemplary embodiment of the present disclosure, the apparatus being applied to a big data processing platform, the big data processing platform maintaining a custom function created by a user based on Python language, the apparatus comprising:
A receiving module 41 that receives a data query statement for a source data table;
The parsing module 42 parses the data query statement and determines a target conditional statement contained in the data query statement; the target condition statement comprises a target condition and a result expression corresponding to the target condition, wherein the result expression is used for calling the custom function;
a generation module 43 for identifying whether the data record in the source data table meets the target condition, and generating a data mark representing whether the data record in the source data table meets the target condition based on the identification result;
And the transmission module is used for transmitting the data mark to a Python process for executing the custom function, so that the Python process performs filtering processing on the data records which do not meet the target condition based on the data mark in the process of calculating the source data table, and executes the custom function to calculate the data records which meet the target condition.
In some examples, the target conditional statement further includes a default result;
The transferring module 44 transfers the data tag to a Python process for executing the custom function, so that the Python process performs filtering processing on the data record that does not meet the target condition based on the data tag in the process of calculating the source data table, and executes the custom function to calculate the data record that meets the target condition, including:
And transmitting the data mark and the default result to a Python process for executing the custom function, so that the Python process determines the calculation result of the data records which do not meet the target condition as the default result based on the data mark in the process of calculating the source data table, and returns the calculation result of each data record in the source data table after executing the custom function on the data records which meet the target condition to obtain the calculation result.
In some examples, the custom function has a plurality of functions;
The target condition statement comprises a plurality of target conditions and result expressions respectively corresponding to each target condition in the plurality of target conditions, and the result expressions respectively corresponding to each target condition are used for calling different custom functions;
The generating module identifies whether the data record in the source data table meets the target condition, and generates a data mark representing whether the data record in the source data table meets the target condition based on the identification result, and the generating module comprises the following steps:
Identifying whether the data record in the source data table meets each of the plurality of target conditions, and generating a data flag characterizing whether the data record in the source data table meets each of the plurality of target conditions based on the identification result.
In some examples, the transferring module transfers the data tag to a Python process for executing the custom function, so that the Python process performs filtering processing on the data record that does not meet the target condition based on the data tag in the process of calculating the source data table, and executes the custom function to calculate for the data record that meets the target condition, including:
And transmitting the data mark to a Python process for executing the plurality of custom functions, so that the Python process executes the calculation of each custom function in the plurality of custom functions on the source data table respectively, and based on the data mark, filtering the data records which do not meet the target conditions corresponding to the custom functions, and executing the calculation of the custom functions on the data records which meet the target conditions corresponding to the custom functions.
In some examples, the custom function is one; the data flag includes a tag value for indicating whether the data record satisfies a target condition.
In some examples, the custom function has a plurality of functions; the data mark comprises a key-value key value pair with a function mark of the custom function as a key and a label value of value for indicating whether the data record meets a target condition.
In some examples, the target conditional statement includes a CASE word statement or an IF word statement.
The implementation process of the functions and roles of each module in the data query device is specifically shown in the implementation process of the corresponding steps in the data query method, and will not be described herein.
Accordingly, the present description also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the foregoing data query method embodiments.
Accordingly, the embodiments of the present disclosure further provide a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the steps of the embodiments of the data query method are implemented when the processor executes the program.
Accordingly, the present description also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the data query method embodiments.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present description. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The above-described embodiments may be applied to one or more computer devices, which are devices capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware of the computer devices include, but are not limited to, microprocessors, application SPECIFIC INTEGRATED Circuits (ASICs), programmable gate arrays (Field-Programmable GATE ARRAY, FPGA), digital processors (DIGITAL SIGNAL processors, DSPs), embedded devices, and the like.
The computer device may be any electronic product that can interact with a user in a human-computer manner, such as a personal computer, a tablet computer, a smart phone, a personal digital assistant (personal DIGITAL ASSISTANT, PDA), a game console, an interactive internet protocol television (Internet Protocol Television, IPTV), a smart wearable device, etc.
The computer device may also include a network device and/or a user device. Wherein the network device includes, but is not limited to, a single network server, a server group composed of a plurality of network servers, or a Cloud based Cloud Computing (Cloud Computing) composed of a large number of hosts or network servers.
The network in which the computer device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), and the like.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The above steps of the methods are divided, for clarity of description, and may be combined into one step or split into multiple steps when implemented, so long as they include the same logic relationship, and they are all within the protection scope of this patent; it is within the scope of this application to add insignificant modifications to the algorithm or flow or introduce insignificant designs, but not to alter the core design of its algorithm and flow.
Where a description of "a specific example", or "some examples", etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present description. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Other embodiments of the present description will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.
It is to be understood that the present description is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.
The foregoing description of the preferred embodiments is provided for the purpose of illustration only, and is not intended to limit the scope of the disclosure, since any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the disclosure are intended to be included within the scope of the disclosure.

Claims (10)

1. A data query method applied to a big data processing platform, the big data processing platform maintaining a custom function created by a user based on Python language, the method comprising:
receiving a data query statement of a source data table;
Analyzing the data query statement and determining a target conditional statement contained in the data query statement; the target condition statement comprises a target condition and a result expression corresponding to the target condition, wherein the result expression is used for calling the custom function;
Identifying whether the data records in the source data table meet the target condition, and generating a data mark representing whether the data records in the source data table meet the target condition based on the identification result;
And transmitting the data mark to a Python process for executing the custom function, so that the Python process performs filtering processing on the data records which do not meet the target condition based on the data mark in the process of calculating the source data table, and executing the custom function to calculate the data records which meet the target condition.
2. The method of claim 1, the target conditional statement further comprising a default result;
The transferring the data mark to a Python process for executing the custom function, so that the Python process performs filtering processing on the data records which do not meet the target condition based on the data mark in the process of calculating the source data table, and executing the custom function to calculate the data records which meet the target condition, including:
And transmitting the data mark and the default result to a Python process for executing the custom function, so that the Python process determines the calculation result of the data records which do not meet the target condition as the default result based on the data mark in the process of calculating the source data table, and returns the calculation result of each data record in the source data table after executing the custom function on the data records which meet the target condition to obtain the calculation result.
3. The method of claim 1, the custom function having a plurality of;
The target condition statement comprises a plurality of target conditions and result expressions respectively corresponding to each target condition in the plurality of target conditions, and the result expressions respectively corresponding to each target condition are used for calling different custom functions;
The identifying whether the data record in the source data table meets the target condition and generating a data mark representing whether the data record in the source data table meets the target condition based on the identification result comprises the following steps:
Identifying whether the data record in the source data table meets each of the plurality of target conditions, and generating a data flag characterizing whether the data record in the source data table meets each of the plurality of target conditions based on the identification result.
4. A method according to claim 3, said passing said data markers to a Python process for executing said custom function, such that said Python process, in calculating said source data table, filters data records that do not meet said target condition based on said data markers, and executes said custom function to calculate for data records that meet said target condition, comprising:
And transmitting the data mark to a Python process for executing the plurality of custom functions, so that the Python process executes the calculation of each custom function in the plurality of custom functions on the source data table respectively, and based on the data mark, filtering the data records which do not meet the target conditions corresponding to the custom functions, and executing the calculation of the custom functions on the data records which meet the target conditions corresponding to the custom functions.
5. The method of claim 1, the number of custom functions being one; the data flag includes a tag value for indicating whether the data record satisfies a target condition.
6. The method of claim 1, the custom function having a plurality of; the data tag includes: and marking the function of the custom function as a key, and marking a key-value key value pair with a value of a label for indicating whether the data record meets the target condition.
7. The method of claim 1, the target conditional statement comprising a CASE word statement, or an IF word statement.
8. A data querying device for use in a big data processing platform maintaining custom functions created by a user based on Python language, the device comprising:
the receiving module is used for receiving a data query statement of the source data table;
The analysis module analyzes the data query statement and determines a target condition statement contained in the data query statement; the target condition statement comprises a target condition and a result expression corresponding to the target condition, wherein the result expression is used for calling the custom function;
the generation module is used for identifying whether the data records in the source data table meet the target conditions or not and generating data marks representing whether the data records in the source data table meet the target conditions or not based on the identification result;
And the transmission module is used for transmitting the data mark to a Python process for executing the custom function, so that the Python process performs filtering processing on the data records which do not meet the target condition based on the data mark in the process of calculating the source data table, and executes the custom function to calculate the data records which meet the target condition.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed by the processor.
10. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of any of claims 1 to 7.
CN202311670343.4A 2023-12-05 2023-12-05 Data query method, device, computer equipment and storage medium Active CN117407430B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311670343.4A CN117407430B (en) 2023-12-05 2023-12-05 Data query method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311670343.4A CN117407430B (en) 2023-12-05 2023-12-05 Data query method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117407430A CN117407430A (en) 2024-01-16
CN117407430B true CN117407430B (en) 2024-04-16

Family

ID=89491050

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311670343.4A Active CN117407430B (en) 2023-12-05 2023-12-05 Data query method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117407430B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999526A (en) * 2011-09-16 2013-03-27 阿里巴巴集团控股有限公司 Splitting and inquiring method and system of database relational table
CN106815353A (en) * 2017-01-20 2017-06-09 星环信息科技(上海)有限公司 A kind of method and apparatus of data query
CN110032574A (en) * 2019-03-07 2019-07-19 北京东方国信科技股份有限公司 The processing method and processing device of SQL statement
CN111190884A (en) * 2019-12-17 2020-05-22 腾讯科技(深圳)有限公司 Data verification method, device and computer readable storage medium
CN115048421A (en) * 2022-05-12 2022-09-13 北京京东振世信息技术有限公司 Data processing method, device, equipment and storage medium
CN115080607A (en) * 2022-06-20 2022-09-20 上海达梦数据库有限公司 Method, device, equipment and storage medium for optimizing structured query statement
CN115292350A (en) * 2022-08-02 2022-11-04 北京奥星贝斯科技有限公司 SQL statement processing method and device
CN116244367A (en) * 2022-12-29 2023-06-09 中国航天系统科学与工程研究院 Visual big data analysis platform based on multi-model custom algorithm
CN116521705A (en) * 2023-04-10 2023-08-01 华东师范大学 Data query method and device, storage medium and electronic equipment
CN116775488A (en) * 2023-06-26 2023-09-19 中国建设银行股份有限公司 Abnormal data determination method, device, equipment, medium and product
CN116881287A (en) * 2023-08-01 2023-10-13 支付宝(杭州)信息技术有限公司 Data query method and related equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10242059B2 (en) * 2014-05-30 2019-03-26 International Business Machines Corporation Distributed execution of expressions in a query
US11334552B2 (en) * 2017-07-31 2022-05-17 Palantir Technologies Inc. Lightweight redundancy tool for performing transactions

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999526A (en) * 2011-09-16 2013-03-27 阿里巴巴集团控股有限公司 Splitting and inquiring method and system of database relational table
CN106815353A (en) * 2017-01-20 2017-06-09 星环信息科技(上海)有限公司 A kind of method and apparatus of data query
CN110032574A (en) * 2019-03-07 2019-07-19 北京东方国信科技股份有限公司 The processing method and processing device of SQL statement
CN111190884A (en) * 2019-12-17 2020-05-22 腾讯科技(深圳)有限公司 Data verification method, device and computer readable storage medium
CN115048421A (en) * 2022-05-12 2022-09-13 北京京东振世信息技术有限公司 Data processing method, device, equipment and storage medium
CN115080607A (en) * 2022-06-20 2022-09-20 上海达梦数据库有限公司 Method, device, equipment and storage medium for optimizing structured query statement
CN115292350A (en) * 2022-08-02 2022-11-04 北京奥星贝斯科技有限公司 SQL statement processing method and device
CN116244367A (en) * 2022-12-29 2023-06-09 中国航天系统科学与工程研究院 Visual big data analysis platform based on multi-model custom algorithm
CN116521705A (en) * 2023-04-10 2023-08-01 华东师范大学 Data query method and device, storage medium and electronic equipment
CN116775488A (en) * 2023-06-26 2023-09-19 中国建设银行股份有限公司 Abnormal data determination method, device, equipment, medium and product
CN116881287A (en) * 2023-08-01 2023-10-13 支付宝(杭州)信息技术有限公司 Data query method and related equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于SPARK大数据处理平台的图书馆智慧服务探索与实践;宋维维;夏绍模;李赞;;情报科学;20180605(第06期);第 47-51页 *

Also Published As

Publication number Publication date
CN117407430A (en) 2024-01-16

Similar Documents

Publication Publication Date Title
US10613971B1 (en) Autonomous testing of web-based applications
CN109376166B (en) Script conversion method, script conversion device, computer equipment and storage medium
CN110580189A (en) method and device for generating front-end page, computer equipment and storage medium
CN110427188B (en) Configuration method, device, equipment and storage medium of single-test assertion program
CN108460068B (en) Method, device, storage medium and terminal for importing and exporting report
CN113076104A (en) Page generation method, device, equipment and storage medium
US20190138433A1 (en) Evaluation of library test suites using mutation testing
US11275625B2 (en) System and method for automated application programming interface generation
KR102013657B1 (en) Apparatus for statically analyzing assembly code including assoxiated multi files
CN113901083A (en) Heterogeneous data source operation resource analysis positioning method and equipment based on multiple analyzers
CN115599359A (en) Code generation method, device, equipment and medium
CN113238739A (en) Plug-in development and data acquisition method, device, electronic equipment and medium
US10795650B2 (en) Code lineage tool
US20160125026A1 (en) Proactive query migration to prevent failures
CN117407430B (en) Data query method, device, computer equipment and storage medium
US10719424B1 (en) Compositional string analysis
CN109542890B (en) Data modification method, device, computer equipment and storage medium
US11921763B2 (en) Methods and systems to parse a software component search query to enable multi entity search
US11740995B2 (en) Source quality check service
CN117271482B (en) Historical data migration method and device, electronic equipment and storage medium
US11960560B1 (en) Methods for analyzing recurring accessibility issues with dynamic web site behavior and devices thereof
CN111221843A (en) Big data processing method and device
CN115545006B (en) Rule script generation method, device, computer equipment and medium
CN115291889B (en) Data blood relationship establishing method and device and electronic equipment
US10481880B2 (en) Code lineage tool

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant