WO2024021790A1 - 一种基于数据湖的虚拟列构建方法以及数据查询方法 - Google Patents

一种基于数据湖的虚拟列构建方法以及数据查询方法 Download PDF

Info

Publication number
WO2024021790A1
WO2024021790A1 PCT/CN2023/094998 CN2023094998W WO2024021790A1 WO 2024021790 A1 WO2024021790 A1 WO 2024021790A1 CN 2023094998 W CN2023094998 W CN 2023094998W WO 2024021790 A1 WO2024021790 A1 WO 2024021790A1
Authority
WO
WIPO (PCT)
Prior art keywords
expression
column
data
virtual column
statement
Prior art date
Application number
PCT/CN2023/094998
Other languages
English (en)
French (fr)
Inventor
郭俊
谢佳君
孙科
罗旋
Original Assignee
北京火山引擎科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京火山引擎科技有限公司 filed Critical 北京火山引擎科技有限公司
Publication of WO2024021790A1 publication Critical patent/WO2024021790A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution

Definitions

  • This application relates to the field of data processing technology, and in particular to a virtual column construction method and data query method based on a data lake.
  • this application provides a virtual column construction method and data query method based on a data lake, which can improve user data query experience.
  • the embodiment of this application provides a method for constructing virtual columns based on a data lake.
  • the method includes:
  • the virtual column construction description information includes the expression to be used
  • a virtual column corresponding to the expression to be used is constructed.
  • the number of statements to be analyzed is N;
  • Determining an expression to be used from the at least one statement to be analyzed includes:
  • n is a positive integer
  • n ⁇ N N is a positive integer
  • the first expression is determined to be the expression to be used.
  • determining the nth expression to be analyzed from the nth statement to be analyzed includes:
  • the method further includes:
  • the virtual column construction description information also includes a column name
  • the process of determining the column name corresponding to the expression to be used includes:
  • the column name corresponding to the expression to be used is determined from the at least one statement to be referenced; the expression carried by the statement to be referenced is the same as the expression to be used.
  • the expressions to be used satisfy the preset semantic identical conditions; the statement to be referenced includes the column name corresponding to the expression carried by the statement to be referenced.
  • determining the column name corresponding to the expression to be used from the at least one statement to be referenced includes:
  • the column name corresponding to the expression to be used is determined.
  • the column name corresponding to the expression carried by the at least one statement to be referenced includes the target column name
  • Determining the column name corresponding to the expression to be used based on the column name statistical results includes:
  • the target column name is determined to be the column name corresponding to the expression to be used.
  • the method further includes:
  • the to-be-referenced statement does not exist in the at least one to-be-analyzed statement, then based on the preset similarity representation data between the at least one to-be-matched expression and the to-be-used expression, from the at least one to-be-matched expression A second expression is determined in the expression; the similarity representation data between the second expression and the expression to be used satisfies a preset similarity condition;
  • mapping relationship is used to record the corresponding relationship between each expression to be matched and the column name corresponding to each expression to be matched.
  • the column name corresponding to the expression to be used is determined.
  • the number of expressions to be matched is M;
  • the determination process of similarity representation data between the mth expression to be matched and the expression to be used includes:
  • the similarity between the field name vector of the expression to be used and the field name vector of the m-th expression to be matched, and the keyword vector of the expression to be used and the m-th expression to be matched determines the similarity representation data between the mth expression to be matched and the expression to be used.
  • the process of determining the field name vector of the expression to be used includes:
  • the process of determining the keyword vector of the expression to be used includes:
  • the virtual column construction description information also includes a data type; the data type It is determined based on the data type corresponding to the column name carried by the expression to be used.
  • Embodiments of this application also provide a data query method based on a data lake.
  • the method includes:
  • the first data query request is used to request data query for a target virtual column; wherein the target virtual column is any one of the data lake-based virtual column construction methods provided by this application. Implementation is constructed;
  • Data query processing is performed according to the second data query request.
  • the embodiment of this application also provides a virtual column construction device based on a data lake, including:
  • An expression determination unit configured to determine an expression to be used from the at least one statement to be analyzed after acquiring at least one statement to be analyzed;
  • An information determination unit configured to determine the virtual column construction description information corresponding to the expression to be used; the virtual column construction description information includes the expression to be used;
  • a request generation unit configured to generate a virtual column construction request based on the virtual column construction description information corresponding to the expression to be used;
  • a virtual column construction unit configured to construct a virtual column corresponding to the expression to be used according to the virtual column construction request.
  • the embodiment of this application also provides a data query device based on a data lake, including:
  • a request acquisition unit is used to obtain a first data query request; the first data query request is used to request data query for a target virtual column; wherein the target virtual column is a data lake-based virtual column provided by the implementation of this application. Constructed by any implementation of the column construction method;
  • An information replacement unit configured to use an expression corresponding to the target virtual column to replace the column name of the target virtual column in the first data query request to obtain a second data query request;
  • a data query unit is configured to perform data query processing according to the second data query request.
  • An embodiment of the present application also provides an electronic device, where the device includes: a processor and a memory;
  • the memory is used to store instructions or computer programs
  • the processor is configured to execute the instructions or computer programs in the memory, so that the electronic device executes any implementation of the data lake-based virtual column construction method provided by this application, or executes this application. Apply to implement any implementation of the provided data query method based on the data lake.
  • Embodiments of the present application also provide a computer-readable medium. Instructions or computer programs are stored in the computer-readable medium. When the instructions or computer programs are run on a device, they cause the device to execute the implementation provided by the present application. Any implementation of the data lake-based virtual column construction method, or any implementation of the data lake-based data query method provided by this application.
  • An embodiment of the present application also provides a computer program product, which includes a computer program carried on a non-transitory computer-readable medium.
  • the computer program includes a method for executing the data lake-based virtual column construction method provided by the implementation of the present application.
  • the technical solution provided by the embodiment of this application can automatically perform expression statistical analysis on a large number of statements to be analyzed in the data lake (for example, SQL statements under various engines) in the data lake scenario to obtain the desired results.
  • Pre-set virtual columns construct conditions to be used expressions (for example, expressions to be used that appear more frequently); and then construct description information (for example, column names, data types, etc.) based on the virtual columns corresponding to the expressions to be used.
  • Expression, etc. automatically construct a virtual column construction request corresponding to the expression to be used, so that the virtual column construction request is used to request the construction of a virtual column that can represent the expression to be used; then, construct a virtual column according to the virtual column Request, construct a virtual column corresponding to the expression to be used, so that the virtual column can represent the expression to be used, so that in the future, users can automatically trigger the expression to be used with the help of data query requests for the virtual column.
  • the data query request can avoid problems that arise when the user manually enters the data query request for the expression to be used (for example, how to write the correct expression, etc.), thus effectively improving the user's data query experience. .
  • Figure 1 is a schematic diagram of metadata of a virtual column provided by an embodiment of the present application.
  • Figure 2 is a flow chart of a data lake-based virtual column construction method provided by an embodiment of the present application
  • Figure 3 is a schematic diagram of a syntax conversion provided by an embodiment of the present application.
  • Figure 4 is a schematic diagram of a syntax tree provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of a mapping library provided by an embodiment of the present application.
  • Figure 6 is a flow chart of a data query method based on a data lake provided by an embodiment of the present application
  • Figure 7 is a schematic structural diagram of a data lake-based virtual column construction device provided by an embodiment of the present application.
  • Figure 8 is a schematic structural diagram of a data query device based on a data lake provided by an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • SQL Structured Query Language
  • a data lake is a centralized repository that allows users to store multiple sources, structured and unstructured data at any scale, store data as-is without structuring it, and run different types of analytics Data processing, such as big data processing, real-time analysis, and machine learning, to guide better decisions.
  • JavaCC Java Compiler Compiler
  • Java Compiler Compiler is a generation program that can generate grammar and lexical analyzers.
  • this application provides a virtual column construction method based on a data lake.
  • the method includes: For data lake scenarios, you can first automatically target a large number of data to be analyzed in the data lake. Statements (for example, SQL statements under various engines) are subjected to expression statistical analysis to obtain expressions to be used that meet the preset virtual column construction conditions (for example, the expression "a+1"); and then based on The virtual column corresponding to the expression to be used constructs description information (for example, column name, data type, expression, etc.), and the expression corresponding to the expression to be used is automatically constructed.
  • Statements for example, SQL statements under various engines
  • expression statistical analysis to obtain expressions to be used that meet the preset virtual column construction conditions (for example, the expression "a+1")
  • the virtual column corresponding to the expression to be used constructs description information (for example, column name, data type, expression, etc.), and the expression corresponding to the expression to be used is automatically constructed.
  • a virtual column construction request so that the virtual column construction request is used to request the construction of a virtual column that can represent the expression to be used; then, according to the virtual column construction request, a virtual column corresponding to the expression to be used is constructed (for example, A virtual column with the character "c" as the column name), so that the virtual column can represent the expression to be used, so that future users can use data query requests for the virtual column (for example, "SELECT c FROM t1" This SQL statement) to automatically trigger a data query request for the expression to be used, thus avoiding problems that arise when the user manually enters a data query request for the expression to be used (for example, how to write Correct expressions, etc.), thus effectively improving the user data query experience.
  • the embodiments of the present application do not limit the execution subject of the virtual column construction method based on the data lake (or the data query method based on the data lake).
  • the virtual column construction method based on the data lake can be applied to devices with data processing functions such as terminal devices or servers.
  • the terminal device can be a smartphone, a computer, a personal digital assistant (Personal Digital Assistant, PDA) or a tablet computer, etc.
  • the server can be a standalone server, a cluster server or a cloud server.
  • Virtual columns are used to represent an expression.
  • a virtual column with the character "c" as the column name (hereinafter referred to as virtual column c) can be used to represent the expression "a+1", so that in the future the user can use query statements targeting this virtual column (for example, , the SQL statement "SELECT c FROM t1") realizes the purpose of data query for the expression "a+1".
  • the embodiment of this application also provides some grammatical contents of virtual columns, which are described below in conjunction with Tables 1 to 4.
  • the string “t1” refers to the table name of the data table to which a virtual column needs to be added; the character “c” refers to the virtual column.
  • step 11 does not limit the implementation of step 11.
  • any existing or future statement parsing method can be used for implementation.
  • step 11 may specifically include: after obtaining the written lexical file (for example, the SQL statement shown in row 2, column 1 of Table 1 above), a pre-built SQL parser can be used to target the lexical file. The file is parsed (for example, a SQL statement is converted into a syntax tree, etc.) and the parsing result is obtained.
  • the embodiments of the present application are not limited to the SQL parser. For example, it can be generated with the help of JavaCC.
  • step 12 in Table 1 above, the embodiment of the present application does not limit the implementation of step 12.
  • any existing or future statement verification method can be used for implementation.
  • step 12 may specifically include: after obtaining the above parsing processing results, verifying the legality of the column names, types, and expressions based on the parsing processing results.
  • step 12 it can not only verify the legality of some information, but also use pre-built type conversion rules (for example, if it is determined that the column name of the virtual column has appeared in the If the virtual column belongs to the column name of at least one existing data column in the data table, you can add a number at the end of the column name of the virtual column so that the number can distinguish the virtual column from these existing data columns, etc. ), convert some illegal information in the above analysis and processing results into legal information, so as to effectively avoid the adverse effects caused by the existence of illegal information.
  • pre-built type conversion rules for example, if it is determined that the column name of the virtual column has appeared in the If the virtual column belongs to the column name of at least one existing data column in the data table, you can add a number at the end of the column name of the virtual column so that the number can distinguish the virtual column from these existing data columns, etc.
  • step 13 may specifically include: after completing the statement verification process for the above parsing processing results, the relevant information of the virtual column (for example, column name, data type, expression, etc.) may be extracted from the parsing processing results. declaration, etc.); then add the virtual column in the data table to which the virtual column belongs (for example, a data table with the string "t1" as the table name), and use the relevant information of the virtual column as the element of the virtual column.
  • the data information is stored in the metadata service (HiveMetaStore) so that the relevant information of the virtual column can be found from the metadata service in the future.
  • the embodiments of the present application do not limit the storage method of the above "information related to virtual columns".
  • the "information related to virtual columns” includes the column name (name) of the virtual column, the expression ( expression), the data type (type) of the virtual column, and the declaration (comment) of the virtual column
  • you can add the column name of the virtual column, the expression of the virtual column, the data type of the virtual column, and the virtual column Column declarations are stored in the metadata of HiveMetaStore.
  • HiveMetaStore usually stores data using key-value pairs; and its storage of a single key-value pair has a character length limit.
  • the embodiment of the present application also provides a possible implementation of the storage method of the expression of the virtual column above, which may specifically be: If it is determined that the expression of the virtual column exceeds the character length limit, refer to the character length limit and split the expression of the virtual column into multiple sub-expressions (for example, the first sub-expression shown in Figure 1 expression to the R-th subexpression) are stored so that these subexpressions comply with the character length limit.
  • R is a positive integer.
  • a number can be given to each virtual column (for example, the number "1" shown in the 101 area in Figure 1), so that the number can uniquely identify the virtual column, so that it can be retrieved from HiveMetaStore according to this number in the future. Query the relevant information of the virtual column.
  • 1 will be added to the current maximum number corresponding to the data table to obtain the number of the new virtual column.
  • HiveMetaStore it is also agreed that only virtual columns can be prefixed with the string “virtual.” so that the string “virtual.” can be used to identify virtual columns, thus enabling future It can be determined based on the "virtual.” string which content in the HiveMetaStore is the metadata of the virtual column and which content is not the metadata of the virtual column.
  • step “add the virtual column to the data table to which the virtual column belongs is to shield the data lake.
  • the data usage policy addresses the usage differences between ordinary columns and virtual columns, which can effectively improve the user's data query experience.
  • step 21 in Table 2 above, the embodiment of the present application does not limit the implementation of step 21.
  • any existing or future statement parsing method can be used for implementation.
  • step 21 may specifically include: after obtaining the written lexical file (for example, the SQL statement shown in row 2, column 1 of Table 2 above), a pre-built SQL parser can be used to target the lexical file. The file is parsed (for example, a SQL statement is converted into a syntax tree, etc.) and the parsing result is obtained.
  • the written lexical file for example, the SQL statement shown in row 2, column 1 of Table 2 above
  • a pre-built SQL parser can be used to target the lexical file.
  • the file is parsed (for example, a SQL statement is converted into a syntax tree, etc.) and the parsing result is obtained.
  • step 22 may specifically include: after obtaining the above parsing processing results, verifying the table (for example, a data table with the string "t1" as the table name) and the virtual column that needs to be deleted (for example, with the string "t1" as the table name) Whether the character "c" is used as a virtual column in the column name) exists. It should be noted that the existence of the column can be skipped with the help of the string "IF EXISTS" in Table 2. That is, if the virtual column c does not exist in the data table t1, the string "IF EXISTS" can be used to skip the check. Characters ignore this problem.
  • step 23 in Table 2 above, the embodiment of the present application does not limit the implementation of step 23.
  • it can be implemented using any existing or future statement execution method.
  • step 23 may specifically include: after completing the statement verification process for the above parsing processing results, you may first extract the table name and column name from the parsing processing results; and then search from the number mapping relationship corresponding to the table name.
  • the number corresponding to the column name for example, the number "1" shown in the 101 area in Figure 1
  • delete all key-value pairs under this number for example, the content shown in the 100 area in Figure 1
  • delete the data column with this column name from the data table with this table name for example, the content shown in the 100 area in Figure 1.
  • the number mapping relationship is used to record the correspondence between each virtual column in the data table with the table name and the number corresponding to each virtual column.
  • the string "t1" refers to the table name of the data table where the virtual column needs to be deleted.
  • step 31 in Table 3 above, the embodiment of the present application does not limit the implementation of step 31.
  • any existing or future statement parsing method can be used for implementation.
  • step 31 may specifically include: after obtaining the written lexical file (for example, the SQL statement shown in row 2, column 1 of Table 3 above), a pre-built SQL parser can be used to target the lexical file. The file is parsed (for example, a SQL statement is converted into a syntax tree, etc.) and the parsing result is obtained.
  • the written lexical file for example, the SQL statement shown in row 2, column 1 of Table 3 above
  • a pre-built SQL parser can be used to target the lexical file.
  • the file is parsed (for example, a SQL statement is converted into a syntax tree, etc.) and the parsing result is obtained.
  • step 32 in Table 3 above, the embodiment of the present application does not limit the implementation of step 32.
  • any existing or future statement verification method can be used for implementation.
  • step 32 may specifically include: after obtaining the above parsing processing result, verifying whether the table (for example, a data table with the string "t1" as the table name) exists based on the parsing processing result.
  • step 33 may specifically include: after completing the statement verification process for the above parsing processing results, you can first extract the table name from the parsing processing results; and then search for the table name under the table name from HiveMetaStore according to the table name.
  • the relevant information of each virtual column for example, name, expression, type, comment, etc.
  • the relevant information of each virtual column is aggregated according to the number to obtain the information set of each virtual column; finally, the information of these virtual columns is The sets are sorted and displayed according to the number, and the virtual column viewing results under the table name are obtained.
  • the information set of the dth virtual column includes the column name of the dth virtual column, the data type of the dth virtual column, and the expression of the dth virtual column.
  • d is a positive integer
  • d ⁇ D is a positive integer
  • D represents the number of virtual columns in the data table with this table name.
  • step 41 in Table 4 above the embodiment of the present application does not limit the implementation of step 41.
  • any existing or future statement verification method can be used for implementation.
  • step 41 may specifically include: after obtaining the written lexical file (for example, the SQL statement shown in row 2, column 1 of Table 4 above), the legality verification of the lexical file may be performed (for example, , verify whether virtual column c exists in data table t1, etc.).
  • step 42 does not limit the implementation of step 42.
  • any existing or future statement adjustment method may be used for implementation.
  • step 42 may specifically include: after completing the statement verification process for the above lexical file, the column name of the virtual column in the lexical file can be replaced with the expression of the virtual column to obtain the replaced file; and then the After replacement, the files are translated into executable plans corresponding to different engines according to the translation rules corresponding to different engines, so that subsequent different engines can complete the query tasks of virtual columns by executing their corresponding executable plans.
  • step 43 does not limit the implementation of step 43.
  • any existing or future statement execution method can be used for implementation.
  • step 43 may specifically include: after obtaining an executable plan corresponding to an engine, the executable plan may be sent to the engine, so that the engine can complete the virtual column by executing the executable plan. Query tasks.
  • the virtual column when the virtual column belongs to a certain data table, the virtual column can represent an expression related to a common column in the data table. , so that in the future, the query task for this expression can be implemented with the help of the query task for this virtual column.
  • the constructed virtual column can appear in the data table, the virtual column in the data table is usually not used to record data, so that when a data query request is triggered for the virtual column in the future, it will not be directly obtained from the virtual column.
  • the data in the virtual column is read from the data table, and the data query result of the virtual column is determined with the help of the expression of the virtual column.
  • the embodiments of this application also provide some automatic generation solutions for SQL statements used to build virtual columns (that is, some possible implementations of virtual column construction methods based on data lakes). For ease of understanding, the following is combined with the attached Figure explains.
  • the data lake-based virtual column construction method includes S201-S204:
  • the above "at least one statement to be analyzed” refers to some SQL statements collected from all engines corresponding to the data lake (for example, MySQL, Hive, Spark, Presto and other engines).
  • the above “at least one statement to be analyzed” refers to some SQL statements collected from H types of engines.
  • H is a positive integer.
  • the embodiments of this application do not limit the method of obtaining the above "at least one statement to be analyzed".
  • the specific method may be: after the h-th engine receives the SQL statement, the h-th engine may store the SQL statement. to the preset storage space, so that after the preset statement analysis conditions are reached, the SQL statement stored therein is read from the preset storage space as the statement to be analyzed.
  • h is a positive integer
  • H is a positive integer
  • H represents the number of engines corresponding to the data lake.
  • the above “statement analysis condition” can be set in advance. For example, it can be specifically: the time difference between the current time and the trigger time of the last virtual column automatic construction process (for example, S201-S204) reaches the preset time difference. It can be seen that based on this statement analysis condition, the purpose of periodically automatically pulling a large number of SQL statements and automatically analyzing expressions with virtual column construction requirements can be achieved from a large number of SQL statements.
  • expression to be used refers to an expression that meets the preset virtual column construction conditions, so that the expression to be used can express an expression with virtual column construction requirements.
  • the expression to be used may be the expression "a+1".
  • the above virtual column construction condition can be set in advance.
  • it can be: the occurrence frequency of the expression to be used is higher than the preset frequency threshold.
  • the preset frequency threshold can be set in advance.
  • S201 may specifically include S2011-S2013:
  • S2011 Determine the nth expression to be analyzed from the nth statement to be analyzed.
  • n is a positive integer
  • n ⁇ N is a positive integer
  • N is a positive integer
  • the nth expression to be analyzed is used to represent the semantic information carried by the expression involved in the nth statement to be analyzed.
  • the embodiment of the present application does not limit the determination process of the nth expression to be analyzed.
  • the expression appearing in the nth statement to be analyzed can be directly determined as the nth expression to be analyzed.
  • SQL statements under different engines are usually semantically expressed in different dialects. Therefore, in order to better improve the statement analysis effect, you can first convert these SQL statements into the same dialect, and then perform expressions on these SQL statements. extract.
  • the embodiment of the present application also provides another possible implementation of the above determination process of the nth expression to be analyzed, which may specifically include S20111-S20112:
  • S20111 Perform syntax conversion processing on the nth statement to be analyzed, and obtain the syntax conversion result.
  • the grammar conversion process is used to convert the n-th sentence to be analyzed from the first dialect to the second dialect.
  • the first dialect refers to the statement dialect used by the n-th statement to be analyzed.
  • the second dialect refers to the target dialect.
  • the above "grammar conversion result" is used to express the semantic information carried by the n-th sentence to be analyzed according to the second dialect.
  • the syntax conversion result can be as Figure 3
  • the expression appearing in the syntax conversion result can be The expression "if(col ⁇ >null,col,")" appearing in the SQL statement shown as 302 in 3) is determined to be the nth expression to be analyzed.
  • the n-th expression to be analyzed can be determined from the n-th statement to be analyzed, so that the n-th expression to be analyzed can be The semantic information carried by the expression in the n-th statement to be analyzed is expressed, so that it can be subsequently determined whether the n-th expression to be analyzed has a virtual column construction requirement.
  • the expression statistics results are used to describe the frequency of occurrence of different expressions among the above N expressions to be analyzed.
  • the expression statistical result may include the expression 1 in the N expressions to be analyzed.
  • Y is a positive integer.
  • the embodiment of the present application does not limit the implementation of S2012.
  • it can be implemented using existing or future methods that can perform statistical analysis and processing on some expressions.
  • S2012 Another possible implementation may specifically include S20121-S20122:
  • S20121 Perform syntax tree construction processing on the nth expression to be analyzed, and obtain the syntax tree of the nth expression to be analyzed.
  • the syntax tree of the nth expression to be analyzed is used to represent the semantic information carried by the nth expression to be analyzed in the form of an abstract syntax tree. For example, when the nth expression to be analyzed above is the expression "if(col ⁇ >null,col,")", the syntax tree of the nth expression to be analyzed can be as shown in Figure 4 syntax tree.
  • S20122 Perform statistical analysis and processing on the syntax trees of N expressions to be analyzed, and obtain statistical analysis results.
  • any two of these syntax trees can be compared first to obtain a comparison result, so that the comparison result can represent the two syntax trees. Whether the syntax trees are equal; then the comparison results between any two of these syntax trees will be used to determine the statistical analysis results of the N expressions to be analyzed, so that the statistical analysis results can represent these expressions to be analyzed The frequency of occurrence of different expressions in the formula.
  • the first expression refers to the expression to be analyzed whose occurrence frequency is higher than the preset frequency threshold.
  • the first expression may be the expression "if(col ⁇ >null,col,")".
  • the embodiment of the present application does not limit the determination process of the first expression.
  • expression statistical results include the number of occurrences of expression 1 in N expressions to be analyzed, the number of occurrences of expression 2 in the N expressions to be analyzed.
  • the number of occurrences in the expression to be analyzed ...
  • the number of occurrences of expression Y in the N expressions to be analyzed it can be judged whether the number of occurrences of expression y in the N expressions to be analyzed is higher than If the preset frequency threshold is higher than the preset frequency threshold, the expression y can be determined as the first expression.
  • y is a positive integer, y ⁇ Y.
  • the virtual column construction description information corresponding to the expression to be used is used to describe the content that needs to be referenced when constructing the virtual column corresponding to the expression to be used (for example, the column name of the virtual column, the data type of the virtual column, the The expression of the virtual column, and the table name of the data table to which the virtual column belongs, etc.).
  • the implementation of this application does not limit the above "virtual column construction description information corresponding to the expression to be used".
  • it may include the column name of the virtual column, the data type of the virtual column, the expression of the virtual column, and the The name of the data table to which the virtual column belongs.
  • the virtual column refers to the data column used to represent the expression to be used.
  • column name of the virtual column refers to the name identifier of the virtual column used to represent the expression to be used.
  • the expression to be used is the expression "if (col ⁇ >null,col,")" in 302 in Figure 3
  • the name identifier of the virtual column used to represent the expression to be used can be as shown in Figure 3
  • the embodiment of the present application does not limit the above determination process of "column name of virtual column".
  • it may specifically include steps 51 to 54:
  • Step 51 Determine whether there is a statement to be referenced that satisfies the preset statement reference conditions in at least one of the above statements to be analyzed. If yes, perform step 52; if not, perform steps 53 to 54.
  • the statement to be referenced refers to the statement to be analyzed that satisfies the preset statement reference condition; and the statement reference condition is used to filter the content related to the column name corresponding to the expression to be used from at least one statement to be analyzed above, Moreover, the embodiment of the present application does not limit the statement reference condition. For example, it may specifically be: the expression carried by the statement to be referenced and the expression to be used satisfy the preset semantic identical conditions; and the statement to be referenced includes the statement to be referenced. The column name corresponding to the expression carried by the statement.
  • the expression carried by the nth statement to be analyzed (for example, the SQL statement shown at 301 in Figure 3) can be determined first Whether the expression "coalesce(col,")" in the statement) and the expression to be used (for example, the expression "if(col ⁇ >null,col,")") satisfy the preset semantic identical conditions , if satisfied, it can be determined that the semantic information of the expression carried by the n-th statement to be analyzed is the same as the semantic information of the expression to be used, and then it is determined whether there is a semantic information carried by the reference statement in the n-th statement to be analyzed.
  • the nth statement to be analyzed can be determined as the statement to be referenced, so that subsequent Based on the column name appearing in the nth statement to be analyzed, the column name corresponding to the expression to be used is determined.
  • n is a positive integer
  • n ⁇ N is a positive integer
  • preset semantic identical condition can be set in advance.
  • it can specifically be: the semantic information of the expression carried by the n-th statement to be analyzed is the same as the semantic information of the expression to be used.
  • Step 52 Determine the column name corresponding to the expression to be used from at least one statement to be referenced.
  • step 52 may specifically include steps 521 to 522:
  • Step 521 Perform statistical analysis and processing on the column name corresponding to the expression carried by at least one statement to be referenced, and obtain the column name statistical result.
  • the column name corresponding to the expression carried by the jth statement to be referenced refers to the column name corresponding to the expression carried by the jth statement to be referenced that exists in the jth statement to be referenced.
  • the expression carried by the jth statement to be referenced is "coalesce(col,” in the SQL statement shown as 301 in Figure 3 )" expression
  • the column name "not_null_col” exists in the j-th statement to be referenced
  • the column name "not_null_col” corresponds to the expression carried by the j-th statement to be referenced, so it can
  • the column name "not_null_col” is determined as the column name corresponding to the expression carried by the j-th statement to be referenced.
  • j is a positive integer
  • J is a positive integer
  • J represents the number of statements to be referenced.
  • the above column name statistics results are used to represent the frequency of occurrence of different names in the above "column name corresponding to the expression carried by at least one to-be-referenced statement".
  • the column name statistical results can include The number of occurrences of column name 1 in the above "column name corresponding to at least one expression carried by the statement to be referenced", and the occurrence of column name 2 in the above "column name corresponding to at least one expression carried by the statement to be referenced”
  • H is a positive integer
  • Step 522 Determine the column name corresponding to the expression to be used based on the column name statistical results.
  • the column name with the highest frequency can be determined from the above "at least one column name corresponding to the expression carried by the sentence to be referenced" based on the column name statistical results.
  • Column name determine the column name corresponding to the expression to be used.
  • step 522 may specifically be: If the column name statistics The result indicates that the frequency of occurrence of the target column name meets the preset frequency condition, then the target column name is determined as the column name corresponding to the expression to be used.
  • the preset frequency condition can be set in advance. For example, it can be specifically: the occurrence frequency of the target column name is higher than the column name corresponding to the expression carried by at least one sentence to be referenced above except the target column. The frequency of occurrence of column names other than the first name.
  • step 52 Based on the relevant content of the above step 52, it can be known that if it is determined that there is at least one statement to be referenced in at least one statement to be analyzed above, it can be determined that these statements to be referenced can provide some usable column names for the expression to be used, so it can be obtained from The column names with the highest frequency of occurrence are filtered out from these to-be-referenced statements as the column names corresponding to the expressions to be used, so that the column names corresponding to the expressions to be used can be used as the virtual columns representing the expressions to be used. Column name.
  • Step 53 Determine a second expression from at least one expression to be matched based on the preset similarity representation data between at least one expression to be matched and the expression to be used. Wherein, the similarity representation data between the second expression and the expression to be used satisfies a preset similarity condition.
  • the mth expression to be matched refers to a preset expression with a virtual column name.
  • m is a positive integer
  • m ⁇ M M is a positive integer
  • M represents the number of expressions to be matched.
  • the mth expression to be matched may be the expression corresponding to the column name c0 in the mapping library shown in Figure 5.
  • the embodiments of this application do not limit the acquisition process of "at least one expression to be matched" mentioned above.
  • the expressions corresponding to each column name in the pre-built mapping library can be determined as expressions to be matched.
  • the similarity representation data between the mth expression to be matched and the expression to be used is used to characterize the degree of similarity between the mth expression to be matched and the expression to be used; and the embodiments of this application are not limited to
  • the determination process of "the similar representation data between the mth expression to be matched and the expression to be used", for example, may specifically include steps 61 to 62:
  • Step 61 Determine the field name vector and keyword vector of the expression to be used.
  • the field name vector of the expression to be used is used to represent the field name carried by the expression to be used; and the embodiment of the present application does not limit the determination process of the "field name vector of the expression to be used".
  • its specific May include step 611-step 612:
  • Step 611 Perform field name extraction processing on the expression to be used, and obtain the field name extraction result of the expression to be used.
  • the field name extraction result of the expression to be used is used to describe the field name carried by the expression to be used.
  • the expression to be used is the expression "if(c1 ⁇ >null,c1,c2)""
  • the field name extraction result of the expression to be used may be ⁇ c1,c2 ⁇ .
  • Step 612 Vectorize the extraction result of the field name of the expression to be used to obtain a vector of field names of the expression to be used.
  • a pre-built field name dictionary for example, the dictionary ⁇ c0, c1, c2, c3 ⁇
  • the field name extraction result of the expression is vectorized to obtain the field name vector of the expression to be used (for example, the vector (0,2,1,0)).
  • the first “0” in the vector means that the field name c0 does not appear in the expression to be used; the “ “2” means that the field name c1 appears twice in the expression to be used; “1” in the vector means that the field name c2 appears once in the expression to be used; the second “0” in the vector means that the field name c1 appears twice in the expression to be used.
  • the field name c3 does not appear in the usage expression.
  • the embodiments of this application do not limit the above "field name dictionary".
  • the determination process of the "field name dictionary” in the previous paragraph is specific. It can be: searching from the first mapping relationship for a field name dictionary that has a corresponding relationship with the data table corresponding to the expression to be used, and confirming Defined as the "Field Name Dictionary” in the previous paragraph.
  • the data table corresponding to the expression to be used refers to the data table to which the virtual column representing the expression to be used belongs.
  • the field name extraction process can be performed on the expression to be used to obtain the field name extraction result of the expression to be used;
  • the field name dictionary is constructed, and the field name extraction results of the expression to be used are subjected to field name statistical analysis and processing to obtain the field name vector of the expression to be used, so that the field name vector of the expression to be used can represent the field name vector of the expression to be used.
  • the keyword vector of the expression to be used is used to represent the keywords carried by the expression to be used; and the embodiment of the present application does not limit the determination process of the "keyword vector of the expression to be used". For example, it may specifically include Step 613-Step 614:
  • Step 613 Perform keyword extraction processing on the expression to be used, and obtain the keyword extraction result of the expression to be used.
  • the keyword extraction result of the expression to be used is used to describe the keywords carried by the expression to be used.
  • the expression to be used is the expression "if(c1 ⁇ >null,c1,c2)""
  • the keyword extraction result of the expression to be used may be ⁇ if,null ⁇ .
  • step 611 and step 613 may be executed sequentially.
  • step 613 and step 611 can be performed.
  • step 611 and step 613 may be performed simultaneously.
  • Step 614 Vectorize the keyword extraction result of the expression to be used to obtain the keyword vector of the expression to be used.
  • a pre-built keyword dictionary for example, the dictionary ⁇ as, if, in, null ⁇
  • the keyword extraction result of the expression is vectorized to obtain the keyword vector of the expression to be used (for example, the vector (0,1,0,1)).
  • the first “0” in the vector indicates that the keyword as does not appear in the expression to be used; the first “0” in the vector A “1” indicates that the keyword if appears once in the expression to be used; the second “0” in the vector indicates that the keyword in does not appear in the expression to be used; the second “ 1” means that the keyword null appears once in the expression to be used.
  • Keyword extraction processing can be performed on the expression to be used to obtain the keyword extraction result of the expression to be used;
  • the keyword dictionary is constructed, and the keyword extraction results of the expression to be used are subjected to keyword statistical analysis to obtain the keyword vector of the expression to be used, so that the keyword vector of the expression to be used can represent the expression to be used. Which keywords appear in the usage expression, and how often each keyword appears.
  • the field name vector and keyword vector of the expression to be used can be extracted to obtain the field name vector and keyword vector of the expression to be used. , so that these vectors can represent the semantic information carried by the expression to be used.
  • Step 62 Based on the similarity between the field name vector of the expression to be used and the field name vector of the mth expression to be matched, and the keyword vector of the expression to be used and the keywords of the mth expression to be matched.
  • the similarity between vectors determines the similarity representation data between the mth expression to be matched and the expression to be used.
  • the above "similarity between the field name vector of the expression to be used and the field name vector of the m-th expression to be matched" is used to represent the field names between the expression to be used and the m-th expression to be matched.
  • the degree of similarity presented; and the embodiment of the present application does not limit the determination process of the "similarity between the field name vector of the expression to be used and the field name vector of the mth expression to be matched", for example, the The Euclidean distance between the field name vector of the expression to be used and the field name vector of the mth expression to be matched is determined as the distance between the field name vector of the expression to be used and the field name vector of the mth expression to be matched. similarity.
  • the above "similarity between the keyword vector of the expression to be used and the keyword vector of the m-th expression to be matched" is used to indicate the keywords between the expression to be used and the m-th expression to be matched.
  • the degree of similarity presented; and the embodiment of the present application does not limit the determination process of the "similarity between the keyword vector of the expression to be used and the keyword vector of the mth expression to be matched".
  • the method can be The Euclidean distance between the keyword vector of the expression to be used and the keyword vector of the mth expression to be matched is determined as the distance between the keyword vector of the expression to be used and the keyword vector of the mth expression to be matched. similarity.
  • step 62 does not limit the implementation of step 62.
  • it may be as follows: according to the preset weight, combine the field name vector of the expression to be used and the field name vector of the mth expression to be matched.
  • the similarity between the expression to be used and the keyword vector of the mth expression to be matched is weighted and summed to obtain the mth expression to be matched and the expression to be used. similarity between representation data.
  • the embodiments of this application do not limit the "weight” in the previous paragraph.
  • the weight corresponding to the above-mentioned “similarity between the field name vector of the expression to be used and the field name vector of the mth expression to be matched" is 0.75
  • the weight corresponding to the above "similarity between the keyword vector of the expression to be used and the keyword vector of the m-th expression to be matched" is 0.25.
  • the field name vector and keyword vector of the expression to be used can be determined first; and then based on the field name vector and keyword vector of the expression to be used, Keyword vector, field name vector and keyword vector of the mth expression to be matched, calculate the similar representation data between the mth expression to be matched and the expression to be used, so that the similar representation data can be represented Find the similarity between the mth expression to be matched and the expression to be used.
  • m is a positive integer
  • M is a positive integer
  • M represents the number of expressions to be matched.
  • the above “second expression” refers to an expression to be matched that satisfies preset similarity conditions with the similarity representation data between the expression to be used.
  • the preset similarity condition can be set in advance. For example, if the above similar representation data is determined with the help of Euclidean distance, the preset similarity condition can specifically be: the distance between the second expression and the expression to be used.
  • the similar representation data is less than the similarity representation data between each expression to be matched and the expression to be used in the above "at least one expression to be matched” except the second expression. That is, the similarity representation data between the second expression and the expression to be used in the above "at least one expression to be matched" reaches a minimum.
  • mapping can be used Similarity representation data between some expressions to be matched and expressions to be used recorded in the library, from these to be matched Determine the second expression in the matching expression that is most similar to the expression to be used, so that the column name corresponding to the second expression can be used to infer the column name corresponding to the expression to be used.
  • Step 54 Find the column name corresponding to the second expression from the pre-built mapping relationship, and determine the column name corresponding to the expression to be used based on the column name corresponding to the second expression.
  • mapping relationship is used to record the correspondence between each expression to be matched and the column name corresponding to each expression to be matched.
  • step 54 does not limit the implementation of step 54.
  • it may be as follows: directly searching the column name corresponding to the second expression in the pre-constructed mapping relationship and determining it as the column name corresponding to the expression to be used.
  • step 54 which may specifically be: finding the column name corresponding to the second expression from the pre-constructed mapping relationship After that, it can be determined whether there is an existing column name including the column name corresponding to the second expression in the data table corresponding to the expression to be used. If it exists, then based on the column name corresponding to the second expression and the column name including the The number of existing column names of the column name corresponding to the second expression is determined to determine the column name corresponding to the expression to be used, so that the column name corresponding to the expression to be used is different from each column name in the data table including the second expression.
  • the column name corresponding to the expression already has a column name.
  • the embodiments of the present application do not limit the above step of "determining the corresponding column name of the expression to be used based on the column name corresponding to the second expression and the number of existing column names including the column name corresponding to the second expression.
  • the implementation of "column name”, for example, may be as follows: first add 1 to the number of existing column names including the column name corresponding to the second expression to obtain the number to be used; and then add the number to be used Go to the end of the column name corresponding to the second expression to obtain the column name corresponding to the expression to be used.
  • the column name corresponding to the second expression can be used to infer the to-be-matched expression. Use the column name corresponding to the expression.
  • mapping library is updated to improve the accuracy of the mapping library.
  • column name of virtual column that is, the column name corresponding to the expression to be used
  • the column used to represent the virtual column of the expression to be used can be automatically analyzed name, which helps reduce user workload.
  • data type of the virtual column refers to the data type of the virtual column used to represent the expression to be used.
  • the embodiments of the present application do not limit the determination process of the above "data type of the virtual column” (that is, the data type corresponding to the expression to be used). For example, it can be set manually.
  • the data type of a virtual column is related to the data type of the existing data column involved in the expression represented by the virtual column.
  • the embodiment of the present application also provides a possible implementation method of the determination process of the above "data type of virtual column" (that is, the data type corresponding to the expression to be used), which can be specifically as follows: according to the to-be-used expression Use the data type corresponding to the column name carried by the expression to determine the data type corresponding to the expression to be used.
  • the embodiments of this application do not limit the implementation of the above step "determine the data type corresponding to the expression to be used according to the data type corresponding to the column name carried by the expression to be used".
  • the specifics can be: you can directly change the data type (for example, int) corresponding to the column name (that is, the character "a") carried by the expression to be used. , determine the data type corresponding to the expression to be used (for example, int).
  • it can also be: in accordance with the preset defined data type inference rules, infer the data type corresponding to the expression to be used from the data type corresponding to the column name carried by the expression to be used.
  • expression of a virtual column refers to the expression represented by the virtual column.
  • the expression of the virtual column is the expression to be used.
  • tablette name of the data table to which the virtual column belongs refers to the table name of the data table to which the virtual column of the expression to be used belongs; and this application implements The example does not limit the determination process of "the table name of the data table to which the virtual column belongs".
  • the specific process can be: determine the table name carried by the target statement in the above "at least one statement to be analyzed" as the table name to be used.
  • the table name corresponding to the expression can be: determine the table name carried by the target statement in the above "at least one statement to be analyzed" as the table name to be used.
  • the table name corresponding to the expression wherein, the expression carried in the target statement and the expression to be used satisfy a preset semantic identical condition; and the target statement carries a table name.
  • the virtual column construction description information corresponding to the expression to be used can be obtained, so that the virtual column construction description information can describe the process of constructing the expression to be used.
  • the corresponding virtual column needs to be referenced (for example, the column name of the virtual column, the data type of the virtual column, the expression of the virtual column, the table name of the data table to which the virtual column belongs, etc.) so that it can be used later.
  • the virtual column constructs description information and constructs a virtual column corresponding to the expression to be used.
  • S203 Generate a virtual column construction request based on the virtual column construction description information corresponding to the expression to be used.
  • the virtual column construction request is used to request the generation of a virtual column representing the expression to be used; and the embodiment of the present application does not limit the implementation of the virtual column construction request.
  • the virtual column corresponding to the expression to be used is constructed
  • the description information includes ⁇ (column name, c), (expression, a+1), (data type, int), (data table, t1) ⁇
  • the virtual column construction request can be the second one in Table 1 above.
  • the executable task corresponding to the first engine can be generated according to the virtual column construction request; and then the executable task corresponding to the first engine can be sent to the first engine. engine, so that the first engine can construct the virtual column corresponding to the expression to be used by executing the executable task.
  • the first engine refers to the execution engine used to construct the virtual column corresponding to the expression to be used; and the embodiment of the present application does not limit the first engine. For example, it can be an engine specified by the user, or it can be based on the current Execution engine selected by resource conditions.
  • the embodiments of this application do not limit the generation process of the "executable task corresponding to the first engine" mentioned above.
  • it may be as follows: after obtaining the virtual column construction request, the first engine corresponding Translation rules to translate the virtual column construction request into executable tasks corresponding to the first engine.
  • the virtual column of the expression then, according to the virtual column construction request, build the virtual column corresponding to the expression to be used (for example, a virtual column with the character "c" as the column name), so that the virtual column can represent the The expression to be used so that in the future the user can use the data query request for the virtual column (for example, the SQL statement "SELECT c FROM t1") to automatically trigger the data query request for the expression to be used, so that This avoids problems that arise when the user manually enters a data query request for the expression to be used (for example, how to write a correct expression, etc.), thereby effectively improving the user's data query experience.
  • the data query request for the virtual column for example, the SQL statement "SELECT c FROM t1"
  • the embodiment of this application also provides a data query method based on the data lake, as shown in Figure 6, the method includes S601-S603:
  • S601 Obtain the first data query request.
  • the first data query request is used to request data query for the target virtual column.
  • the first data query request is used to request data query for a target virtual column in the target data table.
  • the target virtual column is constructed using any implementation of the data lake-based virtual column construction method provided by the embodiments of this application; and the target data table refers to the data table to which the target virtual column belongs. In order to facilitate understanding, the following is explained with examples.
  • the target data table above is the data table with the string "t1" as the table name
  • the target virtual column is the virtual column with the character "c" as the column name
  • the The first data query request may specifically be the SQL statement shown in row 2, column 1 of Table 4 above.
  • S602 Use the expression corresponding to the target virtual column to replace the column name of the target virtual column in the first data query request to obtain the second data query request.
  • the column name of the target virtual column for example, the character "c" and the target virtual column can be extracted from the first data query request.
  • Corresponding table name for example, the string "t1"
  • determine the number mapping relationship that has a corresponding relationship with the table name corresponding to the target virtual column so that the number mapping relationship is used to record data with the table name
  • the number corresponding to each virtual column in the table secondly, find the number corresponding to the column name of the target virtual column in the number mapping relationship as the number to be used; then, find the expression corresponding to the number to be used from HiveMetaStore (for example , the expression "a+1"); finally, use this expression to replace the column name of the target virtual column in the first data query request, and obtain the second data query request (for example, SELECT a+1FROM t1).
  • S603 Perform data query processing according to the second data query request.
  • the executable task corresponding to the second engine can be generated first according to the second data query request; and then the executable task corresponding to the second engine can be sent to the The second engine, so that the second engine completes the data query for the expression represented by the target virtual column by executing the executable task. inquiry process.
  • the second engine refers to an execution engine used to perform data query processing on the expression represented by the target virtual column; and the embodiment of the present application does not limit the second engine. For example, it can be an engine specified by the user, or Can be an execution engine selected based on current resource conditions.
  • the embodiments of this application do not limit the above generation process of "executable tasks corresponding to the second engine". For example, it may be as follows: after obtaining the second data query request, the second engine may Corresponding translation rules translate the second data query request into executable tasks corresponding to the second engine.
  • embodiments of the present application also provide a data lake-based virtual column construction device, which will be explained and described below with reference to the accompanying drawings. It should be noted that, for technical details of the data lake-based virtual column construction device provided in this application, please refer to the relevant content of the data lake-based virtual column construction method mentioned above.
  • Figure 7 is a schematic structural diagram of a data lake-based virtual column construction device provided by an embodiment of the present application.
  • the data lake-based virtual column construction device 700 provided by the embodiment of this application includes:
  • the expression determination unit 701 is configured to determine an expression to be used from the at least one statement to be analyzed after acquiring at least one statement to be analyzed;
  • the information determining unit 702 is used to determine the virtual column construction description information corresponding to the expression to be used; the virtual column construction description information includes the expression to be used;
  • the request generation unit 703 is configured to generate a virtual column construction request according to the virtual column construction description information corresponding to the expression to be used;
  • the virtual column construction unit 704 is configured to construct a virtual column corresponding to the expression to be used according to the virtual column construction request.
  • the number of statements to be analyzed is N;
  • the expression determination unit 701 includes:
  • the first determination subunit is used to determine the nth expression to be analyzed from the nth statement to be analyzed; n is a positive integer, n ⁇ N, and N is a positive integer;
  • the first statistical subunit is used to perform statistical analysis and processing on N expressions to be analyzed to obtain expression statistical results;
  • the second determination subunit is used to determine the first expression as the occurrence frequency of the first expression among the N expressions to be analyzed if the expression statistical result indicates that the occurrence frequency of the first expression among the N expressions to be analyzed is higher than the preset frequency threshold. Describes the expression to be used.
  • the first determining subunit is specifically configured to: perform syntax conversion processing on the nth statement to be analyzed to obtain a syntax conversion result; and extract the syntax conversion result from the syntax conversion result.
  • the nth expression to be analyzed is specifically configured to: perform syntax conversion processing on the nth statement to be analyzed to obtain a syntax conversion result; and extract the syntax conversion result from the syntax conversion result.
  • the expression determination unit 701 further includes:
  • the syntax tree construction subunit is used to perform syntax tree construction processing on the nth expression to be analyzed, and obtain the syntax tree of the nth expression to be analyzed;
  • the first statistical subunit is specifically used to perform statistical analysis on the syntax trees of the N expressions to be analyzed, and obtain statistical analysis results.
  • the virtual column construction description information also includes a column name
  • the information determining unit 702 includes:
  • the third determination subunit is used to determine the column name corresponding to the expression to be used from the at least one statement to be referenced if there is at least one statement to be referenced in the at least one statement to be analyzed; the to-be-referenced
  • the expression carried by the statement and the expression to be used satisfy a preset semantic identity condition; the statement to be referenced includes the column name corresponding to the expression carried by the statement to be referenced.
  • the third determining subunit includes:
  • the second statistical subunit is used to perform statistical analysis and processing on the column names corresponding to the expressions carried by the at least one to-be-referenced statement to obtain column name statistical results;
  • the fourth determination subunit is used to determine the column name corresponding to the expression to be used based on the column name statistical result.
  • the column name corresponding to the expression carried by the at least one statement to be referenced includes the target column name
  • the fourth determination subunit is specifically used to: if the column name statistical result indicates that the frequency of occurrence of the target column name meets the preset frequency condition, determine the target column name as the expression to be used. The corresponding column name.
  • the information determining unit 702 further includes:
  • the fifth determination subunit is used to characterize data based on the similarity between the preset at least one expression to be matched and the expression to be used if the sentence to be referenced does not exist in the at least one sentence to be analyzed. , determine a second expression from the at least one expression to be matched; the similarity representation data between the second expression and the expression to be used satisfies a preset similarity condition;
  • the first search unit is used to search for the column name corresponding to the second expression from a pre-constructed mapping relationship; the mapping relationship is used to record each of the expressions to be matched and the corresponding column name of each expression to be matched. Correspondence between column names;
  • the sixth determination subunit is used to determine the column name corresponding to the expression to be used according to the column name corresponding to the second expression.
  • the number of expressions to be matched is M
  • the information determining unit 702 also includes:
  • the seventh determination subunit is used to determine the field name vector and keyword vector of the expression to be used;
  • the eighth determination subunit is used to determine the similarity between the field name vector of the expression to be used and the field name vector of the mth expression to be matched, and the keyword vector of the expression to be used. with the mth to-be-matched
  • the similarity between the keyword vectors of expressions determines the similarity representation data between the mth expression to be matched and the expression to be used.
  • the seventh determining subunit includes:
  • the ninth determination subunit is used to perform field name extraction processing on the expression to be used to obtain a field name extraction result; perform vectorization processing on the field name extraction result to obtain a field name vector of the expression to be used .
  • the seventh determining subunit includes:
  • the tenth determination subunit is used to perform keyword extraction processing on the expression to be used to obtain a keyword extraction result; perform vectorization processing on the keyword extraction result to obtain a keyword vector of the expression to be used. .
  • the virtual column construction description information also includes a data type; the data type is determined according to the data type corresponding to the column name carried by the expression to be used.
  • the data lake-based virtual column construction device 700 can first automatically target a large number of statements to be analyzed in the data lake (for example, SQL statements under various engines) perform expression statistical analysis to obtain expressions to be used that meet the preset virtual column construction conditions (for example, expressions to be used that appear more frequently); and then based on the to-be-used expressions Use the virtual column construction description information corresponding to the expression (for example, column name, data type, expression, etc.) to automatically construct a virtual column construction request corresponding to the expression to be used, so that the virtual column construction request can be used for the request Construct a virtual column that can represent the expression to be used; then, according to the virtual column construction request, construct a virtual column corresponding to the expression to be used, so that the virtual column can represent the expression to be used, so that users can use it in the future
  • the data query request for the virtual column automatically triggers the data query request for the expression to be
  • embodiments of the present application also provide a data query device based on the data lake, which will be explained and described below with reference to the accompanying drawings. It should be noted that for the technical details of the data query device based on the data lake provided in this application, please refer to the relevant content of the data query method based on the data lake mentioned above.
  • FIG 8 is a schematic structural diagram of a data query device based on a data lake provided by an embodiment of the present application.
  • the data query device 800 based on the data lake provided by the embodiment of this application includes:
  • Request acquisition unit 801 is used to obtain a first data query request; the first data query request is used to request data query for a target virtual column; wherein the target virtual column is a virtual column based on the data lake provided by this application Constructed by any implementation of the construction method;
  • the information replacement unit 802 is configured to use the expression corresponding to the target virtual column to replace the column name of the target virtual column in the first data query request to obtain a second data query request;
  • the data query unit 803 is configured to perform data query processing according to the second data query request.
  • the data query request for the virtual column for example, the SQL statement "SELECT c FROM t1"
  • the data query request for the expression can be used, so as to avoid the need for the user to enter the Factors that need to be considered when using expressions (for example, how to write correct expressions, etc.) can effectively improve the user's data query experience.
  • embodiments of the present application also provide an electronic device, which includes a processor and a memory: the memory is used to store instructions or computer programs; the processor is used to execute the instructions in the memory. Instructions or computer programs to cause the electronic device to execute any implementation of the data lake-based virtual column construction method provided by this application, or to execute any implementation of the data lake-based data query method provided by this application. Way.
  • Terminal devices in embodiments of the present disclosure may include, but are not limited to, mobile phones, laptops, digital broadcast receivers, PDAs (Personal Digital Assistants), PADs (Tablets), PMPs (Portable Multimedia Players), vehicle-mounted terminals (such as Mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers, etc.
  • the electronic device shown in FIG. 9 is only an example and should not impose any limitations on the functions and scope of use of the embodiments of the present disclosure.
  • the electronic device 900 may include a processing device (eg, central processing unit, graphics processor, etc.) 901 that may be loaded into a random access device according to a program stored in a read-only memory (ROM) 902 or from a storage device 908 .
  • the program in the memory (RAM) 903 executes various appropriate actions and processes.
  • various programs and data required for the operation of the electronic device 900 are also stored.
  • the processing device 901, ROM 902 and RAM 903 are connected to each other via a bus 904.
  • An input/output (I/O) interface 905 is also connected to bus 904.
  • the following devices may be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 907 such as a computer; a storage device 908 including a magnetic tape, a hard disk, etc.; and a communication device 909.
  • the communication device 909 may allow the electronic device 900 to communicate wirelessly or wiredly with other devices to exchange data.
  • FIG. 9 illustrates an electronic device 900 having various means, it should be understood that implementation or availability of all illustrated means is not required. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product including a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via communication device 909, or from storage device 908, or from ROM 902.
  • the processing device 901 the above-mentioned functions defined in the method of the embodiment of the present disclosure are performed.
  • the electronic device provided by the embodiments of the present disclosure and the method provided by the above embodiments belong to the same inventive concept.
  • Technical details that are not described in detail in this embodiment can be referred to the above embodiments, and this embodiment has the same beneficial effects as the above embodiments. .
  • Embodiments of the present application also provide a computer-readable medium. Instructions or computer programs are stored in the computer-readable medium. When the instructions or computer programs are run on a device, they cause the device to execute the implementation provided by the present application. Any implementation of the data lake-based virtual column construction method, or any implementation of the data lake-based data query method provided by this application.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof. More specific examples of computer readable storage media may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), removable Programmd read-only memory (EPROM or flash memory), fiber optics, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wire, optical cable, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and server can communicate using any currently known or future developed network protocol such as HTTP (Hyper Text Transfer Protocol), and can communicate with digital data in any form or medium.
  • Data communications e.g., communications network
  • communications networks include local area networks (“LAN”), wide area networks (“WAN”), the Internet (e.g., the Internet), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any currently known or developed in the future network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; it may also exist independently without being assembled into the electronic device.
  • the computer-readable medium carries one or more programs.
  • the electronic device can perform the above method.
  • Computer program code for performing the operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages—such as "C” or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider through Internet connection).
  • LAN local area network
  • WAN wide area network
  • Internet service provider such as an Internet service provider through Internet connection
  • each block in the flowchart or block diagram may represent a module, segment, or portion of code that contains one or more logic functions that implement the specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.
  • the units involved in the embodiments of the present disclosure can be implemented in software or hardware. Among them, the name of the unit/module does not constitute a limitation on the unit itself under certain circumstances.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs Systems on Chips
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM portable compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.
  • At least one (item) refers to one or more, and “plurality” refers to two or more.
  • “And/or” is used to describe the relationship between associated objects, indicating that there can be three relationships. For example, “A and/or B” can mean: only A exists, only B exists, and A and B exist simultaneously. , where A and B can be singular or plural. The character “/” generally indicates that the related objects are in an "or” relationship. “At least one of the following” or similar expressions thereof refers to any combination of these items, including any combination of a single item (items) or a plurality of items (items).
  • At least one of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c” ”, where a, b, c can be single or multiple.
  • RAM random access memory
  • ROM read-only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disks, removable disks, CD-ROMs, or anywhere in the field of technology. any other known form of storage media.

Abstract

本申请公开了一种基于数据湖的虚拟列构建方法以及数据查询方法,该公开为:先自动地针对数据湖中的大量待分析语句进行表达式统计分析,以得到待使用表达式;再根据待使用表达式对应的虚拟列构建描述信息,自动地构建出待使用表达式对应的虚拟列构建请求,以使虚拟列构建请求用于请求构建能够代表待使用表达式的虚拟列;然后,按照虚拟列构建请求,构建待使用表达式对应的虚拟列,以使虚拟列能够代表待使用表达式,以便日后用户能够借助针对虚拟列的数据查询请求,以自动地触发针对待使用表达式的数据查询请求,如此能够避免在用户手动输入针对待使用表达式的数据查询请求时所出现的问题,从而能够有效地提高用户数据查询体验。

Description

一种基于数据湖的虚拟列构建方法以及数据查询方法
本申请要求于2022年07月27日提交中国专利局、申请号为202210892749.6、申请名称为“一种基于数据湖的虚拟列构建方法以及数据查询方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,尤其涉及一种基于数据湖的虚拟列构建方法以及数据查询方法。
背景技术
随着信息技术的发展,数据湖的应用场景越来越多。
实际上,对于数据来说,每天可能会存在大量的数据查询任务,以满足用户数据查询需求。
然而,因一些数据查询方案存在缺陷,导致用户数据查询体验不太好。
发明内容
为了解决上述技术问题,本申请提供了一种基于数据湖的虚拟列构建方法以及数据查询方法,能够提高用户数据查询体验。
为了实现上述目的,本申请实施例提供的技术方案如下:
本申请实施例提供一种基于数据湖的虚拟列构建方法,所述方法包括:
在获取到至少一个待分析语句之后,从所述至少一个待分析语句中确定待使用表达式;
确定所述待使用表达式对应的虚拟列构建描述信息;所述虚拟列构建描述信息包括所述待使用表达式;
根据所述待使用表达式对应的虚拟列构建描述信息,生成虚拟列构建请求;
按照所述虚拟列构建请求,构建所述待使用表达式对应的虚拟列。
在一种可能的实施方式中,所述待分析语句的个数为N;
所述从所述至少一个待分析语句中确定待使用表达式,包括:
从第n个待分析语句中确定第n个待分析表达式;n为正整数,n≤N,N为正整数;
对N个待分析表达式进行统计分析处理,得到表达式统计结果;
若所述表达式统计结果表示所述N个待分析表达式中第一表达式的出现频次高于预设频次阈值,则将所述第一表达式确定为所述待使用表达式。
在一种可能的实施方式中,所述从第n个待分析语句中确定第n个待分析表达式,包括:
对所述第n个待分析语句进行语法转换处理,得到语法转换结果;
从所述语法转换结果中提取所述第n个待分析表达式。
在一种可能的实施方式中,所述方法还包括:
对所述第n个待分析表达式进行语法树构建处理,得到所述第n个待分析表达式的语法树;
所述对N个待分析表达式进行统计分析处理,得到统计分析结果,包括:
对所述N个待分析表达式的语法树进行统计分析处理,得到统计分析结果。
在一种可能的实施方式中,所述虚拟列构建描述信息还包括列名;
所述待使用表达式对应的列名的确定过程,包括:
若所述至少一个待分析语句中存在至少一个待参考语句,则从所述至少一个待参考语句中确定所述待使用表达式对应的列名;所述待参考语句携带的表达式与所述待使用表达式之间满足预设语义相同条件;所述待参考语句包括所述待参考语句携带的表达式对应的列名。
在一种可能的实施方式中,所述从所述至少一个待参考语句中确定所述待使用表达式对应的列名,包括:
对所述至少一个待参考语句携带的表达式对应的列名进行统计分析处理,得到列名统计结果;
根据所述列名统计结果,确定所述待使用表达式对应的列名。
在一种可能的实施方式中,所述至少一个待参考语句携带的表达式对应的列名包括目标列名;
所述根据所述列名统计结果,确定所述待使用表达式对应的列名,包括:
若所述列名统计结果表示所述目标列名的出现频次满足预设频次条件,则将所述目标列名,确定为所述待使用表达式对应的列名。
在一种可能的实施方式中,所述方法还包括:
若所述至少一个待分析语句中不存在所述待参考语句,则根据预先设定的至少一个待匹配表达式与所述待使用表达式之间的相似表征数据,从所述至少一个待匹配表达式中确定第二表达式;所述第二表达式与所述待使用表达式之间的相似表征数据满足预设相似条件;
从预先构建的映射关系中查找所述第二表达式对应的列名;所述映射关系用于记录各所述待匹配表达式与各所述待匹配表达式对应的列名之间的对应关系;
根据所述第二表达式对应的列名,确定所述待使用表达式对应的列名。
在一种可能的实施方式中,所述待匹配表达式的个数为M;
第m个待匹配表达式与所述待使用表达式之间的相似表征数据的确定过程,包括:
确定所述待使用表达式的字段名向量以及关键字向量;
根据所述待使用表达式的字段名向量与所述第m个待匹配表达式的字段名向量之间的相似度、以及所述待使用表达式的关键字向量与所述第m个待匹配表达式的关键字向量之间的相似度,确定所述第m个待匹配表达式与所述待使用表达式之间的相似表征数据。
在一种可能的实施方式中,所述待使用表达式的字段名向量的确定过程,包括:
对所述待使用表达式进行字段名提取处理,得到字段名提取结果;将所述字段名提取结果进行向量化处理,得到所述待使用表达式的字段名向量。
在一种可能的实施方式中,所述待使用表达式的关键字向量的确定过程,包括:
对所述待使用表达式进行关键字提取处理,得到关键字提取结果;将所述关键字提取结果进行向量化处理,得到所述待使用表达式的关键字向量。
在一种可能的实施方式中,所述虚拟列构建描述信息还包括数据类型;所述数据类型 是根据所述待使用表达式携带的列名对应的数据类型确定的。
本申请实施例还提供了一种基于数据湖的数据查询方法,所述方法包括:
获取第一数据查询请求;所述第一数据查询请求用于请求针对目标虚拟列进行数据查询;其中,所述目标虚拟列是利用本申请实施提供的基于数据湖的虚拟列构建方法的任一实施方式进行构建的;
利用所述目标虚拟列对应的表达式,替换所述第一数据查询请求中所述目标虚拟列的列名,得到第二数据查询请求;
按照所述第二数据查询请求进行数据查询处理。
本申请实施例还提供了一种基于数据湖的虚拟列构建装置,包括:
表达式确定单元,用于在获取到至少一个待分析语句之后,从所述至少一个待分析语句中确定待使用表达式;
信息确定单元,用于确定所述待使用表达式对应的虚拟列构建描述信息;所述虚拟列构建描述信息包括所述待使用表达式;
请求生成单元,用于根据所述待使用表达式对应的虚拟列构建描述信息,生成虚拟列构建请求;
虚拟列构建单元,用于按照所述虚拟列构建请求,构建所述待使用表达式对应的虚拟列。
本申请实施例还提供了一种基于数据湖的数据查询装置,包括:
请求获取单元,用于获取第一数据查询请求;所述第一数据查询请求用于请求针对目标虚拟列进行数据查询;其中,所述目标虚拟列是利用本申请实施提供的基于数据湖的虚拟列构建方法的任一实施方式进行构建的;
信息替换单元,用于利用所述目标虚拟列对应的表达式,替换所述第一数据查询请求中所述目标虚拟列的列名,得到第二数据查询请求;
数据查询单元,用于按照所述第二数据查询请求进行数据查询处理。
本申请实施例还提供了一种电子设备,所述设备包括:处理器和存储器;
所述存储器,用于存储指令或计算机程序;
所述处理器,用于执行所述存储器中的所述指令或计算机程序,以使得所述电子设备执行本申请实施提供的基于数据湖的虚拟列构建方法的任一实施方式,或者,执行本申请实施提供的基于数据湖的数据查询方法的任一实施方式。
本申请实施例还提供了一种计算机可读介质,所述计算机可读介质中存储有指令或计算机程序,当所述指令或计算机程序在设备上运行时,使得所述设备执行本申请实施提供的基于数据湖的虚拟列构建方法的任一实施方式,或者,执行本申请实施提供的基于数据湖的数据查询方法的任一实施方式。
本申请实施例还提供了一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行本申请实施提供的基于数据湖的虚拟列构建方法的任一实施方式的程序代码,或者,用于执行本申请实施提供的基于数据湖的数据查询方法的任一实施方式的程序代码。
与现有技术相比,本申请实施例至少具有以下优点:
本申请实施例提供的技术方案,对于数据湖场景来说,可以先自动地针对该数据湖中的大量待分析语句(例如,各种引擎下的SQL语句)进行表达式统计分析,以得到达到预先设定的虚拟列构建条件的待使用表达式(例如,出现频次比较高的待使用表达式);再根据该待使用表达式对应的虚拟列构建描述信息(例如,列名、数据类型、表达式等),自动地构建出该待使用表达式对应的虚拟列构建请求,以使该虚拟列构建请求用于请求构建能够代表该待使用表达式的虚拟列;然后,按照该虚拟列构建请求,构建该待使用表达式对应的虚拟列,以使该虚拟列能够代表该待使用表达式,以便日后用户能够借助针对该虚拟列的数据查询请求,以自动地触发针对该待使用表达式的数据查询请求,如此能够避免在该用户手动输入针对该待使用表达式的数据查询请求时所出现的问题(例如,如何撰写出正确的表达式等),从而能够有效地提高用户数据查询体验。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为本申请实施例提供的一种虚拟列的元数据的示意图;
图2为本申请实施例提供的一种基于数据湖的虚拟列构建方法的流程图;
图3为本申请实施例提供的一种语法转换的示意图;
图4为本申请实施例提供的一种语法树的示意图;
图5为本申请实施例提供的一种映射库的示意图;
图6为本申请实施例提供的一种基于数据湖的数据查询方法的流程图;
图7为本申请实施例提供的一种基于数据湖的虚拟列构建装置的结构示意图;
图8为本申请实施例提供的一种基于数据湖的数据查询装置的结构示意图;
图9为本申请实施例提供的一种电子设备的结构示意图。
具体实施方式
为了便于理解本申请的技术方案,下面针对本申请所涉及的一些技术名词进行说明。
结构化查询语言(Structured Query Language,SQL)是一种数据库查询和程序设计语言;而且SQL可以用于针对数据湖进行数据存取、数据查询、数据更新、数据管理等。
数据湖是一种集中式的存储库,允许用户以任意规模存储多个来源、结构化和非结构化数据,可以按照原样存储数据,无需对数据进行结构化处理,并运行不同类型的分析对数据进行加工,例如:大数据处理、实时分析、机器学习,以指导做出更好地决策。
JavaCC(Java Compiler Compiler)是一个能生成语法和词法分析器的生成程序。
基于上述技术名词,下面针对本申请的技术方案进行说明。
为了解决背景技术部分所示的技术问题,本申请提供了一种基于数据湖的虚拟列构建方法,该方法包括:对于数据湖场景来说,可以先自动地针对该数据湖中的大量待分析语句(例如,各种引擎下的SQL语句)进行表达式统计分析,以得到达到预先设定的虚拟列构建条件的待使用表达式(例如,“a+1”这一表达式);再根据该待使用表达式对应的虚拟列构建描述信息(例如,列名、数据类型、表达式等),自动地构建出该待使用表达式对应 的虚拟列构建请求,以使该虚拟列构建请求用于请求构建能够代表该待使用表达式的虚拟列;然后,按照该虚拟列构建请求,构建该待使用表达式对应的虚拟列(例如,以“c”这一字符为列名的虚拟列),以使该虚拟列能够代表该待使用表达式,以便日后用户能够借助针对该虚拟列的数据查询请求(例如,“SELECT c FROM t1”这一SQL语句),以自动地触发针对该待使用表达式的数据查询请求,如此能够避免在该用户手动输入针对该待使用表达式的数据查询请求时所出现的问题(例如,如何撰写出正确的表达式等),从而能够有效地提高用户数据查询体验。
另外,本申请实施例不限定基于数据湖的虚拟列构建方法(或者,基于数据湖的数据查询方法)的执行主体,例如,本申请实施例提供的基于数据湖的虚拟列构建方法(或者,基于数据湖的数据查询方法)可以应用于终端设备或服务器等具有数据处理功能的设备。其中,终端设备可以为智能手机、计算机、个人数字助理(Personal Digital Assitant,PDA)或平板电脑等。服务器可以为独立服务器、集群服务器或云服务器。
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
为了便于更好地理解本申请的技术方案,下面先介绍虚拟列的相关内容。
虚拟列的相关内容
虚拟列用于代表某个表达式。例如,可以利用以“c”这一字符作为列名的虚拟列(下文简称,虚拟列c)代表“a+1”这一表达式,以便日后用户可以借助针对该虚拟列的查询语句(例如,“SELECT c FROM t1”这一SQL语句),实现针对“a+1”这一表达式的数据查询目的。
为了能够在数据湖中使用虚拟列,本申请实施例还提供了虚拟列的一些语法内容,下面结合表1-表4进行说明。
(1)虚拟列的构建语句的相关内容:如下文表1所示,对于某个表达式(例如,“a+1”这一表达式)来说,可以先利用表1中所示的语法定义,生成用于构建该表达式的虚拟列的SQL语句(例如,下文表1中第2行第1列所示的SQL语句);再利用下文表1所示的语法实现完成针对该SQL语句的执行过程。
表1虚拟列的构建语句的相关内容
对于上文表1中第2行第1列所示的SQL语句来说,“t1”这一字符串是指需要添加虚拟列的数据表的表名;“c”这一字符是指该虚拟列的列名;“int”这一字符串是指该虚拟列的数据类型;“a+1”这一字符串是指该虚拟列所代表的表达式。可见,当数据表t1中已经存在以a作为列名的普通列(下文简称,普通列a)时,上文表1中第2行第1列所示的SQL语句 主要用于表达:向数据表t1(也就是,以“t1”这一字符串作为表名的数据表)中添加虚拟列;该虚拟列c用于代表“a+1”这一表达式;而且该虚拟列c的数据类型为int,如此使得该虚拟列c能够表示出一种与普通列a相关的表达式。可见,每个虚拟列与该虚拟列的表达式所涉及的数据列(例如,普通列a)均属于同一个数据表。其中,普通列是指某个数据表中具有数据记录功能的数据列。
对于上文表1中步骤11来说,本申请实施例不限定步骤11的实施方式,例如,可以采用现有的或者未来出现的任意一种语句解析方法进行实施。又如,步骤11具体可以包括:在获取到编写好的词法文件(例如,上文表1中第2行第1列所示的SQL语句)之后,可以利用预先构建的SQL解析器针对该词法文件进行解析处理(例如,将一个SQL语句转换为语法树等),得到解析处理结果。需要说明的是,本申请实施例不限定SQL解析器,例如,其可以借助JavaCC生成的。
对于上文表1中步骤12来说,本申请实施例不限定步骤12的实施方式,例如,可以采用现有的或者未来出现的任意一种语句验证方法进行实施。又如,步骤12具体可以包括:在获取到上文解析处理结果之后,可以针对该解析处理结果,验证列名、类型以及表达式的合法性。
需要说明的是,对于步骤12所示的语句验证过程来说,其不仅可以验证一些信息的合法性,还可以利用预先构建的类型转换规则(例如,若确定虚拟列的列名已出现在该虚拟列所属数据表内至少一个已有数据列的列名中,则可以在该虚拟列的列名的末尾添加一个数字,以使该数字能够将虚拟列与这些已有数据列区分开等规则),将上文解析处理结果中某些不合法的信息转换成合法信息,如此能够有效地避免因存在不合法的信息而导致的不良影响。
对于上文表1中步骤13来说,本申请实施例不限定步骤13的实施方式,例如,可以采用现有的或者未来出现的任意一种语句执行方法进行实施。又如,步骤13具体可以包括:在完成针对上文解析处理结果的语句验证处理过程之后,可以先从该解析处理结果中提取虚拟列的相关信息(例如,列名、数据类型、表达式、声明等);再在该虚拟列所属数据表(例如,以“t1”这一字符串作为表名的数据表)中添加该虚拟列,并将该虚拟列的相关信息作为该虚拟列的元数据信息存储至元数据服务(HiveMetaStore),以便日后可以从该元数据服务查找该虚拟列的相关信息。
另外,本申请实施例不限定上文“虚拟列的相关信息”的存储方式,例如,当该“虚拟列的相关信息”包括该虚拟列的列名(name)、该虚拟列的表达式(expression)、该虚拟列的数据类型(type)、以及该虚拟列的声明(comment)时,可以将该虚拟列的列名、该虚拟列的表达式、该虚拟列的数据类型、以及该虚拟列的声明均存入HiveMetaStore的元数据中。
此外,对于HiveMetaStore来说,其通常是借助键值对方式进行数据存储的;而且其针对单个键值对的存储是有字符长度限制的。然而,因表达式可能会超过该字符长度限制,故为了满足该存储需求,本申请实施例还提供了上文虚拟列的表达式的存储方式的一种可能的实施方式,其具体可以为:若确定该虚拟列的表达式超过该字符长度限制,则参考该字符长度限制,将该虚拟列的表达式拆分成多个子表达式(例如,图1所示的第1个子表达 式至第R个子表达式)进行存储,以使这些子表达式均符合该字符长度限制。其中,R为正整数。
需要说明的是,因为一个数据表(例如,以“t1”这一字符串作为表名的数据表)中可能会存在很多虚拟列,故为了在HiveMetaStore中更好地存储以及查询这些虚拟列的相关信息,可以针对每个虚拟列给予一个编号(例如,图1中101区域所示的“1”这一个编号),以使该编号能够唯一标识该虚拟列,以便日后能够按照该编号从HiveMetaStore中查询该虚拟列的相关信息。另外,对于一个数据表来说,在该数据表中每新增一个虚拟列,则会在该数据表对应的当前最大编号上加1,以得到该新增虚拟列的编号。
还需要说明的是,在HiveMetaStore中,还约定了只有虚拟列可以以“virtual.”这一字符串作为前缀,以使该“virtual.”这一字符串可以用于标识虚拟列,从而使得日后能够基于该“virtual.”这一字符串判断HiveMetaStore中哪些内容是虚拟列的元数据,哪些内容不是虚拟列的元数据。
又需要说明的是,上文步骤“在该虚拟列所属数据表(例如,以“t1”这一字符串作为表名的数据表)中添加该虚拟列”的目的是:屏蔽掉数据湖的数据使用方针对普通列以及虚拟列的使用差异,如此能够有效地提高用户数据查询体验。
(2)虚拟列的删除语句的相关内容:如下文表2所示,对于某个虚拟列(例如,以“c”作为列名的虚拟列)来说,可以利用下文表2中所示的语法定义,生成用于删除该虚拟列的SQL语句(例如,下文表2中第2行第1列所示的SQL语句);再利用下文表2所示的语法实现完成针对该SQL语句的执行过程。
表2虚拟列的构建语句的相关内容
对于上文表2中第2行第1列所示的SQL语句来说,“t1”这一字符串是指需要删除虚拟列的数据表的表名;“c”这一字符是指该虚拟列的列名。
对于上文表2中步骤21来说,本申请实施例不限定步骤21的实施方式,例如,可以采用现有的或者未来出现的任意一种语句解析方法进行实施。又如,步骤21具体可以包括:在获取到编写好的词法文件(例如,上文表2中第2行第1列所示的SQL语句)之后,可以利用预先构建的SQL解析器针对该词法文件进行解析处理(例如,将一个SQL语句转换为语法树等),得到解析处理结果。
对于上文表2中步骤22来说,本申请实施例不限定步骤22的实施方式,例如,可以采用现有的或者未来出现的任意一种语句验证方法进行实施。又如,步骤22具体可以包括:在获取到上文解析处理结果之后,可以验证表(例如,以“t1”这一字符串作为表名的数据表)和需要删除的虚拟列(例如,以“c”这一字符作为列名的虚拟列)是否存在。需要说明的是,列的存在性可以借助表2中“IF EXISTS”这一字符串跳过检验,也就是,如果数据表t1中不存在虚拟列c,则可以借助该“IF EXISTS”这一字符忽略这个问题。
对于上文表2中步骤23来说,本申请实施例不限定步骤23的实施方式,例如,可以采用现有的或者未来出现的任意一种语句执行方法进行实施。又如,步骤23具体可以包括:在完成针对上文解析处理结果的语句验证处理过程之后,可以先从该解析处理结果中提取表名和列名;再从该表名对应的编号映射关系中查找该列名对应的编号(例如,图1中101区域所示的“1”这一个编号);最后,删除该编号下的所有键值对(例如,图1中100区域所示的内容),并且从具有该表名的数据表中删除具有该列名的数据列。其中,该编号映射关系用于记录具有该表名的数据表中各个虚拟列与各个虚拟列对应的编号之间的对应关系。
(3)虚拟列的查看语句的相关内容:如下文表3所示,对于某个数据表(例如,以“t1”这一字符串作为表名的数据表)来说,可以利用下文表3中所示的语法定义,生成用于查看该数据表下所有虚拟列的SQL语句(例如,下文表3中第2行第1列所示的SQL语句);再利用下文表3所示的语法实现完成针对该SQL语句的执行过程。
表3虚拟列的构建语句的相关内容
对于上文表3中第2行第1列所示的SQL语句来说,“t1”这一字符串是指需要删除虚拟列的数据表的表名。
对于上文表3中步骤31来说,本申请实施例不限定步骤31的实施方式,例如,可以采用现有的或者未来出现的任意一种语句解析方法进行实施。又如,步骤31具体可以包括:在获取到编写好的词法文件(例如,上文表3中第2行第1列所示的SQL语句)之后,可以利用预先构建的SQL解析器针对该词法文件进行解析处理(例如,将一个SQL语句转换为语法树等),得到解析处理结果。
对于上文表3中步骤32来说,本申请实施例不限定步骤32的实施方式,例如,可以采用现有的或者未来出现的任意一种语句验证方法进行实施。又如,步骤32具体可以包括:在获取到上文解析处理结果之后,可以针对该解析处理结果,验证表(例如,以“t1”这一字符串作为表名的数据表)是否存在。
对于上文表3中步骤33来说,本申请实施例不限定步骤33的实施方式,例如,可以采用现有的或者未来出现的任意一种语句执行方法进行实施。又如,步骤33具体可以包括:在完成针对上文解析处理结果的语句验证处理过程之后,可以先从该解析处理结果中提取表名;再按照该表名,从HiveMetaStore中查找该表名下的各个虚拟列的相关信息(例如,name、expression、type和comment等);然后,将各个虚拟列的相关信息按照编号进行聚合,得到各个虚拟列的信息集;最后,将这些虚拟列的信息集按照编号进行排序展示,得到该表名下的虚拟列查看结果。其中,第d个虚拟列的信息集包括该第d个虚拟列的列名、该第d个虚拟列的数据类型、以及该第d个虚拟列的表达式。d为正整数,d≤D,D为正整数,D表示具有该表名的数据表内的虚拟列个数。
(4)虚拟列的使用语句(也就是,查询语句)的相关内容:如下文表4所示,对于某个虚拟列(例如,数据表t1内的虚拟列c)来说,可以利用下文表4中所示的语法定义,生成用于针对该虚拟列进行数据查询的SQL语句(例如,下文表4中第2行第1列所示的SQL语句),以使该SQL语句能够表示出针对该虚拟列进行数据查询的需求;再利用下文表4所示的语法实现完成针对该SQL语句的执行过程。
对于下文表4中第2行第1列所示的SQL语句来说,“t1”这一字符串是指需要删除虚拟列的数据表的表名;“c”这一字符是指该虚拟列的列名。
表4虚拟列的构建语句的相关内容
对于上文表4中步骤41来说,本申请实施例不限定步骤41的实施方式,例如,可以采用现有的或者未来出现的任意一种语句验证方法进行实施。又如,步骤41具体可以包括:在获取到编写好的词法文件(例如,上文表4中第2行第1列所示的SQL语句)之后,可以对该词法文件进行合法性验证(例如,验证数据表t1中是否存在虚拟列c等)。
对于上文表4中步骤42来说,本申请实施例不限定步骤42的实施方式,例如,可以采用现有的或者未来出现的任意一种语句调整方法进行实施。又如,步骤42具体可以包括:在完成针对上文词法文件的语句验证处理过程之后,可以利用虚拟列的表达式替换该词法文件中该虚拟列的列名,得到替换后文件;再将该替换后文件分别按照不同引擎对应的翻译规则,翻译成不同引擎对应的可执行计划,以便后续不同引擎能够通过执行其对应的可执行计划的方式完成虚拟列的查询任务。
对于上文表4中步骤43来说,本申请实施例不限定步骤43的实施方式,例如,可以采用现有的或者未来出现的任意一种语句执行方法进行实施。又如,步骤43具体可以包括:在获取到某个引擎对应的可执行计划之后,可以将该可执行计划发送给该引擎,以使该引擎能够通过执行该可执行计划的方式完成虚拟列的查询任务。
基于上文虚拟列的相关内容可知,对于本申请提供的虚拟列来说,当该虚拟列属于某个数据表时,该虚拟列可以代表一个与该数据表中某个普通列相关的表达式,以便日后可以借助针对该虚拟列的查询任务实现针对该表达式的查询任务。另外,虽然构建好的虚拟列可以出现在数据表中,但是该数据表中的虚拟列内通常不会用于记录数据,以使日后在针对该虚拟列触发数据查询请求时,不是直接从该数据表中读取该虚拟列内的数据,而是借助该虚拟列的表达式确定该虚拟列的数据查询结果。
实际上,对于上文用于构建虚拟列的SQL语句来说,其可以人工编辑输入,也可以自动确定。基于此,本申请实施例还提供了一些用于构建虚拟列的SQL语句的自动生成方案(也就是,基于数据湖的虚拟列构建方法的一些可能的实施方式),为了便于理解,下面结合附图进行说明。
虚拟列的自动构建过程
如图2所示,本申请实施例提供的基于数据湖的虚拟列构建方法,包括S201-S204:
S201:在获取到至少一个待分析语句之后,从该至少一个待分析语句中确定待使用表达式。
上文“至少一个待分析语句”是指从数据湖对应的所有引擎(例如,MySQL、Hive、Spark、Presto等引擎)中收集的一些SQL语句。例如,对于数据湖场景来说,如果针对该数据湖的数据处理任务可以被H种引擎进行执行,则上文“至少一个待分析语句”是指从H种引擎中收集的一些SQL语句。其中,H为正整数。
另外,本申请实施例不限定上文“至少一个待分析语句”的获取方式,例如,其具体可以为:在第h个引擎接收到SQL语句之后,该第h个引擎可以将该SQL语句存储至预设存储空间,以便在达到预先设定的语句分析条件之后,从该预设存储空间中读取其存储的SQL语句,作为待分析语句。其中,h为正整数,h≤H,H为正整数,H表示数据湖对应的引擎个数。
上文“语句分析条件”可以预先设定,例如,其具体可以为:当前时刻与上一次虚拟列自动构建过程(例如,S201-S204)的触发时刻之间的时间差达到预设时间差。可见,基于该语句分析条件能够实现周期性地自动拉取大量SQL语句、以及从大量SQL语句中自动地分析出具有虚拟列构建需求的表达式的目的。
上文“待使用表达式”是指达到预先设定的虚拟列构建条件的表达式,以使该待使用表达式能够表示出具有虚拟列构建需求的表达式。例如,该待使用表达式可以是“a+1”这一表达式。
上文虚拟列构建条件可以预先设定,例如,其可以为:待使用表达式的出现频次高于预设频次阈值。其中,预设频次阈值可以预先设定。
另外,本申请实施例不限定S201的实施方式,例如,当上文“至少一个待分析语句”中存在N个待分析语句时,S201具体可以包括S2011-S2013:
S2011:从第n个待分析语句中确定第n个待分析表达式。其中,n为正整数,n≤N,N为正整数。
其中,第n个待分析表达式用于表示第n个待分析语句所涉及的表达式所携带的语义信息。
另外,本申请实施例不限定第n个待分析表达式的确定过程,例如,可以直接将该第n个待分析语句中出现的表达式,确定为第n个待分析表达式。
实际上,不同引擎下的SQL语句通常是采用不同方言进行语义表达的,故为了更好地提高语句分析效果,可以先将这些SQL语句转换到同一种方言下,再针对这些SQL语句进行表达式提取。
基于此,本申请实施例还提供了上文第n个待分析表达式的确定过程的另一种可能的实施方式,其具体可以包括S20111-S20112:
S20111:对第n个待分析语句进行语法转换处理,得到语法转换结果。
其中,语法转换处理用于将第n个待分析语句从第一方言转换至第二方言。该第一方言是指第n个待分析语句所使用的语句方言。该第二方言是指目标方言。
上文“语法转换结果”用于按照第二方言表达出第n个待分析语句所携带的语义信息。例如,当上文第n个待分析语句为图3中301所示的SQL语句时,该语法转换结果可以是图3 中302所示的SQL语句。可见,图3中301所示的SQL语句所携带的语义信息与图3中302所示的SQL语句所携带的语义信息相同,但是图3中301所示的SQL语句所使用的语句方言不同于该图3中302所示的SQL语句所使用的语句方言。
S20112:从语法转换结果中提取第n个待分析表达式。
本申请实施例中,在获取到第n个待分析语句对应的语法转换结果(例如,图3中302所示的SQL语句)之后,可以将该语法转换结果中出现的表达式(例如,图3中302所示的SQL语句内出现的“if(col<>null,col,”)”这一表达式),确定为第n个待分析表达式。
基于上述S2011的相关内容可知,在获取到第n个待分析语句之后,可以从该第n个待分析语句中确定出第n个待分析表达式,以使该第n个待分析表达式能够表示出该第n个待分析语句中表达式所携带的语义信息,以便后续能够判断该第n个待分析表达式是否具有虚拟列构建需求。
S2012:对N个待分析表达式进行统计分析处理,得到表达式统计结果。
其中,表达式统计结果用于描述上文N个待分析表达式中不同表达式的出现频次。例如,当上文N个待分析表达式中存在表达式1、表达式2、……表达式Y时,则该表达式统计结果可以包括该表达式1在该N个待分析表达式中的出现次数、该表达式2在该N个待分析表达式中的出现次数、……、以及该表达式Y在该N个待分析表达式中的出现次数。其中,Y为正整数。
另外,本申请实施例不限定S2012的实施方式,例如,可以采用现有的或者未来出现的能够针对一些表达式进行统计分析处理的方法进行实施。
实际上,在针对这些待分析表达式的统计分析处理过程中,通常需要比较任意两个待分析表达式是否相同,故为了更好地识别出相同表达式,本申请实施例还提供了S2012的另一种可能的实施方式,其具体可以包括S20121-S20122:
S20121:对第n个待分析表达式进行语法树构建处理,得到该第n个待分析表达式的语法树。
其中,第n个待分析表达式的语法树用于以抽象语法树形式表示该第n个待分析表达式所携带的语义信息。例如,当上文第n个待分析表达式为“if(col<>null,col,”)”这一表达式时,该第n个待分析表达式的语法树可以是图4所示的语法树。
S20122:对N个待分析表达式的语法树进行统计分析处理,得到统计分析结果。
本申请实施例中,在获取到N个待分析表达式的语法树之后,可以先将这些语法树中任意两个语法树进行比较,得到比较结果,以使该比较结果能够表示出这两个语法树是否相等;再将利用这些语法树中任意两个语法树之间的比较结果,确定出该N个待分析表达式的统计分析结果,以使该统计分析结果能够表示出这些待分析表达式中不同表达式的出现频次。
需要说明的是,本申请实施例不限定上文“比较结果”的确定过程,例如,对于任意两个语法树来说,可以采用递归算法来比较这两个语法树是否相等。
基于上述S2012的相关内容可知,在获取到N个待分析表达式之后,可以针对这些待分析表达式进行统计分析处理,得到表达式统计结果,以使该表达式统计结果不仅能够表示 出这些待分析表达式中出现了哪些不同的表达式,还能够表示出每种表达式的出现频次,以便后续能够基于该表达式统计结果,确定出具有虚拟列构建需求的表达式。
S2013:若表达式统计结果表示N个待分析表达式中第一表达式的出现频次高于预设频次阈值,则将该第一表达式确定为待使用表达式。
其中,第一表达式是指出现频次高于预设频次阈值的待分析表达式。例如,该第一表达式可以是“if(col<>null,col,”)”这一表达式。
另外,本申请实施例不限定第一表达式的确定过程,例如,当上文“表达式统计结果”包括表达式1在N个待分析表达式中的出现次数、表达式2在该N个待分析表达式中的出现次数、……、以及表达式Y在该N个待分析表达式中的出现次数时,可以判断表达式y在该N个待分析表达式中的出现次数是否高于预设频次阈值,若高于预设频次阈值,则可以将该表达式y,确定为第一表达式。其中,y为正整数,y≤Y。
基于S2013的相关内容可知,在获取到上文N个待分析表达式对应的表达式统计结果之后,可以基于该表达式统计结果,从该N个待分析表达式中确定出现频次高于预设频次阈值的第一表达式,作为待使用表达式。
基于上述S201的相关内容可知,对于数据湖场景来说,在从该数据湖对应的大量引擎中获取到一些待分析语句之后,可以从这些待分析语句中确定出达到预先设定的虚拟列构建条件的表达式,作为待使用表达式,以便后续能够针对这些达到预先设定的虚拟列构建条件的表达式进行虚拟列自动构建处理。
S202:确定待使用表达式对应的虚拟列构建描述信息。
其中,待使用表达式对应的虚拟列构建描述信息用于描述在构建该待使用表达式对应的虚拟列时所需参考的内容(例如,虚拟列的列名、该虚拟列的数据类型、该虚拟列的表达式、以及该虚拟列所属的数据表的表名等)。
另外,本申请实施不限定上文“待使用表达式对应的虚拟列构建描述信息”,例如,其可以包括虚拟列的列名、该虚拟列的数据类型、该虚拟列的表达式、以及该虚拟列所属的数据表的表名。其中,该虚拟列是指用于代表该待使用表达式的数据列。
上文“虚拟列的列名”(也就是,待使用表达式对应的列名)是指用于代表待使用表达式的虚拟列的名称标识。例如,当该待使用表达式为图3中302内“if(col<>null,col,”)”这一表达式时,用于代表该待使用表达式的虚拟列的名称标识可以为图3中302内“not_null_col”这一字符串。
另外,本申请实施例不限定上文“虚拟列的列名”的确定过程,例如,其具体可以包括步骤51-步骤54:
步骤51:判断上文至少一个待分析语句中是否存在满足预先设定的语句参考条件的待参考语句,若是,则执行步骤52;若否,则执行步骤53-步骤54。
其中,待参考语句是指满足预先设定的语句参考条件的待分析语句;而且该语句参考条件用于从上文至少一个待分析语句中筛选与待使用表达式对应的列名相关的内容,而且本申请实施例不限定该语句参考条件,例如,其具体可以为:该待参考语句携带的表达式与待使用表达式之间满足预设语义相同条件;而且该待参考语句包括该待参考语句携带的表达式对应的列名。
可见,对于第n个待分析语句(例如,图3中301所示的SQL语句)来说,可以先判断该第n个待分析语句携带的表达式(例如,图3中301所示的SQL语句内“coalesce(col,”)”这一表达式)与待使用表达式(例如,“if(col<>null,col,”)”这一表达式)之间是否满足预设语义相同条件,若满足,则可以确定该第n个待分析语句携带的表达式的语义信息与该待使用表达式的语义信息相同,再判断该第n个待分析语句中是否存在该待参考语句携带的表达式对应的列名(例如,图3中301所示的SQL语句内“not_null_col”这一字符串),若存在,则可以将该第n个待分析语句确定为待参考语句,以便后续能够基于该第n个待分析语句中出现的列名,确定该待使用表达式对应的列名。其中,n为正整数,n≤N,N为正整数。
需要说明的是,上述“预设语义相同条件”可以预先设定,例如,其具体可以为:该第n个待分析语句携带的表达式的语义信息与待使用表达式的语义信息相同。
步骤52:从至少一个待参考语句中确定待使用表达式对应的列名。
作为示例,步骤52具体可以包括步骤521-步骤522:
步骤521:对至少一个待参考语句携带的表达式对应的列名进行统计分析处理,得到列名统计结果。
其中,第j个待参考语句携带的表达式对应的列名是指在该第j个待参考语句中存在的与该第j个待参考语句携带的表达式相对应的列名。例如,当该第j个待参考语句为图3中301所示的SQL语句,而且该第j个待参考语句携带的表达式为图3中301所示的SQL语句内“coalesce(col,”)”这一表达式时,因该第j个待参考语句中存在“not_null_col”这一列名,而且该“not_null_col”这一列名与该第j个待参考语句携带的表达式相对应,故可以将该“not_null_col”这一列名,确定为该第j个待参考语句携带的表达式对应的列名。j为正整数,j≤J,J为正整数,J表示待参考语句的个数。
上文列名统计结果用于表示上文“至少一个待参考语句携带的表达式对应的列名”中不同名字的出现频次。例如,当上文“至少一个待参考语句携带的表达式对应的列名”中存在列名1、列名2、列名3、……、以及列名H时,该列名统计结果可以包括列名1在上文“至少一个待参考语句携带的表达式对应的列名”中的出现次数、列名2在上文“至少一个待参考语句携带的表达式对应的列名”中的出现次数、列名3在上文“至少一个待参考语句携带的表达式对应的列名”中的出现次数、……、以及列名H在上文“至少一个待参考语句携带的表达式对应的列名”中的出现次数。其中,H为正整数。
步骤522:根据列名统计结果,确定待使用表达式对应的列名。
本申请实施例中,在获取到上文列名统计结果之后,可以依据该列名统计结果,从上文“至少一个待参考语句携带的表达式对应的列名”中确定出出现频次最高的列名,确定为该待使用表达式对应的列名。
另外,本申请实施例不限定步骤522的实施方式,例如,当上文“至少一个待参考语句携带的表达式对应的列名”包括目标列名,则步骤522具体可以为:若列名统计结果表示该目标列名的出现频次满足预设频次条件,则将该目标列名,确定为待使用表达式对应的列名。其中,该预设频次条件可以预先设定,例如,其具体可以为:该目标列名的出现频次高于上文“至少一个待参考语句携带的表达式对应的列名”中除了该目标列名以外其他列名的出现频次。
基于上述步骤52的相关内容可知,若确定上文至少一个待分析语句中存在至少一个待参考语句,则可以确定这些待参考语句能够针对待使用表达式提供一些可使用的列名,故可以从这些待参考语句中筛选出出现频次最高的列名,作为待使用表达式对应的列名,以便后续能够以该待使用表达式对应的列名,作为用于代表待使用表达式的虚拟列的列名。
步骤53:根据预先设定的至少一个待匹配表达式与待使用表达式之间的相似表征数据,从至少一个待匹配表达式中确定第二表达式。其中,该第二表达式与待使用表达式之间的相似表征数据满足预设相似条件。
其中,第m个待匹配表达式是指预先设定的具有虚拟列列名的表达式。m为正整数,m≤M,M为正整数,M表示待匹配表达式的个数。例如,该第m个待匹配表达式可以是图5所示映射库中列名c0对应的表达式。
另外,本申请实施例不限定上文“至少一个待匹配表达式”的获取过程,例如,可以将预先构建的映射库中各个列名对应的表达式,均确定为待匹配表达式。
此外,第m个待匹配表达式与待使用表达式之间的相似表征数据,用于表征该第m个待匹配表达式与待使用表达式之间的相似程度;而且本申请实施例不限定该“第m个待匹配表达式与待使用表达式之间的相似表征数据”的确定过程,例如,其具体可以包括步骤61-步骤62:
步骤61:确定待使用表达式的字段名向量以及关键字向量。
其中,待使用表达式的字段名向量用于表示该待使用表达式所携带的字段名;而且本申请实施例不限定该“待使用表达式的字段名向量”的确定过程,例如,其具体可以包括步骤611-步骤612:
步骤611:对待使用表达式进行字段名提取处理,得到该待使用表达式的字段名提取结果。
其中,待使用表达式的字段名提取结果用于描述该待使用表达式所携带的字段名。例如,当该待使用表达式为“if(c1<>null,c1,c2)”)”这一表达式时,该待使用表达式的字段名提取结果可以为{c1,c2}。
步骤612:将待使用表达式的字段名提取结果进行向量化处理,得到待使用表达式的字段名向量。
本申请实施例中,在获取到待使用表达式的字段名提取结果之后,可以利用预先构建的字段名字典(例如,{c0,c1,c2,c3}这一字典),对该待使用表达式的字段名提取结果进行向量化处理,得到待使用表达式的字段名向量(例如,(0,2,1,0)这一向量)。
需要说明的是,对于上文(0,2,1,0)这一向量来说,该向量中第一个“0”表示待使用表达式中未出现c0这一字段名;该向量中“2”表示待使用表达式中出现了两次c1这一字段名;该向量中“1”表示待使用表达式中出现了一次c2这一字段名;该向量中第二个“0”表示待使用表达式中未出现c3这一字段名。
需要说明的是,本申请实施例不限定上文“字段名字典”,例如,在一些场景下,若不同数据表所对应的字段名字典不同,则上一段“字段名字典”的确定过程具体可以为:从第一映射关系中查找与待使用表达式对应的数据表之间存在对应关系的字段名字典,确 定为上一段“字段名字典”。其中,该待使用表达式对应的数据表是指用于代表该待使用表达式的虚拟列所属的数据表。
基于上述步骤611至步骤612的相关内容可知,在获取到待使用表达式之后,可以先针对该待使用表达式进行字段名提取处理,得到该待使用表达式的字段名提取结果;再借助预先构建的字段名字典,对该待使用表达式的字段名提取结果进行字段名统计分析处理,得到待使用表达式的字段名向量,以使该待使用表达式的字段名向量能够表示出该待使用表达式中出现了哪些字段名,以及各个字段名的出现频次是多少。
待使用表达式的关键字向量用于表示该待使用表达式所携带的关键字;而且本申请实施例不限定该“待使用表达式的关键字向量”的确定过程,例如,其具体可以包括步骤613-步骤614:
步骤613:对待使用表达式进行关键字提取处理,得到该待使用表达式的关键字提取结果。
其中,待使用表达式的关键字提取结果用于描述该待使用表达式所携带的关键字。例如,当该待使用表达式为“if(c1<>null,c1,c2)”)”这一表达式时,该待使用表达式的关键字提取结果可以为{if,null}。
需要说明的是,本申请实施例不限定步骤611与步骤613之间的执行顺序,例如,可以依次执行步骤611与步骤613。又如,可以步骤613与步骤611。还如,可以同时执行步骤611与步骤613。
步骤614:将待使用表达式的关键字提取结果进行向量化处理,得到待使用表达式的关键字向量。
本申请实施例中,在获取到待使用表达式的关键字提取结果之后,可以利用预先构建的关键字字典(例如,{as,if,in,null}这一字典),对该待使用表达式的关键字提取结果进行向量化处理,得到待使用表达式的关键字向量(例如,(0,1,0,1)这一向量)。
需要说明的是,对于上文(0,1,0,1)这一向量来说,该向量中第一个“0”表示待使用表达式中未出现as这一关键字;该向量中第一个“1”表示待使用表达式中出现了一次if这一关键字;该向量中第二个“0”表示待使用表达式中未出现in这一关键字;该向量中第二个“1”表示待使用表达式中出现了一次null这一关键字。
需要说明的是,本申请实施例不限定上文“关键字字典”,例如,其可以预先设定。
基于上述步骤613至步骤614的相关内容可知,在获取到待使用表达式之后,可以先针对该待使用表达式进行关键字提取处理,得到该待使用表达式的关键字提取结果;再借助预先构建的关键字字典,对该待使用表达式的关键字提取结果进行关键字统计分析处理,得到待使用表达式的关键字向量,以使该待使用表达式的关键字向量能够表示出该待使用表达式中出现了哪些关键字,以及各个关键字的出现频次是多少。
基于上述步骤61的相关内容可知,在获取到待使用表达式之后,可以针对该待使用表达式进行字段名向量以及关键字向量提取处理,得到该待使用表达式的字段名向量以及关键字向量,以使这些向量能够表示出该待使用表达式所携带的语义信息。
步骤62:根据待使用表达式的字段名向量与第m个待匹配表达式的字段名向量之间的相似度、以及待使用表达式的关键字向量与第m个待匹配表达式的关键字向量之间的相似度,确定第m个待匹配表达式与待使用表达式之间的相似表征数据。
上述“待使用表达式的字段名向量与第m个待匹配表达式的字段名向量之间的相似度”用于表示该待使用表达式与第m个待匹配表达式之间在字段名上所呈现的相似程度;而且本申请实施例不限定该“待使用表达式的字段名向量与第m个待匹配表达式的字段名向量之间的相似度”的确定过程,例如,可以将待使用表达式的字段名向量与第m个待匹配表达式的字段名向量之间的欧式距离,确定为该待使用表达式的字段名向量与第m个待匹配表达式的字段名向量之间的相似度。
上述“待使用表达式的关键字向量与第m个待匹配表达式的关键字向量之间的相似度”用于表示该待使用表达式与第m个待匹配表达式之间在关键字上所呈现的相似程度;而且本申请实施例不限定该“待使用表达式的关键字向量与第m个待匹配表达式的关键字向量之间的相似度”的确定过程,例如,可以将待使用表达式的关键字向量与第m个待匹配表达式的关键字向量之间的欧式距离,确定为该待使用表达式的关键字向量与第m个待匹配表达式的关键字向量之间的相似度。
另外,本申请实施例不限定步骤62的实施方式,例如,其具体可以为:按照预先设定的权重,将待使用表达式的字段名向量与第m个待匹配表达式的字段名向量之间的相似度、以及待使用表达式的关键字向量与第m个待匹配表达式的关键字向量之间的相似度进行加权求和,得到该第m个待匹配表达式与待使用表达式之间的相似表征数据。
需要说明的是,本申请实施例不限定上一段“权重”,例如,上述“待使用表达式的字段名向量与第m个待匹配表达式的字段名向量之间的相似度”对应的权重为0.75,而且上述“待使用表达式的关键字向量与第m个待匹配表达式的关键字向量之间的相似度”对应的权重为0.25。
基于上述步骤61至步骤62的相关内容可知,在获取到待使用表达式之后,可以先确定该待使用表达式的字段名向量以及关键字向量;再依据该待使用表达式的字段名向量以及关键字向量、与第m个待匹配表达式的字段名向量以及关键字向量,计算该第m个待匹配表达式与待使用表达式之间的相似表征数据,以使该相似表征数据能够表示出该第m个待匹配表达式与待使用表达式之间的相似程度。其中,m为正整数,m≤M,M为正整数,M表示待匹配表达式的个数。
上文“第二表达式”是指与待使用表达式之间的相似表征数据满足预设相似条件的待匹配表达式。其中,该预设相似条件可以预先设定,例如,若上文相似表征数据是借助欧式距离确定的,则该预设相似条件具体可以为:该第二表达式与待使用表达式之间的相似表征数据,小于,上文“至少一个待匹配表达式”中除了该第二表达式以外的其他各个待匹配表达式与待使用表达式之间的相似表征数据。也就是,上文“至少一个待匹配表达式”中第二表达式与待使用表达式之间的相似表征数据达到最小。
基于上述步骤53的相关内容可知,若确定上文至少一个待分析语句中不存在待参考语句,则可以确定这些待分析语句无法针对待使用表达式提供一些可使用的列名,故可以利用映射库中所记录的一些待匹配表达式与待使用表达式之间的相似表征数据,从这些待匹 配表达式中确定与该待使用表达式最相似的第二表达式,以便后续能够利用该第二表达式对应的列名推测,该待使用表达式对应的列名。
步骤54:从预先构建的映射关系中查找第二表达式对应的列名,并根据该第二表达式对应的列名,确定待使用表达式对应的列名。
其中,映射关系用于记录各待匹配表达式与各待匹配表达式对应的列名之间的对应关系。
另外,本申请实施例不限定步骤54的实施方式,例如,其具体可以为:直接将预先构建的映射关系中查找第二表达式对应的列名,确定为待使用表达式对应的列名。
实际上,为了避免出现重名,本申请实施例还提供了步骤54的另一种可能的实施方式,其具体可以为:在从预先构建的映射关系中查找到第二表达式对应的列名之后,可以判断该待使用表达式对应的数据表中是否存在包括该第二表达式对应的列名的已有列名,若存在,则根据该第二表达式对应的列名、以及包括该第二表达式对应的列名的已有列名的个数,确定该待使用表达式对应的列名,以使该待使用表达式对应的列名不同于该数据表内各个包括该第二表达式对应的列名的已有列名。
另外,本申请实施例不限定上文步骤“根据该第二表达式对应的列名、以及包括该第二表达式对应的列名的已有列名的个数,确定该待使用表达式对应的列名”的实施方式,例如,其具体可以为:先将包括该第二表达式对应的列名的已有列名的个数加1,得到待使用数字;再将该待使用数字添加至该第二表达式对应的列名的末尾,得到该待使用表达式对应的列名。
基于上述步骤54的相关内容可知,在确定上文“至少一个待匹配表达式”中第二表达式与待使用表达式最相似之后,可以利用该第二表达式对应的列名,推测该待使用表达式对应的列名。
基于上述步骤51至步骤54的相关内容可知,在获取到待使用表达式之后,如果确定上文至少一个待分析语句中给出了该待使用表达式对应的虚拟列列名,则可以从这些待分析语句中提取出该待使用表达式对应的列名;但是,如果这些待分析语句中未提及该待使用表达式对应的虚拟列列名,则可以借助映射库中已存在的一些待匹配表达式对应的列名,推测出该待使用表达式对应的列名。
需要说明的是,在获取到待使用表达式对应的列名之后,可以利用该待使用表达式、该待使用表达式的字段名向量与关键字向量、以及该待使用表达式对应的列名之间的对应关系,更新上文映射库,以提高该映射库的准确性。
基于上述“虚拟列的列名”(也就是,待使用表达式对应的列名)的相关内容可知,在一些应用场景下,可以自动地分析出用于代表待使用表达式的虚拟列的列名,如此有利于降低用户工作量。
上文“虚拟列的数据类型”(也就是,待使用表达式对应的数据类型)是指用于代表待使用表达式的虚拟列的数据类型。
另外,本申请实施例不限定上文“虚拟列的数据类型”(也就是,待使用表达式对应的数据类型)的确定过程,例如,可以人工设定。
实际上,一个虚拟列的数据类型与该虚拟列所代表的表达式中所涉及的已有数据列的数据类型相关。基于此,本申请实施例还提供了上文“虚拟列的数据类型”(也就是,待使用表达式对应的数据类型)的确定过程的一种可能的实施方式,其具体可以为:根据待使用表达式携带的列名对应的数据类型,确定该待使用表达式对应的数据类型。
需要说明的是,本申请实施例不限定上文步骤“根据待使用表达式携带的列名对应的数据类型,确定该待使用表达式对应的数据类型”的实施方式,例如,当待使用表达式为“a+1”这一表达式时,其具体可以为:可以直接将该待使用表达式携带的列名(也就是,“a”这一字符)对应的数据类型(例如,int),确定为该待使用表达式对应的数据类型(例如,int)。又如,其还可以为:按照预设限定的数据类型推测规则,从该待使用表达式携带的列名对应的数据类型中推测出该待使用表达式对应的数据类型。
上文“虚拟列的表达式”是指该虚拟列所代表的表达式。例如,对于用于代表待使用表达式的虚拟列来说,该虚拟列的表达式就是该待使用表达式。
上文“虚拟列所属的数据表的表名”(也就是,待使用表达式对应的表名)是指用于代表待使用表达式的虚拟列所属的数据表的表名;而且本申请实施例不限定该“虚拟列所属的数据表的表名”的确定过程,例如,其具体可以为:将上文“至少一个待分析语句”中目标语句所携带的表名,确定为该待使用表达式对应的表名。其中,该目标语句中携带的表达式与所述待使用表达式之间满足预设语义相同条件;而且该目标语句携带有表名。
基于上述S202的相关内容可知,在获取到待使用表达式之后,可以获取该待使用表达式对应的虚拟列构建描述信息,以使该虚拟列构建描述信息能够描述出在构建该待使用表达式对应的虚拟列时所需参考的内容(例如,虚拟列的列名、该虚拟列的数据类型、该虚拟列的表达式、该虚拟列所属的数据表的表名等),以便后续能够利用该虚拟列构建描述信息,构建该待使用表达式对应的虚拟列。
S203:根据待使用表达式对应的虚拟列构建描述信息,生成虚拟列构建请求。
其中,虚拟列构建请求用于请求生成用于代表待使用表达式的虚拟列;而且本申请实施例不限定该虚拟列构建请求的实施方式,例如,当该待使用表达式对应的虚拟列构建描述信息包括{(列名,c),(表达式,a+1),(数据类型,int),(数据表,t1)}时,该虚拟列构建请求可以为上文表1中第2行第1列所示的SQL语句。
S204:按照虚拟列构建请求,构建待使用表达式对应的虚拟列。
本申请实施例中,在获取到虚拟列构建请求之后,可以先按照该虚拟列构建请求,生成第一引擎对应的可执行任务;再将该第一引擎对应的可执行任务发送给该第一引擎,以便由该第一引擎通过执行该可执行任务,构建好该待使用表达式对应的虚拟列。其中,第一引擎是指用于构建待使用表达式对应的虚拟列的执行引擎;而且本申请实施例不限定该第一引擎,例如,其可以是由用户指定的引擎,也可以是基于当前资源条件选择的执行引擎。
需要说明的是,本申请实施例不限定上文“第一引擎对应的可执行任务”的生成过程,例如,其具体可以为:在获取到虚拟列构建请求之后,可以按照该第一引擎对应的翻译规则,将该虚拟列构建请求翻译成该第一引擎对应的可执行任务。
基于上述S201至S204的相关内容可知,对于本申请实施例提供的基于数据湖的虚拟列构建方法来说,可以先自动地针对该数据湖中的大量待分析语句(例如,各种引擎下的SQL语句)进行表达式统计分析,以得到达到预先设定的虚拟列构建条件的待使用表达式(例如,“a+1”这一表达式);再根据该待使用表达式对应的虚拟列构建描述信息(例如,列名、数据类型、表达式等),自动地构建出该待使用表达式对应的虚拟列构建请求,以使该虚拟列构建请求用于请求构建能够代表该待使用表达式的虚拟列;然后,按照该虚拟列构建请求,构建该待使用表达式对应的虚拟列(例如,以“c”这一字符为列名的虚拟列),以使该虚拟列能够代表该待使用表达式,以便日后用户能够借助针对该虚拟列的数据查询请求(例如,“SELECT c FROM t1”这一SQL语句),以自动地触发针对该待使用表达式的数据查询请求,如此能够避免在该用户手动输入针对该待使用表达式的数据查询请求时所出现的问题(例如,如何撰写出正确的表达式等),从而能够有效地提高用户数据查询体验。
虚拟列的数据查询过程
实际上,对于数据表中的虚拟列来说,因该虚拟列不用于记录数据,使得该虚拟列的数据查询流程不同普通列的数据查询流程。基于此,本申请实施例还提供了一种基于数据湖的数据查询方法,如图6所示,该方法包括S601-S603:
S601:获取第一数据查询请求。其中,该第一数据查询请求用于请求针对目标虚拟列进行数据查询。
其中,第一数据查询请求用于请求针对目标数据表中目标虚拟列进行数据查询。该目标虚拟列是利用本申请实施例提供的基于数据湖的虚拟列构建方法的任一实施方式进行构建的;而且该目标数据表是指目标虚拟列所属的数据表。为了便于理解,下面结合示例进行说明。
作为示例,当上文目标数据表为上文以“t1”这一字符串作为表名的数据表,而且目标虚拟列为上文以“c”这一字符作为列名的虚拟列时,该第一数据查询请求具体可以为上文表4中第2行第1列所示的SQL语句。
S602:利用目标虚拟列对应的表达式,替换第一数据查询请求中目标虚拟列的列名,得到第二数据查询请求。
本申请实施例中,在获取到第一数据查询请求之后,可以先从该第一数据查询请求中提取出目标虚拟列的列名(例如,“c”这一字符)、以及该目标虚拟列对应表名(例如,“t1”这一字符串);再确定出与该目标虚拟列对应的表名具有对应关系的编号映射关系,以使该编号映射关系用于记录具有该表名的数据表内各个虚拟列对应的编号;其次,在该编号映射关系中查找该目标虚拟列的列名对应的编号,作为待使用编号;然后,从HiveMetaStore中查找该待使用编号对应的表达式(例如,“a+1”这一表达式);最后,利用该表达式替换第一数据查询请求中目标虚拟列的列名,得到第二数据查询请求(例如,SELECT a+1FROM t1)。
S603:按照第二数据查询请求进行数据查询处理。
本申请实施例中,在获取到第二数据查询请求之后,可以先按照该第二数据查询请求,生成第二引擎对应的可执行任务;再将该第二引擎对应的可执行任务发送给该第二引擎,以便由该第二引擎通过执行该可执行任务,完成针对目标虚拟列所代表的表达式的数据查 询过程。其中,第二引擎是指用于针对目标虚拟列所代表的表达式进行数据查询处理的执行引擎;而且本申请实施例不限定该第二引擎,例如,其可以是由用户指定的引擎,也可以是基于当前资源条件选择的执行引擎。
需要说明的是,本申请实施例不限定上文“第二引擎对应的可执行任务”的生成过程,例如,其具体可以为:在获取到第二数据查询请求之后,可以按照该第二引擎对应的翻译规则,将该第二数据查询请求翻译成该第二引擎对应的可执行任务。
基于上述S601至S603的相关内容可知,对于本申请实施例提供的基于数据湖的数据查询方法来说,在获取到用于针对目标数据表中目标虚拟列进行数据查询的第一数据查询请求之后,可以先利用该目标虚拟列对应的表达式,替换第一数据查询请求中目标虚拟列的列名,得到第二数据查询请求;再按照该第二数据查询请求进行数据查询处理,如此能够实现借助针对该虚拟列的数据查询请求(例如,“SELECT c FROM t1”这一SQL语句),以触发针对该待使用表达式的数据查询请求,如此能够避免在用户输入该待使用表达式时所需顾虑的因素(例如,如何撰写出正确的表达式等),从而能够有效地提高用户数据查询体验。
基于上述基于数据湖的虚拟列构建方法的相关内容,本申请实施例还提供了一种基于数据湖的虚拟列构建装置,下面结合附图进行解释和说明。需要说明的是,本申请提供的基于数据湖的虚拟列构建装置的技术详情,请参照上述基于数据湖的虚拟列构建方法的相关内容。
参见图7,该图为本申请实施例提供的一种基于数据湖的虚拟列构建装置的结构示意图。
本申请实施例提供的基于数据湖的虚拟列构建装置700,包括:
表达式确定单元701,用于在获取到至少一个待分析语句之后,从所述至少一个待分析语句中确定待使用表达式;
信息确定单元702,用于确定所述待使用表达式对应的虚拟列构建描述信息;所述虚拟列构建描述信息包括所述待使用表达式;
请求生成单元703,用于根据所述待使用表达式对应的虚拟列构建描述信息,生成虚拟列构建请求;
虚拟列构建单元704,用于按照所述虚拟列构建请求,构建所述待使用表达式对应的虚拟列。
在一种可能的实施方式下,所述待分析语句的个数为N;
所述表达式确定单元701,包括:
第一确定子单元,用于从第n个待分析语句中确定第n个待分析表达式;n为正整数,n≤N,N为正整数;
第一统计子单元,用于对N个待分析表达式进行统计分析处理,得到表达式统计结果;
第二确定子单元,用于若所述表达式统计结果表示所述N个待分析表达式中第一表达式的出现频次高于预设频次阈值,则将所述第一表达式确定为所述待使用表达式。
在一种可能的实施方式下,所述第一确定子单元,具体用于:对所述第n个待分析语句进行语法转换处理,得到语法转换结果;从所述语法转换结果中提取所述第n个待分析表达式。
在一种可能的实施方式下,所述表达式确定单元701还包括:
语法树构建子单元,用于对所述第n个待分析表达式进行语法树构建处理,得到所述第n个待分析表达式的语法树;
所述第一统计子单元,具体用于:对所述N个待分析表达式的语法树进行统计分析处理,得到统计分析结果。
在一种可能的实施方式下,所述虚拟列构建描述信息还包括列名;
所述信息确定单元702,包括:
第三确定子单元,用于若所述至少一个待分析语句中存在至少一个待参考语句,则从所述至少一个待参考语句中确定所述待使用表达式对应的列名;所述待参考语句携带的表达式与所述待使用表达式之间满足预设语义相同条件;所述待参考语句包括所述待参考语句携带的表达式对应的列名。
在一种可能的实施方式下,所述第三确定子单元,包括:
第二统计子单元,用于对所述至少一个待参考语句携带的表达式对应的列名进行统计分析处理,得到列名统计结果;
第四确定子单元,用于根据所述列名统计结果,确定所述待使用表达式对应的列名。
在一种可能的实施方式下,所述至少一个待参考语句携带的表达式对应的列名包括目标列名;
所述第四确定子单元,具体用于:若所述列名统计结果表示所述目标列名的出现频次满足预设频次条件,则将所述目标列名,确定为所述待使用表达式对应的列名。
在一种可能的实施方式下,所述信息确定单元702还包括:
第五确定子单元,用于若所述至少一个待分析语句中不存在所述待参考语句,则根据预先设定的至少一个待匹配表达式与所述待使用表达式之间的相似表征数据,从所述至少一个待匹配表达式中确定第二表达式;所述第二表达式与所述待使用表达式之间的相似表征数据满足预设相似条件;
第一查找单元,用于从预先构建的映射关系中查找所述第二表达式对应的列名;所述映射关系用于记录各所述待匹配表达式与各所述待匹配表达式对应的列名之间的对应关系;
第六确定子单元,用于根据所述第二表达式对应的列名,确定所述待使用表达式对应的列名。
在一种可能的实施方式下,所述待匹配表达式的个数为M;
所述信息确定单元702,还包括:
第七确定子单元,用于确定所述待使用表达式的字段名向量以及关键字向量;
第八确定子单元,用于根据所述待使用表达式的字段名向量与所述第m个待匹配表达式的字段名向量之间的相似度、以及所述待使用表达式的关键字向量与所述第m个待匹配 表达式的关键字向量之间的相似度,确定所述第m个待匹配表达式与所述待使用表达式之间的相似表征数据。
在一种可能的实施方式下,所述第七确定子单元,包括:
第九确定子单元,用于对所述待使用表达式进行字段名提取处理,得到字段名提取结果;将所述字段名提取结果进行向量化处理,得到所述待使用表达式的字段名向量。
在一种可能的实施方式下,所述第七确定子单元,包括:
第十确定子单元,用于对所述待使用表达式进行关键字提取处理,得到关键字提取结果;将所述关键字提取结果进行向量化处理,得到所述待使用表达式的关键字向量。
在一种可能的实施方式下,所述虚拟列构建描述信息还包括数据类型;所述数据类型是根据所述待使用表达式携带的列名对应的数据类型确定的。
基于上述基于数据湖的虚拟列构建装置700的相关内容可知,对于本申请实施例提供的基于数据湖的虚拟列构建装置700来说,可以先自动地针对该数据湖中的大量待分析语句(例如,各种引擎下的SQL语句)进行表达式统计分析,以得到达到预先设定的虚拟列构建条件的待使用表达式(例如,出现频次比较高的待使用表达式);再根据该待使用表达式对应的虚拟列构建描述信息(例如,列名、数据类型、表达式等),自动地构建出该待使用表达式对应的虚拟列构建请求,以使该虚拟列构建请求用于请求构建能够代表该待使用表达式的虚拟列;然后,按照该虚拟列构建请求,构建该待使用表达式对应的虚拟列,以使该虚拟列能够代表该待使用表达式,以便日后用户能够借助针对该虚拟列的数据查询请求,以自动地触发针对该待使用表达式的数据查询请求,如此能够避免在该用户手动输入针对该待使用表达式的数据查询请求时所出现的问题(例如,如何撰写出正确的表达式等),从而能够有效地提高用户数据查询体验。
基于上述基于数据湖的数据查询方法的相关内容,本申请实施例还提供了一种基于数据湖的数据查询装置,下面结合附图进行解释和说明。需要说明的是,本申请提供的基于数据湖的数据查询装置的技术详情,请参照上述基于数据湖的数据查询方法的相关内容。
参见图8,该图为本申请实施例提供的一种基于数据湖的数据查询装置的结构示意图。
本申请实施例提供的基于数据湖的数据查询装置800,包括:
请求获取单元801,用于获取第一数据查询请求;所述第一数据查询请求用于请求针对目标虚拟列进行数据查询;其中,所述目标虚拟列是本申请提供的基于数据湖的虚拟列构建方法的任一实施方式进行构建的;
信息替换单元802,用于利用所述目标虚拟列对应的表达式,替换所述第一数据查询请求中所述目标虚拟列的列名,得到第二数据查询请求;
数据查询单元803,用于按照所述第二数据查询请求进行数据查询处理。
基于上述基于数据湖的数据查询装置800的相关内容可知,对于本申请实施例提供的基于数据湖的数据查询装置800来说,在获取到用于针对目标数据表中目标虚拟列进行数据查询的第一数据查询请求之后,可以先利用该目标虚拟列对应的表达式,替换第一数据查询请求中目标虚拟列的列名,得到第二数据查询请求;再按照该第二数据查询请求进行数据查询处理,如此能够实现借助针对该虚拟列的数据查询请求(例如,“SELECT c FROM t1”这一SQL语句),以触发针对该待使用表达式的数据查询请求,如此能够避免在用户输入该 待使用表达式时所需顾虑的因素(例如,如何撰写出正确的表达式等),从而能够有效地提高用户数据查询体验。
另外,本申请实施例还提供了一种电子设备,所述设备包括处理器以及存储器:所述存储器,用于存储指令或计算机程序;所述处理器,用于执行所述存储器中的所述指令或计算机程序,以使得所述电子设备执行本申请实施提供的基于数据湖的虚拟列构建方法的任一实施方式,或者,执行本申请实施提供的基于数据湖的数据查询方法的任一实施方式。
参见图9,其示出了适于用来实现本公开实施例的电子设备900的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图9示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图9所示,电子设备900可以包括处理装置(例如中央处理器、图形处理器等)901,其可以根据存储在只读存储器(ROM)902中的程序或者从存储装置908加载到随机访问存储器(RAM)903中的程序而执行各种适当的动作和处理。在RAM903中,还存储有电子设备900操作所需的各种程序和数据。处理装置901、ROM 902以及RAM 903通过总线904彼此相连。输入/输出(I/O)接口905也连接至总线904。
通常,以下装置可以连接至I/O接口905:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置906;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置907;包括例如磁带、硬盘等的存储装置908;以及通信装置909。通信装置909可以允许电子设备900与其他设备进行无线或有线通信以交换数据。虽然图9示出了具有各种装置的电子设备900,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置909从网络上被下载和安装,或者从存储装置908被安装,或者从ROM902被安装。在该计算机程序被处理装置901执行时,执行本公开实施例的方法中限定的上述功能。
本公开实施例提供的电子设备与上述实施例提供的方法属于同一发明构思,未在本实施例中详尽描述的技术细节可参见上述实施例,并且本实施例与上述实施例具有相同的有益效果。
本申请实施例还提供了一种计算机可读介质,所述计算机可读介质中存储有指令或计算机程序,当所述指令或计算机程序在设备上运行时,使得所述设备执行本申请实施提供的基于数据湖的虚拟列构建方法的任一实施方式,或者,执行本申请实施提供的基于数据湖的数据查询方法的任一实施方式。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。 计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(Hyper Text Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备可以执行上述方法。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元/模块的名称在某种情况下并不构成对该单元本身的限定。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
需要说明的是,本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统或装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般 原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (17)

  1. 一种基于数据湖的虚拟列构建方法,所述方法包括:
    在获取到至少一个待分析语句之后,从所述至少一个待分析语句中确定待使用表达式;
    确定所述待使用表达式对应的虚拟列构建描述信息;所述虚拟列构建描述信息包括所述待使用表达式;
    根据所述待使用表达式对应的虚拟列构建描述信息,生成虚拟列构建请求;
    按照所述虚拟列构建请求,构建所述待使用表达式对应的虚拟列。
  2. 根据权利要求1所述的方法,其中,所述待分析语句的个数为N;
    所述从所述至少一个待分析语句中确定待使用表达式,包括:
    从第n个待分析语句中确定第n个待分析表达式;n为正整数,n≤N,N为正整数;
    对N个待分析表达式进行统计分析处理,得到表达式统计结果;
    若所述表达式统计结果表示所述N个待分析表达式中第一表达式的出现频次高于预设频次阈值,则将所述第一表达式确定为所述待使用表达式。
  3. 根据权利要求2所述的方法,其中,所述从第n个待分析语句中确定第n个待分析表达式,包括:
    对所述第n个待分析语句进行语法转换处理,得到语法转换结果;
    从所述语法转换结果中提取所述第n个待分析表达式。
  4. 根据权利要求2所述的方法,其中,所述方法还包括:
    对所述第n个待分析表达式进行语法树构建处理,得到所述第n个待分析表达式的语法树;
    所述对N个待分析表达式进行统计分析处理,得到统计分析结果,包括:
    对所述N个待分析表达式的语法树进行统计分析处理,得到统计分析结果。
  5. 根据权利要求1所述的方法,其中,所述虚拟列构建描述信息还包括列名;
    所述待使用表达式对应的列名的确定过程,包括:
    若所述至少一个待分析语句中存在至少一个待参考语句,则从所述至少一个待参考语句中确定所述待使用表达式对应的列名;所述待参考语句携带的表达式与所述待使用表达式之间满足预设语义相同条件;所述待参考语句包括所述待参考语句携带的表达式对应的列名。
  6. 根据权利要求5所述的方法,其中,所述从所述至少一个待参考语句中确定所述待使用表达式对应的列名,包括:
    对所述至少一个待参考语句携带的表达式对应的列名进行统计分析处理,得到列名统计结果;
    根据所述列名统计结果,确定所述待使用表达式对应的列名。
  7. 根据权利要求6所述的方法,其中,所述至少一个待参考语句携带的表达式对应的列名包括目标列名;
    所述根据所述列名统计结果,确定所述待使用表达式对应的列名,包括:
    若所述列名统计结果表示所述目标列名的出现频次满足预设频次条件,则将所述目标 列名,确定为所述待使用表达式对应的列名。
  8. 根据权利要求5所述的方法,其中,所述方法还包括:
    若所述至少一个待分析语句中不存在所述待参考语句,则根据预先设定的至少一个待匹配表达式与所述待使用表达式之间的相似表征数据,从所述至少一个待匹配表达式中确定第二表达式;所述第二表达式与所述待使用表达式之间的相似表征数据满足预设相似条件;
    从预先构建的映射关系中查找所述第二表达式对应的列名;所述映射关系用于记录各所述待匹配表达式与各所述待匹配表达式对应的列名之间的对应关系;
    根据所述第二表达式对应的列名,确定所述待使用表达式对应的列名。
  9. 根据权利要求8所述的方法,其中,所述待匹配表达式的个数为M;
    第m个待匹配表达式与所述待使用表达式之间的相似表征数据的确定过程,包括:
    确定所述待使用表达式的字段名向量以及关键字向量;
    根据所述待使用表达式的字段名向量与所述第m个待匹配表达式的字段名向量之间的相似度、以及所述待使用表达式的关键字向量与所述第m个待匹配表达式的关键字向量之间的相似度,确定所述第m个待匹配表达式与所述待使用表达式之间的相似表征数据。
  10. 根据权利要求9所述的方法,其中,所述待使用表达式的字段名向量的确定过程,包括:
    对所述待使用表达式进行字段名提取处理,得到字段名提取结果;将所述字段名提取结果进行向量化处理,得到所述待使用表达式的字段名向量;
    和/或,
    所述待使用表达式的关键字向量的确定过程,包括:
    对所述待使用表达式进行关键字提取处理,得到关键字提取结果;将所述关键字提取结果进行向量化处理,得到所述待使用表达式的关键字向量。
  11. 根据权利要求1所述的方法,其中,所述虚拟列构建描述信息还包括数据类型;
    所述数据类型是根据所述待使用表达式携带的列名对应的数据类型确定的。
  12. 一种基于数据湖的数据查询方法,所述方法包括:
    获取第一数据查询请求;所述第一数据查询请求用于请求针对目标虚拟列进行数据查询;其中,所述目标虚拟列是利用权利要求1-11任一项所述的基于数据湖的虚拟列构建方法进行构建的;
    利用所述目标虚拟列对应的表达式,替换所述第一数据查询请求中所述目标虚拟列的列名,得到第二数据查询请求;
    按照所述第二数据查询请求进行数据查询处理。
  13. 一种基于数据湖的虚拟列构建装置,包括:
    表达式确定单元,用于在获取到至少一个待分析语句之后,从所述至少一个待分析语句中确定待使用表达式;
    信息确定单元,用于确定所述待使用表达式对应的虚拟列构建描述信息;所述虚拟列构建描述信息包括所述待使用表达式;
    请求生成单元,用于根据所述待使用表达式对应的虚拟列构建描述信息,生成虚拟列 构建请求;
    虚拟列构建单元,用于按照所述虚拟列构建请求,构建所述待使用表达式对应的虚拟列。
  14. 一种基于数据湖的数据查询装置,包括:
    请求获取单元,用于获取第一数据查询请求;所述第一数据查询请求用于请求针对目标虚拟列进行数据查询;其中,所述目标虚拟列是利用权利要求1-11任一项所述的基于数据湖的虚拟列构建方法进行构建的;
    信息替换单元,用于利用所述目标虚拟列对应的表达式,替换所述第一数据查询请求中所述目标虚拟列的列名,得到第二数据查询请求;
    数据查询单元,用于按照所述第二数据查询请求进行数据查询处理。
  15. 一种电子设备,所述设备包括:处理器和存储器;
    所述存储器,用于存储指令或计算机程序;
    所述处理器,用于执行所述存储器中的所述指令或计算机程序,以使得所述电子设备执行权利要求1-11任一项所述的基于数据湖的虚拟列构建方法,或者,执行权利要求12所述的基于数据湖的数据查询方法。
  16. 一种计算机可读介质,所述计算机可读介质中存储有指令或计算机程序,当所述指令或计算机程序在设备上运行时,使得所述设备执行权利要求1-11任一项所述的基于数据湖的虚拟列构建方法,或者,执行权利要求12所述的基于数据湖的数据查询方法。
  17. 一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行权利要求1-11任一项所述的基于数据湖的虚拟列构建方法的程序代码,或者,用于执行权利要求12所述的基于数据湖的数据查询方法的程序代码。
PCT/CN2023/094998 2022-07-27 2023-05-18 一种基于数据湖的虚拟列构建方法以及数据查询方法 WO2024021790A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210892749.6 2022-07-27
CN202210892749.6A CN115221191A (zh) 2022-07-27 2022-07-27 一种基于数据湖的虚拟列构建方法以及数据查询方法

Publications (1)

Publication Number Publication Date
WO2024021790A1 true WO2024021790A1 (zh) 2024-02-01

Family

ID=83614631

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/094998 WO2024021790A1 (zh) 2022-07-27 2023-05-18 一种基于数据湖的虚拟列构建方法以及数据查询方法

Country Status (2)

Country Link
CN (1) CN115221191A (zh)
WO (1) WO2024021790A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115221191A (zh) * 2022-07-27 2022-10-21 北京火山引擎科技有限公司 一种基于数据湖的虚拟列构建方法以及数据查询方法
CN115809249B (zh) * 2023-02-03 2023-04-25 杭州比智科技有限公司 一种基于专有化数据集的数据湖管理方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680030A (zh) * 2019-03-11 2020-09-18 阿里巴巴集团控股有限公司 数据融合方法及装置,基于元信息的数据处理方法和装置
CN112347133A (zh) * 2019-08-09 2021-02-09 北京京东尚科信息技术有限公司 一种数据查询方法和装置
CN114416773A (zh) * 2021-12-30 2022-04-29 联通智网科技股份有限公司 数据处理方法、装置、存储介质和服务器
CN115221191A (zh) * 2022-07-27 2022-10-21 北京火山引擎科技有限公司 一种基于数据湖的虚拟列构建方法以及数据查询方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680030A (zh) * 2019-03-11 2020-09-18 阿里巴巴集团控股有限公司 数据融合方法及装置,基于元信息的数据处理方法和装置
CN112347133A (zh) * 2019-08-09 2021-02-09 北京京东尚科信息技术有限公司 一种数据查询方法和装置
CN114416773A (zh) * 2021-12-30 2022-04-29 联通智网科技股份有限公司 数据处理方法、装置、存储介质和服务器
CN115221191A (zh) * 2022-07-27 2022-10-21 北京火山引擎科技有限公司 一种基于数据湖的虚拟列构建方法以及数据查询方法

Also Published As

Publication number Publication date
CN115221191A (zh) 2022-10-21

Similar Documents

Publication Publication Date Title
US11755630B2 (en) Regular expression generation using longest common subsequence algorithm on combinations of regular expression codes
US10963794B2 (en) Concept analysis operations utilizing accelerators
WO2024021790A1 (zh) 一种基于数据湖的虚拟列构建方法以及数据查询方法
US10726018B2 (en) Semantic matching and annotation of attributes
CN109522341B (zh) 实现基于sql的流式数据处理引擎的方法、装置、设备
US11120086B2 (en) Toponym disambiguation
WO2022156730A1 (zh) 文本处理方法、装置、设备及介质
US9904674B2 (en) Augmented text search with syntactic information
WO2024082827A1 (zh) 文本相似性度量方法、装置、设备、存储介质和程序产品
CN113536763A (zh) 一种信息处理方法、装置、设备及存储介质
WO2023217019A1 (zh) 文本处理方法、装置、存储介质、电子设备及系统
WO2022143069A1 (zh) 一种文本聚类方法、装置、电子设备及存储介质
CN113468529B (zh) 一种数据搜索方法和装置
CN111737571B (zh) 搜索方法、装置和电子设备
CN112148751B (zh) 用于查询数据的方法和装置
CN112307061A (zh) 用于查询数据的方法和装置
Srivastava Learning Elasticsearch 7. x: Index, Analyze, Search and Aggregate Your Data Using Elasticsearch (English Edition)
CN117891979B (zh) 血缘图谱构建方法、装置、电子设备和可读介质
CN115994151B (zh) 数据请求变更方法、装置、电子设备和计算机可读介质
WO2023022655A2 (zh) 知识图谱构建方法、装置、存储介质及电子设备
CN116150432A (zh) 数据记录的来源用户确定方法、装置、电子设备及介质
CN117493375A (zh) 一种结构化查询语句相似度检测方法、装置及设备
CN116226169A (zh) 一种sql语句的处理方法、装置及设备
CN116340591A (zh) 一种表格数据的处理方法、装置、设备及存储介质
CN117891979A (zh) 血缘图谱构建方法、装置、电子设备和可读介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23845005

Country of ref document: EP

Kind code of ref document: A1