CN112597213B - Batch request processing method and device for feature calculation, electronic equipment and storage medium - Google Patents

Batch request processing method and device for feature calculation, electronic equipment and storage medium Download PDF

Info

Publication number
CN112597213B
CN112597213B CN202011553734.4A CN202011553734A CN112597213B CN 112597213 B CN112597213 B CN 112597213B CN 202011553734 A CN202011553734 A CN 202011553734A CN 112597213 B CN112597213 B CN 112597213B
Authority
CN
China
Prior art keywords
common
data
public
calculation
columns
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011553734.4A
Other languages
Chinese (zh)
Other versions
CN112597213A (en
Inventor
包新启
王太泽
陈迪豪
陈靓
王子贤
邓龙
王豹
孔全
穆妮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4Paradigm Beijing Technology Co Ltd
Original Assignee
4Paradigm Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4Paradigm Beijing Technology Co Ltd filed Critical 4Paradigm Beijing Technology Co Ltd
Priority to CN202311562303.8A priority Critical patent/CN117555944A/en
Priority to CN202011553734.4A priority patent/CN112597213B/en
Publication of CN112597213A publication Critical patent/CN112597213A/en
Application granted granted Critical
Publication of CN112597213B publication Critical patent/CN112597213B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure relates to a batch request processing method and device for feature calculation, electronic equipment and a storage medium. The method comprises the following steps: receiving a batch request of feature calculation, wherein the batch request comprises a plurality of data rows, the plurality of data rows share a calculation target and input field information, and the input field information is used for specifying information of a plurality of columns included in the data rows; generating a first execution plan based on the input field information and the calculation targets, the first execution plan including a plurality of calculation steps and inputs and outputs of each calculation step; determining information of one or more public columns in the input field information, and optimizing the calculation step and/or the input of the calculation step in the first execution plan based on the information of the public columns to obtain a second execution plan; based on the second execution plan, a batch calculation result of the batch request is obtained. It can be seen that the execution plan of the batch request is generated but not directly executed, but the execution plan is optimized and then executed by combining manual heuristic information, namely public column information, so that the processing efficiency of the batch request is improved.

Description

Batch request processing method and device for feature calculation, electronic equipment and storage medium
Technical Field
The embodiment of the disclosure relates to the technical field of artificial intelligence, in particular to a batch request processing method and device for feature calculation, electronic equipment and a storage medium.
Background
With the development of artificial intelligence, feature computation is a very important loop in processing online service requests (e.g., online recommendation requests) using machine learning. In certain service scenarios (e.g., recommendation scenarios), requests for online feature computation arrive at the computing system in batches (batch).
At present, the naive processing mode of the batch request for feature calculation is as follows: each request in the batch is traversed, internal computing logic is invoked separately for each request, and finally all computing results are merged together and returned. This approach has the following drawbacks:
(1) Oversized input/output data
A batch request input contains multiple request inputs and the returned results contain the calculation of multiple requests, which can result in higher data transfer pressures when the batch is relatively large (i.e., the number of requests contained in the batch is large).
(2) Large calculation amount and low efficiency
When the batch is large, each request in the batch is traversed, and the internal calculation logic is independently called for each request, so that the calculation amount is large and the efficiency is low.
The above description of the discovery process of the problem is merely for aiding in understanding the technical solution of the present disclosure, and does not represent an admission that the above is prior art.
Disclosure of Invention
To solve at least one problem in the prior art, at least one embodiment of the present disclosure provides a method, an apparatus, an electronic device, and a storage medium for processing a batch request for feature calculation.
In a first aspect, an embodiment of the present disclosure provides a method for processing a batch request for feature computation, including:
receiving a batch request of feature calculation; wherein the batch request includes a plurality of data rows sharing a computation target and input field information for specifying information of a plurality of columns included in the data rows;
generating a first execution plan based on the input field information and the calculation target; wherein said first execution plan includes a plurality of calculation steps and inputs and outputs for each of said calculation steps;
determining information of one or more public columns in the input field information, and optimizing the calculation step and/or the input of the calculation step in the first execution plan based on the information of the public columns to obtain a second execution plan;
And obtaining a batch calculation result of the batch request based on the second execution plan.
In some embodiments, the determining information for one or more common columns in the input field information includes:
acquiring information of one or more public columns in the input field information through a public column interface which is established in advance;
or alternatively, the first and second heat exchangers may be,
information for one or more common columns in the input field information is determined based on a plurality of rows of data included in the batch request.
In some embodiments, optimizing the input of the calculation step and/or the calculation step in the first execution plan based on the information of the common column, to obtain the second execution plan includes:
for each calculation step in the first execution plan:
judging whether the calculation step meets a first optimization condition, wherein the first optimization condition is as follows:
the calculating step calculates based on data in one or more of the common columns and is not based on any one of the common columns
Calculating data in non-common columns;
if the first optimization condition is satisfied, the input to the optimizing the computing step is the data in the one or more common columns.
In some embodiments, if the calculating step is determined not to satisfy the first optimization condition, then:
Judging whether the calculating step meets a second optimizing condition, wherein the second optimizing condition is as follows: the calculating step performs calculation based on data in one or more of the common columns and performs calculation based on data in one or more non-common columns;
if the second optimization condition is met, optimizing the calculation step into at least one first sub-step and at least one second sub-step; wherein the first sub-step is based solely on data in one or more of the common columns; the second sub-step is calculated based at least on data in one or more non-common columns.
In some embodiments, for the calculating step, a time window is pulled, the input field information includes a main key column and a time column, and the time window is pulled to obtain window data according with the calculation target based on the data in the main key column and the data in the time column;
and if the main key row and the time row are both public rows, optimizing the input pulled by the time window to be the data in the main key row and the data in the time row.
In some embodiments, for the calculating step, table stitching is performed, where the table stitching is performed on a left table and a right table based on stitching conditions, where the left table is obtained based on the batch request, and the right table is obtained based on a table stored in a database;
If the splicing condition is a common column, optimizing the table to be spliced into a first substep and two second substeps;
wherein the first sub-step is for: performing first splicing based on the splicing conditions on the data of all the public columns in the left table and the data of all the public columns in the right table;
the two second sub-steps comprise: a right splicing sub-step and a left splicing sub-step;
the right stitching substep is for: performing a second splice based on the unique index of the right table for the output of the first sub-step and all non-common columns of data in the right table;
the left stitching substep is for: performing third splicing on the output of the right splicing substep and the data of all non-common columns in the left table;
and the output of the left splicing substep is used as the calculation result of the table splicing.
In some embodiments, for the computing step, an aggregate computation that produces a plurality of computation results, each of the computation results being a common computation result, a first non-common computation result, or a second non-common computation result;
wherein the public calculation result is calculated based on data in one or more public columns only; the first non-public calculation result is calculated based on data in one or more public columns and data in one or more non-public columns; the second non-common calculation result is calculated based on data in one or more non-common columns.
In some embodiments, the aggregate calculation is optimized into a plurality of first sub-steps and a plurality of second sub-steps; wherein the plurality of first sub-steps includes at least one first common computing step and at least one second common computing step; the plurality of second sub-steps includes at least one first non-common computing step and at least one second non-common computing step;
the input of the first common calculation step is data in one or more common columns, and the output is common intermediate data;
the input of the second common calculation step is data in one or more common columns, and the output is a common calculation result;
the input of the first non-public computing step is public intermediate data output by one or more first public computing steps and data in one or more non-public columns, and the output is a first non-public computing result;
the input of the second non-public calculation step is data in one or more non-public columns, and the output is a second non-public calculation result.
In some embodiments, after determining the information of one or more common columns in the input field information, the method further includes:
Establishing a shared storage area;
storing common data into the shared memory area; wherein the common data includes: data in the common column, common intermediate data, and a common calculation result;
wherein the common intermediate data is intermediate data generated by a first common calculation step in the second execution plan, the first common calculation step being a step of generating intermediate data based on data in one or more of the common columns;
wherein the common calculation result is a calculation result generated by a second common calculation step in the second execution plan, and the second common calculation step is a step of generating a calculation result based on data in one or more of the common columns.
In some embodiments, the deriving the batch calculation of the batch request based on the second execution plan includes:
and respectively executing the second execution plan for each data row in the batch request to obtain a calculation result of each data row, and merging the calculation results into a batch calculation result of the batch request.
In some embodiments, in executing the second execution plan:
for a first common calculation step in the second execution plan:
Querying an index relation, wherein the index relation is the relation between the identification of the public calculation step and the storage position of the data generated in the public calculation step in the shared storage area;
if the identification of the first public computing step is queried, determining a first storage position corresponding to the identification of the first public computing step from the index relation;
based on the first storage location, data is retrieved from the shared storage area as common intermediate data resulting from the first common computing step.
In some embodiments, the method further comprises:
if the identification of the first public calculation step is not queried, generating public intermediate data through the first public calculation step based on the data in the public column;
and storing the public intermediate data generated by the first public calculation step into the shared storage area, and recording the relation between the identification of the first public calculation step and the storage position in the index relation.
In some embodiments, in executing the second execution plan:
for a second common calculation step in the second execution plan:
querying an index relation, wherein the index relation is the relation between the identification of the public calculation step and the storage position of the data generated in the public calculation step in the shared storage area;
If the identification of the second public computing step is queried, determining a second storage position corresponding to the identification of the second public computing step from the index relation;
based on the second storage location, data is retrieved from the shared storage area as a common calculation result produced by the second common calculation step.
In some embodiments, the method further comprises:
if the identification of the second public calculation step is not queried, generating a public calculation result through the second public calculation step based on the data in the public column;
and storing the public calculation result generated by the second public calculation step into the shared storage area, and recording the relation between the identification of the second public calculation step and the storage position in the index relation.
In some embodiments, the query index relationship comprises:
and inquiring the index relation through a pre-established public data cache interface.
In some embodiments, the storing the common data into the shared memory area comprises:
and storing the public data into the shared memory area through a pre-established data coding memory interface.
In a second aspect, an embodiment of the present disclosure further provides a batch request processing apparatus for feature calculation, including:
A receiving unit for receiving a batch request for feature calculation; wherein the batch request includes a plurality of data rows sharing a computation target and input field information for specifying information of a plurality of columns included in the data rows;
a generation unit configured to generate a first execution plan based on the input field information and the calculation target; wherein said first execution plan includes a plurality of calculation steps and inputs and outputs for each of said calculation steps;
the optimizing unit is used for determining information of one or more public columns in the input field information, optimizing the calculation step and/or the input of the calculation step in the first execution plan based on the information of the public columns, and obtaining a second execution plan;
and the calculating unit is used for obtaining a batch calculation result of the batch request based on the second execution plan.
In some embodiments, the optimizing unit determining information of one or more common columns in the input field information includes:
acquiring information of one or more public columns in the input field information through a public column interface which is established in advance;
or alternatively, the first and second heat exchangers may be,
information for one or more common columns in the input field information is determined based on a plurality of rows of data included in the batch request.
In some embodiments, the optimizing unit optimizes the calculation step and/or the input of the calculation step in the first execution plan based on the information of the common column, and the obtaining the second execution plan includes:
for each calculation step in the first execution plan:
judging whether the calculation step meets a first optimization condition, wherein the first optimization condition is as follows: the calculating step calculates based on data in one or more of the common columns and does not calculate based on data in any of the non-common columns;
if the first optimization condition is satisfied, the input to the optimizing the computing step is the data in the one or more common columns.
In some embodiments, if the optimizing unit determines that the calculating step does not meet the first optimizing condition, then:
judging whether the calculating step meets a second optimizing condition, wherein the second optimizing condition is as follows: the calculating step performs calculation based on data in one or more of the common columns and performs calculation based on data in one or more non-common columns;
if the second optimization condition is met, optimizing the calculation step into at least one first sub-step and at least one second sub-step; wherein the first sub-step is based solely on data in one or more of the common columns; the second sub-step is calculated based at least on data in one or more non-common columns.
In some embodiments, for the calculating step, a time window is pulled, the input field information includes a main key column and a time column, and the time window is pulled to obtain window data according with the calculation target based on the data in the main key column and the data in the time column;
and if the main key row and the time row are both public rows, the optimizing unit optimizes the input pulled by the time window to be the data in the main key row and the data in the time row.
In some embodiments, for the calculating step, table stitching is performed, where the table stitching is performed on a left table and a right table based on stitching conditions, where the left table is obtained based on the batch request, and the right table is obtained based on a table stored in a database;
if the splicing condition is a common column, the optimizing unit optimizes the table to be spliced into a first substep and two second substeps;
wherein the first sub-step is for: performing first splicing based on the splicing conditions on the data of all the public columns in the left table and the data of all the public columns in the right table;
the two second sub-steps comprise: a right splicing sub-step and a left splicing sub-step;
The right stitching substep is for: performing a second splice based on the unique index of the right table for the output of the first sub-step and all non-common columns of data in the right table;
the left stitching substep is for: performing third splicing on the output of the right splicing substep and the data of all non-common columns in the left table;
and the output of the left splicing substep is used as the calculation result of the table splicing.
In some embodiments, for the computing step, an aggregate computation that produces a plurality of computation results, each of the computation results being a common computation result, a first non-common computation result, or a second non-common computation result;
wherein the public calculation result is calculated based on data in one or more public columns only; the first non-public calculation result is calculated based on data in one or more public columns and data in one or more non-public columns; the second non-common calculation result is calculated based on data in one or more non-common columns.
In some embodiments, the optimization unit optimizes the aggregate calculation into a plurality of first sub-steps and a plurality of second sub-steps; wherein the plurality of first sub-steps includes at least one first common computing step and at least one second common computing step; the plurality of second sub-steps includes at least one first non-common computing step and at least one second non-common computing step;
The input of the first common calculation step is data in one or more common columns, and the output is common intermediate data;
the input of the second common calculation step is data in one or more common columns, and the output is a common calculation result;
the input of the first non-public computing step is public intermediate data output by one or more first public computing steps and data in one or more non-public columns, and the output is a first non-public computing result;
the input of the second non-public calculation step is data in one or more non-public columns, and the output is a second non-public calculation result.
In some embodiments, after determining the information of one or more common columns in the input field information, the optimizing unit is further configured to:
establishing a shared storage area;
storing common data into the shared memory area; wherein the common data includes: data in the common column, common intermediate data, and a common calculation result;
wherein the common intermediate data is intermediate data generated by a first common calculation step in the second execution plan, the first common calculation step being a step of generating intermediate data based on data in one or more of the common columns;
Wherein the common calculation result is a calculation result generated by a second common calculation step in the second execution plan, and the second common calculation step is a step of generating a calculation result based on data in one or more of the common columns.
In some embodiments, the computing unit is to:
and respectively executing the second execution plan for each data row in the batch request to obtain a calculation result of each data row, and merging the calculation results into a batch calculation result of the batch request.
In some embodiments, the computing unit, when executing the second execution plan:
for a first common calculation step in the second execution plan:
querying an index relation, wherein the index relation is the relation between the identification of the public calculation step and the storage position of the data generated in the public calculation step in the shared storage area;
if the identification of the first public computing step is queried, determining a first storage position corresponding to the identification of the first public computing step from the index relation;
based on the first storage location, data is retrieved from the shared storage area as common intermediate data resulting from the first common computing step.
In some embodiments, the computing unit is further to:
if the identification of the first public calculation step is not queried, generating public intermediate data through the first public calculation step based on the data in the public column;
and storing the public intermediate data generated by the first public calculation step into the shared storage area, and recording the relation between the identification of the first public calculation step and the storage position in the index relation.
In some embodiments, the computing unit, when executing the second execution plan:
for a second common calculation step in the second execution plan:
querying an index relation, wherein the index relation is the relation between the identification of the public calculation step and the storage position of the data generated in the public calculation step in the shared storage area;
if the identification of the second public computing step is queried, determining a second storage position corresponding to the identification of the second public computing step from the index relation;
based on the second storage location, data is retrieved from the shared storage area as a common calculation result produced by the second common calculation step.
In some embodiments, the computing unit is further to:
If the identification of the second public calculation step is not queried, generating a public calculation result through the second public calculation step based on the data in the public column;
and storing the public calculation result generated by the second public calculation step into the shared storage area, and recording the relation between the identification of the second public calculation step and the storage position in the index relation.
In some embodiments, the computing unit querying the index relationship comprises:
and inquiring the index relation through a pre-established public data cache interface.
In some embodiments, the computing unit storing common data into the shared memory area comprises: and storing the public data into the shared memory area through a pre-established data coding memory interface.
In a third aspect, an embodiment of the present disclosure further proposes an electronic device, including: a processor and a memory; the processor is configured to perform the steps of the method according to any of the embodiments of the first aspect by invoking a program or instruction stored in the memory.
In a fourth aspect, embodiments of the present disclosure also propose a non-transitory computer-readable storage medium storing a program or instructions for causing a computer to perform the steps of the method according to any one of the embodiments of the first aspect.
It can be seen that, in at least one embodiment of the present disclosure, an execution plan of a batch request is generated, but the execution plan is not directly executed, and the correlation inside a plurality of requests in the batch is mined by combining manual heuristic information, i.e., common column information, so as to optimize the execution plan, and then execute the optimized execution plan.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings to those of ordinary skill in the art.
FIG. 1 is a schematic illustration of a directed acyclic graph of an execution plan within a feature computation service provided by an embodiment of the present disclosure;
FIG. 2 is a flow chart of a method of batch request processing for feature computation provided by an embodiment of the present disclosure;
FIG. 3 is a schematic illustration of the input of a calculation step in a feature calculation service optimization execution plan provided by an embodiment of the present disclosure;
FIG. 4 is a schematic illustration of the calculation steps themselves in a feature calculation service optimization execution plan provided by an embodiment of the present disclosure;
FIG. 5 is a schematic illustration of the calculation steps themselves in another feature calculation service optimization execution plan provided by an embodiment of the present disclosure;
FIG. 6 is a block diagram of a batch request processing apparatus for feature computation provided by an embodiment of the present disclosure;
fig. 7 is an exemplary block diagram of an electronic device provided by an embodiment of the present disclosure.
Detailed Description
In order that the above-recited objects, features and advantages of the present disclosure may be more clearly understood, a more particular description of the disclosure will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. It is to be understood that the described embodiments are some, but not all, of the embodiments of the present disclosure. The specific embodiments described herein are to be considered in an illustrative rather than a restrictive sense. All other embodiments derived by a person of ordinary skill in the art based on the described embodiments of the present disclosure fall within the scope of the present disclosure.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
For ease of understanding, the following explains terms related to the schemes of the embodiments of the present disclosure:
on-line feature computation
In the machine learning field, online feature computation is carried by a feature computation service, which is a long-running program with specific computation logic inside for computing specific features.
Request for feature computation
The request for feature computation is logically composed of three parts:
(1) A calculation target specifying what feature to calculate as an output;
(2) An input pattern (schema) specifying which columns the input data contains and the data type for each column;
(3) The data rows are input, and the data format of each data row needs to be matched with the input schema.
For example, for feature computation: "total number of same user ID purchases in 3 days", the request for online feature calculation may be:
(1) Calculating a target: "the same user ID purchases fruit overhead" within 3 days;
(2) Input scheme: user ID, int64, time, purchase goods, string;
(3) Input data line: (user id=10081, time= 2020.8.1, purchase commodity=fruit).
In a practical application scenario, the feature computation may request the computation in the form of a batch (batch), which may be understood as a batch request (batch request) of feature computation, i.e. a group of requests with the same computation objective and input schema are aggregated together to request the feature computation service.
Execution plan
Specific computing steps are generated inside the feature computing service for the specified computing targets and input schemas; when the request of feature calculation comes, the feature calculation service calculates the input data line according to the generated calculation step to obtain a return result. The execution plan is organized in the form of a Directed Acyclic Graph (DAG), each computation step being a node in the graph, an input edge of the node representing an input of the computation step, and an output edge of the node representing an output of the computation step.
For example, for the aforementioned request for feature computation, a directed acyclic graph of an execution plan within a feature computation service is shown in fig. 1, including computation steps 101 to 104:
calculation step 101: a line of data is input. The data line is a data line included in the request for feature calculation, for example: user id=10081, time= 2020.8.1, purchase commodity=fruit.
A calculation step 102: the time window is pulled. For example, the feature calculation service queries window data from the time series database based on the user ID of the input data line as a query primary key and time as a query time, that is, queries behavior data of the same user ID within 3 days. Wherein the window data includes a plurality of columns of data including at least a column of purchased goods.
Calculation step 103: and (5) splicing the tables. For example, the feature calculation service queries the commodity price from the commodity price table using the purchased commodity column for each row of the window data based on the window data and concatenates.
Calculation step 104: and (5) performing aggregation calculation. For example, the feature calculation service sums values on all price columns of commodity = fruit based on the spliced data.
The inventors of the present disclosure found that: in addition to sharing the same input pattern (schema) and execution plan of the computational objective, multiple requests in a batch may also be identical for input values on a partial column, repeating the computational features on the same input each time creates a significant waste of computational effort.
The inventors of the present disclosure also found that: in batch request processing of recommended scenes, users typically have a priori domain knowledge of the requested input schema information; specifically, before performing batch request processing, the user can confirm in advance which columns of values are identical in a batch of input request rows. The inventors of the present disclosure define the batch request common columns as: the values on this column are identical in all request rows of a batch request. The batch request common column constitutes the manual a priori knowledge, i.e., manual heuristic information, required by the present disclosure.
Therefore, the embodiment of the disclosure provides a batch request processing scheme for feature calculation, which generates an execution plan of a batch request, but does not directly execute the plan, and instead, the execution plan is optimized by combining manual heuristic information, namely common column information, and then the optimized execution plan is executed.
Fig. 2 is a block diagram of a batch request processing method for feature computation according to an embodiment of the present disclosure, where an execution subject of the method is a feature computation service. The method may comprise the following steps 201 to 204:
201. receiving a batch request of feature calculation; wherein the batch request includes a plurality of data rows sharing a calculation target and input field information for specifying information of a plurality of columns included in the data rows, the input field information being understood as input pattern (schema) information.
202. Generating a first execution plan based on the input field information and the calculation target; wherein the first execution plan includes a plurality of calculation steps and inputs and outputs for each calculation step.
203. And determining one or more public columns of information in the input field information, and optimizing the calculation step and/or the input of the calculation step in the first execution plan based on the public columns of information to obtain a second execution plan.
In this embodiment, the feature calculation service may determine the information of the common column in various manners, for example, may manually specify the information of the common column, or may automatically determine the information of the common column.
In some embodiments, the feature computation service obtains information of one or more common columns in the input field information through a common column interface established in advance, in such a way that the feature computation service does not actively determine the information of the common columns, but instead manually invokes the information of the common columns specified by the common column interface. The common column interface is used to specify which columns in the input pattern (schema) information of the batch request are common columns. In some embodiments, a common column interface may be set for all batch requests in the same computational goal and the same input mode (schema).
In some embodiments, the feature computation service determines information of one or more common columns in the input field information based on a plurality of data lines included in the batch request, in such a way that the feature computation service automatically determines information of the common columns. In some embodiments, the feature computation service determines whether the values on a column in the plurality of rows of data included in the batch request are identical, and if so, determines the column as a common column.
204. Based on the second execution plan, a batch calculation result of the batch request is obtained.
The following is an illustration of the optimization of the calculation steps in the first execution plan and/or the inputs of the calculation steps mentioned in step 203, respectively.
Optimizing the input of a calculation step
In some embodiments, the feature computation service may optimize the input of the computation steps in the first execution plan based on the information of the common column. Specifically, the feature computation service performs, for each computation step in the first execution plan:
judging whether the calculation step meets a first optimization condition, wherein the first optimization condition is as follows: the calculating step calculates based on data in one or more common columns and is not based on any non-common columns
Calculating data in the columns;
if the first optimization condition is met, the input to the optimizing the computing step is data in one or more common columns.
In some embodiments, for the calculating step, the time window pulling includes a primary key column and a time column in the input field information of the batch request, and the time window pulling pulls window data according to the calculation target based on data in the primary key column and data in the time column. If the primary key row and the time row are both common rows, the input of optimizing the "time window pull" is the data in the primary key row and the data in the time row.
Fig. 3 shows the input of a calculation step in a feature calculation service optimization execution plan, specifically, the input of a step of "time window pulling" in a feature calculation service optimization first execution plan, wherein the "time window pulling" satisfies a first optimization condition, i.e., the "time window pulling" step performs calculation based on data in one or more common columns and does not perform calculation based on data in any non-common column.
The left graph of fig. 3 shows the input and output of the step "time window pull" in the first execution plan, where the request represents any one of the batch requests, the window represents the step "time window pull", the request is the input of the window, and window of request is the output of the window. window of request represents window data queried by the window.
The right hand graph of fig. 3 shows the execution plan resulting from the feature calculation service optimizing the input of the "time window pull" step in the first execution plan. The primary key row and the time row in the input field information of the batch request are both public rows, and accordingly, the input of the feature computation service optimization window is data in the public row based on which the window is computed, namely common in fig. 3, and the data of the primary key row and the data of the time row are included. The output of window is unchanged, still window of request.
Non-common in the right hand graph of fig. 3 represents data in non-common columns that are used in the first execution plan for the splice (concat) step, and common will also participate in the splice, so the input of the concat step is indicated by the dashed arrow to distinguish the input of windows. That is, non-common inputs concat in a dotted line, meaning that non-common does not input window.
In this way, the right graph of fig. 3 is less input data for the optimized window than the left graph of fig. 3, and the data transmission pressure is reduced.
Optimizing the calculation step itself
In some embodiments, the feature computation service may optimize the computation step itself in the first execution plan. Specifically, if the feature calculation service determines that the calculation step does not satisfy the first optimization condition, the feature calculation service:
judging whether the calculating step meets a second optimizing condition, wherein the second optimizing condition is as follows: the calculating step performs a calculation based on data in one or more common columns and a calculation based on data in one or more non-common columns;
if the second optimization condition is met, optimizing the calculation step into at least one first sub-step and at least one second sub-step; wherein the first sub-step is based solely on data in one or more common columns; the second sub-step is calculated based at least on data in one or more non-common columns. In some embodiments, the second sub-step may also be calculated based on common intermediate data, where common intermediate data is data calculated based on only data of a common column.
In some embodiments, for the calculating step being a table splice, the table splice splices a left table and a right table based on splice conditions, wherein the left table is derived based on a batch request and the right table is derived based on a table stored in a database. If the splicing condition is a common column, the optimization table is spliced into a first sub-step and two second sub-steps.
Wherein the first sub-step is for: and performing first splicing based on splicing conditions on the data of all the common columns in the left table and the data of all the common columns in the right table.
Wherein the two second sub-steps comprise: a right splice sub-step and a left splice sub-step.
The right splice sub-step is for: and performing second splicing based on the unique index of the right table on the output of the first substep and all data of non-common columns in the right table.
The left stitching substep is for: and performing third splicing on the output of the right splicing substep and the data of all non-common columns in the left table.
The output of the left splicing substep is used as the calculation result of the table splicing.
Fig. 4 shows the calculation step itself of the feature calculation service optimization execution plan, specifically the step itself of "table stitching" in the feature calculation service optimization first execution plan, wherein the "table stitching" satisfies the second optimization condition, i.e. the "table stitching" is calculated based on data in one or more common columns and is calculated based on data in one or more non-common columns.
The left-hand diagram of fig. 4 shows the input and output of the step "table stitching" in the first execution plan, where "+" shows the step "table stitching", left and right tables with left and right as inputs to "table stitching", and the columns for stitching by left are common columns, and the columns for stitching by right are also common columns, i.e. the stitching conditions are common columns. New left is the output of "Table splice", i.e., "Table splice" splices right into left, forming new left.
The right hand graph of fig. 4 shows the execution plan resulting from the feature computation service optimizing the "table stitching" step itself in the first execution plan. Specifically, the feature computation service optimization "table splice" itself is a first sub-step 401, a right splice sub-step 402, and a left splice sub-step 403.
The first substep 401 computes based only on data in the common columns, left2 and right2 being common inputs to the first substep 401, left2 being the common column in left and right2 being the common column in right. The first substep 401 is for first stitching left2 and right 2. Since the first sub-step 401 performs calculation based on only the data of the common column, the splice result output by the first sub-step 401 is referred to as a common splice result, and is denoted as common join. Also, common join may be understood as common intermediate data, as common join participates in subsequent computations, not the final result.
The right splice substep 402 computes based on both non-common columns and common columns, with right1 and common join together as inputs to the right splice substep 402, with right1 being the non-common column in right. The right splice substep 402 is used to second splice right1 with common join through the unique index (right unique index) of the right table.
The left splice sub-step 403 computes based only on non-common columns, the outputs of left1 and right splice sub-steps 402 are commonly used as inputs to the left splice sub-step 403, left1 is the non-common column in left, and the output of right splice sub-step 402 is non-common data since the outputs of right splice sub-step 402 compute based on non-common columns. The left splicing substep 403 is used for performing unconditional third splicing (concat) on the outputs of left1 and right splicing substep 402, and outputting non-common data (non-common output) as the calculation result of "table splicing".
Thus, the right graph of fig. 4 has reduced input data per sub-step by optimizing the "table splice" itself, compared to the left graph of fig. 4, and reduces data transmission pressure.
In some embodiments, for the computing step being an aggregate computation, the aggregate computation produces a plurality of computation results, each computation result being a common computation result, a first non-common computation result, or a second non-common computation result.
Wherein the common calculation result is calculated based only on data in one or more common columns; the first non-public calculation result is calculated based on the data in one or more public columns and the data in one or more non-public columns; the second non-common calculation result is calculated based on data in one or more non-common columns.
In some embodiments, for the computing step to be an aggregate computation, the feature computation service optimizes the aggregate computation itself as a plurality of first sub-steps and a plurality of second sub-steps; wherein the plurality of first sub-steps includes at least one first common computing step and at least one second common computing step; the plurality of second sub-steps includes at least one first non-common computing step and at least one second non-common computing step.
The input of the first common calculation step is data in one or more common columns and output is common intermediate data.
The input of the second common calculation step is data in one or more common columns, and the output is a common calculation result.
The input of the first non-common computing step is common intermediate data output by one or more first common computing steps and data in one or more non-common columns, and the input is output as a first non-common computing result.
The input of the second non-common computing step is data in one or more non-common columns, and the output is a second non-common computing result.
Fig. 5 shows the feature computation service optimizing the computation step itself, in particular the step of "aggregate computation" in the first execution plan of the feature computation service optimizing the computation itself, wherein the "aggregate computation" fulfils the second optimization condition, i.e. the "aggregate computation" is computed based on data in one or more common columns and based on data in one or more non-common columns.
The left-hand diagram of fig. 5 shows the input and output of the step "aggregate calculation" in the first execution plan, proj representing the step "aggregate calculation". The right hand graph of fig. 5 shows the execution plan resulting from the feature computation service optimizing the "aggregate computation" step itself in the first execution plan. Specifically, the feature computation service optimizes the "aggregate computation" itself to be a first common computation step proj1, a second common computation step proj2, a first non-common computation step concat, and a second non-common computation step proj3.
The input of the first common calculation step proj1 is the data (common) of the common column in the input, and the output is the common intermediate data (common state).
The input of the second common calculation step proj2 is the data (common) of the common columns in input, and the output is the common calculation result (common output).
The input of the first non-common calculation step concat is the data non-common of the non-common columns in input and the common intermediate data (common state) output by the first common calculation step proj1, and the output is the first non-common calculation result.
The input of the second non-common calculation step proj3 is the first non-common calculation result output by the first non-common calculation step concat, and the output is the second non-common calculation result (non-common output).
In this way, the graph on the right side of fig. 5 is smaller than the graph on the left side of fig. 5, and the input data of each sub-step obtained by optimizing the aggregation calculation itself is reduced, so that the data transmission pressure is reduced.
In the following, embodiments of the present disclosure provide the following solutions to the problem of computational effort wastage caused by repeated computation of features on the same input in current batch request processing.
Shared storage
In some embodiments, the feature computation service further establishes the shared memory area after determining one or more common columns of information in the input field information of the batch request. The feature computing service stores the public data in the shared memory area; wherein the common data includes: data in a common column, common intermediate data, and a common calculation result.
The data in the common column, for example, the primary key column and the time column in the input field information of the batch request in the scenario shown in fig. 3, are both common columns.
The common intermediate data includes intermediate data generated by the first common calculation step in the second execution plan, for example, the common state output by the first common calculation step proj1 in fig. 5 is the common intermediate data. As another example, in fig. 3, window of request of the window output is common intermediate data, since window of request will still be used in subsequent calculations.
The common calculation result includes a calculation result generated by the second common calculation step in the second execution plan, for example, common output from the second common calculation step proj2 in fig. 5 is the common calculation result.
In some embodiments, the feature computation service may store the common data into the shared memory area through a pre-established data encoding storage interface. In some embodiments, the feature computation service is located at the server, the batch requests received by the feature computation service are generated by the client, and the client can optimize storage of the batch requests by pre-establishing a data encoding storage interface when generating the batch requests, i.e., storing data of a common column of input field information of the batch requests into a shared storage area established by the client.
In some embodiments, the feature computation service may further compress the data of the common columns or the data of the non-common columns in the batch request using a compression algorithm.
In some embodiments, a server may be understood as a server that can create a feature computation service, and the server may be a server or a server group. The server groups may be centralized or distributed. The client may be any type of electronic device, such as a portable mobile device like a smart phone, a tablet computer, a notebook computer, and a fixed device like a desktop computer, a smart television, and the like.
In some embodiments, the feature computation service executes the second execution plan separately for each data line in the batch request, obtains a computation result for each data line, and merges the computation result for each data line into a batch computation result for the batch request.
Multiplexing common intermediate data
In some embodiments, the feature computation service, when executing the second execution plan:
for a first common calculation step in a second execution plan:
querying an index relation, wherein the index relation is the relation between the identification of the public computing step and the storage position of the data generated in the public computing step in the shared storage area. In some embodiments, the query index relationship may query the index relationship through a pre-established public data cache interface.
If the identification of the first public computing step is queried, determining a first storage position corresponding to the identification of the first public computing step from the index relation.
Based on the first storage location, data is retrieved from the shared storage area as common intermediate data resulting from the first common computing step.
Therefore, the public intermediate data stored in the shared memory area is directly multiplexed, the first public calculation step is not required to be calculated again, and the calculation force waste is reduced.
In some embodiments, the feature computation service generates common intermediate data through a first common computation step based on data in a common column if the identity of the first common computation step is not queried; and then the public intermediate data generated by the first public calculation step is stored in the shared storage area, and the relation between the identification of the first public calculation step and the storage position is recorded in the index relation, so that the public intermediate data stored in the shared storage area can be reused when aiming at the first public calculation step later, the calculation characteristics do not need to be repeated for the same input, and the calculation force waste is reduced.
Multiplexing common calculation results
In some embodiments, the feature computation service, when executing the second execution plan:
For a second common calculation step in a second execution plan:
querying an index relation, wherein the index relation is the relation between the identification of the public computing step and the storage position of the data generated in the public computing step in the shared storage area. In some embodiments, the query index relationship may query the index relationship through a pre-established public data cache interface.
If the identification of the second public calculation step is queried, determining the second public from the index relationship
A second storage location corresponding to the identity of the co-computing step;
based on the second storage location, data is retrieved from the shared storage area as a common calculation result produced by the second common calculation step.
Therefore, the public calculation results stored in the shared storage area are directly multiplexed, the second public calculation step is not required to be calculated again, and the calculation force waste is reduced.
In some embodiments, if the signature computation service does not query for the identity of the second common computation step, generating a common computation result by the second common computation step based on the data in the common column; and further, the public calculation result generated by the second public calculation step is stored in the shared storage area, and the relation between the identification of the second public calculation step and the storage position is recorded in the index relation, so that the public calculation result stored in the shared storage area can be reused when aiming at the second public calculation step later, the calculation characteristics do not need to be repeated for the same input, and the calculation force waste is reduced.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but those skilled in the art can appreciate that the disclosed embodiments are not limited by the order of actions described, as some steps may occur in other orders or concurrently in accordance with the disclosed embodiments. In addition, those skilled in the art will appreciate that the embodiments described in the specification are all alternatives.
Fig. 6 is a batch request processing apparatus for feature calculation according to an embodiment of the present disclosure. As shown in fig. 6, the batch request processing device for feature computation may include, but is not limited to: a receiving unit 61, a generating unit 62, an optimizing unit 63 and a calculating unit 64. The specific description of each unit is as follows:
a receiving unit 61 for receiving a batch request for feature calculation; the batch request comprises a plurality of data rows, wherein the plurality of data rows share a calculation target and input field information, and the input field information is used for specifying information of a plurality of columns included in the data rows;
a generation unit 62 for generating a first execution plan based on the input field information and the calculation target; wherein the first execution plan includes a plurality of calculation steps and inputs and outputs for each calculation step;
An optimizing unit 63, configured to determine information of one or more common columns in the input field information, and optimize the calculation step and/or the input of the calculation step in the first execution plan based on the information of the common columns, so as to obtain a second execution plan;
a calculation unit 64 for obtaining a batch calculation result of the batch request based on the second execution plan.
In some embodiments, the optimizing unit 63 determines information of one or more common columns in the input field information includes:
acquiring information of one or more public columns in the input field information through a public column interface which is established in advance; or, information of one or more common columns in the input field information is determined based on a plurality of data rows included in the batch request.
In some embodiments, the optimizing unit 63 optimizes the calculation steps and/or inputs of calculation steps in the first execution plan based on the information of the common column, and deriving the second execution plan comprises:
for each calculation step in the first execution plan:
judging whether the calculation step meets a first optimization condition, wherein the first optimization condition is as follows: the calculating step calculates based on data in one or more common columns and is not based on any non-common columns
Calculating data in the columns;
if the first optimization condition is met, the input to the optimizing the computing step is data in one or more common columns.
In some embodiments, if the optimization unit 63 determines that the calculation step does not satisfy the first optimization condition, then:
judging whether the calculating step meets a second optimizing condition, wherein the second optimizing condition is as follows: the calculating step performs a calculation based on data in one or more common columns and a calculation based on data in one or more non-common columns;
if the second optimization condition is met, optimizing the calculation step into at least one first sub-step and at least one second sub-step; wherein the first sub-step is based solely on data in one or more common columns; the second sub-step is calculated based at least on data in one or more non-common columns.
In some embodiments, for the calculating step, a time window pull is performed, the input field information includes a main key column and a time column, and the time window pull pulls window data according to the calculation target based on data in the main key column and data in the time column;
if the primary key row and the time row are both common rows, the optimizing unit 63 optimizes the input pulled by the time window to the data in the primary key row and the data in the time row.
In some embodiments, for the calculating step, table stitching, the table stitching stitch a left table and a right table based on stitching conditions, wherein the left table is derived based on a batch request, and the right table is derived based on a table stored in a database;
if the splicing condition is a common column, the optimizing unit 63 optimizes the table splicing into a first sub-step and two second sub-steps;
wherein the first sub-step is for: performing first splicing based on splicing conditions on the data of all the public columns in the left table and the data of all the public columns in the right table;
the two second sub-steps include: a right splicing sub-step and a left splicing sub-step;
the right splice sub-step is for: performing a second splice based on the unique index of the right table for the output of the first substep and all non-common columns of data in the right table;
the left stitching substep is for: performing third splicing on the output of the right splicing substep and the data of all non-common columns in the left table;
the output of the left splicing substep is used as the calculation result of the table splicing.
In some embodiments, for the computing step being an aggregate computation, the aggregate computation producing a plurality of computation results, each computation result being a common computation result, a first non-common computation result, or a second non-common computation result;
Wherein the common calculation result is calculated based only on data in one or more common columns; the first non-public calculation result is calculated based on the data in one or more public columns and the data in one or more non-public columns; the second non-common calculation result is calculated based on data in one or more non-common columns.
In some embodiments, the optimization unit 63 optimizes the aggregate calculation into a plurality of first sub-steps and a plurality of second sub-steps; wherein the plurality of first sub-steps includes at least one first common computing step and at least one second common computing step; the plurality of second sub-steps includes at least one first non-common computing step and at least one second non-common computing step;
the input of the first common calculation step is data in one or more common columns, and the output is common intermediate data;
the input of the second common calculation step is data in one or more common columns, and the output is a common calculation result;
the input of the first non-public calculation step is public intermediate data output by one or more first public calculation steps and data in one or more non-public columns, and the output is a first non-public calculation result;
The input of the second non-common computing step is data in one or more non-common columns, and the output is a second non-common computing result.
In some embodiments, after determining the information of one or more common columns in the input field information, the optimizing unit 63 is further configured to:
establishing a shared storage area;
storing the public data in a shared memory area; wherein the common data includes: data in a common column, common intermediate data, and a common calculation result;
wherein the common intermediate data is intermediate data generated by a first common calculation step in the second execution plan, the first common calculation step being a step of generating intermediate data based on data in one or more common columns;
wherein the common calculation result is a calculation result generated by a second common calculation step in the second execution plan, the second common calculation step being a step of generating a calculation result based on data in one or more common columns.
In some embodiments, the computing unit 64 is to: and respectively executing a second execution plan for each data line in the batch request to obtain a calculation result of each data line, and merging the calculation results into a batch calculation result of the batch request.
In some embodiments, the computing unit 64, when executing the second execution plan:
For a first common calculation step in a second execution plan:
querying an index relationship, wherein the index relationship is the relationship between the identification of the public computing step and the storage position of the data generated in the public computing step in the shared storage area;
if the identification of the first public computing step is queried, determining a first storage position corresponding to the identification of the first public computing step from the index relation;
based on the first storage location, data is retrieved from the shared storage area as common intermediate data resulting from the first common computing step.
In some embodiments, the computing unit 64 is further to: if the identification of the first public computing step is not found, generating public intermediate data through the first public computing step based on the data in the public column; the common intermediate data generated by the first common computing step is stored in a shared memory area, and the relationship between the identity of the first common computing step and the memory location is recorded in an index relationship.
In some embodiments, the computing unit 64, when executing the second execution plan:
for a second common calculation step in a second execution plan:
querying an index relationship, wherein the index relationship is the relationship between the identification of the public computing step and the storage position of the data generated in the public computing step in the shared storage area;
If the identification of the second public computing step is queried, determining a second storage position corresponding to the identification of the second public computing step from the index relation;
based on the second storage location, data is retrieved from the shared storage area as a common calculation result produced by the second common calculation step.
In some embodiments, the computing unit 64 is further to: if the identification of the second public calculation step is not found, generating a public calculation result through the second public calculation step based on the data in the public column; and storing the public calculation result generated by the second public calculation step into a shared storage area, and recording the relation between the identification of the second public calculation step and the storage position in an index relation.
In some embodiments, the computing unit 64 querying the index relationship includes:
and inquiring the index relation through a pre-established public data cache interface.
In some embodiments, the computing unit 64 storing the common data into the shared memory area includes: the common data is stored into the shared memory area through a pre-established data encoding storage interface.
For details of the batch request processing apparatus for feature calculation, reference may be made to various embodiments of the batch request processing method for feature calculation, and in order to avoid repetition of description, details are not described herein.
In some embodiments, the division of each unit in the batch request processing device of the feature calculation is only one logic function division, and other division modes can be adopted in actual implementation, for example, a plurality of units can be implemented as one unit; one unit may also be divided into a plurality of sub-units. It is understood that each unit or sub-unit can be implemented in electronic hardware, or in combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art can implement the described functionality using different methods for each particular application.
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 7, the electronic device includes: at least one processor 71, at least one memory 72, and at least one communication interface 73. The various components in the electronic device are coupled together by a bus system 74. A communication interface 73 for information transfer with external devices. It is to be appreciated that the bus system 74 is employed to facilitate connected communications between these components. The bus system 74 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 74 in fig. 7.
It will be appreciated that the memory 72 in this embodiment may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory.
In some embodiments, memory 72 stores the following elements, executable units or data structures, or a subset thereof, or an extended set thereof: an operating system and application programs.
The operating system includes various system programs, such as a framework layer, a core library layer, a driving layer, and the like, and is used for realizing various basic tasks and processing hardware-based tasks. Applications, including various applications such as Media players (Media players), browsers (browses), etc., are used to implement various application tasks. The program for implementing the batch request processing method for feature calculation provided by the embodiment of the present disclosure may be included in the application program.
In the embodiment of the present disclosure, the processor 71 is configured to execute the steps of each embodiment of the batch request processing method for feature calculation provided in the embodiment of the present disclosure by calling a program or an instruction stored in the memory 72, specifically, a program or an instruction stored in an application program.
The batch request processing method of feature calculation provided by the embodiment of the present disclosure may be applied to the processor 71 or implemented by the processor 71. The processor 71 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in the processor 71. The processor 71 may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The steps of the batch request processing method for feature calculation provided in the embodiments of the present disclosure may be directly embodied in hardware decoding processor execution, or may be executed by a combination of hardware and software units in the decoding processor. The software elements may be located in a random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 72 and the processor 71 reads the information in the memory 72 and performs the steps of the method in combination with its hardware.
Embodiments of the present disclosure further provide a non-transitory computer readable storage medium storing a program or instructions that cause a computer to perform steps of embodiments of a batch request processing method, such as feature computation, and are not described herein in detail to avoid repetition of the description.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
Those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the disclosure and form different embodiments.
Those skilled in the art will appreciate that the descriptions of the various embodiments are each focused on, and that portions of one embodiment that are not described in detail may be referred to as related descriptions of other embodiments.
Although embodiments of the present disclosure have been described with reference to the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the disclosure, and such modifications and variations fall within the scope defined by the appended claims.

Claims (30)

1. A batch request processing method for feature computation, comprising:
receiving a batch request of feature calculation; wherein the batch request includes a plurality of data rows sharing a computation target and input field information for specifying information of a plurality of columns included in the data rows;
generating a first execution plan based on the input field information and the calculation target; wherein said first execution plan includes a plurality of calculation steps and inputs and outputs for each of said calculation steps;
Determining information of one or more common columns in the input field information, optimizing the calculation step and/or the input of the calculation step in the first execution plan based on the information of the common columns, and obtaining a second execution plan, wherein the method comprises the following steps:
for each calculation step in the first execution plan:
judging whether the calculation step meets a first optimization condition, wherein the first optimization condition is as follows: the calculating step calculates based on data in one or more of the common columns and does not calculate based on data in any of the non-common columns;
if the first optimization condition is met, optimizing the input of the calculating step to be the data in the one or more common columns;
if the calculating step is judged not to meet the first optimizing condition, then:
judging whether the calculating step meets a second optimizing condition, wherein the second optimizing condition is as follows: the calculating step performs calculation based on data in one or more of the common columns and performs calculation based on data in one or more non-common columns;
if the second optimization condition is met, optimizing the calculation step into at least one first sub-step and at least one second sub-step; wherein the first sub-step is based solely on data in one or more of the common columns; the second sub-step is based at least on data in one or more non-common columns;
Wherein, the public column is: in all request rows of a batch request, the values on the column are identical;
and obtaining a batch calculation result of the batch request based on the second execution plan.
2. The method of claim 1, wherein the determining information for one or more common columns in the input field information comprises:
acquiring information of one or more public columns in the input field information through a public column interface which is established in advance;
or alternatively, the first and second heat exchangers may be,
information for one or more common columns in the input field information is determined based on a plurality of rows of data included in the batch request.
3. The method according to claim 1, wherein for the calculation step, a time window pull is performed, the input field information includes a main key row and a time row, and the time window pull pulls window data conforming to the calculation target based on data in the main key row and data in the time row;
and if the main key row and the time row are both public rows, optimizing the input pulled by the time window to be the data in the main key row and the data in the time row.
4. The method of claim 1, wherein for the calculating step, a table splice is performed, the table splice splices a left table and a right table based on splice conditions, wherein the left table is obtained based on the batch request, and the right table is obtained based on a table stored in a database;
If the splicing condition is a common column, optimizing the table to be spliced into a first substep and two second substeps;
wherein the first sub-step is for: performing first splicing based on the splicing conditions on the data of all the public columns in the left table and the data of all the public columns in the right table;
the two second sub-steps comprise: a right splicing sub-step and a left splicing sub-step;
the right stitching substep is for: performing a second splice based on the unique index of the right table for the output of the first sub-step and all non-common columns of data in the right table;
the left stitching substep is for: performing third splicing on the output of the right splicing substep and the data of all non-common columns in the left table;
and the output of the left splicing substep is used as the calculation result of the table splicing.
5. The method of claim 1, wherein for the computing step is an aggregate computation, the aggregate computation producing a plurality of computation results, each computation result being a common computation result, a first non-common computation result, or a second non-common computation result;
wherein the public calculation result is calculated based on data in one or more public columns only; the first non-public calculation result is calculated based on data in one or more public columns and data in one or more non-public columns; the second non-common calculation result is calculated based on data in one or more non-common columns.
6. The method of claim 5, wherein the aggregate calculation is optimized into a plurality of first sub-steps and a plurality of second sub-steps; wherein the plurality of first sub-steps includes at least one first common computing step and at least one second common computing step; the plurality of second sub-steps includes at least one first non-common computing step and at least one second non-common computing step;
the input of the first common calculation step is data in one or more common columns, and the output is common intermediate data;
the input of the second common calculation step is data in one or more common columns, and the output is a common calculation result;
the input of the first non-public computing step is public intermediate data output by one or more first public computing steps and data in one or more non-public columns, and the output is a first non-public computing result;
the input of the second non-public calculation step is data in one or more non-public columns, and the output is a second non-public calculation result.
7. The method of claim 1, wherein after said determining the information of one or more common columns in the input field information, the method further comprises:
Establishing a shared storage area;
storing common data into the shared memory area; wherein the common data includes: data in the common column, common intermediate data, and a common calculation result;
wherein the common intermediate data is intermediate data generated by a first common calculation step in the second execution plan, the first common calculation step being a step of generating intermediate data based on data in one or more of the common columns;
wherein the common calculation result is a calculation result generated by a second common calculation step in the second execution plan, and the second common calculation step is a step of generating a calculation result based on data in one or more of the common columns.
8. The method of claim 7, wherein the deriving a batch calculation of the batch request based on the second execution plan comprises:
and respectively executing the second execution plan for each data row in the batch request to obtain a calculation result of each data row, and merging the calculation results into a batch calculation result of the batch request.
9. The method of claim 8, wherein, in executing the second execution plan:
For a first common calculation step in the second execution plan:
querying an index relation, wherein the index relation is the relation between the identification of the public calculation step and the storage position of the data generated in the public calculation step in the shared storage area;
if the identification of the first public computing step is queried, determining a first storage position corresponding to the identification of the first public computing step from the index relation;
based on the first storage location, data is retrieved from the shared storage area as common intermediate data resulting from the first common computing step.
10. The method of claim 9, wherein the method further comprises:
if the identification of the first public calculation step is not queried, generating public intermediate data through the first public calculation step based on the data in the public column;
and storing the public intermediate data generated by the first public calculation step into the shared storage area, and recording the relation between the identification of the first public calculation step and the storage position in the index relation.
11. The method of claim 8, wherein, in executing the second execution plan:
for a second common calculation step in the second execution plan:
Querying an index relation, wherein the index relation is the relation between the identification of the public calculation step and the storage position of the data generated in the public calculation step in the shared storage area;
if the identification of the second public computing step is queried, determining a second storage position corresponding to the identification of the second public computing step from the index relation;
based on the second storage location, data is retrieved from the shared storage area as a common calculation result produced by the second common calculation step.
12. The method of claim 11, wherein the method further comprises:
if the identification of the second public calculation step is not queried, generating a public calculation result through the second public calculation step based on the data in the public column;
and storing the public calculation result generated by the second public calculation step into the shared storage area, and recording the relation between the identification of the second public calculation step and the storage position in the index relation.
13. The method of any of claims 9 to 12, wherein the query index relationship comprises:
and inquiring the index relation through a pre-established public data cache interface.
14. The method of any of claims 7 to 12, wherein the storing of common data into the shared memory area comprises:
and storing the public data into the shared memory area through a pre-established data coding memory interface.
15. A batch request processing apparatus for feature computation, comprising:
a receiving unit for receiving a batch request for feature calculation; wherein the batch request includes a plurality of data rows sharing a computation target and input field information for specifying information of a plurality of columns included in the data rows;
a generation unit configured to generate a first execution plan based on the input field information and the calculation target; wherein said first execution plan includes a plurality of calculation steps and inputs and outputs for each of said calculation steps;
the optimizing unit is used for determining information of one or more public columns in the input field information, optimizing the calculation step and/or the input of the calculation step in the first execution plan based on the information of the public columns, and obtaining a second execution plan;
the optimizing unit optimizes the calculation step and/or the input of the calculation step in the first execution plan based on the information of the common column, and the obtaining of the second execution plan includes:
For each calculation step in the first execution plan:
judging whether the calculation step meets a first optimization condition, wherein the first optimization condition is as follows: the calculating step calculates based on data in one or more of the common columns and does not calculate based on data in any of the non-common columns;
if the first optimization condition is met, optimizing the input of the calculating step to be the data in the one or more common columns;
if the calculating step is judged not to meet the first optimizing condition, then:
judging whether the calculating step meets a second optimizing condition, wherein the second optimizing condition is as follows: the calculating step performs calculation based on data in one or more of the common columns and performs calculation based on data in one or more non-common columns;
if the second optimization condition is met, optimizing the calculation step into at least one first sub-step and at least one second sub-step; wherein the first sub-step is based solely on data in one or more of the common columns; the second sub-step performs computation based at least on data in one or more non-common columns
Wherein, the public column is: in all request rows of a batch request, the values on the column are identical;
And the calculating unit is used for obtaining a batch calculation result of the batch request based on the second execution plan.
16. The apparatus of claim 15, wherein the optimization unit determining information for one or more common columns in the input field information comprises:
acquiring information of one or more public columns in the input field information through a public column interface which is established in advance;
or alternatively, the first and second heat exchangers may be,
information for one or more common columns in the input field information is determined based on a plurality of rows of data included in the batch request.
17. The apparatus of claim 15, wherein for the calculating step, a time window pull is included in the input field information, the main key column and the time column, and the time window pull pulls window data conforming to the calculation target based on data in the main key column and data in the time column;
and if the main key row and the time row are both public rows, the optimizing unit optimizes the input pulled by the time window to be the data in the main key row and the data in the time row.
18. The apparatus of claim 15, wherein for the calculating step is a table splice that splices a left table and a right table based on splice conditions, wherein the left table is derived based on the batch request and the right table is derived based on tables stored in a database;
If the splicing condition is a common column, the optimizing unit optimizes the table to be spliced into a first substep and two second substeps;
wherein the first sub-step is for: performing first splicing based on the splicing conditions on the data of all the public columns in the left table and the data of all the public columns in the right table;
the two second sub-steps comprise: a right splicing sub-step and a left splicing sub-step;
the right stitching substep is for: performing a second splice based on the unique index of the right table for the output of the first sub-step and all non-common columns of data in the right table;
the left stitching substep is for: performing third splicing on the output of the right splicing substep and the data of all non-common columns in the left table;
and the output of the left splicing substep is used as the calculation result of the table splicing.
19. The apparatus of claim 15, wherein for the computing step is an aggregate computation, the aggregate computation producing a plurality of computation results, each computation result being a common computation result, a first non-common computation result, or a second non-common computation result;
wherein the public calculation result is calculated based on data in one or more public columns only; the first non-public calculation result is calculated based on data in one or more public columns and data in one or more non-public columns; the second non-common calculation result is calculated based on data in one or more non-common columns.
20. The apparatus of claim 19, wherein the optimization unit optimizes the aggregate calculation into a plurality of first sub-steps and a plurality of second sub-steps; wherein the plurality of first sub-steps includes at least one first common computing step and at least one second common computing step; the plurality of second sub-steps includes at least one first non-common computing step and at least one second non-common computing step;
the input of the first common calculation step is data in one or more common columns, and the output is common intermediate data;
the input of the second common calculation step is data in one or more common columns, and the output is a common calculation result;
the input of the first non-public computing step is public intermediate data output by one or more first public computing steps and data in one or more non-public columns, and the output is a first non-public computing result;
the input of the second non-public calculation step is data in one or more non-public columns, and the output is a second non-public calculation result.
21. The apparatus of claim 15, wherein the optimizing unit, after determining the information of the one or more common columns in the input field information, is further configured to:
Establishing a shared storage area;
storing common data into the shared memory area; wherein the common data includes: data in the common column, common intermediate data, and a common calculation result;
wherein the common intermediate data is intermediate data generated by a first common calculation step in the second execution plan, the first common calculation step being a step of generating intermediate data based on data in one or more of the common columns;
wherein the common calculation result is a calculation result generated by a second common calculation step in the second execution plan, and the second common calculation step is a step of generating a calculation result based on data in one or more of the common columns.
22. The apparatus of claim 21, wherein the computing unit is configured to:
and respectively executing the second execution plan for each data row in the batch request to obtain a calculation result of each data row, and merging the calculation results into a batch calculation result of the batch request.
23. The apparatus of claim 22, wherein the computing unit, when executing the second execution plan:
for a first common calculation step in the second execution plan:
Querying an index relation, wherein the index relation is the relation between the identification of the public calculation step and the storage position of the data generated in the public calculation step in the shared storage area;
if the identification of the first public computing step is queried, determining a first storage position corresponding to the identification of the first public computing step from the index relation;
based on the first storage location, data is retrieved from the shared storage area as common intermediate data resulting from the first common computing step.
24. The apparatus of claim 23, wherein the computing unit is further configured to:
if the identification of the first public calculation step is not queried, generating public intermediate data through the first public calculation step based on the data in the public column;
and storing the public intermediate data generated by the first public calculation step into the shared storage area, and recording the relation between the identification of the first public calculation step and the storage position in the index relation.
25. The apparatus of claim 22, wherein the computing unit, when executing the second execution plan:
for a second common calculation step in the second execution plan:
Querying an index relation, wherein the index relation is the relation between the identification of the public calculation step and the storage position of the data generated in the public calculation step in the shared storage area;
if the identification of the second public computing step is queried, determining a second storage position corresponding to the identification of the second public computing step from the index relation;
based on the second storage location, data is retrieved from the shared storage area as a common calculation result produced by the second common calculation step.
26. The apparatus of claim 25, wherein the computing unit is further configured to:
if the identification of the second public calculation step is not queried, generating a public calculation result through the second public calculation step based on the data in the public column;
and storing the public calculation result generated by the second public calculation step into the shared storage area, and recording the relation between the identification of the second public calculation step and the storage position in the index relation.
27. The apparatus of any of claims 23 to 26, wherein the computing unit querying an index relationship comprises:
and inquiring the index relation through a pre-established public data cache interface.
28. The apparatus of any of claims 21 to 26, wherein the computing unit to store common data into the shared memory area comprises:
and storing the public data into the shared memory area through a pre-established data coding memory interface.
29. An electronic device, comprising: a processor and a memory;
the processor is adapted to perform the steps of the method according to any of claims 1 to 14 by invoking a program or instruction stored in the memory.
30. A non-transitory computer readable storage medium storing a program or instructions that cause a computer to perform the steps of the method of any one of claims 1 to 14.
CN202011553734.4A 2020-12-24 2020-12-24 Batch request processing method and device for feature calculation, electronic equipment and storage medium Active CN112597213B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202311562303.8A CN117555944A (en) 2020-12-24 2020-12-24 Batch request processing method and device for feature calculation, electronic equipment and storage medium
CN202011553734.4A CN112597213B (en) 2020-12-24 2020-12-24 Batch request processing method and device for feature calculation, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011553734.4A CN112597213B (en) 2020-12-24 2020-12-24 Batch request processing method and device for feature calculation, electronic equipment and storage medium

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202311562303.8A Division CN117555944A (en) 2020-12-24 2020-12-24 Batch request processing method and device for feature calculation, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112597213A CN112597213A (en) 2021-04-02
CN112597213B true CN112597213B (en) 2023-11-10

Family

ID=75202426

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202311562303.8A Pending CN117555944A (en) 2020-12-24 2020-12-24 Batch request processing method and device for feature calculation, electronic equipment and storage medium
CN202011553734.4A Active CN112597213B (en) 2020-12-24 2020-12-24 Batch request processing method and device for feature calculation, electronic equipment and storage medium

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202311562303.8A Pending CN117555944A (en) 2020-12-24 2020-12-24 Batch request processing method and device for feature calculation, electronic equipment and storage medium

Country Status (1)

Country Link
CN (2) CN117555944A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20020088701A (en) * 2001-05-19 2002-11-29 (주)가야테크놀로지 Electronic commerce method and system using internet
CN109146559A (en) * 2018-08-01 2019-01-04 夏颖 A kind of commodity purchase guiding system for Commercial Complex
CN109325808A (en) * 2018-09-27 2019-02-12 重庆智万家科技有限公司 Demand for commodity prediction based on Spark big data platform divides storehouse planing method with logistics
CN110502579A (en) * 2019-08-26 2019-11-26 第四范式(北京)技术有限公司 The system and method calculated for batch and real-time characteristic

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20020088701A (en) * 2001-05-19 2002-11-29 (주)가야테크놀로지 Electronic commerce method and system using internet
CN109146559A (en) * 2018-08-01 2019-01-04 夏颖 A kind of commodity purchase guiding system for Commercial Complex
CN109325808A (en) * 2018-09-27 2019-02-12 重庆智万家科技有限公司 Demand for commodity prediction based on Spark big data platform divides storehouse planing method with logistics
CN110502579A (en) * 2019-08-26 2019-11-26 第四范式(北京)技术有限公司 The system and method calculated for batch and real-time characteristic

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
菜鸟-需求预测与分仓规划解决;limeiyang;github;正文第1-10页 *

Also Published As

Publication number Publication date
CN117555944A (en) 2024-02-13
CN112597213A (en) 2021-04-02

Similar Documents

Publication Publication Date Title
US11416268B2 (en) Aggregate features for machine learning
US11531926B2 (en) Method and apparatus for generating machine learning model by using distributed computing framework
CN109670161B (en) Commodity similarity calculation method and device, storage medium and electronic equipment
RU2524855C2 (en) Extensibility for web-based diagram visualisation
EP4258132A1 (en) Recommendation method, recommendation network, and related device
Yue et al. LlamaRec: Two-stage recommendation using large language models for ranking
CN109992659B (en) Method and device for text sorting
CN108509179B (en) Method for detecting human face and device for generating model
CN112597213B (en) Batch request processing method and device for feature calculation, electronic equipment and storage medium
CN112269915B (en) Service processing method, device, equipment and storage medium
CN110827078B (en) Information recommendation method, device, equipment and storage medium
CN110659701B (en) Information processing method, information processing apparatus, electronic device, and medium
US20200242702A1 (en) Capital chain information traceability method, system, server and readable storage medium
US11734448B2 (en) Method for encrypting database supporting composable SQL query
CN115239442B (en) Method and system for popularizing internet financial products and storage medium
WO2023173550A1 (en) Cross-domain data recommendation method and apparatus, and computer device and medium
Khan et al. On uniform convergence of undiscounted optimal programs in the Mitra–Wan forestry model: the strictly concave case
CN114329280A (en) Method and device for resource recommendation, storage medium and electronic equipment
US20210357955A1 (en) User search category predictor
CN117407388A (en) Idempotent control method and device and electronic equipment
CN114492844A (en) Method and device for constructing machine learning workflow, electronic equipment and storage medium
CN112347242B (en) Digest generation method, device, equipment and medium
US11544240B1 (en) Featurization for columnar databases
CN111859939A (en) Text matching method and system and computer equipment
CN106484747A (en) A kind of webpage item recommendation method based on alternative events and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant