CN115774557A

CN115774557A - Feature engineering compiling optimization method and device

Info

Publication number: CN115774557A
Application number: CN202211465237.8A
Authority: CN
Inventors: 时承凯; 高圣巍
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2023-03-10

Abstract

The application discloses a feature engineering compiling optimization method and device, wherein the method comprises the following steps: constructing a dependency directed graph based on the dependency relationship among the feature configurations; determining a first feature configuration which meets a preset constant condition, and performing constant propagation processing on other feature configurations according to the first feature configuration and the dependency directed graph to obtain the processed feature configuration; sequentially reading the processed feature configurations according to the dependency directed graph, and calculating the feature configurations according to the input data set; the input data set comprises a first input data set, and if the value of the first input data set is kept unchanged within a preset first time, the feature configuration of the first input data set is calculated according to the value of the first input data set. The characteristic configuration is optimized by constructing a dependency directed graph, constant propagation processing and the like, and the processing efficiency is improved.

Description

Feature engineering compiling optimization method and device

Technical Field

The application relates to the technical field of computers, in particular to a feature engineering compiling optimization method and device.

Background

The feature engineering is a process of processing data by using professional knowledge and skills, and converting original data into feature vectors, so that the feature vectors can play a better role in a machine learning algorithm. Data and features determine the upline of machine learning, and therefore, feature engineering is particularly important in machine learning.

In the prior art, feature engineering generally only acquires data, basically cleans dirty data, supplements missing values and the like, does not consider optimization during data conversion of feature vectors, and is low in calculation efficiency. Therefore, there is a need for a feature engineering compiling and optimizing method to improve the computing efficiency of feature engineering.

Disclosure of Invention

In view of the above problems, embodiments of the present application are proposed to provide a feature engineering compilation optimization method and apparatus that overcome or at least partially solve the above problems.

According to a first aspect of embodiments of the present application, there is provided a feature engineering compilation optimization method, including:

constructing a dependency directed graph based on the dependency relationship among the feature configurations;

determining a first feature configuration which meets a preset constant condition, and performing constant propagation processing on other feature configurations according to the first feature configuration and the dependency directed graph to obtain the processed feature configuration;

sequentially reading the processed feature configurations according to the dependency directed graph, and calculating the feature configurations according to the input data set; the input data set comprises a first input data set, and if the value of the first input data set is kept unchanged within a preset first time, the feature configuration of the first input data set is calculated according to the value of the first input data set.

Optionally, constructing the dependency directed graph based on the dependency relationship between the feature configurations further comprises:

configuring each feature as a node of a dependency directed graph;

traversing each feature configuration, acquiring a dependent feature configuration of each feature configuration, and constructing a directed edge of a dependent directed graph according to a dependency relationship between each feature configuration and the corresponding dependent feature configuration to obtain the dependent directed graph; wherein the directed edge points from the dependent feature configuration to the feature configuration.

Optionally, determining that the first feature configuration meets the preset constant condition further includes:

judging whether the feature configuration meets a preset constant condition or not; the preset constant condition comprises that the value of the input data set corresponding to the feature configuration is kept unchanged within a preset second time;

if yes, determining the feature configuration as a first feature configuration, and replacing the first feature configuration with the value of the corresponding input data set.

Optionally, according to the dependency directed graph, performing constant propagation processing on other feature configurations according to the first feature configuration further includes:

according to the dependency directed graph, performing constant propagation processing on other feature configurations having a dependency relationship with the first feature configuration; the dependency relationship includes a direct dependency relationship or an indirect dependency relationship.

Optionally, according to the dependency directed graph, the constant propagation processing on the other feature configurations having a dependency relationship with the first feature configuration further includes:

traversing each node in the dependency directed graph, and determining a domination boundary of each node to obtain a static single assignment SSA;

according to the SSA, constant propagation processing is carried out on other feature configurations with dependency relationships by utilizing the first feature configuration according to a preset propagation rule.

Optionally, traversing each node in the dependency-directed graph, determining a dominant boundary of each node, and obtaining the static single-assignment SSA further includes:

s1, acquiring any node in a dependence directed graph, and judging whether a plurality of preposed nodes with the number larger than a preset number exist in the node; if yes, executing step S2; if not, circularly executing the step S1 until each node in the dependency directed graph is traversed;

s2, acquiring a plurality of front nodes of the nodes;

s3, aiming at any preposed node, setting the preposed node as an execution node;

s4, judging whether the execution node is a dominant node of the node; if not, determining that the node is the dominant boundary of the execution node, updating the execution node to be the dominant node of the execution node, and executing the step S4 in a circulating manner until the execution node is the dominant node of the node; if yes, circularly executing the step S3 until traversing a plurality of front nodes;

and S5, analyzing the dependence directed graph, determining branch and confluence nodes in the dependence directed graph according to the domination boundary of each node, and adding a preset assignment function to the feature configuration of the branch and confluence nodes to obtain the SSA.

Optionally, according to the SSA, performing constant propagation processing on other feature configurations having dependencies by using the first feature configuration according to the preset propagation rule further includes:

and propagating the constant value of the first feature configuration to other feature configurations with dependency relations according to preset propagation rules to replace the first feature configuration according to the assignments in the SSA.

Optionally, sequentially reading the processed feature configurations according to the dependency digraph, and calculating the feature configurations according to the input data set further includes:

traversing from the initial node of the dependency directed graph according to the dependency directed graph to obtain the topological ordering of each feature configuration during calculation, so as to calculate the feature configuration according to the topological ordering and the input data set.

Optionally, traversing from a starting node of the dependency directed graph according to the dependency directed graph to obtain a topological ordering of each feature configuration during computation, so that computing each feature configuration according to the topological ordering further includes:

and sequentially and downwards performing depth-first traversal from the starting node depending on the directed graph to obtain the topological ordering of each feature configuration during calculation so as to calculate each feature configuration according to the topological ordering.

Optionally, if the value of the first input data set remains unchanged for a preset first time, calculating the feature configuration of the first input data set according to the value of the first input data set further includes:

if the value of the first input data set is kept unchanged within a preset first time, calculating the feature configuration of the first input data set according to the value of the first input data set according to topological sorting, caching the obtained calculation result, and freezing to stop repeated calculation of the feature configuration of the first input data set according to the topological sorting;

if the calculation of the feature configuration of the second input data set depends on the calculation result of the feature configuration of the first input data set, the value of the second input data set changes within a preset first time, the cached calculation result is obtained, and the feature configuration of the second input data set is calculated according to the value of the second input data set respectively according to the topological ordering.

According to a second aspect of the embodiments of the present application, there is provided a feature engineering compilation optimization apparatus, including:

the construction module is suitable for constructing a dependency directed graph based on the dependency relationship among the feature configurations;

the constant module is suitable for determining first feature configuration meeting a preset constant condition, and performing constant propagation processing on other feature configurations according to the dependency directed graph and the first feature configuration to obtain processed feature configuration;

the calculation module is suitable for reading the processed feature configuration in sequence according to the dependency digraph and calculating the feature configuration according to the input data set; the input data set comprises a first input data set, and if the value of the first input data set is kept unchanged within a preset first time, the characteristic configuration of the first input data set is calculated according to the value of the first input data set.

According to a third aspect of embodiments herein, there is provided a computing device comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the feature engineering compiling optimization method.

According to a fourth aspect of the embodiments of the present application, there is provided a computer storage medium having at least one executable instruction stored therein, where the executable instruction causes a processor to perform operations corresponding to the above feature engineering compilation optimization method.

According to the feature engineering compiling optimization method and device provided by the application, the dependency directed graph is constructed according to the dependency relationship among the feature configurations, and the constant propagation processing can be carried out by utilizing the first feature configuration which meets the preset constant condition on the basis of the dependency directed graph, so that the calculation is simplified, useless calculation is eliminated, and the calculation speed is increased. According to the dependency directed graph, the calculation can be carried out according to the dependency relationship sequence among the feature configurations, the previous feature configuration on which the feature configurations depend is guaranteed to be calculated first, and the calculation waiting time is avoided. Further, for the value of the first input data set which is kept unchanged within the preset first time, the characteristic configuration of the first input data set can be obtained through calculation according to the value of the first input data set, multiple times of calculation on the value of the first input data set is not needed, the calculation amount is greatly reduced, and the calculation efficiency is improved.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

Various additional advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 illustrates a flow diagram of a feature engineering compilation optimization method according to one embodiment of the present application;

FIG. 2 illustrates a flow diagram of a feature engineering compilation optimization method according to another embodiment of the present application;

FIG. 3 illustrates a flow diagram for building static single assignments from a dependent directed graph;

FIG. 4 shows a schematic diagram of a half-cell used for constant propagation;

FIG. 5 shows a topological ordering diagram based on a feature configuration of a plurality of input data;

FIG. 6 is a block diagram of a feature engineering compilation optimization apparatus according to an embodiment of the present application;

FIG. 7 illustrates a block diagram of a computing device, according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

First, the noun terms referred to in one or more embodiments of the present application are explained.

Topological sorting: in the field of computer science, topological ordering or topological sequencing of a directed graph is a linear ordering of its nodes such that for each directed edge uv from node u to node v, u precedes v in the ordering; for example, nodes of a graph may represent tasks to be performed, edges may represent constraints that one task must execute before another task, and topological ordering is a valid order of tasks;

analyzing a data stream: a technique for gathering information on values computed by a computer program at different points. A Control Flow Graph (CFG) of a program is used to determine which parts of the program a value assignment to a variable may propagate. This information is typically used by compilers to optimize programs. A typical example of dataflow analysis is the computation of reach definitions;

constant propagation: in the field of computer science, sparse conditional constant propagation (sparse conditional propagation) is an optimization technique, which is commonly used in compilers optimized in static single-assignment form (SSA), and can remove some useless program code in programs and perform constant propagation, which is more powerful than dead code deletion and constant propagation.

FIG. 1 is a flow chart of a feature engineering compilation optimization method according to an embodiment of the present application, and as shown in FIG. 1, the method includes the following steps:

and S101, constructing a dependency directed graph based on the dependency relationship among the feature configurations.

In this embodiment, feature engineering is compiled and optimized, and when feature vectors are converted based on an input data set, each feature configuration is analyzed, for example, feature conf is analyzed, and the feature configurations are optimized according to various relationships among the feature conf, such as dependency relationships, calculation sequences, and the like, so that before feature vectors obtained based on feature configuration calculation are input to machine learning, the feature configurations are subjected to equivalence transformation, and equivalent substitution is performed in a manner such as constant propagation, so that calculation of the feature configurations is reduced, and therefore, the calculation efficiency of feature engineering compiling is greatly improved.

The dependency Directed Graph may be, for example, a DAG (Directed Acyclic Graph), and Directed edges in the DAG may be used to record the order of the dependency relationships. The feature configuration records dependent feature configurations, and the feature configuration and the dependent feature configurations have a dependency relationship. Such as feature configuration a and feature configuration B, where the dependency relationship of feature configuration a is feature configuration B, that is, the dependent feature configuration of feature configuration a is feature configuration B. According to the acquired dependency relationship among the feature configurations, topological sorting can be performed based on the dependency relationship, and a DAG graph is created. As for feature configuration a and feature configuration B, DAG graphs may be established for feature configuration a and feature configuration B. The feature configuration A and the feature configuration B are nodes of the DAG graph, a directed edge is created between the feature configuration A and the feature configuration B, the directed edge points to the feature configuration A from the feature configuration B, namely a head node is the feature configuration B, a tail node is the feature configuration A, the feature configuration B can be processed according to the directed edge conveniently during subsequent feature configuration processing, then the feature configuration A depending on the feature configuration B is processed, the condition that the feature configuration B needs to be waited when the feature configuration A is calculated is avoided, and the processing speed can be increased.

For a plurality of feature configurations, each feature configuration can be traversed, and each node and directed edge of a dependent directed graph are constructed according to the dependency relationship among the feature configurations.

And S102, determining a first feature configuration meeting a preset constant condition, and performing constant propagation processing on other feature configurations according to the first feature configuration and the dependency directed graph to obtain the processed feature configuration.

The preset constant condition may be used to determine whether the feature configuration is a constant, for example, the preset constant condition includes that a value of the input data set corresponding to the feature configuration remains unchanged within a preset second time, and the preset constant condition may be set to, for example, a time range of the acquired input data set for feature engineering compilation, a time taken for calculation of the input data set, and the like within the preset second time, and for example, the acquired input data set includes occurrence time, and if the occurrence time is acquired in units of days, the occurrence time in the input data set is the same day and belongs to a constant that is known to be unchanged, and the feature configuration corresponding to the occurrence time meets the preset constant condition, and it may be determined that the feature configuration is the first feature configuration, and the first feature configuration may be replaced with a value of the corresponding input data set, that is a constant value, and for example, the occurrence time is 2022 year, 11 month, 9 day, wednesday (such as 3), and the like. For example, when the first feature configuration meeting the preset constant condition is determined, the value of the input data set that remains unchanged within the preset second time may be determined according to the input data set, the corresponding feature configuration may be determined, and the first feature configuration may be replaced with the value of the input data set, so as to simplify the calculation of the first feature configuration and improve the calculation speed.

After determining the first feature configuration, the constant value of the first feature configuration may be propagated based on the dependency directed graph, and constant propagation processing may be performed on other feature configurations. If the constant value of the first feature configuration is 3, the other feature configurations have related calculation related to the first feature configuration, and if the other feature configurations = the first feature configuration +1, 4 can be directly calculated, and the other feature configurations are subjected to constant propagation processing by using 4, so that the calculation amount of the feature configurations is reduced. Specifically, according to the dependency directed graph, starting from a node of the first feature configuration, a direct dependency relationship may be obtained with the first feature configuration, for example, the node a pointed by the directed edge of the node where the first feature configuration is located, the feature configuration of the node a has a direct dependency relationship with the first feature configuration, and constant propagation processing may be performed according to the first feature configuration, or an indirect dependency relationship with the first feature configuration, and the feature configuration of the node b has an indirect dependency relationship with the first feature configuration, and also constant propagation processing may be performed according to the first feature configuration.

And step S103, sequentially reading the processed feature configuration according to the dependency digraph, and calculating the feature configuration according to the input data set.

After the processing, the feature configurations of each node can be topologically sorted according to the sequence of the dependency relationships in the dependency directed graph, the calculation sequence of each node can be determined by using the topological sorting, the feature configurations can be conveniently calculated according to the calculation sequence, the feature configurations of the nodes can be determined to be calculated, the feature configurations of the nodes do not need to be calculated, and the like.

And determining the calculation sequence of the feature configuration of each node according to the dependency digraph, and calculating sequentially according to the calculation sequence of the feature configuration according to the input data sets, such as the input data sets of users, time, commodities and the like, to obtain corresponding calculation results. The calculation result finally indicates, for example, a feature vector, for example, the input data set is a user, time, and a commodity, and the corresponding obtained feature vector indicates an operation of the user on the commodity at the time, and the like. Not explained herein, and the corresponding calculation result is obtained according to the implementation situation.

Further, the input data set includes a first input data set, and if the value of the first input data set remains unchanged within a preset first time, the preset first time may be, for example, a specified time range, a time range of the obtained input data for feature engineering compilation, a time range spent on input data calculation, and the like, which is not limited herein. Calculating the feature configuration of the first input data set according to the values of the first input data set, where the calculation can be performed only once, such as in a plurality of input data sets, such as itemid including user's userid and commodity, and inputting 1 userid and 10 itemid as input data sets into the feature engineering together for calculation. If the number of userid is 1, keeping the userid unchanged within a preset first time, wherein the userid is a first input data set; the itemid is multiple, namely, the itemid changes for multiple times within a preset first time, and corresponds to different itemid values, and the itemid is a second input data set. 1 userid and 1 itemid can be input in the input, 10 groups are input, wherein the value of the userid in each group is unchanged, and the values of the itemid are different; alternatively, the input is only 1 userid input, 10 different itemid inputs, and the like, and the input is not limited herein. At the time of calculation, since the values userid of the first input data set are only 1, the feature configuration related to the corresponding userid may be calculated only once. After calculation, the calculation result obtained by calculation can be cached, when different itemid use the calculation result of userid, intermediate calculation result or final calculation result, the corresponding calculation result can be obtained from the cache to be directly used, and the userid does not need to be calculated once each time when 10 commodities are calculated, so that the calculation amount can be greatly reduced. The feature configuration of the first input data is calculated according to the first input data set, the obtained calculation result is cached, feature configuration related calculation of the first input data set is finished, the feature configuration calculation of the first input data set cannot be repeatedly executed, the feature configuration calculation of the second input data set can be directly completed by using the cached calculation result when the second input data set is calculated, the calculation times of the feature configuration of the first input data set are reduced, the calculation result is directly cached and obtained, and the overall calculation speed is improved.

According to the feature engineering compiling and optimizing method provided by the application, the dependency directed graph is constructed according to the dependency relationship among the feature configurations, the constant propagation processing can be carried out by utilizing the first feature configuration which meets the preset constant condition based on the dependency directed graph, the calculation is simplified, useless calculation is removed, and the calculation speed is increased. According to the dependency directed graph, the calculation can be carried out according to the dependency relationship sequence among the feature configurations, the previous feature configuration on which the feature configurations depend is guaranteed to be calculated first, and the calculation waiting time is avoided. Further, for the value of the first input data set which is kept unchanged within the preset first time, the characteristic configuration of the first input data set can be obtained through calculation according to the value of the first input data set, multiple times of calculation on the value of the first input data set is not needed, the calculation amount is greatly reduced, and the calculation efficiency is improved.

FIG. 2 is a flow chart of a feature engineering compilation optimization method according to an embodiment of the present application, and as shown in FIG. 2, the method includes the steps of:

step S201, constructing a dependency directed graph based on the dependency relationship among the feature configurations.

For feature configuration, each feature configuration may be first used as each node of the dependency directed graph, such as feature configuration A, B, C, D, E, and a plurality of nodes of the dependency directed graph may be obtained. Then, traversing each feature configuration, obtaining the dependent feature configuration of each feature configuration, where the dependent feature configuration is determined according to the dependency relationship, for example, the dependency relationship of the feature configuration a is the feature configuration B, that is, the dependent feature configuration of the feature configuration a is the feature configuration B. And according to the dependency relationship between each feature configuration and the corresponding dependency feature configuration, a directed edge depending on a directed graph can be constructed, wherein the directed edge points to the feature configuration from the dependency feature configuration, if the feature configuration B points to the feature configuration A, and the directed edge is the BA edge, when calculating, the feature configuration B is firstly calculated, and then the feature configuration A is calculated depending on the feature configuration B. If the feature is configured with a plurality of dependent feature configurations, for example, the dependent feature configuration of the feature configuration B is C, D, when a directed edge is constructed, the feature configuration C and the feature configuration D point to the feature configuration B respectively, and 2 directed edges are constructed; or, the dependent feature configurations of the feature configuration C and the feature configuration D are both the feature configuration E, and when a directed edge is constructed, the feature configuration E points to the feature configuration C and the feature configuration D, and 2 directed edges are constructed, that is, when a feature is configured with a plurality of dependent feature configurations, or when a plurality of features are configured with the same dependent feature configuration, a plurality of directed edges may be constructed. Traversal may begin with any feature configuration, e.g., in the order of feature configurations A, B, C, D, E, ensuring that each feature configuration is traversed, resulting in a dependent directed graph constructed jointly by various nodes and various directed edges. The dependency directed graph may be constructed using the following code, taking the dependency directed graph as a DAG graph as an example:

DAG graph = { } - -first set DAG graph empty

Conf of each feature for each feature, traversing each feature configuration (feature conf) through each dependent feature parent of the for feature conf, traversing the dependent feature configuration of each feature configuration

Add (conf, parent) to DAG graph forming a directed edge-add each feature configuration and dependent feature configuration build directed edges to DAG graph

The above is an example of pseudo code, and the specific implementation is set according to implementation, and is not limited herein.

Step S202, determining a first feature configuration meeting a preset constant condition.

The preset constant condition includes that a value of the input data set corresponding to the feature configuration is kept unchanged within a preset second time, and the preset second time may be set to a time range for feature engineering compilation of the acquired input data, a time spent on input data calculation, and the like.

And judging the input data set according to a preset constant condition, and determining a first characteristic configuration in which the value of the input data set is kept unchanged within a preset second time. The input data set is a fixed value such as the time of day, and the like, and is not limited herein.

Step S203, traversing each node in the dependency directed graph, determining the domination boundary of each node, and obtaining the static single assignment SSA.

After the first feature configuration is determined, prior to constant propagation processing, SSA (Static Single-Assignment) is constructed. The SSA is an intermediate representation form, and the static single assignment means that each name involved is assigned once in the SSA, so that constant propagation based on the SSA is facilitated, the efficiency of sparse analysis can be improved by the SSA, and the constant propagation efficiency is improved.

The SSA can be constructed based on the dependency directed graph, the dependency directed graph is analyzed, and the assignment in each node is set to be in a single assignment form, so that the related variable names in the nodes are assigned only once in the SSA. For example, traversing each node, including y =1, y =2, x = y, etc., the SSA converts it into y1=1, y2=2, x1= y2, etc., and the name of each variable is assigned only once. Besides traversing the dependent directed graph, the dominant boundary of each node needs to be searched and determined. If the front node of the node b in the dependency directed graph is the node a, and the node b has no other routes to reach, the node a is the dominant node of the node b; if the node c is a front node of the node d, but the node c does not directly dominate the node d, and the node d has other front nodes e, the node c and the node e are the dominant boundary of the node d. The minimum SSA can be determined from the dominant boundary.

Specifically, traversing the dependency directed graph to find the node dominance boundary to determine the SSA, as shown in fig. 3, executing the following steps:

step S301, traversing any node in the dependency digraph, and judging whether the nodes have a plurality of prepositive nodes with the number larger than a preset number; if yes, go to step S302; if not, other nodes in the dependency directed graph are continuously traversed until each node is traversed.

The traversal may be started from any node, as it is necessary to traverse each node in the dependency directed graph. Acquiring any node in the dependence directed graph, firstly judging whether the node has a plurality of preposed nodes with the number larger than the preset number, if the preset number is set to be 1, judging whether the node has 2 or more than 2 preposed nodes, wherein the preposed nodes are head nodes of directed edges, if so, executing a step S302, and continuously judging the node; if not, namely the node only has one dominant node, traversing the dependency directed graph to obtain another node and continuing judging until traversing is completed for each node in the dependency directed graph.

Step S302, a plurality of front nodes of the nodes are obtained.

And when the node is judged to have a plurality of front nodes with the number larger than the preset number, acquiring the plurality of front nodes of the node.

Step S303, traverse a plurality of front nodes, and set a front node as an execution node for any front node.

Here, it is necessary to traverse a plurality of front nodes to make a determination one by one, and the front node is set as an execution node based on any front node.

In step S304, it is determined whether the executing node is a dominant node of the node.

And judging whether the executing node is a dominant node of the nodes in the step S301, if so, executing a step S306, and judging whether each front node is traversed. If not, step S305 is executed.

Step S305, determining the dominance boundary of the execution node as the node, updating the execution node as the dominance node of the execution node, and executing step S304 in a circulating manner until the execution node is the dominance node of the node.

The dominance boundary of the node belonging to the execution node can be determined, and the node is added into the dominance boundary of the execution node. And then updating the executing node to be the dominant node of the original executing node, then continuously judging whether the updated executing node is the dominant node of the node, if not, adding the node to the dominant boundary of the updated executing node, continuously updating the executing node to be the dominant node of the executing node, continuously judging whether the executing node which is updated again is the dominant node of the node, and executing the step S3 in a recycling mode until the executing node is judged to be the dominant node of the node, acquiring other front nodes for judgment until all front nodes are traversed and completed. For example, node 1 points to node 2 and node 3, and node 2 and node 3 point to node 4. The front nodes of the nodes 4 comprise nodes 2 and nodes 3, the nodes 3 are used as executive nodes to judge and know, the nodes 4 are the domination boundaries of the nodes 3, the updating executive nodes are the nodes 1, the nodes 1 are judged to be the domination nodes of the nodes 4 again, the nodes 2 are used as executive nodes to judge and know, the nodes 4 are the domination boundaries of the nodes 2, the updating executive nodes are the nodes 1, the nodes 1 are judged to be the domination nodes of the nodes 4 again, and the traversal of each front node is completed through the circular judgment to obtain the domination boundaries of the nodes. For illustration, the following code may be used to implement the dependency directed graph, taking a DAG graph as an example:

node b in for reach DAG graph

Let the dominant boundary of b be the empty set { }

Node b in for each DAG graph

Number of front nodes of if node b >1

front node p for each b

runner:＝p

while runner! Dominating node of = b

Dominant boundary of runner = dominant boundary of runner U { b }

runner = b dominating node

The above is pseudo code, and the specific execution is set according to the implementation, and is not limited here.

Step S306, judging whether each front node is traversed.

And judging whether each preposed node is traversed, if so, executing step S307, otherwise, executing step S303, re-acquiring another preposed node, setting the preposed node as an executing node, and then judging the executing node.

And step S307, analyzing the dependence directed graph, determining branch and confluence nodes in the dependence directed graph according to the domination boundary of each node, and adding a preset assignment function to the feature configuration of the branch and confluence nodes to obtain SSA.

After the dominant boundary of the node is determined, the dependent directed graph is analyzed, and based on the dominant boundary of each node, a branch merged node in the dependent directed graph can be determined. For example, for the node 1, the node 2, the node 3, and the node 4 in step S4, it may be determined that the branch/merge node in the dependency directed graph is the node 4, a preset assignment function, such as a phi function, is added to the feature configuration of the branch/merge node, and after the phi function is inserted, it may be ensured that only one corresponding assignment process can arrive at any node that arrives at the branch/merge node, which specifically depends on the processes of the nodes before the branch/merge node, and finally the SSA is obtained.

And step S204, according to the SSA, performing constant propagation processing on other feature configurations with dependency relationships by using the first feature configuration according to a preset propagation rule.

The dependency relationship includes a direct dependency relationship or an indirect dependency relationship, and the dominant boundary of the SSA includes the above various dependency relationships. According to the assignments in the SSA, the constant value of the first feature configuration can be propagated to other feature configurations having dependency relationships according to a preset propagation rule to replace the first feature configuration. Constant propagation may utilize a dataflow analysis technique, such as the input data of the first feature configuration being time constant date, and replacing other feature configurations to which date may be propagated with constants, i.e., the dataflow analysis technique.

Constant propagation in this embodiment can use, for example, SCCP (sparse conditional constant propagation), can remove some useless program code in a node and perform constant propagation, improves the efficiency of analyzing sparsity based on SSA, and has the capability of detecting control flow edges that are never executed due to constant branch conditions. When the constant propagation processing is performed, the constant propagation may be performed according to a preset propagation rule. The constant propagation process may refer to the half-cell used in constant propagation shown in fig. 4, where each variable in the constant propagation is such a half-cell, and the value of the variable is the element in the half-cell, where c1, c2, c3, and c4 represent one possible constant value, and in the initial stage of the constant propagation analysis, the values of all variables are uncertain, t-represents undefined, meaning not constant, or cannot be determined to be constant, and for the operation in constant propagation (meopertor) process, the preset propagation rules include the following cases, for example:

undefined meet undefined = undefined

Undefined meet variable = variable

Constant 1meet constant 2= constant 1 (if constant 1 equals constant 2)

Constant 1meet constant 2= variable (if constant 1 is not equal to constant 2)

For example, other propagation rules may be set according to the implementation situation, and are not limited herein.

According to SSA, constant propagation processing is carried out on other feature configurations based on the constant value of the first feature configuration, SCCP carries out constant propagation analysis on SSA representation, conditions in conditional branches are judged, executable and non-executable are distinguished, the influence of the non-executable branches is removed, and constant propagation is more accurate.

Wherein, steps S201 to S204 may be executed when the feature configuration is loaded, and the optimization of the feature configuration is completed before the feature configuration is calculated.

And S205, traversing from the initial node of the dependency directed graph according to the dependency directed graph to obtain the topological ordering of each feature configuration during calculation, so as to calculate the feature configuration according to the topological ordering and the input data set.

The step is that when the feature configuration is calculated on line, the feature configuration can be topologically ordered according to the dependency directed graph, so that when a certain node N is calculated, the preposed node (namely, each variable of the dependent feature configuration and the like) is calculated, and when the feature configuration is written, the sequence of the feature configuration in a specific file does not need to be concerned.

Specifically, depth-first traversal is sequentially performed downwards from a starting node depending on a directed graph, a next-level node of the starting node is obtained first, next-level nodes under the next-level node are obtained for each next-level node until a terminal node is reached, the above operations are repeated on another next-level node until all next-level nodes of the starting node are processed, and topology ordering of each feature configuration during calculation is obtained so that each feature configuration can be calculated according to the topology ordering. Depth-first traversal may take the following code, taking a DAG graph as an example depending on a directed graph:

l < -null list

Start node in S < -DAG graph

while S non-empty do

Taking a node n out of S and removing n from S

Adding n to L

Each outgoing edge e of for each of the each n, the end node of the outgoing edge is set as mdo

Removing e from DAG graph

if m does not go to edge

Adding m to S

And after the topological sequence of each feature configuration during calculation is obtained, calculating the feature configuration according to the input data set according to the topological sequence. The topological sorting of userid (identification of user) and itemid (identification of commodity) in calculation as shown in fig. 5 can be obtained by using the topological sorting, and as shown in fig. 5, userid calculates to obtain a variable val1, and then calculates to obtain fea1 (feature 1); the itemid is calculated to obtain a variable val2, the variable val2 is calculated to obtain fea2 (feature 2), fea3 (feature 3) and a variable val3, and fea4 (feature 4) is calculated according to the variable val3 and the variable val1 obtained by using. The above description is for illustrative purposes, and the specific implementation is set forth according to the implementation and is not limited herein.

Step S206, if the value of the first input data set is kept unchanged within the preset first time, calculating the feature configuration of the first input data set according to the topological ordering according to the value of the first input data set, caching the obtained calculation result, freezing to stop repeated calculation of the feature configuration of the first input data set according to the topological ordering, if the calculation of the feature configuration of the second input data set depends on the calculation result of the feature configuration of the first input data set, changing the value of the second input data set within the preset first time, obtaining the cached calculation result, and calculating the feature configuration of the second input data set according to the topological ordering respectively according to the value of the second input data set.

Considering a plurality of input data sets, the values of some input data sets are kept unchanged within a preset first time, for example, for the same user to operate on a plurality of different commodities, the input data sets include userid of the user and itemid of the commodities, wherein the userid is 1 numerical value, and the itemid is a plurality of numerical values. The value of userid is kept unchanged in the preset first time, userid is a first input data set, itemid is a plurality of different values, namely, the value is changed in the preset first time, and itemid is a second input data set. As shown in fig. 5, the feature configuration of the first input data set is calculated according to the topological order from the first input data set userid, for example, a variable val1 is calculated from the userid, then fea1 (feature 1) is calculated, and the obtained calculation results are cached, where the calculation results include intermediate calculation results, final calculation results, and the like, and the calculation results may be cached according to the calculation dependency of the feature configuration of the specific second input data set during caching, as shown in fig. 5, the intermediate calculation result variable val1 obtained by calculation is cached depending on the intermediate calculation result variable val1 during calculation of the second input data set itemid, and is frozen to stop the calculation of the relevant topological order of the feature configuration of the first input data set userid, and it is not necessary to repeatedly calculate the userid to obtain the variable val1 during calculation of different itemid values in the second input data set. When the second input data set is calculated, obtaining a cached calculation result variable val1, respectively calculating calculation of feature configuration of the second input data set according to values of the second input data set according to topological sorting. In the above, the calculation of the feature configuration of the second input data set depends on the calculation result of the feature configuration of the first input data set. Taking 1 userid and 10 itemid as an example, if the computation of the relevant topological order of the feature configuration of the first input data set userid is not frozen, when 1 userid and 10 itemid are obtained through computation to obtain fea1 (feature 1) of 1 userid and fea4 (feature 4) corresponding to fea2 (feature 2), fea3 (feature 3) and fea4 (feature 4), respectively, 40 times of computation need to be performed, that is, 1 userid needs to correspond to 10 itemid and 10 times of computation need to be performed to obtain 10 identical fea1 (feature 1), and 10 itemid obtain 10 fea2 (feature 2), 10 fea3 (feature 3) and 10 fea4 (feature 4). In the embodiment, after 1-time calculation of 1 userid is executed to obtain fea1 (feature 1), the calculation of the userid is frozen, and 10 itemid calculations are executed 30 times to obtain 10 fea2 (feature 2), 10 fea3 (feature 3) and 10 fea4 (feature 4), so that the calculation amount is greatly reduced, and the calculation efficiency is improved.

According to the feature engineering compiling and optimizing method provided by the application, the dependency directed graph is constructed according to the dependency relationship among the feature configurations, the domination boundary of the node is determined based on the dependency directed graph, the minimum SSA is obtained, the SSA is used for performing constant propagation processing on the first feature configuration which meets the preset constant condition, assignment calculation is simplified in advance, useless calculation can be reduced, the calculation amount is greatly reduced, and the calculation speed is improved. According to the dependency digraph, depth-first traversal is carried out to obtain topological sorting of feature configuration, feature configuration with the sorting being in the first is guaranteed during calculation, namely feature configuration is calculated in the first by the dependency digraph, and calculation waiting time is avoided. For the value of the first input data set which is kept unchanged in the preset first time, the characteristic configuration of the first input data set can be obtained through calculation according to the value of the first input data set, the calculation result is cached, calculation related to the characteristic configuration of the first input data set is frozen, repeated calculation on the first input data set is avoided, the second input data set can be calculated by directly using the cached calculation result, the calculation amount is greatly reduced, and the calculation efficiency is improved.

Fig. 6 shows a schematic structural diagram of a feature engineering compilation optimization device according to an embodiment of the present application. As shown in fig. 6, the apparatus includes:

a construction module 610 adapted to construct a dependency directed graph based on dependency relationships among feature configurations;

the constant quantization module 620 is adapted to determine a first feature configuration meeting a preset constant condition, and perform constant propagation processing on other feature configurations according to the dependency directed graph and the first feature configuration to obtain a processed feature configuration;

a calculating module 630, adapted to sequentially read the processed feature configurations according to the dependency digraph, and calculate the feature configurations according to the input data set; the input data set comprises a first input data set, and if the value of the first input data set is kept unchanged within a preset first time, the characteristic configuration of the first input data set is calculated according to the value of the first input data set.

Optionally, the building block 610 is further adapted to:

configuring each feature as a node of a dependency directed graph;

Optionally, the normalizing module 620 is further adapted to:

s2, acquiring a plurality of front nodes of the nodes;

Optionally, the normalizing module 620 is further adapted to:

according to the assignments in the SSA, propagating the constant value of the first characteristic configuration to other characteristic configurations with dependency relationship according to the preset propagation rule to replace the first characteristic configuration.

Optionally, the calculation module 630 is further adapted to:

and traversing from the initial node of the dependency directed graph according to the dependency directed graph to obtain the topological sorting of each feature configuration during calculation so as to calculate the feature configuration according to the topological sorting and the input data set.

Optionally, the calculation module 630 is further adapted to:

and sequentially and downwards performing depth-first traversal from a starting node depending on the directed graph to obtain the topological ordering of each feature configuration during calculation so as to calculate each feature configuration according to the topological ordering.

Optionally, the calculation module 630 is further adapted to:

if the calculation of the feature configuration of the second input data set depends on the calculation result of the feature configuration of the first input data set, the value of the second input data set changes within a preset first time, the cached calculation result is obtained, and the feature configuration of the second input data set is calculated according to the value of the second input data set respectively according to topological sorting.

The descriptions of the modules refer to the corresponding descriptions in the method embodiments, and are not repeated herein.

According to the feature engineering compiling and optimizing device provided by the application, the dependency directed graph is constructed according to the dependency relationship among the feature configurations, the constant propagation processing can be carried out by utilizing the first feature configuration which meets the preset constant condition based on the dependency directed graph, the calculation is simplified, useless calculation is removed, and the calculation speed is increased. According to the dependency digraph, calculation can be carried out according to the sequence of the dependency relationship among feature configurations, the previous feature configuration on which the feature configurations depend is guaranteed to be calculated first, the calculation waiting time is avoided, for invariable first input data, after the calculation result of the feature configuration of the first input data is obtained through calculation, the calculation result is cached, the calculation related to the feature configuration of the first input data is frozen, the feature configurations of other changed second input data can be calculated based on the calculation result in the cache, multiple times of calculation on the first input data are not needed, the calculation amount is greatly reduced, and the calculation efficiency is improved.

The present application further provides a non-volatile computer storage medium, where at least one executable instruction is stored in the computer storage medium, and the executable instruction may execute the feature engineering compilation optimization method in any method embodiment described above.

Fig. 7 is a schematic structural diagram of a computing device according to an embodiment of the present application, and the specific embodiment of the present application does not limit the specific implementation of the computing device.

As shown in fig. 7, the computing device may include: a processor (processor) 702, a Communications Interface 704, a memory 706, and a communication bus 708.

Wherein:

the processor 702, communication interface 704, and memory 706 communicate with each other via a communication bus 708.

A communication interface 704 for communicating with network elements of other devices, such as clients or other servers.

The processor 702 is configured to execute the program 710, and may specifically execute the relevant steps in the above embodiment of the feature engineering compilation optimization method.

In particular, the program 710 may include program code that includes computer operating instructions.

The processor 702 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits configured to implement the present Application. The computing device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

The memory 706 stores a program 710. The memory 706 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 710 may be specifically adapted to cause the processor 702 to perform the feature engineering compilation optimization methodology of any of the method embodiments described above. For specific implementation of each step in the program 710, reference may be made to corresponding steps and corresponding descriptions in units in the above feature engineering compiling and optimizing embodiment, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

The algorithms or displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose preferred embodiments of the present application.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Moreover, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components according to the application. The present application may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website, or provided on a carrier signal, or provided in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.

Claims

1. A method of feature engineering compilation optimization, comprising:

determining a first feature configuration which meets a preset constant condition, and performing constant propagation processing on other feature configurations according to the dependency directed graph and the first feature configuration to obtain a processed feature configuration;

sequentially reading the processed feature configuration according to the dependency directed graph, and calculating the feature configuration according to an input data set; the input data set comprises a first input data set, and if the value of the first input data set is kept unchanged within a preset first time, the characteristic configuration of the first input data set is calculated according to the value of the first input data set.

2. The method of claim 1, wherein the building a dependency directed graph based on dependencies between feature configurations further comprises:

configuring each feature as a node of the dependency directed graph;

traversing each feature configuration, acquiring a dependent feature configuration of each feature configuration, and constructing a directed edge of the dependent directed graph according to a dependent relationship between each feature configuration and the corresponding dependent feature configuration to obtain the dependent directed graph; wherein the directed edge points to the feature configuration according to the dependent feature configuration.

3. The method of claim 1, wherein the determining that the first feature configuration meets a preset constant condition further comprises:

if yes, determining that the feature configuration is a first feature configuration, and replacing the first feature configuration with a value of the corresponding input data set.

4. The method of claim 1, wherein the constant propagation processing of other feature configurations according to the first feature configuration according to the dependency directed graph further comprises:

according to the dependency directed graph, performing constant propagation processing on other feature configurations having dependency relationship with the first feature configuration; the dependency relationship comprises a direct dependency relationship or an indirect dependency relationship.

5. The method of claim 4, wherein the constant propagation processing of other feature configurations having a dependency relationship with the first feature configuration according to the dependency directed graph further comprises:

traversing each node in the dependency directed graph, determining a domination boundary of each node, and obtaining a static single assignment SSA;

and according to the SSA, performing constant propagation processing on other feature configurations with dependency relationships by using the first feature configuration according to a preset propagation rule.

6. The method of claim 5, wherein traversing each node in the dependency-directed graph, determining a dominant boundary for each node, and obtaining a static single-assignment SSA further comprises:

s1, acquiring any node in the dependency directed graph, and judging whether the node has a plurality of preposed nodes with the number larger than a preset number; if yes, executing step S2; if not, circularly executing the step S1 until each node in the dependence directed graph is traversed;

s2, acquiring a plurality of front nodes of the nodes;

s4, judging whether the execution node is a dominant node of the nodes; if not, determining that the node is the dominant boundary of the execution node, updating the execution node to be the dominant node of the execution node, and executing the step S4 in a circulating manner until the execution node is the dominant node of the node; if yes, circularly executing the step S3 until traversing the plurality of front nodes;

and S5, analyzing the dependency directed graph, determining branch confluence nodes in the dependency directed graph according to the domination boundary of each node, and adding a preset assignment function to the feature configuration of the branch confluence nodes to obtain SSA.

7. The method of claim 5, wherein the performing, according to the SSA, constant propagation processing on other feature configurations having dependencies using the first feature configuration according to a preset propagation rule further comprises:

and according to the assignments in the SSA, propagating the constant value of the first feature configuration to other feature configurations with dependency relationship according to a preset propagation rule to replace the first feature configuration.

8. The method according to any one of claims 1-7, wherein said reading processed feature configurations in order of the dependency directed graph, said computing the feature configurations from an input dataset further comprises:

9. The method of claim 8, wherein the traversing from a starting node of the dependency directed graph according to the dependency directed graph to obtain a topological ordering of the feature configurations in the calculation, so that the calculating of the feature configurations according to the topological ordering further comprises:

and sequentially and downwards performing depth-first traversal from the initial node of the dependency directed graph to obtain the topological ordering of each feature configuration during calculation so as to calculate each feature configuration according to the topological ordering.

10. The method of claim 8 or 9, wherein calculating the feature configuration of the first input data set from the values of the first input data set if the values of the first input data set remain unchanged for a preset first time further comprises:

if the value of the first input data set is kept unchanged within a preset first time, calculating the feature configuration of the first input data set according to the topological sorting according to the value of the first input data set, caching the obtained calculation result, and freezing to stop repeated calculation of the feature configuration of the first input data set according to the topological sorting;

if the calculation of the feature configuration of the second input data set depends on the calculation result of the feature configuration of the first input data set, the value of the second input data set changes within a preset first time, the cached calculation result is obtained, the feature configuration of the second input data set is calculated according to the topological sorting, and the feature configuration of the second input data set is calculated according to the value of the second input data set.

11. A feature engineering compilation optimization device, comprising:

the constant module is suitable for determining first feature configuration meeting a preset constant condition, and performing constant propagation processing on other feature configurations according to the dependency directed graph and the first feature configuration to obtain processed feature configurations;

the calculation module is suitable for sequentially reading the processed feature configuration according to the dependency directed graph and calculating the feature configuration according to an input data set; the input data set comprises a first input data set, and if the value of the first input data set is kept unchanged within a preset first time, the characteristic configuration of the first input data set is calculated according to the value of the first input data set.

12. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the feature engineering compiling optimization method according to any one of claims 1-10.

13. A computer storage medium having stored therein at least one executable instruction that causes a processor to perform operations corresponding to the feature engineering compilation optimization method of any of claims 1-10.