CN113688191B - Feature data generation method, electronic device, and storage medium - Google Patents

Feature data generation method, electronic device, and storage medium Download PDF

Info

Publication number
CN113688191B
CN113688191B CN202110996469.5A CN202110996469A CN113688191B CN 113688191 B CN113688191 B CN 113688191B CN 202110996469 A CN202110996469 A CN 202110996469A CN 113688191 B CN113688191 B CN 113688191B
Authority
CN
China
Prior art keywords
feature
entity
characteristic
value
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110996469.5A
Other languages
Chinese (zh)
Other versions
CN113688191A (en
Inventor
王林
王桐
邓玉明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202110996469.5A priority Critical patent/CN113688191B/en
Publication of CN113688191A publication Critical patent/CN113688191A/en
Application granted granted Critical
Publication of CN113688191B publication Critical patent/CN113688191B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a characteristic data generation method, electronic equipment, a computer storage medium and a program product, wherein the characteristic data generation method comprises the following steps: according to a target entity of feature data to be generated, acquiring all feature norms corresponding to the target entity from a feature norms set, wherein the feature norms are used for describing entity feature generation rule information based on entity relation data; selecting a feature pattern from all the obtained feature patterns according to a selection strategy, generating rule information according to entity features described by the selected feature pattern, and determining feature values corresponding to the feature pattern; evaluating the characteristic value to obtain an evaluation result, and determining the characteristic value of the evaluation result which is superior to the historical characteristic value as an effective characteristic value; and generating feature data of the target entity according to the effective feature value and the feature paradigm corresponding to the effective feature value. By the embodiment of the application, the generation efficiency of the characteristic data is improved.

Description

Feature data generation method, electronic device, and storage medium
Technical Field
Embodiments of the present application relate to the field of computer technologies, and in particular, to a feature data generating method, an electronic device, a computer storage medium, and a computer program product.
Background
Feature engineering is a necessary link of machine learning, and better data features are screened from original data in a series of engineering modes so as to improve training effect of a model.
Along with the development of computer technology, feature engineering has also evolved to the automation stage. In the aspect of the present feature engineering automation, the two-stage automation work is involved, which comprises one-stage original feature generation (generating a feature wide table by a multi-feature table) and two-stage high-order feature combination. However, on one hand, a good feature engineering scheme still needs to be obtained by combining the domain knowledge of an expert and continuously exploring and repeatedly trial and error, and the process accounts for more than 70% of the manpower cost for algorithm development. On the other hand, most of the feature engineering automation at present focuses on the direction of processing two-stage high-order feature combination, and the automation scheme aiming at the direction of generating one-stage original features is less questionable.
Therefore, how to provide a low-cost scheme applicable to one-stage original feature generation is a problem to be solved.
Disclosure of Invention
In view of the above, an embodiment of the present application provides a feature data generating scheme to at least partially solve the above-mentioned problems.
According to a first aspect of an embodiment of the present application, there is provided a feature data generating method, including: according to a target entity of feature data to be generated, acquiring all feature norms corresponding to the target entity from a feature norms set, wherein the feature norms are used for describing entity feature generation rule information based on entity relation data; selecting a feature pattern from all the obtained feature patterns according to a selection strategy, and generating rule information according to entity features described by the selected feature pattern to determine feature values corresponding to the feature pattern; evaluating the characteristic value to obtain an evaluation result, and determining the characteristic value of the evaluation result which is superior to the historical characteristic value as an effective characteristic value; and generating feature data of the target entity according to the effective feature value and the feature paradigm corresponding to the effective feature value.
According to a second aspect of the embodiment of the present application, there is provided a feature data generating apparatus including: the device comprises an acquisition module, a feature pattern generation module and a rule generation module, wherein the acquisition module is used for acquiring all feature patterns corresponding to a target entity from a feature pattern set according to the target entity of the feature data to be generated, wherein the feature patterns are used for describing entity feature generation rule information based on entity relation data; the first determining module is used for selecting a characteristic normal form from all the acquired characteristic normal forms according to a selection strategy, and determining a characteristic value corresponding to the characteristic normal form according to entity characteristic generation rule information described by the selected characteristic normal form; the second determining module is used for evaluating the characteristic values to obtain evaluation results and determining the characteristic values of the evaluation results, which are superior to the historical characteristic values, as effective characteristic values; and the generation module is used for generating the characteristic data of the target entity according to the effective characteristic value and the characteristic normal form corresponding to the effective characteristic value.
According to a third aspect of an embodiment of the present application, there is provided an electronic apparatus including: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus; the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to perform operations corresponding to the method according to the first aspect.
According to a fourth aspect of embodiments of the present application, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to the first aspect.
According to a fifth aspect of embodiments of the present application, there is provided a computer program product comprising computer instructions for instructing a computing device to perform the operations corresponding to the method as described in the first aspect.
According to the feature data generation scheme provided by the embodiment of the application, when the feature data corresponding to the entity is generated, on one hand, the processing is performed based on the feature paradigm, wherein the feature paradigm describes entity feature generation rule information generated based on the entity relation data, namely, all possible generation modes of the feature data of the target entity are described. Therefore, the entity relation data is not required to be combined and carded manually, the feature data generation cost is greatly reduced, and the method is effectively applicable to one-stage feature data generation processing. On the other hand, when generating the characteristic data, the characteristic value of the evaluation result, namely the effective characteristic value, which is superior to the evaluation result of the historical characteristic value, is considered, and in this way, the characteristic value and the characteristic pattern which can effectively represent the characteristics of the target entity can be efficiently screened out, so that the characteristic data of the target entity can be efficiently generated, the generation efficiency of the characteristic data is greatly improved, and the calculation cost of the generated characteristic data is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present application, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.
FIG. 1A is a flow chart illustrating steps of a method for generating feature data according to a first embodiment of the present application;
FIG. 1B is a schematic diagram of an example of a scenario in the embodiment of FIG. 1A;
FIG. 2A is a flowchart illustrating steps of a method for generating feature data according to a second embodiment of the present application;
FIG. 2B is an exemplary diagram of an entity relationship constructed in one of the embodiments shown in FIG. 2A;
FIG. 2C is a schematic diagram of generating a feature paradigm of an entity based on the entity-relationship example graph shown in FIG. 2B;
FIG. 3A is a flowchart illustrating steps of a method for generating feature data according to a third embodiment of the present application;
FIG. 3B is a schematic diagram of a policy update process according to the embodiment shown in FIG. 3A;
fig. 4 is a block diagram of a feature data generating apparatus according to a fourth embodiment of the present application;
Fig. 5 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present application.
Detailed Description
In order to better understand the technical solutions in the embodiments of the present application, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which are derived by a person skilled in the art based on the embodiments of the present application, shall fall within the scope of protection of the embodiments of the present application.
The implementation of the embodiments of the present application will be further described below with reference to the accompanying drawings.
Example 1
Referring to fig. 1A, a flowchart of steps of a feature data generation method according to a first embodiment of the present application is shown.
The feature data generation method of the present embodiment includes the steps of:
step S102: and acquiring all feature patterns corresponding to the target entity from the feature pattern set according to the target entity of the feature data to be generated.
The feature paradigm is used for describing entity feature generation rule information based on entity relation data. A target entity may correspond to at least one feature paradigm, each feature paradigm uniquely describing an entity feature generation approach. The entity relationship data is used to describe the entities and relationships between the entities, and may be, for example, data in an entity relationship data set.
For a machine learning model, training data is required to be used for model training to obtain a model capable of meeting actual demands, raw data is processed into training data required by the machine learning model, a characteristic engineering link is not separated, and most of the characteristic engineering links are manually completed by manpower at present, so that an effective automatic means is lacked. Aiming at the situation, the scheme provided by the embodiment of the application aims at automatically generating the characteristic data of the target entity by means of automatic characteristic engineering so as to serve as training data of a machine learning model.
For this purpose, in this step, after determining the target entity to generate the feature data, all feature patterns corresponding to the target entity are selected from the feature pattern set obtained in advance. Because the feature paradigm is generated based on the original entity relation data and can characterize the entity feature generation rule information, all possible generation modes of the feature data of the target entity and the information of the entity relation data required by the feature data can be obtained based on the feature paradigm. It should be noted that, the feature Fan Shiji may be generated by the executive according to the entity relationship data, or may be generated by another third party, and the executive according to the embodiment of the present application may obtain from the third party when needed.
In one possible manner, the entity characteristic generation rule information includes: the method comprises the steps of generating path information by the entity characteristics, generating operator information among the entity nodes on the path corresponding to the path information by the entity characteristics, and attribute information of other entity nodes except for a target entity on the path. In practical application, an entity relationship graph can be constructed based on entity relationship data, the entity feature generation rule information is determined based on the entity relationship graph, and then the corresponding feature paradigm is determined.
An exemplary feature paradigm is as follows:
f4=
Item.NORM(Item.Cate.MAX(Item.MEAN(Order.User.SUM(Order.item_quantity)))))
where f4 is a name identification example of the feature range, bold portions are entity examples, italic portions are operations Fu Shili, and item_quality is an attribute information example. As can be seen from the feature paradigm, the entities appearing from right to left in the feature paradigm uniquely correspond to the feature generation paths in one entity relationship graph, for example, item_quality features of Order entities can be summarized onto Item entities through paths [ Order- > User- > Order- > Item- > gate- > Item ].
As described above, each entity corresponds to at least one feature pattern, and based on the feature pattern corresponding to the entity, multiple forms of features can be generated for the entity.
By means of the feature paradigm, the processing process of the feature data can be regularly described, so that the efficiency of feature data generation is improved.
Although, corresponding feature data may be generated by the feature paradigmHowever, since the number of feature patterns is determined by a combination of three types, namely, the number of physical feature generation paths, the number of operators between physical nodes on the paths corresponding to the physical feature generation path information, and the number of attribute information of other physical nodes except for the target entity on the paths, the number of feature patterns is usually large, and is assumed to beFeature data corresponding to training data required for a machine learning model is composed of feature values corresponding to partial feature pattern combinations, which are common ∈ ->In this case. If all the feature patterns are used for subsequent processing, the feature values corresponding to the feature patterns need to be calculated in a full amount in the subsequent processing, and then feature combination is performed, so that the problems of large feature pattern space, large feature calculation and combination cost, large burden of further processing of subsequent data and the like are caused.
For this reason, the solution provided in the embodiment of the present application further needs to perform the processing of the subsequent steps S104 to S108, which is described in detail below.
Step S104: and selecting a feature pattern from all the obtained feature patterns according to a selection strategy, and determining a feature value corresponding to the feature pattern according to entity feature generation rule information described by the selected feature pattern.
In the initial process (for example, when generating the feature data of the first entity), the initial selection policy may be in a form of all or random selection, and all feature patterns corresponding to the target entity are selected for processing, or some feature patterns corresponding to the target entity are selected for processing at random. Furthermore, the selection strategy can be updated based on the related data in the processing process, so that when the feature data of the non-first entity is generated, the feature pattern corresponding to the current entity is selected in a targeted manner according to the updated selection strategy. Therefore, the generation process of the characteristic data aiming at each target entity can be used as a basis for updating the characteristic paradigm selection strategy of the subsequent target entity, so that the selection strategy is continuously optimized, and the efficiency and the effectiveness of the characteristic data generation are improved as a whole.
In one possible manner, when the entity characteristic generation rule information includes: when the physical feature generates path information, operator information among physical nodes on a path corresponding to the physical feature generating path information, and attribute information of other physical nodes except for a target entity on the path, the generation of the feature value in the step can be realized as follows: generating a relation between a path corresponding to the path information and the entity nodes on the path along the entity characteristics, generating an operation result according to the attribute information of the other entity nodes and the operation indicated by the operator information, and summarizing the operation result to the target entity so as to obtain the characteristic value of the target entity under the current characteristic normal form. Thus, efficient and accurate feature value acquisition is achieved.
The item_quality feature of the Order entity is summarized onto the Item entity through the path [ Order- > User- > Order- > Item- > date- > Item ] as in the feature paradigm f4 described above.
In the embodiment of the present application, in order to facilitate distinguishing from the feature data of the target entity that is finally generated, feature data that is generated in a lump based on the feature pattern selected in this step is referred to as a feature value.
Step S106: and evaluating the characteristic value to obtain an evaluation result, and determining the characteristic value of the evaluation result superior to the evaluation result of the historical characteristic value as a valid characteristic value.
The evaluation may be performed in a suitable manner, and in order to make the generated feature data more suitable for a machine learning model to be used later, the machine learning model may be used as an evaluator to evaluate the feature value. However, the present application is not limited thereto, and other suitable evaluation methods are also applicable to the embodiments of the present application.
For a target entity, there are a plurality of features Fan Shishi corresponding to the target entity, and there are a plurality of feature values corresponding to the target entity, where the historical feature values are the feature values that have been evaluated. In the initial stage, the historical characteristic value is null, and the corresponding evaluation result is also the setting corresponding to null, for example, the setting is also null, or 0, etc., which is not limited in the embodiment of the present application.
In order to facilitate determination of the effective feature value and efficient management of the effective feature value, in one possible manner, an effective feature pattern set and an effective feature value set may be pre-configured for the target entity, where the effective feature pattern set and the effective feature value set are both empty sets initially. That is, for the current target entity, a set of valid feature patterns and a set of valid feature values are constructed for it. When feature data is needed to be generated for the next new target entity, a new effective feature pattern set and an effective feature value set corresponding to the new target entity are built for the new target entity.
Based on this, in one possible manner, after a feature pattern is selected from all feature patterns corresponding to the obtained target entity according to a selection policy, calculating a corresponding feature value for each selected feature pattern, and inputting the feature value and all historical effective feature values in an effective feature value set together into an evaluator to evaluate, so as to obtain an evaluation result for the feature value; and if the evaluation result of the characteristic value is better than the evaluation results corresponding to all the historical effective characteristic values in the effective characteristic value set, determining the characteristic value as an effective characteristic value. Further, the valid eigenvalues may be placed into the set of valid eigenvalues at the same time. The specific index of the evaluation result can be set by a person skilled in the art according to actual requirements. But in order to make the evaluation result more objective and efficient, the evaluation result may alternatively be set to the accuracy of the output result of the evaluator.
Assuming that { V1, V2, V3} exists in the current valid feature value set of the target entity X, and the current feature value is V4, V1, V2, V3, and V4 are input together into the evaluator to be evaluated, so as to obtain an evaluation result corresponding to V4, and set to be 0.6 in accuracy. If the evaluation result corresponding to { V1, V2, V3} is not more than 0.6, V4 is an effective characteristic value; otherwise, V4 is not a valid eigenvalue, and V4 and its corresponding eigenvalue would be discarded.
Under the condition that V4 is an effective characteristic value, the effective characteristic value is added into an effective characteristic value set, the current effective characteristic value set is updated to { V1, V2, V3 and V4}, the characteristic value corresponding to the characteristic normal form is continuously evaluated, if the current characteristic value is set as V5, the V1, V2, V3, V4 and V5 are input into an evaluator together for evaluation, an evaluation result corresponding to the V5 is obtained, and the evaluation result is set to be 0.5. Then V5 will be discarded as its evaluation fails to outperform V4.
It can be seen that in the case of using the valid feature pattern set and the valid feature value set, after determining the feature value as the valid feature value, the valid feature value may be added to the valid feature value set to update the valid feature value set, and the feature pattern corresponding to the valid feature value may also be added to the valid feature pattern set to update the valid feature pattern set. That is, the valid feature pattern set and the valid feature value set are updated as the evaluation result of the feature values is updated, and there is a different possibility of a historical valid feature value for comparison with the evaluation result of each current feature value.
In the embodiments of the present application, unless otherwise specified, the terms "plurality of" and "a plurality of" mean two or more.
Step S108: and generating feature data of the target entity according to the effective feature value and the feature paradigm corresponding to the effective feature value.
After all the effective feature values corresponding to the target entity and the feature norms corresponding to the effective feature values (simply referred to as effective feature norms) are determined, feature data can be generated for the target entity based on the two parts of data. The effective characteristic norms are in one-to-one correspondence with the effective characteristic values. That is, the valid feature values may be selected according to an order of valid feature patterns, and feature data of the target entity may be generated based on the order.
However, in some cases, there is still a possibility that the number of valid feature patterns and valid feature values is large, for this purpose, in one possible manner, if it is determined that the valid feature value set has been updated according to the feature values corresponding to all the selected feature patterns, and the valid feature pattern set has been updated according to all the selected feature patterns, determining the valid feature pattern to be used from the updated valid feature pattern set according to a preset rule; determining a corresponding effective characteristic value to be used from the updated effective characteristic value set according to the effective characteristic paradigm; and generating feature data of the target entity according to the effective feature value to be used. The preset rules may be set by those skilled in the art according to actual requirements, for example, TOP N is selected according to the evaluation results corresponding to the feature values. By the method, the data processing burden can be further reduced, and the characteristic data generation efficiency is improved.
Hereinafter, the above-described process will be exemplarily described with an example of a predicted scene of a certain commodity, as shown in fig. 1B.
Assuming that sales of an item is to be predicted within a month of the future, historical sales data for the item, including consumer data, order data, store data, etc., may first be obtained from one or more electronic commerce platforms on which the item is sold. And preprocessing the historical sales data, including data normalization and data denoising, and then generating entity characteristic data. The generation process comprises the following steps: generating entity relationship data based on the relationship among the three entities of the consumer, the order and the store and the corresponding data; constructing an entity relationship graph based on the entity relationship data; generating all feature norms corresponding to all entities based on the entity relation diagram; for each entity, acquiring all corresponding characteristic norms of the entity; selecting a part of characteristic norms from all the characteristic norms according to a selection strategy and obtaining characteristic values corresponding to the part of characteristic norms; evaluating the characteristic values one by one according to the mode in the step S106, and determining effective characteristic values and corresponding characteristic patterns thereof; after all the feature patterns and feature values are processed, that is, after all the valid feature patterns and valid feature values of the current entity are obtained, feature data of the entity can be generated accordingly. The generated feature data will be input into a machine learning model (e.g., lightGBM/Xgboost, etc.) to be used to train the machine learning model. In this example, the training process and the post-processing process after training are not described in detail, nor are they limited.
With the present embodiment, when generating feature data corresponding to an entity, on the one hand, processing is performed based on a feature paradigm that describes entity feature generation rule information generated based on entity relationship data, that is, that describes all possible generation manners of feature data of a target entity. Therefore, the entity relation data is not required to be combined and carded manually, the feature data generation cost is greatly reduced, and the method is effectively applicable to one-stage feature data generation processing. On the other hand, when generating the characteristic data, the characteristic value of the evaluation result, namely the effective characteristic value, which is superior to the evaluation result of the historical characteristic value, is considered, and in this way, the characteristic value and the characteristic pattern which can effectively represent the characteristics of the target entity can be efficiently screened out, so that the characteristic data of the target entity can be efficiently generated, the generation efficiency of the characteristic data is greatly improved, and the calculation cost of the generated characteristic data is reduced.
The feature data generation method of the present embodiment may be performed by any suitable electronic device having data processing capabilities, including, but not limited to: server, PC, etc.
Example two
Referring to fig. 2A, a flowchart of steps of a feature data generating method according to a second embodiment of the present application is shown.
In this embodiment, the implementation of the embodiment of the present application is self-generated by the implementation party as an example by using a feature model, and the feature data generation method of this embodiment includes the following steps:
step S202: and constructing an entity relation diagram based on the entity relation data, constructing a characteristic normal form corresponding to the entity based on the entity relation diagram, and generating a characteristic normal form set based on all the characteristic normal forms.
Wherein the feature paradigm may describe generating rule information based on entity features of the entity-relationship data. Specifically, the process of generating the feature paradigm may include: generating a directed acyclic entity relation graph according to the entity relation data, wherein nodes in the entity relation graph are entity nodes used for representing entities, edges between the entity nodes are used for representing the relation between the entities, and the entity nodes are provided with attribute information tables of the entities; performing path sampling based on the entity relation graph to obtain information of an entity characteristic generation path taking an entity node as a termination node; generating a relation between entities represented by edges between adjacent entity nodes on a path according to the entity characteristics, mounting operators for the adjacent entity nodes, and generating an entity node mounting attribute information table on the path for the entity characteristics; and generating a path according to the entity characteristics after the operators and the attribute information table are mounted, and generating a characteristic paradigm corresponding to the entity nodes.
Hereinafter, the above-described process will be specifically described.
(1) And generating a directed acyclic entity relationship graph according to the entity relationship data.
In one possible manner, the entity relationship graph may be obtained according to the analysis of the primary external key relationship of the entity relationship data set stored in the database, and the analysis may be completed by means of a manual or other related algorithm, which is not described herein. In this embodiment, the entity relationship graph is represented by a directed acyclic graph.
For example, an entity-relationship dataset may be defined asWherein->Representation->A set of individual entity nodes. Based on this, nodes in the entity-relationship graph can be defined as entities (e.g., commodities, consumers, orders, stores, etc.); epsilon=epsilon 1 ∪ε 2 ∪ε 3 Is a set of edges, and the forward edge epsilon 1 =∪{<v child ,v parent >Directed edges from child nodes to parent nodes indicate that there is an n-to-1 relationship between entities (e.g., multiple child orders may correspond to a consumer), reverse edge ε 2 =∪{<v parent ,v child >The directed edge from the father node to the child node represents that 1-to-n relation exists between entities, and the self-circulation edge epsilon 3 =∪{<v i ,v i >},Pointing to the node itself to itself; each entity node is provided with different attribute information tables, records the historical attribute data and the historical behavior data of the entity, and is defined as +.>Correspond to->Attribute information table of individual entity node, wherein t i And (3) a property information table mounted for the ith entity node, wherein a main key of the property information table points to the entity id of the entity node, and an external key points to the entity id of the adjacent entity node.
An entity relationship diagram constructed in an e-commerce scenario is shown in fig. 2B, and as can be seen in the diagram, the entity nodes in the entity relationship diagram include: consumer entity node 1, sub order entity node 2, commodity entity node 3, category entity node 5, store entity node 4, brand entity node 6 and log entity node 7, each of which is loaded with a corresponding attribute information table (containing information describing attributes and behaviors of the entity corresponding to the entity node). In the entity relationship diagram, there are directed edges of the sub order entity node 2 to the consumer entity node 1 and the commodity entity node 3, the commodity entity node 3 to the directed edges of the category entity node 5, the store entity node 4 and the brand entity node 6, and the log entity node 7 to the directed edges of the consumer entity node 1 and the commodity entity node 3.
Therefore, through the entity relation diagram, the representation of the relation between the entities and the determination of the data can be realized clearly, simply and conveniently.
(2) And performing path sampling based on the entity relation graph to obtain information of the entity characteristic generation path taking the entity node as the termination node.
The entity characteristic generation path adopts formalized representation, and an entity v is assumed to be generated i Features of (2), the feature generation path is an entity relationship graphIn v i For the union of all legal paths of the termination node, +.>Representing different path lengths, examples are as follows:
(3) Generating a relation between entities represented by edges between adjacent entity nodes on a path according to the entity characteristics, mounting operators for the adjacent entity nodes, and generating an entity node mounting attribute information table on the path for the entity characteristics; and generating a path according to the entity characteristics after the operators and the attribute information table are mounted, and generating a characteristic paradigm corresponding to the entity nodes. A feature pattern set is generated based on all feature patterns.
In terms of mounting operators (also called operators): each edge between adjacent nodes on each entity characteristic generating path can be provided with an operator setWherein the forward edge ε 1 Mountable operator->Reverse edge epsilon 2 Mountable operator->Self-circulating edge epsilon 3 Mountable operator->Wherein the operators represent a processing function of information in the corresponding attribute information table along edges between adjacent entity nodes in the entity-relationship graph, in the embodiment of the present application, the operators are divided into three main classes:
First category: aggregation operators (along the forward edge, information in the attribute information table is aggregated from child entities to parent entities, such as aggregating information of order entities to commodity entities). Aggregation operators include, but are not limited to, the following:
the second category: assignment operators (assigning information in attribute information tables from parent entities to child entities along reverse edges, e.g., assigning attribute information for items to order entities)
Operator Operator meaning
Direct Direct assignment
Third category: conversion operators (converting information in the attribute information table from itself to itself along the circular edge). Conversion operators include, but are not limited to, the following:
operator Operator meaning
Percentile Percent ordering
Log Taking logarithmic transformation
Sqrt Square function transform
Sin Sin function transformation
Cos Cos function transformation
EqualRangeDiscretizer Equidistant discretization transformation
In terms of mounting attribute values: entity v i Attribute information table t of (2) i The attribute and behavior data columns in the network are mounted on the entity node.
Based on the above arrangement, for each entity, one or more feature patterns corresponding thereto may be generated, and all feature patterns corresponding to all entities form a feature Fan Shiji, also referred to as a feature pattern space.
From the above, entity v i The feature-wise space of (2) can be defined as Wherein the method comprises the steps ofTo v i For the union of all legal paths of the termination node, +.>For the path->Mountable operator combination +.>For the path->And all the entities of the initial node correspond to a mountable attribute information (variable) set in an attribute information table, and X represents Cartesian products. Entity v i The characteristic data processing course of (2) can be expressed as +.>Medium variable, can follow the corresponding path composed of epsilonAnd apply the corresponding operation +.>Summary to entity v i On the other hand, v is obtained i Feature paradigm set->
An example of the above procedure is shown in fig. 2C, where fig. 2C generates a feature pattern f4 of an entity based on the entity relationship diagram shown in fig. 2B by performing path sampling, operator mounting, and attribute value mounting, respectively.
It can be seen from the above process that given the entity relationship data set and the entity relationship graph, the processing of the feature data can be defined by a set of regularized descriptions, i.e., feature paradigms. Each characteristic paradigm uniquely defines a characteristic data processing rule on an entity relation data set and consists of three elements: entity, operator, original feature (attribute information). As shown in f4 in fig. 2C, the entities, the operators and the original features are connected together through an expression to construct a feature paradigm, in which the entities appearing from right to left uniquely correspond to a feature generation path in an entity relationship graph, for example, the item_quality feature of an Order entity can be summarized onto the Item entity through the path [ Order- > User- > Order- > Item- > gate- > Item ]. AGG in the graph is an aggregation operator, T is a conversion operator, and D is an assignment operator.
After the feature pattern set is constructed and generated, a subsequent feature data generating operation may be performed, including the following steps S204-S210.
Step S204: and acquiring all feature patterns corresponding to the target entity from the feature pattern set according to the target entity of the feature data to be generated.
The target entity is from an entity corresponding to the entity node in the entity relation diagram.
Step S206: and selecting a characteristic normal form from all the characteristic normal forms corresponding to the obtained target entity according to a selection strategy, and generating rule information according to entity characteristics described by the selected characteristic normal form to determine characteristic values corresponding to the characteristic normal form.
Step S208: and evaluating the characteristic value to obtain an evaluation result, and determining the characteristic value of the evaluation result superior to the evaluation result of the historical characteristic value as a valid characteristic value.
Step S210: and generating feature data of the target entity according to the effective feature value and the feature paradigm corresponding to the effective feature value.
The specific implementation of the steps S204 to S210 may refer to the description of the corresponding parts in the first embodiment, and will not be described in detail here.
According to the embodiment, the feature paradigm set is constructed based on the entity relation diagram so as to represent the entity feature generation path corresponding to each entity, so that the feature generation mode of each entity can be determined efficiently and quickly. And further, the characteristic normal form set is continuously subjected to subsequent characteristic data generation processing, and characteristic values and characteristic normal forms capable of effectively representing characteristics of the target entity are efficiently screened out, so that characteristic data of the target entity are efficiently generated, the generation efficiency of the characteristic data is greatly improved, and the calculation cost of generating the characteristic data is reduced.
The feature data generation method of the present embodiment may be performed by any suitable electronic device having data processing capabilities, including, but not limited to: server, PC, etc.
Example III
Referring to fig. 3A, a flowchart of steps of a feature data generation method according to a third embodiment of the present application is shown.
In this embodiment, the method for generating feature data provided by the embodiment of the present application is described with emphasis on updating a selection policy based on relevant data in a feature data generation process.
The feature data generation method of the present embodiment includes the steps of:
step S302: and acquiring all feature patterns corresponding to the target entity from the feature pattern set according to the target entity of the feature data to be generated.
The feature paradigm is used for describing entity feature generation rule information based on entity relation data. Alternatively, the feature pattern set and the feature patterns therein may be obtained in the manner as described in embodiment two.
Step S304: and selecting a feature normal form from all the obtained feature normal forms according to a selection strategy, and determining a feature value corresponding to the feature normal form according to entity feature generation rule information described by the selected feature normal form.
Step S306: and evaluating the characteristic value to obtain an evaluation result, and determining the characteristic value of the evaluation result superior to the evaluation result of the historical characteristic value as a valid characteristic value.
Step S308: and generating feature data of the target entity according to the effective feature value and the feature paradigm corresponding to the effective feature value.
The specific implementation of the steps S302-S308 may refer to the description of the corresponding parts in the first embodiment, and will not be described in detail here.
Step S310: and updating the selection strategy according to the characteristic paradigm corresponding to the target entity.
Through the step, the selection strategy is updated based on the selected current characteristic normal form and the existing effective characteristic normal form, so that optimization of the selection strategy is realized, and the selection direction of the subsequent characteristic normal form is guided.
In one possible approach, the updating of the selection policy may be implemented using the DQN model. Based on this, in one possible manner, a vector (encoding vector) of a feature pattern selected at a certain moment is taken as a motion characterization vector, a feature statistical vector (feature statistical vector corresponding to encoding that can be realized as a feature pattern) corresponding to a feature pattern in an effective feature pattern set corresponding to the moment is taken as a state characterization vector, an DQN model is constructed based on a preset reward function, a selection strategy is obtained through the constructed DQN model, and a Q value corresponding to a current feature pattern is obtained according to a reward value obtained by a reward function of a previous feature pattern; and dynamically updating the obtained selection strategy according to the Q value. Wherein optionally, the reward function is generated according to preset parameters, and the preset parameters include: : accuracy of the evaluation result. Further optionally, the preset parameters further include at least one of the following: the importance ordering of the entity feature data and the complexity penalty of the feature paradigm. The importance ranking of the entity feature data may be used to indicate whether the feature data is reasonably ranked according to importance, for example, whether the importance of commodity feature data is higher than store feature data in the corresponding feature data for the consumer entity. The complexity penalty of the feature pattern may be used to indicate whether the feature pattern is too long, a length threshold may be set, and if the length of the feature pattern exceeds the length threshold, the feature pattern is penalized, or the more penalties exceeded, the more. The accuracy of the evaluation result, the importance ranking of the entity characteristic data and the complexity penalty of the characteristic paradigm can be combined or used independently, and the reward target can be set by a person skilled in the art according to actual requirements based on the factors, so that the embodiment of the application does not limit the specific setting mode.
DQN (Deep Q Network) is a reinforcement learning model, reinforcement learning is a iterative process, each iteration solving two problems: a policy evaluation function is given and the policy is updated according to the value function. For DQN, which uses a neural network to approximate the function Q, the input to the neural network can be a state token vector and a motion token vector, and the output is the Q value to which the motion token vector corresponds. Wherein the rewards for the agents are determined based on a rewards function, each input of the DQN model results in a rewards, i.e. a rewards value. Then, the Q value of the current feature pattern is obtained based on the prize value determined by the previous feature pattern, and whether to perform policy updating is determined according to the Q value.
Specifically, in this embodiment, a coded vector corresponding to a feature pattern selected at a certain moment, for example, a moment t, is taken as a motion characterization vector, and a feature statistics vector corresponding to feature pattern codes in an effective feature pattern set corresponding to the moment is taken as a state characterization vector, so that a corresponding feature generation path vector, an operator vector, an attribute value vector and a feature statistics vector are respectively input into a DQN, and a Q value output by the DQN is obtained based on a reward value determined by a previous reward function; it is determined whether to perform policy updating based on the Q value. If the current Q value is better than the historical Q value or the current Q value meets a preset standard value, the selection of the motion characterization vector and the state characterization vector can be considered to be better, and the current selection strategy can be updated by the action corresponding to the motion characterization vector, namely the selection action of the corresponding characteristic pattern. The possibility that the feature pattern selected from the feature patterns by the updated strategy is an effective feature pattern is higher, so that the selection efficiency of the effective feature pattern is improved. The feature statistical vector corresponding to the feature pattern code may be obtained according to statistical histogram data generated by the feature pattern code in the generation path dimension, the operator dimension and the attribute value dimension, that is, a vector corresponding to the statistical histogram data is used as a state characterization vector.
For example, assume that the action of an Agent (Agent) at time t is a t Defined as the time t from the characteristic normal form spaceSelecting a characteristic pattern f i t Action a t Can be characterized by adopting a characteristic normal form f i t Representing the corresponding Embedding; suppose that the instant Environment (Environment) rewards agent with r t Wherein the Reward (forward) function may be comprehensively considered from a model performance relative lifting dimension of the evaluator, a feature pattern corresponding feature importance relative ordering dimension in the evaluator model, and a feature pattern complexity (e.g., feature pattern path length) penalty dimension; assume that the state of the environment at time t is s t Adopts t time->The statistical histogram of each characteristic pattern code in each dimension is characterized; the Policy Net/Target Net may be implemented by a multi-layer neural network structure. After the Policy Net/Target Net receives the motion characterization vector and the state characterization vector, the corresponding Q value is obtained through the neural network, and the optimal motion selection can be completed according to the Q value. The policy update, i.e. the main training update procedure of the DQN, may be implemented by using a related technology, such as Nature 2015, which is not described herein.
Hereinafter, the above-described process is exemplarily described with a specific example, as shown in fig. 3B.
First, after a model (such as LightGBM/Xgboost) and a metric (such as MAPE (Mean Absolute Percentage Error, mean absolute percentage error)/SMAPE (Symmetric Mean Absolute Percentage Error, symmetrical mean absolute percentage error) are given for measuring the difference between feature data corresponding to two entities), a feature pattern space corresponding to a certain entity relationship graph is defined(i.e. all features Fan Shiji of the feature pattern composition), an optimal feature pattern subset is selected from the above>So that the feature set calculated using the feature paradigm subset can optimize model performance, formally defined as:
wherein,,for selected models and super-parameters, +.>For metrics (e.g. MAPE/SMAPE etc.), -, for example>For model performance, y is label. Due to the optimal feature paradigm subset->Is a characteristic paradigm space->Is that the number of rules corresponding to the characteristic pattern is +.>(which would normally be large), then the combination of feature paradigm subsets shares +.>In this case. If the feature values corresponding to the feature patterns are calculated in the total quantity, all potential feature pattern subsets are obtained according to the combination, each feature pattern subset is evaluated, and then the optimal feature pattern subset ++ >It will be very labor intensive. This is because: the feature normal form space is large, and the cost for calculating the feature values corresponding to all the feature normal forms is large; moreover, the feature pattern subsets are combined in a plurality of ways, so that the calculation cost for evaluating the feature values is high.
To this end, the present example provides an adaptive near-optimal feature paradigm combination selection strategy based on reinforcement learning. Specifically, during the feature pattern selection process, the reinforcement learning algorithm learns the evaluation result, such as scoring distribution, of the feature values corresponding to the feature patterns evaluated in the past, and guides the selection direction of the subsequent feature patterns.
Illustratively, the process is as follows:
(1) Determining evaluator and parameters (super parameters, set according to model), collectively abbreviated as
(2) Suppose that computing entity v i Is characterized in that givenAfter that (preset according to the length of the feature range in the feature range space), through the exhaustion of the entity v i Is to construct an entity v i Corresponding feature paradigm spaceAnd, initializing entity v i Is a set of efficient (near optimal) characteristic patterns +.>Is empty, corresponding characteristic value setIs empty;
(3) Judging whether the search iteration termination condition is met (if the iteration number reaches the preset number or other proper termination conditions), and ending the flow if the search iteration number is met; otherwise, entering the t-th iteration, and selecting a strategy pi according to the characteristic pattern from the characteristic pattern space In the selection of the characteristic paradigm->
(4) Computing a characteristic paradigmCorresponding characteristic value->Feature value set +.>Feeding into an evaluator->Evaluation is performed if the model performance of the evaluator (e.g. accuracy of output result, etc.) is +.>With gain, determine->Is a valid feature pattern and vice versa is an invalid feature pattern.
(5) If (3)Effective, will be->Add to the active feature paradigm set +.>Will beAdd to the corresponding set of valid eigenvalues +.>Otherwise, from->Is removed fromIn one possible way, the above information can also be synchronously recorded into a knowledge base, which records the added feature paradigm ++>Model performance gain conditions for post-evaluator.
To this end, atAnd->After completion, the entity v can be generated based on the method i Is a feature data of (1). However, in order to optimize the selection policy pi, the following procedure (6) may also be performed.
(6) And (3) updating the selection strategy pi by the strategy learner (such as the DQN model) according to the information recorded by the knowledge base, and returning to the step (3) for execution after the updating is completed.
After the final search is finished, the effective characteristic normal form setI.e. the entity v sought i Is>
Wherein the policy learner may be implemented based on a DQN model. Based on entity v i For example, first, entity v may be i After the entity, the operator and the attribute value in the characteristic normal form are respectively encoded, adopting Embedding as the characterization, and taking the Action (Action) at the moment t as the moment t to take the characteristic normal form spaceSelecting a characteristic patternThe action characterization can directly adopt the Embedding characterization of the characteristic normal form; the State (State) at time t adopts the State +.>The characteristic pattern codes in the system are characterized in the statistical histogram of each dimension, namely, the state characterization adopts the statistical histogram characterization; the Reward (Reward) function may be integrated from model performance versus boost dimension, feature importance versus ordering dimension, and feature complexity penalty dimension. After the Policy Net/Target Net receives the action characterization vector and the state characterization vectorThe corresponding Q value is obtained through a neural network. And updating the selection strategy based on the Q value. Wherein (1)>The statistical histogram of each feature pattern code in each dimension may be a statistical histogram of the onehot code of each feature pattern in the generated path dimension, the operator dimension, and the attribute value dimension, respectively. For example, all M generation paths with maximum depth of 2 are sequentially onehot coded from 0; all N operators (operators) are onehot coded in sequence starting from 0; all W attribute values (original features) are onehot coded in order from 0. Further, statistical histograms are determined for these onehot codes, thereby obtaining the state characterization.
According to the scheme of the embodiment, the characteristic data construction rule can be expressed in an abstract mode through characteristic norms based on the entity relationship graph (directed acyclic graph) expressing the structural relationship among the entity relationship data sets, each characteristic norms uniquely describes a generating mode of the characteristic data, all potential characteristic norms of a certain entity form a characteristic norms space, and possible generating modes of all the characteristic data of the entity are represented. In view of the large feature pattern space, not every feature pattern generated feature data has a positive effect, and therefore an effective subset of feature patterns is selected. And based on the characteristic normal form selection strategy of reinforcement learning, the optimal characteristic normal form subset is automatically selected from the characteristic normal form space, so that the calculation cost caused by traversing the characteristic normal form in a full quantity can be effectively avoided.
The feature data generation method of the present embodiment may be performed by any suitable electronic device having data processing capabilities, including, but not limited to: server, PC, etc.
Example IV
Referring to fig. 4, there is shown a block diagram of a feature data generating apparatus according to a fourth embodiment of the present application.
The feature data generation device of the present embodiment includes: an obtaining module 402, configured to obtain, according to a target entity that is to generate feature data, all feature patterns corresponding to the target entity from a feature pattern set, where the feature patterns are used to describe entity feature generation rule information based on entity relationship data; a first determining module 404, configured to select a feature pattern from all the feature patterns according to a selection policy, and determine a feature value corresponding to the feature pattern according to the entity feature generation rule information described by the selected feature pattern; a second determining module 406, configured to evaluate the feature value to obtain an evaluation result, and determine a feature value of the evaluation result that is better than the historical feature value as a valid feature value; and a generating module 408, configured to generate feature data of the target entity according to the valid feature value and the feature pattern corresponding to the valid feature value.
Optionally, the second determining module 406 is configured to input, for each feature value of the selected feature range, the feature value and all the historical valid feature values in the valid feature value set together into the evaluator to evaluate, so as to obtain an evaluation result for the feature value; and if the evaluation result of the characteristic value is better than the evaluation results corresponding to all the historical effective characteristic values in the effective characteristic value set, determining the characteristic value as an effective characteristic value.
Optionally, the apparatus of this embodiment further includes: an initialization module 410, configured to construct a valid feature pattern set and a valid feature value set for a target entity after the obtaining module 402 obtains all feature patterns corresponding to the target entity from a feature pattern set according to the target entity of the feature data to be generated; wherein the valid characteristic normal form set and the valid characteristic value set are empty sets at the beginning; the second determining module 406 is further configured to, after the determining the feature value as the valid feature value, add the valid feature value to the valid feature value set to update the valid feature value set, and add the feature pattern corresponding to the valid feature value to the valid feature pattern set to update the valid feature pattern set.
Optionally, the generating module 408 is configured to determine, if it is determined that the valid feature value set has been updated according to the feature values corresponding to all the selected feature patterns, and the valid feature pattern set has been updated according to all the selected feature patterns, a valid feature pattern to be used from the valid feature pattern set after the update is completed according to a preset rule; determining a corresponding effective characteristic value to be used from the updated effective characteristic value set according to the effective characteristic paradigm to be used; and generating the characteristic data of the target entity according to the effective characteristic value to be used.
Optionally, the entity characteristic generation rule information includes: the entity characteristics generate path information, operator information among entity nodes on a path corresponding to the entity characteristics generate path information, and attribute information of other entity nodes except the target entity on the path.
Optionally, the first determining module 404 is configured to select, according to a selection policy, a feature pattern from all feature patterns obtained; generating a relation between a path corresponding to the path information and the entity nodes on the path along the entity characteristics of the characteristic normal form, generating an operation result according to the operation indicated by the operator information by the attribute information of the other entity nodes, and summarizing the operation result to the target entity so as to obtain the characteristic value of the target entity under the current characteristic normal form.
Optionally, the apparatus of this embodiment further includes: a building module 412, configured to generate a directed acyclic entity relationship graph according to entity relationship data, where nodes in the entity relationship graph are entity nodes for characterizing entities, edges between the entity nodes are used for characterizing relationships between the entities, and the entity nodes have an attribute information table of the entities; performing path sampling based on the entity relation diagram to obtain information of an entity characteristic generation path taking an entity node as a termination node; generating a relation between entities represented by edges between adjacent entity nodes on a path according to the entity characteristics, mounting operators for the adjacent entity nodes, and generating an entity node mounting attribute information table on the path for the entity characteristics; and generating a path according to the entity characteristics after the operators and the attribute information table are mounted, and generating a characteristic paradigm corresponding to the entity nodes.
Optionally, the apparatus of this embodiment further includes: and an updating module 414, configured to update the selection policy according to the feature paradigm corresponding to the target entity.
Optionally, the updating module 414 is configured to construct a DQN model based on a preset reward function by taking a vector of a feature pattern selected at a certain moment as an action characterization vector, and taking a feature statistics vector corresponding to a feature pattern in the valid feature pattern set corresponding to the moment as a state characterization vector; obtaining a selection strategy through the constructed DQN model, and obtaining a Q value corresponding to the current characteristic normal form according to the rewarding value obtained by the previous characteristic normal form through the rewarding function; and dynamically updating the obtained selection strategy according to the Q value.
Optionally, the reward data comprises: accuracy of the evaluation result.
Optionally, the reward data further comprises at least one of: the importance ordering of the entity feature data and the complexity penalty of the feature paradigm.
The feature data generating device of the present embodiment is configured to implement the corresponding feature data generating method in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again. In addition, the functional implementation of each module in the feature data generating apparatus of this embodiment may refer to the description of the corresponding portion in the foregoing method embodiment, which is not repeated herein.
Example five
Referring to fig. 5, a schematic structural diagram of an electronic device according to a fifth embodiment of the present application is shown, and the specific embodiment of the present application is not limited to the specific implementation of the electronic device.
As shown in fig. 5, the electronic device may include: a processor 502, a communication interface (Communications Interface) 504, a memory 506, and a communication bus 508.
Wherein:
processor 502, communication interface 504, and memory 506 communicate with each other via communication bus 508.
A communication interface 504 for communicating with other electronic devices or servers.
The processor 502 is configured to execute the program 510, and may specifically perform relevant steps in the above-described embodiment of the method for generating feature data.
In particular, program 510 may include program code including computer-operating instructions.
The processor 502 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present application. The one or more processors comprised by the smart device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.
A memory 506 for storing a program 510. Memory 506 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The program 510 is specifically operable to cause the processor 502 to execute the characteristic data generation method described in any of the foregoing embodiments one to three.
The specific implementation of each step in the program 510 may refer to corresponding steps and corresponding descriptions in the units in the above embodiment of the feature data generating method, which are not described herein. It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus and modules described above may refer to corresponding procedure descriptions in the foregoing method embodiments, which are not repeated herein.
By the electronic device of the present embodiment, when generating feature data corresponding to an entity, on one hand, processing is performed based on a feature paradigm that describes entity feature generation rule information generated based on entity relationship data, that is, that describes all possible generation manners of feature data of a target entity. Therefore, the entity relation data is not required to be combined and carded manually, the feature data generation cost is greatly reduced, and the method is effectively applicable to one-stage feature data generation processing. On the other hand, when generating the characteristic data, the characteristic value of the evaluation result, namely the effective characteristic value, which is superior to the evaluation result of the historical characteristic value, is considered, and in this way, the characteristic value and the characteristic pattern which can effectively represent the characteristics of the target entity can be efficiently screened out, so that the characteristic data of the target entity can be efficiently generated, the generation efficiency of the characteristic data is greatly improved, and the calculation cost of the generated characteristic data is reduced.
The embodiments of the present application also provide a computer program product, which includes computer instructions that instruct a computing device to perform operations corresponding to any of the feature data generation methods in the above-described method embodiments.
It should be noted that, according to implementation requirements, each component/step described in the embodiments of the present application may be split into more components/steps, or two or more components/steps or part of operations of the components/steps may be combined into new components/steps, so as to achieve the objects of the embodiments of the present application.
The above-described methods according to embodiments of the present application may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium and to be stored in a local recording medium downloaded through a network, so that the methods described herein may be stored on such software processes on a recording medium using a general purpose computer, special purpose processor, or programmable or special purpose hardware such as an ASIC or FPGA. It is understood that a computer, processor, microprocessor controller or programmable hardware includes a storage component (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the characteristic data generation methods described herein. Further, when the general-purpose computer accesses code for implementing the characteristic data generation method shown herein, execution of the code converts the general-purpose computer into a special-purpose computer for executing the characteristic data generation method shown herein.
Those of ordinary skill in the art will appreciate that the elements and method steps of the examples described in connection with the embodiments disclosed herein can be implemented as electronic hardware, or as a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.
The above embodiments are only for illustrating the embodiments of the present application, but not for limiting the embodiments of the present application, and various changes and modifications may be made by one skilled in the relevant art without departing from the spirit and scope of the embodiments of the present application, so that all equivalent technical solutions also fall within the scope of the embodiments of the present application, and the scope of the embodiments of the present application should be defined by the claims.

Claims (12)

1. A feature data generation method, comprising:
according to a target entity of feature data to be generated, acquiring all feature norms corresponding to the target entity from a feature norms set, wherein the feature norms are used for describing entity feature generation rule information based on entity relation data;
Selecting a feature pattern from all the obtained feature patterns according to a selection strategy, and generating rule information according to entity features described by the selected feature pattern to determine feature values corresponding to the feature pattern;
evaluating the characteristic value to obtain an evaluation result, and determining the characteristic value of the evaluation result which is superior to the historical characteristic value as an effective characteristic value;
generating feature data of the target entity according to the effective feature value and the feature paradigm corresponding to the effective feature value; wherein,,
the entity characteristic generation rule information includes: the entity characteristics generate path information, operator information among entity nodes on a path corresponding to the entity characteristics generate path information, and attribute information of other entity nodes except the target entity on the path.
2. The method of claim 1, wherein the evaluating the feature value to obtain an evaluation result and determining a feature value of the evaluation result that is better than an evaluation result of a historical feature value as a valid feature value comprises:
inputting the characteristic value and all historical effective characteristic values in the effective characteristic value set into an evaluator for evaluation aiming at the characteristic value of each selected characteristic range, and obtaining an evaluation result aiming at the characteristic value;
And if the evaluation result of the characteristic value is better than the evaluation results corresponding to all the historical effective characteristic values in the effective characteristic value set, determining the characteristic value as an effective characteristic value.
3. The method of claim 2, wherein,
after the target entity according to the feature data to be generated obtains all feature patterns corresponding to the target entity from the feature pattern set, the method further comprises: constructing an effective characteristic normal form set and an effective characteristic value set aiming at the target entity; wherein the valid characteristic normal form set and the valid characteristic value set are empty sets at the beginning;
after said determining the characteristic value as a valid characteristic value, the method further comprises: adding the effective characteristic value into the effective characteristic value set to update the effective characteristic value set, and adding the characteristic normal form corresponding to the effective characteristic value into the effective characteristic normal form set to update the effective characteristic normal form set.
4. A method according to claim 3, wherein the generating feature data of the target entity according to the valid feature values and the feature paradigm corresponding to the valid feature values comprises:
If the effective characteristic value set is determined to be updated according to the characteristic values corresponding to all the selected characteristic patterns, and the effective characteristic pattern set is determined to be updated according to all the selected characteristic patterns, determining an effective characteristic pattern to be used from the updated effective characteristic pattern set according to a preset rule; determining a corresponding effective characteristic value to be used from the updated effective characteristic value set according to the effective characteristic paradigm to be used;
and generating the characteristic data of the target entity according to the effective characteristic value to be used.
5. The method of claim 1, wherein the determining, according to the entity feature generation rule information described by the selected feature pattern, the feature value corresponding to the feature pattern includes:
generating a relation between a path corresponding to the path information and the entity nodes on the path along the entity characteristics of the characteristic normal form, generating an operation result according to the operation indicated by the operator information by the attribute information of the other entity nodes, and summarizing the operation result to the target entity so as to obtain the characteristic value of the target entity under the current characteristic normal form.
6. The method of claim 1, wherein the method further comprises:
generating a directed acyclic entity relation graph according to entity relation data, wherein nodes in the entity relation graph are entity nodes used for representing entities, edges between the entity nodes are used for representing the relation between the entities, and the entity nodes are provided with attribute information tables of the entities;
performing path sampling based on the entity relation diagram to obtain information of an entity characteristic generation path taking an entity node as a termination node;
generating a relation between entities represented by edges between adjacent entity nodes on a path according to the entity characteristics, mounting operators for the adjacent entity nodes, and generating an entity node mounting attribute information table on the path for the entity characteristics;
and generating a path according to the entity characteristics after the operators and the attribute information table are mounted, and generating a characteristic paradigm corresponding to the entity nodes.
7. The method of claim 2, wherein the method further comprises:
and updating the selection strategy according to the characteristic paradigm corresponding to the target entity.
8. The method of claim 7, wherein the updating the selection policy according to the feature paradigm corresponding to the target entity comprises:
Taking a vector of a feature pattern selected at a certain moment as an action characterization vector, taking a feature statistical vector corresponding to a feature pattern in an effective feature pattern set corresponding to the moment as a state characterization vector, and constructing a DQN model based on a preset reward function;
obtaining a selection strategy through the constructed DQN model, obtaining a Q value corresponding to the current characteristic normal form according to the rewarding value obtained by the previous characteristic normal form through the rewarding function, and dynamically updating the obtained selection strategy according to the Q value.
9. The method of claim 8, wherein the reward function is generated according to preset parameters, the preset parameters comprising: accuracy of the evaluation result.
10. The method of claim 9, wherein the preset parameters further comprise at least one of: the importance ordering of the entity feature data and the complexity penalty of the feature paradigm.
11. An electronic device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
the memory is configured to store at least one executable instruction that causes the processor to perform operations corresponding to the method of any one of claims 1-10.
12. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1-10.
CN202110996469.5A 2021-08-27 2021-08-27 Feature data generation method, electronic device, and storage medium Active CN113688191B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110996469.5A CN113688191B (en) 2021-08-27 2021-08-27 Feature data generation method, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110996469.5A CN113688191B (en) 2021-08-27 2021-08-27 Feature data generation method, electronic device, and storage medium

Publications (2)

Publication Number Publication Date
CN113688191A CN113688191A (en) 2021-11-23
CN113688191B true CN113688191B (en) 2023-08-18

Family

ID=78583452

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110996469.5A Active CN113688191B (en) 2021-08-27 2021-08-27 Feature data generation method, electronic device, and storage medium

Country Status (1)

Country Link
CN (1) CN113688191B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024000585A1 (en) * 2022-07-01 2024-01-04 西门子股份公司 Data processing method, apparatus, and system for data tracking and electronic device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9110852B1 (en) * 2012-07-20 2015-08-18 Google Inc. Methods and systems for extracting information from text
CN108694201A (en) * 2017-04-10 2018-10-23 华为软件技术有限公司 A kind of entity alignment schemes and device
CN110688433A (en) * 2019-12-10 2020-01-14 银联数据服务有限公司 Path-based feature generation method and device
CN110796261A (en) * 2019-09-23 2020-02-14 腾讯科技(深圳)有限公司 Feature extraction method and device based on reinforcement learning and computer equipment
CN110999766A (en) * 2019-12-09 2020-04-14 怀化学院 Irrigation decision method, device, computer equipment and storage medium
CN111090686A (en) * 2019-12-24 2020-05-01 腾讯科技(深圳)有限公司 Data processing method, device, server and storage medium
CN111538741A (en) * 2020-03-23 2020-08-14 重庆特斯联智慧科技股份有限公司 Deep learning analysis method and system for big data of alarm condition

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10289963B2 (en) * 2017-02-27 2019-05-14 International Business Machines Corporation Unified text analytics annotator development life cycle combining rule-based and machine learning based techniques
JP7271216B2 (en) * 2019-02-19 2023-05-11 株式会社東芝 Information processing device, information processing method, and program
US11461638B2 (en) * 2019-03-07 2022-10-04 Adobe Inc. Figure captioning system and related methods
CN112307214A (en) * 2019-07-26 2021-02-02 株式会社理光 Deep reinforcement learning-based recommendation method and recommendation device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9110852B1 (en) * 2012-07-20 2015-08-18 Google Inc. Methods and systems for extracting information from text
CN108694201A (en) * 2017-04-10 2018-10-23 华为软件技术有限公司 A kind of entity alignment schemes and device
CN110796261A (en) * 2019-09-23 2020-02-14 腾讯科技(深圳)有限公司 Feature extraction method and device based on reinforcement learning and computer equipment
CN110999766A (en) * 2019-12-09 2020-04-14 怀化学院 Irrigation decision method, device, computer equipment and storage medium
CN110688433A (en) * 2019-12-10 2020-01-14 银联数据服务有限公司 Path-based feature generation method and device
CN111090686A (en) * 2019-12-24 2020-05-01 腾讯科技(深圳)有限公司 Data processing method, device, server and storage medium
CN111538741A (en) * 2020-03-23 2020-08-14 重庆特斯联智慧科技股份有限公司 Deep learning analysis method and system for big data of alarm condition

Also Published As

Publication number Publication date
CN113688191A (en) 2021-11-23

Similar Documents

Publication Publication Date Title
CN111177792B (en) Method and device for determining target business model based on privacy protection
CN109902708B (en) Recommendation model training method and related device
US11423295B2 (en) Dynamic, automated fulfillment of computer-based resource request provisioning using deep reinforcement learning
CN107770783B (en) Base station capacity expansion transformation scheme design method and related equipment
JP6892424B2 (en) Hyperparameter tuning methods, devices and programs
CN113688191B (en) Feature data generation method, electronic device, and storage medium
CN111310918B (en) Data processing method, device, computer equipment and storage medium
US20200210853A1 (en) Optimization calculation method and information processing apparatus
Fernández et al. VQQL. Applying vector quantization to reinforcement learning
CN111625688B (en) Heterogeneous network-based feature aggregation method, device, equipment and storage medium
CN109800815B (en) Training method, wheat recognition method and training system based on random forest model
CN117131100A (en) Mining method, device, equipment and storage medium for power equipment fault data
CN114679335B (en) Power monitoring system network security risk assessment training method, assessment method and equipment
CN113784411A (en) Link quality evaluation method, link switching method, device and storage medium
JP2020198135A (en) Hyper parameter tuning method, device and program
Ikeda et al. Multi-fractality analysis of time series in artificial stock market generated by multi-agent systems based on the genetic programming and its applications
CN112256705A (en) Multi-table connection optimization method in Gaia system
CN116383884B (en) Data security protection method and system based on artificial intelligence
CN112132260B (en) Training method, calling method, device and storage medium of neural network model
CN116012123B (en) Wind control rule engine method and system based on Rete algorithm
CN113706040B (en) Risk identification method, apparatus, device and storage medium
CN112749839B (en) Model determination method, device, equipment and storage medium
CN115760200B (en) User portrait construction method based on financial transaction data
CN110096642B (en) Search engine optimization method and system
CN117435245A (en) Code repairing method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant