CN111460047A - Method, device and equipment for constructing characteristics based on entity relationship and storage medium - Google Patents

Method, device and equipment for constructing characteristics based on entity relationship and storage medium Download PDF

Info

Publication number
CN111460047A
CN111460047A CN202010156947.7A CN202010156947A CN111460047A CN 111460047 A CN111460047 A CN 111460047A CN 202010156947 A CN202010156947 A CN 202010156947A CN 111460047 A CN111460047 A CN 111460047A
Authority
CN
China
Prior art keywords
relationship
entity
tables
main
main table
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010156947.7A
Other languages
Chinese (zh)
Inventor
刘利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010156947.7A priority Critical patent/CN111460047A/en
Publication of CN111460047A publication Critical patent/CN111460047A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Abstract

The application discloses a feature construction method based on entity relationship, which comprises the following steps: acquiring a main table and a plurality of auxiliary tables associated with the main table in a relational database, wherein the main table is provided with a main key column and a plurality of external key columns, each entry in the main table corresponds to an entity, and the auxiliary tables are associated with the main table through the external keys of the main table; constructing a relationship graph between directed tables by taking the main table and the auxiliary table as nodes and taking the association relationship between every two main tables and two auxiliary tables as edges; taking the corresponding nodes of the main table as starting points, and traversing the relationship graph among the tables to acquire the relationship data between each entity in the main table and the corresponding auxiliary table; and performing conversion calculation on the relationship data between the tables based on a preset conversion function to construct the characteristics corresponding to each entity in the main table. The application also discloses a device, equipment and a storage medium for constructing the characteristics based on the entity relationship. The method and the device for modeling the data have the advantages that the characteristic data are collected based on the relation graph among the tables, the characteristics of the data can be expressed in multiple dimensions on the whole, and therefore the modeling success rate is improved.

Description

Method, device and equipment for constructing characteristics based on entity relationship and storage medium
Technical Field
The present application relates to the field of big data technologies, and in particular, to a method, an apparatus, a device, and a storage medium for feature construction based on entity relationships.
Background
Feature engineering is the most time and effort consuming part of the data analysis, it is not a deterministic step like algorithms and models, it is more an engineering experience and trade-off, and therefore there is no uniform approach. The feature construction is an important content of feature engineering, and means that new features are constructed from original data, so that some features with physical significance can be found from the original data. Assuming that the raw data is tabular data, features are typically created using mixed or combined attributes, or by decomposing or slicing the original features.
The existing feature construction usually needs to perform function calculation on different lists with association relations by means of forward association relations or backward association relations among all data tables, so that features are constructed.
Disclosure of Invention
The application mainly aims to provide a method, a device, equipment and a storage medium for constructing characteristics based on entity relationships, and aims to solve the technical problem that the characteristics of data cannot be expressed in multiple dimensions on the whole in the existing characteristic engineering technology.
In order to achieve the above object, the present application provides a method for constructing a feature based on an entity relationship, where the method for constructing a feature based on an entity relationship includes the following steps:
acquiring a main table and a plurality of auxiliary tables associated with the main table in a relational database, wherein the main table is provided with a main key column and a plurality of external key columns, each entry in the main table corresponds to an entity, and the auxiliary tables are associated with the main table through the external keys of the main table;
constructing a relationship graph between directed tables by taking the main table and the auxiliary table as nodes and taking the association relationship between every two main tables and every two auxiliary tables as edges;
traversing the relationship graph between the tables by taking the corresponding nodes of the main table as starting points to acquire relationship data between each entity in the main table and the corresponding auxiliary table;
and performing conversion calculation on the relation data between the tables based on a preset conversion function to construct the characteristics corresponding to each entity in the main table.
Optionally, the edge M of the inter-table relationship graph is defined as follows:
Figure BDA0002404405450000021
wherein, Ti-1、TiIs a table in a database, CiIs a linked list Ti-1、TiI is a positive integer;
A. when C is presentiIs Ti-1When the main key of (1), Ti-1And TiIs a one-to-many incidence relation;
B. when C is presentiIs not only Ti-1Is also TiWhen the main key of (1), Ti-1And TiIs a one-to-one association relationship;
C. when C is presentiIs TiWhen the main key of (1), Ti-1And TiIs a many-to-one incidence relation;
D. when C is presentiIs neither T nori-1Is not T, is notiWhen the main key of (1), Ti-1And TiIs a many-to-many incidence relation.
Optionally, traversing a connection path P corresponding to each entity in the relationship graph between tableskThe table is formed by sequentially connecting edges M of the relationship graph among the k tables, and the following definition mode is adopted:
Figure BDA0002404405450000022
wherein, Ti-1、TiRepresenting tables in a database, CiIs a linked list Ti-1、TiI and k are positive integers, i is any positive integer from 2 to (k-1), and T0Represents the main table, TiRepresenting a sub-table, C representing the last sub-table T in the connection pathkAttribute column (2).
Optionally, traversing the inter-table relationship graph with the primary table corresponding node as a starting point to acquire inter-table relationship data between each entity in the primary table and the corresponding secondary table includes:
taking the corresponding node of the main table as a starting point and according to the connection path PiTraversing the relationship graph between the tables, and generating a relationship tree corresponding to each entity in the main table and the connection path corresponding to the auxiliary table;
based on the traversal depth of the inter-table relationship graph, grouping operation is respectively carried out on the relationship trees corresponding to the entities so as to collect the inter-table relationship data of the entities in the main table and the sub-table;
wherein a root node of the relationship tree corresponds to an entity in the main table and leaf nodes of the relationship tree correspond to nodes in the main table by traversing the connection path PiCollected secondary table TkC, the child node with traversal depth i corresponds to the link path P by traversing the link path PiCollected secondary table TiOuter key column C ini
Optionally, after the step of performing conversion calculation on the relationship data between the tables based on the preset conversion function to construct the features corresponding to the entities in the main table, the method further includes:
checking whether repeated features exist in the constructed features;
if the repeated features exist, deleting the repeated features, and adopting a chi-square hypothesis to check whether the correlation exists between the features and the target variable;
if the repeated features do not exist, adopting a chi-square hypothesis to check whether the correlation exists between the features and the target variable;
if the correlation exists and the chi-squared value is larger than the characteristic of the preset chi-squared value threshold value, the characteristic is reserved, and if not, the characteristic is deleted.
Optionally, the conversion function includes at least: any one or more of an averaging function, a maximum function, a minimum function, a sum function, a difference function, and a product function.
Further, in order to achieve the above object, the present application also provides a feature construction apparatus based on entity relationship, where the feature construction apparatus based on entity relationship includes:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a main table and a plurality of auxiliary tables associated with the main table in a relational database, the main table is provided with a main key column and a plurality of external key columns, each entry in the main table corresponds to an entity, and the auxiliary tables are associated with the main table through the external keys of the main table;
the table-to-table relational graph building module is used for building a directed table-to-table relational graph by taking the main table and the auxiliary table as nodes and taking the association relationship between every two of the main table and the auxiliary table as an edge;
the traversal module is used for traversing the inter-table relationship graph by taking the corresponding node of the main table as a starting point so as to acquire the inter-table relationship data of each entity in the main table and the corresponding auxiliary table;
and the characteristic construction module is used for performing conversion calculation on the relation data between the tables based on a preset conversion function so as to construct the characteristics corresponding to each entity in the main table.
Optionally, the edge M of the inter-table relationship graph is defined as follows:
Figure BDA0002404405450000031
wherein, Ti-1、TiIs a table in a database, CiIs a linked list Ti-1、TiI is a positive integer;
A. when C is presentiIs Ti-1When the main key of (1), Ti-1And TiIs a one-to-many incidence relation;
B. when C is presentiIs not only Ti-1Is also TiWhen the main key of (1), Ti-1And TiIs a one-to-one association relationship;
C. when C is presentiIs TiWhen the main key of (1), Ti-1And TiIs a many-to-one incidence relation;
D. when C is presentiIs neither T nori-1Is not T, is notiWhen the main key of (1), Ti-1And TiIs a many-to-many incidence relation。
Optionally, traversing a connection path P corresponding to each entity in the relationship graph between tableskThe table is formed by sequentially connecting edges M of the relationship graph among the k tables, and the following definition mode is adopted:
Figure BDA0002404405450000041
wherein, Ti-1、TiRepresenting tables in a database, CiIs a linked list Ti-1、TiI and k are positive integers, i is any positive integer from 2 to (k-1), and T0Represents the main table, TiRepresenting a sub-table, C representing the last sub-table T in the connection pathkAttribute column (2).
Optionally, the traversal module is specifically configured to:
taking the corresponding node of the main table as a starting point and according to the connection path PiTraversing the relationship graph between the tables, and generating a relationship tree corresponding to each entity in the main table and the connection path corresponding to the auxiliary table;
based on the traversal depth of the inter-table relationship graph, grouping operation is respectively carried out on the relationship trees corresponding to the entities so as to collect the inter-table relationship data of the entities in the main table and the sub-table;
wherein a root node of the relationship tree corresponds to an entity in the main table and leaf nodes of the relationship tree correspond to nodes in the main table by traversing the connection path PiCollected secondary table TkC, the child node with traversal depth i corresponds to the link path P by traversing the link path PiCollected secondary table TiOuter key column C ini
Optionally, the entity relationship-based feature construction apparatus further includes:
the characteristic inspection module is used for inspecting whether repeated characteristics exist in the constructed characteristics; if the repeated features exist, deleting the repeated features, and adopting a chi-square hypothesis to check whether the correlation exists between the features and the target variable; if the repeated features do not exist, adopting a chi-square hypothesis to check whether the correlation exists between the features and the target variable; if the correlation exists and the chi-squared value is larger than the characteristic of the preset chi-squared value threshold value, the characteristic is reserved, and if not, the characteristic is deleted.
Optionally, the conversion function includes at least: any one or more of an averaging function, a maximum function, a minimum function, a sum function, a difference function, and a product function.
Further, to achieve the above object, the present application also provides an entity relationship based feature construction device, where the entity relationship based feature construction device includes a memory, a processor, and a feature construction program stored in the memory and executable on the processor, and when executed by the processor, the feature construction program further implements the steps of the entity relationship based feature construction method according to any one of the above.
Further, to achieve the above object, the present application also provides a computer readable storage medium, which stores a feature construction program, and when the feature construction program is executed by a processor, the computer readable storage medium further implements the steps of the entity relationship based feature construction method according to any one of the above items.
The method and the device have the advantages that the incidence relation among the data tables is combed through the relation graph among the tables, then the data related to the entities are collected based on the relation graph among the tables, and finally the characteristics are constructed based on the collected data. Characteristic data are collected based on the relation graph between tables, so that the characteristics of the data can be expressed in multiple dimensions on the whole, and the modeling success rate is improved. In addition, the method and the device do not need human participation in the characteristic construction process of the relational database, can serve different data sets, and further can help to improve the data modeling efficiency, help managers to make decisions quickly at low cost, and support the quick development of enterprise business.
Drawings
FIG. 1 is a schematic structural diagram of an apparatus operating environment constructed based on characteristics of entity relationships according to an embodiment of the present application;
FIG. 2 is a schematic flowchart of a first embodiment of a method for constructing features based on entity relationships according to the present application;
FIG. 3 is a table-to-table relationship diagram of an embodiment of the method for building a feature based on an entity relationship according to the present application;
FIG. 4 is a schematic diagram illustrating a detailed flow of step S30 in FIG. 2;
FIG. 5 is a schematic diagram of a relationship tree according to an embodiment of the method for constructing features based on entity relationships;
FIG. 6 is a flowchart illustrating a second embodiment of the method for constructing features based on entity relationships according to the present application;
fig. 7 is a functional module diagram of an embodiment of the apparatus for constructing features based on entity relationships according to the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The application provides a feature construction device based on entity relationships.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an apparatus operating environment constructed based on features of entity relationships according to an embodiment of the present application.
As shown in fig. 1, the entity relationship-based feature construction apparatus includes: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the hardware configuration of the entity relationship based feature construction apparatus shown in fig. 1 does not constitute a limitation of the entity relationship based feature construction apparatus, and may include more or less components than those shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a computer program. The operating system is a program for managing and controlling the feature building device and the software resources based on the entity relationship, and supports the operation of the feature building program and other software and/or programs.
In the hardware structure of the entity relationship based feature construction device shown in fig. 1, the network interface 1004 is mainly used for accessing a network; the user interface 1003 is mainly used for detecting a confirmation instruction, an editing instruction, and the like. And the processor 1001 may be configured to invoke the feature construction program stored in the memory 1005 and perform the operations of the following embodiments of the entity relationship based feature construction method.
Based on the above feature construction device hardware structure based on the entity relationship, various embodiments of the feature construction method based on the entity relationship are provided.
Referring to fig. 2, fig. 2 is a schematic flowchart of a first embodiment of the method for constructing features based on entity relationships according to the present application. In this embodiment, the method for constructing a feature based on an entity relationship includes the following steps:
step S10, obtaining a main table and a plurality of auxiliary tables associated with the main table in a relational database, wherein the main table is provided with a main key column and a plurality of external key columns, each entry in the main table corresponds to an entity, and the auxiliary tables are associated with the main table through the external keys of the main table;
in this embodiment, feature construction is performed on data tables in a relational database, and meanwhile, in order to facilitate construction of a directed table relational graph between the tables, a primary table and a plurality of secondary tables are required to exist for an object data table of feature construction. The primary table is provided with a plurality of external key columns, each entry corresponds to one entity, and the secondary table is associated with the primary table through the external keys of the primary table.
For example, a simple relational database with 4 tables as shown in tables 1-4 below.
(1) Main table (main): containing information about the arrival time of the train. The target column is the arrival time. Each entry in the master table is uniquely identified by a MessageID column that corresponds to a message sent when a train arrives at a station. The main table has two external keys: StationID and TrainID.
TABLE 1
TrainID StationID Arrival time MessageID
IRE01 Dublin 2017-01-01 10:02:00 1
IRE01 AshTown 2017-01-01 10:12:00 2
IRE01 Maynooth 2017-01-01 10:24:00 3
IRE01 Dublin 2017-01-02 10:03:00 4
IRE01 AshTown 2017-01-02 10:15:00 5
IRE01 Maynooth 2017-01-02 10:27:30 6
IRE02 Dublin 2017-01-01 11:00:00 7
IRE02 Cork 2017-01-01 14:20:00 8
(2) Delay table (delay): including train delay information. It is similar to the main table, but the arrival time translates to a delay in seconds.
TABLE 2
TrainID StationID Delay TimeStamp
IRE01 Dublin 120 2017-01-01 10:02:00
IRE01 AshTown 60 2017-01-01 10:12:00
IRE01 Maynooth 60 2017-01-01 10:24:00
IRE01 Dublin 180 2017-01-02 10:03:00
IRE01 AshTown 240 2017-01-02 10:15:00
IRE01 Maynooth 240 2017-01-02 10:27:30
IRE02 Dublin 0 2017-01-01 11:00:00
IRE02 Cork 60 2017-01-01 14:20:00
(3) Information table (info): detailed information about the train, such as the train grade.
TABLE 3
TrainID Trainclass Max Speed(km/h)
IRE01 Regional 120
IRE02 Intercity 240
(4) Event table (event): event logs of station occurrences where the train is scheduled to arrive.
TABLE 4
StationID Event TimeStamp
Dublin Roadwork 2017-01-01 10:00:00
Dublin Roadwork 2017-01-01 18:00:00
AshTown Roadwork 2017-01-01 10:00:00
Dublin Strike 2017-01-02 9:00:00
AshTown Strike 2017-01-02 9:00:00
Step S20, constructing a directed table relational graph by taking the main table and the auxiliary table as nodes and taking the association relationship between every two main tables and the auxiliary table as an edge;
in this embodiment, the relationship graph between tables is a relationship graph, where nodes are tables and edges are connections between tables. The relationships between the tables can be associated by the inter-table relationship diagram, and the features of the data can be expressed in a plurality of dimensions as a whole. It should be noted that there are various association relationships between two tables correspondingly connected with an edge of the inter-table relationship diagram, such as a one-to-one association relationship, a one-to-many association relationship, or a many-to-many association relationship.
Step S30, traversing the relationship graph between tables with the corresponding nodes of the main table as starting points to collect the relationship data between each entity in the main table and the corresponding sub table;
in this embodiment, data may be collected for each entity in the main table through an arbitrary path from the node corresponding to the main table. Data is collected by traversing the inter-table relationship graph for different paths in the graph, which is equivalent to exploring different relationships between tables. In general, the number of paths is exponential in relation to the depth of the graph, so it is necessary to limit the maximum depth d of traversal, preferably d equal to 2.
And step S40, performing conversion calculation on the relation data between the tables based on a preset conversion function to construct the characteristics corresponding to each entity in the main table.
In this embodiment, the relational data between tables collected through the above steps is actually a list of data of the last table in the traversal path, and the data of the tables are usually numbers, categories, timestamps and text types, so that data conversion functions are supported, such as converting label-like data into numerical variables and converting timestamps into 4 different features, namely week (1-7), day (1-28/30/31), month (1-12) or hour (1-24).
In this embodiment, a large number of new features having practical significance can be constructed by performing conversion calculation on the acquired data. For example, based on the data in tables 1-4, by calculating the difference between the arrival time and the delay time, the expected scheduled arrival time can be obtained; by comparing the delay time, the station name with the longest delay and the station name with the shortest delay can be obtained; by comparing the delay time of the same train arriving at the same station on different dates, the optimal travel time can be recommended.
Optionally, the conversion function at least includes: any one or more of an averaging function, a maximum function, a minimum function, a sum function, a difference function, and a product function.
In this embodiment, the association relationship between the data tables is sorted out through the inter-table relationship diagram, then data related to each entity is collected based on the inter-table relationship diagram, and finally, features are constructed based on the collected data. Characteristic data are collected based on the relation graph between tables, so that the characteristics of the data can be expressed in multiple dimensions on the whole, and the modeling success rate is improved. In addition, the embodiment does not need human participation in the feature construction process of the relational database, and can serve different data sets, so that the data modeling efficiency can be improved, a manager can be helped to make a decision quickly at a low cost, and the quick development of enterprise business is supported.
Further, in an embodiment of the feature construction method based on entity relationships, an edge M of the relationship graph between tables is defined as follows:
Figure BDA0002404405450000091
wherein, Ti-1、TiIs a table in a database, CiIs a linked list Ti-1、TiI is a positive integer;
A. when C is presentiIs Ti-1When the main key of (1), Ti-1And TiIs a one-to-many incidence relation;
B. when C is presentiIs not only Ti-1Is also TiWhen the main key of (1), Ti-1And TiIs a one-to-one association relationship;
C. when C is presentiIs TiWhen the main key of (1), Ti-1And TiIs a many-to-one incidence relation;
D. when C is presentiIs neither T nori-1Is not T, is notiWhen the main key of (1), Ti-1And TiIs a many-to-many incidence relation.
The inter-table relationship diagram in this embodiment is a relationship diagram, in which tables are used as nodes of the inter-table relationship diagram, and the association relationship between the tables is used as an edge of the inter-table relationship diagram. The relational database described in the above embodiment is taken as an example, and the corresponding relationship diagram between tables is shown in fig. 3.
It should be noted that the inter-table relationship graph is mainly composed of nodes represented by each table and edges represented by the association relationship between tables, where the edge M is defined as follows:
Figure BDA0002404405450000092
in this embodiment, the inter-table relationship diagram may be specifically constructed by using an adjacency matrix, an edge array, an adjacency list, a cross-linked list, an adjacency multiple list, and the like, where an arrow in the edge M indicates two tables (T) of the edge corresponding nodei-1、Ti) Correlation between (C)iIs a linked list Ti-1、TiKey column) and the connection direction.
In the inter-table relationship diagram depicted in figure 3,
Figure BDA0002404405450000101
the relation is many-to-one, that is, a plurality of records in the main table correspond to one record in the information table; while
Figure BDA0002404405450000102
The association relationship is many-to-many, that is, there are multiple records in the main table, and each record corresponds to multiple records in the delay table.
In this embodiment, the relationships between the tables can be associated by the inter-table relationship diagram, and the features of the data can be expressed in a plurality of dimensions as a whole. It should be noted that there are various association relationships between two tables connected to each other in correspondence with the edges of the inter-table relationship diagram, for example, a one-to-one association relationship or a one-to-many association relationship.
Further, in a specific embodiment of the method for constructing features based on entity relationships, a connection path P corresponding to each entity in the relationship graph among the history tableskThe edge sequence is formed by connecting edges M of the relationship graph among the k tables in sequence, and adopts the following definition mode:
Figure BDA0002404405450000103
wherein, Ti-1、TiRepresenting tables in a database, CiIs a linked list Ti-1、TiI and k are positive integers, i is any positive integer from 2 to (k-1), and T0Represents the main table, TiRepresenting a sub-table, C representing the last sub-table T in the connection pathkAttribute column (2).
Referring to fig. 4, fig. 4 is a schematic view of a detailed flow of the step S30 in fig. 2. In this embodiment, the step S30 further includes:
step S301, using the corresponding node of the main table as the starting point, according to the connection path PiTraversing the relationship graph between the tables, and generating a relationship tree corresponding to each entity in the main table and the connection path corresponding to the auxiliary table;
step S302, based on the traversal depth of the inter-table relationship graph, grouping operation is respectively carried out on the relationship trees corresponding to the entities so as to collect the inter-table relationship data of the entities in the main table and the sub-table;
wherein a root node of the relationship tree corresponds to an entity in the main table and leaf nodes of the relationship tree correspond to nodes in the main table by traversing the connection path PiCollected secondary table TkC, the child node with traversal depth i corresponds to the link path P by traversing the link path PiCollected secondary table TiOuter key column C ini
In the above embodiment, the relational database is taken as an example, and in this embodiment, the data collection for the entity e may be represented as a relational tree, as shown in fig. 5.
In this embodiment, the entity e is the main table in which the MessageID is rainid ═ IRE01, and the traversed connection path is:
Figure BDA0002404405450000111
the root of the relationship tree in FIG. 5 corresponds to entity e, while the leaf nodes of the relationship tree correspond to the connection path P by traversal2Attribute column Event in the collected sublist Event; the child node having the traversal depth of 1 corresponds to the connection path P by traversal1Collected pairThe foreign key column StationID in Table delay; the child node having the traversal depth of 2 corresponds to the connection path P by traversal2The foreign key column StationID in the collected sublist event. The traversal depth of the relationship graph among the tables can be determined by the number of the secondary tables traversed from the primary table. For example, if the first sub table is traversed from the main table, the traversal depth at this time is 1, and if 3 different sub tables are successively traversed from the main table, the traversal depth at this time is 3.
For convenience of description, the relationship tree corresponding to the connection path of each entity is recorded as
Figure BDA0002404405450000112
A relationship tree representing the connection path P of entity e. In addition, in order to collect data from multiple dimensions, the collected data of different traversal depths are further grouped while the data is collected.
Taking fig. 5 as an example, grouping operations at different traversal depths represent different information of events affecting the train. For example, each child node with depth of 1 in fig. 5 corresponds to the StationID attribute information corresponding to the traninid IRE01 in the main table, and the grouping operation
Figure BDA0002404405450000113
Information of the event table that affects the train delay is represented.
Referring to fig. 6, fig. 6 is a schematic flowchart of a second embodiment of the method for constructing features based on entity relationships according to the present application. Based on the first embodiment, the present embodiment further includes, after the step S40, the following steps:
step S50, checking whether the constructed features have repeated features;
step S60, if the repeated features exist, deleting the repeated features, and adopting chi-square hypothesis to check whether the correlation exists between the features and the target variables;
step S70, if there is no repeated feature, checking whether there is correlation between the feature and the target variable by adopting chi-square hypothesis;
in step S80, if there is a correlation and the chi-squared value is greater than the preset chi-squared value threshold, the feature is retained, otherwise, the feature is deleted.
In this embodiment, the features obtained by feature construction based on a plurality of dimensions inevitably have repetitive features having the same actual data value although the physical meanings are different, and therefore, it is necessary to further select the constructed features. The feature selection can eliminate irrelevant or excessive features, so that the aims of reducing the number of features, improving the accuracy of a model and reducing the running time are fulfilled.
In the present embodiment, it is preferable to use the chi-square assumption in the filtering method for feature selection. The chi-square hypothesis is to examine the correlation of qualitative independent variables to qualitative dependent variables. Assuming that the independent variable has N values and the dependent variable has M values, considering the difference between the observed value of the sample frequency number of the independent variable equal to i and the dependent variable equal to j and the expectation, constructing statistic, wherein the statistic is the correlation of the independent variable to the dependent variable. If the correlation exists and the chi-square value is larger than the characteristic of the preset chi-square value threshold, the correlation is reserved, and if not, the correlation is deleted.
The application also provides a device for constructing the characteristics based on the entity relationship.
Referring to fig. 7, fig. 7 is a functional module schematic diagram of an embodiment of the feature construction apparatus based on entity relationship according to the present application. In this embodiment, the apparatus for constructing features based on entity relationships includes:
an obtaining module 10, configured to obtain a main table and multiple auxiliary tables associated with the main table in a relational database, where the main table is provided with a main key column and multiple external key columns, each entry in the main table corresponds to an entity, and the auxiliary tables are associated with the main table through an external key of the main table;
the inter-table relationship graph building module 20 is configured to build a directed inter-table relationship graph by using the main table and the auxiliary table as nodes and using an association relationship between each two of the main table and the auxiliary table as an edge;
a traversal module 30, configured to traverse the inter-table relationship graph with the primary table corresponding node as a starting point to acquire inter-table relationship data between each entity in the primary table and the corresponding secondary table;
and the feature construction module 40 is configured to perform conversion calculation on the relationship data between the tables based on a preset conversion function, so as to construct features corresponding to the entities in the main table.
Optionally, in a specific embodiment, the edge M of the inter-table relationship graph is defined as follows:
Figure BDA0002404405450000121
wherein, Ti-1、TiIs a table in a database, CiIs a linked list Ti-1、TiI is a positive integer;
A. when C is presentiIs Ti-1When the main key of (1), Ti-1And TiIs a one-to-many incidence relation;
B. when C is presentiIs not only Ti-1Is also TiWhen the main key of (1), Ti-1And TiIs a one-to-one association relationship;
C. when C is presentiIs TiWhen the main key of (1), Ti-1And TiIs a many-to-one incidence relation;
D. when C is presentiIs neither T nori-1Is not T, is notiWhen the main key of (1), Ti-1And TiIs a many-to-many incidence relation.
Optionally, in a specific embodiment, the connection path P corresponding to each entity in the relationship graph between tables is traversedkThe table is formed by sequentially connecting edges M of the relationship graph among the k tables, and the following definition mode is adopted:
Figure BDA0002404405450000131
wherein, Ti-1、TiRepresenting tables in a database, CiIs a linked list Ti-1、TiI and k are positive integers, i is any positive integer from 2 to (k-1), and T0Represents the main table, TiRepresenting a sub-table, C representing the last sub-table T in the connection pathkAttribute column (2).
Optionally, in a specific embodiment, the traversal module is specifically configured to:
taking the corresponding node of the main table as a starting point and according to the connection path PiTraversing the relationship graph between the tables, and generating a relationship tree corresponding to each entity in the main table and the connection path corresponding to the auxiliary table;
based on the traversal depth of the inter-table relationship graph, grouping operation is respectively carried out on the relationship trees corresponding to the entities so as to collect the inter-table relationship data of the entities in the main table and the sub-table;
wherein a root node of the relationship tree corresponds to an entity in the main table and leaf nodes of the relationship tree correspond to nodes in the main table by traversing the connection path PiCollected secondary table TkC, the child node with traversal depth i corresponds to the link path P by traversing the link path PiCollected secondary table TiOuter key column C ini
Optionally, in a specific embodiment, the apparatus for constructing features based on entity relationships further includes:
the characteristic inspection module is used for inspecting whether repeated characteristics exist in the constructed characteristics; if the repeated features exist, deleting the repeated features, and adopting a chi-square hypothesis to check whether the correlation exists between the features and the target variable; if the repeated features do not exist, adopting a chi-square hypothesis to check whether the correlation exists between the features and the target variable; if the correlation exists and the chi-square value is larger than the characteristic of the preset chi-square value threshold, the correlation is reserved, and if not, the correlation is deleted.
Optionally, in a specific embodiment, the conversion function at least includes: any one or more of an averaging function, a maximum function, a minimum function, a sum function, a difference function, and a product function.
Based on the same embodiment description content as the method for constructing the feature based on the entity relationship in the present application, the embodiment of the device for constructing the feature based on the entity relationship is not described in detail in this embodiment.
In this embodiment, the association relationship between the data tables is sorted out through the inter-table relationship diagram, then data related to each entity is collected based on the inter-table relationship diagram, and finally, features are constructed based on the collected data. Characteristic data are collected based on the relation graph between tables, so that the characteristics of the data can be expressed in multiple dimensions on the whole, and the modeling success rate is improved. In addition, the embodiment does not need human participation in the feature construction process of the relational database, and can serve different data sets, so that the data modeling efficiency can be improved, a manager can be helped to make a decision quickly at a low cost, and the quick development of enterprise business is supported.
The present application also provides a non-volatile computer-readable storage medium.
In this embodiment, a computer-readable storage medium stores a feature construction program, and when the feature construction program is executed by a processor, the feature construction program further implements the steps of the entity relationship based feature construction method according to any one of the embodiments. The method implemented when the feature building program is executed by the processor may refer to various embodiments of the method for building a feature building program based on an entity relationship in the present application, and therefore, redundant description is not repeated.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM), and includes several instructions for enabling a terminal (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.
The embodiments of the present application have been described above with reference to the drawings, but the present application is not limited to the above-mentioned embodiments, which are only illustrative and not restrictive, and those skilled in the art can make many changes and modifications without departing from the spirit and scope of the present application and the protection scope of the claims, and all changes and modifications that come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims (10)

1. A feature construction method based on entity relationship is characterized by comprising the following steps:
acquiring a main table and a plurality of auxiliary tables associated with the main table in a relational database, wherein the main table is provided with a main key column and a plurality of external key columns, each entry in the main table corresponds to an entity, and the auxiliary tables are associated with the main table through the external keys of the main table;
constructing a relationship graph between directed tables by taking the main table and the auxiliary table as nodes and taking the association relationship between every two main tables and every two auxiliary tables as edges;
traversing the relationship graph between the tables by taking the corresponding nodes of the main table as starting points to acquire relationship data between each entity in the main table and the corresponding auxiliary table;
and performing conversion calculation on the relation data between the tables based on a preset conversion function to construct the characteristics corresponding to each entity in the main table.
2. The method for constructing features based on entity relationship as claimed in claim 1, wherein the edge M of the relationship graph between tables is defined as follows:
Figure FDA0002404405440000011
wherein, Ti-1、TiIs a table in a database, CiIs a linked list Ti-1、TiI is a positive integer;
A. when C is presentiIs Ti-1When the main key of (1), Ti-1And TiIs a one-to-many incidence relation;
B. when C is presentiIs not only Ti-1Is also TiWhen the main key of (1), Ti-1And TiIs a one-to-one association relationship;
C. when C is presentiIs TiWhen the main key of (1), Ti-1And TiIs a many-to-one incidence relation;
D. when C is presentiIs neither T nori-1Is not T, is notiWhen the main key of (1), Ti-1And TiIs a many-to-many incidence relation.
3. The method of claim 2, wherein the connection path P corresponding to each entity in the inter-table relationship graph is traversedkThe table is formed by sequentially connecting edges M of the relationship graph among the k tables, and the following definition mode is adopted:
Figure FDA0002404405440000012
wherein, Ti-1、TiRepresenting tables in a database, CiIs a linked list Ti-1、TiI and k are positive integers, i is any positive integer from 2 to (k-1), and T0Represents the main table, TiRepresenting a sub-table, C representing the last sub-table T in the connection pathkAttribute column (2).
4. The method for constructing features based on entity relationships according to claim 3, wherein traversing the inter-table relationship graph with the primary table corresponding node as a starting point to collect the inter-table relationship data between each entity in the primary table and the corresponding secondary table comprises:
taking the corresponding node of the main table as a starting point and according to the connection path PiTraversing the relationship graph between the tables, and generating a relationship tree corresponding to each entity in the main table and the connection path corresponding to the auxiliary table;
based on the traversal depth of the inter-table relationship graph, grouping operation is respectively carried out on the relationship trees corresponding to the entities so as to collect the inter-table relationship data of the entities in the main table and the sub-table;
wherein a root node of the relationship tree corresponds to an entity in the main table and leaf nodes of the relationship tree correspond to nodes in the main table by traversing the connection path PiCollected secondary table TkC, the child node with traversal depth i corresponds to the link path P by traversing the link path PiCollected secondary table TiOuter key column C ini
5. The method for constructing characteristics based on entity relationships according to any one of claims 1 to 4, wherein after the step of performing conversion calculation on the relationship data between tables based on the preset conversion function to construct the characteristics corresponding to each entity in the main table, the method further comprises:
checking whether repeated features exist in the constructed features;
if the repeated features exist, deleting the repeated features, and adopting a chi-square hypothesis to check whether the correlation exists between the features and the target variable;
if the repeated features do not exist, adopting a chi-square hypothesis to check whether the correlation exists between the features and the target variable;
if the correlation exists and the chi-squared value is larger than the characteristic of the preset chi-squared value threshold value, the characteristic is reserved, and if not, the characteristic is deleted.
6. The entity relationship based feature construction method according to claim 1, wherein the conversion function at least comprises: any one or more of an averaging function, a maximum function, a minimum function, a sum function, a difference function, and a product function.
7. An entity relationship-based feature construction apparatus, comprising:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a main table and a plurality of auxiliary tables associated with the main table in a relational database, the main table is provided with a main key column and a plurality of external key columns, each entry in the main table corresponds to an entity, and the auxiliary tables are associated with the main table through the external keys of the main table;
the table-to-table relational graph building module is used for building a directed table-to-table relational graph by taking the main table and the auxiliary table as nodes and taking the association relationship between every two of the main table and the auxiliary table as an edge;
the traversal module is used for traversing the inter-table relationship graph by taking the corresponding node of the main table as a starting point so as to acquire the inter-table relationship data of each entity in the main table and the corresponding auxiliary table;
and the characteristic construction module is used for performing conversion calculation on the relation data between the tables based on a preset conversion function so as to construct the characteristics corresponding to each entity in the main table.
8. The entity relationship based feature construction apparatus according to claim 7, wherein the conversion function comprises at least: any one or more of an averaging function, a maximum function, a minimum function, a sum function, a difference function, and a product function.
9. An entity relationship based feature construction device, characterized in that the entity relationship based feature construction device comprises a memory, a processor and a feature construction program stored on the memory and executable on the processor, the feature construction program when executed by the processor further implementing the steps of the entity relationship based feature construction method according to any one of claims 1-6.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a feature construction program, which when executed by a processor further implements the steps of the entity relationship based feature construction method according to any one of claims 1 to 6.
CN202010156947.7A 2020-03-09 2020-03-09 Method, device and equipment for constructing characteristics based on entity relationship and storage medium Pending CN111460047A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010156947.7A CN111460047A (en) 2020-03-09 2020-03-09 Method, device and equipment for constructing characteristics based on entity relationship and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010156947.7A CN111460047A (en) 2020-03-09 2020-03-09 Method, device and equipment for constructing characteristics based on entity relationship and storage medium

Publications (1)

Publication Number Publication Date
CN111460047A true CN111460047A (en) 2020-07-28

Family

ID=71682639

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010156947.7A Pending CN111460047A (en) 2020-03-09 2020-03-09 Method, device and equipment for constructing characteristics based on entity relationship and storage medium

Country Status (1)

Country Link
CN (1) CN111460047A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347102A (en) * 2020-11-03 2021-02-09 第四范式(北京)技术有限公司 Multi-table splicing method and multi-table splicing device
CN112561084A (en) * 2021-02-18 2021-03-26 腾讯科技(深圳)有限公司 Feature extraction method and device, computer equipment and storage medium
CN113312890A (en) * 2021-06-16 2021-08-27 第四范式(北京)技术有限公司 Multi-table splicing method and device, electronic equipment and storage medium
CN113448667A (en) * 2021-06-09 2021-09-28 绿盟科技集团股份有限公司 Method and device for generating display relation graph
CN113590886A (en) * 2021-07-05 2021-11-02 金电联行(北京)信息技术有限公司 Method and device for automatically identifying association relation of data tables and automatically integrating multiple data tables
CN114491085A (en) * 2022-04-15 2022-05-13 支付宝(杭州)信息技术有限公司 Graph data storage method and distributed graph data calculation method
CN115292508A (en) * 2022-06-29 2022-11-04 江苏昆山农村商业银行股份有限公司 Knowledge graph construction method and system based on table data
CN115328883A (en) * 2022-06-29 2022-11-11 江苏昆山农村商业银行股份有限公司 Data warehouse modeling method and system
CN115712691A (en) * 2022-11-17 2023-02-24 创新奇智(重庆)科技有限公司 Data relation processing method and system
CN113312890B (en) * 2021-06-16 2024-04-12 第四范式(北京)技术有限公司 Multi-table splicing method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150120775A1 (en) * 2013-10-24 2015-04-30 Microsoft Corporation Answering relational database queries using graph exploration
CN104866576A (en) * 2015-05-25 2015-08-26 广州精点计算机科技有限公司 Method and apparatus for automatically constructing Data Vault-modeled data warehouse

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150120775A1 (en) * 2013-10-24 2015-04-30 Microsoft Corporation Answering relational database queries using graph exploration
CN104866576A (en) * 2015-05-25 2015-08-26 广州精点计算机科技有限公司 Method and apparatus for automatically constructing Data Vault-modeled data warehouse

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347102A (en) * 2020-11-03 2021-02-09 第四范式(北京)技术有限公司 Multi-table splicing method and multi-table splicing device
CN112561084A (en) * 2021-02-18 2021-03-26 腾讯科技(深圳)有限公司 Feature extraction method and device, computer equipment and storage medium
CN112561084B (en) * 2021-02-18 2021-05-18 腾讯科技(深圳)有限公司 Feature extraction method and device, computer equipment and storage medium
CN113448667B (en) * 2021-06-09 2023-08-01 绿盟科技集团股份有限公司 Method and device for generating display relationship diagram
CN113448667A (en) * 2021-06-09 2021-09-28 绿盟科技集团股份有限公司 Method and device for generating display relation graph
CN113312890A (en) * 2021-06-16 2021-08-27 第四范式(北京)技术有限公司 Multi-table splicing method and device, electronic equipment and storage medium
CN113312890B (en) * 2021-06-16 2024-04-12 第四范式(北京)技术有限公司 Multi-table splicing method and device, electronic equipment and storage medium
CN113590886A (en) * 2021-07-05 2021-11-02 金电联行(北京)信息技术有限公司 Method and device for automatically identifying association relation of data tables and automatically integrating multiple data tables
CN114491085A (en) * 2022-04-15 2022-05-13 支付宝(杭州)信息技术有限公司 Graph data storage method and distributed graph data calculation method
CN114491085B (en) * 2022-04-15 2022-08-09 支付宝(杭州)信息技术有限公司 Graph data storage method and distributed graph data calculation method
CN115328883A (en) * 2022-06-29 2022-11-11 江苏昆山农村商业银行股份有限公司 Data warehouse modeling method and system
CN115292508A (en) * 2022-06-29 2022-11-04 江苏昆山农村商业银行股份有限公司 Knowledge graph construction method and system based on table data
CN115292508B (en) * 2022-06-29 2024-02-02 江苏昆山农村商业银行股份有限公司 Knowledge graph construction method and system based on table data
CN115712691A (en) * 2022-11-17 2023-02-24 创新奇智(重庆)科技有限公司 Data relation processing method and system

Similar Documents

Publication Publication Date Title
CN111460047A (en) Method, device and equipment for constructing characteristics based on entity relationship and storage medium
Martinelli et al. Measuring knowledge persistence: a genetic approach to patent citation networks
CN110292775B (en) Method and device for acquiring difference data
US20150095892A1 (en) Systems and methods for evaluating a change pertaining to a service or machine
CN107870949B (en) Data analysis job dependency relationship generation method and system
CN111680108B (en) Data storage method and device and data acquisition method and device
CN113486008A (en) Data blood margin analysis method, device, equipment and storage medium
US10250550B2 (en) Social message monitoring method and apparatus
CN114416703A (en) Method, device, equipment and medium for automatically monitoring data integrity
CN112559538A (en) Incidence relation generation method and device, computer equipment and storage medium
CN108073641B (en) Method and device for querying data table
CN112052134A (en) Service data monitoring method and device
CN105719072B (en) System and method for associating multi-segment component transactions
CN114661832A (en) Multi-mode heterogeneous data storage method and system based on data quality
US20160125005A1 (en) Apparatus and Method for Profiling Activities and Transitions
WO2024027071A1 (en) Data monitoring method and system
CN115033646B (en) Method for constructing real-time warehouse system based on Flink and Doris
WO2019153546A1 (en) Ten-thousand-level dimension data generation method, apparatus and device, and storage medium
CN103714066B (en) Database analysis device based on template
CN115470251A (en) Big data analysis display device
CN109582806B (en) Personal information processing method and system based on graph calculation
CN115357657B (en) Data processing method and device, computer equipment and storage medium
CN112214290B (en) Log information processing method, edge node, center node and system
CN116610727B (en) Analysis processing method and device for enterprise statistical data
US11562026B1 (en) Data access using sorted count mapping

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination