CN110245270A

CN110245270A - Data genetic connection storage method, system, medium and equipment based on graph model

Info

Publication number: CN110245270A
Application number: CN201910385135.7A
Authority: CN
Inventors: 陈政; 潘强; 蔡灿; 张翼飞
Original assignee: Chongqing Tianpeng Network Co Ltd
Current assignee: Chongqing Tianpeng Network Co Ltd
Priority date: 2019-05-09
Filing date: 2019-05-09
Publication date: 2019-09-17

Abstract

The present invention provides a kind of data genetic connection storage method based on graph model, comprising: the SQL statement in parsing data mart modeling script；Create initial graph model；By the parsing result and initial graph model interaction；It repeats above operation, traverses the SQL statement in all data mart modeling scripts, generate a genetic connection graph model.The above scheme of the embodiment of the present invention, directly using data as the node of figure, relationship, attribute storage into chart database, without being pre-designed complicated relational data table structure, significantly reduces the design difficulty and complexity of such scene using graph model；Second, have benefited from the memory computer system of chart database Neo4j and the data structure of optimization, under mass data, data blood relationship upstream and downstream level quantity, the statistics of dependence table quantity can be rapidly completed in several milliseconds, and the retrieval of data field and tables of data dependence is rapidly completed.

Description

Data blood relationship storage method, system, medium and equipment based on graph model

Technical Field

The invention relates to the technical field of software, in particular to a data blood relationship storage method, a data blood relationship storage system, a data blood relationship storage medium and electronic equipment based on a graph model.

Background

In the prior art, a data warehouse can generate a large amount of data tables and data in order to support different services, and when data quality problems are solved, redundant data are cleaned, and data flow direction links are researched, bloody border dependency relationships among a large amount of data are difficult to clean quickly. Data blood relationship based on a graph model is usually stored in a form of manual recording or a relational database based on mysql and the like, however, the method is high in complexity, prone to errors, incapable of supporting complex data blood relationship analysis, and difficult to meet performance requirements under large-scale data.

Therefore, in the long-term research and development, the inventor has conducted a great deal of research and study on the storage of data blood-related relationships based on a graph model, and has proposed a data blood-related relationship storage method based on a graph model to solve one of the above technical problems.

Disclosure of Invention

The present invention is directed to a graph model-based data relationship storage method, system, medium, and electronic device, which can solve at least one of the above-mentioned problems. The specific scheme is as follows:

according to a first aspect of the present invention, there is provided a data relationship storage method based on a graph model, including:

analyzing SQL sentences in the data processing script;

creating an initial graph model;

associating the analysis result with an initial graph model;

and repeating the operations, traversing the SQL sentences in all the data processing scripts and generating a blood relationship graph model.

After the SQL statement in the data processing script is analyzed, the method comprises the following steps:

and acquiring the relationship between the data source table name and the field name, the data target table name and the field name, and the relationship between the data source table and the data target table field.

Wherein, the creating of the initial graph model specifically includes:

an initial graph model is created in the graph database Neo4 j.

Wherein the associating the parsing result with the initial graph model comprises:

and respectively taking the data source table field names and the data target table field names as nodes of the initial graph model, and writing the nodes into the graph database Neo4 j.

Wherein the associating the parsing result with the initial graph model further comprises:

and respectively taking the data source table name and the data target table name as the attributes of the initial graph model node, and writing the attributes into the graph database Neo4 j.

and taking the relation between the field names of the data source table and the field names of the data target table as the edge of the initial graph model, and writing the edge into the graph database Neo4 j.

And visually displaying the blood relationship graph model.

The blood relationship graph model is a mesh graph, wherein the mesh graph takes one node as a center, other nodes are associated with the center node, and different nodes are distinguished in color according to the depth of the blood relationship.

Wherein the node information in the blood relationship graph model comprises: table name, number of upstream layers, number of downstream layers, number of upstream tables, number of downstream tables, number of direct upstream tables, number of direct downstream tables, and direct downstream table field list.

Wherein the information of each list in the direct downstream table field list includes a relationship between a data source table field name and a data destination table field name.

According to a second aspect of the present invention, there is provided a data relationship storage system based on a graph model, including:

the analysis module is used for analyzing SQL sentences in the data processing script;

the creation module is used for creating an initial graph model in a graph database Neo4 j;

a node writing module, configured to write the data source table field name and the data target table field name into the graph database Neo4j, as nodes of the initial graph model, respectively;

an attribute writing module, configured to write the data source table name and the data target table name into the graph database Neo4j, as attributes of the initial graph model node;

a relation writing module, configured to write the relation between the data source table field name and the data target table field name as an edge of the initial graph model into the graph database Neo4 j;

and the traversal module is used for traversing and analyzing the SQL sentences in all the data processing scripts to generate a blood relationship graph model.

The analysis module is further used for obtaining the relationship between the data source table name and the field name, the data target table name and the field name, and the relationship between the data source table and the data target table field.

According to a third aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the graph model-based data relationship storage method as described in any one of the above.

According to a fourth aspect of the present invention, there is provided an electronic apparatus including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the graph model-based data context storage method as described in any one of the above.

According to the scheme of the embodiment of the invention, the data are directly stored into the graph database as the nodes, the relations and the attributes of the graph by utilizing the graph model, a complex relational data table structure is not required to be designed in advance, and the design difficulty and the complexity of the scene are greatly reduced; second, thanks to the in-memory computation mechanism and optimized data structure of the graph database Neo4j, under a large amount of data, statistics of the number of levels upstream and downstream of the data lineage and the number of dependency tables can be completed quickly within a few milliseconds, and retrieval of data fields and dependency relationships of the data tables can be completed quickly.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 is a flow chart of a graph model-based data relationship storage method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a graph model-based data relationship storage method according to another embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a data-context storage system based on a graph model according to an embodiment of the present invention;

fig. 4 shows a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and "a plurality" typically includes at least two.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

It should be understood that although the terms first, second, third, etc. may be used to describe … … in embodiments of the present invention, these … … should not be limited to these terms. These terms are used only to distinguish … …. For example, the first … … can also be referred to as the second … … and similarly the second … … can also be referred to as the first … … without departing from the scope of embodiments of the present invention.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such article or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in the article or device in which the element is included.

In the big data era, data has infinite value. The explosion of mobile internet has enabled internet companies to accumulate PB-level user data and business data. Under the drive of strong requirements, the big data technology is developed steadily and maturely, and massive and continuously increased data are recorded through storage components such as HDFS, HBase, MongoDB, Kafka and the like.

The data generation, processing fusion, circulation and circulation are carried out until the data are finally lost, and a relationship can be naturally formed among the data. The relationship between data is expressed by referring to a similar relationship in human society, which is called the blood-related relationship of data.

Example 1

Referring to fig. 1, an embodiment of the present invention provides a data relationship storage method based on a graph model, including the following steps:

analyzing SQL sentences in the data processing script;

creating an initial graph model;

associating the analysis result with an initial graph model;

Wherein, the creating of the initial graph model specifically includes:

an initial graph model is created in the graph database Neo4 j.

And visually displaying the blood relationship graph model.

Example 2

Referring to fig. 2, an embodiment of the present invention provides a data relationship storage method based on a graph model, including the following steps:

s1, analyzing SQL sentences in the data processing script to obtain the relationship between the data source table name and field name, the data target table name and field name, and the data source table and data target table field.

Specifically, the data processing script is analyzed in a certain mode, and the blood relationship between the data table and the data field in the data warehouse is obtained and used as a data basis for constructing a data blood relationship graph model based on the graph model. Since the emphasis of the embodiment of the present invention is on the storage of the blood relationship, the script parsing process is not described here. In this embodiment, the SQL statement in the data processing script is analyzed by the parser to obtain the relationship between the data source TABLE name (S _ TABLE), the data source TABLE field name (S _ COLUMN), the target TABLE name (T _ TABLE), the target TABLE field name (T _ COLUMN), the data source TABLE field, and the data target TABLE field.

S2, an initial graph model is created in a graph database Neo4 j.

Specifically, after the blood relationship between the data is analyzed, an initial graph model is created according to a data model of a Neo4j graph database, and then the data is stored in the initial graph model.

S3, the data source table field names and the data target table field names are respectively used as nodes of the initial graph model and written into the graph database Neo4 j.

Specifically, the field name of the data source table is set as the Node _ a of the initial graph model, and is written into a graph database Neo4 j; and setting the field name of the data object table as the Node _ B of the initial graph model, and writing the field name into a graph database Neo4 j.

S4, the data source table names and the data target table names are used as attributes of the initial graph model nodes and written into the graph database Neo4 j.

Specifically, the data source table name is set as the attribute of the initial graph model Node _ a, and is written into a graph database Neo4 j; and simultaneously setting the name of the data target table as the attribute of the initial graph model Node _ B, and writing the attribute into a graph database Neo4 j.

S5, taking the relation between the field names of the data source table and the field names of the data target table as the edge of the initial graph model, and writing the edge into the graph database Neo4 j.

Specifically, the relationship between the nodes Node _ a and Node _ B of the initial graph model is set, and written into the graph database Neo4 j. In the embodiment, an application programming interface is used for specifying the connection address and the account name of a Graph database object (Graph) and establishing connection with a Graph database Neo4 j; then, field names of a data source table and a target table are designed as vertexes of the initial graph model, a node object is created by using a create method, and the names of the data source table and the target table are respectively used as name attributes of corresponding node objects; the create method is then used to create a relationship object for the graph database Neo4j with parameters specifying the first as the data source field, the second as the description of the direction of the relationship as 'to', and the third as the data destination field.

And S6, repeating the steps, traversing and analyzing SQL sentences in all the data processing scripts, and generating a blood relationship graph model.

Specifically, the steps S1-S5 are repeated, SQL statements in all data processing scripts are traversed and analyzed, all data source tables, data target table fields and table names are obtained, and are respectively used as Nodes (Nodes), Relationships (Relationships) and attributes (Properties) of the initial graph model, and are written into the graph database Neo4j to form a large complete graph, namely, a blood-edge relationship graph model.

The blood relationship graph model is a mesh graph, wherein the mesh graph takes one node as a center, other nodes are associated with the center node, and different nodes are distinguished according to the depth of the blood relationship. The node information in the blood relationship graph model comprises: table name, number of upstream layers, number of downstream layers, number of upstream tables, number of downstream tables, number of direct upstream tables, number of direct downstream tables, and direct downstream table field list. The information of each list in the direct downstream table field list includes the relationship between the data source table field name and the data destination table field name.

Further, the data blood relationship storage method based on the graph model comprises the step of visually displaying the blood relationship graph model. Specifically, data are acquired based on the Cypher query grammar of Neo4j, and meanwhile, the data are combined with the front-end framework vue.

According to the data blood relationship storage method based on the graph model, provided by the embodiment of the invention, the initial graph model is used for directly storing the data into the graph database as the nodes, the relationships and the attributes of the graph without designing a complex relational data table structure in advance, so that the design difficulty and the complexity of the scene are greatly reduced; based on the memory computing mechanism and the optimized data structure of the graph database Neo4j, under the condition of a large amount of data, statistics of the number of upstream and downstream levels of the data blood margin and the number of dependency tables can be completed quickly within a few milliseconds, and retrieval of data fields and dependency relationships of the data tables can be completed quickly; meanwhile, by combining with good visual interface function design, a user can quickly check and search the data table and visually see key information such as blood relationship flow direction relation, blood relationship dependent hierarchy and the like among data by clicking and selecting through a mouse without compiling codes.

Example 3

Referring to fig. 3, an embodiment of the invention provides a graph model-based data relationship storage system 200, where the system 200 includes: the system comprises a parsing module 210, a creating module 220, a node writing module 230, an attribute writing module 240, a relationship writing module 250 and a traversing module 260.

The parsing module 210 is configured to parse the SQL statements in the data processing script to obtain the relationship between the data source table name and the field name, the data target table name and the field name, and the data source table and the data target table field. Specifically, the parsing module 210 parses the data processing script in a certain manner to obtain the data relationship between the data table and the data field in the data warehouse, which is used as a data basis for constructing a data relationship graph model based on a graph model. Since the emphasis of the embodiment of the present invention is on the storage of the relationship of blood relationship, the script analysis is not described here. In this embodiment, the parsing module 210 parses the SQL statement in the data processing script by a parser to obtain a relationship between a data source TABLE name (S _ TABLE), a data source TABLE field name (S _ COLUMN), a target TABLE name (T _ TABLE), a target TABLE field name (T _ COLUMN), a data source TABLE field, and a data target TABLE field.

The creation module 220 is configured to create an initial graph model in a graph database Neo4 j. Specifically, after the blood relationship between the data is analyzed, the creating module 220 creates an initial graph model according to the data model of the Neo4j graph database, and then stores the data in the initial graph model.

The node writing module 230 is configured to write the data source table field name and the data destination table field name into the graph database Neo4j as nodes of the initial graph model, respectively. Specifically, the Node writing module 230 sets the name of the data source table field as the Node _ a of the initial graph model, and writes the name into a graph database Neo4 j; and setting the field name of the data object table as the Node _ B of the initial graph model, and writing the field name into a graph database Neo4 j.

The attribute writing module 240 is configured to write the data source table name and the data target table name into the graph database Neo4j as the attributes of the initial graph model node. Specifically, the attribute writing module 240 sets the attribute of the data source table name as the initial graph model Node _ a, and writes the attribute into a graph database Neo4 j; and simultaneously setting the name of the data target table as the attribute of the initial graph model Node _ B, and writing the attribute into a graph database Neo4 j.

The relationship writing module 250 is configured to write the relationship between the data source table field name and the data destination table field name as an edge of the initial graph model into the graph database Neo4 j. Specifically, the relationship writing module 250 sets the relationship between the nodes Node _ a and Node _ B of the initial graph model, and writes the relationship into the graph database Neo4 j. In the embodiment of the invention, an application programming interface is used for appointing the connection address and the account name of a Graph database object (Graph) and establishing connection with a Graph database Neo4 j; then, field names of a data source table and a target table are designed as vertexes of the initial graph model, a node object is created by using a create method, and the names of the data source table and the target table are respectively used as name attributes of corresponding node objects; the create method is then used to create a relationship object for the graph database Neo4j with parameters specifying the first as the data source field, the second as the description of the direction of the relationship as 'to', and the third as the data destination field.

The traversal module 260 is configured to traverse and analyze SQL statements in all the data processing scripts to generate a blood relationship graph model. Specifically, the traversal module 260 traverses and analyzes SQL statements in all data processing scripts to obtain fields and table names of all data source tables and data target tables, and uses the fields and table names as Nodes (Nodes), Relationships (Relationships), and attributes (Properties) of the initial graph model, and writes the fields and table names into a graph database Neo4j to form a large complete graph, i.e., a blood-edge relationship graph model.

Further, the graph model-based data relationship storage system 200 includes a visualization display module 270 for visually displaying the graph model of relationship between blood vessels. Specifically, the visualization display module 270 acquires data based on the Cypher query syntax of Neo4j, and completes visualization display by combining the front-end frame vue. In this embodiment, after the data blood relationship based on the graph model is constructed, the data blood relationship based on the graph model is queried and analyzed by designing a matched visual interface.

The blood relationship graph model displayed on the visual interface defaults to take a certain data table node as a center, and the node is displayed in red; all the data table nodes related to the data table nodes are displayed in a mesh graph form, the data table nodes are connected through gray connecting lines with arrows, and the color of each level of nodes is gradually lightened according to the depth of the blood relationship. Wherein, the nodes are displayed by circular icons, and the icons are marked with the name of the data table represented by the nodes; the data table comprises a data source table and a data target table. The node can display related information through a pop-up box after clicking, and the method comprises the following steps: table name, number of upstream layers, number of downstream layers, number of upstream tables, number of downstream tables, number of direct upstream tables, number of direct downstream tables, and direct downstream table field list. Clicking a certain table in the direct downstream table field list, and expanding the relationship between the display data fields, wherein the relationship comprises the following steps: the name of the field of the data source table, the name of the field of the data target table and a connecting line with a directional arrow.

The visual interface further comprises a search box capable of inputting texts, a search button is clicked after the table name is input, the blood relationship graph is redrawn, and the blood relationship graph related to the searched specific data table is directly displayed. The visual data blood relationship graph based on the graph model can be dragged, enlarged, reduced and the like, so that a user can visually check the flow direction condition of a certain data table in the whole data link.

The graph model-based data consanguinity relationship storage system 200 provided by the embodiment of the invention directly stores data into a graph database as nodes, relationships and attributes of a graph by using an initial graph model without designing a complex relational data table structure in advance, thereby greatly reducing the design difficulty and complexity of such scenes; based on the memory computing mechanism and the optimized data structure of the graph database Neo4j, under the condition of a large amount of data, statistics of the number of upstream and downstream levels of the data blood margin and the number of dependency tables can be completed quickly within a few milliseconds, and retrieval of data fields and dependency relationships of the data tables can be completed quickly; meanwhile, by combining with good visual interface function design, a user can quickly check and search the data table and visually see key information such as blood relationship flow direction relation, blood relationship dependent hierarchy and the like among data by clicking and selecting through a mouse without compiling codes.

Example 4

As shown in fig. 4, the present embodiment provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the one processor to cause the at least one processor to:

analyzing SQL sentences in the data processing scripts to obtain the names and field names of the data source tables, the names and field names of the data target tables and the relations between the data source tables and the data target tables;

creating an initial graph model in a graph database Neo4 j;

taking the data source table field names and the data target table field names as nodes of the initial graph model respectively, and writing the nodes into the graph database Neo4 j;

respectively taking the name of the data source table and the name of the data target table as the attributes of the initial graph model node, and writing the attributes into the graph database Neo4 j;

taking the relation between the field names of the data source table and the field names of the data target table as the edges of the initial graph model, and writing the edges into the graph database Neo4 j;

and repeating the steps, traversing and analyzing SQL sentences in all the data processing scripts, and generating a blood relationship graph model.

Example 4

The embodiment of the disclosure provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the vulnerability component version searching method in any method embodiment.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".

Claims

1. A data blood relationship storage method based on a graph model is characterized by comprising the following steps:

analyzing SQL sentences in the data processing script;

creating an initial graph model;

associating the analysis result with an initial graph model;

2. The method of claim 1, wherein after parsing the SQL statement in the data manipulation script, the method comprises:

3. The method according to claim 2, wherein the creating of the initial graph model specifically comprises:

an initial graph model is created in the graph database Neo4 j.

4. The method of claim 3, wherein the associating the parsed result with the initial graph model comprises:

5. The method of claim 3, wherein associating the parsed result with an initial graph model further comprises:

6. The method of claim 3, wherein the associating the parsed result with the initial graph model comprises:

7. A graph model-based data relationship storage system, comprising:

8. The system of claim 7, wherein the parsing module is further configured to obtain relationships between data source table names and field names, data target table names and field names, and data source tables and data target table fields.

9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 6.

10. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method of any one of claims 1 to 6.