CN111523003A - Data application method and platform with time sequence dynamic map as core - Google Patents

Data application method and platform with time sequence dynamic map as core Download PDF

Info

Publication number
CN111523003A
CN111523003A CN202010343863.4A CN202010343863A CN111523003A CN 111523003 A CN111523003 A CN 111523003A CN 202010343863 A CN202010343863 A CN 202010343863A CN 111523003 A CN111523003 A CN 111523003A
Authority
CN
China
Prior art keywords
data
graph
database
algorithm
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010343863.4A
Other languages
Chinese (zh)
Inventor
闭雨哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tutomos Technology Co ltd
Original Assignee
Beijing Tutomos Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Tutomos Technology Co ltd filed Critical Beijing Tutomos Technology Co ltd
Priority to CN202010343863.4A priority Critical patent/CN111523003A/en
Publication of CN111523003A publication Critical patent/CN111523003A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/80Database-specific techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Abstract

The invention belongs to the technical field of knowledge maps, and particularly relates to a data application method and a data application platform taking a time sequence dynamic map as a core, which comprise the following steps: constructing a data mixing storage structure taking a time sequence dynamic map as a core in a database; in the data storage process, data acquired from various data sources are converted and then stored into a database according to a data mixing storage mode; various algorithms can be called to analyze the data in the database in the data using process; and in the process of storing and using the data, monitoring the use index of the CPU/memory of each server of the database. The invention can store data from different data sources through a mixed storage mode of time sequence dynamic and static combination, can call various algorithms to analyze the data when inquiring the data of the database, and can monitor the writing use of the database, thereby providing an integrated big data platform from time sequence dynamic storage to data calculation and analysis for users.

Description

Data application method and platform with time sequence dynamic map as core
Technical Field
The invention belongs to the technical field of knowledge maps, and particularly relates to a data application method and a data application platform taking a time sequence dynamic map as a core.
Background
Today, the market scale is gradually enlarged and the modernization is gradually increased, the traditional social governance mode is difficult to adapt, and scientific governance by means of technical means is more and more important for users. With the arrival of 5G networks, higher intelligent effect requirements are provided for the application of realizing comprehensive perception, ubiquitous interconnection, pervasive computing and intelligent fusion based on the infrastructure of the Internet of things, a new-generation information technology is fully applied to various industries in cities, and the city information advanced form is based on the next-generation innovation (innovation 2.0) of the knowledge society. The knowledge graph technology is used as a step of big data moving towards artificial intelligence, expresses a complex social structure in a form most fitting social behaviors, and is very suitable for analyzing data with an incidence relation. The social knowledge is stored by a database, 90% of the databases in China currently use two technologies, namely Neo4J and Janus graph, which have poor writing performance, cannot keep up with real-time reading and writing, are troublesome in large data volume introduction, lack of butt joint support of large data ecology and AI ecology technologies, are static graph data storage systems, do not support a time sequence data storage form, and cannot support cross-graph/multi-graph association query, namely knowledge fusion technology. Because the industrialized application of the database is still in the beginning, and enterprises with related business scene requirements just begin to explore the application of the knowledge graph in the industry, the current same type of technology only reaches the level of the data graph, namely, the data in the original traditional database or distributed database is only transferred to the database, the data is simply changed into the storage form of the graph, and the function of achieving the advanced and better fitting the data form generated under the knowledge society is not realized. In the era of internet of things, the knowledge graph technology is an important infrastructure thereof, and is used for storing data generated by interconnection of large-scale everything in real time and timely applying a big data analysis means to make upper-layer decision support. The technical requirements are very high, firstly, large-scale real-time data access is carried out, then large-scale data analysis is carried out, and the prior art is difficult to realize the former without technical ecological support required by the latter. Different from the technology, the method has the advantages that the method is unique in domestic market, self-research is carried out on a large number of commercial real-time databases Tiger graph (made in America and China, and static graph storage is also realized), the timeliness is met by Kafka (a data access technology) which is docked in real-time storage, and other main characteristics are approximately consistent with those of the graph library technology in the market.
In summary, there is an urgent need for a large-scale real-time, time-sequential, static + dynamic graph storage system in China, which combines the ecological perfection of big data and artificial intelligence technology to enable various industries such as finance, telecommunication, government, internet of things, internet, insurance and the like, and realize intelligent digital transformation and upgrade.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a data application method and a data application platform taking a time sequence dynamic map as a core, which can store data from different data sources in a mixed storage mode combining time sequence dynamic and static states, can analyze the data through a butted interface call graph mining algorithm and a deep learning algorithm when inquiring the data of a database, and can monitor the writing use of the database, thereby providing an integrated big data platform from time sequence dynamic storage to data calculation analysis for a user.
In a first aspect, the present invention provides a data application method using a time sequence dynamic graph as a core, including the following steps:
constructing a data mixing storage structure taking a time sequence dynamic map as a core in a database;
in the data storage process, a big data real-time processing technology Flink is adopted as a database data access technology, various data sources are accessed, the obtained various data are converted according to a conversion rule, and the converted data are stored in a database according to a data mixed storage mode;
in the data use process, calling required data in a database according to a query request called by a function through various standard workflow interfaces, and in the calling process, analyzing the data by using a graph mining algorithm and/or analyzing the data by using a deep learning algorithm to obtain various analyzed service indexes;
and in the process of storing and using the data, monitoring the use index of the CPU/memory of each server of the database.
In a second aspect, the present invention provides a data application platform with a time sequence dynamic graph as a core, which is suitable for the data application method with the time sequence dynamic graph as the core in the first aspect, and includes:
the database is used for storing data according to a data mixing storage mode taking the static graph and the time sequence dynamic graph as the core;
the data access unit is used for connecting various data sources through a data interface by using a big data real-time processing technology Flink as a database data access technology, converting data from the various data sources according to a conversion rule and storing the data into a database;
the workflow interface unit is used for providing various transmission interfaces, and each application realizes the function call of the data in the database through a transmission interface;
the graph mining algorithm library is used for storing a plurality of graph mining algorithms so that a user can call the graph mining algorithms to perform data analysis on data in the database in the function calling process;
the deep learning docking unit is used for docking the database to a deep learning framework by adopting Python API so as to call data in the database when a user performs data analysis through AI algorithm;
the cluster resource monitoring unit is used for caching data, is convenient for a user to connect to the database within milliseconds and execute a function of inquiring data, and simultaneously monitors the resource utilization index of each server;
and the graph structure construction unit is used for constructing a data mixed storage structure taking the time sequence dynamic graph as a core in the database, updating the existing graph data in the database by using the data mixed storage structure, and realizing automatic data updating and data importing of the original data.
Preferably, the database comprises a data storage subsystem, a metadata storage subsystem, a user isolation subsystem and a data index monitoring subsystem;
the data storage subsystem is used for storing relational graph data according to a data mixed storage mode taking a static graph and a time sequence dynamic graph as a core;
the metadata storage subsystem is used for storing the graph structure and the graph configuration of the map; the graph structure comprises various entities, entity dimensions, entity attributes, relationships, relationship dimensions, relationship attributes and data types in the graph, and the graph configuration comprises the steps of storing key information connected to a database when each graph is instantiated and allocating system resources used when each graph is allocated;
the user isolation subsystem is used for carrying out data separation storage by taking the user authority as an identifier; when the database receives the operation of a user, the authority inquiry is triggered, and if the user authority is consistent with the corresponding user authority in the memory, the user can operate the graph data in the administration range;
and the data index monitoring subsystem is used for activating a real-time collecting and monitoring program when the system is initialized, and recording and displaying various operation indexes.
Preferably, the data mixing storage mode supports simultaneous storage of a plurality of data, and can be stored according to time sequence, geographic space, relational data, text data and dynamic and static combined map data.
Preferably, the data sources include Kafka data source, rockmq data source, ActiveMQ data source, Hdfs data source, and File data source;
the data interface comprises a Kafka interface, a RocketMQ interface, an ActiveMQ interface, a Hadoop interface and a universal data interface.
Preferably, the data of the data source comprises real-time data and offline data, and the access comprises real-time data access and offline data access; the real-time data access is used for accessing data from various message components in real time and storing the real-time data into a database after executing the user-defined data processing logic; the offline data access is used for accessing data from a local file or a distributed file system HDFS in a large batch, and storing the offline data into a database after executing a user-defined data processing logic.
Preferably, the graph mining algorithm library includes 13 types of algorithm types, and the 13 types of algorithm types include an entity classification algorithm, an entity clustering algorithm, a graph coarsening algorithm, a community division algorithm, a network layering algorithm, a connection prediction algorithm, a complete network graph attribute calculation algorithm, an edge in network attribute calculation algorithm, a node in network attribute calculation algorithm, a minimum spanning tree algorithm, an NLP node vectorization algorithm, an overlapping community detection algorithm, and a shortest path algorithm.
Preferably, the docking of the database to the deep learning framework by using the python api specifically includes: and the graph data in the database is docked from a programming language layer by using a deep learning framework developed by the same type of programming language, and the docking interface realizes the conversion from the Java programming language to the Python programming language.
Preferably, the cluster resource monitoring unit includes a primitive data storage and management module, a distributed graph instance storage module, a user right management module, and a cluster resource monitoring and displaying module;
the primitive data storage and management module is used for caching the graph structure in the database into a memory, and the graph structure in the memory is synchronous with the graph structure in the database;
the distributed graph instance storage module is used for storing the graph instance after graph initialization into a distributed cache system, and the graph instance in the distributed cache system is synchronous with the graph instance in the database;
the user authority management module is used for storing the serialized user names into a distributed cache system, synchronizing the user authority information in the distributed cache system with the user authority information in the database, verifying the user authority when the user submits the database operation, and if the verification is passed, the user can operate the graph data in the administrative range;
and the cluster resource monitoring and displaying module is used for accessing each database in an embedded mode, mainly interacting metadata and user permission storage and management, and monitoring the use index of the CPU/memory of each server in the running process of each server of the database.
Preferably, the graph structure building unit includes a visualization UI of a graph structure and a building UI of a graph structure;
the visual UI of the graph structure is used for displaying the existing graph structure, analyzing the graph structure stored in the database into a node relation graph and displaying the node relation graph, and also has the functions of deleting the graph structure and loading a specified file to the specified graph by using the graph structure;
and the construction UI of the graph structure is used for depicting a graph structure according to the original data in a non-programming mode, finishing the whole process of mapping the original data into graph structure data and realizing automatic data updating and data importing of the original data.
According to the technical scheme, data from different data sources can be stored in a mixed storage mode of combining time sequence dynamic and static states, when data of a database are inquired, the data can be analyzed through a butted interface call graph mining algorithm and a deep learning algorithm, and the writing use of the database can be monitored, so that an integrated big data platform from time sequence dynamic storage to data calculation and analysis is provided for a user.
Drawings
In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.
FIG. 1 is a data application platform with a time-sequence dynamic graph as a core in the present embodiment;
fig. 2 is a data application method with the time-series dynamic graph as the core in the embodiment.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
In particular implementations, the terminals described in embodiments of the invention include, but are not limited to, other portable devices such as mobile phones, laptop computers, or tablet computers having touch sensitive surfaces (e.g., touch screen displays and/or touch pads). It should also be understood that in some embodiments, the device is not a portable communication device, but is a desktop computer having a touch-sensitive surface (e.g., a touch screen display and/or touchpad).
The first embodiment is as follows:
the embodiment provides a data application platform with a time sequence dynamic graph as a core, which is suitable for the data application method with the time sequence dynamic graph as the core in the second embodiment, as shown in fig. 1, and includes:
the database is used for storing data according to a data mixing storage mode taking the time sequence dynamic map as a core; the data mixing storage mode supports simultaneous storage of various data and can store according to time sequence, geographic space, event data, machine learning characteristics, relational data, text data and dynamic and static combined map data.
And the data access unit is used for connecting various data sources through a data interface, converting the data from the various data sources according to a conversion rule and storing the converted data into the database by using a big data real-time processing technology Flink as a database data access technology.
And the workflow interface unit is used for providing various transmission interfaces, and each application realizes function call of the data in the database through the transmission interface.
And the graph mining algorithm library is used for storing a plurality of graph mining algorithms so that a user can call the graph mining algorithms to perform data analysis on the data in the database in the function calling process.
And the deep learning docking unit is used for docking the database to the deep learning framework by adopting Python API so as to call the data in the database when a user performs data analysis through AI algorithm.
The cluster resource monitoring unit is used for caching data, is convenient for a user to connect to the database within milliseconds and execute a function of inquiring data, and simultaneously monitors the resource utilization index of each server;
and the graph structure construction unit is used for constructing a data mixed storage structure taking the time sequence dynamic graph as a core in the database, updating the existing graph data in the database by using the data mixed storage structure, and realizing automatic data updating and data importing of the original data.
The database of the embodiment comprises a data storage subsystem, a metadata storage subsystem, a user isolation subsystem and a data index monitoring subsystem;
and the data storage subsystem is used for storing the relational graph data according to a data mixing storage mode taking the time sequence dynamic graph as a core. The metadata storage subsystem is used for storing the graph structure and the graph configuration of the map; the graph structure comprises various entities, entity dimensions, entity attributes, relationships, relationship dimensions, relationship attributes and data types in the graph, and the graph configuration comprises the steps of storing key information connected to a database when each graph is instantiated and distributing system resources used when each graph is distributed. The user isolation subsystem is used for carrying out data separation storage by taking the user authority as an identifier; and when the database receives the operation of the user, triggering authority inquiry, and if the user authority is consistent with the corresponding user authority in the memory, enabling the user to operate the graph data in the administration range. And the data index monitoring subsystem is used for activating a real-time collecting and monitoring program when the system is initialized, and recording and displaying various operation indexes.
The data storage subsystem mainly adopts an Apache Accumulo (an open-source distributed relational database constructed on an Apache Hadoop of an open-source distributed file system) kernel to carry out deep optimization and expansion development, changes the capability of only storing relational data into the storage capability of simultaneously supporting an atlas, and comprises a data storage module, an atlas data module, an atlas function module, a serialization module and a data operation module.
The data storage module is mainly realized as a multi-dimensional entity relationship attribute graph (comprising a static dimension, a dynamic dimension, a time sequence dimension and a relationship dimension), each piece of data is stored in a column, and the specific storage form of the entity static dimension is as follows: the method comprises the following steps of A, RowKey + RowFamily + Value, wherein the name of an entity is RowKey, the RowKey of the entity ZhangIII is ZhangIII, the dimension of the entity is stored as ColumnAily, specific attribute values of the dimension of the entity are stored in a Value column in a centralized mode, for example, the daily body temperature (dimension) of the entity ZhangIII is 36.5, and the Value is stored in a database in a Java serialization object format after being serialized through data classes with corresponding attributes; if the entity has dynamic dimensions, the concrete storage form is as follows: RowKey + ColumnFuamily ColumQualifier + Value, different from the static dimension: adding a ColumnQualifier column as a related item for storing a detailed aggregation rule, namely, taking the aggregation rule specified in a graph structure as a specific ColumnQualifier column, wherein the attribute in Value is dynamically changed, namely, the aggregation method specified in ColumnQualifier is combined with the ColumnQualifier rule to perform aggregation with the historical attribute during data streaming and perform serialization storage again; if the entity has a time series dimension, the specific storage form is as follows: RowKey + ColumnFuamily ColumQualifier + Value, different from the static dimension: adding ColumnQualifier columns as related items for storing detailed aggregation rules, namely, taking aggregation time intervals specified in a graph structure as specific columns of ColumnQualifier, and dynamically changing attributes in Value, namely, taking the aggregation time intervals in ColumnQualifier as GroupBy operators and applying a specified aggregation method to perform aggregation and re-serialization storage with historical attributes during data streaming. The specific storage form of the relation dimension is as follows: the method comprises the steps that RowKey + ColumnAily + Value is used, the combination of directions of a starting point entity + an end point entity + an edge is used as the RowKey, for example, a relation of 'Zhangi' (relation) - > 'Liquan', wherein the RowKey is 'Zhangi + Liquan + an oriented edge', a relation of a dimension to which the relation belongs is stored as RowFamily, the storage mode of attribute information of the relation is consistent with the attribute storage mode of the entity, and if the relation has a dynamic dimension or a time sequence dimension, the storage mode except the RowKey is consistent with the storage mode in the entity.
The graph data module is mainly implemented as definition of entity and relationship data, wherein the entity data is expressed as: dimension "group", vertex ", attribute" properties ", the relational data is expressed as: the method comprises the steps of searching a relation, wherein the relation comprises a dimension "group", a source point "source", an end point "destination", a direction ", a vertex" matchedVertex "and an attribute" Properties ", wherein the dimension, the vertex, the source point and the end point are all coded by characters, the direction of the relation is divided into directed edges and undirected edges, the entity and the attribute of the relation are stored by using java.util.map, and the attribute is converted into a java Properties class for a user to use after being retrieved;
the graph function module comprises a subdivision module: core, Config, Function, predites, Types. The Core module mainly realizes format conversion of some basic elements of the graph, such as Entity, Edge, interface data Json and the like; the Config module is realized as a configuration information class of the graph and mainly encapsulates some storage attribute information of the bottom storage system; the Function module realizes built-in functions of a database, such as common mathematical statistical functions Sum/Max/Count and the like; the Predicates module implements some built-in prediction functions of the database, such as common comparison method functions: greater than, less than, equal to, etc.; types module implements the data type definition of graph data, also represented in the definition of graph structure.
The serialization module realizes corresponding serialization of each element in the graph data according to the data type, and is beneficial to rapid storage and data compression, and comprises common data type serialization and Json data serialization, wherein the common data type serialization provides serialization means of basic data types such as String, Date, Double, Float, Integer, Long, Set, Boolean, Roaring Bitmap, Sketch, Java and the like, which are inherited from the self-serialized class of Java language, and Json data serialization is adopted, so that the serialization module realizes the mutual conversion of all database statement interfaces and the Json format and adopts the self-ObjectMapper class of the Java language.
The data operation module comprises a series of diagram query operation statements such as OperationChan, AddElement, Count, assembler, DataSeed, Export, Function, Generator, GetElement, Job, Join, Output, If, While, ForEach, Map, Reduce, Limit, Variable, and the like, wherein the statement OperationChan is used for executing the series of ordered operations in series once, transmitting the result among each operation, receiving the Output of the previous operation by the current operation, and transmitting the result to the next operation after executing calculation. The AddElement statement is used for adding elements to the database, the addition can be generally carried out by processing data into elements (Entity/Edge) by using GenerateElements (data processors) before adding, and then followed by AddElements operation to add warehousing, the AddElements are basic data warehousing statements, and the operation of receiving data from a third-party data source is also realized in large-scale real-time oriented data warehousing, and the addelementFromFile, AddElementsFromHdfs, AddElementsFromKafka, AddElementsFromRockMQ, AddElementsFromSocket, AddElementsSpark Frame, and AddElementsSparesSparkFrakGraphFrame. The Count statement realizes counting the number of queries entites + Edges. The Comparator statement implements the Max greater, Min less, Sort operations. The DataSeed statement implements the operations of extracting EntityId from one Entity and extracting EdgeId from Edge. The Export statement realizes that the result of the current operation in the Export operation chain is temporarily stored in a memory (facing a small amount of data) or an AbsutionGDB (facing a large amount of data), and the current result is transmitted to the next operation, including an Export ToSet, an Export ToAbutionResultCache and an Export Toother graph, wherein the Export ToSet faces a small amount of results, and the intermediate calculation result of the operation is temporarily stored in the memory; the ExportToAbutionResultCache temporarily stores the intermediate calculation result of the operation in a database facing to a large number of results; ExportToOtherGraph exports graph data into another graph. The Function statement implements Aggregate aggregation, Filter filtering, Transform conversion operations, wherein the Aggregate operations are typically used in an operation chain to Aggregate the results of previous operations; filter operations are typically used in the chain of operations, filtering the results of previous operations; transform is typically used in the chain of operations to Transform the results of previous operations. The generation statement, the data generation and converter realize the processing of data line by line into the Entity/Edge format data of the database. The GetElement statement is an operation for acquiring data elements (Entity/Edge), and comprises GetAllElements, GetElements, GetjacentIds, GetFromEndpoint, GetGraphFrameOfElements and GetDataFrameOfElements, wherein the GetElements operation acquires all elements specified in a database; the GetElements operation acquires data related to the specified data; GetAdjacentIds operation acquires neighbor node data related to the provided data; the GetFromEndpoint operation uses an HTTP protocol to obtain data from a specified url; the GetGraphFrameOfElements operation obtains graph data as GraphFrame graph data of Spark technology; the GetDataFrameOfElements operation acquires graph data as DataFrame table data of Spark technology. The Job statement comprises an ExecuteJob, a GetAllJobDetails, a GetJobDetails, a GetJobResults and a CancelSchedulJob, wherein the ExecuteJob is an operation for executing a query task, and a result is not returned immediately but temporarily stored in a distributed cache system and taken at any time according to a JobId; the GetAllJobDetails operation acquires all running and historical Job detailed information; the GetJobDetails operation acquires the execution information of a certain Job; the GetJobResults operation obtains the execution result of a certain Job according to the JobId; the CancelSchedulJob operation cancels the graph operation task according to the JobId. The Join statement is used for connecting two Iterables objects together and appointing match and merge methods, three connection types of Join: FULL- (FULL connected) return all objects in the key, and all matching objects on the other side, 2.INNER- (INNER connected) return all keys matching the object on the other side, 3.OUTER- (OUTER connected) return all keys not matching the object on the other side; two matchmethods by Join: ElementMatch-matches elements with the same id, group and group by attributes, 2. KeyFunctionMatch-matches any left and right element objects based on two Key functions, the first Key function being applicable to any join type (left-Key joined object on the left, right-Key on the opposite). The Output statement comprises ToCsv, ToMap, ToArray/ToList/ToSet/ToStream and ToVertics, wherein the ToCsv operation converts image data Elements (vertex/edge) into Csv format data; the ToMap converts the graph data Elements (vertex/edge) into Map data in a key-value format; ToArray/ToList/ToSet/ToStream operations convert Iterable data to a collection type; the ToVertics operation converts the graph data Element (vertex/edge) ids into vertices. The If statement is a Conditional statement judgment Operation, which tests an input object or a result of a previous Operation according to a provided java. The While statement, which repeatedly executes operations when conditions are met, will execute the provided operations when some conditions are true or condition ═ true, and stop executing operations when a maximum hop limit (setMaxRepeats) is reached or some conditions are false, While may be in any form that resolves to a boolean value, and may also provide Conditional to control operations based on the input object itself. The ForEach statement runs the provided operation on the Iterable of the inputs, i.e., for a given Iterable input, it will run the provided operation for each input in turn. The Map statement maps the input to the output by using the Jfunc function provided by the database, can use the existing Jfunc function of the system, and can also use the self-writing function class, and only the function class of the function is required to inherit the Jfunc function < input type, output type >. The Reduce statement uses the provided Jfunc function-binary operator to merge any type of Iterable input into a single output value, can use the existing Jfunc function of the system, can also use a self-written function class, and only needs the function class of the function to inherit the Jfunc function < input type, output type >. The Limit statement limits the query result Iterable to a number of a given size. The Variable statement comprises SetVariable and GetVariable, the SetVariable operation sets a global Variable in a current operation statement, the Variable value can be any data object, and the Variable can be acquired by using the GetVariable operation in other operation statements.
The metadata storage subsystem realizes the storage of a graph structure and graph configuration, wherein the graph structure comprises each entity, entity dimension, entity attribute, relationship dimension, relationship attribute and data types thereof in the graph, the graph and the structure thereof in the current system can be rapidly obtained, the required data can be selected according to the graph structure when the metadata storage subsystem is applied to data query operation, and the graph structure also has the functions of increasing, deleting and modifying. The storage of the graph configuration comprises the steps of storing key information connected to a database when each graph is instantiated, allocating system resources which can be used by each graph and the like.
The user isolation subsystem comprises graphs, metadata and historical operations generated by each user operation, the information is stored by using a distributed memory, data is stored in a separated mode by taking user authority as an identifier, authority inquiry is triggered when the database receives the user operation, and when the user is consistent with the corresponding user authority in the memory, the graph in the user jurisdiction can be operated.
The data index monitoring subsystem is used for activating a real-time collecting and monitoring code block program when the bottom layer storage system is initialized, recording and displaying each index at the front end, wherein the indexes comprise data intake (Entries/s), data query quantity (Entries/s), data intake (MB/s), data query quantity (MB/s), GraphTables storage quantity, running state, batch import rate and the like;
additionally, the main difficulty in the technical implementation of the cross-graph/multi-graph fusion query is that different graphs have different graph structures and are likely to encounter the situation of the same dimension and different attributes, which causes the failure of multi-graph merging, in this regard, each dimension name is suffixed with a graph name, uniquely identifying which graph each piece of data belongs to, such as the dimension name "temporary dimension", it should be "temporary dimension-testGraph" at the time of storage, representing that all data in this dimension belongs to the testGraph, therefore, the merging of the graph structure ensures that each dimension data is unique, and the other difficulty is the merging of the graph data, which adopts the method that the data obtained by the query is returned as the result of the native Java editor, the results of multiple graphs belonging to Java Iterator can be merged directly without conflict, because the data of different graphs have differences in graph structure. After the difficulty of graph structures and graph data is solved, the graph structures are stored in a cache system in a Json mode, a Schema fusion method call is added, the graph structures can be rapidly combined, the combined graph structures are used for inquiring and filtering the data according to dimensionality or attributes, the graph data from multiple graphs can be obtained, and the data storage in each graph is not influenced.
Second, the data access unit of this embodiment accesses data sources including Kafka data source, rockmq data source, ActiveMQ data source, Hdfs data source, and File data source. The data of the data source comprises real-time data and offline data, and the access comprises real-time data access and offline data access; the real-time data access is used for accessing data from various message components in real time and storing the real-time data into a database after executing the user-defined data processing logic; the offline data access is used for accessing data from a local file or a distributed file system HDFS in a large batch, and storing the offline data into a database after executing a user-defined data processing logic.
The data interface adopted by the data access unit comprises a Kafka interface, a RocktMQ interface, an ActiveMQ interface, a Hadoop interface and a universal data interface. The universal data interface provides any data which is processed into a graph entity or a relation object class through self-defined programming to perform database insertion operation, after the graph object class is received by the bottom layer of the database, information such as corresponding nodes, dimensions, attributes and the like is analyzed, and then the information is written into the database according to rules of the data storage subsystem. The Kafka interface is developed for a message-to-message middleware Kafka, comprises old version Kafka010 and new version Kafka2.x support, and has the following main interface parameters: the system comprises a topology, a group, a bootstrap servers and an elementary generator, wherein the elementary generator parameter is a data processing logic program which can be realized by self-defined programming and is used for specifying the conversion of original data into graph structure data. The RocktMQ interface is developed for butting message middleware RocktMQ, and the main interface parameters are as follows: the system comprises a topic, a groupId, a nameServerAddrs and an elementary Generator, wherein an elementary Generator parameter is a data processing logic program which can be realized by self-defining programming and is used for specifying the conversion of original data into graph structure data. The ActiveMQ interface is developed for butting message middleware ActiveMQ, and the main interface parameters are as follows: the system comprises a topic, a groupId, a nameServerAddrs and an elementary Generator, wherein an elementary Generator parameter is a data processing logic program which can be realized by self-defining programming and is used for specifying the conversion of original data into graph structure data. The Hadoop interface is developed for docking the Hdfs file system, a programming interface can realize a data processing logic part in a self-defined way like other interfaces, program automation is realized in the restAPI, and the main interface parameters are as follows: the method comprises the steps of obtaining a mapping relation between original data and a graph structure existing in a metadata base according to a provided schema name parameter, importing the original data into a provided designated parameter graph name according to the mapping relation, wherein the position of the original data can contain Hdfs and 2 local file systems, and after the original data are connected to the original data, a bottom storage puts a data processing logic into a MapReduce parallel processing frame of Hadoop to execute a large amount of data import operation offline so as to realize the rapid import of a large amount of historical data.
The workflow interface unit of the embodiment comprises a graph mining algorithm description module, a graph mining algorithm operation module, a graph application algorithm module, a graph configuration module, a calculation task module, a graph operation statement module, a graph attribute module, a system state module, a system test module and a graph structure module.
The graph mining algorithm description module provides various built-in graph algorithm descriptions and interfaces of using mode information.
The graph mining algorithm operation module is used for acquiring all algorithm parameter description interfaces, creating a graph computing resource pool interface, releasing a graph computing resource pool interface, submitting graph algorithm tasks to an AbutionGCS interface, an AbutionGCSJob monitoring interface, starting a graph computing service interface and stopping the graph computing service interface, wherein the acquired all algorithm parameter description interfaces are used for calling algorithm parameters specified in the graph mining algorithm description module; creating a graph computing resource pool interface, and creating a space of a fixed memory and a CPU (Central processing Unit) core number required by algorithm computation by using a big data technology Spark; releasing a graph computing resource pool interface, and releasing a resource pool resource created by using a big data technology Spark; submitting the graph algorithm task to an AbutionGCS interface, converting the inquired data into a resource pool environment created by Spark to run a specified graph mining algorithm task, and finally saving the calculation result back to a database as a new graph; the start/stop graph computation service interface, i.e., the start/stop big data technology Spark, serves as an interface.
The graph application algorithm module comprises a graph algorithm interface, an edge which is directly associated between two node sets, all elements which are multi-hop between node pairs, attributes of designated dimension elements which are multi-hop between node pairs, an edge which is directly associated between node sets, a link algorithm and a loop algorithm.
The Graph configuration module comprises a Graph configuration interface, and is used for obtaining the number of threads used by the Graph, modifying the number of threads which can be used by the Graph, obtaining an available filter function, obtaining an available transform function, obtaining all serialized fields of a given java class, obtaining all serialized fields of the given java class and class types thereof, obtaining an available data conversion program, and obtaining an available filter function of a given input class.
The calculation task module comprises a calculation interface, acquires the execution states of all Job, submits the given query statement to the graph, acquires the execution state of a certain queried statement, acquires the execution result of a certain queried statement, and executes the given OperationChain on the graph in a task form.
The Graph operation statement module comprises a Graph operation interface, acquires all functional operations, acquires detailed information related to a specified operation class, acquires JSON form examples of the specified operation class, acquires all next step compatible operations of the specified operations, completely deletes a hidden Graph, completely deletes an online Graph, acquires all hidden Graph names, deletes Elements operations, updates Elements attribute operations, deletes Elements operations under the specified Graph, returns all next step operations supported by the functional operations, submits queries and asynchronously returns the results of blocks, returns all functional operations supported by the first step queries, hides a Graph, exposes a Graph, loads CSV data to a Graph, loads HDFS data to a Graph, acquires commands for loading HDFS data, synchronizes a local map example to restAPI, and executes the given operations on the Graph, wherein performing the given operation on Graph further comprises other Graph query statement interfaces: such as AddElements, GetElements, GetJancentIds, GetAllElements, GetWalks, Join, Reduce, ForEach, If, While, Map, Filter, Transform, Aggregate, GetAllGraphIds, Limit, Count, etc., these query statements can be freely combined for query according to business requirements.
The map attribute module includes configuration information of some maps, and is used for informing a user, such as: the system comprises a data mapping module, a data storage module, a table structure storage module, a storage diagram calculation module, a storage diagram storage module, a template file path, an official website of an enterprise and the like.
The system state module comprises a state interface: and returning to the current service state and emptying the cache of all the graphs.
The system test module is used for importing test data to perform functional integrity test during initial installation, and comprises a test interface, a system test chart loading module and a system test chart deleting module.
The Graph structure module is used for constructing a Graph structure and a data mapping relation and comprises a Graph structure interface, merging schemas of a plurality of graphs, acquiring names of all data aggregation functions, merging schemas of all the graphs, acquiring the first 15 rows of data of a specified file, acquiring names of all supported data types, acquiring time formatting options, acquiring a Schema, deleting a Schema, acquiring json expression of the specified function, acquiring description of the specified function, loading CSV data to the graphs, storing a Schema and mapping information, acquiring names of all the schemas, acquiring json expression of the specified data types in a serialized mode, and acquiring names of all data verification functions.
Fourthly, the graph mining algorithm library of the embodiment comprises 13 types of algorithm types, wherein the 13 types of algorithm types comprise an entity classification algorithm, an entity clustering algorithm, a graph coarsening algorithm, a community division algorithm, a network layering algorithm, a connection prediction algorithm, a complete network graph attribute calculation algorithm, an edge attribute calculation algorithm in a network, an attribute calculation algorithm of a node in the network, a minimum spanning tree algorithm, an NLP node vectorization algorithm, an overlapping community detection algorithm and a shortest path algorithm.
The entity classification algorithm type realizes Metis and Spectral algorithms, and the algorithms classify entities with similar behaviors in the network into one class. Metis classifies patterns using multi-level recursive bi-or multi-level multi-way partitioning; spectral uses local clustering dimension reduction to classify data graphs that are consecutive and interleaved together.
The entity clustering algorithm type realizes AffinityPropropagation and MarkovCluster algorithms, and the algorithm classifies areas with certain local density in the network into a cluster. Affinity propagation is a graph clustering algorithm based on the concept of affinity propagation between data points, and the clustering number does not need to be specified; MarkovCluster, markov clustering, aggregates entities of the same social sphere together with the distance between the points, i.e., the circle of activity.
The pattern coarsening algorithm realizes the LPCoarening algorithm, combines areas with certain local density in the network into a node, achieves the effect of reducing the network, and is a pattern coarsening algorithm based on a label propagation algorithm.
The community division algorithm type realizes Louvain and Pscan algorithms, and the algorithm divides sub-areas with certain local cohesion in the network into a community, so that the purpose of clustering the people by the class is achieved. Louvain is a directionless and unweighted multi-component area modular community detection algorithm suitable for a large network; pscan is the most advanced community detection algorithm based on the area density structure and suitable for large networks.
The network layering algorithm type realizes KDegenate and KCore algorithms, and the algorithm calculates the hierarchy of nodes in the network, such as application: user classification/layering similar users/searching core network/harassment takeaway express users of different levels, etc. KDegenarate is a mode for measuring a sparse graph by degenerating the graph, a parameter K represents the number of K neighbors in neighbor calculation, and a network can be simplified by removing nodes with maximum < K neighbors > in the neighbors; KCore is a hierarchical algorithm for similar users, and recursively peels off nodes with the degree smaller than k in the network, and k is sequentially decreased at the innermost layer with the largest k (cohesion), namely, the user distribution with different levels is realized.
The connection prediction algorithm type realizes LinkPredictrByAdamicAdar and LinkPredictrByCommonNeighbours algorithms, defines node similarity based on network structure information, and calculates the similarity of node pairs (edge source nodes and destination nodes). The LinkPredigitorByAdamicAdar algorithm calculates the similarity of the connected node pairs (src- > dst) according to the connection information of the common neighbors as a node weight basis; the LinkPredictByCommonNeighbours algorithm uses the degree information of the common neighbors to calculate the similarity of the connected node pairs (src- > dst), and the contribution of the common neighbor nodes with small degrees is larger than that of the common neighbor nodes with large degrees.
The complete network graph attribute calculation algorithm type realizes EigenTrust, NetworkRandomzation and KTruss algorithms, and the algorithm is not based on nodes or edges and adopts the complete graph as the basis for calculation. The EigenTrust algorithm is a trust model, each vertex is returned to contain the credibility eineTrust, and each edge contains a normalized local trust value localTrustValue; the network random recombination algorithm randomly recombines the network nodes again, and the edge attribute is lost; the KTrun algorithm obtains a data subset only with local density, an entity 'isKTrun' attribute represents whether to condense a node of a network, and an edge 'kTruss' attribute is 0, represents that an edge is unimportant and can be filtered;
the attribute calculation algorithm types of the edges in the network realize SVDPlusPlus, CommonNeighbours, AdamicAdar, TrianglesCount and EdgesBetWeneness algorithms, and the algorithms calculate the edge-related attributes by taking the edges of node pairs (src- > dst) as basic statistics. The SVDPlusPlus algorithm is a defect map detection/knowledge completion technology, and is used for predicting similarity between a user and a commodity/between the user and recommending articles in which the user is interested (an algorithm model cannot be expressed by using a graph); the common Neighbours algorithm is used for calculating the number of common neighbors of node pairs, namely the number of common friends among friends, and the edge direction can be selected to be true/false; the AdamicAdar algorithm calculates important edges in the network, the edges have local cohesion/excellence and are complementary with the VertexEmbedded algorithm, and one calculation node is used for calculating the edges; the TrianglesCount algorithm calculates the number of common neighbors of the node pair, namely the number of common friends among the friends, and the edge direction can be Either/Out/In; the EdgesBetWeneness algorithm calculates the busy degree of the edge in the network, if the busy degree of the edge connecting two groups is higher;
the type of the attribute calculation algorithm of the nodes in the network realizes OptimalBetWeneness, KBetWeneness, LocalClustercoef, Eigenvector, Degree, VertexEmbedded reduction, Closens, HitsHubAuthority, EdmondsBetWeneness and NeighborHodConnection, and the algorithm is a method for calculating the position of the nodes in the network. The OptimalBetWenenes algorithm is a new approximately optimal centrally distributed algorithm for computing intermediaries, the performance is 3 times higher than that of Edmonds, and the complexity is only O (V); the KBetweenness algorithm adopts the limited path length to adapt to the ultra-large network, calculates the centrality of the medium, and calculates the parameter K in the path K range; judging the local clustering degree of the nodes in the graph by using a LocalClusterCoef local clustering coefficient algorithm according to triangular counting; calculating the feature vector centrality of each node in the graph by using an Eigenvector algorithm; calculating the in-out Degree of each node in the graph by using a Degree algorithm; the VertexEmbedded else algorithm calculates the importance of nodes in the network, the nodes have local cohesion/excellence and are complementary with the AdamicAdar algorithm, and one calculation node is one calculation edge; calculating the compactness centrality of each node in the graph by using the Closense algorithm; the webpage importance analysis algorithm of HitsHubAuthority, a Hub value (Hub) refers to the sum of Authority values of all export links, and Authority values (Authority) refers to the sum of Hub values of all import links. The EdmondsBetwenness algorithm is used for searching the optimal branch and spanning the tree, and is a space parallel algorithm for calculating the centrality of the sparse network; calculating the average connectivity of the neighbors of the vertex and the richness of the neighbor connection by a Neighborhodconnectivity algorithm;
the minimum spanning tree algorithm category implements greentree, Krushal, Boruvka algorithms, the purpose of the minimum spanning tree is to create the most economical connectivity subgraph, i.e. the most efficient route. The greedy tree specifies a minimum spanning tree greedy algorithm of a starting point, such as a shortest route which a postman finds to go through each street at least once from a post office; krushal is an algorithm similar to the Boruvka algorithm, and the Krushal algorithm is effective when edges with the same weight exist in a graph, for example, the Krushal algorithm can realize road traffic between any two villages in the whole province by 'unblocked engineering'; the Boruvka algorithm is a combination of Prim and Kruskal algorithms, and the purpose of the minimum spanning tree is to create the most economical connected subgraphs, i.e. the most efficient routes, such as traffic systems between cities and oil pipeline planning;
the NLP node vectorization algorithm type realizes a random walk algorithm, namely a core n-order random walk algorithm of a node2vec algorithm, width-first search and depth-first search are introduced into a generation process of a random walk sequence, namely BFS can explore structural properties in a graph, DFS can explore similarity (similarity between adjacent nodes) on contents, wherein the structural similarity is not necessarily directly connected and even possibly far away, and the random walk algorithm can specify order/upper limit of a walk neighbor/longest path of the walk/direction of the walk path/edge weight algorithm parameters.
The overlapping community detection algorithm category implements the bigmnmfa, Copra algorithm, which assumes that in real society, users may belong to zero, one, or multiple communities, i.e., people tend to participate in n community activities to different extents. The BigNMFA non-negative matrix factorization algorithm calculates the attribution degree of a plurality of communities to which large real network nodes belong simultaneously and community attribution degrees, and the BigNMFA finds the overlapped and non-overlapped community structures more accurately than most of the prior art; the Copra algorithm searches for overlapping communities in the network through label propagation, can process a very large and dense network in a short time, but does not have community attribution degree information;
the shortest path algorithm category implements full graph shortest path calculation, single source shortest path calculation, shortest path length calculation, path usage calculation, etc., and the algorithm calculates the shortest path between two or more points.
The deep learning docking unit of the embodiment is used for docking a deep learning framework developed by using the same type of programming language from a programming language layer, wherein a docking interface realizes conversion from a Java programming language to a Python programming language, is embodied as graph data query of the Python programming language, is established on an interface module abuutinggrs, and translates all interfaces into interfaces of the Python language by using a urllib toolkit, and the deep learning docking unit includes modules: core, Connector, Config, Function, Operation, Predicates, Types. The Core module mainly realizes format conversion of some basic elements (such as Entity, Edge, and Json interface data) of the graph; the Connector module is implemented as a graph instance connected to an abuutiongrs server and sends a processing operation request; the Config module is used for realizing the conversion of the graph configuration, and the Function module is used for realizing the conversion of some database built-in functions; the Operation module is used for realizing the conversion of the graph query statement; the Predicates module is used for realizing the conversion of a built-in prediction function of the database; types module is implemented as graph structure and data type conversion of graph data.
Sixthly, the cluster resource monitoring unit of the embodiment comprises a primitive data storage and management module, a distributed graph instance storage module, a user authority management module and a cluster resource monitoring and display module; the modules are all constructed by a Java distributed cache technology Hazelcast.
The primitive data storage and management module is used for caching the graph structure in the database into a memory, and the graph structure in the memory is synchronous with the graph structure in the database; the distributed graph instance storage module is used for storing the graph instance after graph initialization into a distributed cache system, and the graph instance in the distributed cache system is synchronous with the graph instance in the database; the user authority management module is used for storing the serialized user names into a distributed cache system, synchronizing the user authority information in the distributed cache system with the user authority information in the database, verifying the user authority when the user submits the database operation, and if the verification is passed, the user can operate the graph data in the administrative range; and the cluster resource monitoring and displaying module is used for accessing each database in an embedded mode, mainly interacting metadata and user permission storage and management, and monitoring the use index of the CPU/memory of each server in the running process of each server of the database.
The graphic element data storage and management module is realized as caching and storage of graph structures in a database, the graph structures stored in a memory can be read and used within milliseconds, data is stored in the memory and can be lost after the system is restarted, so the caching of the graph structures is synchronously stored with metadata tables in the database, after a cluster is restarted, the graph structures are synchronized into the memory from the metadata tables, management operations such as increasing, deleting, modifying, checking and the like are provided, the operations are also synchronized to the database for storage, a database table schemas distributed is used for storing hidden graph structure information, the database table schemas expanded is used for storing graph structure information which is 'exposed' in front of a user surface, and the database table schemas Maping is used for storing field mapping relations between original data and the graph structures;
the distributed graph instance storage module is implemented as that a graph instance, namely a Java object after graph initialization, is stored in a distributed cache system, the distributed cache system at least has 1 service in each machine of a cluster, the service can dynamically expand the cluster through a tcp-ip address or a service discovery mode, when the graph instance is obtained and used, connection initialization in a mode of using a database account password is not needed, verification is only needed according to information such as the account password and the graph name of the database and the existing instance in the distributed cache, the verification result is consistent, the graph instance can be obtained by connecting to any one service through a client program, the graph instance can execute any graph operation on the graph, the distributed cache system also provides management operations such as adding, deleting, modifying and the like of the graph instance, and the operations are synchronized to the database for storage;
the user authority management module is implemented by serializing user names, storing the serialized user names into a distributed cache system, using user names as keys of a storage set, using a managed graph as Value, carrying out user authority verification once when a user submits database operation, then verifying whether the user has a graph to be queried, and executing corresponding graph operation if the verification is passed, wherein the user authority information in the database and the distributed cache in the user authority management system keeps synchronous, the metadata of graph data is stored in a group by using the user names as RowKey in the database, the graph structure information under the user can be directly positioned when querying the user information, the user authority management system is only a basic authority management platform, does not provide storage of more information of the user, but realizes an interface for storing more information, and the interface uses an LDAP database for interaction, user information is transmitted in url submitted by restAPI, the AbsutionGRS interface platform analyzes implicit user information after the url is taken, interactive verification is carried out on the user information and the LDAP database, corresponding graph operation can be executed after permission verification is passed, for users without user permission management, the part of the LDAP database can be ignored, and permission management in system default users or distributed caches can be used.
The cluster resource monitoring and displaying module is accessed in each database program in an embedded mode, main interaction is storage and management of metadata and user permission, when a database interface platform AbsutionGRS is initialized, Hazelcast service is started simultaneously, the resource monitoring of a current server can be managed, the resource monitoring and the resource monitoring are started in a Tomcat container, the programs run continuously, and the monitoring service acquires various indexes continuously and displays the indexes in real time.
Seventhly, the graph structure building unit of the embodiment comprises a visualization UI of a graph structure and a building UI of the graph structure;
the visual UI of the graph structure is used for displaying the existing graph structure, analyzing the graph structure stored in the database into a node relation graph and displaying the node relation graph, and also has the functions of deleting the graph structure and loading a specified file to the specified graph by using the graph structure.
The method comprises the steps of firstly allocating attributes to entities and attributes to which the attributes belong to relationships, then starting to newly build a dimension of the entities or the relationships, mapping column data of CSV data to attributes of a certain dimension of the entities or the relationships in a mode of selecting or filling tables, and assigning an ETL data processing program, an attribute aggregation program, a data verification program and a finally stored Java data type of the attributes to each attribute.
In summary, according to the technical scheme of this embodiment, data from different data sources can be stored in a mixed storage manner of combining dynamic and static time sequences, when data of a database is queried, the data can be analyzed through a butted interface call graph mining algorithm and a deep learning algorithm, and the writing use of the database can be monitored, so that an integrated big data platform from time sequence dynamic storage to data calculation and analysis is provided for a user. The method has the following specific beneficial effects:
1. updating the dynamic attribute map, and simultaneously being compatible with storing the static attribute map;
2. storing time sequence data and a dynamic attribute graph of a time sequence type;
3. multi-dimensional data storage, wherein each entity and relationship in the graph can store the attribute of the entity in any number of dimensions;
4. the graph node ID need not be stored as an int/long type, but can be any string or number.
5. The associated query is fused across graphs/multi-graphs. That is, different graphs are merged together to execute the query statement, and the obtained result may come from different graphs
6. Distributed graph instances, a graph instance sharing mechanism among clustered machines, also makes Web services highly available.
7. The user isolation mechanism is used for realizing mutual confidentiality of user chart data;
8. a task mechanism for recording the information (user, time, statement, returned result, etc.) of each query;
9. the embedded distributed cluster resource management system dynamically monitors the real-time use condition of the resources of each server;
10. accessing a big data processing technology Flink as a multi-source data real-time warehousing framework to support any data source expansion;
11. seamlessly docking a big data processing technology Spark, and accordingly realizing a 13-class 60-variety graph mining algorithm library;
12. and adopting Python API to interface the graph data to any deep learning technology framework.
Example two
The embodiment provides a data application method taking a time sequence dynamic graph as a core, which is suitable for a data application platform taking the time sequence dynamic graph as the core in the first embodiment, as shown in fig. 2, and includes the following steps:
s1, constructing a data mixed storage structure taking a time sequence dynamic map as a core in a database;
s2, in the data storage process, adopting a big data real-time processing technology Flink as a database data access technology, accessing various data sources, converting the acquired various data according to a conversion rule, and storing the converted data into a database according to a data mixed storage mode;
s3, in the data using process, calling required data in a database according to the query request of function calling through various standard workflow interfaces, in the calling process, analyzing the data according to a graph mining algorithm and/or analyzing the data through a deep learning algorithm, and obtaining various analyzed service indexes;
and S4, monitoring the use indexes of the CPU/memory of each server in the database in the process of storing and using the data.
The data application method of the present embodiment is applicable to the data application platform of the first embodiment, and is based on the same contents as those of the first embodiment, and will not be described herein again.
Those of ordinary skill in the art will appreciate that the various illustrative steps or elements described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that various illustrative steps or elements have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present application, it should be understood that the division of the step or unit is only one logical function division, and there may be other division manners in actual implementation, for example, multiple steps or units may be combined into one step or unit, one step or unit may be split into multiple steps or units, or some features may be omitted, etc.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims (10)

1. A data application method taking a time sequence dynamic map as a core is characterized by comprising the following steps:
constructing a data mixing storage structure taking a time sequence dynamic map as a core in a database;
in the data storage process, a big data real-time processing technology Flink is adopted as a database data access technology, various data sources are accessed, the obtained various data are converted according to a conversion rule, and the converted data are stored in a database according to a data mixed storage mode;
in the data use process, calling required data in a database according to a query request of function calling through various standard workflow interfaces, and in the calling process, analyzing the data according to a graph mining algorithm and/or analyzing the data through a deep learning algorithm to obtain various analyzed service indexes;
and in the process of storing and using the data, monitoring the use index of the CPU/memory of each server of the database.
2. A time-series dynamic graph-based data application platform, which is suitable for the time-series dynamic graph-based data application method according to claim 1, and comprises:
the database is used for storing data according to a data mixing storage mode taking the time sequence dynamic map as a core;
the data access unit is used for connecting various data sources through a data interface by using a big data real-time processing technology Flink as a database data access technology, converting data from the various data sources according to a conversion rule and storing the data into a database;
the workflow interface unit is used for providing various transmission interfaces, and each application realizes the function call of the data in the database through a transmission interface;
the graph mining algorithm library is used for storing a plurality of graph mining algorithms so that a user can call the graph mining algorithms to perform data analysis on data in the database in the function calling process;
the deep learning docking unit is used for docking the database to a deep learning framework by adopting Python API so as to call data in the database when a user performs data analysis through AI algorithm;
the cluster resource monitoring unit is used for caching data, is convenient for a user to connect to the database within milliseconds and execute a function of inquiring data, and simultaneously monitors the resource utilization index of each server;
and the graph structure construction unit is used for constructing a data mixed storage structure taking the time sequence dynamic graph as a core in the database, updating the existing graph data in the database by using the data mixed storage structure, and realizing automatic data updating and data importing of the original data.
3. The time series dynamic graph-centric data application platform according to claim 2, wherein the database comprises a data storage subsystem, a metadata storage subsystem, a user isolation subsystem, and a data index monitoring subsystem;
the data storage subsystem is used for storing relational graph data according to a data mixed storage mode taking the time sequence dynamic graph as a core;
the metadata storage subsystem is used for storing the graph structure and the graph configuration of the map; the graph structure comprises various entities, entity dimensions, entity attributes, relationships, relationship dimensions, relationship attributes and data types in the graph, and the graph configuration comprises the steps of storing key information connected to a database when each graph is instantiated and allocating system resources used when each graph is allocated;
the user isolation subsystem is used for carrying out data separation storage by taking the user authority as an identifier; when the database receives the operation of a user, the authority inquiry is triggered, and if the user authority is consistent with the corresponding user authority in the memory, the user can operate the graph data in the administration range;
and the data index monitoring subsystem is used for activating a real-time collecting and monitoring program when the system is initialized, and recording and displaying various operation indexes.
4. The data application platform taking the time-series dynamic graph as the core as claimed in claim 3, wherein the data hybrid storage mode supports simultaneous storage of a plurality of data, and can be stored according to time series, geographic space, relational data, text data, dynamic and static combined graph data.
5. The time-series dynamic graph-centric data application platform according to claim 4, wherein the data sources comprise a Kafka data source, a rockmq data source, an ActiveMQ data source, an Hdfs data source, and a File data source;
the data interface comprises a Kafka interface, a RocketMQ interface, an ActiveMQ interface, a Hadoop interface and a universal data interface.
6. The time-series dynamic graph-based data application platform of claim 5, wherein the data of the data source comprises real-time data and offline data, and the access comprises real-time data access and offline data access; the real-time data access is used for accessing data from various message components in real time and storing the real-time data into a database after executing the user-defined data processing logic; the offline data access is used for accessing data from a local file or a distributed file system HDFS in a large batch, and storing the offline data into a database after executing a user-defined data processing logic.
7. The time-series dynamic graph-based data application platform of claim 6, wherein the graph mining algorithm library comprises 13 types of algorithm types, and the 13 types of algorithm types comprise an entity classification algorithm, an entity clustering algorithm, a graph coarsening algorithm, a community division algorithm, a network layering algorithm, a connection prediction algorithm, a complete network graph attribute calculation algorithm, an edge-in-network attribute calculation algorithm, a node-in-network attribute calculation algorithm, a minimum spanning tree algorithm, an NLP node vectorization algorithm, an overlapping community detection algorithm, and a shortest path algorithm.
8. The data application platform with the time-series dynamic graph as the core according to claim 7, wherein the python api is adopted to interface the database to a deep learning framework, specifically: and the graph data in the database is docked from a programming language layer by using a deep learning framework developed by the same type of programming language, and the docking interface realizes the conversion from the Java programming language to the Python programming language.
9. The data application platform taking the time-series dynamic graph as the core as claimed in claim 8, wherein the cluster resource monitoring unit comprises a primitive data storage and management module, a distributed graph instance storage module, a user authority management module and a cluster resource monitoring and display module;
the primitive data storage and management module is used for caching the graph structure in the database into a memory, and the graph structure in the memory is synchronous with the graph structure in the database;
the distributed graph instance storage module is used for storing the graph instance after graph initialization into a distributed cache system, and the graph instance in the distributed cache system is synchronous with the graph instance in the database;
the user authority management module is used for storing the serialized user names into a distributed cache system, synchronizing the user authority information in the distributed cache system with the user authority information in the database, verifying the user authority when the user submits the database operation, and if the verification is passed, the user can operate the graph data in the administrative range;
and the cluster resource monitoring and displaying module is used for accessing each database in an embedded mode, mainly interacting metadata and user permission storage and management, and monitoring the use index of the CPU/memory of each server in the running process of each server of the database.
10. The time-series dynamic graph-based data application platform of claim 9, wherein the graph structure building unit comprises a graph structure visualization UI and a graph structure building UI;
the visual UI of the graph structure is used for displaying the existing graph structure, analyzing the graph structure stored in the database into a node relation graph and displaying the node relation graph, and also has the functions of deleting the graph structure and loading a specified file to the specified graph by using the graph structure;
and the construction UI of the graph structure is used for depicting a graph structure according to the original data in a non-programming mode, finishing the whole process of mapping the original data into graph structure data and realizing automatic data updating and data importing of the original data.
CN202010343863.4A 2020-04-27 2020-04-27 Data application method and platform with time sequence dynamic map as core Pending CN111523003A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010343863.4A CN111523003A (en) 2020-04-27 2020-04-27 Data application method and platform with time sequence dynamic map as core

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010343863.4A CN111523003A (en) 2020-04-27 2020-04-27 Data application method and platform with time sequence dynamic map as core

Publications (1)

Publication Number Publication Date
CN111523003A true CN111523003A (en) 2020-08-11

Family

ID=71903977

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010343863.4A Pending CN111523003A (en) 2020-04-27 2020-04-27 Data application method and platform with time sequence dynamic map as core

Country Status (1)

Country Link
CN (1) CN111523003A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111930862A (en) * 2020-09-17 2020-11-13 国网浙江省电力有限公司 SQL interactive analysis method and system based on big data platform
CN111949664A (en) * 2020-08-18 2020-11-17 上海伽易信息技术有限公司 Subway high-speed data network route analysis method based on double-route protection and tendency aggregation degree graph calculation
CN112241401A (en) * 2020-10-16 2021-01-19 中国民用航空华东地区空中交通管理局 Knowledge graph-based digital quality management system and method
CN112395512A (en) * 2020-11-06 2021-02-23 中山大学 Method for constructing complex attribute network representation model based on path aggregation
CN112433999A (en) * 2020-11-05 2021-03-02 北京浪潮数据技术有限公司 Traversal method for Janus graph client and related components
CN112559453A (en) * 2020-12-09 2021-03-26 恒安嘉新(北京)科技股份公司 Data storage method and device, electronic equipment and storage medium
CN112860914A (en) * 2021-03-02 2021-05-28 中国电子信息产业集团有限公司第六研究所 Network data analysis system and method of multi-element identification
CN113553350A (en) * 2021-05-27 2021-10-26 四川大学 Traffic flow partition model for similar evolution mode clustering and dynamic time zone partitioning
CN113569104A (en) * 2021-09-27 2021-10-29 成都市维思凡科技有限公司 Data tracking method, device, equipment and medium based on graphic data
CN114003791A (en) * 2021-12-30 2022-02-01 之江实验室 Depth map matching-based automatic classification method and system for medical data elements
WO2022052374A1 (en) * 2020-09-09 2022-03-17 北京邮电大学 Recursive timing knowledge graph completion method and apparatus
CN114676169A (en) * 2022-05-27 2022-06-28 富算科技(上海)有限公司 Data query method and device
CN114780497A (en) * 2022-04-22 2022-07-22 湖南长银五八消费金融股份有限公司 Batch file processing method, apparatus, computer device, medium, and program product

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160026677A1 (en) * 2014-07-23 2016-01-28 Battelle Memorial Institute System and method of storing and analyzing information
CN109828965A (en) * 2019-01-09 2019-05-31 北京小乘网络科技有限公司 A kind of method and electronic equipment of data processing
CN110245158A (en) * 2019-06-10 2019-09-17 上海理想信息产业(集团)有限公司 A kind of multi-source heterogeneous generating date system and method based on Flink stream calculation technology
CN110688495A (en) * 2019-12-09 2020-01-14 武汉中科通达高新技术股份有限公司 Method and device for constructing knowledge graph model of event information and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160026677A1 (en) * 2014-07-23 2016-01-28 Battelle Memorial Institute System and method of storing and analyzing information
CN109828965A (en) * 2019-01-09 2019-05-31 北京小乘网络科技有限公司 A kind of method and electronic equipment of data processing
CN110245158A (en) * 2019-06-10 2019-09-17 上海理想信息产业(集团)有限公司 A kind of multi-source heterogeneous generating date system and method based on Flink stream calculation technology
CN110688495A (en) * 2019-12-09 2020-01-14 武汉中科通达高新技术股份有限公司 Method and device for constructing knowledge graph model of event information and storage medium

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111949664A (en) * 2020-08-18 2020-11-17 上海伽易信息技术有限公司 Subway high-speed data network route analysis method based on double-route protection and tendency aggregation degree graph calculation
WO2022052374A1 (en) * 2020-09-09 2022-03-17 北京邮电大学 Recursive timing knowledge graph completion method and apparatus
CN111930862A (en) * 2020-09-17 2020-11-13 国网浙江省电力有限公司 SQL interactive analysis method and system based on big data platform
CN112241401A (en) * 2020-10-16 2021-01-19 中国民用航空华东地区空中交通管理局 Knowledge graph-based digital quality management system and method
CN112433999A (en) * 2020-11-05 2021-03-02 北京浪潮数据技术有限公司 Traversal method for Janus graph client and related components
CN112433999B (en) * 2020-11-05 2023-12-22 北京浪潮数据技术有限公司 Janusgraph client traversing method and related components
CN112395512A (en) * 2020-11-06 2021-02-23 中山大学 Method for constructing complex attribute network representation model based on path aggregation
CN112559453A (en) * 2020-12-09 2021-03-26 恒安嘉新(北京)科技股份公司 Data storage method and device, electronic equipment and storage medium
CN112860914A (en) * 2021-03-02 2021-05-28 中国电子信息产业集团有限公司第六研究所 Network data analysis system and method of multi-element identification
CN113553350B (en) * 2021-05-27 2023-07-18 四川大学 Traffic flow partition model for similar evolution mode clustering and dynamic time zone division
CN113553350A (en) * 2021-05-27 2021-10-26 四川大学 Traffic flow partition model for similar evolution mode clustering and dynamic time zone partitioning
CN113569104A (en) * 2021-09-27 2021-10-29 成都市维思凡科技有限公司 Data tracking method, device, equipment and medium based on graphic data
CN114003791A (en) * 2021-12-30 2022-02-01 之江实验室 Depth map matching-based automatic classification method and system for medical data elements
CN114780497A (en) * 2022-04-22 2022-07-22 湖南长银五八消费金融股份有限公司 Batch file processing method, apparatus, computer device, medium, and program product
CN114676169A (en) * 2022-05-27 2022-06-28 富算科技(上海)有限公司 Data query method and device

Similar Documents

Publication Publication Date Title
CN111523003A (en) Data application method and platform with time sequence dynamic map as core
Mohanty Big data: An introduction
Pääkkönen et al. Reference architecture and classification of technologies, products and services for big data systems
Liu et al. Distributed data mining for e-business
US11334589B2 (en) System and platform for computing and analyzing big data
US9201700B2 (en) Provisioning computer resources on a network
CN104809244B (en) Data digging method and device under a kind of big data environment
Ding et al. SeaCloudDM: a database cluster framework for managing and querying massive heterogeneous sensor sampling data
CN103473247B (en) Geological data information cluster mechanism and interface aggregation system
Xiao et al. RETRACTED ARTICLE: Cloud platform wireless sensor network detection system based on data sharing
Wu et al. A geospatial information grid framework for geological survey
CN108399208A (en) A kind of information display system of big data
Arputhamary et al. A review on big data integration
Niu Optimization of teaching management system based on association rules algorithm
CN108363756A (en) A kind of intelligent transportation big data processing system
CN105653523A (en) Energy consumption supervise network of things basis platform system building method
Alwaisi et al. A review on big data stream processing applications: contributions, benefits, and limitations
Ameer et al. Techniques, tools and applications of graph analytic
Liu et al. Survey of big data platform based on cloud computing container technology
Shao et al. Large-scale Graph Analysis: System, Algorithm and Optimization
Peng Analysis of Computer Information Processing Technology Based on Unstructured Data
Xu et al. Research on performance optimization and visualization tool of Hadoop
Sanaboyina Performance evaluation of time series databases based on energy consumption
Li et al. Real-Time Controllable Optimization Algorithm for Correlated Big Data in Cloud Computing Environment
Junwei et al. Architecture for component library retrieval on the cloud

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination