WO2015058500A1

WO2015058500A1 - Data storage method and device

Info

Publication number: WO2015058500A1
Application number: PCT/CN2014/075570
Authority: WO
Inventors: 刘志容; 李川
Original assignee: 华为技术有限公司
Priority date: 2013-10-23
Filing date: 2014-04-17
Publication date: 2015-04-30
Also published as: CN104572740A; CN104572740B

Abstract

Disclosed are a data storage method and device. The method comprises a data storage method, and the method comprises: acquiring an original data set; extracting information representing an information network diagram structure from the original data set, the information representing the information network diagram structure at least comprising node information, node attribute information, side information and side attribute information, wherein the node information at least comprises a node identifier and a node attribute keycode, the node attribute keycode and the node attribute information having a correlation; the side information at least comprises a side identifier and a side attribute keycode, the side attribute keycode and the side attribute information having a correlation; and a side is used to describe the relationship between nodes; and storing the extracted node information, the node attribute information, the side information and the side attribute information. By means of the solution provided in the embodiments of the present invention, researchers can also observe the relationship between nodes.

Description

Method and device for storing data

The present application claims priority to Chinese Patent Application No. 201310505069.5, entitled "A Method and Apparatus for Storing Data", filed on October 23, 2013, the entire contents of in.

Technical field

The present invention relates to the field of data storage, and in particular, to a method and apparatus for storing data.

Background technique

The concept of Information Networks is a general abstraction of massive, multidimensional, and complex structural data in real space. Information networks are of great value in the fields of community network analysis, partner network analysis, traffic network capacity calculation, protein network receiving component analysis, and criminal network analysis.

In the information network environment, the subject information that the user pays attention to evolves from a simple numerical metric (such as total sales volume, profit value) to a complex network, such as a sales network, where each node (Vertex) represents a commodity. The connections between nodes (ie: Edge, Edge) represent the common sales relationship for different types of items, see the sales network shown in Figure 1.

The classic online analytical processing (OLAP, Online Analysis Processing) data warehouse model is a multidimensional data model. A multidimensional data model is a multidimensional space. "Dimensions" are different angles that people observe data and can be used to represent different attributes of something. For example, when analyzing product sales data, it involves time dimension, product dimension, and regional dimension. There is no unified multidimensional data model at this stage. Among them, there are three classic OLAP data warehouse models, namely: star mode, snowflake mode, and constellation mode.

The star schema is the basic structure of the multidimensional data model, and its composition includes: a central fact table and a dimension table. The central fact table is a core table in the star schema, storing the metrics of the facts and the key codes of the respective dimension tables; the dimension table is used to maintain the dimension information, that is, each dimension member, including the attribute information of the dimension. The central fact table is connected with the key values of each dimension table stored and the dimension tables. The snowflake mode is a variant of the star mode, which decomposes some dimension tables on a star schema basis. The constellation mode can be regarded as a convergence of star patterns, which can satisfy multiple implementation tables to share certain dimension tables, and thus achieve multi-agent modeling.

As shown in Figure 2, for classic product sales data, the star mode can well solve its data organization. For sales data, it can be considered from four dimensions: Time, Item, Branch, and Location. This mode contains one The heart fact table (Sales), which contains four dimensions of keys (as shown in Figure 2, Time_Key, Branch key, Item key, Location key) and two metrics (as shown in Figure 2). Dollars— sold , Unit — sold ).

Star mode and snowflake mode are only suitable for modeling a single topic, and you cannot model multiple topics. The constellation mode can satisfy multiple fact tables and share some dimension tables to realize multi-theme modeling. However, the subject data in the information network evolves into a complex graph network, and it is necessary to simultaneously save the information dimension and the topology dimension information. Suitable for modeling of online graph processing.

In traditional OLAP, researchers pay attention to numerical metrics, such as the number of sales of goods in the mall, sales and other numerical data. The multidimensional data model is proposed for traditional OLAP and does not apply to the data organization in the information network. Now researchers are paying more attention to the common sales relationship between goods and goods, which involves the modeling of the connection relationship between objects and objects. At present, more and more data appears in the form of network diagrams, such as social networks, partner networks, protein networks, etc. In these networks, researchers pay more attention to the connection between entities. The traditional multidimensional data model can not reasonably store and represent the network graph data relationship, and can not reasonably pay attention to the connection relationship between entities.

Summary of the invention

The embodiment of the invention provides a method and a device for storing data, which overcomes the problem that the traditional multidimensional data model cannot reasonably store and represent the network graph data relationship.

A first aspect of the embodiments of the present invention provides a method for storing data, where the method includes: acquiring an original data set;

Extracting information indicating the structure of the information network graph from the original data set; wherein, the information indicating the structure of the information network graph includes at least: node information, node attribute information, side information, and edge attribute information; the node information includes at least: Identification and node attribute keys;

The node attribute key has a corresponding relationship with the node attribute information;

The side information includes at least: an edge identifier and an edge attribute key;

The edge attribute key has a corresponding relationship with the edge attribute information;

The edge is used to describe the relationship between the node and the node;

And storing the extracted node information, node attribute information, side information, and edge attribute information.

In a first possible implementation manner of the first aspect of the embodiment, the node information is further included Including: node metrics;

The side information further includes: an edge measure.

With reference to the first aspect of the embodiments of the present invention, and the second possible implementation manner of the first aspect of the embodiments of the present invention,

The extracted node information is stored in a node fact table;

The extracted side information is stored in an edge fact table;

The extracted node attribute information is stored in a topology dimension table;

The extracted edge attribute information is stored in the information dimension table;

Since the edge is used to describe the relationship between the node and the node, the information in the node fact table has a corresponding relationship with the information in the edge fact table;

The node attribute key has a corresponding relationship with the node attribute information; and the information in the topology dimension table has a corresponding relationship with the information in the node fact table;

Due to the edge attribute key and the edge attribute information, the information in the information dimension table has a corresponding relationship with the information in the edge fact table.

In a third possible implementation manner of the first aspect of the present disclosure, after the storing the extracted node information, the node attribute information, the side information, and the edge attribute information, the method further includes: Data is located in the stored node information, node attribute information, side information, and edge attribute information;

The query is performed from one of the node information, the node attribute information, the side information, or the side attribute information after the positioning.

In a fourth possible implementation manner of the first aspect of the present disclosure, after the storing the extracted node information, the node attribute information, the side information, and the edge attribute information, the method further includes: extracting according to the extracting The node information, node attribute information, side information, and edge attribute information are used for online graph processing operations.

With reference to the fourth possible implementation manner of the first aspect of the embodiment of the present invention, in the fifth possible implementation manner of the first aspect of the embodiment, the online map processing operation includes:

Information dimension roll (I-OLGP), topological roll (T-OLGP), asynchronous roll up, drill down, slice, cut, and one of the data views.

In conjunction with the fifth possible implementation manner of the first aspect of the embodiment of the present invention, the embodiment of the present invention In a sixth possible implementation manner, if the extracted edge attribute information is stored in the information dimension table, the information dimension rollup specifically includes:

The information of one attribute of the edge stored in the information dimension table, or the information of one or more attributes is scrolled.

With reference to the fifth possible implementation manner of the first aspect of the embodiment of the present invention, in the seventh possible implementation manner of the first aspect of the embodiment, if the extracted node attribute information is stored in the topology dimension table, The topology aggregation operation specifically includes:

The information of one attribute of the node stored in the topology table or the information of one or more attributes is scrolled.

An apparatus for storing data according to a second aspect of the embodiments of the present invention, the apparatus includes: an acquiring unit, an extracting unit, and a storage unit;

The obtaining unit is configured to acquire an original data set;

The extracting unit is configured to extract, from the original data set, information indicating a structure of the information network graph; wherein, the information indicating the structure of the information network graph includes at least: node information, node attribute information, side information, and edge attribute information; The node information includes at least: a node identifier and a node attribute key; the node attribute key has a corresponding relationship with the node attribute information; the side information includes at least: an edge identifier and an edge attribute key; a code corresponding to the edge attribute information; the edge is used to describe a relationship between the node and the node;

The storage unit is configured to store the extracted node information, node attribute information, side information, and edge attribute information.

In a first implementation manner of the second aspect of the embodiments of the present disclosure, the node information further includes: a node metric value;

The side information further includes: an edge measure.

With reference to the first implementation manner of the second aspect of the embodiment of the present invention, in the second implementation manner of the second aspect of the embodiment, the extracted node information is stored in a node fact table;

The extracted side information is stored in an edge fact table;

Since the edge is used to describe the relationship between the node and the node, the information in the node fact table Corresponding to the information in the edge fact table;

The information in the information dimension table has a corresponding relationship with the information in the edge fact table, because the edge attribute key has a corresponding relationship with the edge attribute information.

In a third implementation manner of the second aspect of the embodiments of the present invention, the device further includes: a positioning unit, and a query unit;

The positioning unit is configured to perform positioning on the stored node information, node attribute information, side information, and edge attribute information for data that needs to be queried;

The query unit is configured to perform query from one of the node information, the node attribute information, the side information, or the edge attribute information after the positioning.

In a fourth implementation manner of the second aspect of the embodiments of the present disclosure, the device further includes: a picture processing unit;

The map processing unit is configured to perform an online map processing operation according to the extracted node information, node attribute information, side information, and edge attribute information.

With reference to the fourth implementation manner of the second aspect of the embodiment of the present invention, in the fifth implementation manner of the second aspect of the embodiment, the online map processing operation in the map processing unit includes:

With reference to the fifth implementation manner of the second aspect of the embodiment of the present invention, in the sixth implementation manner of the second aspect of the embodiment, the extracted edge attribute information is stored in the information dimension table in the graph processing unit. , the information dimension roll up specifically includes:

With reference to the fifth implementation manner of the second aspect of the embodiment of the present invention, in the seventh implementation manner of the second aspect of the embodiment of the present invention, if the extracted node attribute information is stored in the topology dimension table, The topology aggregation operation specifically includes:

The information of one attribute of the node stored in the topology table or the information of one or more attributes is scrolled. The method for storing data is provided by the embodiment of the present invention. The method extracts node information, node attribute information, side information, and edge attribute information from the original data set by acquiring the original data set. The node information includes at least: a node identifier and a key attribute of the node attribute; the node attribute key has a corresponding relationship with the node attribute information; the side information includes at least: an edge identifier and an edge attribute key; the edge attribute key has a corresponding relationship with the edge attribute information The edge is used to describe the relationship between the node and the node; the node information extracted above, the node attribute information, the side information, and the edge attribute information are stored. Since there is a connection between the extracted information, when the data is subsequently operated, the required data can be quickly and accurately located. At the same time, compared with the existing OLAP multi-dimensional data warehouse model, the information stored by the solution provided by the embodiment of the present invention includes not only the same node information and node attribute information as the prior art, so that the researcher can focus on the node as the center. In addition, the information stored in the solution provided by the embodiment of the present invention further includes side information and edge attribute information that cannot be focused on by the prior art, so that the researcher can also pay attention to the relationship between the nodes.

DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any inventive labor.

1 is a schematic diagram of a sales network in the prior art;

2 is a multi-dimensional data model of a star pattern in the prior art;

3 is a schematic diagram of an information network provided by an embodiment of the present invention;

4 is a method for storing data according to Embodiment 1 of the present invention;

FIG. 5 is a schematic diagram of association between node attribute information, side information, and edge attribute information according to an embodiment of the present invention (or a multidimensional information network data warehouse model);

Figure 6 is a schematic diagram of a network of research collaborators;

FIG. 7 is a schematic diagram of a method for storing data according to Embodiment 2 of the present invention; FIG.

Figure 8 is a multidimensional information network data warehouse model;

The edge fact table shown in Figure 9 is converted into an edge fact table;

The node fact table shown in FIG. 10 is converted into a node fact relation table;

The transformation of the information dimension relationship information dimension table shown in FIG. 11; Figure 12 shows the transformation of the topological dimension relationship topology table;

Figure 13 shows the keyword-collaborator multidimensional information network data warehouse model;

Figure 14 shows the film actor cooperation network;

15 is a method for storing data according to Embodiment 2 of the present invention;

Figure 16 is a video actor collaborative multidimensional information network data warehouse model;

17 is a data storage device according to Embodiment 4 of the present invention;

FIG. 18 is a data storage device according to Embodiment 5 of the present invention.

detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is a partial embodiment of the invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

In the information network, the center of interest of the user rises from a numerical measure to a graph or network, and the structure of the center of the user's attention consists of nodes and edges. Among them, the nodes and edges correspond to some related attributes, namely node attributes and edge attributes. The attributes associated with the edges can be referred to as information dimensions, and the attributes associated with the nodes can be referred to as topological dimensions. The edge represents the connection between two nodes. For example, the information network diagram shown in Figure 3, the circle represents the node, each side has its own attributes, and each node also has its own attributes.

In the information network, researchers pay more attention to the connection relationship between objects and objects. The objects mentioned here can be understood as nodes, that is, the connection relationship between nodes and nodes. Most researchers work on the connection prediction of social networks with graph structure, traffic hub node discovery, community trend evolution, protein structure analysis, etc. These tasks are carried out on graph-structured data. However, the prior art lacks a general and efficient underlying data organization model for the storage of these data to facilitate the analysis of these data.

Therefore, the embodiment of the present invention provides a general storage scheme for the graph data in the information network, that is, a method, a device and a system for storing data, and organizes the data structured by the graph to facilitate the research of the upper layer algorithm. It facilitates the analysis and utilization of data, solves the relationship between objects based on graphs, simplifies complex information storage formats, and eliminates redundancy; uses relational databases to store their relationships, facilitating users to perform efficient structured queries. operating. Among them, the relational database refers to the creation of the Based on the database of the model, the mathematical concepts and methods such as set algebra are used to process the data in the database.

The present solution will be described in detail with reference to specific embodiments as follows.

Embodiment 1

An embodiment of the present invention provides a method for storing data. As shown in FIG. 4, the method includes: Step 101: Acquire an original data set;

Among them, the original data set can be understood as a collection of all the data collected by the user, which is messy and unfavorable for analysis. The raw data set obtained in step 101 may be raw data of unstructured text input into the execution device.

Step 102: Extract information representing a structure of the information network graph from the original data set. The information indicating the structure of the information network graph includes at least: node information, node attribute information, side information, and edge attribute information. The node information includes at least: a node identifier. And a node attribute key; the node attribute key has a corresponding relationship with the node attribute information; the side information at least includes: an edge identifier and an edge attribute key; the edge attribute key has a corresponding relationship with the edge attribute information ; the edge is used to describe the relationship between the node and the node;

Due to the relationship between the node and the node attribute, the relationship between the edge and the edge attribute, the edge is used to describe the relationship between the node and the node, and the extracted node information, node attribute information, side information, and edge can be easily represented by the graph structure. The relationship between the attribute information (see Figure 5, Figure 8, Figure 16 of the subsequent description).

Referring to FIG. 3, the information indicating the structure of the information network graph may include: node information (eg, node identifier (VertexID)), node attribute information (eg, Attribute 1, Attribute2), side information (eg, edge identifier (EdgeID) ), edge attributes (such as Attributed Attribute2), etc.; the number of node attributes, the number of edge attributes, and the number of nodes and edges will vary according to the specific information network, and the structure will be different, as shown in Figure 3 here. The information network diagram structure is only a simple example for easy understanding, and is not a limitation of the embodiment of the present invention.

Often, the raw data sets acquired by the device are cluttered and inconvenient for analysis. Therefore, after obtaining the original data set, the device extracts the original data from the original data set according to the structure of the information network graph structure, including: node information, node attribute information, side information, and representation information network structure structure of the edge attribute information. information.

It should be understood that the above extracted node information, node attribute information, side information, and edge attributes The information is related to each other. Referring to FIG. 5, the extracting in step 102 includes: node information, node attribute information, side information, and edge attribute information may represent information of the information network graph structure, which may be specifically a table. Formal representation, for example: The extracted node information is stored in the node fact table (VFT), the extracted side information is stored in the edge fact table (EFT), and the extracted node attribute information is stored in the topology dimension table (TDT), and the extracted edges are extracted. The attribute information is stored in the information dimension table (IDT). Due to the relationship between the node and the node attribute, the relationship between the edge and the edge attribute makes the lists have an association (the association is shown in Figure 5 between the tables). Connection).

As shown in FIG. 5, when the information of a node is extracted, the information of the node includes: a node identifier (ie, a node ID, and the specific meaning of the node may be different according to different information network definitions, such as a multi-dimensional information network of the partner, the node may On behalf of the author, the actor collaborator in the multidimensional information network, the node can represent the actor), the attribute key of the node, and/or the metric of the node. It should be understood that the metric of the node included in the node information may be a numerical representation of information related to the node, for example, in the partner network, the information of the node may be the number of articles published by the author. Among them, the metric of the node can be used as the preferred solution, not the solution.

The node attribute key has a corresponding relationship with the node attribute information. It can be understood that the node attribute key included in the node information is a link between the contact side information and the node attribute information. The node attribute key code, the corresponding detailed information may be stored in the topology dimension table. For example, when the node is an actor, the node attribute key may be a movie company to which the actor belongs, and the specific information corresponding to the node (ie, actor) attribute key (ie, the movie company to which the actor belongs) is the node attribute information (ie, the node attribute). The information is specific to each film company, such as: Huayi Brothers Film Production Company, Tianyu Film Company, etc.).

The edge information includes at least: an edge identifier and an edge attribute key, and may also include: a measure of the edge. For example, as shown in FIG. 5, since the edge is a connection of two nodes, the edge identifier (EdgelD) can be represented by the identifier of two nodes. As shown in FIG. 5, two nodes of node 1 and node 2 represent the edge. . There may be multiple edge attribute keys. Each edge attribute key can represent a type of attribute. For example: If the node is the author in the partner information network, the edge represents the cooperation of the two authors, and the edge attribute key can be cooperation. Articles of cooperation between the parties, and/or the age of cooperation, and/or the location of the collaboration. It should also be understood that the metric of the edge included in the side information may be a numerical representation of the information related to the edge, such as: In the network of partners, the information of the edge may be the number of times the two authors cooperate (eg, Co- Frequence). The edge attribute key has a corresponding relationship with the edge attribute information, and can be understood as the edge attribute key included in the side information is a link between the contact side information and the edge attribute information. The edge attribute information may specifically be stored in the information dimension table.

If the key of the side is a collaborative article, the specific information in the side attribute information (which may be the information dimension table) may be the name of all the articles cooperating between the collaborators, such as: "Cooperative articles include: "Rainwater" "Snowflake". If the key of the side is the place of cooperation, the specific information in the side attribute information (specifically, the information dimension table) may be all the places where the partners cooperate, such as: Beijing, Shanghai.

Step 103: Store the extracted node information, node attribute information, side information, and edge attribute information.

The stored node information, the node attribute information, the side information, and the edge attribute information may be stored in the form of a table, that is, by: a node fact table, a topology dimension table, an edge fact table, and an information dimension table corresponding to the above information. storage. The storage in the form of a table is a factual manner, and is not a limitation of the embodiment of the present invention. The specific storage form may have other.

The method for storing data is provided by the first embodiment of the present invention. The method extracts node information, node attribute information, side information, and edge attribute information from the original data set by acquiring the original data set. The node information includes at least: a node identifier. And a key attribute of the node attribute; the node attribute key has a corresponding relationship with the node attribute information; the side information at least includes: an edge identifier and an edge attribute key; the edge attribute key has a correspondence with the edge attribute information Relationship; due to the relationship between the node and the node attribute, the relationship between the edge and the edge attribute, the connection between the node and the node is an edge, so that the extracted node information, the node attribute information, the side information, and the edge attribute information are related, and the above is stored. Extracted node information, node attribute information, side information, and edge attribute information. Since there is a connection between the extracted information, it is possible to quickly and accurately locate the required data when the data is subsequently manipulated. At the same time, compared with the existing OLAP multi-dimensional data warehouse model, the information stored by the solution provided by the embodiment of the present invention includes not only the same node information and node attribute information as the prior art, so that the researcher can focus on the node as the center. In addition, the information stored in the solution provided by the embodiment of the present invention further includes side information and edge attribute information that cannot be focused on by the prior art, so that the researcher can also pay attention to the relationship between the nodes.

Further, the method for storing data is provided in the first embodiment of the present invention, and the existing method is solved. In the OLAP multi-dimensional data warehouse model, the redundancy problem exists in the original data set, and the solution provided by the embodiment of the invention has the advantages of flexible query, high efficiency, and flexible subject extraction.

Furthermore, the first embodiment of the present invention provides a method for storing data, which is more in line with the modeling requirements of a real social network, and is beneficial to the design of an efficient OLGP algorithm, and the model is convenient to convert to a traditional relational table, and is beneficial to people in the real world. understanding.

Moreover, in the solution provided by the embodiment of the present invention, the connection between the node and the edge is established according to the connection between the node and the node, and therefore, the node information, the node attribute information, and the side information and the edge attribute information are directly connected. Therefore, the solution can realize the relationship between the nodes concerned by discovering the important relationship between the edges and the nodes, so that the changes to the prior art are small.

Embodiment 2

The embodiment of the present invention provides a method for storing data, which is similar to the method provided in the first embodiment, except that the method provided by the embodiment of the present invention is specifically applied in a research partner information network. An example of a storage method.

The network of research collaborators is a case of recording scientific research personnel in a field and publishing papers. It is a typical example of information networks. As shown in Figure 6, each node represents an author. If two people collaborate to publish an article, there is an edge between the two points. The attributes of the side record the number of articles published by the two collaborators at the time of the feature and at a specific meeting. The following is an example of a partner network in the data collection of the ACM (Association for Computing Machinery) to elaborate and demonstrate the implementation process of the multidimensional information network data warehouse model.

As shown in Figure 7, the method includes:

Step 201: Acquire an original data set.

At present, most researchers use unprocessed, disorganized data sets for research and analysis. For the classic partner network, the dataset version is varied. More typical are xml-based digital document and library project (DBLP, Digital Bibliography & Library Project) datasets and ACM datasets. In the ACM raw dataset, the data format of its xml version is organized as follows:

The original data set can be stored in unstructured text, which is not conducive to the user's efficient query analysis operation. This solution extracts the acquired ACM data set, classifies and stores it, and can perform query analysis operations efficiently.

Step 202: Extract information representing a structure of the information network graph from the original data set. The information indicating the structure of the information network graph includes at least: node information, node attribute information, side information, and edge attribute information. The node information includes at least: a node identifier. And a key attribute of the node attribute; the node attribute key has a corresponding relationship with the node attribute information; the side information at least includes: an edge identifier and an edge attribute key; the edge attribute key has a correspondence with the edge attribute information Relationship; where the edge is used to describe the relationship between the node and the node.

Due to the relationship between the node and the node attribute, the relationship between the edge and the edge attribute, the connection between the node and the node is an edge, so that the extracted node information, the node attribute information, the side information, and the edge attribute information are related.

In the partner network, the extracted node information can be a node fact table (VFT, Vertex Fact)

The stored node information may include a node ID, a node attribute key, and may also include a metric of the node. The node represents the author in the partner network.

The extracted side information can be stored in the Edge Fact Table (EFT, Edge Fact Table), and the storage side information can include: two author nodes idl, id2 (used to represent the edge identifier), and key attributes of the edge attribute (eg: paper key) ( Paper_key ), time key (Time_key ), and location key ( Venue—key), the side information can also include the measure of the edge.

The node attribute information is specific information corresponding to the node attribute key, and the node attribute information may be specifically stored in a Topology dimension Table (TDT), and the topology dimension table may have one or more. For example, in the node information, the key of the node is the institution key (Institution — ID ), what can be stored in the topology table is the name of the institution that all authors (ie nodes) have worked on. The edge attribute information is specific information corresponding to the edge attribute key, and the edge attribute information may be specifically stored in an Information Dimension Table (IDT). For example: the above paper key (Paper_key), time key (Time key), and location key (Venue_key), the corresponding edge attribute information, specifically can be stored in the paper dimension table, time dimension table, location (Venue) Dimension table. The information dimension table enables the collection to record the publication of the paper, the time of publication, the title of the paper, and the name of the paper. For example, the Paper dimension table can contain Paper_key, Paper_name. Figure 8 shows Figure 7 is. According to the method provided by the embodiment of the present invention, according to the multi-dimensional information network data warehouse model shown in FIG. 8, the information indicating the structure of the information network graph includes at least: node information, node attribute information, side information, and edge attribute information. The extracted information is stored, and the specific method of storing may store the corresponding information in the form of a table.

Step 203: Store extracted node information, node attribute information, side information, and edge attribute information, where the foregoing information is stored by using a node fact table, a topology dimension table, an edge fact table, and an information dimension table.

In order to more clearly understand the information of the edge fact table, the node fact table, the information dimension table, and the topology dimension table, detailed description will be made below with reference to the specific drawings.

First, the fact table

The edge fact table (EFT) of the partner network consists of the IDs of the author nodes (Author-id, Author2_id), and the key of each edge attribute (the specific information of the edge attributes is stored in the information dimension table) (Paper key, Time key, Venue key) and the metric (which can be: Co Frequence ). Authorl id, Author2_id constitutes the primary key of the partner network side fact table, which can locate an edge (that is, can represent the edge identifier). The connection between the fact table and each information dimension table can be done by Paper-key, Time key, Venue key. One edge corresponds to an edge fact table. The specific information carried in the side fact table can be represented by the edge fact table. As shown in FIG. 9, the edge fact table is converted into the edge fact table. The table on the left side of FIG. 9 only identifies the header of the edge fact table, that is, The important information related to the edge in the side fact table, such as the edge identifier and the edge attribute key; in the table on the right side of Figure 9, the specific information of the edge identifier and the edge attribute key is located, or can be understood as the edge identifier and The specific value of the edge attribute key.

For example, in the table on the right in Figure 9, the first row of Authorl_id has a value of 0, Author2_id Take the value 1;

Paper—the value of key is 1, indicating the specific information of the papers cooperating with the authors with values 0 and 1. See the information dimension table corresponding to the specific information of the paper with a value of 1;

The value of time—key is 1, indicating the specific time of the cooperation time of the authors with values 0 and 1. See the information dimension table corresponding to the time specific information corresponding to 1;

Venue—the value of the key is 1, indicating the specific information of the place where the authors of the values 0 and 1 cooperate. See the information dimension table corresponding to the specific information of the location.

The paper—key, Time—key, and Venue—key are edge attribute keys included in the side information, and each value of the key corresponds to a specific information dimension table.

Co-frequence takes a value of 1, which is a measure of the edge included in the edge information. The value is usually a specific value, that is, the value of Co-frequence is 1, which can be understood as the number of times the authors of the values 0 and 1 cooperate. 1 time.

Second, the node fact table

The partner network node fact table (VFT) may also include the metric value of the node by the node information (specifically, the node ID, or the author ID), and the key of the node attribute.

The node information includes: a node ID and/or an author ID, that is, the node information may be a single node ID, or may be a joint representation of the node ID and the author ID, or may be separately represented by the author ID. The author ID ( Author- id ) uniquely represents a node as the primary key of the node fact table.

The key of the node attribute may be the primary key of the topology dimension table (the primary key may be understood as the subject information of the information recorded in the topology dimension table, for example, the primary key of the topological dimension organization (Institution_id) records the identifier of the organization, and the like. ), there can be multiple top-level tables, each of which can reflect a property of a node.

The metric of the node can be composed of the number of articles published by the author of the node (ie, Paper—Num), or it can have the metric of the node.

There is usually one node fact table. The link between the node fact table and the topology dimension table can be implemented by the primary key of the topology dimension table (ie, Institution_id). The specific information carried in the node fact table may be represented by a node fact relation table. As shown in FIG. 10, the node fact table is converted into a node fact relation table, and the table on the left side of FIG. 10 only identifies the header of the node fact table, that is, Nodes of interest in the node fact table Relevant important information, such as the author's logo, the author's name, the name of the organization, the number of papers published by the author, etc.; and the specific information of the node identifier and the node attribute key is located in the table on the right side of Figure 10, or can be understood The specific value of the node ID and node attribute key.

For example, the first row in the table on the right in Figure 9 has the author ID 0, the author name is Janwei Han, the code for organizing the organization name is 1, and the number of papers published by the author is 15.

Third, the information dimension table

The information dimension table (IDT) consists of a primary key that can identify the dimension table of the information (that is, the primary key is understood as the subject information of the information recorded in the information dimension table) and some related attributes of the information dimension table. There can be multiple information dimensions, and each dimension has a relational table associated with it, called a dimension table, which further describes the dimension. The information dimension in the partner network includes the Paper dimension table, the Time dimension table, and the Venue dimension table. The dimension table is set by the user according to the actual situation, or automatically generated and adjusted according to the data distribution. The transformation of the information dimension relationship information dimension table is shown in Figure 11:

In the transformation of the information dimension table on the right side of FIG. 11, the Paper_key identifier is 1 uniquely identifies the paper_name as the FP tree, and the Paper-classify is the ap311 aper i record; in the same way, the Paper key identifier is 2 , 3, 4 have a similar understanding.

The Time-key identifier is 1 uniquely identified as the Time record of 1967 and the 1960s. Similarly, the Time key identifiers have similar understandings for 2, 3, and 4.

Venue—The key identifier is 1 uniquely identifies Venue name as VLDB, and Venue—are is the Venue record of DB. Similarly, Venue key identifiers have similar understandings for 2, 3, and 4.

Fourth, topological dimension

Topology determines the edge set and node set of the information network, that is, determines the topology of the graph in the information network. In turn, the size of the unit represented by the node is determined. The topology of the partner network is the organization. The Topological Dimension Table (TDT) consists of a primary key that uniquely identifies the topology dimension table and some related attributes of the topology dimension table. There can be more than one topological dimension table. The transformation of each topological dimension relationship topological dimension table is shown in Fig. 12, that is, the specific storage form in the topology dimension table can be stored in the relationship topological dimension table on the right side of Fig. 12.

To facilitate understanding of the information network data warehouse model, this concept is further explained as follows: Information dimension: The graph structure is G ( V, E ) = G ( V, f (ID) ). Where V is the set of points in the graph, E is the set of edges, and function f is the side information determining function of graph G. Set variable ID = {I1, 12...Im} It is a set of dimensions to be investigated in OLGP. The set of dimensions formed by the m information attributes can only determine the edge set of the graph, and cannot change the topological structure of the graph. The ID is called the information dimension set.

Topological dimension: Let the variable TD={T1, Τ2, . , ., Τη } be a collection of metric mapping topological structures in the OLGP. A graph can be expressed as G(V, E) = G((TD), 5(TD)), where the function Φ is the point topology decision function and the function δ is the edge topology decision function. The topology of the topological attributes determines the point set and edge set of the graph, thereby determining the topological structure of the graph, and calling TD a topological dimension set.

Information Network Data Warehouse Model: Let ROLGP (EFT, VFT, S(IDT), S(TDT), F) be the relational OLGP data cube. Where EFT is the edge fact table, VFT is the node fact table, S (IDT) information dimension table set, IDT is the information dimension table, S (TDT) topology dimension table set, TDT is the topology dimension table, and F is the dependency set between the tables. , and the following constraints must be met:

(1) IDT is connected to EFT through a foreign key, TDT is connected to VFT through a foreign key, and EFT is connected to VFT through node ID. (2) EFT, VFT, IDT, TDT satisfy the relationship table, that is, the following definitions are satisfied: R(U, D, Dom, F').R is the relational table, U is the set of attribute names that make up the relationship, and D is the attribute group. The domain from which attributes are derived from U, Dom is the set of attributes to the domain, and F' is the set of dependencies of the data between attributes.

Similar to traditional OLAP modeling, OLGP-based information networks are modeled with fact tables and dimension tables. The difference is that the fact table is composed of the edge fact table (EFT) and the node fact table (VFT), and the dimension table is composed of the information dimension table (IDT) and the topology dimension table (TDT). The OLGP information network is modeled based on relational data. The node and the edge are stored by the node fact table and the edge fact table respectively _. The attributes related to the edge fact table are stored by using the information dimension table, and the attributes related to the node are utilized by the topology dimension table. Store.

The method for storing data is provided by the second embodiment of the present invention. The method extracts node information, node attribute information, side information, and edge attribute information from the original data set by acquiring the original data set. The node information includes at least: a node identifier. And a key attribute of the node attribute; the node attribute key has a corresponding relationship with the node attribute information; the side information at least includes: an edge identifier and an edge attribute key; the edge attribute key has a correspondence with the edge attribute information Relationship; due to the relationship between the node and the node attribute, the relationship between the edge and the edge attribute, the connection between the node and the node is an edge, so that the extracted node information, the node attribute information, the side information, and the edge attribute information are related, and the above is stored. Extracted node information, node attribute information, side information, and edge attribute information. Due to the connection between the extracted information, This allows you to quickly and accurately locate the data you need when you subsequently manipulate the data. At the same time, compared with the existing OLAP multi-dimensional data warehouse model, the information stored by the solution provided by the embodiment of the present invention includes not only the same node information and node attribute information as the prior art, so that the researcher can focus on the node as the center. In addition, the information stored in the solution provided by the embodiment of the present invention further includes side information and edge attribute information that cannot be focused on by the prior art, so that the researcher can also pay attention to the relationship between the nodes.

Further, in the solution provided by the embodiment of the present invention, the connection between the node and the edge is established according to the connection between the node and the node, and therefore, the node information, the node attribute information, and the side information and the edge attribute information are directly established. Therefore, the solution can realize the relationship between the nodes concerned because the important relationship between the edges and the nodes is found, and the changes to the prior art are made smaller.

Preferably, due to the storage method provided by the embodiment of the present invention, the subsequent query operation on the stored data is implemented very quickly and accurately. The method may also include the following:

Step 204: Perform positioning on the stored node information, node attribute information, side information, and edge attribute information for the data to be queried; from the located node information, node attribute information, side information, or edge The query is made in one of the attribute information.

That is, when the data to be queried is judged to belong to the node information, or the node attribute information, or the side information, or the edge attribute information; the query operation is performed from one of the determined information. Greatly narrows the scope of the query.

For example: In the partner network, query the number of papers published in different conferences, because the storage method of the above steps 201~203 is used, in the multidimensional information network data warehouse model, involving the EFT and Venue tables (the address table in the information dimension table) , ie Venue table), the edge attribute key Venue_key in the EFT table and the information dimension table, that is, the Venue table, establish a connection relationship. You can query the number of papers published in different conferences. The specific query operation can be as follows:

Structured Query Language (SQL):

Select EFT.Paper key

From EFT, Venue

Where EFT.Venue— key = Venue. Venue— key AND Venue. Venue— name = "meeting name"

By adding the above step 204, when querying the data that needs to be queried, in the multidimensional information network In the edge fact table, the node fact table, the information dimension table, and the topology dimension table in the data warehouse, it can be determined that the required query information should belong to one or more of the above tables, so that a large amount of information redundancy can be eliminated, and the query is efficient. And save time. Queries for specific problems involve only the join operations of some tables.

Preferably, the method further comprises the following steps:

Step 205, based on the extracted node information, node attribute information, edge information, and the attribute information side, FIG online processing operation (OLGP, Online Graph Processing) ₀

Among them, OLGP operations can include but are not limited to: Volume Up (I-OLGP), Top-Up Volume (T-OLGP), Asynchronous Roll Up, Drill Down, Slice, Cut, Pivot.

Among them, the information network can be uploaded to the partner network (I-OLGP), and the specific operation can be: performing year (year) → decade (decade) → all (all) at different levels in the time dimension of the information dimension Volume operations, from the number of papers published in different years to the number of papers published in different years, and then the number of papers published to all time.

Among them, the partner network can be topologically scrolled, and the specific operations can be: Performing the author's individual on the organizational dimension in the topology table (11 0 → Institution - All (all) Operation, from the cooperation between different authors to the cooperation between different institutions.

It should be understood that the scrolling operation can be understood as a generalization of low-level detail data to a high-level summary data in a certain dimension. For example, for the information dimension (time dimension), the volume is rolled up from the year to the age, the aggregated data of the age is obtained, and then the year is rolled up to all the years, and the aggregated data of all the years can be obtained. The relationship between the extracted node information, the node attribute information, the side information, and the edge attribute information, so that when the stored information is subjected to an online map processing (OLGP) operation, different classified information may be used. The processing is performed, for example, only the edge attribute information stored in the information dimension table is operated, or only the node attribute information stored in the topology table is operated, and the like.

Further, with the storage method provided by the embodiment of the present invention, multi-topic modeling can be performed by sharing the information dimension, and the underlying data can be reconstructed with little, and the existing dimension table can be shared as much as possible. For example, in the keyword-partner network model, since the keyword network and the partner network both include the Paper, Time, and Venue dimensions, the keyword partner network can be constructed by sharing the three information dimensions. As shown in Figure 13, the keyword fact table and the collaborator fact table share the Venue dimension, Paper dimension, and Time structure. Building a keyword-collaborator multidimensional information network data warehouse model.

From the keyword-collaborator multidimensional information network data warehouse model shown in Figure 13, it can be seen that the nodes displayed in the four columns on the left represent the term (Term), the side represents the situation between the semester and the semester, and the four columns on the left are executed. The storage node information, the node attribute information, the side information, and the edge attribute information are performed in the operation of step 201 203 described above; the nodes displayed on the right four columns represent the author, and the side represents the situation between the author and the author, and the left four columns are The storage node information, the node attribute information, the side information, and the side attribute information are performed by performing the operations of steps 201 to 203 described above. That is, the topics stored on the left and right sides are different (the topics stored on the left are nodes for the semester, and the topics stored on the right are the nodes for the authors).

The middle Co-IDT can be used as the information dimension table in the left storage warehouse, or as the information dimension table of the right storage warehouse, that is, the multidimensional information network data warehouse sharing information dimension table on the left and the right sides, that is, the edges stored in the two warehouses. The attribute information is the same.

Therefore, when the topics represented by the nodes are different, when the data of the plurality of topics stored by the storage method provided by the embodiment of the present invention is used, multi-topic modeling can be performed by sharing the information dimension, and the underlying data can be reconstructed with little. Share existing dimension tables as much as possible.

Embodiment 3

The embodiment of the present invention provides a method for storing data, which is similar to the method provided by the foregoing embodiment, except that the method provided by the embodiment of the present invention is another storage method of a specific application. This storage method is applied to the movie actor cooperative network.

The film actor cooperative network is also a kind of information network. When the user needs to pay attention to the inter-actor cooperation relationship, the node identifies the actor, and the representative represents the cooperation relationship between the two actors. The movie actor cooperation network is shown in Figure 14. The node description includes: actor name, gender, age, affiliated film company; side descriptions include: movie name, release time. As shown in Figure 15, the method includes:

Step 301, obtaining the original data set, for the original data set of the movie actor cooperative network, usually the name of the messy actor, the gender, the name of the movie, the time of the release, and the like, disorderly and disorderly. Not easy to find, as well as OLGP operations.

Step 302: Extract information indicating a structure of the information network graph from the original data set. The information indicating the structure of the information network graph includes at least: node information, node attribute information, side information, and edge attribute information. The node information includes at least: a node identifier. And the key of the node attribute; the node attribute is off The key code has a corresponding relationship with the node attribute information; the side information includes at least: an edge identifier and an edge attribute key; the edge attribute key has a corresponding relationship with the edge attribute information; and the edge is used to describe between the node and the node Contact.

Due to the relationship between the node and the node attribute, the relationship between the edge and the edge attribute, the edge is used to describe the relationship between the node and the node, and the extracted node information, node attribute information, side information, and edge can be easily represented by the graph structure. The link between attribute information.

In the movie actor cooperation network, the extracted node information may be stored in a node fact table (VFT, Vertex Fact Table), where the node information may include a node ID, a node attribute key, and may also include a metric of the node. As shown in Figure 15, the VFT, the node ID is the actor (Actor_id) and the actor's name, the node attribute key is the actor's company code (Film Comany id), and the node's metric is the number of actors' films (Film-Num). .

The extracted side information can be stored in the Edge Fact Table (EFT, Edge Fact Table), and the stored side information can include: two actor nodes idl, id2 (used to represent the edge identification), and key attributes of the edge attribute (eg: Cooperative movie key The code (Film-key), the release time key (Release-Time-key), the side information can also include the measure of the edge (ie Co- Frequence).

The node attribute information is specific information corresponding to the node attribute key, and the node attribute information may be specifically stored in a Topology dimension Table (TDT), and the topology dimension table may have one or more. For example, if the key of the node in the node information is Film Comany lD, the name of the movie company to which the actor (ie node) belongs can be stored in the topology table.

The edge attribute information is specific information corresponding to the edge attribute key, and the side attribute information may be stored in an Information Dimension Table (IDT). For example: the above-mentioned cooperative movie key (Film-key), the release time key (Release-Time-key), and the corresponding edge attribute information, which may be separately stored in the movie dimension table and the release time dimension table. The film dimension records the movie name, movie type and other information; the release time dimension record records the year, the age and other information.

Step 303: Store extracted node information, node attribute information, side information, and edge attribute information, where the foregoing information is stored by using a node fact table, a topology dimension table, an edge fact table, and an information dimension table. As shown in FIG. 16, the node fact table, the topology table, the side fact table, and the information dimension table are combined to form a multi-dimensional information network data warehouse model. The method for storing data is provided by the third embodiment of the present invention. The method extracts node information, node attribute information, side information, and edge attribute information from the original data set by acquiring the original data set. The node information includes at least: a node identifier. And a key attribute of the node attribute; the node attribute key has a corresponding relationship with the node attribute information; the side information at least includes: an edge identifier and an edge attribute key; the edge attribute key has a correspondence with the edge attribute information Relationship; due to the relationship between the node and the node attribute, the relationship between the edge and the edge attribute, the connection between the node and the node is an edge, so that the extracted node information, the node attribute information, the side information, and the edge attribute information are related, and the above is stored. Extracted node information, node attribute information, side information, and edge attribute information. Since there is a connection between the extracted information, when the data is subsequently operated, the required data can be quickly and accurately located. At the same time, compared with the existing OLAP multi-dimensional data warehouse model, the information stored by the solution provided by the embodiment of the present invention includes not only the same node information and node attribute information as the prior art, so that the researcher can focus on the node as the center. In addition, the information stored in the solution provided by the embodiment of the present invention further includes side information and edge attribute information that cannot be focused on by the prior art, so that the researcher can also pay attention to the relationship between the nodes.

Step 304: Perform positioning on the stored node information, node attribute information, side information, and edge attribute information for the data to be queried. The query is performed from one of the node information, the node attribute information, the side information, or the side attribute information after the positioning. That is, when the data to be queried is judged, it belongs to the node information, or the node attribute information, or the side information, or the edge attribute information; the query is performed in the positioned information, and the scope of the narrowed query is performed.

For example: In the movie actor cooperative network, query the number of movies issued in different years, because the storage method of the above steps 301~303 is used, in the multidimensional information network data warehouse model, involving the EFT and the release schedule (in the information dimension table) Release_Time table), the edge attribute key in the EFT table Release-Time-key and the information dimension table, that is, the Release-Time table, establish a connection relationship. Can query The number of movies released in different years. The specific query operation can be as follows:

Structured Query Language (SQL):

Select EFT.Film— key

From EFT, Release— Time

Where EFT.Release— Time— key=Release—Time.Release— Time— key AND

Release— Time. Year = "Year"

By adding the above step 304, when querying the data to be queried, in the edge fact table, the node fact table, the information dimension table, and the topology dimension table in the multidimensional information network data warehouse, it can be determined that the required query information should belong to the above table. One or more of them can eliminate a large amount of information redundancy, be efficient and save time. Queries for specific problems involve only the connection operations of some tables.

Preferably, the method further comprises the following steps:

Step 305, according to the extracted node information between the node attribute information, edge information, and the attribute information having contact side, FIG online processing operation (OLGP, Online Graph Processing) ₀ wherein, OLGP operations may include but are not limited to: Rollup (I-OLGP), Topology Volume (T-OLGP), Asynchronous Roll Up, Drill Down, Slice, Cut, Pivot.

Among them, the information network can be uploaded to the partner network (I-OLGP), and the specific operation can be: performing year (year) → decade (decade) → all (all) at different levels in the time dimension of the information dimension Volume operations, from the number of movies released in different years to the number of movies released in different years, and then the number of movies released to all times.

Among them, the partner network can be topologically scrolled, and the specific operations can be: Performing an actor (Actor) → Film Company (All) (all) in the organizational dimension of the topology table. Roll-up operation, from the cooperation relationship between different actors to the cooperation relationship between different film companies. The relationship between the extracted node information, the node attribute information, the side information, and the edge attribute information, so that when the stored information is subjected to an online map processing (OLGP) operation, different classified information may be used. The processing is performed, for example, only the edge attribute information stored in the information dimension table is operated, or only the node attribute information stored in the topology table is operated, and the like. Further, with the storage method provided by the embodiment of the present invention, multi-topic modeling can be performed by sharing the information dimension, and the underlying data can be reconstructed with little, and the existing dimension table can be shared as much as possible.

Embodiment 4

An embodiment of the present invention provides a data storage device. As shown in FIG. 17, the device includes: an obtaining unit 401, an extracting unit 402, and a storage unit 403;

The obtaining unit 401 is configured to acquire an original data set.

Among them, the original data set can be understood as a collection of all the data collected by the user, which is messy and unfavorable for analysis. The raw data set obtained in the acquisition unit may be raw data of unstructured text input into the execution device.

The extracting unit 402 is configured to extract information indicating a structure of the information network graph from the original data set, where the information indicating the structure of the information network graph includes at least: node information, node attribute information, side information, and edge attribute information; The node information includes at least: a node identifier and a node attribute key; the node attribute key has a corresponding relationship with the node attribute information; the side information at least includes: an edge identifier and an edge attribute key; the edge attribute The key has a corresponding relationship with the edge attribute information; the edge is used to describe a relationship between the node and the node;

Due to the relationship between the node and the node attribute, the relationship between the edge and the edge attribute, the edge is used to describe the relationship between the node and the node, and the extracted node information, node attribute information, side information, and edge can be easily represented by the graph structure. The relationship between the attribute information (see Figure 5, Figure 8, Figure 16 above).

The storage unit 403 is configured to store the extracted node information, node attribute information, side information, and edge attribute information.

The storage unit stores the extracted node information, node attribute information, side information, and edge attribute information, which may be stored in the form of a table, that is, by: a node fact table, a topology dimension table, an edge fact table, and an information dimension table. The above information is stored correspondingly. The storage in the form of a table is a factual manner, and is not a limitation of the embodiment of the present invention. The specific storage form may have other forms.

The device for storing data is provided by the first embodiment of the present invention. The device extracts node information, node attribute information, side information, and edge attribute information from the original data set by acquiring the original data set. The node information includes at least: a node identifier. And a key attribute of the node attribute; the node attribute key has a corresponding relationship with the node attribute information; the side information at least includes: an edge identifier and an edge attribute key; the edge attribute key has a correspondence with the edge attribute information Relationship; due to node and node properties The relationship between the edge and the edge attribute, the connection between the node and the node is the edge, so that the extracted node information, the node attribute information, the side information, and the edge attribute information are related, and the extracted node information, the node attribute are stored. Information, side information, and side attribute information. Since there is a connection between the extracted information, when the data is subsequently operated, the required data can be quickly and accurately located. At the same time, compared with the existing OLAP multi-dimensional data warehouse model, the information stored by the solution provided by the embodiment of the present invention includes not only the same node information and node attribute information as the prior art, so that the researcher can focus on the node as the center. In addition, the information stored in the solution provided by the embodiment of the present invention further includes side information and edge attribute information that cannot be focused on by the prior art, so that the researcher can also pay attention to the relationship between the nodes.

Further, the apparatus for storing data is provided in the first embodiment of the present invention, and the existing

In the OLAP multi-dimensional data warehouse model, the redundancy problem existing in the original data set, the solution provided by the embodiment of the present invention has the advantages of flexible query, high efficiency, and flexible subject extraction.

Furthermore, the first embodiment of the present invention provides a device for storing data, which is more in line with the modeling requirements of a real social network, and is beneficial to efficient OLGP algorithm design, and the model is convenient to convert to a traditional relation table, and is beneficial to people in real world information. understanding.

Preferably, in the solution, the node information further includes: a node metric value; the side information further includes: an edge metric value.

Preferably, in the solution, the extracted node information is stored in a node fact table;

The extracted side information is stored in an edge fact table;

Since the edge is used to describe a relationship between a node and a node, the node fact table has a relationship with the edge fact table;

The node attribute key has a corresponding relationship with the node attribute information; and the topology dimension table has a relationship with the node fact table; The information dimension table has a relationship with the edge fact table because the edge attribute key has a corresponding relationship with the edge attribute information.

Preferably, the device further includes: a positioning unit 404, and a query unit 405;

The locating unit 404 is configured to locate, in the stored node information, node attribute information, side information, and edge attribute information, the data that needs to be queried;

The query unit 405 is configured to perform query from one of the node information, the node attribute information, the side information, or the edge attribute information after the positioning.

By adding the positioning unit 404 and the query unit 405 to query the data to be queried, the edge fact table, the node fact table, the information dimension table, and the topology dimension table in the multidimensional information network data warehouse can determine the need. The query information should belong to one or more of the above tables, so that a large amount of information redundancy can be eliminated, the query is efficient, and time is saved. Queries for specific problems involve only the join operations of some tables.

Preferably, in this solution, the apparatus further includes: a map processing unit 406;

The map processing unit 406 is configured to perform an online map processing operation according to the extracted node information, node attribute information, side information, and edge attribute information.

Preferably, in the solution, the online map processing operation in the map processing unit 406 at least includes: scrolling (I-OLGP), topological volume (T-OLGP), asynchronous scrolling ), drill down, slice, diced, one of the data views.

Preferably, in the solution, if the extracted edge attribute information is stored in the information dimension table, the information dimension rollup specifically includes:

Preferably, in the solution, if the extracted node attribute information is stored in the topology dimension table, the topology dimension aggregation operation specifically includes:

The information of one attribute of the node stored in the topology table or the information of one or more attributes is scrolled. The relationship between the extracted node information, the node attribute information, the side information, and the edge attribute information, therefore, when performing online graph processing (OLGP) operations on the stored information, The same classified information is processed, for example, only the edge attribute information stored in the information dimension table is operated, or only the node attribute information stored in the topology table is operated, and the like.

Further, with the storage method provided by the embodiment of the present invention, multi-topic modeling can be performed by sharing the information dimension, and the underlying data can be reconstructed with little, and the existing dimension table can be shared as much as possible.

Embodiment 5

The embodiment of the present invention provides a data storage device. As shown in FIG. 18, the device includes: a memory 40, a processor 41, an input device 43, and an output device 44 respectively connected to a bus, wherein the memory 40 is Used to store data input from the input device 43, and also to store the processor

The processor 41 is configured to extract information indicating a structure of the information network map from the original data set, where the information indicating the structure of the information network map includes at least: node information, node attribute information, side information, and edge attribute information; The node information includes at least: a node identifier and a node attribute key; the node attribute key has a corresponding relationship with the node attribute information; the side information includes at least: an edge identifier and an edge attribute key; the edge attribute The key has a corresponding relationship with the edge attribute information; the edge is used to describe a relationship between the node and the node;

The memory 40 is further configured to store the extracted node information, node attribute information, side information, and edge attribute information.

The device for storing data is provided by the first embodiment of the present invention. The device extracts node information, node attribute information, side information, and edge attribute information from the original data set by acquiring the original data set. The node information includes at least: a node identifier. And a key attribute of the node attribute; the node attribute key has a corresponding relationship with the node attribute information; the side information at least includes: an edge identifier and an edge attribute key; the edge attribute key has a correspondence with the edge attribute information Relationship; due to the relationship between the node and the node attribute, the relationship between the edge and the edge attribute, the connection between the node and the node is an edge, so that the extracted node information, the node attribute information, the side information, and the edge attribute information are related, and the above is stored. Extracted node information, node attribute information, side information, and edge attribute information. Since there is a connection between the extracted information, when the data is subsequently operated, the required data can be quickly and accurately located. At the same time, compared with the existing OLAP multi-dimensional data warehouse model, the information stored by the solution provided by the embodiment of the present invention includes not only the same node information and node attribute information as the prior art, so that researchers can pay attention to The node-centered fact, and the information stored in the solution provided by the embodiment of the present invention also includes side information and edge attribute information that cannot be focused on by the prior art, so that the researcher can also pay attention to the relationship between the nodes.

Further, an apparatus for storing data is provided in the first embodiment of the present invention, which solves the redundancy problem in the original data set in the existing OLAP multi-dimensional data warehouse model. The solution provided by the embodiment of the present invention has flexible query, high efficiency, and theme. The advantages of extraction flexibility.

Preferably, the node information processed by the processor 41 further includes: a node metric value; the side information further includes: an edge metric value.

Preferably, the extracted node information in the processor 41 is stored in a node fact table; the extracted side information is stored in an edge fact table; the extracted node attribute information is stored in a topology dimension table; the extracted edge The attribute information is stored in the information dimension table; since the edge is used to describe a relationship between the node and the node, the node fact table has a relationship with the edge fact table; the node attribute key and the node attribute The information has a correspondence relationship; the topology dimension table is associated with the node fact table; and the information dimension table has a relationship with the edge fact table because the edge attribute key has a corresponding relationship with the edge attribute information. .

Preferably, the processor 41 is further configured to: in the stored node information, the node attribute information, the side information, and the edge attribute information, the data to be queried; the node information after the positioning, the node Query in one of attribute information, side information, or edge attribute information.

When querying the data to be queried, in the edge fact table, the node fact table, the information dimension table, and the topology dimension table in the multidimensional information network data warehouse, it may be determined that the required query information should belong to one or more of the above tables. Therefore, a large amount of information redundancy can be eliminated, which is efficient and saves time. Queries for specific problems involve only the join operations of some tables. Preferably, the processor 41 is further configured to perform an online map processing operation according to the extracted node information, node attribute information, side information, and edge attribute information.

Preferably, the processor 41 is further configured to: the online map processing operation at least:

One of the volumes (I-OLGP, T-OLGP, Asynchronous), drill down, slice, dicing, and pivot.

Preferably, the processor 41 is further configured to: if the extracted edge attribute information is stored in the information dimension table, the information dimension rollup specifically includes:

Preferably, the processor 41 is further configured to: if the extracted node attribute information is stored in the topology dimension table, the topology dimension aggregation operation specifically includes:

The information of one attribute of the node stored in the topology table or the information of one or more attributes is scrolled. The relationship between the extracted node information, the node attribute information, the side information, and the edge attribute information, so that when the stored information is subjected to an online map processing (OLGP) operation, different classified information may be used. The processing is performed, for example, only the edge attribute information stored in the information dimension table is operated, or only the node attribute information stored in the topology table is operated, and the like.

A person skilled in the art can understand that all or part of the steps of implementing the above embodiments can be completed by a program to instruct related hardware, and the program can be stored in a computer readable storage medium, the above mentioned storage. The medium can be a read only memory, a magnetic disk or a compact disk or the like.

The method and apparatus for storing data provided by the present invention are described in detail above. For those skilled in the art, according to the idea of the embodiments of the present invention, there are changes in specific implementation manners and application scopes. In summary, the content of the specification should not be construed as limiting the invention.

Claims

Rights request

1. A method of storing data, characterized in that the method includes:

Get the original data set;

Extract information representing the structure of the information network graph from the original data set; wherein, the information representing the structure of the information network graph at least includes: node information, node attribute information, edge information, and edge attribute information; the node information at least includes: node Identification and node attribute keys;

The node attribute key code has a corresponding relationship with the node attribute information;

The side information at least includes: edge identifier and edge attribute key code;

The edge attribute key code has a corresponding relationship with the edge attribute information;

The edges are used to describe the connections between nodes;

Store the extracted node information, node attribute information, edge information, and edge attribute information.

2. The method according to claim 1, characterized in that,

The node information also includes: node metric value;

The side information also includes: side metric value.

3. The method according to claim 1 or 2, characterized in that,

The extracted node information is stored in the node fact table;

The extracted edge information is stored in the edge fact table;

The extracted node attribute information is stored in the topology dimension table;

Since the edge is used to describe the connection between nodes, the information in the node fact table has a corresponding relationship with the information in the edge fact table;

The node attribute key code has a corresponding relationship with the node attribute information; then the information in the topology dimension table has a corresponding relationship with the information in the node fact table;

Since the edge attribute key code is the same as the edge attribute information, the information in the information dimension table has a corresponding relationship with the information in the edge fact table.

4. The method according to claim 1, characterized in that, after storing the extracted node information, node attribute information, edge information, and edge attribute information, the method further includes:

For the data that needs to be queried, locate the stored node information, node attribute information, edge information, and edge attribute information; Query is performed from one of the positioned node information, node attribute information, edge information, or edge attribute information.

5. The method according to claim 1, characterized in that, after storing the extracted node information, node attribute information, edge information, and edge attribute information, the method further includes:

According to the extracted node information, node attribute information, edge information, and edge attribute information, an online graph processing operation is performed.

6. The method according to claim 5, characterized in that the online graph processing operation at least includes:

One of information dimension roll-up (I-OLGP), topology dimension roll-up (T-OLGP), asynchronous roll-up, drill-down, slicing, dicing, and data pivot.

7. The method according to claim 6, characterized in that if the extracted edge attribute information is stored in an information dimension table, then the information dimension roll-up specifically includes:

Perform a roll-up operation on the information of one attribute of the edge stored in the information dimension table, or the information of more than one attribute.

8. The method according to claim 6, characterized in that, if the extracted node attribute information is stored in a topology dimension table, then the topology dimension aggregation operation specifically includes:

Perform a roll-up operation on information about one attribute of nodes stored in the topological dimension table, or information about more than one attribute.

9. A device for storing data, characterized in that the device includes: an acquisition unit, an extraction unit, and a storage unit;

The acquisition unit is used to acquire the original data set;

The extraction unit is used to extract information representing the structure of the information network graph from the original data set; wherein the information representing the structure of the information network graph at least includes: node information, node attribute information, edge information, and edge attribute information; so The node information at least includes: a node identifier and a node attribute key; the node attribute key has a corresponding relationship with the node attribute information; the edge information at least includes: an edge identifier and an edge attribute key; the edge attribute key The corresponding relationship between the code and the edge attribute information; the edge is used to describe the connection between nodes;

The storage unit is used to store the extracted node information, node attribute information, edge information, and edge attribute information.

10. The device according to claim 9, characterized in that,

The node information also includes: node metric value;

The side information also includes: side metric value.

11. The device according to claim 9 or 10, characterized in that,

The extracted node information is stored in the node fact table;

The extracted edge information is stored in the edge fact table;

Since the edge attribute key code has a corresponding relationship with the edge attribute information, the information in the information dimension table has a corresponding relationship with the information in the edge fact table.

12. The device according to claim 9, characterized in that the device further includes: a positioning unit, and a query unit;

The positioning unit is used to locate the data that needs to be queried in the stored node information, node attribute information, edge information, and edge attribute information;

The query unit is configured to query from one of the positioned node information, node attribute information, edge information, or edge attribute information.

13. The device according to claim 9, characterized in that the device further includes: a graph processing unit;

The graph processing unit is configured to perform online graph processing operations based on the extracted node information, node attribute information, edge information, and edge attribute information.

14. The device according to claim 13, wherein the online graph processing operation in the graph processing unit at least includes:

15. The device according to claim 14, characterized in that, in the image processing unit, if the If the extracted edge attribute information is stored in the information dimension table, then the information dimension roll-up specifically includes: performing a roll-up operation on information of one attribute of the edge, or information of more than one attribute, stored in the information dimension table.

16. The device according to claim 14, characterized in that, in the graph processing unit, if the extracted node attribute information is stored in a topological dimension table, then the topological dimension aggregation operation specifically includes: The information of one attribute of the node, or the information of more than one attribute, is rolled up.