WO2015058500A1 - Data storage method and device - Google Patents

Data storage method and device Download PDF

Info

Publication number
WO2015058500A1
WO2015058500A1 PCT/CN2014/075570 CN2014075570W WO2015058500A1 WO 2015058500 A1 WO2015058500 A1 WO 2015058500A1 CN 2014075570 W CN2014075570 W CN 2014075570W WO 2015058500 A1 WO2015058500 A1 WO 2015058500A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
node
edge
attribute
attribute information
Prior art date
Application number
PCT/CN2014/075570
Other languages
French (fr)
Chinese (zh)
Inventor
刘志容
李川
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2015058500A1 publication Critical patent/WO2015058500A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

Disclosed are a data storage method and device. The method comprises a data storage method, and the method comprises: acquiring an original data set; extracting information representing an information network diagram structure from the original data set, the information representing the information network diagram structure at least comprising node information, node attribute information, side information and side attribute information, wherein the node information at least comprises a node identifier and a node attribute keycode, the node attribute keycode and the node attribute information having a correlation; the side information at least comprises a side identifier and a side attribute keycode, the side attribute keycode and the side attribute information having a correlation; and a side is used to describe the relationship between nodes; and storing the extracted node information, the node attribute information, the side information and the side attribute information. By means of the solution provided in the embodiments of the present invention, researchers can also observe the relationship between nodes.

Description

一种存储数据的方法和装置  Method and device for storing data
本申请要求于 2013 年 10 月 23 日提交中国专利局、 申请号为 201310505069.5、 发明名称为"一种存储数据的方法和装置 "的中国专利申请的 优先权, 其全部内容通过引用结合在本申请中。  The present application claims priority to Chinese Patent Application No. 201310505069.5, entitled "A Method and Apparatus for Storing Data", filed on October 23, 2013, the entire contents of in.
技术领域  Technical field
本发明涉及数据存储领域, 具体涉及一种存储数据的方法和装置。  The present invention relates to the field of data storage, and in particular, to a method and apparatus for storing data.
背景技术  Background technique
信息网络( Information Networks ) 的概念, 是对现实空间中海量、 多维、 复杂结构数据的一般性抽象。 信息网络在社区网络分析、 合作者网络分析、 交 通运输网络能力计算、蛋白质网络接收成分分析、犯罪网络分析等领域都具有 重要价值。  The concept of Information Networks is a general abstraction of massive, multidimensional, and complex structural data in real space. Information networks are of great value in the fields of community network analysis, partner network analysis, traffic network capacity calculation, protein network receiving component analysis, and criminal network analysis.
在信息网络环境中, 用户关注的主题信息由简单的数值度量值(如销售总 量, 利润值), 演化成为关注复杂的网络,如销售网络, 其中每个节点(Vertex ) 代表一种商品, 节点间的连线(即: 边, Edge )表示不同类物品的共同销售关 系, 参见图 1所示的销售网络。  In the information network environment, the subject information that the user pays attention to evolves from a simple numerical metric (such as total sales volume, profit value) to a complex network, such as a sales network, where each node (Vertex) represents a commodity. The connections between nodes (ie: Edge, Edge) represent the common sales relationship for different types of items, see the sales network shown in Figure 1.
经典的在线分析处理(OLAP, Online Analysis Processing )数据仓库模型 是多维数据模型。 多维数据模型是一个多维空间, "维"是人们观察数据的不同 角度, 可以用于表示某个事物的不同属性。 例如, 在分析产品销售数据时, 涉 及时间维, 产品维, 地区维等。 现阶段没有统一的多维数据模型。 其中, 经典 的 OLAP数据仓库模型有三种, 即: 星形模式, 雪花模式, 和星座模式。  The classic online analytical processing (OLAP, Online Analysis Processing) data warehouse model is a multidimensional data model. A multidimensional data model is a multidimensional space. "Dimensions" are different angles that people observe data and can be used to represent different attributes of something. For example, when analyzing product sales data, it involves time dimension, product dimension, and regional dimension. There is no unified multidimensional data model at this stage. Among them, there are three classic OLAP data warehouse models, namely: star mode, snowflake mode, and constellation mode.
星形模式是多维数据模型的基本结构, 其组成包括: 中心事实表和维表。 其中, 中心事实表是星形模式中的核心表,存储事实的度量值及各个维表的关 键码; 维表用于保持维的信息, 即每个维成员, 包括维的属性信息等。 中心事 实表通过所存储的每个维表的关键码值和各维表进行连接。雪花模式是星形模 式的变种,在星形模式的基础上对某些维表进行规范分解。 星座模式可以看成 是星形模式的汇聚, 能满足多个实施表共享某些维表, 进而实现多主体建模。  The star schema is the basic structure of the multidimensional data model, and its composition includes: a central fact table and a dimension table. The central fact table is a core table in the star schema, storing the metrics of the facts and the key codes of the respective dimension tables; the dimension table is used to maintain the dimension information, that is, each dimension member, including the attribute information of the dimension. The central fact table is connected with the key values of each dimension table stored and the dimension tables. The snowflake mode is a variant of the star mode, which decomposes some dimension tables on a star schema basis. The constellation mode can be regarded as a convergence of star patterns, which can satisfy multiple implementation tables to share certain dimension tables, and thus achieve multi-agent modeling.
如图 2所示,对于经典的产品销售数据来说, 星形模式能够很好的解决其 数据组织。 对于销售数据, 可以从四个维度考虑, 分别是: 时间维 (Time ), 商品维 (Item ), 商店维 (Branch )和位置维 ( Location )。 该模式包含一个中 心事实表 (Sales ), 该中心事实表包含四个维的关键码 (如图 2 中所示, Time— key, Branch key , Item key , Location key )和两个度量 (如图 2中所 示 Dollars— sold , Unit— sold )。 As shown in Figure 2, for classic product sales data, the star mode can well solve its data organization. For sales data, it can be considered from four dimensions: Time, Item, Branch, and Location. This mode contains one The heart fact table (Sales), which contains four dimensions of keys (as shown in Figure 2, Time_Key, Branch key, Item key, Location key) and two metrics (as shown in Figure 2). Dollars— sold , Unit — sold ).
星形模式与雪花模式只适合对单个主题建模, 无法对多主题进行建模。 星 座模式能满足多个事实表共享某些维表进而实现多主题建模,但信息网络中的 主题数据演化成复杂的图网络, 需要同时保存信息维、 拓朴维的信息, 星座模 式也不适用于在线图处理的建模。  Star mode and snowflake mode are only suitable for modeling a single topic, and you cannot model multiple topics. The constellation mode can satisfy multiple fact tables and share some dimension tables to realize multi-theme modeling. However, the subject data in the information network evolves into a complex graph network, and it is necessary to simultaneously save the information dimension and the topology dimension information. Suitable for modeling of online graph processing.
在传统 OLAP中,科研工作者关注数值型的度量, 比如商场中商品的销售 数量,销售额等数值型数据。 多维数据模型是面向传统 OLAP提出的, 并不适 用于信息网络中以图为结构的数据组织。现在科研工作者更加关注商品与商品 之间的共同销售关系, 这就涉及对象与对象之间连接关系的建模问题。 目前越 来越多的数据以网络图的形式出现,如社交网络,合作者网络,蛋白质网络等, 在这些网络中科研工作者更加关注实体间的连接关系。传统的多维数据模型不 能合理的对网络图数据关系进行存储及表示,不能合理的关注实体间的连接关 系。  In traditional OLAP, researchers pay attention to numerical metrics, such as the number of sales of goods in the mall, sales and other numerical data. The multidimensional data model is proposed for traditional OLAP and does not apply to the data organization in the information network. Now researchers are paying more attention to the common sales relationship between goods and goods, which involves the modeling of the connection relationship between objects and objects. At present, more and more data appears in the form of network diagrams, such as social networks, partner networks, protein networks, etc. In these networks, researchers pay more attention to the connection between entities. The traditional multidimensional data model can not reasonably store and represent the network graph data relationship, and can not reasonably pay attention to the connection relationship between entities.
发明内容  Summary of the invention
本发明实施例提供了一种存储数据的方法和装置,克服了传统的多维数据 模型不能合理的对网络图数据关系进行存储及表示的问题。  The embodiment of the invention provides a method and a device for storing data, which overcomes the problem that the traditional multidimensional data model cannot reasonably store and represent the network graph data relationship.
本发明实施例第一方面提供了一种存储数据的方法, 所述方法包括: 获取原始数据集;  A first aspect of the embodiments of the present invention provides a method for storing data, where the method includes: acquiring an original data set;
从原始数据集中提取表示信息网络图结构的信息; 其中, 所述表示信息网 络图结构的信息至少包括: 节点信息, 节点属性信息,边信息,和边属性信息; 所述节点信息至少包括: 节点标识和节点属性关键码;  Extracting information indicating the structure of the information network graph from the original data set; wherein, the information indicating the structure of the information network graph includes at least: node information, node attribute information, side information, and edge attribute information; the node information includes at least: Identification and node attribute keys;
所述节点属性关键码与所述节点属性信息具有对应关系;  The node attribute key has a corresponding relationship with the node attribute information;
所述边信息至少包括: 边标识和边属性关键码;  The side information includes at least: an edge identifier and an edge attribute key;
所述边属性关键码与所述边属性信息具有对应关系;  The edge attribute key has a corresponding relationship with the edge attribute information;
所述边用于描述节点与节点之间的联系;  The edge is used to describe the relationship between the node and the node;
存储所述提取的节点信息, 节点属性信息, 边信息, 和边属性信息。  And storing the extracted node information, node attribute information, side information, and edge attribute information.
本发明实施例第一方面的第一种可能的实现方式中, 所述节点信息还包 括: 节点度量值; In a first possible implementation manner of the first aspect of the embodiment, the node information is further included Including: node metrics;
所述边信息还包括: 边度量值。  The side information further includes: an edge measure.
结合本发明实施例第一方面,和本发明实施例第一方面的第二种可能的实 现方式中,  With reference to the first aspect of the embodiments of the present invention, and the second possible implementation manner of the first aspect of the embodiments of the present invention,
所述提取的节点信息存储在节点事实表中;  The extracted node information is stored in a node fact table;
所述提取的边信息存储在边事实表中;  The extracted side information is stored in an edge fact table;
所述提取的节点属性信息存储在拓朴维表中;  The extracted node attribute information is stored in a topology dimension table;
所提取的边属性信息存储在信息维表中;  The extracted edge attribute information is stored in the information dimension table;
由于所述边用于描述节点与节点之间的联系,则所述节点事实表中的信息 与所述边事实表中的信息具有对应关系;  Since the edge is used to describe the relationship between the node and the node, the information in the node fact table has a corresponding relationship with the information in the edge fact table;
所述节点属性关键码与所述节点属性信息具有对应关系;则所述拓朴维表 中的信息与所述节点事实表中的信息具有对应关系;  The node attribute key has a corresponding relationship with the node attribute information; and the information in the topology dimension table has a corresponding relationship with the information in the node fact table;
由于所述边属性关键码与所述边属性信息,则所述信息维表中的信息与所 述边事实表中的信息具有对应关系。  Due to the edge attribute key and the edge attribute information, the information in the information dimension table has a corresponding relationship with the information in the edge fact table.
本发明实施例第一方面的第三种可能的实现方式中,所述存储所述提取的 节点信息, 节点属性信息, 边信息, 和边属性信息之后, 所述方法还包括: 对需要查询的数据,在所述存储的所述节点信息,节点属性信息,边信息, 和边属性信息中进行定位;  In a third possible implementation manner of the first aspect of the present disclosure, after the storing the extracted node information, the node attribute information, the side information, and the edge attribute information, the method further includes: Data is located in the stored node information, node attribute information, side information, and edge attribute information;
从定位后的所述节点信息, 节点属性信息, 边信息, 或者边属性信息中其 中之一中进行查询。  The query is performed from one of the node information, the node attribute information, the side information, or the side attribute information after the positioning.
本发明实施例第一方面的第四种可能的实现方式中,所述存储所述提取的 节点信息, 节点属性信息, 边信息, 和边属性信息之后, 所述方法还包括: 根据所述提取的节点信息, 节点属性信息, 边信息, 和边属性信息, 进行 在线图处理操作。  In a fourth possible implementation manner of the first aspect of the present disclosure, after the storing the extracted node information, the node attribute information, the side information, and the edge attribute information, the method further includes: extracting according to the extracting The node information, node attribute information, side information, and edge attribute information are used for online graph processing operations.
结合本发明实施例第一方面的第四种可能的实现方式中,本发明实施例第 一方面的第五种可能的实现方式中, 所述在线图处理操作至少包括:  With reference to the fourth possible implementation manner of the first aspect of the embodiment of the present invention, in the fifth possible implementation manner of the first aspect of the embodiment, the online map processing operation includes:
信息维上卷(I-OLGP ), 拓朴维上卷(T-OLGP ), 异步上卷, 下钻, 切片, 切块, 数据透视其中之一。  Information dimension roll (I-OLGP), topological roll (T-OLGP), asynchronous roll up, drill down, slice, cut, and one of the data views.
结合本发明实施例第一方面的第五种可能的实现方式中,本发明实施例第 一方面的第六种可能的实现方式中,若所述提取的边属性信息存储在信息维表 中, 则所述信息维上卷具体包括: In conjunction with the fifth possible implementation manner of the first aspect of the embodiment of the present invention, the embodiment of the present invention In a sixth possible implementation manner, if the extracted edge attribute information is stored in the information dimension table, the information dimension rollup specifically includes:
对信息维表中存储的边的一种属性的信息,或者一种以上属性的信息进行 上卷操作。  The information of one attribute of the edge stored in the information dimension table, or the information of one or more attributes is scrolled.
结合本发明实施例第一方面的第五种可能的实现方式中,本发明实施例第 一方面的第七种可能的实现方式中,若所述提取的节点属性信息存储在拓朴维 表中, 则所述拓朴维聚集操具体包括:  With reference to the fifth possible implementation manner of the first aspect of the embodiment of the present invention, in the seventh possible implementation manner of the first aspect of the embodiment, if the extracted node attribute information is stored in the topology dimension table, The topology aggregation operation specifically includes:
对拓朴维表中存储的节点的一种属性的信息,或者一种以上属性的信息进 行上卷操作。  The information of one attribute of the node stored in the topology table or the information of one or more attributes is scrolled.
本发明实施例第二方面提供的一种存储数据的装置, 所述装置包括: 获取 单元, 提取单元, 和存储单元;  An apparatus for storing data according to a second aspect of the embodiments of the present invention, the apparatus includes: an acquiring unit, an extracting unit, and a storage unit;
所述获取单元, 用于获取原始数据集;  The obtaining unit is configured to acquire an original data set;
所述提取单元, 用于从原始数据集中提取表示信息网络图结构的信息; 其 中, 所述表示信息网络图结构的信息至少包括: 节点信息, 节点属性信息, 边 信息, 和边属性信息; 所述节点信息至少包括: 节点标识和节点属性关键码; 所述节点属性关键码与所述节点属性信息具有对应关系; 所述边信息至少包 括: 边标识和边属性关键码; 所述边属性关键码与所述边属性信息对应关系; 所述边用于描述节点与节点之间的联系;  The extracting unit is configured to extract, from the original data set, information indicating a structure of the information network graph; wherein, the information indicating the structure of the information network graph includes at least: node information, node attribute information, side information, and edge attribute information; The node information includes at least: a node identifier and a node attribute key; the node attribute key has a corresponding relationship with the node attribute information; the side information includes at least: an edge identifier and an edge attribute key; a code corresponding to the edge attribute information; the edge is used to describe a relationship between the node and the node;
所述存储单元, 用于存储所述提取的节点信息, 节点属性信息, 边信息, 和边属性信息。  The storage unit is configured to store the extracted node information, node attribute information, side information, and edge attribute information.
本发明实施例第二方面的第一种实现方式中, 所述节点信息还包括: 节点 度量值;  In a first implementation manner of the second aspect of the embodiments of the present disclosure, the node information further includes: a node metric value;
所述边信息还包括: 边度量值。  The side information further includes: an edge measure.
结合本发明实施例第二方面的第一种实现方式,本发明实施例第二方面的 第二种实现方式中, 所述提取的节点信息存储在节点事实表中;  With reference to the first implementation manner of the second aspect of the embodiment of the present invention, in the second implementation manner of the second aspect of the embodiment, the extracted node information is stored in a node fact table;
所述提取的边信息存储在边事实表中;  The extracted side information is stored in an edge fact table;
所述提取的节点属性信息存储在拓朴维表中;  The extracted node attribute information is stored in a topology dimension table;
所提取的边属性信息存储在信息维表中;  The extracted edge attribute information is stored in the information dimension table;
由于所述边用于描述节点与节点之间的联系,则所述节点事实表中的信息 与所述边事实表中的信息具有对应关系; Since the edge is used to describe the relationship between the node and the node, the information in the node fact table Corresponding to the information in the edge fact table;
所述节点属性关键码与所述节点属性信息具有对应关系;则所述拓朴维表 中的信息与所述节点事实表中的信息具有对应关系;  The node attribute key has a corresponding relationship with the node attribute information; and the information in the topology dimension table has a corresponding relationship with the information in the node fact table;
由于所述边属性关键码与所述边属性信息具有对应关系,则所述信息维表 中的信息与所述边事实表中的信息具有对应关系。  The information in the information dimension table has a corresponding relationship with the information in the edge fact table, because the edge attribute key has a corresponding relationship with the edge attribute information.
本发明实施例第二方面的第三种实现方式中,所述装置还包括:定位单元, 和查询单元;  In a third implementation manner of the second aspect of the embodiments of the present invention, the device further includes: a positioning unit, and a query unit;
所述定位单元, 用于对需要查询的数据, 在所述存储的所述节点信息, 节 点属性信息, 边信息, 和边属性信息中进行定位;  The positioning unit is configured to perform positioning on the stored node information, node attribute information, side information, and edge attribute information for data that needs to be queried;
所述查询单元, 用于从定位后的所述节点信息, 节点属性信息, 边信息, 或者边属性信息中其中之一中进行查询。  The query unit is configured to perform query from one of the node information, the node attribute information, the side information, or the edge attribute information after the positioning.
本发明实施例第二方面的第四种实现方式中, 所述装置还包括: 图处理单 元;  In a fourth implementation manner of the second aspect of the embodiments of the present disclosure, the device further includes: a picture processing unit;
所述图处理单元,用于根据所述提取的节点信息,节点属性信息,边信息, 和边属性信息, 进行在线图处理操作。  The map processing unit is configured to perform an online map processing operation according to the extracted node information, node attribute information, side information, and edge attribute information.
结合本发明实施例第二方面的第四种实现方式,本发明实施例第二方面的 第五种实现方式中, 所述图处理单元中所述在线图处理操作至少包括:  With reference to the fourth implementation manner of the second aspect of the embodiment of the present invention, in the fifth implementation manner of the second aspect of the embodiment, the online map processing operation in the map processing unit includes:
信息维上卷(I-OLGP ), 拓朴维上卷(T-OLGP ), 异步上卷, 下钻, 切片, 切块, 数据透视其中之一。  Information dimension roll (I-OLGP), topological roll (T-OLGP), asynchronous roll up, drill down, slice, cut, and one of the data views.
结合本发明实施例第二方面的第五种实现方式,本发明实施例第二方面的 第六种实现方式中,所述图处理单元中若所述提取的边属性信息存储在信息维 表中, 则所述信息维上卷具体包括:  With reference to the fifth implementation manner of the second aspect of the embodiment of the present invention, in the sixth implementation manner of the second aspect of the embodiment, the extracted edge attribute information is stored in the information dimension table in the graph processing unit. , the information dimension roll up specifically includes:
对信息维表中存储的边的一种属性的信息,或者一种以上属性的信息进行 上卷操作。  The information of one attribute of the edge stored in the information dimension table, or the information of one or more attributes is scrolled.
结合本发明实施例第二方面的第五种实现方式,本发明实施例第二方面的 第七种实现方式中,所述图处理单元中若所述提取的节点属性信息存储在拓朴 维表中, 则所述拓朴维聚集操具体包括:  With reference to the fifth implementation manner of the second aspect of the embodiment of the present invention, in the seventh implementation manner of the second aspect of the embodiment of the present invention, if the extracted node attribute information is stored in the topology dimension table, The topology aggregation operation specifically includes:
对拓朴维表中存储的节点的一种属性的信息,或者一种以上属性的信息进 行上卷操作。 通过上述对本发明实施例提供一种存储数据的方法,该方法通过获取原始 数据集, 从原始数据集中提取节点信息, 节点属性信息, 边信息, 和边属性信 息; 节点信息至少包括: 节点标识和节点属性的关键码; 所述节点属性关键码 与所述节点属性信息具有对应关系;边信息至少包括:边标识和边属性关键码; 所述边属性关键码与所述边属性信息具有对应关系;边用于描述节点与节点之 间的联系;存储上述提取的节点信息, 节点属性信息,边信息,和边属性信息。 由于提取信息之间具有联系, 因此在后续对数据进行操作时, 可以快速准确定 位到所需要的数据。 同时, 与现有的 OLAP多维数据仓库模型相比, 本发明实 施例提供的方案存储的信息中, 不仅包括与现有技术相同的节点信息, 节点属 性信息, 使得研究人员可以关注以节点为中心的事实, 而且, 本发明实施例提 供的方案存储的信息中,还包括现有技术不能关注的边信息和边属性信息,使 得研究人员还可以关注节点之间关系。 The information of one attribute of the node stored in the topology table or the information of one or more attributes is scrolled. The method for storing data is provided by the embodiment of the present invention. The method extracts node information, node attribute information, side information, and edge attribute information from the original data set by acquiring the original data set. The node information includes at least: a node identifier and a key attribute of the node attribute; the node attribute key has a corresponding relationship with the node attribute information; the side information includes at least: an edge identifier and an edge attribute key; the edge attribute key has a corresponding relationship with the edge attribute information The edge is used to describe the relationship between the node and the node; the node information extracted above, the node attribute information, the side information, and the edge attribute information are stored. Since there is a connection between the extracted information, when the data is subsequently operated, the required data can be quickly and accurately located. At the same time, compared with the existing OLAP multi-dimensional data warehouse model, the information stored by the solution provided by the embodiment of the present invention includes not only the same node information and node attribute information as the prior art, so that the researcher can focus on the node as the center. In addition, the information stored in the solution provided by the embodiment of the present invention further includes side information and edge attribute information that cannot be focused on by the prior art, so that the researcher can also pay attention to the relationship between the nodes.
附图说明  DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施 例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地, 下面描述 中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付 出创造性劳动性的前提下, 还可以根据这些附图获得其他的附图。  In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any inventive labor.
图 1是现有技术中的销售网络示意简图;  1 is a schematic diagram of a sales network in the prior art;
图 2是现有技术中的星形模式的多维数据模型;  2 is a multi-dimensional data model of a star pattern in the prior art;
图 3是本发明实施例提供的信息网络示意图;  3 is a schematic diagram of an information network provided by an embodiment of the present invention;
图 4是本发明实施例一提供了一种存储数据的方法;  4 is a method for storing data according to Embodiment 1 of the present invention;
图 5是本发明实施例提供的节点属性信息, 边信息, 和边属性信息之间是 具有联系示意图 (或者称为多维信息网络数据仓库模型);  FIG. 5 is a schematic diagram of association between node attribute information, side information, and edge attribute information according to an embodiment of the present invention (or a multidimensional information network data warehouse model);
图 6是科研合作者网络示意图;  Figure 6 is a schematic diagram of a network of research collaborators;
图 7是本发明实施例二提供了一种存储数据的方法;  FIG. 7 is a schematic diagram of a method for storing data according to Embodiment 2 of the present invention; FIG.
图 8是多维信息网络数据仓库模型;  Figure 8 is a multidimensional information network data warehouse model;
图 9所示边事实表转换为边事实关系表;  The edge fact table shown in Figure 9 is converted into an edge fact table;
图 10所示节点事实表转换为节点事实关系表;  The node fact table shown in FIG. 10 is converted into a node fact relation table;
图 11所示信息维向关系信息维表的转化; 图 12所示拓朴维向关系拓朴维表的转化; The transformation of the information dimension relationship information dimension table shown in FIG. 11; Figure 12 shows the transformation of the topological dimension relationship topology table;
图 13所示关键字一合作者多维信息网络数据仓库模型;  Figure 13 shows the keyword-collaborator multidimensional information network data warehouse model;
图 14所示电影演员合作网络;  Figure 14 shows the film actor cooperation network;
图 15是本发明实施例二提供了一种存储数据的方法;  15 is a method for storing data according to Embodiment 2 of the present invention;
图 16是电影演员合作多维信息网络数据仓库模型;  Figure 16 is a video actor collaborative multidimensional information network data warehouse model;
图 17是本发明实施例四提供的一种数据的存储装置;  17 is a data storage device according to Embodiment 4 of the present invention;
图 18是本发明实施例五提供的一种数据的存储装置。  FIG. 18 is a data storage device according to Embodiment 5 of the present invention.
具体实施方式  detailed description
为使本发明实施例的目的、技术方案和优点更加清楚, 下面将结合本发明 实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然, 所描述的实施例是本发明一部分实施例, 而不是全部的实施例。基于本发明中 的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其 他实施例, 都属于本发明保护的范围。  The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is a partial embodiment of the invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
在信息网络中用户关注的中心由数值度量上升到一个图或者网络,用户关 注的中心的结构由节点和边组成。 其中, 节点和边分别对应着一些相关属性, 即节点属性和边属性。 与边相关的属性可以称为信息维, 与节点相关的属性可 以称为拓朴维。 边代表两个节点之间的连接关系。例如图 3所示信息网络示意 图, 圓圈表示节点, 每个边有各自的属性, 每个节点也有各自的属性。  In the information network, the center of interest of the user rises from a numerical measure to a graph or network, and the structure of the center of the user's attention consists of nodes and edges. Among them, the nodes and edges correspond to some related attributes, namely node attributes and edge attributes. The attributes associated with the edges can be referred to as information dimensions, and the attributes associated with the nodes can be referred to as topological dimensions. The edge represents the connection between two nodes. For example, the information network diagram shown in Figure 3, the circle represents the node, each side has its own attributes, and each node also has its own attributes.
在信息网络中,研究者更加关注对象与对象之间的连接关系, 这里所说的 对象可以理解为节点, 即节点与节点之间的连接关系。 多数科研工作者从事以 图为结构的社交网络的连接预测、 交通枢纽节点发现、 社区趋势演化、 蛋白质 结构分析等工作, 这些工作都是在以图为结构的数据上开展。 但是, 现有技术 对这些数据的存储,缺乏一种通用高效的底层数据组织模型来方便对这些数据 的分析。  In the information network, researchers pay more attention to the connection relationship between objects and objects. The objects mentioned here can be understood as nodes, that is, the connection relationship between nodes and nodes. Most researchers work on the connection prediction of social networks with graph structure, traffic hub node discovery, community trend evolution, protein structure analysis, etc. These tasks are carried out on graph-structured data. However, the prior art lacks a general and efficient underlying data organization model for the storage of these data to facilitate the analysis of these data.
因而, 本发明实施例在对信息网络中的图数据提供一种通用的存储方案, 即一种存储数据的方法、 装置及系统, 对以图为结构的数据进行组织, 方便上 层算法研究的展开, 方便对数据的分析利用,解决了以图为结构的对象之间关 系建模, 简化复杂的信息存储格式, 消除冗余; 利用关系数据库对其关系进行 存储, 方便用户进行高效的结构化查询操作。 其中, 关系数据库是指创建在关 系模型基础上的数据库,借助于集合代数等数学概念和方法来处理数据库中的 数据。 Therefore, the embodiment of the present invention provides a general storage scheme for the graph data in the information network, that is, a method, a device and a system for storing data, and organizes the data structured by the graph to facilitate the research of the upper layer algorithm. It facilitates the analysis and utilization of data, solves the relationship between objects based on graphs, simplifies complex information storage formats, and eliminates redundancy; uses relational databases to store their relationships, facilitating users to perform efficient structured queries. operating. Among them, the relational database refers to the creation of the Based on the database of the model, the mathematical concepts and methods such as set algebra are used to process the data in the database.
如下参考具体具体实施例, 详细说明本方案。  The present solution will be described in detail with reference to specific embodiments as follows.
实施例一  Embodiment 1
本发明实施例提供了一种存储数据的方法, 如图 4所示, 该方法包括: 步骤 101, 获取原始数据集;  An embodiment of the present invention provides a method for storing data. As shown in FIG. 4, the method includes: Step 101: Acquire an original data set;
其中,原始数据集可以理解为用户收集的所有数据的集合, 这些数据是杂 乱, 不利于分析的。 步骤 101中获取的原始数据集可以是输入到该执行设备中 的非结构化文本的原始数据。  Among them, the original data set can be understood as a collection of all the data collected by the user, which is messy and unfavorable for analysis. The raw data set obtained in step 101 may be raw data of unstructured text input into the execution device.
步骤 102, 从原始数据集中提取表示信息网络图结构的信息; 其中, 表示 信息网络图结构的信息至少包括: 节点信息, 节点属性信息, 边信息, 和边属 性信息; 节点信息至少包括: 节点标识和节点属性关键码; 所述节点属性关键 码与所述节点属性信息具有对应关系; 边信息至少包括: 边标识和边属性关键 码; 所述边属性关键码与所述边属性信息具有对应关系; 所述边用于描述节点 与节点之间的联系;  Step 102: Extract information representing a structure of the information network graph from the original data set. The information indicating the structure of the information network graph includes at least: node information, node attribute information, side information, and edge attribute information. The node information includes at least: a node identifier. And a node attribute key; the node attribute key has a corresponding relationship with the node attribute information; the side information at least includes: an edge identifier and an edge attribute key; the edge attribute key has a corresponding relationship with the edge attribute information ; the edge is used to describe the relationship between the node and the node;
由于节点与节点属性的关系, 边与边属性的关系, 所述边用于描述节点与 节点之间的联系,可以容易用图结构体现上述提取的节点信息,节点属性信息, 边信息, 和边属性信息之间的联系 (参见后续说明的图 5、 图 8、 图 16 )。  Due to the relationship between the node and the node attribute, the relationship between the edge and the edge attribute, the edge is used to describe the relationship between the node and the node, and the extracted node information, node attribute information, side information, and edge can be easily represented by the graph structure. The relationship between the attribute information (see Figure 5, Figure 8, Figure 16 of the subsequent description).
参见图 3所示, 表示信息网络图结构的信息可以包括: 节点信息(如, 节 点标识( VertexID ) ), 节点属性信息(如, Attribute 1, Attribute2 ), 边信息(如, 边标识 (EdgeID ) ), 边属性(如 Attributed Attribute2 )等等; 节点属性的数 量, 边属性的数量, 以及节点、 边的数量根据具体的信息网络会有不同, 结构 也会有不同,此处图 3所示简单的信息网络图结构,仅是便于理解的简单举例, 并非对本发明实施例的限制。  Referring to FIG. 3, the information indicating the structure of the information network graph may include: node information (eg, node identifier (VertexID)), node attribute information (eg, Attribute 1, Attribute2), side information (eg, edge identifier (EdgeID) ), edge attributes (such as Attributed Attribute2), etc.; the number of node attributes, the number of edge attributes, and the number of nodes and edges will vary according to the specific information network, and the structure will be different, as shown in Figure 3 here. The information network diagram structure is only a simple example for easy understanding, and is not a limitation of the embodiment of the present invention.
通常, 设备获取的原始数据集是杂乱的, 不便于分析利用的。 因此, 在获 取到原始数据集后,设备将原始数据按照信息网络图结构的形式,从原始数据 集中提取包括: 节点信息, 节点属性信息, 边信息, 以及边属性信息的表示信 息网络图结构的信息。  Often, the raw data sets acquired by the device are cluttered and inconvenient for analysis. Therefore, after obtaining the original data set, the device extracts the original data from the original data set according to the structure of the information network graph structure, including: node information, node attribute information, side information, and representation information network structure structure of the edge attribute information. information.
需要理解的是, 上述提取的节点信息, 节点属性信息, 边信息, 和边属性 信息之间是具有联系,可以参见图 5所示,在步骤 102中提取包括:节点信息, 节点属性信息, 边信息, 和边属性信息可以表示信息网络图结构的信息, 可以 具体是以表格的形式表示, 例如: 提取的节点信息存储到节点事实表(VFT ) 中, 提取的边信息存储到边事实表(EFT ) 中, 提取的节点属性信息存储到拓 朴维表(TDT ) 中, 提取的边属性信息存储到信息维表(IDT ) 中, 由于节点 与节点属性的关系, 边与边属性的关系, 使得各列表之间具有关联(所说的关 联, 在图 5中体现在各表之间的连线)。 It should be understood that the above extracted node information, node attribute information, side information, and edge attributes The information is related to each other. Referring to FIG. 5, the extracting in step 102 includes: node information, node attribute information, side information, and edge attribute information may represent information of the information network graph structure, which may be specifically a table. Formal representation, for example: The extracted node information is stored in the node fact table (VFT), the extracted side information is stored in the edge fact table (EFT), and the extracted node attribute information is stored in the topology dimension table (TDT), and the extracted edges are extracted. The attribute information is stored in the information dimension table (IDT). Due to the relationship between the node and the node attribute, the relationship between the edge and the edge attribute makes the lists have an association (the association is shown in Figure 5 between the tables). Connection).
如图 5中所示, 当提取了一个节点的信息, 该节点的信息包括: 节点标识 (即节点 ID, 节点具体的含义可以根据不同的信息网络定义不同, 如合作者 多维信息网络, 节点可以代表作者, 演员合作者多维信息网络中, 节点可以代 表演员), 节点的属性关键码, 和 /或节点的度量。 需要理解的是节点信息中包 括的节点的度量可以是用数值的形式表示该节点相关的信息,如: 合作者网络 中, 节点的信息可以是该作者发表的文章数量等。 其中, 节点的度量可以作为 优选的方案, 而非本方案必须。  As shown in FIG. 5, when the information of a node is extracted, the information of the node includes: a node identifier (ie, a node ID, and the specific meaning of the node may be different according to different information network definitions, such as a multi-dimensional information network of the partner, the node may On behalf of the author, the actor collaborator in the multidimensional information network, the node can represent the actor), the attribute key of the node, and/or the metric of the node. It should be understood that the metric of the node included in the node information may be a numerical representation of information related to the node, for example, in the partner network, the information of the node may be the number of articles published by the author. Among them, the metric of the node can be used as the preferred solution, not the solution.
节点属性关键码与所述节点属性信息具有对应关系, 可以理解为, 节点信 息中包括的节点属性关键码是联系边信息与节点属性信息的纽带。节点属性关 键码, 所对应的详细信息具体可以是存储在拓朴维表中。 例如, 当节点为演员 时, 节点属性关键码可以是演员归属的电影公司, 该节点(即演员)属性关键 码(即演员归属的电影公司)所对应的具体信息为节点属性信息(即节点属性 信息为具体的每个电影公司, 例如: 华谊兄弟电影制作公司, 天娱电影公司等 等)。  The node attribute key has a corresponding relationship with the node attribute information. It can be understood that the node attribute key included in the node information is a link between the contact side information and the node attribute information. The node attribute key code, the corresponding detailed information may be stored in the topology dimension table. For example, when the node is an actor, the node attribute key may be a movie company to which the actor belongs, and the specific information corresponding to the node (ie, actor) attribute key (ie, the movie company to which the actor belongs) is the node attribute information (ie, the node attribute). The information is specific to each film company, such as: Huayi Brothers Film Production Company, Tianyu Film Company, etc.).
边信息至少包括: 边标识和边属性关键码, 还可以包括: 边的度量。 例如 图 5所示, 由于边是两个节点的连线, 因此, 边标识(EdgelD )可以用两个节 点的标识表示, 如图 5所示, 节点 1和节点 2两个节点来表示该边。 边属性关 键码可以有多个, 每个边属性关键码可以代表一类属性, 例如: 节点若是合作 者信息网络中的作者, 则边代表 2个作者的合作的, 边属性关键码可以是合作 者之间的合作的文章, 和 /或合作的年代, 和 /或合作的地点。 还需要理解的是 边信息中包括的边的度量, 可以是用数值的形式表示该边相关的信息, 如: 合 作者网络中, 边的信息可以是 2个作者合作的次数(如, Co— Frequence )。 边属性关键码与所述边属性信息具有对应关系,可以理解为边信息中包括 的边属性关键码是联系边信息与边属性信息的纽带。边属性信息具体可以是存 储在信息维表中。 The edge information includes at least: an edge identifier and an edge attribute key, and may also include: a measure of the edge. For example, as shown in FIG. 5, since the edge is a connection of two nodes, the edge identifier (EdgelD) can be represented by the identifier of two nodes. As shown in FIG. 5, two nodes of node 1 and node 2 represent the edge. . There may be multiple edge attribute keys. Each edge attribute key can represent a type of attribute. For example: If the node is the author in the partner information network, the edge represents the cooperation of the two authors, and the edge attribute key can be cooperation. Articles of cooperation between the parties, and/or the age of cooperation, and/or the location of the collaboration. It should also be understood that the metric of the edge included in the side information may be a numerical representation of the information related to the edge, such as: In the network of partners, the information of the edge may be the number of times the two authors cooperate (eg, Co- Frequence). The edge attribute key has a corresponding relationship with the edge attribute information, and can be understood as the edge attribute key included in the side information is a link between the contact side information and the edge attribute information. The edge attribute information may specifically be stored in the information dimension table.
若边的关键码为合作的文章, 则在边属性信息(具体可以是信息维表)中 具体的信息可以是合作者之间合作的所有文章的名称, 如: 合作的文章包括: 《雨水》、 《雪花》。 若边的关键码为合作的地点, 则在边属性信息 (具体可以 是信息维表)中具体的信息可以是合作者之间合作的所有地点, 如: 北京, 上 海。  If the key of the side is a collaborative article, the specific information in the side attribute information (which may be the information dimension table) may be the name of all the articles cooperating between the collaborators, such as: "Cooperative articles include: "Rainwater" "Snowflake". If the key of the side is the place of cooperation, the specific information in the side attribute information (specifically, the information dimension table) may be all the places where the partners cooperate, such as: Beijing, Shanghai.
步骤 103, 存储上述提取的节点信息, 节点属性信息, 边信息, 和边属性 信息。  Step 103: Store the extracted node information, node attribute information, side information, and edge attribute information.
其中, 存储提取的节点信息, 节点属性信息, 边信息, 和边属性信息, 具 体可以是以表格的形式存储, 即通过: 节点事实表, 拓朴维表, 边事实表, 信 息维表将上述信息对应存储。 其中, 以表格的形式存储是一种事实方式, 并非 对本发明实施例的限制, 具体的存储形式还可以有其他。  The stored node information, the node attribute information, the side information, and the edge attribute information may be stored in the form of a table, that is, by: a node fact table, a topology dimension table, an edge fact table, and an information dimension table corresponding to the above information. storage. The storage in the form of a table is a factual manner, and is not a limitation of the embodiment of the present invention. The specific storage form may have other.
通过上述对本发明实施例一提供一种存储数据的方法,该方法通过获取原 始数据集, 从原始数据集中提取节点信息, 节点属性信息, 边信息, 和边属性 信息; 节点信息至少包括: 节点标识和节点属性的关键码; 所述节点属性关键 码与所述节点属性信息具有对应关系; 边信息至少包括: 边标识和边属性关键 码; 所述边属性关键码与所述边属性信息具有对应关系; 由于节点与节点属性 的关系, 边与边属性的关系, 节点与节点的连线为边, 使得提取的节点信息, 节点属性信息, 边信息, 和边属性信息之间具有联系, 存储上述提取的节点信 息, 节点属性信息, 边信息, 和边属性信息。 由于提取信息之间具有联系, 因 此在后续对数据进行操作时, 可以快速准确定位到所需要的数据。 同时, 与现 有的 OLAP多维数据仓库模型相比, 本发明实施例提供的方案存储的信息中, 不仅包括与现有技术相同的节点信息, 节点属性信息,使得研究人员可以关注 以节点为中心的事实, 而且, 本发明实施例提供的方案存储的信息中, 还包括 现有技术不能关注的边信息和边属性信息,使得研究人员还可以关注节点之间 关系。  The method for storing data is provided by the first embodiment of the present invention. The method extracts node information, node attribute information, side information, and edge attribute information from the original data set by acquiring the original data set. The node information includes at least: a node identifier. And a key attribute of the node attribute; the node attribute key has a corresponding relationship with the node attribute information; the side information at least includes: an edge identifier and an edge attribute key; the edge attribute key has a correspondence with the edge attribute information Relationship; due to the relationship between the node and the node attribute, the relationship between the edge and the edge attribute, the connection between the node and the node is an edge, so that the extracted node information, the node attribute information, the side information, and the edge attribute information are related, and the above is stored. Extracted node information, node attribute information, side information, and edge attribute information. Since there is a connection between the extracted information, it is possible to quickly and accurately locate the required data when the data is subsequently manipulated. At the same time, compared with the existing OLAP multi-dimensional data warehouse model, the information stored by the solution provided by the embodiment of the present invention includes not only the same node information and node attribute information as the prior art, so that the researcher can focus on the node as the center. In addition, the information stored in the solution provided by the embodiment of the present invention further includes side information and edge attribute information that cannot be focused on by the prior art, so that the researcher can also pay attention to the relationship between the nodes.
进一步, 对本发明实施例一提供一种存储数据的方法, 解决了现有的 OLAP多维数据仓库模型中, 原始数据集中存在的冗余问题, 本发明实施例提 供的方案具有查询灵活, 高效, 主题抽取灵活的优点 Further, the method for storing data is provided in the first embodiment of the present invention, and the existing method is solved. In the OLAP multi-dimensional data warehouse model, the redundancy problem exists in the original data set, and the solution provided by the embodiment of the invention has the advantages of flexible query, high efficiency, and flexible subject extraction.
更进一步, 本发明实施例一提供一种存储数据的方法, 更符合现实社会网 络的建模要求, 有利于高效 OLGP算法设计, 且该模型向传统关系表转化方 便, 利于人们对现实世界信息的理解。  Furthermore, the first embodiment of the present invention provides a method for storing data, which is more in line with the modeling requirements of a real social network, and is beneficial to the design of an efficient OLGP algorithm, and the model is convenient to convert to a traditional relational table, and is beneficial to people in the real world. understanding.
而且, 本发明实施例提供的方案中, 根据节点与节点的连线为边, 建立了 节点与边的联系, 因此, 将节点信息、 节点属性信息, 以及边信息, 边属性信 息直接建立了联系, 因此, 本方案由于发现了边与节点之间的重要联系, 使得 对现有技术的改动较小的基础上, 能够实现关注节点之间关系。  Moreover, in the solution provided by the embodiment of the present invention, the connection between the node and the edge is established according to the connection between the node and the node, and therefore, the node information, the node attribute information, and the side information and the edge attribute information are directly connected. Therefore, the solution can realize the relationship between the nodes concerned by discovering the important relationship between the edges and the nodes, so that the changes to the prior art are small.
实施例二  Embodiment 2
本发明实施例提供了一种数据的存储方法,该方法与上述实施例一提供的 方法相似, 所不同的是, 本发明实施例提供的方法, 是一种具体应用在科研合 作者信息网络中的存储方法举例。  The embodiment of the present invention provides a method for storing data, which is similar to the method provided in the first embodiment, except that the method provided by the embodiment of the present invention is specifically applied in a research partner information network. An example of a storage method.
科研合作者网络是记录某领域科研人员和在发表论文的情况,是信息网络 的典型事例。 如图 6所示, 每个节点表示一个作者, 若两人合作发表过文章, 则两点间存在一条边。 边的属性记录两合作者在特点时间、特定会议发表的文 章数。 下面以美国计算机学会 ( ACM , Association for Computing Machinery) 数据集中的合作者网络为例,对多维信息网络数据仓库模型的实施流程进行详 细的阐述和展示。  The network of research collaborators is a case of recording scientific research personnel in a field and publishing papers. It is a typical example of information networks. As shown in Figure 6, each node represents an author. If two people collaborate to publish an article, there is an edge between the two points. The attributes of the side record the number of articles published by the two collaborators at the time of the feature and at a specific meeting. The following is an example of a partner network in the data collection of the ACM (Association for Computing Machinery) to elaborate and demonstrate the implementation process of the multidimensional information network data warehouse model.
如图 7所示, 该方法包括:  As shown in Figure 7, the method includes:
步骤 201, 获取原始数据集。  Step 201: Acquire an original data set.
目前大多数科研工作者使用的都是未经处理过的、杂乱无章的数据集来进 行研究分析的。 对于经典的合作者网络, 其数据集版本就各式各样。 比较典型 的有基于 xml 文本的数字文献与图书馆项目(DBLP, Digital Bibliography & Library Project)数据集和 ACM数据集。在 ACM原始数据集中,其 xml版本的 数据格式组织如下:  At present, most researchers use unprocessed, disorganized data sets for research and analysis. For the classic partner network, the dataset version is varied. More typical are xml-based digital document and library project (DBLP, Digital Bibliography & Library Project) datasets and ACM datasets. In the ACM raw dataset, the data format of its xml version is organized as follows:
<author> ... </author>  <author> ... </author>
<Institute> ... </institute>  <Institute> ... </institute>
<author> ... </author> <Institute> ... </institute> <author> ... </author> <Institute> ... </institute>
<author> ... </author>  <author> ... </author>
<Institute> ... </institute> <title>...</title>  <Institute> ... </institute> <title>...</title>
<year>...</year>  <year>...</year>
<journal> ... </journal>  <journal> ... </journal>
原始数据集可以是以非结构化文本方式存储,不利于用户高效的进行查询 分析操作。 本方案对获取的 ACM数据集进行提取, 分类存储, 可以高效的进 行查询分析操作。  The original data set can be stored in unstructured text, which is not conducive to the user's efficient query analysis operation. This solution extracts the acquired ACM data set, classifies and stores it, and can perform query analysis operations efficiently.
步骤 202, 从原始数据集中提取表示信息网络图结构的信息; 其中, 表示 信息网络图结构的信息至少包括: 节点信息, 节点属性信息, 边信息, 和边属 性信息; 节点信息至少包括: 节点标识和节点属性的关键码; 所述节点属性关 键码与所述节点属性信息具有对应关系; 边信息至少包括: 边标识和边属性关 键码; 所述边属性关键码与所述边属性信息具有对应关系; 其中, 边用于描述 节点与节点之间的联系。  Step 202: Extract information representing a structure of the information network graph from the original data set. The information indicating the structure of the information network graph includes at least: node information, node attribute information, side information, and edge attribute information. The node information includes at least: a node identifier. And a key attribute of the node attribute; the node attribute key has a corresponding relationship with the node attribute information; the side information at least includes: an edge identifier and an edge attribute key; the edge attribute key has a correspondence with the edge attribute information Relationship; where the edge is used to describe the relationship between the node and the node.
由于节点与节点属性的关系,边与边属性的关系,节点与节点的连线为边, 使得上述提取的节点信息, 节点属性信息, 边信息, 和边属性信息之间具有联 系。  Due to the relationship between the node and the node attribute, the relationship between the edge and the edge attribute, the connection between the node and the node is an edge, so that the extracted node information, the node attribute information, the side information, and the edge attribute information are related.
在合作者网络中,提取的节点信息可以是以节点事实表(VFT, Vertex Fact In the partner network, the extracted node information can be a node fact table (VFT, Vertex Fact)
Table )存储的, 其中, 节点信息可以包括节点 ID、 节点属性关键码, 还可以 包括节点的度量。 在合作者网络中节点表示作者。 The stored node information may include a node ID, a node attribute key, and may also include a metric of the node. The node represents the author in the partner network.
提取的边信息可以存储在边事实边(EFT, Edge Fact Table ), 存储边信息 可以包括: 两个作者节点 idl、 id2 (用于表示边标识), 边属性的关键码(如: 论文关键码 ( Paper_key ), 时间关键码 ( Time_key ), 和地点关键码 ( Venue— key ) ), 边信息还可以包括边的度量。  The extracted side information can be stored in the Edge Fact Table (EFT, Edge Fact Table), and the storage side information can include: two author nodes idl, id2 (used to represent the edge identifier), and key attributes of the edge attribute (eg: paper key) ( Paper_key ), time key (Time_key ), and location key ( Venue—key), the side information can also include the measure of the edge.
节点属性信息是节点属性关键码所对应的具体信息,节点属性信息具体可 以是存储在拓朴维表(TDT, Topology dimension Table ) 中, 拓朴维表可以有 一个或者一个以上。 如在节点信息中节点的关键码为机构关键码 (Institution — ID ),则在拓朴维表中可以存储的是所有作者(即节点)所工作过的机构名称。 边属性信息是边属性关键码所对应的具体信息,边属性信息具体可以是存 储在信息维表(IDT, Information Dimension Table ) 中。 例如: 上述论文关键 码(Paper_key ), 时间关键码 ( Time key ), 和地点关键码(Venue_key ), 对 应的边属性信息, 具体可以是分别存储在论文维表, 时间维表, 地点(Venue ) 维表。 信息维表使论文集能够记录论文的发表会议、 发表时间以及论文 ID和 论文名称等。 如 Paper维表可以包含 Paper_key, Paper_name。 图 8给出了图 7 是。 本发明实施例提供的方法, 依照图 8所示的多维信息网络数据仓库模型, 提取表示信息网络图结构的信息至少包括:节点信息,节点属性信息,边信息, 和边属性信息。存储提取的信息, 其中存储的具体方法可以是以表格的形式存 储对应的信息。 The node attribute information is specific information corresponding to the node attribute key, and the node attribute information may be specifically stored in a Topology dimension Table (TDT), and the topology dimension table may have one or more. For example, in the node information, the key of the node is the institution key (Institution — ID ), what can be stored in the topology table is the name of the institution that all authors (ie nodes) have worked on. The edge attribute information is specific information corresponding to the edge attribute key, and the edge attribute information may be specifically stored in an Information Dimension Table (IDT). For example: the above paper key (Paper_key), time key (Time key), and location key (Venue_key), the corresponding edge attribute information, specifically can be stored in the paper dimension table, time dimension table, location (Venue) Dimension table. The information dimension table enables the collection to record the publication of the paper, the time of publication, the title of the paper, and the name of the paper. For example, the Paper dimension table can contain Paper_key, Paper_name. Figure 8 shows Figure 7 is. According to the method provided by the embodiment of the present invention, according to the multi-dimensional information network data warehouse model shown in FIG. 8, the information indicating the structure of the information network graph includes at least: node information, node attribute information, side information, and edge attribute information. The extracted information is stored, and the specific method of storing may store the corresponding information in the form of a table.
步骤 203, 存储提取的节点信息, 节点属性信息, 边信息, 和边属性信息, 其中, 存储上述信息具体釆用节点事实表, 拓朴维表, 边事实表, 和信息维表 进行存储。  Step 203: Store extracted node information, node attribute information, side information, and edge attribute information, where the foregoing information is stored by using a node fact table, a topology dimension table, an edge fact table, and an information dimension table.
为了更清楚了理解边事实表、 节点事实表、 信息维表、 以及拓朴维表的信 息, 下面结合具体附图进行详细说明。  In order to more clearly understand the information of the edge fact table, the node fact table, the information dimension table, and the topology dimension table, detailed description will be made below with reference to the specific drawings.
一、 边事实表  First, the fact table
合作者网络的边事实表 (EFT ) 由两个作者节点的 ID ( Authorl— id, Author2_id ), 各个边属性 (边属性的具体信息存储在信息维表) 的关键码 (Paper key, Time key, Venue key)以及度量值 ( 可以是: 合作次数 ( Co Frequence ) )组成。 Authorl id, Author2_id组成合作者网络边事实关 系表的主键, 该主键可以定位一条边(即可以表示边标识)。 边事实表与各个 信息维表的连接可以通过 Paper— key, Time key, Venue key来完成。 一个边 对应一个边事实表。 边事实表中携带的具体信息, 可以由边事实关系表表示, 如图 9所示边事实表转换为边事实关系表,图 9左边的表格中仅标识出了边事 实表的表头, 即边事实表中关注的边相关的重要信息,如边标识和边属性关键 码; 而在图 9右边的表格中对于边标识和边属性关键码的具体信息进行定位, 或者可以理解为边标识和边属性关键码的具体取值。  The edge fact table (EFT) of the partner network consists of the IDs of the author nodes (Author-id, Author2_id), and the key of each edge attribute (the specific information of the edge attributes is stored in the information dimension table) (Paper key, Time key, Venue key) and the metric (which can be: Co Frequence ). Authorl id, Author2_id constitutes the primary key of the partner network side fact table, which can locate an edge (that is, can represent the edge identifier). The connection between the fact table and each information dimension table can be done by Paper-key, Time key, Venue key. One edge corresponds to an edge fact table. The specific information carried in the side fact table can be represented by the edge fact table. As shown in FIG. 9, the edge fact table is converted into the edge fact table. The table on the left side of FIG. 9 only identifies the header of the edge fact table, that is, The important information related to the edge in the side fact table, such as the edge identifier and the edge attribute key; in the table on the right side of Figure 9, the specific information of the edge identifier and the edge attribute key is located, or can be understood as the edge identifier and The specific value of the edge attribute key.
例如, 图 9中右边的表格中第一行 Authorl— id的取值为 0, Author2— id的 取值为 1 ; For example, in the table on the right in Figure 9, the first row of Authorl_id has a value of 0, Author2_id Take the value 1;
Paper— key取值为 1, 表示取值为 0和 1的作者合作的论文的具体信息, 参见取值为 1所对应的表示论文具体信息的信息维表;  Paper—the value of key is 1, indicating the specific information of the papers cooperating with the authors with values 0 and 1. See the information dimension table corresponding to the specific information of the paper with a value of 1;
Time— key取值为 1,表示取值为 0和 1的作者合作的时间的具体信息,参 见取值为 1所对应的表示时间具体信息的信息维表;  The value of time—key is 1, indicating the specific time of the cooperation time of the authors with values 0 and 1. See the information dimension table corresponding to the time specific information corresponding to 1;
Venue— key取值为 1, 表示取值为 0和 1的作者合作的地点的具体信息, 参见取值为 1所对应的表示地点具体信息的信息维表;  Venue—the value of the key is 1, indicating the specific information of the place where the authors of the values 0 and 1 cooperate. See the information dimension table corresponding to the specific information of the location.
其中, 上述 Paper— key, Time— key, Venue— key为边信息中包括的边属性 关键码, 该关键码的每个取值对应具体的信息维表。  The paper—key, Time—key, and Venue—key are edge attribute keys included in the side information, and each value of the key corresponds to a specific information dimension table.
Co— frequence取值为 1, 是边信息中包括的边的度量, 其取值通常为具体 的数值, 即 Co— frequence取值为 1, 可以理解为值为 0和 1的作者合作的次数 为 1次。  Co-frequence takes a value of 1, which is a measure of the edge included in the edge information. The value is usually a specific value, that is, the value of Co-frequence is 1, which can be understood as the number of times the authors of the values 0 and 1 cooperate. 1 time.
二、 节点事实表  Second, the node fact table
合作者网络节点事实表(VFT ) 由节点信息 (具体是节点 ID, 或者是作 者 ID ), 和节点属性的关键码, 还可以包括节点的度量值。  The partner network node fact table (VFT) may also include the metric value of the node by the node information (specifically, the node ID, or the author ID), and the key of the node attribute.
其中, 节点信息包括: 节点 ID和 /或作者 ID, 即节点信息可以是单独的 节点 ID, 也可以是节点 ID与作者 ID联合表示, 或者也可以是由作者 ID单独 表示。 作者 ID ( Author— id )可唯一表示一个节点, 作为节点事实关系表的主 键。  The node information includes: a node ID and/or an author ID, that is, the node information may be a single node ID, or may be a joint representation of the node ID and the author ID, or may be separately represented by the author ID. The author ID ( Author- id ) uniquely represents a node as the primary key of the node fact table.
节点属性的关键码, 具体可以是拓朴维表的主键, (该主键可以理解为拓 朴维表中记录的信息的主题信息, 例如拓朴维表组织的主键( Institution— id ) 中记录的是组织的标识等信息), 拓朴维表可以有多个, 每个都可以反映节点 的一种属性。  The key of the node attribute may be the primary key of the topology dimension table (the primary key may be understood as the subject information of the information recorded in the topology dimension table, for example, the primary key of the topological dimension organization (Institution_id) records the identifier of the organization, and the like. ), there can be multiple top-level tables, each of which can reflect a property of a node.
节点的度量值具体可以该节点作者发表文章数(即 Paper— Num )组成, 也 可以有节点的度量值。  The metric of the node can be composed of the number of articles published by the author of the node (ie, Paper—Num), or it can have the metric of the node.
节点事实表通常有一个。节点事实表与拓朴维表的链接可以通过拓朴维表 组织的主键(即 Institution— id )来实现。 节点事实表中携带的具体信息, 可以 由节点事实关系表表示, 如图 10所示节点事实表转换为节点事实关系表, 图 10 左边的表格中仅标识出了节点事实表的表头, 即节点事实表中关注的节点 相关的重要信息, 如作者标识, 作者名称, 组织所述组织名称, 作者发表的论 文数等; 而在图 10右边的表格中对于节点标识和节点属性关键码的具体信息 进行定位, 或者可以理解为节点标识和节点属性关键码的具体取值。 There is usually one node fact table. The link between the node fact table and the topology dimension table can be implemented by the primary key of the topology dimension table (ie, Institution_id). The specific information carried in the node fact table may be represented by a node fact relation table. As shown in FIG. 10, the node fact table is converted into a node fact relation table, and the table on the left side of FIG. 10 only identifies the header of the node fact table, that is, Nodes of interest in the node fact table Relevant important information, such as the author's logo, the author's name, the name of the organization, the number of papers published by the author, etc.; and the specific information of the node identifier and the node attribute key is located in the table on the right side of Figure 10, or can be understood The specific value of the node ID and node attribute key.
例如, 图 9中右边的表格中第一行作者标识为 0,作者名称为 Janwei Han, 组织所述组织名称的代码为 1, 作者发表的论文数为 15篇。  For example, the first row in the table on the right in Figure 9 has the author ID 0, the author name is Janwei Han, the code for organizing the organization name is 1, and the number of papers published by the author is 15.
三、 信息维表  Third, the information dimension table
信息维表(IDT ) 由能够标识该信息维表的主键(即主键理解为信息维表 中记录的信息的主题信息)和该信息维表的一些相关属性组成。信息维可以有 多个, 每个维都有一个关系表与之相关联, 称为维表, 它进一步描述维。 在合 作者网络中信息维包括 Paper维表, Time维表, Venue维表。 维表由用户自己 根据实际情况设定, 或者根据数据分布自动产生和调整。信息维向关系信息维 表的转化如图 11所示:  The information dimension table (IDT) consists of a primary key that can identify the dimension table of the information (that is, the primary key is understood as the subject information of the information recorded in the information dimension table) and some related attributes of the information dimension table. There can be multiple information dimensions, and each dimension has a relational table associated with it, called a dimension table, which further describes the dimension. The information dimension in the partner network includes the Paper dimension table, the Time dimension table, and the Venue dimension table. The dimension table is set by the user according to the actual situation, or automatically generated and adjusted according to the data distribution. The transformation of the information dimension relationship information dimension table is shown in Figure 11:
其中, 图 11右边的信息维表关系表转化中, Paper— key标识为 1唯一标识 了 paper— name是 FP tree, Paper— classify为 TP311这条 aper i己录; 同理, Paper key标识为 2、 3、 4有相似的理解。  In the transformation of the information dimension table on the right side of FIG. 11, the Paper_key identifier is 1 uniquely identifies the paper_name as the FP tree, and the Paper-classify is the ap311 aper i record; in the same way, the Paper key identifier is 2 , 3, 4 have a similar understanding.
Time— key标识为 1唯一标识了为 1967年,年代为 1960年代的 Time记录, 同理, Time key标识为 2、 3、 4有相似的理解。  The Time-key identifier is 1 uniquely identified as the Time record of 1967 and the 1960s. Similarly, the Time key identifiers have similar understandings for 2, 3, and 4.
Venue— key标识为 1唯一标识了 Venue name是 VLDB, Venue— area是 DB 的 Venue记录, 同理, Venue key标识为 2、 3、 4有相似的理解。  Venue—The key identifier is 1 uniquely identifies Venue name as VLDB, and Venue—are is the Venue record of DB. Similarly, Venue key identifiers have similar understandings for 2, 3, and 4.
四、 拓朴维表  Fourth, topological dimension
拓朴维决定信息网络的边集和节点集, 即决定信息网络中图的拓朴结构。 进而决定了节点所表示单位的大小。 合作者网络中拓朴维是机构。 拓朴维表 ( TDT ) 由能够唯一标识该拓朴维表的主键和该拓朴维表的一些相关属性组 成。 同样拓朴维表可以有多个。 各个拓朴维向关系拓朴维表的转化如图 12所 示, 即在拓朴维表中具体的存储形式可以如图 12右边的关系拓朴维表中的存 储形式。  Topology determines the edge set and node set of the information network, that is, determines the topology of the graph in the information network. In turn, the size of the unit represented by the node is determined. The topology of the partner network is the organization. The Topological Dimension Table (TDT) consists of a primary key that uniquely identifies the topology dimension table and some related attributes of the topology dimension table. There can be more than one topological dimension table. The transformation of each topological dimension relationship topological dimension table is shown in Fig. 12, that is, the specific storage form in the topology dimension table can be stored in the relationship topological dimension table on the right side of Fig. 12.
便于理解信息网络数据仓库模型, 如下进一步对此概念进行说明: 信息维: 图结构为 G ( V, E ) = G ( V, f(ID) )。 其中 V是图中点的集合, E表示边的集合, 函数 f为图 G的边信息决定函数。 设变量 ID = {I1, 12...Im} 是 OLGP中待考察的维度集合, 这 m个信息属性构成的维度集合只能决定图 的边集, 不能改变图的拓朴结构, 称 ID为信息维集合。 To facilitate understanding of the information network data warehouse model, this concept is further explained as follows: Information dimension: The graph structure is G ( V, E ) = G ( V, f (ID) ). Where V is the set of points in the graph, E is the set of edges, and function f is the side information determining function of graph G. Set variable ID = {I1, 12...Im} It is a set of dimensions to be investigated in OLGP. The set of dimensions formed by the m information attributes can only determine the edge set of the graph, and cannot change the topological structure of the graph. The ID is called the information dimension set.
拓朴维: 设变量 TD={T1 ,Τ2 ,. , .,Τη }是刻画 OLGP中图中心度量拓朴结 构的一个集合。 一个图可表示为 G(V,E)=G( (TD), 5(TD)), 其中函数 Φ为点 拓朴决定函数, 函数 δ为边拓朴决定函数。 这 η 个拓朴属性构成的拓朴维决 定图的点集合和边集合, 从而决定图的拓朴结构, 称 TD为拓朴维集合。  Topological dimension: Let the variable TD={T1, Τ2, . , ., Τη } be a collection of metric mapping topological structures in the OLGP. A graph can be expressed as G(V, E) = G((TD), 5(TD)), where the function Φ is the point topology decision function and the function δ is the edge topology decision function. The topology of the topological attributes determines the point set and edge set of the graph, thereby determining the topological structure of the graph, and calling TD a topological dimension set.
信息网络数据仓库模型: 设 ROLGP(EFT,VFT,S(IDT), S(TDT),F)是关系 OLGP数据立方体。 其中, EFT 为边事实表, VFT 为节点事实表, S(IDT)信 息维表集合, IDT为信息维表, S(TDT)拓朴维表集合, TDT 为拓朴维表, F 为 表间的依赖关系集合, 且需满足以下约束:  Information Network Data Warehouse Model: Let ROLGP (EFT, VFT, S(IDT), S(TDT), F) be the relational OLGP data cube. Where EFT is the edge fact table, VFT is the node fact table, S (IDT) information dimension table set, IDT is the information dimension table, S (TDT) topology dimension table set, TDT is the topology dimension table, and F is the dependency set between the tables. , and the following constraints must be met:
(1) IDT 通过外键与 EFT 连接, TDT通过外键与 VFT 连接, EFT 与 VFT通过节点 ID连接。 (2) EFT, VFT , IDT , TDT 满足关系表, 即满足 以下定义: R(U, D, Dom, F').R 为关系表, U 为组成该关系的属性名集合, D 为 属性组 U 中属性所来自的域, Dom为属性向域的集合, F'为属性间数据的依 赖关系集合。  (1) IDT is connected to EFT through a foreign key, TDT is connected to VFT through a foreign key, and EFT is connected to VFT through node ID. (2) EFT, VFT, IDT, TDT satisfy the relationship table, that is, the following definitions are satisfied: R(U, D, Dom, F').R is the relational table, U is the set of attribute names that make up the relationship, and D is the attribute group. The domain from which attributes are derived from U, Dom is the set of attributes to the domain, and F' is the set of dependencies of the data between attributes.
与传统 OLAP建模类似,基于 OLGP 的信息网络的建模也有事实表和维 表。 不同的是事实表由边事实表(EFT )和节点事实表(VFT )共同组成, 维表则是由信息维表(IDT )和拓朴维表(TDT )组成。 对 OLGP信息网 络做基于关系数据的建模,对节点和边分别用节点事实表和边事实表进行存储 : 与边事实表相关的属性利用信息维表进行存储,与节点相关的属性利用拓朴维 表进行存储。 Similar to traditional OLAP modeling, OLGP-based information networks are modeled with fact tables and dimension tables. The difference is that the fact table is composed of the edge fact table (EFT) and the node fact table (VFT), and the dimension table is composed of the information dimension table (IDT) and the topology dimension table (TDT). The OLGP information network is modeled based on relational data. The node and the edge are stored by the node fact table and the edge fact table respectively . The attributes related to the edge fact table are stored by using the information dimension table, and the attributes related to the node are utilized by the topology dimension table. Store.
通过上述对本发明实施例二提供一种存储数据的方法,该方法通过获取原 始数据集, 从原始数据集中提取节点信息, 节点属性信息, 边信息, 和边属性 信息; 节点信息至少包括: 节点标识和节点属性的关键码; 所述节点属性关键 码与所述节点属性信息具有对应关系; 边信息至少包括: 边标识和边属性关键 码; 所述边属性关键码与所述边属性信息具有对应关系; 由于节点与节点属性 的关系, 边与边属性的关系, 节点与节点的连线为边, 使得提取的节点信息, 节点属性信息, 边信息, 和边属性信息之间具有联系, 存储上述提取的节点信 息, 节点属性信息, 边信息, 和边属性信息。 由于提取信息之间具有联系, 因 此在后续对数据进行操作时, 可以快速准确定位到所需要的数据。 同时, 与现 有的 OLAP多维数据仓库模型相比, 本发明实施例提供的方案存储的信息中, 不仅包括与现有技术相同的节点信息, 节点属性信息,使得研究人员可以关注 以节点为中心的事实, 而且, 本发明实施例提供的方案存储的信息中, 还包括 现有技术不能关注的边信息和边属性信息,使得研究人员还可以关注节点之间 关系。 The method for storing data is provided by the second embodiment of the present invention. The method extracts node information, node attribute information, side information, and edge attribute information from the original data set by acquiring the original data set. The node information includes at least: a node identifier. And a key attribute of the node attribute; the node attribute key has a corresponding relationship with the node attribute information; the side information at least includes: an edge identifier and an edge attribute key; the edge attribute key has a correspondence with the edge attribute information Relationship; due to the relationship between the node and the node attribute, the relationship between the edge and the edge attribute, the connection between the node and the node is an edge, so that the extracted node information, the node attribute information, the side information, and the edge attribute information are related, and the above is stored. Extracted node information, node attribute information, side information, and edge attribute information. Due to the connection between the extracted information, This allows you to quickly and accurately locate the data you need when you subsequently manipulate the data. At the same time, compared with the existing OLAP multi-dimensional data warehouse model, the information stored by the solution provided by the embodiment of the present invention includes not only the same node information and node attribute information as the prior art, so that the researcher can focus on the node as the center. In addition, the information stored in the solution provided by the embodiment of the present invention further includes side information and edge attribute information that cannot be focused on by the prior art, so that the researcher can also pay attention to the relationship between the nodes.
更进一步, 本发明实施例提供的方案中, 根据节点与节点的连线为边, 建 立了节点与边的联系, 因此, 将节点信息、 节点属性信息, 以及边信息, 边属 性信息直接建立了联系, 因此, 本方案由于发现了边与节点之间的重要联系, 使得对现有技术的改动较小的基础上, 能够实现关注节点之间关系。  Further, in the solution provided by the embodiment of the present invention, the connection between the node and the edge is established according to the connection between the node and the node, and therefore, the node information, the node attribute information, and the side information and the edge attribute information are directly established. Therefore, the solution can realize the relationship between the nodes concerned because the important relationship between the edges and the nodes is found, and the changes to the prior art are made smaller.
优选的, 由于本发明实施例提供的存储方法,对于后续对存储的数据的查 询操作实现非常快速, 准确。 如下该方法还可以包括:  Preferably, due to the storage method provided by the embodiment of the present invention, the subsequent query operation on the stored data is implemented very quickly and accurately. The method may also include the following:
步骤 204, 对需要查询的数据, 在存储的所述节点信息, 节点属性信息, 边信息,和边属性信息中进行定位;从定位后的所述节点信息,节点属性信息, 边信息, 或者边属性信息中其中之一中进行查询。  Step 204: Perform positioning on the stored node information, node attribute information, side information, and edge attribute information for the data to be queried; from the located node information, node attribute information, side information, or edge The query is made in one of the attribute information.
即判处出需要查询的数据时属于节点信息, 或者是节点属性信息, 或者是 边信息, 或者是边属性信息; 从判断出的其中之一的信息中进行查询操作。 大 大缩小了查询的范围。  That is, when the data to be queried is judged to belong to the node information, or the node attribute information, or the side information, or the edge attribute information; the query operation is performed from one of the determined information. Greatly narrows the scope of the query.
例如: 在合作者网络中, 查询不同会议发表的论文数量, 由于釆用上述步 骤 201~203的存储方法,在该多维信息网络数据仓库模型, 涉及 EFT与 Venue 表(信息维表中的地址表, 即 Venue表), EFT表中的边属性关键码 Venue— key 与信息维表, 即 Venue表, 建立连接关系。 可以查询出不同会议发表的论文数 量。 具体的查询操作可以如下所示:  For example: In the partner network, query the number of papers published in different conferences, because the storage method of the above steps 201~203 is used, in the multidimensional information network data warehouse model, involving the EFT and Venue tables (the address table in the information dimension table) , ie Venue table), the edge attribute key Venue_key in the EFT table and the information dimension table, that is, the Venue table, establish a connection relationship. You can query the number of papers published in different conferences. The specific query operation can be as follows:
结构化查询语言 (SQL, Structured Query Language)语句:  Structured Query Language (SQL):
select EFT.Paper key  Select EFT.Paper key
from EFT, Venue  From EFT, Venue
where EFT.Venue— key = Venue. Venue— key AND Venue. Venue— name = "会议 名称"  Where EFT.Venue— key = Venue. Venue— key AND Venue. Venue— name = "meeting name"
通过增加上述步骤 204, 对需要查询的数据进行查询时, 在多维信息网络 数据仓库中的边事实表、 节点事实表、 信息维表以及拓朴维表中, 可以判断出 该需要查询信息应该属于上述表中的一个或者一个以上, 因此, 可以消除大量 信息冗余, 查询起来高效并且节约时间。对特定问题的查询只涉及部分表的连 接操作。 By adding the above step 204, when querying the data that needs to be queried, in the multidimensional information network In the edge fact table, the node fact table, the information dimension table, and the topology dimension table in the data warehouse, it can be determined that the required query information should belong to one or more of the above tables, so that a large amount of information redundancy can be eliminated, and the query is efficient. And save time. Queries for specific problems involve only the join operations of some tables.
优选的, 该方法还包括如下步骤:  Preferably, the method further comprises the following steps:
步骤 205, 根据所述提取的节点信息, 节点属性信息, 边信息, 和边属性 信息, 进行在线图处理操作 ( OLGP, Online Graph Processing )0 Step 205, based on the extracted node information, node attribute information, edge information, and the attribute information side, FIG online processing operation (OLGP, Online Graph Processing) 0
其中, OLGP操作可以包括但不限于: 上卷(信息维上卷(I-OLGP ), 拓 朴维上卷(T-OLGP ), 异步上卷), 下钻, 切片, 切块, 数据透视。  Among them, OLGP operations can include but are not limited to: Volume Up (I-OLGP), Top-Up Volume (T-OLGP), Asynchronous Roll Up, Drill Down, Slice, Cut, Pivot.
其中, 对合作者网络可进行信息维上卷(I-OLGP ), 具体操作可以是: 在 信息维中的时间维上进行年份 (year)→年代 (decade)→全部 (all)不同层次的上卷 操作,从不同年份发表的论文数量上卷到不同年代发表的论文数量,再上卷到 所有时间发表的论文数量。  Among them, the information network can be uploaded to the partner network (I-OLGP), and the specific operation can be: performing year (year) → decade (decade) → all (all) at different levels in the time dimension of the information dimension Volume operations, from the number of papers published in different years to the number of papers published in different years, and then the number of papers published to all time.
其中, 对合作者网络可进行拓朴维上卷, 具体操作可以是: 在拓朴维表中 的机构维上进行作者个人 ( 11 0 →作者机构 (Institution)—全部 (all) 不同拓朴 层次上卷操作, 从不同作者间的合作关系上卷到不同机构之间的合作关系。  Among them, the partner network can be topologically scrolled, and the specific operations can be: Performing the author's individual on the organizational dimension in the topology table (11 0 → Institution - All (all) Operation, from the cooperation between different authors to the cooperation between different institutions.
需要理解的是, 上卷操作, 可以理解为在某一维上将低层次的细节数据概 括到高层次的汇总数据。 例如, 对信息维(时间维)上卷, 由年份向年代上卷, 得到年代的聚合数据,再由年代向所有年份上卷,可得到所有年份的聚合数据。 系, 即所述提取的节点信息, 节点属性信息, 边信息, 和边属性信息之间具有 联系, 因此, 在对存储的信息进行在线图处理(OLGP )操作时, 可以针对不 同的分类的信息进行处理,如仅对信息维表中存储的边属性信息进行操作, 或 者仅对拓 4卜维表中存储的节点属性信息进行操作等等。  It should be understood that the scrolling operation can be understood as a generalization of low-level detail data to a high-level summary data in a certain dimension. For example, for the information dimension (time dimension), the volume is rolled up from the year to the age, the aggregated data of the age is obtained, and then the year is rolled up to all the years, and the aggregated data of all the years can be obtained. The relationship between the extracted node information, the node attribute information, the side information, and the edge attribute information, so that when the stored information is subjected to an online map processing (OLGP) operation, different classified information may be used. The processing is performed, for example, only the edge attribute information stored in the information dimension table is operated, or only the node attribute information stored in the topology table is operated, and the like.
更进一步, 利用本发明实施例提供的存储方法, 可以通过共享信息维进行 多主题建模, 能够很少的重构底层数据, 尽可能的共享已有的维表。 例如在关 键字一合作者网络模型中, 由于关键字网络与合作者网络都包含 Paper、 Time, Venue 维, 因而可以通过共享这三个信息维构建关键字合作者网络。 如图 13 所示, 关键字事实表与合作者事实表通过共享 Venue维、 Paper维和 Time维构 建关键字一合作者多维信息网络数据仓库模型。 Further, with the storage method provided by the embodiment of the present invention, multi-topic modeling can be performed by sharing the information dimension, and the underlying data can be reconstructed with little, and the existing dimension table can be shared as much as possible. For example, in the keyword-partner network model, since the keyword network and the partner network both include the Paper, Time, and Venue dimensions, the keyword partner network can be constructed by sharing the three information dimensions. As shown in Figure 13, the keyword fact table and the collaborator fact table share the Venue dimension, Paper dimension, and Time structure. Building a keyword-collaborator multidimensional information network data warehouse model.
从图 13所示的关键字一合作者多维信息网络数据仓库模型可以看出, 左 边四列所显示的节点代表学期 (Term ), 边代表学期与学期之间的情况, 左边 四列是以执行与上述说明的步骤 201 203的操作进行存储节点信息,节点属性 信息, 边信息, 和边属性信息; 右边四列所显示的节点代表作者, 边代表作者 与作者之间的情况,左边四列是以执行与上述说明的步骤 201~203的操作进行 存储节点信息, 节点属性信息, 边信息, 和边属性信息。 即左右两边存储的主 题不同(左边存储的主题是以节点代表学期,右边存储的主题是以节点代表作 者)。  From the keyword-collaborator multidimensional information network data warehouse model shown in Figure 13, it can be seen that the nodes displayed in the four columns on the left represent the term (Term), the side represents the situation between the semester and the semester, and the four columns on the left are executed. The storage node information, the node attribute information, the side information, and the edge attribute information are performed in the operation of step 201 203 described above; the nodes displayed on the right four columns represent the author, and the side represents the situation between the author and the author, and the left four columns are The storage node information, the node attribute information, the side information, and the side attribute information are performed by performing the operations of steps 201 to 203 described above. That is, the topics stored on the left and right sides are different (the topics stored on the left are nodes for the semester, and the topics stored on the right are the nodes for the authors).
其中, 中间的 Co— IDT可以作为左边存储仓库中的信息维表, 也可以作为 右边存储仓库的信息维表, 即左右两边的多维信息网络数据仓库共用信息维 表, 即两仓库中存储的边的属性信息是相同的。  The middle Co-IDT can be used as the information dimension table in the left storage warehouse, or as the information dimension table of the right storage warehouse, that is, the multidimensional information network data warehouse sharing information dimension table on the left and the right sides, that is, the edges stored in the two warehouses. The attribute information is the same.
因此, 当节点所代表的主题不同时, 釆用本发明实施例提供的存储方法存 储的多个主题的数据时, 可以通过共享信息维进行多主题建模, 能够很少的重 构底层数据, 尽可能的共享已有的维表。  Therefore, when the topics represented by the nodes are different, when the data of the plurality of topics stored by the storage method provided by the embodiment of the present invention is used, multi-topic modeling can be performed by sharing the information dimension, and the underlying data can be reconstructed with little. Share existing dimension tables as much as possible.
实施例三  Embodiment 3
本发明实施例提供了一种数据的存储方法,该方法与上述实施例儿提供的 方法相似, 所不同的是, 本发明实施例提供的方法, 是另一种具体应用的存储 方法举例。 是将该存储方法应用在电影演员合作网络中。  The embodiment of the present invention provides a method for storing data, which is similar to the method provided by the foregoing embodiment, except that the method provided by the embodiment of the present invention is another storage method of a specific application. This storage method is applied to the movie actor cooperative network.
电影演员合作网络也是信息网络的一种。当用户需要关注演员间合作关系 时, 节点标识演员, 边代表两演员之间有合作关系。 电影演员合作网络如图 14所示, 节点描述包括: 演员名, 性别, 年纪, 所属电影公司; 边描述包括: 电影名称, 上映时间。 如图 15所示, 该方法包括:  The film actor cooperative network is also a kind of information network. When the user needs to pay attention to the inter-actor cooperation relationship, the node identifies the actor, and the representative represents the cooperation relationship between the two actors. The movie actor cooperation network is shown in Figure 14. The node description includes: actor name, gender, age, affiliated film company; side descriptions include: movie name, release time. As shown in Figure 15, the method includes:
步骤 301, 获取原始数据集, 对于电影演员合作网络的原始数据集, 通常 是杂乱的演员的名字, 性别, 所出演电影名称, 上映的时间等等, 混乱无序。 不便于查找, 以及 OLGP操作等。  Step 301, obtaining the original data set, for the original data set of the movie actor cooperative network, usually the name of the messy actor, the gender, the name of the movie, the time of the release, and the like, disorderly and disorderly. Not easy to find, as well as OLGP operations.
步骤 302, 从原始数据集中提取表示信息网络图结构的信息; 其中, 表示 信息网络图结构的信息至少包括: 节点信息, 节点属性信息, 边信息, 和边属 性信息; 节点信息至少包括: 节点标识和节点属性的关键码; 所述节点属性关 键码与所述节点属性信息具有对应关系; 边信息至少包括: 边标识和边属性关 键码; 所述边属性关键码与所述边属性信息具有对应关系; 边用于描述节点与 节点之间的联系。 Step 302: Extract information indicating a structure of the information network graph from the original data set. The information indicating the structure of the information network graph includes at least: node information, node attribute information, side information, and edge attribute information. The node information includes at least: a node identifier. And the key of the node attribute; the node attribute is off The key code has a corresponding relationship with the node attribute information; the side information includes at least: an edge identifier and an edge attribute key; the edge attribute key has a corresponding relationship with the edge attribute information; and the edge is used to describe between the node and the node Contact.
由于节点与节点属性的关系, 边与边属性的关系, 所述边用于描述节点与 节点之间的联系,可以容易用图结构体现上述提取的节点信息,节点属性信息, 边信息, 和边属性信息之间的联系。  Due to the relationship between the node and the node attribute, the relationship between the edge and the edge attribute, the edge is used to describe the relationship between the node and the node, and the extracted node information, node attribute information, side information, and edge can be easily represented by the graph structure. The link between attribute information.
在电影演员合作网络中, 提取的节点信息可以是以节点事实表(VFT, Vertex Fact Table )存储的, 其中, 节点信息可以包括节点 ID、 节点属性关键 码, 还可以包括节点的度量。 如图 15所示 VFT, 节点 ID为演员 (Actor— id ) 和演员姓名, 节点属性关键码为演员所属公司关键码 ( Film Comany id ) , 节 点的度量为演员所演电影数量( Film— Num )。  In the movie actor cooperation network, the extracted node information may be stored in a node fact table (VFT, Vertex Fact Table), where the node information may include a node ID, a node attribute key, and may also include a metric of the node. As shown in Figure 15, the VFT, the node ID is the actor (Actor_id) and the actor's name, the node attribute key is the actor's company code (Film Comany id), and the node's metric is the number of actors' films (Film-Num). .
提取的边信息可以存储在边事实边(EFT, Edge Fact Table ), 存储边信息 可以包括: 两个演员节点 idl、 id2 (用于表示边标识), 边属性的关键码(如: 合作电影关键码(Film— key ), 上映时间关键码( Release— Time— key ), 边信息 还可以包括边的度量(即合作次数 Co— Frequence )。  The extracted side information can be stored in the Edge Fact Table (EFT, Edge Fact Table), and the stored side information can include: two actor nodes idl, id2 (used to represent the edge identification), and key attributes of the edge attribute (eg: Cooperative movie key The code (Film-key), the release time key (Release-Time-key), the side information can also include the measure of the edge (ie Co- Frequence).
节点属性信息是节点属性关键码所对应的具体信息,节点属性信息具体可 以是存储在拓朴维表(TDT, Topology dimension Table ) 中, 拓朴维表可以有 一个或者一个以上。 如在节点信息中节点的关键码为电影公司关键码 ( Film Comany lD ), 则在拓朴维表中可以存储的是演员 (即节点)所属电影 公司名称。  The node attribute information is specific information corresponding to the node attribute key, and the node attribute information may be specifically stored in a Topology dimension Table (TDT), and the topology dimension table may have one or more. For example, if the key of the node in the node information is Film Comany lD, the name of the movie company to which the actor (ie node) belongs can be stored in the topology table.
边属性信息是边属性关键码所对应的具体信息,边属性信息具体可以是存 储在信息维表(IDT, Information Dimension Table ) 中。 例如: 上述合作电影 关键码(Film— key ), 上映时间关键码( Release— Time— key ), 对应的边属性信 息,具体可以是分别存储在电影维表,上映时间维表。电影维表记录电影名称, 电影类型等信息; 上映时间维表记录上映年, 年代等信息。  The edge attribute information is specific information corresponding to the edge attribute key, and the side attribute information may be stored in an Information Dimension Table (IDT). For example: the above-mentioned cooperative movie key (Film-key), the release time key (Release-Time-key), and the corresponding edge attribute information, which may be separately stored in the movie dimension table and the release time dimension table. The film dimension records the movie name, movie type and other information; the release time dimension record records the year, the age and other information.
步骤 303, 存储提取的节点信息, 节点属性信息, 边信息, 和边属性信息, 其中, 存储上述信息具体釆用节点事实表, 拓朴维表, 边事实表, 和信息维表 进行存储。 如图 16所示存储的节点事实表, 拓朴维表, 边事实表, 和信息维 表组成的电影演员合作多维信息网络数据仓库模型。 通过上述对本发明实施例三提供一种存储数据的方法,该方法通过获取原 始数据集, 从原始数据集中提取节点信息, 节点属性信息, 边信息, 和边属性 信息; 节点信息至少包括: 节点标识和节点属性的关键码; 所述节点属性关键 码与所述节点属性信息具有对应关系; 边信息至少包括: 边标识和边属性关键 码; 所述边属性关键码与所述边属性信息具有对应关系; 由于节点与节点属性 的关系, 边与边属性的关系, 节点与节点的连线为边, 使得提取的节点信息, 节点属性信息, 边信息, 和边属性信息之间具有联系, 存储上述提取的节点信 息, 节点属性信息, 边信息, 和边属性信息。 由于提取信息之间具有联系, 因 此在后续对数据进行操作时, 可以快速准确定位到所需要的数据。 同时, 与现 有的 OLAP多维数据仓库模型相比, 本发明实施例提供的方案存储的信息中, 不仅包括与现有技术相同的节点信息, 节点属性信息,使得研究人员可以关注 以节点为中心的事实, 而且, 本发明实施例提供的方案存储的信息中, 还包括 现有技术不能关注的边信息和边属性信息,使得研究人员还可以关注节点之间 关系。 Step 303: Store extracted node information, node attribute information, side information, and edge attribute information, where the foregoing information is stored by using a node fact table, a topology dimension table, an edge fact table, and an information dimension table. As shown in FIG. 16, the node fact table, the topology table, the side fact table, and the information dimension table are combined to form a multi-dimensional information network data warehouse model. The method for storing data is provided by the third embodiment of the present invention. The method extracts node information, node attribute information, side information, and edge attribute information from the original data set by acquiring the original data set. The node information includes at least: a node identifier. And a key attribute of the node attribute; the node attribute key has a corresponding relationship with the node attribute information; the side information at least includes: an edge identifier and an edge attribute key; the edge attribute key has a correspondence with the edge attribute information Relationship; due to the relationship between the node and the node attribute, the relationship between the edge and the edge attribute, the connection between the node and the node is an edge, so that the extracted node information, the node attribute information, the side information, and the edge attribute information are related, and the above is stored. Extracted node information, node attribute information, side information, and edge attribute information. Since there is a connection between the extracted information, when the data is subsequently operated, the required data can be quickly and accurately located. At the same time, compared with the existing OLAP multi-dimensional data warehouse model, the information stored by the solution provided by the embodiment of the present invention includes not only the same node information and node attribute information as the prior art, so that the researcher can focus on the node as the center. In addition, the information stored in the solution provided by the embodiment of the present invention further includes side information and edge attribute information that cannot be focused on by the prior art, so that the researcher can also pay attention to the relationship between the nodes.
更进一步, 本发明实施例提供的方案中, 根据节点与节点的连线为边, 建 立了节点与边的联系, 因此, 将节点信息、 节点属性信息, 以及边信息, 边属 性信息直接建立了联系, 因此, 本方案由于发现了边与节点之间的重要联系, 使得对现有技术的改动较小的基础上, 能够实现关注节点之间关系。  Further, in the solution provided by the embodiment of the present invention, the connection between the node and the edge is established according to the connection between the node and the node, and therefore, the node information, the node attribute information, and the side information and the edge attribute information are directly established. Therefore, the solution can realize the relationship between the nodes concerned because the important relationship between the edges and the nodes is found, and the changes to the prior art are made smaller.
优选的, 由于本发明实施例提供的存储方法,对于后续对存储的数据的查 询操作实现非常快速, 准确。 如下该方法还可以包括:  Preferably, due to the storage method provided by the embodiment of the present invention, the subsequent query operation on the stored data is implemented very quickly and accurately. The method may also include the following:
步骤 304, 对需要查询的数据, 在存储的所述节点信息, 节点属性信息, 边信息,和边属性信息中进行定位。从定位后的所述节点信息,节点属性信息, 边信息,或者边属性信息中其中之一中进行查询。 即判处出需要查询的数据时 属于节点信息, 或者是节点属性信息, 或者是边信息, 或者是边属性信息; 在 定位后的信息中进行查询, 缩小的查询的范围。  Step 304: Perform positioning on the stored node information, node attribute information, side information, and edge attribute information for the data to be queried. The query is performed from one of the node information, the node attribute information, the side information, or the side attribute information after the positioning. That is, when the data to be queried is judged, it belongs to the node information, or the node attribute information, or the side information, or the edge attribute information; the query is performed in the positioned information, and the scope of the narrowed query is performed.
例如: 在在电影演员合作网络中, 查询不同年份发行的电影数量, 由于釆 用上述步骤 301~303的存储方法,在该多维信息网络数据仓库模型, 涉及 EFT 与上映时间表(信息维表中的 Release— Time表), EFT表中的边属性关键码 Release— Time— key与信息维表, 即 Release— Time表, 建立连接关系。 可以查询 出不同年份发行的电影数量。 具体的查询操作可以如下所示: For example: In the movie actor cooperative network, query the number of movies issued in different years, because the storage method of the above steps 301~303 is used, in the multidimensional information network data warehouse model, involving the EFT and the release schedule (in the information dimension table) Release_Time table), the edge attribute key in the EFT table Release-Time-key and the information dimension table, that is, the Release-Time table, establish a connection relationship. Can query The number of movies released in different years. The specific query operation can be as follows:
结构化查询语言 (SQL, Structured Query Language)语句:  Structured Query Language (SQL):
select EFT.Film— key  Select EFT.Film— key
from EFT, Release— Time  From EFT, Release— Time
where EFT.Release— Time— key=Release—Time.Release— Time— key AND Where EFT.Release— Time— key=Release—Time.Release— Time— key AND
Release— Time. Year = "年份" Release— Time. Year = "Year"
通过增加上述步骤 304, 对需要查询的数据进行查询时, 在多维信息网络 数据仓库中的边事实表、 节点事实表、 信息维表以及拓朴维表中, 可以判断出 该需要查询信息应该属于上述表中的一个或者一个以上, 因此, 可以消除大量 信息冗余, 查询起来高效并且节约时间。对特定问题的查询只涉及部分表的连 接操作。  By adding the above step 304, when querying the data to be queried, in the edge fact table, the node fact table, the information dimension table, and the topology dimension table in the multidimensional information network data warehouse, it can be determined that the required query information should belong to the above table. One or more of them can eliminate a large amount of information redundancy, be efficient and save time. Queries for specific problems involve only the connection operations of some tables.
优选的, 该方法还包括如下步骤:  Preferably, the method further comprises the following steps:
步骤 305, 根据所述提取的节点信息, 节点属性信息, 边信息, 和边属性 信息之间具有联系, 进行在线图处理操作(OLGP, Online Graph Processing )0 其中, OLGP操作可以包括但不限于: 上卷(信息维上卷(I-OLGP ), 拓 朴维上卷(T-OLGP ), 异步上卷), 下钻, 切片, 切块, 数据透视。 Step 305, according to the extracted node information between the node attribute information, edge information, and the attribute information having contact side, FIG online processing operation (OLGP, Online Graph Processing) 0 wherein, OLGP operations may include but are not limited to: Rollup (I-OLGP), Topology Volume (T-OLGP), Asynchronous Roll Up, Drill Down, Slice, Cut, Pivot.
其中, 对合作者网络可进行信息维上卷(I-OLGP ), 具体操作可以是: 在 信息维中的时间维上进行年份 (year)→年代 (decade)→全部 (all)不同层次的上卷 操作,从不同年份发行的电影数量上卷到不同年代发行的电影数量,再上卷到 所有时间发行的电影数量。  Among them, the information network can be uploaded to the partner network (I-OLGP), and the specific operation can be: performing year (year) → decade (decade) → all (all) at different levels in the time dimension of the information dimension Volume operations, from the number of movies released in different years to the number of movies released in different years, and then the number of movies released to all times.
其中, 对合作者网络可进行拓朴维上卷, 具体操作可以是: 在拓朴维表中 的机构维上进行演员(Actor)→所属电影公司(Film— Company)—全部 (all) 不同 拓朴层次上卷操作,从不同演员间的合作关系上卷到不同电影公司之间的合作 关系。 系, 即所述提取的节点信息, 节点属性信息, 边信息, 和边属性信息之间具有 联系, 因此, 在对存储的信息进行在线图处理(OLGP )操作时, 可以针对不 同的分类的信息进行处理,如仅对信息维表中存储的边属性信息进行操作, 或 者仅对拓 4卜维表中存储的节点属性信息进行操作等等。 更进一步, 利用本发明实施例提供的存储方法, 可以通过共享信息维进行 多主题建模, 能够很少的重构底层数据, 尽可能的共享已有的维表。 Among them, the partner network can be topologically scrolled, and the specific operations can be: Performing an actor (Actor) → Film Company (All) (all) in the organizational dimension of the topology table. Roll-up operation, from the cooperation relationship between different actors to the cooperation relationship between different film companies. The relationship between the extracted node information, the node attribute information, the side information, and the edge attribute information, so that when the stored information is subjected to an online map processing (OLGP) operation, different classified information may be used. The processing is performed, for example, only the edge attribute information stored in the information dimension table is operated, or only the node attribute information stored in the topology table is operated, and the like. Further, with the storage method provided by the embodiment of the present invention, multi-topic modeling can be performed by sharing the information dimension, and the underlying data can be reconstructed with little, and the existing dimension table can be shared as much as possible.
实施例四  Embodiment 4
本发明实施例提供了一种数据的存储装置, 如图 17所示, 该装置包括: 获取单元 401, 提取单元 402, 和存储单元 403;  An embodiment of the present invention provides a data storage device. As shown in FIG. 17, the device includes: an obtaining unit 401, an extracting unit 402, and a storage unit 403;
所述获取单元 401, 用于获取原始数据集;  The obtaining unit 401 is configured to acquire an original data set.
其中,原始数据集可以理解为用户收集的所有数据的集合, 这些数据是杂 乱, 不利于分析的。获取单元中获取的原始数据集可以是输入到该执行设备中 的非结构化文本的原始数据。  Among them, the original data set can be understood as a collection of all the data collected by the user, which is messy and unfavorable for analysis. The raw data set obtained in the acquisition unit may be raw data of unstructured text input into the execution device.
所述提取单元 402, 用于从原始数据集中提取表示信息网络图结构的信 息; 其中, 所述表示信息网络图结构的信息至少包括: 节点信息, 节点属性信 息, 边信息, 和边属性信息; 所述节点信息至少包括: 节点标识和节点属性关 键码; 所述节点属性关键码与所述节点属性信息具有对应关系; 所述边信息至 少包括: 边标识和边属性关键码; 所述边属性关键码与所述边属性信息具有对 应关系; 所述边用于描述节点与节点之间的联系;  The extracting unit 402 is configured to extract information indicating a structure of the information network graph from the original data set, where the information indicating the structure of the information network graph includes at least: node information, node attribute information, side information, and edge attribute information; The node information includes at least: a node identifier and a node attribute key; the node attribute key has a corresponding relationship with the node attribute information; the side information at least includes: an edge identifier and an edge attribute key; the edge attribute The key has a corresponding relationship with the edge attribute information; the edge is used to describe a relationship between the node and the node;
由于节点与节点属性的关系, 边与边属性的关系, 所述边用于描述节点与 节点之间的联系,可以容易用图结构体现上述提取的节点信息,节点属性信息, 边信息, 和边属性信息之间的联系 (参见上述说明的图 5、 图 8、 图 16 )。  Due to the relationship between the node and the node attribute, the relationship between the edge and the edge attribute, the edge is used to describe the relationship between the node and the node, and the extracted node information, node attribute information, side information, and edge can be easily represented by the graph structure. The relationship between the attribute information (see Figure 5, Figure 8, Figure 16 above).
所述存储单元 403, 用于存储所述提取的节点信息, 节点属性信息, 边信 息, 和边属性信息。  The storage unit 403 is configured to store the extracted node information, node attribute information, side information, and edge attribute information.
其中, 所述存储单元存储提取的节点信息, 节点属性信息, 边信息, 和边 属性信息, 具体可以是以表格的形式存储, 即通过: 节点事实表, 拓朴维表, 边事实表, 信息维表将上述信息对应存储。 其中, 以表格的形式存储是一种事 实方式, 并非对本发明实施例的限制, 具体的存储形式还可以有其他。  The storage unit stores the extracted node information, node attribute information, side information, and edge attribute information, which may be stored in the form of a table, that is, by: a node fact table, a topology dimension table, an edge fact table, and an information dimension table. The above information is stored correspondingly. The storage in the form of a table is a factual manner, and is not a limitation of the embodiment of the present invention. The specific storage form may have other forms.
通过上述对本发明实施例一提供一种存储数据的装置,该装置通过获取原 始数据集, 从原始数据集中提取节点信息, 节点属性信息, 边信息, 和边属性 信息; 节点信息至少包括: 节点标识和节点属性的关键码; 所述节点属性关键 码与所述节点属性信息具有对应关系; 边信息至少包括: 边标识和边属性关键 码; 所述边属性关键码与所述边属性信息具有对应关系; 由于节点与节点属性 的关系, 边与边属性的关系, 节点与节点的连线为边, 使得提取的节点信息, 节点属性信息, 边信息, 和边属性信息之间具有联系, 存储上述提取的节点信 息, 节点属性信息, 边信息, 和边属性信息。 由于提取信息之间具有联系, 因 此在后续对数据进行操作时, 可以快速准确定位到所需要的数据。 同时, 与现 有的 OLAP多维数据仓库模型相比, 本发明实施例提供的方案存储的信息中, 不仅包括与现有技术相同的节点信息, 节点属性信息,使得研究人员可以关注 以节点为中心的事实, 而且, 本发明实施例提供的方案存储的信息中, 还包括 现有技术不能关注的边信息和边属性信息,使得研究人员还可以关注节点之间 关系。 The device for storing data is provided by the first embodiment of the present invention. The device extracts node information, node attribute information, side information, and edge attribute information from the original data set by acquiring the original data set. The node information includes at least: a node identifier. And a key attribute of the node attribute; the node attribute key has a corresponding relationship with the node attribute information; the side information at least includes: an edge identifier and an edge attribute key; the edge attribute key has a correspondence with the edge attribute information Relationship; due to node and node properties The relationship between the edge and the edge attribute, the connection between the node and the node is the edge, so that the extracted node information, the node attribute information, the side information, and the edge attribute information are related, and the extracted node information, the node attribute are stored. Information, side information, and side attribute information. Since there is a connection between the extracted information, when the data is subsequently operated, the required data can be quickly and accurately located. At the same time, compared with the existing OLAP multi-dimensional data warehouse model, the information stored by the solution provided by the embodiment of the present invention includes not only the same node information and node attribute information as the prior art, so that the researcher can focus on the node as the center. In addition, the information stored in the solution provided by the embodiment of the present invention further includes side information and edge attribute information that cannot be focused on by the prior art, so that the researcher can also pay attention to the relationship between the nodes.
进一步, 对本发明实施例一提供一种存储数据的装置, 解决了现有的 Further, the apparatus for storing data is provided in the first embodiment of the present invention, and the existing
OLAP多维数据仓库模型中, 原始数据集中存在的冗余问题, 本发明实施例提 供的方案具有查询灵活, 高效, 主题抽取灵活的优点 In the OLAP multi-dimensional data warehouse model, the redundancy problem existing in the original data set, the solution provided by the embodiment of the present invention has the advantages of flexible query, high efficiency, and flexible subject extraction.
更进一步, 本发明实施例一提供一种存储数据的装置, 更符合现实社会网 络的建模要求, 有利于高效 OLGP算法设计, 且该模型向传统关系表转化方 便, 利于人们对现实世界信息的理解。  Furthermore, the first embodiment of the present invention provides a device for storing data, which is more in line with the modeling requirements of a real social network, and is beneficial to efficient OLGP algorithm design, and the model is convenient to convert to a traditional relation table, and is beneficial to people in real world information. understanding.
而且, 本发明实施例提供的方案中, 根据节点与节点的连线为边, 建立了 节点与边的联系, 因此, 将节点信息、 节点属性信息, 以及边信息, 边属性信 息直接建立了联系, 因此, 本方案由于发现了边与节点之间的重要联系, 使得 对现有技术的改动较小的基础上, 能够实现关注节点之间关系。  Moreover, in the solution provided by the embodiment of the present invention, the connection between the node and the edge is established according to the connection between the node and the node, and therefore, the node information, the node attribute information, and the side information and the edge attribute information are directly connected. Therefore, the solution can realize the relationship between the nodes concerned by discovering the important relationship between the edges and the nodes, so that the changes to the prior art are small.
优选的, 本方案中, 所述节点信息还包括: 节点度量值; 所述边信息还包 括: 边度量值。  Preferably, in the solution, the node information further includes: a node metric value; the side information further includes: an edge metric value.
优选的, 本方案中, 所述提取的节点信息存储在节点事实表中;  Preferably, in the solution, the extracted node information is stored in a node fact table;
所述提取的边信息存储在边事实表中;  The extracted side information is stored in an edge fact table;
所述提取的节点属性信息存储在拓朴维表中;  The extracted node attribute information is stored in a topology dimension table;
所提取的边属性信息存储在信息维表中;  The extracted edge attribute information is stored in the information dimension table;
由于所述边用于描述节点与节点之间的联系,则所述节点事实表与所述边 事实表具有联系;  Since the edge is used to describe a relationship between a node and a node, the node fact table has a relationship with the edge fact table;
所述节点属性关键码与所述节点属性信息具有对应关系;则所述拓朴维表 与所述节点事实表具有联系; 由于所述边属性关键码与所述边属性信息具有对应关系,则所述信息维表 与所述边事实表具有联系。 The node attribute key has a corresponding relationship with the node attribute information; and the topology dimension table has a relationship with the node fact table; The information dimension table has a relationship with the edge fact table because the edge attribute key has a corresponding relationship with the edge attribute information.
优选的, 所述装置还包括: 定位单元 404, 和查询单元 405;  Preferably, the device further includes: a positioning unit 404, and a query unit 405;
所述定位单元 404,用于对需要查询的数据,在所述存储的所述节点信息, 节点属性信息, 边信息, 和边属性信息中进行定位;  The locating unit 404 is configured to locate, in the stored node information, node attribute information, side information, and edge attribute information, the data that needs to be queried;
所述查询单元 405, 用于从定位后的所述节点信息, 节点属性信息, 边信 息, 或者边属性信息中其中之一中进行查询。  The query unit 405 is configured to perform query from one of the node information, the node attribute information, the side information, or the edge attribute information after the positioning.
通过增加上述定位单元 404, 和查询单元 405, 对需要查询的数据进行查 询时, 在多维信息网络数据仓库中的边事实表、 节点事实表、 信息维表以及拓 朴维表中, 可以判断出该需要查询信息应该属于上述表中的一个或者一个以 上, 因此, 可以消除大量信息冗余, 查询起来高效并且节约时间。 对特定问题 的查询只涉及部分表的连接操作。  By adding the positioning unit 404 and the query unit 405 to query the data to be queried, the edge fact table, the node fact table, the information dimension table, and the topology dimension table in the multidimensional information network data warehouse can determine the need. The query information should belong to one or more of the above tables, so that a large amount of information redundancy can be eliminated, the query is efficient, and time is saved. Queries for specific problems involve only the join operations of some tables.
优选的, 本方案中, 所述装置还包括: 图处理单元 406;  Preferably, in this solution, the apparatus further includes: a map processing unit 406;
所述图处理单元 406, 用于才艮据所述提取的节点信息, 节点属性信息, 边 信息, 和边属性信息, 进行在线图处理操作。  The map processing unit 406 is configured to perform an online map processing operation according to the extracted node information, node attribute information, side information, and edge attribute information.
优选的,本方案中,所述图处理单元 406中所述在线图处理操作至少包括: 上卷(信息维上卷(I-OLGP ), 拓朴维上卷(T-OLGP ), 异步上卷), 下 钻, 切片, 切块, 数据透视其中之一。  Preferably, in the solution, the online map processing operation in the map processing unit 406 at least includes: scrolling (I-OLGP), topological volume (T-OLGP), asynchronous scrolling ), drill down, slice, diced, one of the data views.
优选的, 本方案中, 所述图处理单元 406中若所述提取的边属性信息存储 在信息维表中, 则所述信息维上卷具体包括:  Preferably, in the solution, if the extracted edge attribute information is stored in the information dimension table, the information dimension rollup specifically includes:
对信息维表中存储的边的一种属性的信息,或者一种以上属性的信息进行 上卷操作。  The information of one attribute of the edge stored in the information dimension table, or the information of one or more attributes is scrolled.
优选的, 本方案中, 所述图处理单元 406中若所述提取的节点属性信息存 储在拓朴维表中, 则所述拓朴维聚集操具体包括:  Preferably, in the solution, if the extracted node attribute information is stored in the topology dimension table, the topology dimension aggregation operation specifically includes:
对拓朴维表中存储的节点的一种属性的信息,或者一种以上属性的信息进 行上卷操作。 系, 即所述提取的节点信息, 节点属性信息, 边信息, 和边属性信息之间具有 联系, 因此, 在对存储的信息进行在线图处理(OLGP )操作时, 可以针对不 同的分类的信息进行处理,如仅对信息维表中存储的边属性信息进行操作, 或 者仅对拓 4卜维表中存储的节点属性信息进行操作等等。 The information of one attribute of the node stored in the topology table or the information of one or more attributes is scrolled. The relationship between the extracted node information, the node attribute information, the side information, and the edge attribute information, therefore, when performing online graph processing (OLGP) operations on the stored information, The same classified information is processed, for example, only the edge attribute information stored in the information dimension table is operated, or only the node attribute information stored in the topology table is operated, and the like.
更进一步, 利用本发明实施例提供的存储方法, 可以通过共享信息维进行 多主题建模, 能够很少的重构底层数据, 尽可能的共享已有的维表。  Further, with the storage method provided by the embodiment of the present invention, multi-topic modeling can be performed by sharing the information dimension, and the underlying data can be reconstructed with little, and the existing dimension table can be shared as much as possible.
实施例五  Embodiment 5
本发明实施例提供了一种数据的存储装置, 如图 18所示, 该装置包括: 包括分别连接到总线上的存储器 40、 处理器 41、 输入装置 43和输出装置 44, 其中,存储器 40中用来储存从输入装置 43输入的数据,且还可以储存处 理器 The embodiment of the present invention provides a data storage device. As shown in FIG. 18, the device includes: a memory 40, a processor 41, an input device 43, and an output device 44 respectively connected to a bus, wherein the memory 40 is Used to store data input from the input device 43, and also to store the processor
Figure imgf000028_0001
Figure imgf000028_0001
处理器 41, 用于用于从原始数据集中提取表示信息网络图结构的信息; 其中, 所述表示信息网络图结构的信息至少包括: 节点信息, 节点属性信息, 边信息,和边属性信息;所述节点信息至少包括:节点标识和节点属性关键码; 所述节点属性关键码与所述节点属性信息具有对应关系; 所述边信息至少包 括: 边标识和边属性关键码; 所述边属性关键码与所述边属性信息具有对应关 系; 所述边用于描述节点与节点之间的联系;  The processor 41 is configured to extract information indicating a structure of the information network map from the original data set, where the information indicating the structure of the information network map includes at least: node information, node attribute information, side information, and edge attribute information; The node information includes at least: a node identifier and a node attribute key; the node attribute key has a corresponding relationship with the node attribute information; the side information includes at least: an edge identifier and an edge attribute key; the edge attribute The key has a corresponding relationship with the edge attribute information; the edge is used to describe a relationship between the node and the node;
所述存储器 40, 还用于存储所述提取的节点信息, 节点属性信息, 边信 息, 和边属性信息。  The memory 40 is further configured to store the extracted node information, node attribute information, side information, and edge attribute information.
通过上述对本发明实施例一提供一种存储数据的装置,该装置通过获取原 始数据集, 从原始数据集中提取节点信息, 节点属性信息, 边信息, 和边属性 信息; 节点信息至少包括: 节点标识和节点属性的关键码; 所述节点属性关键 码与所述节点属性信息具有对应关系; 边信息至少包括: 边标识和边属性关键 码; 所述边属性关键码与所述边属性信息具有对应关系; 由于节点与节点属性 的关系, 边与边属性的关系, 节点与节点的连线为边, 使得提取的节点信息, 节点属性信息, 边信息, 和边属性信息之间具有联系, 存储上述提取的节点信 息, 节点属性信息, 边信息, 和边属性信息。 由于提取信息之间具有联系, 因 此在后续对数据进行操作时, 可以快速准确定位到所需要的数据。 同时, 与现 有的 OLAP多维数据仓库模型相比, 本发明实施例提供的方案存储的信息中, 不仅包括与现有技术相同的节点信息, 节点属性信息,使得研究人员可以关注 以节点为中心的事实, 而且, 本发明实施例提供的方案存储的信息中, 还包括 现有技术不能关注的边信息和边属性信息,使得研究人员还可以关注节点之间 关系。 The device for storing data is provided by the first embodiment of the present invention. The device extracts node information, node attribute information, side information, and edge attribute information from the original data set by acquiring the original data set. The node information includes at least: a node identifier. And a key attribute of the node attribute; the node attribute key has a corresponding relationship with the node attribute information; the side information at least includes: an edge identifier and an edge attribute key; the edge attribute key has a correspondence with the edge attribute information Relationship; due to the relationship between the node and the node attribute, the relationship between the edge and the edge attribute, the connection between the node and the node is an edge, so that the extracted node information, the node attribute information, the side information, and the edge attribute information are related, and the above is stored. Extracted node information, node attribute information, side information, and edge attribute information. Since there is a connection between the extracted information, when the data is subsequently operated, the required data can be quickly and accurately located. At the same time, compared with the existing OLAP multi-dimensional data warehouse model, the information stored by the solution provided by the embodiment of the present invention includes not only the same node information and node attribute information as the prior art, so that researchers can pay attention to The node-centered fact, and the information stored in the solution provided by the embodiment of the present invention also includes side information and edge attribute information that cannot be focused on by the prior art, so that the researcher can also pay attention to the relationship between the nodes.
进一步, 对本发明实施例一提供一种存储数据的装置, 解决了现有的 OLAP多维数据仓库模型中, 原始数据集中存在的冗余问题, 本发明实施例提 供的方案具有查询灵活, 高效, 主题抽取灵活的优点。  Further, an apparatus for storing data is provided in the first embodiment of the present invention, which solves the redundancy problem in the original data set in the existing OLAP multi-dimensional data warehouse model. The solution provided by the embodiment of the present invention has flexible query, high efficiency, and theme. The advantages of extraction flexibility.
更进一步, 本发明实施例一提供一种存储数据的装置, 更符合现实社会网 络的建模要求, 有利于高效 OLGP算法设计, 且该模型向传统关系表转化方 便, 利于人们对现实世界信息的理解。  Furthermore, the first embodiment of the present invention provides a device for storing data, which is more in line with the modeling requirements of a real social network, and is beneficial to efficient OLGP algorithm design, and the model is convenient to convert to a traditional relation table, and is beneficial to people in real world information. understanding.
而且, 本发明实施例提供的方案中, 根据节点与节点的连线为边, 建立了 节点与边的联系, 因此, 将节点信息、 节点属性信息, 以及边信息, 边属性信 息直接建立了联系, 因此, 本方案由于发现了边与节点之间的重要联系, 使得 对现有技术的改动较小的基础上, 能够实现关注节点之间关系。  Moreover, in the solution provided by the embodiment of the present invention, the connection between the node and the edge is established according to the connection between the node and the node, and therefore, the node information, the node attribute information, and the side information and the edge attribute information are directly connected. Therefore, the solution can realize the relationship between the nodes concerned by discovering the important relationship between the edges and the nodes, so that the changes to the prior art are small.
优选的, 处理器 41 中处理的所述节点信息还包括: 节点度量值; 所述边 信息还包括: 边度量值。  Preferably, the node information processed by the processor 41 further includes: a node metric value; the side information further includes: an edge metric value.
优选的, 处理器 41 中所述提取的节点信息存储在节点事实表中; 所述提 取的边信息存储在边事实表中; 所述提取的节点属性信息存储在拓朴维表中; 所提取的边属性信息存储在信息维表中;由于所述边用于描述节点与节点之间 的联系, 则所述节点事实表与所述边事实表具有联系; 所述节点属性关键码与 所述节点属性信息具有对应关系; 则所述拓朴维表与所述节点事实表具有联 系; 由于所述边属性关键码与所述边属性信息具有对应关系, 则所述信息维表 与所述边事实表具有联系。  Preferably, the extracted node information in the processor 41 is stored in a node fact table; the extracted side information is stored in an edge fact table; the extracted node attribute information is stored in a topology dimension table; the extracted edge The attribute information is stored in the information dimension table; since the edge is used to describe a relationship between the node and the node, the node fact table has a relationship with the edge fact table; the node attribute key and the node attribute The information has a correspondence relationship; the topology dimension table is associated with the node fact table; and the information dimension table has a relationship with the edge fact table because the edge attribute key has a corresponding relationship with the edge attribute information. .
优选的, 处理器 41还用于对需要查询的数据, 在所述存储的所述节点信 息, 节点属性信息, 边信息, 和边属性信息中进行定位; 从定位后的所述节点 信息, 节点属性信息, 边信息, 或者边属性信息中其中之一中进行查询。  Preferably, the processor 41 is further configured to: in the stored node information, the node attribute information, the side information, and the edge attribute information, the data to be queried; the node information after the positioning, the node Query in one of attribute information, side information, or edge attribute information.
对需要查询的数据进行查询时, 在多维信息网络数据仓库中的边事实表、 节点事实表、信息维表以及拓朴维表中, 可以判断出该需要查询信息应该属于 上述表中的一个或者一个以上, 因此, 可以消除大量信息冗余, 查询起来高效 并且节约时间。 对特定问题的查询只涉及部分表的连接操作。 优选的, 处理器 41还用于根据所述提取的节点信息, 节点属性信息, 边 信息, 和边属性信息, 进行在线图处理操作。 When querying the data to be queried, in the edge fact table, the node fact table, the information dimension table, and the topology dimension table in the multidimensional information network data warehouse, it may be determined that the required query information should belong to one or more of the above tables. Therefore, a large amount of information redundancy can be eliminated, which is efficient and saves time. Queries for specific problems involve only the join operations of some tables. Preferably, the processor 41 is further configured to perform an online map processing operation according to the extracted node information, node attribute information, side information, and edge attribute information.
优选的, 处理器 41还用于中所述在线图处理操作至少包括:  Preferably, the processor 41 is further configured to: the online map processing operation at least:
上卷(信息维上卷(I-OLGP ), 拓朴维上卷(T-OLGP ), 异步上卷), 下 钻, 切片, 切块, 数据透视其中之一。  One of the volumes (I-OLGP, T-OLGP, Asynchronous), drill down, slice, dicing, and pivot.
优选的,处理器 41还用于中若所述提取的边属性信息存储在信息维表中, 则所述信息维上卷具体包括:  Preferably, the processor 41 is further configured to: if the extracted edge attribute information is stored in the information dimension table, the information dimension rollup specifically includes:
对信息维表中存储的边的一种属性的信息,或者一种以上属性的信息进行 上卷操作。  The information of one attribute of the edge stored in the information dimension table, or the information of one or more attributes is scrolled.
优选的, 处理器 41还用于中若所述提取的节点属性信息存储在拓朴维表 中, 则所述拓朴维聚集操具体包括:  Preferably, the processor 41 is further configured to: if the extracted node attribute information is stored in the topology dimension table, the topology dimension aggregation operation specifically includes:
对拓朴维表中存储的节点的一种属性的信息,或者一种以上属性的信息进 行上卷操作。 系, 即所述提取的节点信息, 节点属性信息, 边信息, 和边属性信息之间具有 联系, 因此, 在对存储的信息进行在线图处理(OLGP )操作时, 可以针对不 同的分类的信息进行处理,如仅对信息维表中存储的边属性信息进行操作, 或 者仅对拓 4卜维表中存储的节点属性信息进行操作等等。  The information of one attribute of the node stored in the topology table or the information of one or more attributes is scrolled. The relationship between the extracted node information, the node attribute information, the side information, and the edge attribute information, so that when the stored information is subjected to an online map processing (OLGP) operation, different classified information may be used. The processing is performed, for example, only the edge attribute information stored in the information dimension table is operated, or only the node attribute information stored in the topology table is operated, and the like.
更进一步, 利用本发明实施例提供的存储方法, 可以通过共享信息维进行 多主题建模, 能够很少的重构底层数据, 尽可能的共享已有的维表。  Further, with the storage method provided by the embodiment of the present invention, multi-topic modeling can be performed by sharing the information dimension, and the underlying data can be reconstructed with little, and the existing dimension table can be shared as much as possible.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分步骤 是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可 读存储介质中, 上述提到的存储介质可以是只读存储器, 磁盘或光盘等。  A person skilled in the art can understand that all or part of the steps of implementing the above embodiments can be completed by a program to instruct related hardware, and the program can be stored in a computer readable storage medium, the above mentioned storage. The medium can be a read only memory, a magnetic disk or a compact disk or the like.
以上对本发明所提供的一种存储数据的方法和装置进行了详细介绍,对于 本领域的一般技术人员,依据本发明实施例的思想,在具体实施方式及应用范 围上均会有改变之处, 综上所述, 本说明书内容不应理解为对本发明的限制。  The method and apparatus for storing data provided by the present invention are described in detail above. For those skilled in the art, according to the idea of the embodiments of the present invention, there are changes in specific implementation manners and application scopes. In summary, the content of the specification should not be construed as limiting the invention.

Claims

权 利 要 求 Rights request
1、 一种存储数据的方法, 其特征在于, 所述方法包括: 1. A method of storing data, characterized in that the method includes:
获取原始数据集; Get the original data set;
从原始数据集中提取表示信息网络图结构的信息; 其中, 所述表示信息网 络图结构的信息至少包括: 节点信息, 节点属性信息,边信息,和边属性信息; 所述节点信息至少包括: 节点标识和节点属性关键码; Extract information representing the structure of the information network graph from the original data set; wherein, the information representing the structure of the information network graph at least includes: node information, node attribute information, edge information, and edge attribute information; the node information at least includes: node Identification and node attribute keys;
所述节点属性关键码与所述节点属性信息具有对应关系; The node attribute key code has a corresponding relationship with the node attribute information;
所述边信息至少包括: 边标识和边属性关键码; The side information at least includes: edge identifier and edge attribute key code;
所述边属性关键码与所述边属性信息具有对应关系; The edge attribute key code has a corresponding relationship with the edge attribute information;
所述边用于描述节点与节点之间的联系; The edges are used to describe the connections between nodes;
存储所述提取的节点信息, 节点属性信息, 边信息, 和边属性信息。 Store the extracted node information, node attribute information, edge information, and edge attribute information.
2、 根据权利要求 1所述方法, 其特征在于, 2. The method according to claim 1, characterized in that,
所述节点信息还包括: 节点度量值; The node information also includes: node metric value;
所述边信息还包括: 边度量值。 The side information also includes: side metric value.
3、 根据权利要求 1或者 2所述方法, 其特征在于, 3. The method according to claim 1 or 2, characterized in that,
所述提取的节点信息存储在节点事实表中; The extracted node information is stored in the node fact table;
所述提取的边信息存储在边事实表中; The extracted edge information is stored in the edge fact table;
所述提取的节点属性信息存储在拓朴维表中; The extracted node attribute information is stored in the topology dimension table;
所提取的边属性信息存储在信息维表中; The extracted edge attribute information is stored in the information dimension table;
由于所述边用于描述节点与节点之间的联系,则所述节点事实表中的信息 与所述边事实表中的信息具有对应关系; Since the edge is used to describe the connection between nodes, the information in the node fact table has a corresponding relationship with the information in the edge fact table;
所述节点属性关键码与所述节点属性信息具有对应关系;则所述拓朴维表 中的信息与所述节点事实表中的信息具有对应关系; The node attribute key code has a corresponding relationship with the node attribute information; then the information in the topology dimension table has a corresponding relationship with the information in the node fact table;
由于所述边属性关键码与所述边属性信息,则所述信息维表中的信息与所 述边事实表中的信息具有对应关系。 Since the edge attribute key code is the same as the edge attribute information, the information in the information dimension table has a corresponding relationship with the information in the edge fact table.
4、 根据权利要求 1所述方法, 其特征在于, 所述存储所述提取的节点信 息, 节点属性信息, 边信息, 和边属性信息之后, 所述方法还包括: 4. The method according to claim 1, characterized in that, after storing the extracted node information, node attribute information, edge information, and edge attribute information, the method further includes:
对需要查询的数据,在所述存储的所述节点信息,节点属性信息,边信息, 和边属性信息中进行定位; 从定位后的所述节点信息, 节点属性信息, 边信息, 或者边属性信息中其 中之一中进行查询。 For the data that needs to be queried, locate the stored node information, node attribute information, edge information, and edge attribute information; Query is performed from one of the positioned node information, node attribute information, edge information, or edge attribute information.
5、 根据权利要求 1所述方法, 其特征在于, 所述存储所述提取的节点信 息, 节点属性信息, 边信息, 和边属性信息之后, 所述方法还包括: 5. The method according to claim 1, characterized in that, after storing the extracted node information, node attribute information, edge information, and edge attribute information, the method further includes:
根据所述提取的节点信息, 节点属性信息, 边信息, 和边属性信息, 进行 在线图处理操作。 According to the extracted node information, node attribute information, edge information, and edge attribute information, an online graph processing operation is performed.
6、 根据权利要求 5所述方法, 其特征在于, 所述在线图处理操作至少包 括: 6. The method according to claim 5, characterized in that the online graph processing operation at least includes:
信息维上卷(I-OLGP ), 拓朴维上卷(T-OLGP ), 异步上卷, 下钻, 切片, 切块, 数据透视其中之一。 One of information dimension roll-up (I-OLGP), topology dimension roll-up (T-OLGP), asynchronous roll-up, drill-down, slicing, dicing, and data pivot.
7、 根据权利要求 6所述方法, 其特征在于, 若所述提取的边属性信息存 储在信息维表中, 则所述信息维上卷具体包括: 7. The method according to claim 6, characterized in that if the extracted edge attribute information is stored in an information dimension table, then the information dimension roll-up specifically includes:
对信息维表中存储的边的一种属性的信息,或者一种以上属性的信息进行 上卷操作。 Perform a roll-up operation on the information of one attribute of the edge stored in the information dimension table, or the information of more than one attribute.
8、 根据权利要求 6所述方法, 其特征在于, 若所述提取的节点属性信息 存储在拓朴维表中, 则所述拓朴维聚集操具体包括: 8. The method according to claim 6, characterized in that, if the extracted node attribute information is stored in a topology dimension table, then the topology dimension aggregation operation specifically includes:
对拓朴维表中存储的节点的一种属性的信息,或者一种以上属性的信息进 行上卷操作。 Perform a roll-up operation on information about one attribute of nodes stored in the topological dimension table, or information about more than one attribute.
9、 一种存储数据的装置, 其特征在于, 所述装置包括: 获取单元, 提取 单元, 和存储单元; 9. A device for storing data, characterized in that the device includes: an acquisition unit, an extraction unit, and a storage unit;
所述获取单元, 用于获取原始数据集; The acquisition unit is used to acquire the original data set;
所述提取单元, 用于从原始数据集中提取表示信息网络图结构的信息; 其 中, 所述表示信息网络图结构的信息至少包括: 节点信息, 节点属性信息, 边 信息, 和边属性信息; 所述节点信息至少包括: 节点标识和节点属性关键码; 所述节点属性关键码与所述节点属性信息具有对应关系; 所述边信息至少包 括: 边标识和边属性关键码; 所述边属性关键码与所述边属性信息对应关系; 所述边用于描述节点与节点之间的联系; The extraction unit is used to extract information representing the structure of the information network graph from the original data set; wherein the information representing the structure of the information network graph at least includes: node information, node attribute information, edge information, and edge attribute information; so The node information at least includes: a node identifier and a node attribute key; the node attribute key has a corresponding relationship with the node attribute information; the edge information at least includes: an edge identifier and an edge attribute key; the edge attribute key The corresponding relationship between the code and the edge attribute information; the edge is used to describe the connection between nodes;
所述存储单元, 用于存储所述提取的节点信息, 节点属性信息, 边信息, 和边属性信息。 The storage unit is used to store the extracted node information, node attribute information, edge information, and edge attribute information.
10、 根据权利要求 9所述装置, 其特征在于, 10. The device according to claim 9, characterized in that,
所述节点信息还包括: 节点度量值; The node information also includes: node metric value;
所述边信息还包括: 边度量值。 The side information also includes: side metric value.
11、 根据权利要求 9或者 10所述装置, 其特征在于, 11. The device according to claim 9 or 10, characterized in that,
所述提取的节点信息存储在节点事实表中; The extracted node information is stored in the node fact table;
所述提取的边信息存储在边事实表中; The extracted edge information is stored in the edge fact table;
所述提取的节点属性信息存储在拓朴维表中; The extracted node attribute information is stored in the topology dimension table;
所提取的边属性信息存储在信息维表中; The extracted edge attribute information is stored in the information dimension table;
由于所述边用于描述节点与节点之间的联系,则所述节点事实表中的信息 与所述边事实表中的信息具有对应关系; Since the edge is used to describe the connection between nodes, the information in the node fact table has a corresponding relationship with the information in the edge fact table;
所述节点属性关键码与所述节点属性信息具有对应关系;则所述拓朴维表 中的信息与所述节点事实表中的信息具有对应关系; The node attribute key code has a corresponding relationship with the node attribute information; then the information in the topology dimension table has a corresponding relationship with the information in the node fact table;
由于所述边属性关键码与所述边属性信息具有对应关系,则所述信息维表 中的信息与所述边事实表中的信息具有对应关系。 Since the edge attribute key code has a corresponding relationship with the edge attribute information, the information in the information dimension table has a corresponding relationship with the information in the edge fact table.
12、根据权利要求 9所述装置,其特征在于,所述装置还包括: 定位单元, 和查询单元; 12. The device according to claim 9, characterized in that the device further includes: a positioning unit, and a query unit;
所述定位单元, 用于对需要查询的数据, 在所述存储的所述节点信息, 节 点属性信息, 边信息, 和边属性信息中进行定位; The positioning unit is used to locate the data that needs to be queried in the stored node information, node attribute information, edge information, and edge attribute information;
所述查询单元, 用于从定位后的所述节点信息, 节点属性信息, 边信息, 或者边属性信息中其中之一中进行查询。 The query unit is configured to query from one of the positioned node information, node attribute information, edge information, or edge attribute information.
13、 根据权利要求 9所述装置, 其特征在于, 所述装置还包括: 图处理单 元; 13. The device according to claim 9, characterized in that the device further includes: a graph processing unit;
所述图处理单元,用于根据所述提取的节点信息,节点属性信息,边信息, 和边属性信息, 进行在线图处理操作。 The graph processing unit is configured to perform online graph processing operations based on the extracted node information, node attribute information, edge information, and edge attribute information.
14、 根据权利要求 13所述装置, 其特征在于, 所述图处理单元中所述在 线图处理操作至少包括: 14. The device according to claim 13, wherein the online graph processing operation in the graph processing unit at least includes:
信息维上卷(I-OLGP ), 拓朴维上卷(T-OLGP ), 异步上卷, 下钻, 切片, 切块, 数据透视其中之一。 One of information dimension roll-up (I-OLGP), topology dimension roll-up (T-OLGP), asynchronous roll-up, drill-down, slicing, dicing, and data pivot.
15、 根据权利要求 14所述装置, 其特征在于, 所述图处理单元中若所述 提取的边属性信息存储在信息维表中, 则所述信息维上卷具体包括: 对信息维表中存储的边的一种属性的信息,或者一种以上属性的信息进行 上卷操作。 15. The device according to claim 14, characterized in that, in the image processing unit, if the If the extracted edge attribute information is stored in the information dimension table, then the information dimension roll-up specifically includes: performing a roll-up operation on information of one attribute of the edge, or information of more than one attribute, stored in the information dimension table.
16、 根据权利要求 14所述装置, 其特征在于, 所述图处理单元中若所述 提取的节点属性信息存储在拓朴维表中, 则所述拓朴维聚集操具体包括: 对拓朴维表中存储的节点的一种属性的信息,或者一种以上属性的信息进 行上卷操作。 16. The device according to claim 14, characterized in that, in the graph processing unit, if the extracted node attribute information is stored in a topological dimension table, then the topological dimension aggregation operation specifically includes: The information of one attribute of the node, or the information of more than one attribute, is rolled up.
PCT/CN2014/075570 2013-10-23 2014-04-17 Data storage method and device WO2015058500A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310505069.5A CN104572740B (en) 2013-10-23 2013-10-23 A kind of method and apparatus of storing data
CN201310505069.5 2013-10-23

Publications (1)

Publication Number Publication Date
WO2015058500A1 true WO2015058500A1 (en) 2015-04-30

Family

ID=52992190

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/075570 WO2015058500A1 (en) 2013-10-23 2014-04-17 Data storage method and device

Country Status (2)

Country Link
CN (1) CN104572740B (en)
WO (1) WO2015058500A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106325756B (en) * 2015-06-15 2020-04-24 阿里巴巴集团控股有限公司 Data storage method, data calculation method and equipment
CN110019357B (en) * 2017-09-29 2021-06-29 北京国双科技有限公司 Database query script generation method and device
CN109446362B (en) * 2018-09-05 2021-07-23 深圳神图科技有限公司 Graph database structure based on external memory, graph data storage method and device
CN110737805B (en) * 2019-10-18 2022-07-19 网易(杭州)网络有限公司 Method and device for processing graph model data and terminal equipment
CN110933101B (en) * 2019-12-10 2022-11-04 腾讯科技(深圳)有限公司 Security event log processing method, device and storage medium
CN112948447A (en) * 2020-12-28 2021-06-11 福建票付通信息科技有限公司 User information efficient retrieval method based on mesh structure
CN114077680B (en) * 2022-01-07 2022-05-17 支付宝(杭州)信息技术有限公司 Graph data storage method, system and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080288524A1 (en) * 2007-05-18 2008-11-20 Microsoft Corporation Filtering of multi attribute data via on-demand indexing
US20090248715A1 (en) * 2008-03-31 2009-10-01 Microsoft Corporation Optimizing hierarchical attributes for olap navigation
CN102982103A (en) * 2012-11-06 2013-03-20 东南大学 On-line analytical processing (OLAP) massive multidimensional data dimension storage method
CN103164222A (en) * 2013-02-25 2013-06-19 用友软件股份有限公司 Multidimensional modeling system and multidimensional modeling method
CN103235793A (en) * 2013-04-01 2013-08-07 华为技术有限公司 On-line data processing method, equipment and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093495B (en) * 2006-06-22 2011-08-17 国际商业机器公司 Data processing method and system based on network relation dimension
US7774227B2 (en) * 2007-02-23 2010-08-10 Saama Technologies, Inc. Method and system utilizing online analytical processing (OLAP) for making predictions about business locations

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080288524A1 (en) * 2007-05-18 2008-11-20 Microsoft Corporation Filtering of multi attribute data via on-demand indexing
US20090248715A1 (en) * 2008-03-31 2009-10-01 Microsoft Corporation Optimizing hierarchical attributes for olap navigation
CN102982103A (en) * 2012-11-06 2013-03-20 东南大学 On-line analytical processing (OLAP) massive multidimensional data dimension storage method
CN103164222A (en) * 2013-02-25 2013-06-19 用友软件股份有限公司 Multidimensional modeling system and multidimensional modeling method
CN103235793A (en) * 2013-04-01 2013-08-07 华为技术有限公司 On-line data processing method, equipment and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
NIB, ZHANGYAN ET AL.: "Design of Multi-Dimensional Information Network Datawarehouse Model for Online Graph Processing. CNKI digital publishing platform", JOURNAL OF FRONTIERS OF COMPUTER SCIENCE AND TECHNOLOGY, 30 August 2013 (2013-08-30), pages 51 - 60 *
NIE, ZHANGYAN ET AL.: "Design of Multi-Dimensional Information Network Datawarehouse Model for Online Graph Processing. CNKI digital publishing platform", JOURNAL OF FRONTIERS OF COMPUTER SCIENCE AND TECHNOLOGY, 30 August 2013 (2013-08-30), pages 51 - 60 *
XU, HONGYU ET AL.: "On-Line Graphic Processing: Information Network Oriented On-Line Analytical Processing.", JOURNAL OF FRONTIERS OF COMPUTER SCIENCE AND TECHNOLOGY, vol. 9, 2012, pages 797 - 809 *

Also Published As

Publication number Publication date
CN104572740A (en) 2015-04-29
CN104572740B (en) 2019-09-13

Similar Documents

Publication Publication Date Title
WO2015058500A1 (en) Data storage method and device
Moniruzzaman et al. Nosql database: New era of databases for big data analytics-classification, characteristics and comparison
CN104881424B (en) A kind of acquisition of electric power big data, storage and analysis method based on regular expression
CN101201822B (en) Method for searching visual lens based on contents
Ribeiro et al. Data modeling and data analytics: a survey from a big data perspective
US9785725B2 (en) Method and system for visualizing relational data as RDF graphs with interactive response time
CN104850601B (en) Police service based on chart database analyzes application platform and its construction method in real time
Mohammed et al. A review of big data environment and its related technologies
Ahmed et al. A literature review on NoSQL database for big data processing
Loudcher et al. Combining OLAP and information networks for bibliographic data analysis: a survey
Cuzzocrea et al. Semantics-aware advanced OLAP visualization of multidimensional data cubes
CN113535788A (en) Retrieval method, system, equipment and medium for marine environment data
Hashem et al. Evaluating NoSQL document oriented data model
Kanchi et al. Challenges and Solutions in Big Data Management--An Overview
US10628421B2 (en) Managing a single database management system
Vazhkudai et al. Constellation: A science graph network for scalable data and knowledge discovery in extreme-scale scientific collaborations
Kang et al. Distributed graph cube generation using Spark framework
Jakawat et al. Olap on information networks: A new framework for dealing with bibliographic data
Suri et al. A comparative study between the performance of relational & object oriented database in Data Warehousing
Ghrab et al. Topograph: an end-to-end framework to build and analyze graph cubes
CN112765490A (en) Information recommendation method and system based on knowledge graph and graph convolution network
CN115309789B (en) Method for generating associated data graph in real time based on intelligent dynamic business object
Akid et al. Towards NoSQL graph data warehouse for big social data analysis
Jakawat et al. Graphs enriched by cubes for OLAP on bibliographic networks
CN111399838A (en) Data modeling method and device based on spark SQ L and materialized view

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14855623

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14855623

Country of ref document: EP

Kind code of ref document: A1