WO2014051455A1 - Method and system for storing graph data - Google Patents

Method and system for storing graph data Download PDF

Info

Publication number
WO2014051455A1
WO2014051455A1 PCT/RU2012/000783 RU2012000783W WO2014051455A1 WO 2014051455 A1 WO2014051455 A1 WO 2014051455A1 RU 2012000783 W RU2012000783 W RU 2012000783W WO 2014051455 A1 WO2014051455 A1 WO 2014051455A1
Authority
WO
WIPO (PCT)
Prior art keywords
tree
predicate
arguments
storing
data
Prior art date
Application number
PCT/RU2012/000783
Other languages
French (fr)
Inventor
Maxim Viktorovich TSYPLIAEV
Petr Evgenievich VOLYNSKY
Original Assignee
Obschestvo S Ogranichennoy Otvetstvennostyu "Colloware"
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Obschestvo S Ogranichennoy Otvetstvennostyu "Colloware" filed Critical Obschestvo S Ogranichennoy Otvetstvennostyu "Colloware"
Priority to EA201500264A priority Critical patent/EA201500264A1/en
Publication of WO2014051455A1 publication Critical patent/WO2014051455A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention is related to storage of graph data in a B-tree that substantially obviates one or several disadvantages of the related art.
  • the graph data can represent a workflow of a business process.
  • the business process can be controlled by a state machine.
  • the workflow data can be represented by a triple consisting of stages of the process (i.e., states of the state machine), a task and an assignee of the task. These triples are store in the B-tree.
  • the data is stored in a form of triples, quadruples, quintuples, etc. in a B-tree.
  • the B-trees are data structures that allow operations on dynamic data sets.
  • the operations can be search, search for minimum and maximum values, insert, delete, reference to parent or child directory.
  • the tree can be used as a dictionary or as a prioritized chain.
  • the speed of tree operations is proportional to the height.
  • the data is read as blocks from the same location. If a tree node is moved to an operational memory, an allocated memory block is moved and the operation executes very fast.
  • Data is typically stored in databases.
  • the most common databases are relational databases.
  • a relational database is a set of tables consisting of columns and rows. All tables are connected (i.e., related) to each other via a primary key.
  • the data is retrieved from the database by, for example, SQL queries that return data from one or several tables.
  • One SQL query can retrieve data from several tables by using JOIN SQL operator that connects table using the primary key.
  • a relational database system is very complex. For example, a simple SELECT statement for selecting the data can be executed in a variety of ways.
  • the relational database engine finds an optimal way of selecting and retrieving data.
  • the database engine applies optimization algorithms in order to return data in a fastest way possible.
  • FIG. 1 illustrates a graph, in accordance with the exemplary embodiment
  • FIG. 2 illustrates a database with a B-tree structure for storing triples, in accordance with the exemplary embodiment
  • FIG. 3 illustrates an example of storing predicates with corresponding attributes, in in accordance with the exemplary embodiment
  • FIG. 4 illustrates example of storing predicates with their attributes by using additional links
  • FIG. 5 illustrates an example of a B-tree of an order of 5
  • FIG. 6 illustrates another example of a graph that can be written in a form of triples
  • FIG. 7 illustrates a schematic of an exemplary computer system that can be used for implementation of the invention.
  • Graph data can represent a workflow of a business process.
  • the business process can be controlled by a state machine.
  • the workflow data can be represented by a triple (or, more generally, n-tuple) representing stages of the process (i.e., states of the state machine), a task and an assignee of the task.
  • stages of the process i.e., states of the state machine
  • the conditions of relationships between the state machine's states can serve as predicates of the triple.
  • These triples are store in the B-tree.
  • a workflow can have multiple tasks and assignees for each state of the process.
  • the data can be represented, for example, by quadruples or quintuples, as well.
  • the data in order to speed up storage and retrieval of workflow data, is stored in a form of triples, quadruples, quintuples, etc. in a B-tree.
  • the trees are data structures that allow operations on dynamic data sets. The operations can be search, search for minimum and maximum values, insert, delete, reference to parent or child directory.
  • the tree can be used as a dictionary or as a prioritized chain.
  • the speed of tree operations is proportional to the height.
  • Balanced trees have minimized height. For example, a height of a binary tree with n nodes equals log n.
  • B-trees are balanced trees and the speed of operations within the B-trees is proportional to their height.
  • the B-trees work effectively with a disk memory, because they minimize I/O calls.
  • the data is read as blocks from the same location.
  • triples are stored in a triplestore.
  • the triplestore is a special database for storage and retrieval of the triples.
  • the triple is a data entity composed of subject- predicate-object, like "John is 35" or "John knows Helen".
  • the information is stored in the triplestore and retrieved by a query statement.
  • the triplestore is optimized for the storage and retrieval of the triples.
  • the triples can be imported/exported using Resource Description Framework (RDF) and other formats.
  • RDF Resource Description Framework
  • the triplestores can store billions of triples.
  • the triplestores can be built as proprietary database engines or they can be built on top of the existing commercial relational database engines (i.e., SQL- based database engines).
  • Online analytical processing (OLAP) databases used triplestores integrated into existing database engines.
  • proprietary (native) triplestore have performance advantages.
  • a difficulty with implementing triplestores over SQL is that implementing efficient querying of a graph-based RDF model (i.e., mapping from SPARQL (an RDF query language, that is, a query language for databases, designed to retrieve and manipulate data stored in Resource Description Framework format.) onto SQL queries) is difficult.
  • the Semantic Web takes the solution further. It involves publishing in languages specifically designed for data: Resource Description Framework (RDF), Web Ontology Language (OWL), and Extensible Markup Language (XML).
  • RDF Resource Description Framework
  • OWL Web Ontology Language
  • XML Extensible Markup Language
  • the HTML describes documents and the links between them.
  • the RDF, the OWL, and the XML by contrast, can describe arbitrary things such as people, meetings, car parts, etc.
  • Ontologies are considered to be the pillars of the semantic network, although they do not have a universally accepted definition.
  • a (semantic network) vocabulary can be considered as a special form of (usually light-weight) ontology, or sometimes merely as a collection of URIs with a (usually informally) described meaning.
  • Ontologies are usually assumed to be accompanied by some document in a formal ontology language, though some ontologies do not use standardized formats for that purpose.
  • ontology languages are formal languages used to construct ontologies. They allow the encoding of knowledge about specific domains and often include reasoning rules that support the processing of that knowledge.
  • the ontology languages are usually declarative languages, are almost always generalizations of frame languages, and are commonly based either on a first-order logic or on description logic.
  • content can manifest itself as descriptive data stored in Web- accessible databases, or as a markup within documents, particularly, in Extensible HTML (XHTML) interspersed with XML, or, more often, purely in XML, with layout or rendering cues stored separately.
  • XHTML Extensible HTML
  • the machine-readable descriptions enable content managers to add meaning to the content, i.e., to describe the structure of the available knowledge about that content. In this way, a machine can process knowledge itself, instead of a text, using processes similar to a human deductive reasoning and inference, thereby obtaining more meaningful results and helping computers to perform automated information gathering and research.
  • the Resource Description Framework is a family of specifications originally designed as a metadata data model.
  • the RDF has come to be used as a general method for conceptual description or modeling of information that is implemented in web resources, using a variety of syntax formats.
  • the RDF data model is similar to classic conceptual modeling approaches, such as entity- relationship or class diagrams, as it is based upon the idea of making statements about resources (in particular Web resources) in the form of subject-predicate-object expressions. These expressions are known as triples in RDF terminology.
  • the subject denotes the resource
  • the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object.
  • RDF For example, one way to represent the notion "The sky has the color blue” in RDF is as the triple: a subject denoting "the sky,” a predicate denoting "has the color,” and an object denoting "blue.” Therefore, the RDF swaps object for subject that would be used in the classical notation of an entity- attribute-value model within an object-orientated design (i.e., an object "sky,” an attribute "color” and a value "blue”).
  • the RDF is an abstract model with several serialization formats (i.e., file formats), and so the particular way in which a resource or triple is encoded varies from format to format.
  • This mechanism for describing resources is a major component in what is proposed by the semantic network activity: an evolutionary stage of the World Wide Web in which automated software can store, exchange, and use machine-readable information distributed throughout the Web, in turn enabling users to deal with the information with greater efficiency and certainty.
  • the RDF simple data model and ability to model disparate, abstract concepts has also led to its increasing use in knowledge management applications unrelated to the Semantic Web activity.
  • RDF statements intrinsically represents a labeled directed multi-graph.
  • an RDF-based data model is more naturally suited to certain kinds of knowledge representation than the relational model and other ontological models.
  • the RDF data is often persisted in a relational database or native representations also called triplestores or quadstores, if context (i.e., the named graph) is also persisted for each RDF triple.
  • context i.e., the named graph
  • Notation 3 is a non-XML serialization of RDF models designed to be easier to write by hand, and in some cases, easier to follow. Because it is based on a tabular notation, it makes the underlying triples encoded in the documents more easily recognizable compared to the XML serialization. N3 is closely related to the Turtle (Terse RDF Triple Language) and N-Triples formats.
  • N3 is a shorthand non-XML serialization of Resource Description Framework models, designed with human-readability in mind.
  • the N3 is much more compact and readable than XML RDF notation.
  • the N3 has several features that go beyond a serialization for RDF models, such as support for RDF-based rules. Turtle is a simplified RDF-only subset of the N3.
  • Turtle (Terse RDF Triple Language) is a serialization format for Resource Description Framework (RDF) graphs.
  • RDF Resource Description Framework
  • Turtle examples are also valid Notation 3.
  • the subject of the RDF statement is either a Uniform Resource Identifier (URI) or a blank node, both of which denote resources. Resources indicated by blank nodes are called anonymous resources. They are not directly identifiable from the RDF statement.
  • the predicate is a URI, which also indicates a resource, representing a relationship.
  • the object is a URI, blank node or a Unicode string literal.
  • URI In semantic network (semantic web) applications, and in relatively popular applications of the RDF like RSS and FOAF (Friend of a Friend), resources tend to be represented by the URIs that intentionally denote, and can be used to access, the actual data on the World Wide Web.
  • the RDF in general, is not limited to the description of Internet-based resources.
  • the URI that names a resource does not have to be de-referenceable at all.
  • a URI that begins with "http:" and is used as the subject of an RDF statement does not necessarily have to represent a resource that is accessible via HTTP, nor does it need to represent a tangible, network-accessible resource, such a URI could represent absolutely anything.
  • a bare URI without a # symbol
  • which returns a 300-level coded response when used in an HTTP GET request should be treated as denoting the Internet resource that it succeeds in accessing.
  • Triples can represent business objects and rules implemented in any programming language or expression language. According to the exemplary embodiment, special names can be used for transition rules for business process.
  • the names can be:
  • the business rules can be stored as triples, as well.
  • the triples can be used for tasks generated by a state machine.
  • the conditions of relationships between the state machine's states can serve as predicates of the triple.
  • a task is an activity that needs to be accomplished (or executed) within a defined period of time.
  • a task can be assigned to a responsible person.
  • the task has a start and end date (time).
  • a task is a unit that is used to track user activities in terms of completing certain objectives defined by the task description.
  • An example of a task is a task within MS Outlook.
  • Other examples of tasks can be the tasks such as fixing a bug in a computer code, report generation, replacing a car part, cargo transport, writing an executable computer module, etc.
  • a computer code bug needs to be fixed.
  • the process detects a bug and creates an object/item "bug.”
  • the object is processed according to a workflow that includes stages of object processing represented by states.
  • An item within the workflow is a configurable unit that can be transitioned from state to state and used for tracking certain business process. Item transition through the workflow can generate tasks at required steps (states) and assign them to certain users.
  • the exemplary workflow for fixing bugs operates as follows. If a bug is detected, an order is given to a bug fixer (i.e., developer) to fix the bug. After that, the bug is considered fixed.
  • the object "bug” can have several states such as “bug detected,” “bug is under correction,” “bug is corrected.”
  • the bug handling process moves to a "bug detected” step.
  • the process orders the bug fixer (i.e., developer) to fix the detected bug.
  • a task for the bug fixer "fix bug” is generated, and the bug object is transitioned into "bug is under correction” state.
  • the bug fixer After the bug fixer (developer) corrects the bug, the bug fixer closes his task, and the state of the bug object transitions into the state "bug is corrected.”
  • An assignee of the task closes the task upon completion.
  • the task can have several parameters. For example, a status (open/closed), a task assignee, a generation date, a closing date, a task name, etc.
  • the task parameters can be described as triples.
  • Each task can have a unique task ID (i.e., for example a number).
  • task ID can be generated by hashing task name or creation date.
  • the task status can be described as a triple: 001 status opened, 002 status opened.
  • task id is a subject
  • the status is predicate and opened is an object of the triple.
  • “003 status closed” indicates that task 003 has been closed.
  • the “003 status closed” indicates that task 003 has been closed.
  • an operation with a task for example, opening or closing the task
  • the corresponding triples are added to a database. Storing the triples in the database allows for retrieval of the task-related data for future use. For example, task status and assignment can be retrieved by a database query.
  • An exemplary query "?subject status opened” retrieves all subjects (task IDs) with an open status.
  • the IDs of the tasks 001 and 002 will be retrieved by this query.
  • the queries can be bases on objects, predicates and subjects.
  • storing triples in a database is inefficient. According to the exemplary embodiment, the triples are stored in B-trees or tables, which are discussed in detail herein.
  • graphs can be represented by the triples.
  • FIG. 1 illustrates an exemplary graph. This graph can be defined by a number of triples:
  • Task 1 - owner (assignee) is John;
  • Task 2 - owner (assignee) is Alex;
  • Task 2 - owner (assignee) is Alex;
  • the "owner” and the "status” are predicates (P)
  • the Task 1 and the Task 2 are subjects (S)
  • John and Alex are objects (0) of the triples.
  • a B-tree is used for storing triples, quadruples (where predicate depends on three attributes), quintuples (where predicate depends on four attributes), etc. represented by graphs. Note that the actual graphs are not stored, but the corresponding triples (quadruples, quintuples, etc.) are stored in the B-tree.
  • graph (G) represents a combination of two sets: peaks V (for example, Task 1, John and Alex) and edges E (predicates 1, 2, 3, etc.). The two sets are combined bases on an instance (for example, a state of a workflow).
  • the peak v' and the edge e are considered coincident, and the peaks v' and v" are considered adjacent.
  • n
  • m
  • graph G is (n,m) graph, where n - is a power of the graph (i.e., triple, quadruple, quintuple, etc.) and m is graph size.
  • FIG. 2 illustrates a database with a B-tree structure for storing triples depicted in FIG. 1.
  • predicates, objects and subjects are stored in a database in a hashed from.
  • task, status and owner values can be hashed prior to storage in the table. Hashing allows for speed up of a search process by sorting the hash values within the database.
  • a key table in the database is a table containing relations between the predicates, the subjects and the objects.
  • the predicates in the key table are sorted.
  • Table of predicates 210 contains predicates Pl...Pn used for selecting the subjects and the objects.
  • N1S Link ... NnS Link are values in the predicate table that represent links to table N1S ... NnS (220 and 240 accordingly).
  • the tables 220 and 240 represent relations between the subject and the objects.
  • P2 name a different set of tables N10 ... NnO is created.
  • the above example uses the triples that use only two related values (attributes). Any node of the graph can be used as an attribute. In cases when more than two related attributes are uses, such as PI (al, a2, a3), an alternative method for storing is employed. A B-tree is also used.
  • X is a name of a variable. If X value is "darling”, then "write” will print the word “darling.”
  • Both a line of symbols and a variable are attributes of the predicate "write.”
  • the attributes can be mixed in an arbitrary manner according to syntax rules. For example, write("Today is the “,N,”th day of ", ⁇ ,”, a ",D,”.”).
  • M are the seven attributes of this predicate.
  • the predicate is to the power of seven.
  • FIG. 3 illustrates an example of storing predicates with their attributes in a B-tree,. in in accordance with the exemplary embodiment.
  • a table of predicates 310 contains predicates in sorted in a particular order.
  • the table of predicates 310 also contains links to corresponding stored attributes for each of the predicates.
  • first set of the predicate PI has a power of three (i.e., the predicate has three attributes). Therefore, when a first set is written into the table 310, the table 310 contains four links to the stored attributes.
  • the first link (Linkl) references a table 320 containing all first attributes (i.e., the attributes placed first in all of the sets).
  • the second link (Link2) points to a table 330 containing all second attributes of all the sets.
  • the third link (Link3) points to a table 340 which contains all of the third attributes.
  • the fourth link (Link4) is an additional link that points to a table 350, which contains a terminator.
  • the terminator is a constant value (numerical or text).
  • the terminator is used for retrieving the triples from the B-tree 300.
  • the triples are identified within the B-tree by the terminator.
  • the search continues until the terminator corresponding to a particular triple is found.
  • Each set of attributes along with a corresponding predicate is assigned an identification number upon being written into the B-tree 300. The next identification number is incremented by one. Therefore, the data sets are ordered according to when they are written into the B-tree 300. Thus, each set (a predicate, attributes and a terminator) has a corresponding unique identifier also stored in the B- tree 300.
  • each attribute has a position id, which identifies the position of the attribute in a set (i.e., first, second, third, etc.).
  • the first set contains an attribute "al" in the first position.
  • the attribute "al” is written into the table 320 with the identifier "1" indicating that the attribute belongs to the first set.
  • An attribute "a2" is second in the first set. Therefore, it is written into the table 330 with the identifier "1.”
  • An attribute "a3" is third in the first set. Therefore, it is written into the table 340 with the identifier "1.”
  • the terminator with the identifier "1” is written into the table 350. The terminator "1" indicates that there are no more attributes of the first set.
  • An attribute "al” is first in the second set. Therefore, it is written into the table 320 with the identifier "2", which indicates that the attribute "al" belongs to the second set.
  • the second set has only one attribute. Thus, the terminator with the identifier "2" is written into the table 330.
  • the predicates, the attributes and the terminators are stored in the same manner.
  • the tables 320 - 350 can contain the links to other tables instead of the actual identifiers. This example is depicted in FIG. 4.
  • the table 310 contains the link (Table 2Link) to the table 320.
  • the table 320 contains links (Table3Link and Table4link) that point tables 420 and 430 that contain the identifiers.
  • the tables depicted in FIGs. 4 and 5 belong to the B-tree. In other words, the tables serve as branches of the B-tree. If a particular data set (the predicate with the attributes) needs to be retrieved from the B-tree, the predicate, the attributes and the terminator are searched for based on the corresponding identifier.
  • the first set "PI" is found in the table 310. Then, the corresponding arguments are read from the tables 320, 330, 340 and 350. The attributes with the identifier "1" are found in the tables 320-350. Therefore, the attributes are retrieved in order they are placed in the original set (i.e., the first attribute is placed first, the second - second, etc.). The terminator is found in order to make sure that all the attributes from the set have been retrieved.
  • the first identifier is "3", which means that argument a4 is located at the third predicate in the B-tree (and possibly in subsequent ones as well).
  • the set of identifiers for argument a4 becomes the master for the search for intersecting identifiers.
  • the set of identifiers for argument al becomes a dependent set of identifiers, and the set of identifiers for the terminator is also dependent, and is after the al identifiers.
  • the set of identifiers for al is read from table 320 and the set of identifiers for the terminators is read from table 340. Then the set of identifiers for the first argument (here, a2) is read from table 330, and the intersecting identifiers are determined, as discussed earlier.
  • intersections are determined, the set of identifiers for the next argument is read from table 330 and all intersections are also determined.
  • FIG. 5 illustrates an example of a B-tree of order 5.
  • the maximum number of children that a node can have is 5 (so that 4 is the maximum number of keys).
  • each leaf node must contain at least 2 keys.
  • the B- trees usually have orders a lot larger than 5.
  • the B-tree is a tree data structure that contains sorted data and allows searches, sequential access, insertions, and deletions in a logarithmic time.
  • the B-tree is a type of a binary search tree, which has more than two children per node.
  • internal (non-leaf) nodes can have a variable number of child nodes within a predefined range. When data is inserted or removed from a node, the number of child nodes changes. In order to maintain the pre-defined range, the internal nodes can be joined or split. Because a certain range of the child nodes is allowed, the B-trees do not need to be re-balanced, as frequently as other self-balancing search trees. However, they may waste some space, since nodes are not entirely full. The lower and upper bounds on the number of child nodes are typically fixed for a particular implementation. For example, in a 2-3 B-tree (often simply referred to as a 2-3 tree), each internal node may have only 2 or 3 child nodes.
  • Each internal node of a B-tree contains a number of keys. In practice, the keys take up the most space in the node. Each internal node of a B-tree contains a number of keys. Usually, the number of keys is chosen to vary between d and 2d. The factor of 2 guarantees that the nodes can be split or combined. If an internal node has 2d keys, then adding a key to that node can be accomplished by splitting the 2d key node into two d key nodes and adding the key to the parent node. Each split node has the required minimum number of keys. Similarly, if an internal node and its neighbor each have keys, then a key may be deleted from the internal node by combining it with its neighbor. Deleting the key would make the internal node have d-1 keys. Joining the neighbor would add d keys plus one more key brought down from the neighbor's parent. The result is an entirely full node of 2d keys.
  • the number of branches (or child nodes) from a node is one more than the number of keys stored in the node.
  • the internal nodes store either one key (with two child nodes) or two keys (with three child nodes).
  • the B-tree can be described with parameters (d+l) - (2d+l), or simply with the highest branching order (2d+l).
  • the B-tree is kept balanced by requiring that all leaf nodes are located at the same depth.
  • the depth increases slowly as elements are added to the tree, but an increase in the overall depth is infrequent and results in all leaf nodes being one more node further away from the root.
  • the B-trees have substantial advantages over alternative implementations when node access times far exceed access times within the nodes, because the cost of accessing the node can be amortized over multiple operations within the node. This usually occurs when the nodes are in secondary storage such as disk drives.
  • secondary storage such as disk drives.
  • a maximum number of child nodes depends on the data that must be stored for each child node, and the size of a full disk block or an analogous size in a secondary storage. While 2-3 B-trees are easier to explain, practical B-trees using secondary storage require a higher number of child nodes to improve performance.
  • each node of a b-tree can have a variable number of keys and children.
  • the keys are stored in a non-decreasing order.
  • Each key has an associated child that is the root of a sub-tree containing all nodes with the keys less than or equal to the key, but greater than the preceding key.
  • a node also has an additional rightmost child that is the root for a sub-tree containing all keys greater than any keys in the node.
  • each node tends to have a large branching factor (a large number of children), it is necessary to traverse relatively few nodes before locating the desired key. If access to each node requires a disk access, then a B-tree will minimize the number of disk accesses required.
  • the minimization factor is usually chosen so that the total size of each node corresponds to a multiple of the block size of the underlying storage device. This choice simplifies and optimizes disk access. Consequently, a B-tree is an ideal data structure for situations where all data cannot reside in a primary storage and accesses to a secondary storage are comparatively expensive (or time consuming).
  • the worst case height is 0(log n). Since the "branchiness" of a B-tree can be relatively large compared to many other balanced tree structures, the base of the logarithm tends to be large. Therefore, the number of nodes visited during a search tends to be smaller than required by other tree structures. Although this does not affect the asymptotic worst case height, the B-trees tend to have smaller heights than other trees with the same asymptotic height.
  • the B-tree must satisfy the following conditions:
  • Each node contains at least one key.
  • the keys at each node are ordered.
  • the root node contains from 1 to 2t-l keys. Any other node contains from t-1 to 2t-l keys (the leaves are not excluded).
  • t is a tree parameter that is no less than 2 (typically is in the range 50 to 2000); - The leaves do not have children. Any other node having keys Kl, ...Kn, contains n+1 children, where:
  • i-th child and all its children contain the keys from the interval (Ki-1, Ki);
  • the second condition can be states as: each node of the B-tree (except for the leaves) can be considered as an order array consisting of the keys and references to the children in an alternating order.
  • the key is contained in the root, it is easily found. Otherwise, the interval is determined and a corresponding child is found. These operations are repeated until the process reaches the leaf.
  • FIG. 6 illustrates an exemplary graph that can be written in a form of triples.
  • the exemplary graph can be written in form of ten triples:
  • John is a man (relation 620 to 610);
  • Bill is a man (relation 650 to 610);
  • Alex is a man (relation 660 to 610);
  • John position manager (relation 620 to 670);
  • Bill position developer (relation 650 to 680);
  • Alex position CEO (relation 660 to 690).
  • an exemplary system for implementing the invention includes a general purpose computing device in the form of a computer 20 or a server, including a processing unit 21, a system memory 22, and a system bus 23 that couples various system components including the system memory to the processing unit 21.
  • the system bus 23 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures.
  • the system memory includes read-only memory (ROM) 24 and random access memory (RAM) 25.
  • ROM read-only memory
  • RAM random access memory
  • the computer 20 may further include a hard disk drive 27 for reading from and writing to a hard disk, not shown, a magnetic disk drive 28 for reading from or writing to a removable magnetic disk 29, and an optical disk drive 30 for reading from or writing to a removable optical disk 31 such as a CD-ROM, DVD-ROM or other optical media.
  • the hard disk drive 27, magnetic disk drive 28, and optical disk drive 30 are connected to the system bus 23 by a hard disk drive interface 32, a magnetic disk drive interface 33, and an optical drive interface 34, respectively.
  • the drives and their associated computer-readable media provide non-volatile storage of computer readable instructions, data structures, program modules and other data for the computer 20.
  • exemplary environment described herein employs a hard disk, a removable magnetic disk 29 and a removable optical disk 31, it should be appreciated by those skilled in the art that other types of computer readable media that can store data that is accessible by a computer, such as magnetic cassettes, flash memory cards, digital video disks, Bernoulli cartridges, random access memories (RAMs), read-only memories (ROMs) and the like may also be used in the exemplary operating environment.
  • RAMs random access memories
  • ROMs read-only memories
  • a number of program modules may be stored on the hard disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25, including an operating system 35.
  • the computer 20 includes a file system 36 associated with or included within the operating system 35, one or more application programs 37, 37', other program modules 38 and program data 39.
  • a user may enter commands and information into the computer 20 through input devices such as a keyboard 40 and pointing device 42.
  • Other input devices may include a microphone, joystick, game pad, satellite dish, scanner or the like.
  • serial port interface 46 that is coupled to the system bus, but may be connected by other interfaces, such as a parallel port, game port or universal serial bus (USB).
  • a monitor 47 or other type of display device is also connected to the system bus 23 via an interface, such as a video adapter 48.
  • personal computers typically include other peripheral output devices (not shown), such as speakers and printers.
  • the computer 20 may operate in a networked environment using logical connections to one or more remote computers 49.
  • the remote computer (or computers) 49 may be another computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 20, although only a memory storage device 50 has been illustrated.
  • the logical connections include a local area network (LAN) 51 and a wide area network (WAN) 52.
  • LAN local area network
  • WAN wide area network
  • the computer 20 When used in a LAN networking environment, the computer 20 is connected to the local network 51 through a network interface or adapter 53. When used in a WAN networking environment, the computer 20 typically includes a modem 54 or other means for establishing communications over the wide area network 52, such as the Internet.
  • a modem 54 or other means for establishing communications over the wide area network 52, such as the Internet.
  • the modem 54 which may be internal or external, is connected to the system bus 23 via the serial port interface 46.
  • program modules depicted relative to the computer 20, or portions thereof may be stored in the remote memory storage device. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A method and system for storing complex graph data. The graph data is represented by triples, quadruples, quintuples, etc. In order to speed up storage and retrieval of graph data, the data is stored in a form of triples, quadruples, quintuples, etc. in a B-tree. The B-trees are data structures that allow operations on dynamic data sets. The operations can be search, search for minimum and maximum values, insert, delete, reference to parent or child directory. The tree can be used as a dictionary or as a prioritized chain. The speed of tree operations is proportional to the height. The data is read as blocks from the same location. If a tree node is moved to an operational memory, an allocated memory block is moved and the operation executes very fast.

Description

METHOD AND SYSTEM FOR STORING GRAPH DATA
The present invention is related to storage of graph data in a B-tree that substantially obviates one or several disadvantages of the related art.
A method and system for storing complex graph data is provided. The graph data can represent a workflow of a business process. The business process can be controlled by a state machine. Then, the workflow data can be represented by a triple consisting of stages of the process (i.e., states of the state machine), a task and an assignee of the task. These triples are store in the B-tree.
In order to speed up storage and retrieval of data, the data is stored in a form of triples, quadruples, quintuples, etc. in a B-tree. The B-trees are data structures that allow operations on dynamic data sets. The operations can be search, search for minimum and maximum values, insert, delete, reference to parent or child directory. The tree can be used as a dictionary or as a prioritized chain. The speed of tree operations is proportional to the height. The data is read as blocks from the same location. If a tree node is moved to an operational memory, an allocated memory block is moved and the operation executes very fast.
Additional features and advantages of the invention will be set forth in the description that follows, and in part will be apparent from the description, or may be learned by practice of the invention. The advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Data is typically stored in databases. The most common databases are relational databases. A relational database is a set of tables consisting of columns and rows. All tables are connected (i.e., related) to each other via a primary key. The data is retrieved from the database by, for example, SQL queries that return data from one or several tables.
One SQL query can retrieve data from several tables by using JOIN SQL operator that connects table using the primary key. A relational database system is very complex. For example, a simple SELECT statement for selecting the data can be executed in a variety of ways. The relational database engine finds an optimal way of selecting and retrieving data. The database engine applies optimization algorithms in order to return data in a fastest way possible.
However, in cases of hundreds of thousands or even millions of records, data retrieval can be quite slow. The data can be stored and retrieved faster, if the data is stored in a form of triples, quadruples or combinations of an arbitrary number of elements. Conventional relational databases also can operate with triples or quadruples-type data, however, there are a number of disadvantages of this approach, such as data redundancy, which in turn leads to anomalies and data inconsistencies. Processing and searching through triplets in conventional databases is also a computationally intensive process, which leads to slowdowns when working with relational databases.
Accordingly, there is a need in the art for a system and method for efficient storage and retrieval of complex graph data formed into triples, quadruples, quintuples, etc. The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention.
In the drawings:
FIG. 1 illustrates a graph, in accordance with the exemplary embodiment;
FIG. 2 illustrates a database with a B-tree structure for storing triples, in accordance with the exemplary embodiment;
FIG. 3 illustrates an example of storing predicates with corresponding attributes, in in accordance with the exemplary embodiment;
FIG. 4 illustrates example of storing predicates with their attributes by using additional links;
FIG. 5 illustrates an example of a B-tree of an order of 5;
FIG. 6 illustrates another example of a graph that can be written in a form of triples;
FIG. 7 illustrates a schematic of an exemplary computer system that can be used for implementation of the invention.
A method and a system for storing complex graph data are provided. Graph data can represent a workflow of a business process. The business process can be controlled by a state machine. Then, the workflow data can be represented by a triple (or, more generally, n-tuple) representing stages of the process (i.e., states of the state machine), a task and an assignee of the task. The conditions of relationships between the state machine's states can serve as predicates of the triple. These triples are store in the B-tree. A workflow can have multiple tasks and assignees for each state of the process. The data can be represented, for example, by quadruples or quintuples, as well.
According to an exemplary embodiment, in order to speed up storage and retrieval of workflow data, the data is stored in a form of triples, quadruples, quintuples, etc. in a B-tree. The trees are data structures that allow operations on dynamic data sets. The operations can be search, search for minimum and maximum values, insert, delete, reference to parent or child directory. The tree can be used as a dictionary or as a prioritized chain.
The speed of tree operations is proportional to the height. Balanced trees have minimized height. For example, a height of a binary tree with n nodes equals log n. B-trees are balanced trees and the speed of operations within the B-trees is proportional to their height. The B-trees work effectively with a disk memory, because they minimize I/O calls. The data is read as blocks from the same location.
If a tree node is moved to an operational memory, an allocated memory block is moved and the operation executes very fast. Thus, the server load is small and the waiting time is minimal. Therefore, a use of a B-tree is advantageous for storing complex data.
According to the exemplary embodiment, triples are stored in a triplestore. The triplestore is a special database for storage and retrieval of the triples. The triple is a data entity composed of subject- predicate-object, like "John is 35" or "John knows Helen". Much like a relational database, the information is stored in the triplestore and retrieved by a query statement. Unlike a relational database, the triplestore is optimized for the storage and retrieval of the triples. In addition to queries, the triples can be imported/exported using Resource Description Framework (RDF) and other formats.
The triplestores can store billions of triples. The triplestores can be built as proprietary database engines or they can be built on top of the existing commercial relational database engines (i.e., SQL- based database engines). Online analytical processing (OLAP) databases used triplestores integrated into existing database engines. However, proprietary (native) triplestore have performance advantages.
A difficulty with implementing triplestores over SQL is that implementing efficient querying of a graph-based RDF model (i.e., mapping from SPARQL (an RDF query language, that is, a query language for databases, designed to retrieve and manipulate data stored in Resource Description Framework format.) onto SQL queries) is difficult. The Semantic Web takes the solution further. It involves publishing in languages specifically designed for data: Resource Description Framework (RDF), Web Ontology Language (OWL), and Extensible Markup Language (XML). The HTML describes documents and the links between them. The RDF, the OWL, and the XML, by contrast, can describe arbitrary things such as people, meetings, car parts, etc.
Ontologies are considered to be the pillars of the semantic network, although they do not have a universally accepted definition. A (semantic network) vocabulary can be considered as a special form of (usually light-weight) ontology, or sometimes merely as a collection of URIs with a (usually informally) described meaning.
Ontologies are usually assumed to be accompanied by some document in a formal ontology language, though some ontologies do not use standardized formats for that purpose. In computer science and artificial intelligence, ontology languages are formal languages used to construct ontologies. They allow the encoding of knowledge about specific domains and often include reasoning rules that support the processing of that knowledge. The ontology languages are usually declarative languages, are almost always generalizations of frame languages, and are commonly based either on a first-order logic or on description logic.
These technologies are combined in order to provide descriptions that supplement or replace the content of Web documents. Thus, content can manifest itself as descriptive data stored in Web- accessible databases, or as a markup within documents, particularly, in Extensible HTML (XHTML) interspersed with XML, or, more often, purely in XML, with layout or rendering cues stored separately. The machine-readable descriptions enable content managers to add meaning to the content, i.e., to describe the structure of the available knowledge about that content. In this way, a machine can process knowledge itself, instead of a text, using processes similar to a human deductive reasoning and inference, thereby obtaining more meaningful results and helping computers to perform automated information gathering and research.
The Resource Description Framework (RDF) is a family of specifications originally designed as a metadata data model. The RDF has come to be used as a general method for conceptual description or modeling of information that is implemented in web resources, using a variety of syntax formats.
The RDF data model is similar to classic conceptual modeling approaches, such as entity- relationship or class diagrams, as it is based upon the idea of making statements about resources (in particular Web resources) in the form of subject-predicate-object expressions. These expressions are known as triples in RDF terminology. The subject denotes the resource, and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object. For example, one way to represent the notion "The sky has the color blue" in RDF is as the triple: a subject denoting "the sky," a predicate denoting "has the color," and an object denoting "blue." Therefore, the RDF swaps object for subject that would be used in the classical notation of an entity- attribute-value model within an object-orientated design (i.e., an object "sky," an attribute "color" and a value "blue"). The RDF is an abstract model with several serialization formats (i.e., file formats), and so the particular way in which a resource or triple is encoded varies from format to format.
This mechanism for describing resources is a major component in what is proposed by the semantic network activity: an evolutionary stage of the World Wide Web in which automated software can store, exchange, and use machine-readable information distributed throughout the Web, in turn enabling users to deal with the information with greater efficiency and certainty. The RDF simple data model and ability to model disparate, abstract concepts has also led to its increasing use in knowledge management applications unrelated to the Semantic Web activity.
A collection of RDF statements intrinsically represents a labeled directed multi-graph. As such, an RDF-based data model is more naturally suited to certain kinds of knowledge representation than the relational model and other ontological models. However, in practice, the RDF data is often persisted in a relational database or native representations also called triplestores or quadstores, if context (i.e., the named graph) is also persisted for each RDF triple. As the RDFS and the OWL demonstrate, additional ontology languages can be built upon the RDF.
Two common serialization formats are in use. The first is an XML format. This format is often called simply the RDF (Resource Description Framework). Notation 3 (or N3) is a non-XML serialization of RDF models designed to be easier to write by hand, and in some cases, easier to follow. Because it is based on a tabular notation, it makes the underlying triples encoded in the documents more easily recognizable compared to the XML serialization. N3 is closely related to the Turtle (Terse RDF Triple Language) and N-Triples formats.
Notation3, or N3, as it is more commonly known, is a shorthand non-XML serialization of Resource Description Framework models, designed with human-readability in mind. The N3 is much more compact and readable than XML RDF notation.
The N3 has several features that go beyond a serialization for RDF models, such as support for RDF-based rules. Turtle is a simplified RDF-only subset of the N3.
Turtle (Terse RDF Triple Language) is a serialization format for Resource Description Framework (RDF) graphs.
Here is an example of the Turtle:
(©prefix rdf: <http://www.w3.Org/1999/02/22-rdf-syntax-ns#>. @prefix dc: <http://purl.Org/dc/elements/l.l/>. @prefix ex: <http://example.0rg/stuff/l.O/>.
<http://www.w3.org/TR/rdf-syntax-grammar>
dc:title "RDF/XML Syntax Specification (Revised)" ; ex:editor [ ex:fullname "Dave Beckett"; ex:homePage <http://purl.org/net/dajobe/>
].
Turtle examples are also valid Notation 3. The subject of the RDF statement is either a Uniform Resource Identifier (URI) or a blank node, both of which denote resources. Resources indicated by blank nodes are called anonymous resources. They are not directly identifiable from the RDF statement. The predicate is a URI, which also indicates a resource, representing a relationship. The object is a URI, blank node or a Unicode string literal.
In semantic network (semantic web) applications, and in relatively popular applications of the RDF like RSS and FOAF (Friend of a Friend), resources tend to be represented by the URIs that intentionally denote, and can be used to access, the actual data on the World Wide Web. But the RDF, in general, is not limited to the description of Internet-based resources. In fact, the URI that names a resource does not have to be de-referenceable at all. For example, a URI that begins with "http:" and is used as the subject of an RDF statement does not necessarily have to represent a resource that is accessible via HTTP, nor does it need to represent a tangible, network-accessible resource, such a URI could represent absolutely anything. However, there is broad agreement that a bare URI (without a # symbol) which returns a 300-level coded response when used in an HTTP GET request should be treated as denoting the Internet resource that it succeeds in accessing.
Triples (or quadruples) can represent business objects and rules implemented in any programming language or expression language. According to the exemplary embodiment, special names can be used for transition rules for business process. The names can be:
$property, that identifies an object;
$$now, indicating current data and time; and
$$currentUser, identifying a current user.
The business rules can be stored as triples, as well. The triples can be used for tasks generated by a state machine. The conditions of relationships between the state machine's states can serve as predicates of the triple. In project management, a task is an activity that needs to be accomplished (or executed) within a defined period of time. A task can be assigned to a responsible person. The task has a start and end date (time).
A task is a unit that is used to track user activities in terms of completing certain objectives defined by the task description. An example of a task is a task within MS Outlook. Other examples of tasks can be the tasks such as fixing a bug in a computer code, report generation, replacing a car part, cargo transport, writing an executable computer module, etc.
For example, a computer code bug needs to be fixed. The process detects a bug and creates an object/item "bug." The object is processed according to a workflow that includes stages of object processing represented by states. An item within the workflow is a configurable unit that can be transitioned from state to state and used for tracking certain business process. Item transition through the workflow can generate tasks at required steps (states) and assign them to certain users. The exemplary workflow for fixing bugs operates as follows. If a bug is detected, an order is given to a bug fixer (i.e., developer) to fix the bug. After that, the bug is considered fixed. The object "bug" can have several states such as "bug detected," "bug is under correction," "bug is corrected." When the bug is detected, the bug handling process moves to a "bug detected" step. The process orders the bug fixer (i.e., developer) to fix the detected bug. A task for the bug fixer "fix bug" is generated, and the bug object is transitioned into "bug is under correction" state.
After the bug fixer (developer) corrects the bug, the bug fixer closes his task, and the state of the bug object transitions into the state "bug is corrected." An assignee of the task closes the task upon completion. Thus, the task can have several parameters. For example, a status (open/closed), a task assignee, a generation date, a closing date, a task name, etc.
According to the exemplary embodiment, the task parameters can be described as triples. Each task can have a unique task ID (i.e., for example a number). The first task can have an ID=001, second task ID=002 and etc. Note that task ID can be generated by hashing task name or creation date. The task status can be described as a triple: 001 status opened, 002 status opened.
In this case, task id is a subject, the status is predicate and opened is an object of the triple. For example, "003 status closed" indicates that task 003 has been closed. The "003 status closed" indicates that task 003 has been closed. The triple "001 owner John" indicates that the task with the ID=001 is assigned to John. The triple "002 date 12 dec" indicates that the task with the ID=002 was created on December 12. Whenever an operation with a task is performed (for example, opening or closing the task) the corresponding triples are added to a database. Storing the triples in the database allows for retrieval of the task-related data for future use. For example, task status and assignment can be retrieved by a database query.
An exemplary query "?subject status opened" retrieves all subjects (task IDs) with an open status. In the above example, the IDs of the tasks 001 and 002 will be retrieved by this query. Note the queries can be bases on objects, predicates and subjects. However, storing triples in a database is inefficient. According to the exemplary embodiment, the triples are stored in B-trees or tables, which are discussed in detail herein.
According to the exemplary embodiment, graphs can be represented by the triples. FIG. 1 illustrates an exemplary graph. This graph can be defined by a number of triples:
Task 1 - owner (assignee) is John;
Task 1 status - closed;
Task 2 - owner (assignee) is Alex;
Task 2 status - opened;
Task 2 - owner (assignee) is Alex;
Task 3 status - closed.
In this example, the "owner" and the "status" are predicates (P), the Task 1 and the Task 2 are subjects (S), and John and Alex are objects (0) of the triples.
The above triples can be written as: SI PI 01,
51 P2 02,
52 PI 04,
52 P2 03,
53 PI 04,
S3 P2 02.
The triples, written in the above form, represent the relations between predicates, objects and subjects. These triples can be written as functions of the predicates:
P1(S1, 02),
P1(S2, 04),
P1(S3,04),
P2 (S2, 03),
P2(S3, 02).
Then, if all the tasks assigned to Alex need to be found, all the triples need to be searched through. If a number of triples exceeds, for example, a thousand, the search through triples becomes very slow and requires enormous computational resources. According to the exemplary embodiment, a B-tree is used for storing triples, quadruples (where predicate depends on three attributes), quintuples (where predicate depends on four attributes), etc. represented by graphs. Note that the actual graphs are not stored, but the corresponding triples (quadruples, quintuples, etc.) are stored in the B-tree.
In the example depicted in FIG. 1, predicates serve as edges of the graph and object and subjects serve as peaks (nodes) of the graph. In FIG. 1, graph (G) represents a combination of two sets: peaks V (for example, Task 1, John and Alex) and edges E (predicates 1, 2, 3, etc.). The two sets are combined bases on an instance (for example, a state of a workflow).
Each edge e from the set E incidentally equals to peaks v', v" connected by that edge. The peak v' and the edge e are considered coincident, and the peaks v' and v" are considered adjacent. If |V(G) | =n, | E(G) | =m, then graph G is (n,m) graph, where n - is a power of the graph (i.e., triple, quadruple, quintuple, etc.) and m is graph size.
According to the exemplary embodiment, data storage in a form of a B-tree is used in order to make searches more efficient. The B-tree can be implemented as a set of tables. FIG. 2 illustrates a database with a B-tree structure for storing triples depicted in FIG. 1. According to the exemplary embodiment, predicates, objects and subjects are stored in a database in a hashed from. In other words, task, status and owner values can be hashed prior to storage in the table. Hashing allows for speed up of a search process by sorting the hash values within the database.
The database search is performed by predicates in order to find relations between the graph peaks. Thus, a key table in the database is a table containing relations between the predicates, the subjects and the objects. The predicates in the key table are sorted. Table of predicates 210 contains predicates Pl...Pn used for selecting the subjects and the objects.
Note that this method of storing data is not limited to tasks of the triples. The data of quadruples, etc. can be stored in the key table. N1S Link ... NnS Link are values in the predicate table that represent links to table N1S ... NnS (220 and 240 accordingly). The tables 220 and 240 represent relations between the subject and the objects. The predicates table 210 also contains links to tables N10 ... NnO (230 and 250 accordingly) that also represent relations between the subject and the objects. For some predicates (i.e., for predicates of one type), for example, Pl=owner, a set of its own tables N1S ... NnS is created, and for other types of predicates, for example, P2=name a different set of tables N10 ... NnO is created. Note that the above example uses the triples that use only two related values (attributes). Any node of the graph can be used as an attribute. In cases when more than two related attributes are uses, such as PI (al, a2, a3), an alternative method for storing is employed. A B-tree is also used.
An example of a predicate with three attributes can be a statement:
"a cup color red yesterday" or "color(a cup, red, yesterday). In other words, the "color" is a predicate, and "cup" is a subject, "red' is the object, and "yesterday" is circumstances (attributes) of this predicate. An example of a predicate with for attributes can be a statement: Europe ("France," "Germany," "Spain," "Italy"). All attributes: France, Germany, Spain and Italy are countries. A common relation is that they are the countries in Europe. Thus, Europe is the predicate. A predicate "write" built into a programming language, such as Prolog, can be used. This predicate is responsible for passing data to a computer screen. Such predicates do not require special description in the code and can be easily used: write("A synonym for 'brave' is ")
The quotes in this statement are used for defining a line of symbols: "A synonym for 'brave' is". The predicate "write" can contain names of variables. In this case the quotes are not required.
A simple example is: write(X),
where X is a name of a variable. If X value is "darling", then "write" will print the word "darling."
Both a line of symbols and a variable are attributes of the predicate "write." The attributes can be mixed in an arbitrary manner according to syntax rules. For example, write("Today is the ",N,"th day of ",Μ,", a ",D,".").
This predicate will print "Today is the 19th day of August, a Tuesday.", if the values of the variables N, M and D are "19", "August" and "Tuesday," accordingly.
In this example "write" is a predicate and:
"Today is the "
N
"the day of "
M are the seven attributes of this predicate. In other words, the predicate is to the power of seven.
Consider a B-tree that needs to store the following sets:
a first set Pl(al, a2, a3);
a second set Pl(al);
a third set Pl(al, a4);
a fourth set P1(S1, 04);
a fifth set Pl(al, a5).
FIG. 3 illustrates an example of storing predicates with their attributes in a B-tree,. in in accordance with the exemplary embodiment. A table of predicates 310 contains predicates in sorted in a particular order. The table of predicates 310 also contains links to corresponding stored attributes for each of the predicates.
Writing of the predicates and their corresponding attributes into a B-tree 300 is implemented as follows: first set of the predicate PI has a power of three (i.e., the predicate has three attributes). Therefore, when a first set is written into the table 310, the table 310 contains four links to the stored attributes. The first link (Linkl) references a table 320 containing all first attributes (i.e., the attributes placed first in all of the sets). The second link (Link2) points to a table 330 containing all second attributes of all the sets. The third link (Link3) points to a table 340 which contains all of the third attributes. The fourth link (Link4) is an additional link that points to a table 350, which contains a terminator.
According to the exemplary embodiment, the terminator is a constant value (numerical or text). The terminator is used for retrieving the triples from the B-tree 300. The triples are identified within the B-tree by the terminator. When the attributes are read from the B-tree 300, the search continues until the terminator corresponding to a particular triple is found.
Each set of attributes along with a corresponding predicate is assigned an identification number upon being written into the B-tree 300. The next identification number is incremented by one. Therefore, the data sets are ordered according to when they are written into the B-tree 300. Thus, each set (a predicate, attributes and a terminator) has a corresponding unique identifier also stored in the B- tree 300.
Additionally, each attribute has a position id, which identifies the position of the attribute in a set (i.e., first, second, third, etc.). In the example depicted in FIG. 3, the first set contains an attribute "al" in the first position. Thus, the attribute "al" is written into the table 320 with the identifier "1" indicating that the attribute belongs to the first set.
An attribute "a2" is second in the first set. Therefore, it is written into the table 330 with the identifier "1." An attribute "a3" is third in the first set. Therefore, it is written into the table 340 with the identifier "1." The terminator with the identifier "1" is written into the table 350. The terminator "1" indicates that there are no more attributes of the first set.
An attribute "al" is first in the second set. Therefore, it is written into the table 320 with the identifier "2", which indicates that the attribute "al" belongs to the second set. The second set has only one attribute. Thus, the terminator with the identifier "2" is written into the table 330.
For all other sets, the predicates, the attributes and the terminators are stored in the same manner. Note that the tables 320 - 350 can contain the links to other tables instead of the actual identifiers. This example is depicted in FIG. 4. The table 310 contains the link (Table 2Link) to the table 320. The table 320 contains links (Table3Link and Table4link) that point tables 420 and 430 that contain the identifiers.
Note that the tables depicted in FIGs. 4 and 5 belong to the B-tree. In other words, the tables serve as branches of the B-tree. If a particular data set (the predicate with the attributes) needs to be retrieved from the B-tree, the predicate, the attributes and the terminator are searched for based on the corresponding identifier.
For example, if a first set needs to be retrieved, the first set "PI" is found in the table 310. Then, the corresponding arguments are read from the tables 320, 330, 340 and 350. The attributes with the identifier "1" are found in the tables 320-350. Therefore, the attributes are retrieved in order they are placed in the original set (i.e., the first attribute is placed first, the second - second, etc.). The terminator is found in order to make sure that all the attributes from the set have been retrieved.
When reading a set from the B-tree, the following approach can be used (for triplets, quadruplets, quintuplets, etc.):
As an example, consider a case where we need to find set Pl(al,a4). Attribute al is in the first place, and attribute a4 is in the second, therefore, the search is done in Table 2 (320) and in Table 3 (330). Also, Table 4 (340) needs to be searched, where the terminator is written to, as described above.
From Table 320, argument al is read, and identifiers of al are also read. Then, in Table 330, argument a4 is found, and identifiers of a4 are also found. Then, Table 340 is read for identifiers of the terminator, B
Then, a search is performed for intersecting identifiers for al, a4 and the terminator.
Taking the first identifier ("1") for al, which means that argument al belongs to the first predicate written in the B-tree.
Then, considering the identifiers for argument a4, the first identifier is "3", which means that argument a4 is located at the third predicate in the B-tree (and possibly in subsequent ones as well). Then, the set of identifiers for argument a4 becomes the master for the search for intersecting identifiers. The set of identifiers for argument al becomes a dependent set of identifiers, and the set of identifiers for the terminator is also dependent, and is after the al identifiers.
Then, a search is performed for an identifier of argument al that is equal to "3". Then, an identifier equal to "3" needs to be found in the set of identifiers for the terminators. Once found, it means that Pl(al,a4) exists in the B-tree. The set of identifiers for the terminators becomes the master set, and the process of finding the identifiers continues in the same vein. Another example is as follows:
Consider a case where all the predicates Pl(al,?) need to be found. "?" here means that we are looking for predicates of power 2, where the subject is argument al, and the object is any other argument, e.g., a5, aO, a2 etc. Since the power of the predicate is 2 (as in the previous example) , we need to search the tables 320, 330 and 340. As in the previous example, we need to read the set of identifiers for argument al and the set of identifiers for the terminators. The second argument is not known, therefore, the search process is as follows:
The set of identifiers for al is read from table 320 and the set of identifiers for the terminators is read from table 340. Then the set of identifiers for the first argument (here, a2) is read from table 330, and the intersecting identifiers are determined, as discussed earlier.
Once all the intersections are determined, the set of identifiers for the next argument is read from table 330 and all intersections are also determined.
The process continues until all identifiers for all arguments are read from table 330, and compared to the set of identifiers for argument al and the terminators.
The search for remaining predicates is done in a similar manner.
FIG. 5 illustrates an example of a B-tree of order 5. This means that (other than the root node) all internal nodes have at least ceil(5/2) = ceil(2.5) = 3 children (and hence at least 2 keys). Of course, the maximum number of children that a node can have is 5 (so that 4 is the maximum number of keys). According to condition 4 (see below), each leaf node must contain at least 2 keys. In practice, the B- trees usually have orders a lot larger than 5. The B-tree is a tree data structure that contains sorted data and allows searches, sequential access, insertions, and deletions in a logarithmic time. The B-tree is a type of a binary search tree, which has more than two children per node.
In the B-trees, internal (non-leaf) nodes can have a variable number of child nodes within a predefined range. When data is inserted or removed from a node, the number of child nodes changes. In order to maintain the pre-defined range, the internal nodes can be joined or split. Because a certain range of the child nodes is allowed, the B-trees do not need to be re-balanced, as frequently as other self-balancing search trees. However, they may waste some space, since nodes are not entirely full. The lower and upper bounds on the number of child nodes are typically fixed for a particular implementation. For example, in a 2-3 B-tree (often simply referred to as a 2-3 tree), each internal node may have only 2 or 3 child nodes.
Each internal node of a B-tree contains a number of keys. In practice, the keys take up the most space in the node. Each internal node of a B-tree contains a number of keys. Usually, the number of keys is chosen to vary between d and 2d. The factor of 2 guarantees that the nodes can be split or combined. If an internal node has 2d keys, then adding a key to that node can be accomplished by splitting the 2d key node into two d key nodes and adding the key to the parent node. Each split node has the required minimum number of keys. Similarly, if an internal node and its neighbor each have keys, then a key may be deleted from the internal node by combining it with its neighbor. Deleting the key would make the internal node have d-1 keys. Joining the neighbor would add d keys plus one more key brought down from the neighbor's parent. The result is an entirely full node of 2d keys.
The number of branches (or child nodes) from a node is one more than the number of keys stored in the node. In a 2-3 B-tree, the internal nodes store either one key (with two child nodes) or two keys (with three child nodes). The B-tree can be described with parameters (d+l) - (2d+l), or simply with the highest branching order (2d+l).
The B-tree is kept balanced by requiring that all leaf nodes are located at the same depth. The depth increases slowly as elements are added to the tree, but an increase in the overall depth is infrequent and results in all leaf nodes being one more node further away from the root.
The B-trees have substantial advantages over alternative implementations when node access times far exceed access times within the nodes, because the cost of accessing the node can be amortized over multiple operations within the node. This usually occurs when the nodes are in secondary storage such as disk drives. By maximizing the number of child nodes within each internal node, the height of the tree decreases and the number of expensive node accesses is reduced. In addition, rebalancing the tree is needed less often.
A maximum number of child nodes depends on the data that must be stored for each child node, and the size of a full disk block or an analogous size in a secondary storage. While 2-3 B-trees are easier to explain, practical B-trees using secondary storage require a higher number of child nodes to improve performance.
Unlike a binary-tree, each node of a b-tree can have a variable number of keys and children. The keys are stored in a non-decreasing order. Each key has an associated child that is the root of a sub-tree containing all nodes with the keys less than or equal to the key, but greater than the preceding key. A node also has an additional rightmost child that is the root for a sub-tree containing all keys greater than any keys in the node.
Since each node tends to have a large branching factor (a large number of children), it is necessary to traverse relatively few nodes before locating the desired key. If access to each node requires a disk access, then a B-tree will minimize the number of disk accesses required. The minimization factor is usually chosen so that the total size of each node corresponds to a multiple of the block size of the underlying storage device. This choice simplifies and optimizes disk access. Consequently, a B-tree is an ideal data structure for situations where all data cannot reside in a primary storage and accesses to a secondary storage are comparatively expensive (or time consuming).
For n greater than or equal to one, the height h of an n-key B-tree T of height h with a minimum degree t greater than or equal to 2,
h<=log ((n+l)/2)
The worst case height is 0(log n). Since the "branchiness" of a B-tree can be relatively large compared to many other balanced tree structures, the base of the logarithm tends to be large. Therefore, the number of nodes visited during a search tends to be smaller than required by other tree structures. Although this does not affect the asymptotic worst case height, the B-trees tend to have smaller heights than other trees with the same asymptotic height.
The B-tree must satisfy the following conditions:
- Each node contains at least one key. The keys at each node are ordered. The root node contains from 1 to 2t-l keys. Any other node contains from t-1 to 2t-l keys (the leaves are not excluded). Where t is a tree parameter that is no less than 2 (typically is in the range 50 to 2000); - The leaves do not have children. Any other node having keys Kl, ...Kn, contains n+1 children, where:
- a first child and all its children have the keys from the interval (-infinity, Kl);
- for 2 less or equal to i that is less or equal to n, i-th child and all its children contain the keys from the interval (Ki-1, Ki);
- (n + l)-th child and all its children contain the keys from the interval (Kn, infinity);
- all leaves have the same depth.
The second condition can be states as: each node of the B-tree (except for the leaves) can be considered as an order array consisting of the keys and references to the children in an alternating order.
According to the exemplary embodiment, if the key is contained in the root, it is easily found. Otherwise, the interval is determined and a corresponding child is found. These operations are repeated until the process reaches the leaf.
FIG. 6 illustrates an exemplary graph that can be written in a form of triples. The exemplary graph can be written in form of ten triples:
John is a man (relation 620 to 610);
Michael is a man ( relation 630 to 610);
Morgan is a man (relation 640 to 610);
Bill is a man (relation 650 to 610);
Alex is a man (relation 660 to 610);
John position manager (relation 620 to 670);
Michael position manager (relation 630 to 670);
Morgan position developer (relation 640 to 680);
Bill position developer (relation 650 to 680);
Alex position CEO (relation 660 to 690).
The same triples can be written as follows:
Is (John, a man);
Is (Michael, a man);
Is (Morgan, a man);
Is (Bill, a man);
Is (Alex, a man); Position (John, manager); Position (Michael, manager); Position (Morgan, developer);
Position (Bill, developer);
Position (Alex, CEO).
With reference to FIG. 7, an exemplary system for implementing the invention includes a general purpose computing device in the form of a computer 20 or a server, including a processing unit 21, a system memory 22, and a system bus 23 that couples various system components including the system memory to the processing unit 21.
The system bus 23 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. The system memory includes read-only memory (ROM) 24 and random access memory (RAM) 25. A basic input/output system 26 (BIOS), containing the basic routines that help transfer information between elements within the computer 20, such as during start-up, is stored in ROM 24.
The computer 20 may further include a hard disk drive 27 for reading from and writing to a hard disk, not shown, a magnetic disk drive 28 for reading from or writing to a removable magnetic disk 29, and an optical disk drive 30 for reading from or writing to a removable optical disk 31 such as a CD-ROM, DVD-ROM or other optical media. The hard disk drive 27, magnetic disk drive 28, and optical disk drive 30 are connected to the system bus 23 by a hard disk drive interface 32, a magnetic disk drive interface 33, and an optical drive interface 34, respectively. The drives and their associated computer-readable media provide non-volatile storage of computer readable instructions, data structures, program modules and other data for the computer 20.
Although the exemplary environment described herein employs a hard disk, a removable magnetic disk 29 and a removable optical disk 31, it should be appreciated by those skilled in the art that other types of computer readable media that can store data that is accessible by a computer, such as magnetic cassettes, flash memory cards, digital video disks, Bernoulli cartridges, random access memories (RAMs), read-only memories (ROMs) and the like may also be used in the exemplary operating environment.
A number of program modules may be stored on the hard disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25, including an operating system 35. The computer 20 includes a file system 36 associated with or included within the operating system 35, one or more application programs 37, 37', other program modules 38 and program data 39. A user may enter commands and information into the computer 20 through input devices such as a keyboard 40 and pointing device 42. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner or the like.
These and other input devices are often connected to the processing unit 21 through a serial port interface 46 that is coupled to the system bus, but may be connected by other interfaces, such as a parallel port, game port or universal serial bus (USB). A monitor 47 or other type of display device is also connected to the system bus 23 via an interface, such as a video adapter 48. In addition to the monitor 47, personal computers typically include other peripheral output devices (not shown), such as speakers and printers. The computer 20 may operate in a networked environment using logical connections to one or more remote computers 49. The remote computer (or computers) 49 may be another computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 20, although only a memory storage device 50 has been illustrated. The logical connections include a local area network (LAN) 51 and a wide area network (WAN) 52. Such networking environments are commonplace in offices, enterprise-wide computer networks, Intranets and the Internet.
When used in a LAN networking environment, the computer 20 is connected to the local network 51 through a network interface or adapter 53. When used in a WAN networking environment, the computer 20 typically includes a modem 54 or other means for establishing communications over the wide area network 52, such as the Internet.
The modem 54, which may be internal or external, is connected to the system bus 23 via the serial port interface 46. In a networked environment, program modules depicted relative to the computer 20, or portions thereof, may be stored in the remote memory storage device. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.

Claims

Claims:
1. A computer-implemented method for storing graph data representing a workflow of a business process, the method comprising: generating the workflow of the business process; generating a graph having a plurality of peaks and connecting edges; generating a graph data having a predicate and arguments corresponding to the peaks;
storing the predicate in a predicate table of a B-tree;
storing the arguments in separate tables of the B-tree;
storing relations between the arguments in a relations table;
generating a link to the relations table; generating the links to the tables containing the arguments; storing the link to the relations table and the links to the tables containing the arguments in the predicate table, wherein relations between the predicate and the arguments are determined by the edges of the graph.
2. The method of claim 1, wherein the graph data is represented by any of:
a triple;
a quadruple; and a quintuple.
3. The method of claim 1, wherein the predicate has an arbitrary number of corresponding arguments.
4. The method of claim 1, further comprising assigning a unique identifier to the predicate and corresponding arguments.
5. The method of claim 4, further comprising sorting the predicates in the predicate table based on the identifier.
6. The method of claim 1, further comprising generating a terminator value and storing it in a terminator table within the B-tree. "
7. The method of claim 6, wherein the terminator value indicates an end of the arguments corresponding to the predicate.
8. The method of claim 1, wherein the tables referenced by the links stored in the predicate table contain links to other tables containing arguments and terminators.
9. The method of claim 1, wherein the predicates are hashed prior to storing in the predicate table.
10. The method of claim 9, wherein the predicates are sorted by sorting corresponding hash values.
11. The method of claim 1, wherein the arguments are hashed prior to storing in the argument table.
12. The method of claim 11, wherein the arguments are sorted by sorting corresponding hash values.
13. The method of claim 1, wherein the search for predicates in the B-tree includes identifying intersections of attributes of the identifiers.
14. The method of claim 1, wherein the identifiers are sorted based on order of writing of sets of identifiers into the B-tree.
15. The method of claim 1, wherein the B-tree is a binary search tree that has a plurality of nodes.
16. The method of claim 1, wherein the arguments are subjects and objects.
17. A system for storing graph data representing data representing a workflow of a business process, the system comprising:
a processor;
a memory connected to the processor;
a computer program logic stored in the memory and executed on the processor for implementing the steps of claim 1.
PCT/RU2012/000783 2012-09-26 2012-09-27 Method and system for storing graph data WO2014051455A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EA201500264A EA201500264A1 (en) 2012-09-26 2012-09-27 METHOD AND SYSTEM FOR STORAGE OF GRAPH DATA

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
RU2012141077 2012-09-26
RU2012141077/08A RU2605387C2 (en) 2012-09-26 2012-09-26 Method and system for storing graphs data

Publications (1)

Publication Number Publication Date
WO2014051455A1 true WO2014051455A1 (en) 2014-04-03

Family

ID=50388704

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2012/000783 WO2014051455A1 (en) 2012-09-26 2012-09-27 Method and system for storing graph data

Country Status (3)

Country Link
EA (1) EA201500264A1 (en)
RU (1) RU2605387C2 (en)
WO (1) WO2014051455A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106452884A (en) * 2016-10-21 2017-02-22 北京云图科瑞科技有限公司 A data distribution method and apparatus in a block chain system
WO2017061901A1 (en) * 2015-10-08 2017-04-13 Общество с ограниченной ответственностью "Колловэар" System and method for processing graph data

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6644767B2 (en) * 2014-08-29 2020-02-12 コーニンクレッカ フィリップス エヌ ヴェKoninklijke Philips N.V. Method and apparatus for hierarchical data analysis based on cross-correlation

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080243770A1 (en) * 2007-03-29 2008-10-02 Franz Inc. Method for creating a scalable graph database

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB9105367D0 (en) * 1991-03-13 1991-04-24 Univ Strathclyde Computerised information-retrieval database systems
RU2242048C2 (en) * 2003-02-18 2004-12-10 Онтос Аг Method for automated processing of text information materials
US7089266B2 (en) * 2003-06-02 2006-08-08 The Board Of Trustees Of The Leland Stanford Jr. University Computer systems and methods for the query and visualization of multidimensional databases
US7698267B2 (en) * 2004-08-27 2010-04-13 The Regents Of The University Of California Searching digital information and databases

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080243770A1 (en) * 2007-03-29 2008-10-02 Franz Inc. Method for creating a scalable graph database

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ABADI DANIEL J. ET AL.: "SW-Store: a vertically partitioned DBMS for Seamantic Web data management.", THE VLDB JOURNAL, vol. 18, 2009, pages 385 - 406 *
COPELAND GEORGE.P. ET AL.: "A decomposition storage model.", PROCEEDINGS OF SIGMOD, 1985, pages 268 - 279 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017061901A1 (en) * 2015-10-08 2017-04-13 Общество с ограниченной ответственностью "Колловэар" System and method for processing graph data
CN106452884A (en) * 2016-10-21 2017-02-22 北京云图科瑞科技有限公司 A data distribution method and apparatus in a block chain system
CN106452884B (en) * 2016-10-21 2019-06-25 北京云图科瑞科技有限公司 Data distributing method and device in block catenary system

Also Published As

Publication number Publication date
RU2605387C2 (en) 2016-12-20
RU2012141077A (en) 2014-04-10
EA201500264A1 (en) 2015-06-30

Similar Documents

Publication Publication Date Title
US10437804B1 (en) Storing graph data representing workflow management
US11763175B2 (en) Systems and methods for semantic inference and reasoning
Bellomarini et al. Swift Logic for Big Data and Knowledge Graphs: Overview of Requirements, Language, and System
US11068439B2 (en) Unsupervised method for enriching RDF data sources from denormalized data
US7383274B2 (en) Systems and methods for efficiently storing and accessing data storage system paths
US20130124545A1 (en) System and method implementing a text analysis repository
EP3671526B1 (en) Dependency graph based natural language processing
Ghrab et al. Grad: On graph database modeling
Ait-Ameur et al. Ontologies in engineering: the OntoDB/OntoQL platform
Afzal et al. OWLMap: fully automatic mapping of ontology into relational database schema
Deb Nath et al. High-level ETL for semantic data warehouses
RU2605387C2 (en) Method and system for storing graphs data
Aggoune Automatic ontology learning from heterogeneous relational databases: Application in alimentation risks field
Suda et al. On the saturation of YAGO
Saake et al. Rule-based schema matching for ontology-based mediators
Filipiak et al. Generating semantic media Wiki content from domain ontologies
Ramathilagam et al. A framework for owl dl based ontology construction from relational database using mapping and semantic rules
Chiu et al. Enabling ad hoc queries over low-level scientific data sets
Alam et al. Towards a semantic web stack applicable for both RDF and topic maps: a survey
Anam et al. Schema mapping using hybrid ripple-down rules
Bellomarini et al. Vadalog: a language and system for knowledge graphs
EP4369212A1 (en) Scopes in process mining
Laadidi et al. Simplification of owl ontology sources for data warehousing
Vamsi A survey on RDF Data Management Systems
Nummenmaa et al. Automating Support for Refactoring SQL Databases

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12885314

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 201500264

Country of ref document: EA

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12885314

Country of ref document: EP

Kind code of ref document: A1