CN111538804A - HBase-based graph data processing method and equipment - Google Patents

HBase-based graph data processing method and equipment Download PDF

Info

Publication number
CN111538804A
CN111538804A CN202010313655.XA CN202010313655A CN111538804A CN 111538804 A CN111538804 A CN 111538804A CN 202010313655 A CN202010313655 A CN 202010313655A CN 111538804 A CN111538804 A CN 111538804A
Authority
CN
China
Prior art keywords
point
data
edge
key
identifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010313655.XA
Other languages
Chinese (zh)
Inventor
谢翔
王光勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jinganjia New Technology Co ltd
Original Assignee
Beijing Jinganjia New Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jinganjia New Technology Co ltd filed Critical Beijing Jinganjia New Technology Co ltd
Priority to CN202010313655.XA priority Critical patent/CN111538804A/en
Publication of CN111538804A publication Critical patent/CN111538804A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a graph data processing method and device based on HBase, wherein the method comprises the following steps: receiving a data loading request sent by a user, and acquiring graph data to be loaded based on the data loading request, wherein the graph data to be loaded comprises point data and edge data; submitting the point data to a point data table according to a point key corresponding to the point data; submitting the edge data to an edge data table and a point association table according to an edge key corresponding to the edge data and an edge identifier of the edge data; and loading the graph data to be loaded into the HBase based on the first HFile file corresponding to the point data table, the second HFile file corresponding to the edge data table and the third HFile file corresponding to the point association table, thereby realizing the high-efficiency processing of the mass graph data based on the HBase.

Description

HBase-based graph data processing method and equipment
Technical Field
The present application relates to the field of HBase data processing technologies, and in particular, to a method and device for processing graph data based on HBase.
Background
HBase is a distributed, highly reliable, high-performance, scalable, column-oriented open source database, and is mainly used for storing unstructured and semi-structured loose data. HBase can process very large tables, and data tables consisting of more than 10 million rows of data and millions of columns of elements are processed by using an inexpensive computer cluster in a horizontally expanding manner.
A graph database is a type of database that is a non-relational database. The interest of graph databases is the graph formed by the association relationship, and the aim of the graph databases is to store and analyze the association relationship between entities in the real world: entities are abstracted as vertices and associations between entities are abstracted as edges. The map structure formed by the top points and the edges visually and naturally expresses the world of all-object association, and simultaneously solves the performance problem of deep retrieval of complex association relation.
Due to the rapid development of networks, data shows a well-jet growth, and the applicant finds that the following problems exist:
although the HBase in the prior art can store mass data, the HBase cannot store mass map data in a map database, and the data storage capacity supported by the existing map database is limited.
Therefore, how to implement efficient processing of massive graph data based on HBase is a technical problem to be solved at present.
Disclosure of Invention
The invention provides an image data processing method based on HBase, which is used for solving the technical problem that the HBase cannot be used for efficiently processing mass image data in the prior art.
In some embodiments, the method comprises:
receiving a data loading request sent by a user, and acquiring graph data to be loaded based on the data loading request, wherein the graph data to be loaded comprises point data and edge data;
submitting the point data into a point data table according to a point key corresponding to the point data, wherein the point key is generated by inverting and hashing a point identifier of the point data, and the point data table is generated according to the corresponding relation between the point data and the point key;
submitting the side data to a side data table and a point association table according to side keys corresponding to the side data and side identifiers of the side data, wherein the side keys are generated based on the inversion and hashing of the side identifiers, the side data table is generated according to the corresponding relation between the side data and the side keys, and the point association table is generated according to first point identifiers and second point identifiers corresponding to the side identifiers;
and loading the graph data to be loaded into HBase based on a first HFile file corresponding to the point data table, a second HFile file corresponding to the edge data table and a third HFile file corresponding to the point association table.
In some embodiments, before submitting the edge data into an edge data table and a point association table according to an edge key corresponding to the edge data and an edge identifier of the edge data, the method further includes:
determining the first point identifier and the second point identifier according to the edge identifier;
judging whether the point key of the first point identifier and the point key of the second point identifier exist in the point data table based on an application program interface of the HBase;
if so, determining the edge key according to the edge identifier;
if not, determining that the data loading fails.
In some embodiments, submitting the edge data to an edge data table and a point association table according to an edge key corresponding to the edge data and an edge identifier of the edge data specifically includes:
converting the edge data and the edge key into a first added object based on an application program interface of the HBase;
submitting the edge data into the edge data table based on the first adding object;
converting the point key of the first point identifier and the point key of the second point identifier into a second adding object based on the application program interface of the HBase;
submitting the edge data into the point association table based on the second add object.
In some embodiments, the method further comprises:
when a data retrieval request sent by a user is received, determining a retrieval item according to the data retrieval request, wherein the retrieval item is point data, edge data or path data;
if the retrieval item is the point data, determining a point data retrieval result based on the point identifier in the data retrieval request and the point data table;
if the retrieval item is the edge data, determining an edge data retrieval result based on the edge identifier in the data retrieval request and the edge data table;
and if the retrieval item is the path data, determining a path data retrieval result according to the point key of the path starting point, the point key of the path ending point and the point association table in the data retrieval request.
In some embodiments, determining a path data retrieval result according to the point key of the path starting point, the point key of the path ending point and the point association table in the data retrieval request specifically includes:
acquiring a point key of an associated point connected with the path starting point based on the point key of the path starting point and the point association table;
judging whether the point key of the path end point exists in the point keys of the associated points;
if yes, determining that the path data is successfully retrieved;
if not, performing recursive retrieval based on the point key of the associated point and the point associated table, and reducing the preset retrieval depth by one every time of performing the recursive retrieval until finding the point key of the path end point and determining that the path data retrieval is successful, or determining that the path data retrieval is failed until the retrieval depth is less than or equal to the point key of the associated point.
In some embodiments, submitting the point data to a point data table according to a point key corresponding to the point data specifically includes:
converting the point data and the point key into a third adding object according to an application program interface of the HBase;
and submitting the point data into a point data table based on the third adding object.
In some embodiments, the edge identifier further includes direction information of the first point identifier and the second point identifier, where the direction information is specifically unidirectional, non-directional, or bidirectional, and the first point identifier and the second point identifier are separated based on invisible character \ 1.
In some embodiments, the key of the point association table is the point key of the first point identifier, the column of the point association table is the edge key, the value of the point association table is the point key of the second point identifier, and the column family of the point association table is a preset default value.
In some embodiments, the first HFile file, the second HFile file, and the third HFile file are generated based on MapReduce.
Correspondingly, the invention also provides graph data processing equipment based on HBase, and the equipment comprises:
the receiving module is used for receiving a data loading request sent by a user, wherein the data loading request carries graph data to be loaded, and the graph data to be loaded comprises point data and edge data;
the first submitting module is used for submitting the point data to a point data table according to a point key corresponding to the point data, wherein the point key is generated by inverting and hashing a point identifier of the point data, and the point data table is generated according to the corresponding relation between the point data and the point key;
a second submission module, configured to submit the edge data to an edge data table and a point association table according to an edge key corresponding to the edge data and an edge identifier of the edge data, where the edge key is generated based on inverting and hashing the edge identifier, the edge data table is generated according to a corresponding relationship between the edge data and the edge key, and the point association table is generated according to a first point identifier and a second point identifier corresponding to the edge identifier;
and the loading module is used for loading the graph data to be loaded into the HBase based on the first HFile file corresponding to the point data table, the second HFile file corresponding to the edge data table and the third HFile file corresponding to the point association table.
Compared with the prior art, the invention has the following beneficial effects:
the invention discloses a graph data processing method and device based on HBase, wherein the method comprises the following steps: receiving a data loading request sent by a user, and acquiring graph data to be loaded based on the data loading request, wherein the graph data to be loaded comprises point data and edge data; submitting the point data to a point data table according to a point key corresponding to the point data; submitting the edge data to an edge data table and a point association table according to an edge key corresponding to the edge data and an edge identifier of the edge data; and loading the graph data to be loaded into the HBase based on the first HFile file corresponding to the point data table, the second HFile file corresponding to the edge data table and the third HFile file corresponding to the point association table, and generating a point key and an edge key through inversion and hash operations, so that a hot spot phenomenon of the HBase is avoided, and the high-efficiency processing of the mass graph data based on the HBase is realized.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart illustrating an HBase-based graph data processing method according to an embodiment of the present invention;
fig. 2 shows a schematic structural diagram of an HBase-based graph data processing device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
As described in the background art, although the HBase in the prior art can store mass data, it cannot store mass graph data in a graph database, and the data storage amount supported by the existing graph database is limited.
In order to solve the above problems, the embodiment of the present application provides an HBase-based graph data processing method, which submits point data to a point data table according to a point key corresponding to the point data; submitting the edge data to an edge data table and a point association table according to the edge key corresponding to the edge data and the edge identifier of the edge data; and loading the graph data to be loaded into the HBase based on the first HFile file corresponding to the point data table, the second HFile file corresponding to the edge data table and the third HFile file corresponding to the point association table, thereby realizing the high-efficiency processing of the mass graph data based on the HBase.
Fig. 1 shows a schematic flow chart of an HBase-based graph data processing method according to an embodiment of the present invention, where the method includes the following steps:
step S101, receiving a data loading request sent by a user, and acquiring graph data to be loaded based on the data loading request, wherein the graph data to be loaded comprises point data and edge data.
Specifically, points in graph data generally refer to entities, such as people or accounts; edges in the graph data generally refer to relationships, such as friendships or transfer actions, and attributes, such as a person's name, a person's age, or a time of transfer, are also included in the graph data. As described above, the data loading request carries the graph data to be loaded, the graph data to be loaded includes point data and edge data, the point data may include a point identifier and a point attribute, and the edge data may include an edge identifier and an edge attribute.
And step S102, submitting the point data to a point data table according to the point key corresponding to the point data.
Specifically, the point data table is generated from the correspondence relationship between the point data and the point keys, thereby storing the correspondence relationship between the point data and the point keys based on the point data table. The dot keys are generated after the dot identifications of the dot data are inverted and hashed, and the dot keys are generated by inverting and hashing the dot identifications of the dot data, namely generating the RowKey corresponding to the dot data, so that the hot spot phenomenon in HBase is avoided through the dot keys. Wherein, the hot spot phenomenon occurs in a large number of clients directly accessing one or a few nodes of the HBase cluster. The large number of accesses may cause a single machine in which the hot region is located to exceed its own tolerance, causing performance degradation and even region unavailability. In addition, the specific inversion and hash operations are prior art and will not be described herein.
In order to more accurately submit point data to a point data table, in a preferred embodiment of the present application, the submitting the point data to the point data table according to a point key corresponding to the point data specifically includes:
converting the point data and the point key into a third adding object according to an application program interface of the HBase;
and submitting the point data into a point data table based on the third adding object.
As described above, the point data and the point key are converted into the third addition object, that is, the put object of the HBase, through the application program interface of the HBase, and the point data is submitted into the point data table based on the third addition object.
It should be noted that the scheme of the above preferred embodiment is only a specific implementation scheme proposed in the present application, and other ways of submitting the point data to the point data table according to the point key corresponding to the point data all belong to the protection scope of the present application.
Step S103, submitting the edge data to an edge data table and a point association table according to the edge key corresponding to the edge data and the edge identifier of the edge data.
Specifically, since the edge data further includes point data at both ends, the edge data needs to be stored in the edge data table and the point association table, thereby ensuring reliable storage of the edge data. The side data table is generated according to the corresponding relation between the side data and the side keys, so that the corresponding relation between the side data and the side keys is stored based on the side data table. The point association table is generated according to the first point identifier and the second point identifier corresponding to the edge identifier, so that the corresponding relation between the first point identifier and the second point identifier corresponding to the edge identifier is stored based on the point association table. And generating an edge key after inverting and hashing the edge identifier, and generating a RowKey corresponding to the edge data, so as to avoid a hot spot phenomenon in HBase through the edge key. In addition, the specific inversion and hash operations are prior art and will not be described herein.
In order to generate an accurate edge key, in a preferred embodiment of the present application, before submitting the edge data into an edge data table and a point association table according to an edge key corresponding to the edge data and an edge identifier of the edge data, the method further includes:
determining the first point identifier and the second point identifier according to the edge identifier;
judging whether the point key of the first point identifier and the point key of the second point identifier exist in the point data table based on an application program interface of the HBase;
if so, determining the edge key according to the edge identifier;
if not, determining that the data loading fails.
Specifically, a first point identifier and a second point identifier corresponding to an edge identifier are determined according to the edge identifier, and then whether a point key of the first point identifier and a point key of the second point identifier exist in the point data table is judged based on an application program interface of the HBase; if the data does not exist, the data cannot be loaded, and the data loading failure is determined.
In order to accurately submit edge data to an edge data table and a point association table, in a preferred embodiment of the present application, the submitting the edge data to the edge data table and the point association table according to an edge key corresponding to the edge data and an edge identifier of the edge data specifically includes:
converting the edge data and the edge key into a first added object based on an application program interface of the HBase;
submitting the edge data into the edge data table based on the first adding object;
converting the point key of the first point identifier and the point key of the second point identifier into a second adding object based on the application program interface of the HBase;
submitting the edge data into the point association table based on the second add object.
Specifically, the side data and the side key are converted into a first added object based on the application program interface of the HBase, that is, into a first put object, and the side data is submitted into the side data table based on the first added object; and converting the point key of the first point identifier and the point key of the second point identifier into a second added object, namely converting the point keys into a first put object, based on the application program interface of the HBase, and submitting the edge data into the point association table based on the second added object.
It should be noted that the above solution of the preferred embodiment is only a specific implementation solution proposed in the present application, and other ways of submitting edge data to the edge data table and the point association table according to the edge key and the edge identifier all belong to the protection scope of the present application.
In order to further improve the accuracy of the side key, in a preferred embodiment of the present application, the side identifier further includes direction information of the first point identifier and the second point identifier, where the direction information is specifically unidirectional, non-directional, or bidirectional, and the first point identifier and the second point identifier are separated based on invisible character \ 1.
Specifically, in a specific application scenario of the present application, an Edge represents a relationship between a vertex and a vertex, a key design is also an Edge unique identifier EID, the Edge identifier EID may be designed based on a point identifier VID, two point identifiers VID are separated by an invisible character \1, and the Edge identifier EID determines the directionality of its Edge by name, for example:
a) VID2 points to VID1, then EID is VID2\1VID 1;
b) the VID1 and the VID2 have no direction or two directions, and store two pieces of information at the same time, namely an EID (identification value) VID1\1VID2 and an EID (identification value) VID2\1VID 1;
c) VID1 points to VID2, then EID is VID1\1VID 2.
It should be noted that the above solution of the preferred embodiment is only one specific implementation solution proposed in the present application, and all the composition manners of other edge identifiers belong to the protection scope of the present application.
In order to more accurately store the corresponding relationship between the first point identifier and the second point identifier corresponding to the edge identifier based on the point association table, in a preferred embodiment of the present application, the key of the point association table is the point key of the first point identifier, the column of the point association table is the edge key, the value of the point association table is the point key of the second point identifier, and the column family of the point association table is a preset default value.
Each column in the HBase table belongs to a column family. The column family is part of the table schema (and the columns are not), and must be defined before the table can be used. Access control, disk and memory usage statistics are performed at the column family level.
Those skilled in the art can select other point association tables according to actual needs, which does not affect the scope of protection of the present application.
Step S104, loading the graph data to be loaded into the HBase based on the first HFile file corresponding to the point data table, the second HFile file corresponding to the edge data table, and the third HFile file corresponding to the point association table.
Specifically, since the HFile is a file organization form of the HBase stored data, the data needs to be loaded into the HBase through an HFile file, that is, the graph data to be loaded is loaded into the HBase based on the first HFile file corresponding to the point data table, the second HFile file corresponding to the edge data table, and the third HFile file corresponding to the point association table.
In order to generate HFile files efficiently, in a preferred embodiment of the present application, the first HFile file, the second HFile file, and the third HFile file are generated based on MapReduce.
Specifically, MapReduce is a programming model for parallel operation of large-scale data sets (larger than 1TB), and the current software implementation specifies a Map function for mapping a set of key-value pairs into a new set of key-value pairs, and specifies a concurrent Reduce function for ensuring that each of all mapped key-value pairs shares the same key-group.
And generating a first HFile file corresponding to the point data table, a second HFile file corresponding to the edge data table and a third HFile file corresponding to the point association table through MapReduce, thereby efficiently generating the HFile files.
In order to efficiently search data, in a preferred embodiment of the present application, when a data search request sent by a user is received, a search item is determined according to the data search request, where the search item is point data, or edge data, or path data;
if the retrieval item is the point data, determining a point data retrieval result based on the point identifier in the data retrieval request and the point data table;
if the retrieval item is the edge data, determining an edge data retrieval result based on the edge identifier in the data retrieval request and the edge data table;
and if the retrieval item is the path data, determining a path data retrieval result according to the point key of the path starting point, the point key of the path ending point and the point association table in the data retrieval request.
Specifically, a data retrieval request of a user is received and a retrieval item is determined, if the retrieval item is point data, a corresponding point key is obtained according to a point identifier in the data retrieval request, whether the point data corresponding to the point key exists in a point data table is judged based on the point key, if yes, the retrieval is determined to be successful, and if not, the retrieval is failed;
if the retrieval item is edge data, obtaining a corresponding edge key based on an edge identifier in the data retrieval request, and judging whether the edge data corresponding to the edge key exists in an edge data table based on the edge key, if so, determining that the retrieval is successful, otherwise, failing to retrieve;
and if the retrieval item is the path data, determining a path data retrieval result according to the point key of the path starting point, the point key of the path ending point and the point association table in the data retrieval request.
In order to determine an accurate path data retrieval result, in a preferred embodiment of the present application, a path data retrieval result is determined according to a point key of a path starting point, a point key of a path ending point and the point association table in the data retrieval request, specifically:
acquiring a point key of an associated point connected with the path starting point based on the point key of the path starting point and the point association table;
judging whether the point key of the path end point exists in the point keys of the associated points;
if yes, determining that the path data is successfully retrieved;
if not, performing recursive retrieval based on the point key of the associated point and the point associated table, and reducing the preset retrieval depth by one every time of performing the recursive retrieval until finding the point key of the path end point and determining that the path data retrieval is successful, or determining that the path data retrieval is failed until the retrieval depth is less than or equal to the point key of the associated point.
Specifically, the point association table is inquired based on the point key of the path starting point, the point keys of the association points connected with the path starting point are obtained, then whether the point key of the path ending point exists in the point keys of the association points is judged, and if yes, the path data retrieval is determined to be successful; otherwise, performing recursive retrieval based on the point key of the associated point and the point association table, wherein the recursive retrieval is to sequentially retrieve each point on the corresponding path of the associated point, and in order to improve the retrieval efficiency, the retrieval depth is set, and the retrieval depth which is set in advance is reduced by one every time the recursive retrieval is performed. Until finding the point key of the path end point and determining that the path data is successfully retrieved, or until the retrieval depth is less than or equal to the retrieval depth and determining that the path data is failed to be retrieved.
It should be noted that the above solution of the preferred embodiment is only a specific implementation solution proposed in the present application, and other ways of determining the path data retrieval result according to the point key of the path starting point, the point key of the path ending point, and the point association table all belong to the protection scope of the present application.
By applying the technical scheme, a data loading request sent by a user is received, and graph data to be loaded is obtained based on the data loading request, wherein the graph data to be loaded comprises point data and edge data; submitting the point data to a point data table according to a point key corresponding to the point data; submitting the edge data to an edge data table and a point association table according to an edge key corresponding to the edge data and an edge identifier of the edge data; and loading the graph data to be loaded into the HBase based on the first HFile file corresponding to the point data table, the second HFile file corresponding to the edge data table and the third HFile file corresponding to the point association table, and generating a point key and an edge key through inversion and hash operations, so that a hot spot phenomenon of the HBase is avoided, and the high-efficiency processing of the mass graph data based on the HBase is realized.
In order to further illustrate the technical idea of the present invention, the technical solution of the present invention will now be described with reference to specific application scenarios.
The embodiment of the invention provides an HBase-based graph data processing method, which is used for realizing the efficient processing of mass graph data based on HBase by generating point keys of point data and side keys of side data and carrying out graph data loading and graph data retrieval based on the point keys and the side keys.
In the embodiment of the present application, before graph data processing is performed, it is also necessary to determine the structure of a key corresponding to graph data, which is also equivalent to determining the structure of a RowKey of HBase, and the structure of a key is described below:
i. vertex is used as a main information carrying entity, and the key design of the point is mainly used for ensuring the uniqueness of point data in the graph data in a graph library, and the unique identification VID of the point plays an important role in the application, such as: judging whether the point exists, quickly searching point information, maintaining the point, and the like. Due to the RowKey characteristic of the HBase database, in order to avoid the hot spot phenomenon of the HBase, firstly, the point identification VID is subjected to value-to-HASH to obtain the point key HVID, and the corresponding relation HVID < > point data is stored in a point data table.
Where RowKey is the primary key used to retrieve records, as in the nosql database. Accessing a row in the HBase table has only three ways: 1. access via a single RowKey; 2. range by RowKey; 3. full table scanning.
RowKey line key can be any character string (the maximum length is 64KB, the length is generally 10-100bytes in practical application), and RowKey is stored as byte array in HBase. When the data is stored, the data is stored according to the dictionary order (bytoreorder) of RowKey.
In order to avoid hot spots, the embodiment of the application introduces 'value-to-HASH', namely, data is subjected to inversion and hashing operations.
Edge represents the relationship between vertex and vertex, the key design is also focused on the unique identification EID of the Edge, the Edge identification EID can be designed on the basis of the point identification VID, the two point identification VIDs are separated by the invisible character \1, the Edge identification EID determines the directionality of its Edge by name, such as:
a) VID2 points to VID1, then EID is VID2\1VID 1;
b) the VID1 and the VID2 have no direction or two directions, and store two pieces of information at the same time, namely an EID (identification value) VID1\1VID2 and an EID (identification value) VID2\1VID 1;
c) VID1 points to VID2, then EID is VID1\1VID 2.
In the design of HBase storage, the edge key HEID is obtained by using the design consistent with the design of point-in-HBase storage, namely, the value is converted into HASH, and the corresponding relation HEID is stored in an edge data table.
The graph data processing method based on the HBase in the embodiment of the application can be divided into a graph data loading process and a graph data retrieval process, wherein the graph data loading process comprises the following specific steps:
step 1, receiving a data loading request sent by a user, and acquiring graph data to be loaded based on the data loading request, wherein the graph data to be loaded comprises point data and edge data.
And 2, submitting the point data to a point data table according to the point key HVID corresponding to the point data.
The step 2 further comprises the following concrete steps:
a) acquiring a point identifier VID of a vertex, and acquiring a point key HVID through value-to-HASH;
b) converting the point data and the point key HVID into a corresponding put object through an Application Programming Interface (API) of the HBase;
c) and submitting point data to the point data table graph _ vertex based on the put object.
And step 3, submitting the edge data to an edge data table and a point association table according to the edge key corresponding to the edge data and the edge identifier of the edge data.
The step 3 further comprises the following concrete steps:
a) acquiring an edge identifier EID, and identifying VID from the points of two corresponding vertexes in the edge identifier EID;
b) the two point identifiers VID are respectively subjected to value conversion HASH to obtain a point key HVID;
c) judging whether the dot key HVID exists in the dot data table graph _ vertex through the API of HBase;
d) if the edge key HEID exists, performing value-to-HASH on the edge identifier EID to obtain an edge key HEID, converting the edge data and the edge key HEID into corresponding put objects through an API (application program interface) of HBase, and submitting the edge data to an edge data table graph _ edge based on the put objects; and storing the dot keys HVIDs of the two corresponding vertexes into the dot association table graph _ v _ v, wherein the key is the first dot key HVID, the column family can be default f, the column is the edge key HEID, and the value is the second dot key HVID.
e) And if any point corresponding to the edge identifier does not exist, the loading of the edge data fails.
And step 4, loading the graph data to be loaded into HBase based on the first HFile file corresponding to the point data table, the second HFile file corresponding to the edge data table and the third HFile file corresponding to the point association table.
Specifically, the step 4 is performed by a MapReduce mode, and the HFile is a file organization form of HBase storage data.
The process of generating the HFile file may be: new configuration () is performed in the main function, and then a Job instance new Job () is created based on conf. The types of key and value of the input path, output path, mapper class, reducer class, and map output of the instance are set. The configuration of HBase is then set via the set () function, such as specifying the zookeeper's node in the cluster. And thirdly, establishing an HBase table according to the set configuration, and finally setting output as an HFile file according to an HFileOutputFormat.
And generating a first HFile file corresponding to the point data table, a second HFile file corresponding to the edge data table and a third HFile file corresponding to the point association table through MapReduce, and loading the first HFile file, the second HFile file and the third HFile file into HBase to complete the loading of the graph data.
The graph data retrieval process can be divided into point data retrieval, edge data retrieval or path data retrieval, and the point data retrieval can comprise the following steps:
a) acquiring a vertex identification VID, and obtaining a dot key HVID through value-to-HASH;
b) and acquiring point data corresponding to the point key HVID from the point data table graph _ vertex through an API of HBase.
The side data retrieval may include the steps of:
a) acquiring an edge identifier EID, and acquiring an edge key HEID through value-to-HASH;
b) and acquiring the edge data corresponding to the edge key HEID from the edge data table graph _ edge through the API of the HBase.
The following steps are exemplified for the path data retrieval:
for example, the path data to be retrieved is a path from a point a to a point D, the default retrieval depth may be 5, and the following steps are included in the path retrieval:
a) acquiring a dot key HVID of the point A, searching the vertex association table graph _ v _ v to acquire a connecting point of the vertex A, wherein the connecting point does not contain D and only comprises B and C;
b) and recursively searching B and C until D is found, searching the depth minus 1 once in each recursion, and exiting the recursion when the depth is less than or equal to 1 to indicate that the path cannot be reached.
Corresponding to the method for processing graph data based on HBase in the embodiment of the present application, an embodiment of the present application further provides a device for processing graph data based on HBase, as shown in fig. 2, where the device includes:
a receiving module 201, configured to receive a data loading request sent by a user, where the data loading request carries graph data to be loaded, and the graph data to be loaded includes point data and edge data;
a first submitting module 202, configured to submit the point data to a point data table according to a point key corresponding to the point data, where the point key is generated based on inverting and hashing a point identifier of the point data, and the point data table is generated according to a corresponding relationship between the point data and the point key;
a second submitting module 203, configured to submit the edge data to an edge data table and a point association table according to an edge key corresponding to the edge data and an edge identifier of the edge data, where the edge key is generated based on inverting and hashing the edge identifier, the edge data table is generated according to a corresponding relationship between the edge data and the edge key, and the point association table is generated according to a first point identifier and a second point identifier corresponding to the edge identifier;
a loading module 204, configured to load the graph data to be loaded into the HBase based on the first HFile file corresponding to the point data table, the second HFile file corresponding to the edge data table, and the third HFile file corresponding to the point association table.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (10)

1. A graph data processing method based on HBase is characterized by comprising the following steps:
receiving a data loading request sent by a user, and acquiring graph data to be loaded based on the data loading request, wherein the graph data to be loaded comprises point data and edge data;
submitting the point data into a point data table according to a point key corresponding to the point data, wherein the point key is generated by inverting and hashing a point identifier of the point data, and the point data table is generated according to the corresponding relation between the point data and the point key;
submitting the side data to a side data table and a point association table according to side keys corresponding to the side data and side identifiers of the side data, wherein the side keys are generated based on the inversion and hashing of the side identifiers, the side data table is generated according to the corresponding relation between the side data and the side keys, and the point association table is generated according to first point identifiers and second point identifiers corresponding to the side identifiers;
and loading the graph data to be loaded into HBase based on a first HFile file corresponding to the point data table, a second HFile file corresponding to the edge data table and a third HFile file corresponding to the point association table.
2. The method of claim 1, further comprising, before submitting the edge data into an edge data table and a point association table according to an edge key corresponding to the edge data and an edge identifier of the edge data:
determining the first point identifier and the second point identifier according to the edge identifier;
judging whether the point key of the first point identifier and the point key of the second point identifier exist in the point data table based on an application program interface of the HBase;
if so, determining the edge key according to the edge identifier;
if not, determining that the data loading fails.
3. The method according to claim 2, wherein submitting the edge data into an edge data table and a point association table according to an edge key corresponding to the edge data and an edge identifier of the edge data specifically comprises:
converting the edge data and the edge key into a first added object based on an application program interface of the HBase;
submitting the edge data into the edge data table based on the first adding object;
converting the point key of the first point identifier and the point key of the second point identifier into a second adding object based on the application program interface of the HBase;
submitting the edge data into the point association table based on the second add object.
4. The method of claim 1, wherein the method further comprises:
when a data retrieval request sent by a user is received, determining a retrieval item according to the data retrieval request, wherein the retrieval item is point data, edge data or path data;
if the retrieval item is the point data, determining a point data retrieval result based on the point identifier in the data retrieval request and the point data table;
if the retrieval item is the edge data, determining an edge data retrieval result based on the edge identifier in the data retrieval request and the edge data table;
and if the retrieval item is the path data, determining a path data retrieval result according to the point key of the path starting point, the point key of the path ending point and the point association table in the data retrieval request.
5. The method according to claim 4, wherein determining the path data retrieval result according to the point key of the path starting point, the point key of the path ending point and the point association table in the data retrieval request specifically comprises:
acquiring a point key of an associated point connected with the path starting point based on the point key of the path starting point and the point association table;
judging whether the point key of the path end point exists in the point keys of the associated points;
if yes, determining that the path data is successfully retrieved;
if not, performing recursive retrieval based on the point key of the associated point and the point associated table, and reducing the preset retrieval depth by one every time of performing the recursive retrieval until finding the point key of the path end point and determining that the path data retrieval is successful, or determining that the path data retrieval is failed until the retrieval depth is less than or equal to the point key of the associated point.
6. The method of claim 1, wherein said point data is submitted to a point data table according to a point key corresponding to said point data, specifically:
converting the point data and the point key into a third adding object according to an application program interface of the HBase;
and submitting the point data into a point data table based on the third adding object.
7. The method of claim 1, wherein the edge identifier further comprises direction information of the first point identifier and the second point identifier, the direction information is specifically unidirectional, non-directional, or bidirectional, and the first point identifier and the second point identifier are separated based on invisible character \ 1.
8. The method of claim 2, wherein the key of the point association table is the point key of the first point identifier, the column of the point association table is the edge key, the value of the point association table is the point key of the second point identifier, and the column family of the point association table is a preset default value.
9. The method of claim 1, wherein the first, second, and third HFile files are generated based on MapReduce.
10. An HBase-based graph data processing apparatus, characterized in that the apparatus comprises:
the receiving module is used for receiving a data loading request sent by a user, wherein the data loading request carries graph data to be loaded, and the graph data to be loaded comprises point data and edge data;
the first submitting module is used for submitting the point data to a point data table according to a point key corresponding to the point data, wherein the point key is generated by inverting and hashing a point identifier of the point data, and the point data table is generated according to the corresponding relation between the point data and the point key;
a second submission module, configured to submit the edge data to an edge data table and a point association table according to an edge key corresponding to the edge data and an edge identifier of the edge data, where the edge key is generated based on inverting and hashing the edge identifier, the edge data table is generated according to a corresponding relationship between the edge data and the edge key, and the point association table is generated according to a first point identifier and a second point identifier corresponding to the edge identifier;
and the loading module is used for loading the graph data to be loaded into the HBase based on the first HFile file corresponding to the point data table, the second HFile file corresponding to the edge data table and the third HFile file corresponding to the point association table.
CN202010313655.XA 2020-04-20 2020-04-20 HBase-based graph data processing method and equipment Pending CN111538804A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010313655.XA CN111538804A (en) 2020-04-20 2020-04-20 HBase-based graph data processing method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010313655.XA CN111538804A (en) 2020-04-20 2020-04-20 HBase-based graph data processing method and equipment

Publications (1)

Publication Number Publication Date
CN111538804A true CN111538804A (en) 2020-08-14

Family

ID=71979424

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010313655.XA Pending CN111538804A (en) 2020-04-20 2020-04-20 HBase-based graph data processing method and equipment

Country Status (1)

Country Link
CN (1) CN111538804A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114090837A (en) * 2022-01-11 2022-02-25 智器云南京信息科技有限公司 Graph data query method and device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9576020B1 (en) * 2012-10-18 2017-02-21 Proofpoint, Inc. Methods, systems, and computer program products for storing graph-oriented data on a column-oriented database
CN110263225A (en) * 2019-05-07 2019-09-20 南京智慧图谱信息技术有限公司 Data load, the management, searching system of a kind of hundred billion grades of knowledge picture libraries
CN110275969A (en) * 2019-06-13 2019-09-24 南京智慧图谱信息技术有限公司 A kind of data presentation technique in hundred billion grades of knowledge picture libraries

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9576020B1 (en) * 2012-10-18 2017-02-21 Proofpoint, Inc. Methods, systems, and computer program products for storing graph-oriented data on a column-oriented database
CN110263225A (en) * 2019-05-07 2019-09-20 南京智慧图谱信息技术有限公司 Data load, the management, searching system of a kind of hundred billion grades of knowledge picture libraries
CN110275969A (en) * 2019-06-13 2019-09-24 南京智慧图谱信息技术有限公司 A kind of data presentation technique in hundred billion grades of knowledge picture libraries

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114090837A (en) * 2022-01-11 2022-02-25 智器云南京信息科技有限公司 Graph data query method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
US10331641B2 (en) Hash database configuration method and apparatus
WO2015106711A1 (en) Method and device for constructing nosql database index for semi-structured data
US11283616B2 (en) Method for index-based and integrity-assured search in a blockchain
CN110413595B (en) Data migration method applied to distributed database and related device
WO2019161679A1 (en) Data processing method and device for use in online analytical processing
US20140244606A1 (en) Method, apparatus and system for storing, reading the directory index
CN108205577A (en) A kind of array structure, the method, apparatus and electronic equipment of array inquiry
WO2021179782A1 (en) Method, device and apparatus for improving execution efficiency of database appliance, and medium
CN110597852A (en) Data processing method, device, terminal and storage medium
US20230024345A1 (en) Data processing method and apparatus, device, and readable storage medium
CN112015741A (en) Method and device for storing massive data in different databases and tables
Liang et al. Mid-model design used in model transition and data migration between relational databases and nosql databases
CN111858730A (en) Data importing and exporting device, method, equipment and medium of graph database
CN112416880A (en) Method and device for optimizing storage performance of mass small files based on real-time merging
CN114691721A (en) Graph data query method and device, electronic equipment and storage medium
Araujo et al. Comparative performance analysis of NoSQL Cassandra and MongoDB databases
CN112231351A (en) Real-time query method and device for PB-level mass data
US10558636B2 (en) Index page with latch-free access
CN111538804A (en) HBase-based graph data processing method and equipment
WO2016175880A1 (en) Merging incoming data in a database
Li et al. Accurate Counting Bloom Filters for Large‐Scale Data Processing
WO2013097065A1 (en) Index data processing method and device
Chihoub et al. A scalability comparison study of data management approaches for smart metering systems
US10762139B1 (en) Method and system for managing a document search index
CN114048219A (en) Graph database updating method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination