CN115794965A

CN115794965A - Data management system and method

Info

Publication number: CN115794965A
Application number: CN202211530471.4A
Authority: CN
Inventors: 杨传真; 甘志雄; 李欣明; 陶刚; 杨绍平; 李淳; 杨帆; 余洋; 李立刚; 罗晖; 唐俊; 赵桂艳; 唐峻
Original assignee: China Tobacco Yunnan Industrial Co Ltd
Current assignee: China Tobacco Yunnan Industrial Co Ltd
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-03-14

Abstract

The application discloses a data management system and a method, wherein the data management system comprises a data service module, a database change data capture module, an original database, a graph database and a search engine; the data service module receives the written data through the data writing interface, processes the data and transmits the processed data to the original database for storage; wherein, the written data only comprises entity data and relation data; the database change data capturing module monitors an original database, obtains change data and writes the change data into an internal message queue; graph databases and search engines store fluctuating messages obtained from internal message queues. The data are abstracted into an entity and a relation, the data are stored in the original database, the linkage writing of the data of the database and the search engine is realized by monitoring the original database, and the consistency of the data is ensured.

Description

Data management system and method

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data management system and method.

Background

For information systems, raw databases are typically employed for storage. With the development of informatization in recent years, the following requirements are often found in practical application scenarios:

1. it is often necessary in information systems to describe complex relationships between different objects. When a large number of nodes and relationships need to be stored, the retrieval efficiency of the original database is not high, so that graph databases such as Neo4j are commonly used in the industry to store the nodes and relationships, so that the retrieval is convenient.

2. Users often need to perform fast full-text searches, search objects with certain fields containing some keywords, perform pagination queries or statistical queries. However, the original database is not efficient in processing full-text retrieval, so that a search engine (such as an elastic search) is commonly used in the industry to implement full-text retrieval of semi-structured data.

In actual work, a graph database and a search engine are often used together, so that the storage and analysis technology of the two data needs to ensure data consistency. However, no data linkage mechanism exists between the database and the search engine, and the data of the database and the data of the search engine need to be maintained separately.

Disclosure of Invention

The application provides a data management system and a data management method, data are abstracted into an entity and a relation, the data are stored in an original database, data linkage writing of a database and a search engine is realized by monitoring the original database, and the consistency of the data is ensured.

The application provides a data management system, which comprises a data service module, a database change data capturing module, an original database, a graph database and a search engine;

the data service module receives the written data through the data writing interface, processes the data and transmits the processed data to the original database for storage; wherein, the written data only comprises entity data and relation data;

the database change data capturing module monitors an original database, obtains change data and writes the change data into an internal message queue;

graph databases and search engines store fluctuating messages obtained from internal message queues.

Preferably, the entity data and the relationship data are stored in the graph database, and the search engine only stores the entity data required to support the search function.

Preferably, the graph database has a relational query/graph query interface and the data service module has a retrieval interface, the data service module forwarding data received by the retrieval interface to the search engine.

Preferably, the data management system further comprises at least one customized message queue, and the customized message queue performs information interaction with an external user;

the customized message queue monitors the internal message queue and forms a change message adaptive to the requirement of the corresponding external user for the external user to use.

Preferably, the raw database, the graph database and/or the search engine is a clustered raw database, a clustered graph database, a clustered search engine.

The application also provides a data management method, which comprises the following steps:

the data access module receives first write-in data through the data write-in interface and performs first data processing to obtain second write-in data; wherein the first write data includes only entity data and relationship data;

the original database receives second write data;

the graph database and the search engine write a first change message obtained by performing second data processing on the change data in the internal message queue.

Preferably, the graph database writes structured data information of the entity and the search engine writes unstructured text fields of the entity.

Preferably, the data management method further includes:

and performing third data processing on the variable data in the internal message queue according to the requirement of the external user to obtain a second variable message and output the second variable message to a customized message queue communicated with the external user.

Preferably, the first data processing includes checking whether both ends of the relationship data have entities, and checking whether each field of the entity data and the relationship data conforms to a predefined value range.

Preferably, the second data processing includes:

distinguishing whether the changed data belongs to the change of entity data or the change of relationship data to obtain an attribute change result, and determining a target database to be written according to the attribute change result, wherein the target database comprises a database and/or a search engine;

distinguishing whether the changed data belongs to added data, deleted data or modified data to obtain a change mode;

and converting the format of the changed data into a data format matched with the target database according to the changing mode to obtain a first changing message.

Further features of the present application and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which is to be read in connection with the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic structural diagram of a first embodiment of a data management system provided in the present application;

FIG. 2 is a schematic structural diagram of a second embodiment of a data management system provided in the present application;

fig. 3 is a flowchart of a data management method provided in the present application.

Detailed Description

Various exemplary embodiments of the present application will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present application unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the application, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

Example one

As shown in FIG. 1, the data management system includes a data service module, a database change data crawling module, a raw database, a graph database, and a search engine.

As shown in fig. 1, the data service module includes a data write interface (shown in fig. 1 as a data write API) and a retrieval interface (shown in fig. 1 as a full text retrieval API).

And the data service module receives the written data through the data writing interface, processes the data and transmits the processed data to the original database for storage.

In the present application, data objects are abstracted as "entities" and "relationships". An "entity" refers to an object having its own attribute value, and for example, "user", "organization", "server", "cluster", "application" are all "entities". "relationship" refers to a state that describes the interaction, etc. between an "entity" and an "entity", e.g., what organization the user "belongs to," what cluster the server "belongs to, etc. Thus, the write data includes only the entity data and the relationship data.

Before writing data, an administrator needs to define entity types and relationship types in the data service module in advance, including information such as each entity, fields of relationships, and value ranges allowed by the entities. When writing data of the relationship type, the data service module needs to check whether both ends of the relationship have corresponding entities. When writing data of entity type and relationship type, the data service module will check whether each field conforms to the predefined value range.

As shown by the dotted line box in the upper left corner of fig. 1, data input through the data writing interface enters the ingress message queue, and the written data in the ingress message queue needs to be processed through the message processing logic, so that the obtained message format required by the user can be transmitted to the original database. The message processing logic here is a personalized logic obtained according to requirements corresponding to the business functions of the data management system. The message handling logic is different for different business systems.

The retrieval interface is used for receiving data (mainly comprising keywords) input by a retrieval user and forwarding the data to a search engine to realize functions of full-text retrieval and the like.

And the original database receives the written data transmitted by the data service module and stores the written data, and correspondingly, the data in the original database only comprises the relational data and the entity data. It should be noted that the raw database supports extraction of the change data using change data acquisition techniques. As an example, the original database may be a relational database (e.g., mySQL) or a non-relational database (e.g., cassandra).

A database Change Data Capture (CDC) module (shown in figure 1 as CDC software) listens to the original database and obtains Change Data, writing the Change Data into the internal message queue. As one example, the CDC software may be Debezium or Oracle Golden Gate software with variable data acquisition.

Graph databases and search engines store fluctuating messages obtained from internal message queues. Specifically, the change log processing module performs data processing on the change data in the internal message queue to obtain change messages respectively corresponding to the graph database and the search engine. Specifically, the change log processing module distinguishes whether entity data or relationship data has changed, distinguishes whether the change is an addition, deletion or modification, converts the format of the changed data into a format required by a graph database and a search engine, and writes the format into the graph database and the search engine. It should be noted that, the same change data needs to be written into the graph database and the search engine at the same time, so that data consistency is ensured, and the situation that only the graph database is written or only the search engine is written cannot occur.

As one embodiment, a graph database is written to structured data information of an entity and a search engine is written to unstructured text fields of the entity.

The database stores entity data and relationship data, and the search engine only stores entity data which need to support the search function. Some fields that occupy a relatively large space do not have much meaning in a graph database and are therefore written only to a search engine. For example: the article is used as an entity, the text of the article is long, the occupied space of the article is large, the text field of the article is only stored in a search engine under the condition, and the database only stores information such as the identification number, the title, the author and the like of the article and does not store the text.

When writing data, it is necessary to ensure that the change of entity data is earlier than the change of relationship data, for example, when a user adjusts a work department, it is necessary to ensure that the entities of the user and the work department already exist in the data management system before the relationship between the user and the work department can be increased.

The graph database has a relational query/graph query interface that supports query statements for querying complex relational conditions. The data service module is provided with a retrieval interface, the data service module forwards the data received by the retrieval interface to a search engine, and the search engine supports retrieval according to keywords.

Because the graph database and the search engine are both non-relational databases, the graph database and the search engine support the dynamic adjustment of data structures and can dynamically change fields, and therefore, the method has high flexibility.

Preferably, the data management system further comprises at least one customized message queue, and the customized message queue performs information interaction with an external user. The customized message queue monitors the internal message queue and forms a change message adaptive to the requirement of the corresponding external user for the external user (such as a Hadoop-based big data platform). Specifically, as shown in fig. 1, the change data in the internal message queue is processed into a change message adapted to the requirement of the external user through a message processing logic, which is a personalized message processing logic setting according to different service output requirements. For example, the changed data in the internal message queue is converted into the message format required by the data lake by the message processing logic corresponding to the data lake.

As an example, when the data volume of the data management system is large (e.g., hundreds of millions, billions, and billions), the data management system supports replacing a portion of the critical components to take on the large data volume.

As an embodiment, a non-relational database (for example, cassandra) is adopted as the original database, and the non-relational database of the one type supports the improvement of data carrying capacity and concurrent processing capacity by horizontally expanding nodes.

As another example, the raw database, the graph database, and/or the search engine may be a clustered raw database, a clustered graph database, a clustered search engine.

As shown in fig. 2, the data management system includes a clustered original database formed by a plurality of relational databases, a graph database cluster formed by a plurality of graph databases, and an elastosearch cluster formed by a plurality of elastosearch search engines. Correspondingly, each relational database corresponds to one data service module.

As one embodiment, an open-source cluster graph database such as Nebula is adopted, and the number of nodes and edges capable of bearing the cluster graph database can be increased to the order of billions.

As an embodiment, a container technology is used to extend the data query module, and a Redis-based memory cache may be added if necessary.

Based on the framework, the entity data and the relation data can be classified in multiple levels, and for a certain entity object, the content of each field contained in the certain entity object is specified through each data service module, wherein the content comprises but is not limited to field names, value ranges and the like. When data is written through the API, each data service module automatically checks whether the written entity data and the written relation data conform to the predefined rules, for example, whether the value range of a certain field conforms to the predefined rules, so that the written data are automatically classified.

Therefore, the data carrying capacity and the concurrent processing capacity of the data management system are improved, the Query performance Per unit time (QPS) index is improved, and meanwhile, the data Query interface can further accelerate the Query speed and the Query performance by using memory caching technologies such as Redis, memcached and the like.

As an example, there are different types of organizations in an enterprise, such as user administration, party building organizations, etc., a user may be affiliated with different types of organizations simultaneously (e.g., a user U1 may be administratively affiliated with the A1 organization and he may be simultaneously affiliated with the B1 party affiliate with the party member's identity), and may be affiliated with different affiliations with different organizations (a user U1 may be administratively affiliated with the A1 organization while he may be borrowed to the A2 organization while doubling in the A3 organization). These 3 organizations A1, A2, A3 belong to a higher level organization A0. The entity information of a user may include information about his work history, birth year, month and day, etc.

Thus, the enterprise includes 3 entity types: administrative organizations, party building organizations and users, and 3 relationship types: membership, borrowing and concurrent functions.

Thus, in creating the data management system for the enterprise, the following process is included:

1. the above-mentioned 3 entity types and 3 relationship types are defined in the data service module.

2. 5 organization entities A0, A1, A2, A3, B1 and one user entity U1 are written via the data write API interface.

3. The relation information among the entities is written in through a data writing API interface, namely A1, A2 and A3 are all affiliated to A0, U1 is affiliated to A1, borrowed from A2, part of the function is affiliated to A3 and affiliated to B1.

After the data information of the entities and the relations is written in through the API, the data can be automatically stored in a graph database and a search engine, and then the following queries can be carried out through a query interface of the graph database and the search engine:

1. all staff members belonging to at least one party department under a certain organization, including both affiliation and debit relationship types, are queried.

2. All staff in the work experience including the keyword "xx factory" under a certain party organization are inquired and sorted according to the birth year, month and day.

Example two

Based on the data management system, the application also provides a data management method. As shown in fig. 3, the data management method includes:

s310: the data access module receives first write-in data through the data write-in interface and performs first data processing to obtain second write-in data; wherein the first write data includes only entity data and relationship data.

Specifically, the first data processing includes checking whether both ends of the relationship data have entities, and checking whether each field of the entity data and the relationship data conforms to a predefined value range.

S320: the primary database receives the second write data.

S330: the database change data capturing module monitors the original database and obtains change data, and writes the change data into the internal message queue.

S340: the graph database and the search engine write a first change message obtained by performing second data processing on the change data in the internal message queue.

Specifically, the second data processing includes:

p1: and distinguishing whether the changed data belongs to the change of the entity data or the change of the relation data to obtain an attribute change result, and determining a target database needing to be written according to the attribute change result, wherein the target database comprises a database and/or a search engine. For example, if there is a change in entity data, the target database is a graph database and a search engine, and if there is a change in relationship data, the target database is a graph database.

P2: and identifying whether the changed data belongs to the added data, the deleted data or the modified data to obtain a changed mode.

P3: and converting the format of the changed data into a data format matched with the target database according to the changing mode to obtain a first changing message.

Preferably, the data management method further comprises:

s350: and performing third data processing on the change data in the internal message queue according to the requirement of the external user to obtain a second change message and output the second change message to a customized message queue communicated with the external user.

The beneficial effect of this application is as follows:

1. universality: the data management system only has two data types of 'entity' and 'relation', and is highly abstract, so that the data management system is unified, not only can express complex relations, but also can support a bottom data framework of full-text retrieval, can be widely applied to a large number of data scenes, can be reused under a plurality of scenes, reduces the development complexity, and improves the development speed.

2. The method and the device have the advantages that the graph database and the search engine are taken into consideration, the graph database has high efficiency when complex relation query is carried out, such as iterative query, the search engine has high efficiency when full-text retrieval is carried out, a user can seamlessly switch graph query and full-text retrieval according to the requirements of actual scenes, and the method and the device are suitable for most scenes.

3. Write once, use many places: data writing only passes through the original database, and data writing of the subsequent database and the search engine depends on data change information of the original database, so that respective operation is not needed, and data consistency is guaranteed.

Although some specific embodiments of the present application have been described in detail by way of example, it should be understood by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the present application. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the present application. The scope of the application is defined by the appended claims.

Claims

1. A data management system is characterized by comprising a data service module, a database change data capture module, an original database, a graph database and a search engine;

the data service module receives written data through a data writing interface, processes the written data and transmits the processed data to the original database for storage; wherein the written data only comprises entity data and relationship data;

the database change data capturing module monitors the original database, obtains change data and writes the change data into an internal message queue;

the graph database and the search engine store change messages obtained from the internal message queue.

2. The data management system of claim 1, wherein the graph database stores entity data and relationship data, and the search engine stores only entity data needed to support a search function.

3. The data management system of claim 1, wherein the graph database has a relational query/graph query interface and the data service module has a retrieval interface, the data service module forwarding data received by the retrieval interface to the search engine.

4. The data management system of claim 1, further comprising at least one custom message queue, the custom message queue for information interaction with external users;

and the customized message queue monitors the internal message queue and forms a change message adaptive to the requirement of the corresponding external user for the external user to use.

5. The data management system of claim 1, wherein the raw databases, the graph database, and/or the search engine are clustered raw databases, clustered graph databases, clustered search engines.

6. A method for managing data, comprising:

the original database receives the second written data;

a database change data capturing module monitors the original database, obtains change data and writes the change data into an internal message queue;

and writing a first change message obtained by performing second data processing on the change data in the internal message queue into the graph database and the search engine.

7. The data management method of claim 6 wherein the graph database writes structured data information of an entity and the search engine writes unstructured text fields of an entity.

8. The data management method of claim 6, further comprising:

and performing third data processing on the changed data in the internal message queue according to the requirements of the external user to obtain a second changed message and output the second changed message to a customized message queue communicated with the external user.

9. The data management method according to claim 6, wherein the first data processing includes checking whether both ends of the relationship data have entities, and checking whether each field of the entity data and the relationship data conforms to a predefined value range.

10. The data management method according to claim 6, wherein the second data processing includes:

distinguishing whether the changed data belongs to the change of entity data or the change of relationship data to obtain an attribute change result, and determining a target database needing to be written according to the attribute change result, wherein the target database comprises the database and/or the search engine;

and converting the format of the changed data into a data format adaptive to the target database according to the changing mode to obtain the first changing message.