CN114661832A - Multi-mode heterogeneous data storage method and system based on data quality - Google Patents

Multi-mode heterogeneous data storage method and system based on data quality Download PDF

Info

Publication number
CN114661832A
CN114661832A CN202210281261.XA CN202210281261A CN114661832A CN 114661832 A CN114661832 A CN 114661832A CN 202210281261 A CN202210281261 A CN 202210281261A CN 114661832 A CN114661832 A CN 114661832A
Authority
CN
China
Prior art keywords
data
database
original
relational
multimedia
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210281261.XA
Other languages
Chinese (zh)
Inventor
李冬
张志钧
单晓欢
宋宝燕
陈廷伟
王俊陆
纪婉婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning University
Original Assignee
Liaoning University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning University filed Critical Liaoning University
Priority to CN202210281261.XA priority Critical patent/CN114661832A/en
Publication of CN114661832A publication Critical patent/CN114661832A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/289Object oriented databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2264Multidimensional index structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models

Abstract

The invention relates to a multi-mode heterogeneous data storage method and a system based on data quality, which comprises the following steps: 1) performing distributed storage on the original text data in a key-value format in an original database; 2) carrying out data modeling on original multimedia data, and carrying out distributed storage in a file database in a file form; 3) converting the key-value data into relational data and constructing a relational database; 4) constructing a graph database according to the relationship between the entities in the relational database; 5) performing data modeling on the activity data of the entity in a chain structure to construct a chain database; 6) converting the multimedia data into text data, and respectively storing the text data in a multimedia database and an original database according to data types; 7) linking the entity data of each sub-database by constructing a multi-level index structure; 8) and constructing a log file maintenance system of the multi-mode database aiming at the data integration method and each sub-database. The method can greatly reduce the time required for inquiring data and ensure the efficiency of related personnel in using the data.

Description

Multi-mode heterogeneous data storage method and system based on data quality
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a data quality-based multi-modal database heterogeneous storage method and system.
Background
Users now generate a large amount of user behavior data on different network platforms, and the data is no longer single text or picture data, but contains multi-modal data of texts, images, videos and the like from different platforms, including structured data, semi-structured data and unstructured data. Structured data refers to data that can be represented and stored using a relational database, represented in a two-dimensional form, and generally characterized by: the data is in row units, one row of data represents information of one entity, and the attribute of each row of data is the same; semi-structured data is a form of structured data that does not conform to the data model structure associated with a relational database or other data table form, but contains relevant tags to separate semantic elements and to layer records and fields, and is therefore also referred to as a self-describing structure; the unstructured data is data without a fixed structure, various documents, pictures, videos, audios and the like belong to unstructured data, and the unstructured data are generally directly and integrally stored.
In recent years, with the appearance of massive multi-modal data, the cost of data storage is increased, and how to construct a good and efficient multi-modal database becomes a problem which needs to be solved by most computer industry personnel together.
Disclosure of Invention
In order to solve the technical problem, the invention provides a multi-mode heterogeneous data storage method and system based on data quality.
In order to achieve the purpose, the invention adopts the following technical scheme:
a multi-modal heterogeneous data storage method and system based on data quality are characterized by comprising the following steps:
1) aiming at original data (including original text data and original multimedia data) from an internet data source, the original text data is stored in an original database in a key-value format in a distributed manner;
2) performing data modeling on original multimedia data from the Internet, and performing distributed storage in a file database in a file form;
3) performing data conversion on original text data by using data integration methods such as event extraction, entity linking, incomplete data filling and the like, converting the original text data into relational data, modeling the relational data, and constructing a relational database;
4) modeling entities having incidence relations among the entities in a relational database and relations among the entities to construct a graph database;
5) the activity data of each entity in the relational database has typical time sequence characteristics, and the activity data is subjected to data modeling in a chain structure to construct a chain database;
6) converting video data and audio data in the multimedia data into text data by a data conversion method, storing the text data in a multimedia database in a file form, and storing the text data in an original database in a key-value format;
7) according to the data quality, database optimization is carried out on different distributed databases, entity data of each sub-database are linked through constructing a multi-level index structure, and the consistency of the data is guaranteed;
8) and constructing a log file maintenance system of the multi-mode database aiming at the data integration method and each sub-database.
In another aspect, the present invention provides a data quality-based multi-modal heterogeneous data storage system, including: a primary database: the system is used for storing original data derived from internet data, and the storage format is as follows: key-value format; a relational database: the device is used for converting key-value data in an original database into relational data and modeling and storing the relational data; graph database: the system is used for imaging and storing the related entities in the relational database and the relations among the entities; multimedia database: the video data and the audio data which are converted into text formats are stored; a chain database: and the chain structure is used for storing the activity data of each entity in the relational database.
A cluster of computer readable storage media having stored thereon a computer program which, when executed by a processor, implements a 5-distributed sub-database of a data quality based multimodal heterogeneous data storage method.
Further, the specific method for storing the raw data in step 1) is as follows:
1.1) use MongoDB database system as the database system of key-value data storage. Relevant data which are crawled from the Internet are stored in a JSON file form and are stored in a MongoDB database, the MongoDB automatically generates a unique key value for each piece of data to serve as a unique identifier, and each piece of specific data can be located through the key value;
1.2) distributed storage solution using MongoDB Replica Set in MongoDB as the original database. The computer readable storage medium cluster is provided with 1 main node, 1 Replica node and 1 arbitration node according to the MongoDB replay Set distribution rule, the main node receives all requests, the Replica node and the main node keep the same data Set and can participate in the election of the main node, and the arbitration node carries out election voting.
Further, the specific method for modeling the multimedia data in step 2) is as follows: storing multimedia data including video, audio and picture data into a distributed file system according to a specific rule; wherein, the specific rule refers to determining the distributed file system node stored by the multimedia data according to the data source of the multimedia data. The multimedia data including video, audio or picture data are stored into the distributed file system according to the storage nodes corresponding to the data sources.
Further, the specific method for constructing the relational data model in the step 3) is as follows:
3.1) using a data integration method, including event extraction, entity linking and incomplete data filling, and converting original text data into structured data; the event extraction mainly comprises the steps of carrying out data labeling on original text data through a specific rule to form a data set, training an event extraction model by using the data set, and storing an obtained result in a structured form; the entity link is mainly used for disambiguating certain specific entities between the result obtained by event extraction and the database, and storing the disambiguated structured data; incomplete data filling mainly fills missing parts in the converted structured data by using a missing data filling method, so that the integrity of the data is ensured;
3.2) use MySQL database system as the database system of the relational data store. Storing the structured data integrated by the data into a relational database MySQL;
3.3) distributed storage solution using MySQLCuster as relational database. The computer-readable storage medium cluster is set with 1 management node, 2 data nodes and 1 application node according to the MySQLCuster distribution rule, the management nodes manage related configuration files, the data nodes store data in a distributed mode, and the application nodes perform reading and writing operations and the like.
Further, the specific method for constructing the graph database storage model in the step 4) is as follows:
4.1) use HBase as the underlying graph data storage scheme. Extracting elements with specific relations in a relational database and the relations between the elements with the specific relations, storing the elements into HBase, storing data in HBase in a row form through rowkey, and setting 1 main node, 1 slave node and 1 standby node in the computer-readable storage medium cluster according to HBase distribution rules;
4.2) visualization of query plans using Neo4j as a graph database. Partial data in HBase is exported and stored into Neo4j by using Hive to construct a knowledge graph capable of meeting different query requirements. Establishing the mapping between HBase and Hive, reducing HBase data into class relation database data, and establishing the relation of the data through Neo4 j;
4.3) modeling the entities in the relational database through the relations among the entities, forming a series of nodes and edges to represent the entities, wherein the entities are represented as the nodes, and the relations are represented as the edges and are visualized through Neo4 j.
Further, the chained database in step 5) exists in the form of a federation chain and a private chain, wherein the federation chain stores structured data, and the private chain stores semi-structured and unstructured data, including text, pictures, videos and the like; the alliance chain stores structured data by adopting a MySQL database, and the private chain stores semi-structured and unstructured data by adopting an HDFS.
Further, the specific method for storing the multimedia data in the step 6) is as follows:
6.1) crawling relevant multimedia data from the Internet according to multimedia data sources, wherein the relevant multimedia data comprise video, audio, images, texts and the like; designing a multimedia data index table according to data attributes, positioning the specific position of multimedia data through the index table according to attributes such as a data source, a data type, a storage node, a storage path, a file name and the like, and storing the index table into a relational database in a structured data form;
6.2) designing a data conversion storage model, and converting video data into text data through a process of 'video- > audio- > text'; converting the audio data into text data through an audio-text process; converting the image data into text data through an image-text process; and stored in the raw database and the multimedia database.
Further, in the step 7), a method for defining data quality includes: accuracy, completeness, consistency, relevance:
7.1) accuracy refers to that in data integration methods such as event extraction, data filling, data consistency detection and conversion, the accuracy of data conversion is ensured through indexes such as accuracy of a conversion model and the method;
7.2) the integrity refers to that aiming at the original text data of the same entity, the multi-modal heterogeneous data storage system has the structured data after data conversion, also has the semi-structured data in the key-value format and the unstructured data in the document format, and simultaneously stores the data in the original database and the relational database; on the other hand, for multi-modal data, in the multi-modal heterogeneous data storage system, both structured form data after multimedia data conversion exists, semi-structured data in a key-value format and unstructured data in a multimedia file form exist, and the data are stored in an original database and a multimedia database at the same time;
7.3) consistency refers to consistency detection of data of the same entity in each sub-database through data consistency detection and conversion, and the consistency detection comprises dimension consistency, expression mode consistency, data value consistency and the like; ensuring that the related data stored in each sub-database is consistent with the related data of the original text data and the original data of the original multimedia data;
and 7.4) the association refers to associating each sub-database by using an entity id or an entity name a, realizing synchronous update of data of the same entity in each sub-database, and realizing the tracing of the data by the association of the entity id or the entity name a.
On the other hand, the invention provides a multi-modal heterogeneous data optimization method based on data quality, which comprises a multi-level index structure and a log file maintenance module;
the multi-level index and dynamic maintenance module comprises a global index part, a local index part and a dynamic maintenance part:
the overall index constructs a main foreign key index among all sub-databases of the multi-mode database, and effectively links all sub-databases in the multi-mode database to realize query operation of related data;
the local index constructs independent index structures in each sub-database of the multi-mode database to realize the local index of the content of each sub-database, and comprises the following steps:
the original database local index module establishes an index for each key of the data, sets a fragment key for an index field, and improves the query efficiency through the index;
the relational database local index module is used for establishing indexes for common fields in the data, for example, the common fields of certain entity data are entity names, and the query efficiency is improved through the indexes;
the map database local index module is used for performing secondary index construction through Apache Phoenix, and establishing mapping between Phoenix and an HBase table, so that the HBase table can be operated on Phoenix, and the query efficiency is improved through indexing;
the local index module of the chain database mainly comprises a name index part, a sorting establishment part, a dynamic increment updating part and the like. Establishing name indexes according to specific fields, simultaneously establishing a alliance chain according to a time sequence, and updating data dynamic increment;
the multimedia database local index module constructs a local index structure from the basic information of the multimedia data, including information such as storage node information, path, file name, extension name and the like of the data, stores the local index structure in a relational database, and can be positioned to the specific position of the multimedia data through an index table according to attributes such as data source, data type, storage node, storage path, file name and the like.
The log file maintenance module comprises log file maintenance of the multi-modal database and log file maintenance of data integration. The log file maintenance of the multi-mode database comprises log file maintenance of a relational database, log file maintenance of a graph database, log file maintenance of a chained database and log file maintenance of an original database; the log file maintenance of data integration comprises log file maintenance of event extraction, log file maintenance of entity link, log file maintenance of incomplete data filling and log file maintenance of data consistency;
8.1) log file maintenance of the multi-modal database is maintained by each sub-database through the log files of the relevant system or scheme used by it;
8.2) the maintenance of the log file of data integration means that in the process of data integration, when data change occurs due to data integration operation, the process of all data integration operation is recorded and stored in the form of log file; the contents of the log file include: the occurrence time of database operation, the type of data integration, the category of operation logs and the characteristic attribute characteristics of each data integration method; for each type of data integration method, the log content includes: the time (Timestamp) when the data integration operation occurs, the data integration type, the level of logging (INFO, WARNING, ERROR, etc.), and for each different type of data integration method, designing the log content elements specific to each data integration method;
8.3) the types of data integration methods used by the invention are Event Extraction (EE), Entity Linking (EL), incomplete Data Filling (DF) and data consistency Detection (DC), and the log level is divided into five types: critical ERRORs (false) that cause the exit of an application, ERROR that does not affect the continued operation of the system (ERROR) despite the ERROR, potential ERROR situations (WARNING) that can occur, emphasizing the whole course of the operation of the application (INFO) at a coarse level of granularity, very helpful to debugging the application (DEBUG) at a fine level of granularity;
8.4) the log file record of event extraction is composed of: [ Timestamp ] [ EE ] [ Log level ] [ event type code ] [ event ID ] [ event time ];
8.5) physically linked log file records consisting of: [ Timestamp ] [ EL ] [ Log level ] [ entity Link type coding ] [ data value corresponding to unique Primary Key ] [ Table name ] [ data value corresponding to Complex Primary Key 1 required to use when Link to Master Table ] [ data value corresponding to Complex Primary Key 2 required to use when Link to Master Table ] [ data value corresponding to Complex Primary Key 3 required to use when Link to Master Table ];
8.6) the composition of log file records without complete data padding is: [ Timestamp ] [ DF ] [ Log level ] [ operation content ] [ operation result ];
8.7) the log file record of data consistency detection is composed of: [ Timestamp ] [ DC ] [ Log level ] [ type of data consistency detection ] [ operation content ] [ operation result ].
The beneficial effects created by the invention are as follows: according to the scheme, data are collected from different data sources, the original data are stored in an original database in a key-value form, the key-value data are subjected to data conversion through data integration methods such as event extraction, entity linking and incomplete data filling, the key-value data are converted into relational data and stored in a relational database, the data in the relational database are stored in a graph database in an entity-relational form, the activity data with typical time sequence characteristics in the relational database are stored in a chained database, the multimedia data are stored in a multimedia database, the data are processed according to the accuracy, integrity, consistency and relevance of the data, and finally heterogeneous storage and database optimization of the data are achieved. The method has the advantages that various database models are designed, and data in different formats are stored according to the storage characteristics of each database to form the multi-mode database. The query efficiency is greatly improved through the index structure, fault recovery can be carried out on the multi-mode database through log file maintenance, and the operation process of each data integration method can be checked. Through the steps, a high-data-quality multi-mode heterogeneous distributed database system is finally obtained. According to the multi-mode database heterogeneous storage method and system, the data are stored in different distributed databases according to different data forms, and data query optimization is performed, so that the time required for querying the data can be greatly reduced, and the efficiency of related personnel in using the data is ensured.
Drawings
FIG. 1 is a diagram of a multi-modal heterogeneous data storage system architecture based on data quality;
FIG. 2 is a diagram of a raw database architecture;
FIG. 3 is a relational database architecture diagram;
FIG. 4 is a flow chart of graph database data presentation;
FIG. 5 is a flow chart of a chained database;
FIG. 6 is a flow chart of a global index structure.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and in addition, the embodiments described below are only one embodiment of the present invention, not all embodiments.
The invention provides a multi-mode heterogeneous data storage method based on data quality, which is designed and conceived as follows: data are collected from different data sources, original text data are stored in a key-value mode, and multimedia data are stored in a file mode. Secondly, performing data modeling, converting original text data into relational data through a data integration method, storing the relational data into a relational database, extracting elements with specific relations in the relational database and relations among the elements, storing the elements into the HBase, exporting part of data in the HBase by using Hive, storing the data into Neo4j to construct a map which can meet different query requirements, and storing the original data and compressed multimedia data into a chain database. And finally, optimizing the database, establishing a global index structure, a local index structure and a log file maintenance module to form a multi-mode heterogeneous database.
A multi-modal heterogeneous data storage system framework based on data quality designed based on the method is shown in the following figure 1. The method comprises the following steps: original database, relational database, graph database, chain database, multimedia database.
The database functions as follows:
a primary database: the system is used for storing original data derived from internet data, and the storage format is as follows: key-value format;
a relational database: the device is used for converting key-value data in an original database into relational data and modeling and storing the relational data;
graph database: the system is used for imaging and storing the related entities in the relational database and the relations among the entities;
multimedia database: the video data and the audio data which are converted into text formats are stored;
a chain database: and the chain structure is used for storing activity data of each entity in the relational database.
The system is adopted to realize a multi-mode heterogeneous data storage method based on data quality, and the steps are as follows:
1) the method comprises the steps of crawling original data from the Internet, and storing the original data into an original database in a key-value form;
the method specifically comprises the following steps:
1.1) storing relevant data which is crawled from the Internet in a JSON file form, and storing the data into a MongoDB database;
1.2) in the invention, the original database adopts MongoDB replay Set + Sharding cluster to realize the distributed storage mode of the original database, and according to the cluster mode, the database is composed of three nodes, which are respectively: the primary database architecture diagram is as shown in fig. 2, and viewed from the vertical, the three nodes respectively serve as three servers, and each server is configured with a routing process, a configuration server process, and corresponding segments. When storing or reading the task of operating the database, the routing process receives the instruction sent by the client and sends the request instruction to the corresponding fragment, and the configuration server is responsible for the configuration of the meta information in the storage database. From the horizontal direction, the invention designs three fragments, and each fragment forms a main mode, a standby mode and an arbitration mode on three nodes by using a Replica Set.
2) Converting original text data into relational data through a data integration method and storing the relational data into a relational database;
the method specifically comprises the following steps:
2.1) converting original text data into relational data through data integration methods such as event extraction, entity linking, incomplete data filling and the like, and storing the relational data into a MySQL database;
2.2) in the invention, the relational database adopts MySQL Cluster to realize the distributed storage model of the relational database, and according to the adopted Cluster mode, the database is composed of four nodes, which are respectively: fig. 3 shows a relational database architecture diagram, in the relational database model provided by the present invention, a client performs basic operations of a database by connecting application nodes, and stores data in a structured form. After the operation of the client is completed, the two data nodes can automatically and synchronously copy the same data so as to ensure the safety of the data. The management node can monitor the states of other nodes at any time, and can add and configure new nodes.
3) Extracting elements with specific relations in data in a relational database and relations among the elements, storing the elements into a graph database and displaying the elements;
the method specifically comprises the following steps:
3.1) extracting the elements with specific relations in the relational database and the relations between the elements with specific relations in the relational database, and storing the elements with specific relations in the HBase;
3.2) exporting and storing partial data in HBase into Neo4j by using Hive to construct a knowledge graph capable of meeting different query requirements. And (3) establishing the mapping between the HBase and Hive, reducing HBase data into class relation database data, and establishing the relation of the data through Neo4 j.
3.3) based on visual display of Neo4j, extracting relevant data from HBase, and storing the data to a graph of Neo4j, wherein a graph database data display flow chart is shown in FIG. 4, and the graph database adopts four nodes to form a cluster to realize distributed storage of the graph database. And storing the data of the underlying graph by HBase.
4) Storing the original data and the compressed multimedia data into a chained database;
the flow chart of the chained database in step 4) is shown in fig. 5:
in the alliance chain, the json text file is analyzed, and the corresponding attribute of the json text file is stored in the corresponding field of the MySQL database corresponding table. In a private chain, an HDFS distributed file system is used for storing original detailed contents of events, each event corresponds to a text file, a picture, a video and the like of the event in a local file system, the text file, the picture, the video and the like are packaged and compressed into a compressed packet, then the compressed packet is subjected to hash value calculation, the hash value is stored into a hash field corresponding to MySQL, and the compressed packet is uploaded to the HDFS distributed file system.
5) According to the same data attribute in each sub-database, a global index is constructed, the invention links the related data in each database by a certain specific field index from a relational database, a graph database, a chained database, a multimedia database and an original database, and provides 3 different functions: firstly, a data query function of related data is realized; secondly, the source tracing function of the original data from other databases to the original database is realized through the construction of the index; thirdly, after video, audio and image data in the multimedia data are converted into text data, the copy management function of the relational database, the original database and the multimedia database on the storage of a plurality of copies of the text data is realized. Through the construction of a multi-modal global index structure, the linking and tracing functions of related data are realized, and a flow chart of the global index structure is shown in fig. 6:
5.1) in the relational database, an entity basic information table contains entity basic information attributes such as entity ID, entity name, object ID and the like, and the entity of the relational database is linked to the entity corresponding to the original data in the JSON format in the original database through the entity ID attribute, so that the tracing from the relational data to the original data is realized;
5.2) by using the entity ID attribute in each entity service data table in the relational database as a foreign key, referring to the basic information data of the entity in the entity basic information table, realizing the correlation query function from the basic information data to the service data;
5.3) linking to the multimedia index table stored in the relational database corresponding to the entity through the entity ID attribute of the entity information table in the relational database, wherein the multimedia index table comprises the following steps: the data such as storage node information, paths, file names, extension names and the like of video, audio, image and text data realize the functions of associated modification and deletion from a relational database to a multimedia database;
5.4) through the multimedia index table, the multimedia file (comprising: video, audio, image, text data) storage node information, path, file name, extension name and other attribute information are combined and associated for use, so that the function of inquiring the multimedia files stored in each node by the multimedia index table is realized;
5.5) taking the entity name as a key word in the chain database, and storing the event information of the entity in a chain structure of a alliance chain and a private chain. Linking the relational data to the chained database through the entity name in the entity basic information table in the relational database, and inquiring the data of the entity in the chained database;
5.6) taking entity names as keywords in the graph database, and storing and displaying the link relation between entities taking the entities as central nodes by constructing entity information triples. Realizing the link of the relational data to a graph database through an entity name in an entity basic information table in a relational database, and inquiring the data of the entity in the graph database and the incidence relation between the entities;
5.7) in the data conversion provided by the invention, besides the text data converted from the video, audio and image data is stored in the multimedia database, in order to provide richer data interfaces for the outside, the text data is stored as the original data in the original database, and the text data is associated with the JSON format file in the original database through the text ID, so that the link between the text data in the multimedia database and the original data in the original database is realized.

Claims (10)

1. A multi-modal heterogeneous data storage method based on data quality is characterized by comprising the following steps:
1) carrying out distributed storage on original data derived from internet data in a key-value format in an original database; the original data comprises original text data and original multimedia data;
2) performing data modeling on original multimedia data, and performing distributed storage in a file database in a file form;
3) performing data conversion on original text data by an event extraction, entity linking and incomplete data filling data integration method to convert the original text data into relational data, modeling the relational data, and constructing a relational database;
4) modeling entities having incidence relations among the entities in a relational database and relations among the entities to construct a graph database;
5) the activity data of each entity in the relational database has typical time sequence characteristics, and the activity data is subjected to data modeling in a chain structure to construct a chain database;
6) converting video data and audio data in the multimedia data into text data by a data conversion method, storing the text data in a multimedia database in a file form, and storing the text data in an original database in a key-value format;
7) according to the data quality, database optimization is carried out on different distributed databases, entity data of each sub-database are linked through constructing a multi-level index structure, and the consistency of the data is guaranteed;
8) and constructing a log file maintenance system of the multi-mode database aiming at the data integration method and each sub-database.
2. The method for multi-modal heterogeneous data storage based on data quality as claimed in claim 1, wherein in step 1), the specific method for performing distributed storage on the original text data in the original database in a key-value format is as follows:
2.1) using the MongoDB database system as a database system for key-value data storage;
2.2) distributed storage solution using MongoDB Replica Set in MongoDB as the original database.
3. The multi-modal heterogeneous data storage method based on data quality as claimed in claim 1, wherein in the step 2), the multimedia data is stored into the distributed file system according to the data source type, that is, the video, audio or picture data is stored according to the storage node corresponding to the data source.
4. The multi-modal heterogeneous data storage method based on data quality according to claim 1, wherein in the step 3), a specific method for constructing a relational database is as follows:
3.1) converting original text data into relational data by three data integration methods of relational extraction, entity linkage and incomplete data filling;
3.2) using the MySQL database system as a database system of the relational data storage;
3.3) distributed storage solution using MySQLCuster as relational database.
5. The method for storing multimodal heterogeneous data based on data quality according to claim 1, wherein in the step 4), a concrete method for constructing a graph database is as follows:
4.1) using HBase as a data storage scheme of a bottom layer diagram;
4.2) visualizing query plans using Neo4j as a graph database;
4.3) modeling the entities in the relational database by the relations between the entities.
6. The multi-modal heterogeneous data storage method based on data quality as claimed in claim 1, wherein in the step 5), a specific method for constructing a chain database is as follows:
the chain database uses MySQL and HDFS as data storage schemes and stores the data storage schemes into the chain database; the alliance chain stores structured data by adopting MySQL, and the private chain stores semi-structured and unstructured data by adopting HDFS.
7. The method for multimodal heterogeneous data storage based on data quality as claimed in claim 1, wherein the specific method for storing multimedia data in step 6) is as follows:
6.1) relevant multimedia data including video, audio, images, texts and the like are crawled from the Internet according to multimedia data sources; designing a multimedia data index table according to data attributes, positioning the specific position of multimedia data through the index table according to attributes such as a data source, a data type, a storage node, a storage path, a file name and the like, and storing the index table into a relational database in a structured data form;
6.2) designing a data conversion storage model, and converting video data into text data through a process of 'video- > audio- > text'; converting the audio data into text data through an audio-text process; converting the image data into text data through an image-text process; and stored in the raw database and the multimedia database.
8. The multi-modal heterogeneous data storage method based on data quality as claimed in claim 1, wherein in the step 7), the multi-level index structure is composed of a global index and a local index; the dynamic maintenance process is as follows:
constructing a main foreign key index among the global index original database, the relational database, the graph database, the chain database and the multimedia database, and effectively linking all sub-databases to realize query operation of related data; the local index constructs independent index structures in the databases to realize the local index of the content of each sub database;
each sub-database index module is as follows: the original database local index module establishes an index for each key of the data, sets a fragment key for an index field, and improves the query efficiency through the index;
the relational database local index module is used for establishing indexes for common fields in the data, for example, the common fields of certain entity data are entity names, and the query efficiency is improved through the indexes;
the map database local index module is used for performing secondary index construction through Apache Phoenix, and establishing mapping between Phoenix and an HBase table, so that the HBase table can be operated on Phoenix, and the query efficiency is improved through indexing;
the local index module of the chain database mainly comprises a name index part, a sorting establishment part, a dynamic increment updating part and the like. Establishing name indexes according to specific fields, simultaneously establishing a alliance chain according to a time sequence, and updating data dynamic increment;
the local index module of the multimedia database constructs a local index structure of the basic information of the multimedia data, including information such as storage node information, path, file name, extension name and the like of the data, and stores the local index structure in the relational database, and the local index module can be positioned to the specific position of the multimedia data through an index table according to attributes such as a data source, a data type, a storage node, a storage path, a file name and the like.
9. The method for multi-modal heterogeneous data storage based on data quality as claimed in claim 1, wherein in the step 8), the specific method for maintaining the log file is as follows:
the log file maintenance is divided into log file maintenance of a multi-mode database and log file maintenance of data integration; the log file maintenance of the multi-mode database comprises log file maintenance of a relational database, log file maintenance of a graph database, log file maintenance of a chained database and log file maintenance of an original database; the log file maintenance of data integration comprises log file maintenance of event extraction, log file maintenance of entity link, log file maintenance of incomplete data filling and log file maintenance of data consistency.
10. A multimodal heterogeneous data storage system based on data quality, comprising:
a primary database: the system is used for storing original data derived from internet data, and the storage format is as follows: key-value format;
a relational database: the device is used for converting key-value data in an original database into relational data and modeling and storing the relational data;
graph database: the system is used for imaging and storing the related entities and the relations among the entities in the relational database;
multimedia database: the video data and the audio data which are converted into text formats are stored;
a chain database: and the chain structure is used for storing the activity data of each entity in the relational database.
CN202210281261.XA 2022-03-22 2022-03-22 Multi-mode heterogeneous data storage method and system based on data quality Pending CN114661832A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210281261.XA CN114661832A (en) 2022-03-22 2022-03-22 Multi-mode heterogeneous data storage method and system based on data quality

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210281261.XA CN114661832A (en) 2022-03-22 2022-03-22 Multi-mode heterogeneous data storage method and system based on data quality

Publications (1)

Publication Number Publication Date
CN114661832A true CN114661832A (en) 2022-06-24

Family

ID=82031071

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210281261.XA Pending CN114661832A (en) 2022-03-22 2022-03-22 Multi-mode heterogeneous data storage method and system based on data quality

Country Status (1)

Country Link
CN (1) CN114661832A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910126A (en) * 2023-09-14 2023-10-20 国网山东省电力公司营销服务中心(计量中心) System and method for conveniently storing, classifying and inquiring massive daily clear electric quantity data
CN117290457A (en) * 2023-11-22 2023-12-26 湖南省第一测绘院 Multi-mode data management system for geographic entity, database and time sequence management method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910126A (en) * 2023-09-14 2023-10-20 国网山东省电力公司营销服务中心(计量中心) System and method for conveniently storing, classifying and inquiring massive daily clear electric quantity data
CN116910126B (en) * 2023-09-14 2023-11-24 国网山东省电力公司营销服务中心(计量中心) System and method for conveniently storing, classifying and inquiring massive daily clear electric quantity data
CN117290457A (en) * 2023-11-22 2023-12-26 湖南省第一测绘院 Multi-mode data management system for geographic entity, database and time sequence management method
CN117290457B (en) * 2023-11-22 2024-03-08 湖南省第一测绘院 Multi-mode data management system for geographic entity, database and time sequence management method

Similar Documents

Publication Publication Date Title
CN112685385B (en) Big data platform for smart city construction
US10445321B2 (en) Multi-tenant distribution of graph database caches
CN106095862B (en) Storage method of centralized extensible fusion type multi-dimensional complex structure relation data
CN108038222B (en) System of entity-attribute framework for information system modeling and data access
CN110032604B (en) Data storage device, translation device and database access method
US9098530B2 (en) Scalable rendering of large spatial databases
US20150095303A1 (en) Knowledge Graph Generator Enabled by Diagonal Search
US20130006968A1 (en) Data integration system
CN113312191B (en) Data analysis method, device, equipment and storage medium
CN107451225A (en) Scalable analysis platform for semi-structured data
CN114661832A (en) Multi-mode heterogeneous data storage method and system based on data quality
US11334549B2 (en) Semantic, single-column identifiers for data entries
CN114218218A (en) Data processing method, device and equipment based on data warehouse and storage medium
Imam et al. Data modeling guidelines for NoSQL document-store databases
US20230024345A1 (en) Data processing method and apparatus, device, and readable storage medium
Ortona et al. Wadar: Joint wrapper and data repair
CN105824872B (en) Method and system for search-based data detection, linking and acquisition
CN115080765A (en) Aerospace quality knowledge map construction method, system, medium and equipment
CN115640406A (en) Multi-source heterogeneous big data analysis processing and knowledge graph construction method
EP3499379B1 (en) Computer implemented and computer controlled method, computer program product and platform for manipulating data arranged for processing and storage at a data storage engine
CN113221528B (en) Automatic generation and execution method of clinical data quality evaluation rule based on openEHR model
CN114880483A (en) Metadata knowledge graph construction method, storage medium and system
Brahmia et al. τ JUpdate: A Temporal Update Language for JSON Data
Del Aguila et al. Towards a more straightforward and more expressive metamodel for SDW modeling
Stephan et al. A scientific data provenance harvester for distributed applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination