CN115587082A - Multi-modal data storage management method and system - Google Patents

Multi-modal data storage management method and system Download PDF

Info

Publication number
CN115587082A
CN115587082A CN202211240474.4A CN202211240474A CN115587082A CN 115587082 A CN115587082 A CN 115587082A CN 202211240474 A CN202211240474 A CN 202211240474A CN 115587082 A CN115587082 A CN 115587082A
Authority
CN
China
Prior art keywords
metadata
data
file
client
access
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211240474.4A
Other languages
Chinese (zh)
Inventor
张静逸
江波
张浩博
雷旸
王梦童
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 32 Research Institute
Original Assignee
CETC 32 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 32 Research Institute filed Critical CETC 32 Research Institute
Priority to CN202211240474.4A priority Critical patent/CN115587082A/en
Publication of CN115587082A publication Critical patent/CN115587082A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a system for multi-mode data storage management, comprising the following steps: uniformly describing multi-source heterogeneous data; obtaining metadata with a unified structure, extracting the characteristics of each heterogeneous data, analyzing and storing, and connecting the characteristics of the multi-source heterogeneous data in series; establishing an efficient access mechanism on the basis of a client/server mode architecture of a distributed file system, designing a client cache layer and a server cache layer, and providing two-stage access performance acceleration; by analyzing, prefetching and caching the file metadata, the number of access requests of the metadata in the system is reduced, and the access process and the metadata access efficiency of the optimized metadata in the distributed file system are obtained. The invention can manage the multi-source heterogeneous data and simultaneously complete the effective accumulation of historical data, realize the unified description of the multi-source heterogeneous data, provide integrated data storage and access service for the multi-source data and further promote the optimization of a data comprehensive treatment system.

Description

Method and system for multi-modal data storage management
Technical Field
The invention relates to the technical field of multi-modal data storage, in particular to a multi-modal data storage management method and system.
Background
The intersection and fusion of the information technology and the economic society causes the rapid increase of data, and the data becomes a national fundamental strategic resource. The multi-modal data storage is the basis for supporting services such as upper layer data fusion management and analysis. Aiming at massive, complex and multi-source heterogeneous characteristics of data in practical application, a necessary premise for analyzing, sharing and developing massive heterogeneous data is to research theoretical methods and key technologies such as storage models of massive uncertain heterogeneous data. Organization and management of multi-source heterogeneous data is an important research content in the big data era. With the continuous increase of user data, data acquisition channels are continuously abundant, and the scale growth of the data acquisition channels is not limited. On the other hand, the data information carriers are diversified, ranging from characters to graphics, images, and sounds, and from structured to semi-structured and unstructured, and the increase of data types is not limited.
With the development and application of information systems moving towards the intelligent stage, diversified management and standardized management are carried out on massive multi-source heterogeneous data, and the business requirement of realizing accurate management and quick decision making through high-quality data becomes a new challenge. Many researches are conducted around data storage structures and management modes of massive multi-modal data at home and abroad, and many efforts and attempts are made. Foreign research on an integrated system for multimodal data is rapidly developed, and a relatively typical integrated system has been developed.
The chief technical officer of Pentaho, james Dixon, proposed the data lake as a large data storage processing and sharing service mechanism. The data lake is a novel storage framework capable of storing original formats of data. The method stores all structured and unstructured data in a centralized repository, and supports distributed storage of massive structured data, semi-structured data and unstructured data. Allowing for data expansion to any size while saving time in defining data structures, schemas, and conversions. Aiming at the multi-source heterogeneous characteristics of big data in the related field, a data lake is built for storing multi-mode data, so that the project can be rapidly circulated.
Amazon, microsoft and other foreign leading cloud computing and artificial intelligence enterprises respectively propose AWS Lake Format and Azure Data Lake based on the technical requirements of Data lakes. Amazon Simple Storage Service (S3) of Amazon is a high-performance object Storage Service, is suitable for structured and unstructured data, and the data stored by using Amazon S3 is protected by the persistence of 99.999999999 percent, and is a Storage Service which can be used for constructing a data lake. The data lake built on Amazon S3 can run big data analytics, artificial Intelligence (AI), machine Learning (ML), high Performance Computing (HPC), and media data processing applications using native AWS services to obtain critical information from multimodal datasets.
The Multibase system developed by CSC corporation of america is an integrated heterogeneous distributed database system for integrating access to multi-sourced, heterogeneous, distributed databases. The system suppresses differences between database management systems, languages and data models, provides a uniform global model and a single high-level query language for users, and enables the local database to retain autonomy of updates.
IBM corporation of the united states developed Garl ic systems, designed initially to build a heterogeneous database system capable of integrating data in different database systems as well as in various non-database data servers. Such integration must ensure data server independence while not creating copies of its data. Since data is mostly modeled naturally by objects, the system provides an object-oriented schema for applications, provides object queries, creates and sends query fragments to the appropriate data servers, and compiles the query results to deliver them back to the application.
The TSIMMIS developed by Stanford university research is a heterogeneous information source integration system that mainly aims at structured data and unstructured data, extracts component objects of attributes from the unstructured data, converts the information into a common object model, combines information from multiple sources, and allows browsing of information and management of constraints across heterogeneous sites. Its advantages are high adaptability to any data source and different data can be solved by different programs.
With the development of social economy and the application of various big data technologies, multi-source organization data becomes an important component part of the development of social economy. The method is favorable for implementing the national indication that the data is a new production element, and practically promotes the development of data management and service industry in China. The research on the digital storage technology and the multi-mode big data fusion in China starts late, but with the emphasis on the autonomous controllable software and hardware, the increase in the demand on the multi-mode data rapid storage system in China and the emphasis on the informatization, modernization and intelligent development, a plurality of enterprises and scientific research institutions in various fields make great progress in the research and development aspects of related technologies. In order to solve the problems of multi-mode data storage, analysis and management, china publishes respective data lakes and data storage services thereof for clouds, ariyun and Tengchin.
The CoXML V1.0 developed by Beijing university is an information application system based on extensible markup language (XML), and can realize the collection, management and sharing of data. The system develops a collaborative query response framework on the basis of a relational model database, and realizes a query response mechanism between the system and other databases and data sources. The system can establish a general platform based on a collaborative query response mechanism, and integrate, manage and share massive multi-source heterogeneous data.
A multisource heterogeneous power distribution and utilization data storage technology is established by Nanjing Nanrui group company based on Hadoop. The data of the power distribution and utilization data storage technology is more standard and distributed, and the storage layer comprises two important parts, namely data preprocessing and NoSQL. Different structured data modes are converted uniformly by using data preprocessing, the multi-mode data storage and retrieval are easier to realize by the uniform standardized mode, and the data is stored in a distributed manner by NoSQL.
In view of the above-mentioned related technologies, the inventor believes that there is a problem of effective management and storage of massive multi-source heterogeneous data, and therefore, a new technical solution needs to be provided to improve the above technical problem.
Disclosure of Invention
In view of the defects in the prior art, the present invention aims to provide a method and a system for multimodal data storage management.
According to the invention, the method for multimodal data storage management comprises the following steps:
step S1: uniformly describing multi-source heterogeneous data, and standardizing and driving various data access processes based on metadata;
step S2: uniformly describing multi-source heterogeneous data to obtain metadata with a uniform structure, extracting the characteristics of each heterogeneous data, analyzing and storing, connecting the characteristics of the multi-source heterogeneous data in series, and performing semantic analysis and internal data integration spanning the heterogeneous data;
and step S3: establishing an efficient access mechanism on the basis of a client/server mode architecture of a distributed file system, designing a client cache layer and a server cache layer, and providing two-stage access performance acceleration; by analyzing, prefetching and caching the file metadata, the number of access requests of the metadata in the system is reduced, and the access process and the metadata access efficiency of the optimized metadata in the distributed file system are obtained.
Preferably, the step S1 includes the steps of:
step S1.1: researching multi-source data templated extraction, combining rules and various machine learning-based templated extraction methods, performing metadata normalization processing and storage in a warehouse on multi-source heterogeneous data, and paying attention to the unified description of unstructured data;
step S1.2: naming according to rules to generate id fields of audio/video and images as identifiers of data management, inserting the id fields into extended attributes of metadata, and performing unified logical representation on heterogeneous data sources by using the metadata without changing the storage structure of original data;
step S1.3: storing all data as objects in a flat namespace;
step S1.4: and storing the related information into the extended attribute space of the metadata.
Preferably, the object in said step S1.3 comprises an id identifier, binary data, and metadata consisting of name/value pairs.
Preferably, the step S2 extracts relevant characteristics of the read file and performs characteristic analysis on the pre-read file; converting characters in the extracted data characteristics into corresponding digital ids according to a dictionary with a fixed sequence, and splicing according to a set sequence to obtain a file feature vector capable of being used for calculation; then the file feature vectors are used as a judgment standard for whether prefetching or not to obtain a series of file feature vectors, then the obtained vectors are compared with the previous judgment standard vectors to obtain the file association degree, whether metadata of the file is prefetched or not is judged, and finally the obtained prefetched metadata sequence is output.
Preferably, when the user initiates a file access operation in step S3, a read request operation for the file metadata first reaches the client through the file system, and then the client searches for metadata of the target file in its own local cache layer, and if the metadata is hit, the client processes a subsequent metadata request of the operation in the local cache, and then returns corresponding file metadata information to the upper layer; otherwise, the client forwards the read request operation to the MDSs through the network, when the read request operation reaches one of the MDSs, the metadata prefetching module on the server searches the metadata of the target file and the related files in the cache layer of the server according to the result given by the association analysis model, then packs all the searched metadata and returns the packed metadata to the client, the client does not need to request other related metadata from the MDS during subsequent metadata access, and interacts with the corresponding OSD through the data index information in the metadata after the client processes the metadata of the files, and finally completes the read operation of the files.
The invention also provides a system for multi-modal data storage management, which comprises the following modules:
a module M1: uniformly describing multi-source heterogeneous data, and standardizing and driving various data access processes based on metadata;
a module M2: uniformly describing multi-source heterogeneous data to obtain metadata with a uniform structure, extracting the characteristics of each heterogeneous data, analyzing and storing, connecting the characteristics of the multi-source heterogeneous data in series, and performing semantic analysis and internal data integration spanning the heterogeneous data;
a module M3: establishing an efficient access mechanism on the basis of a client/server mode architecture of a distributed file system, designing a client cache layer and a server cache layer, and providing two-stage access performance acceleration; by analyzing, prefetching and caching the file metadata, the number of access requests of the metadata in the system is reduced, and the access process and the metadata access efficiency of the optimized metadata in the distributed file system are obtained.
Preferably, the module M1 comprises the following modules:
module M1.1: researching multi-source data templated extraction, combining rules and various machine learning-based templated extraction systems, performing metadata normalization processing and storage in a warehouse on multi-source heterogeneous data, and paying attention to the uniform description of unstructured data;
module M1.2: naming according to rules to generate id fields of audio/video and images as identifiers of data management, inserting the id fields into extended attributes of metadata, and performing unified logical representation on heterogeneous data sources by using the metadata without changing the storage structure of original data;
module M1.3: storing all data as objects in a flat namespace;
module M1.4: and storing the related information into an extended attribute space of the metadata.
Preferably, the objects in said module M1.3 contain an id identifier, binary data, and metadata consisting of name/value pairs.
Preferably, the module M2 extracts relevant characteristics of the read file and performs characteristic analysis on the pre-read file; converting characters in the extracted data characteristics into corresponding digital ids according to a dictionary with a fixed sequence, and splicing according to a set sequence to obtain a file feature vector capable of being used for calculation; then the file characteristic vectors are used as a judgment standard for whether prefetching or not to obtain a series of file characteristic vectors, then the obtained vectors are compared with the previous judgment standard vectors to obtain the file association degree, whether the metadata of the file are prefetched or not is judged, and finally the obtained prefetched metadata sequence is output.
Preferably, when a user initiates a file access operation in the module M3, a read request operation for file metadata first reaches the client through the file system, and then the client searches for metadata of a target file in its own local cache layer, if the metadata is hit, the client processes a subsequent metadata request of the operation in the local cache, and then returns corresponding file metadata information to the upper layer; otherwise, the client forwards the read request operation to the MDSs through the network, when the read request operation reaches one of the MDSs, a metadata prefetching module on the server searches metadata of a target file and related files in a server cache layer of the client according to a result given by the correlation analysis model, then packs all the searched metadata and returns the packed metadata to the client, the client does not need to request other related metadata from the MDS during subsequent metadata access, and interacts with corresponding OSD through data index information in the metadata after the metadata of the files are processed by the client, so that the file read operation is finally completed.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention provides a multi-mode data distributed intelligent storage technology by taking mass multi-source heterogeneous data as a research object, and designs a multi-mode data optimization management system based on the technology, the system can manage the multi-source heterogeneous data and simultaneously complete the effective accumulation of historical data, realize the unified description of the multi-source heterogeneous data, provide integrated data storage and access service for the multi-source data, and further promote the optimization of a data comprehensive treatment system;
2. according to the method, a metadata normalization processing method is adopted to perform metadata normalization processing and storage in a warehouse on the multi-source heterogeneous data, so that information contained in the heterogeneous data is better understood, data interaction and unification of a multi-source heterogeneous system are realized, and data sharing is facilitated;
3. the invention adopts a relevance analysis method to learn the hidden relation between the extracted file and the file to be analyzed from the file access characteristics, extracts the file correlation characteristics and integrates the file correlation characteristics into a feature vector, and then performs metadata prefetching by means of the feature vector. By performing a series of operations such as prefetching and caching on the metadata of the file, the metadata access flow of the associated file is optimized, and the metadata access performance is improved;
4. the invention adopts a high-efficiency access method to design a client cache layer and a server cache layer so as to accelerate the access performance in two stages. A series of operations such as analysis, prefetching and caching are carried out on the file metadata, so that the number of access requests of the metadata in the system is remarkably reduced, the metadata access process is optimized in the distributed file system, and the metadata access efficiency is improved.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a diagram of a multimodal data optimization management system architecture according to the present invention;
FIG. 2 is a diagram of a metadata structure according to the present invention;
FIG. 3 is a flow chart of metadata prefetching according to the present invention;
FIG. 4 is a metadata access flow diagram of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will aid those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any manner. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
Example 1:
according to the method for multi-modal data storage management provided by the invention, the method comprises the following steps:
step S1: uniformly describing multi-source heterogeneous data, and standardizing and driving various data access processes based on metadata;
step S1.1: researching multi-source data templated extraction, combining rules and various machine learning-based templated extraction methods, performing metadata normalization processing and storage in a warehouse on multi-source heterogeneous data, and paying attention to the unified description of unstructured data;
step S1.2: naming according to rules to generate id fields of audio, video and images as identifiers of data management, inserting the id fields into extended attributes of metadata, and performing unified logic representation on heterogeneous data sources by using the metadata without changing the storage structure of original data;
step S1.3: storing all data as objects in a flat namespace; an object contains an id identifier, binary data, and metadata consisting of name/value pairs.
Step S1.4: and storing the related information into the extended attribute space of the metadata.
Step S2: uniformly describing multi-source heterogeneous data to obtain metadata with a uniform structure, extracting the characteristics of each heterogeneous data, analyzing and storing, connecting the characteristics of the multi-source heterogeneous data in series, and performing semantic analysis and internal data integration spanning the heterogeneous data; extracting relevant characteristics of the read files and analyzing the characteristics of the files to be read; converting characters in the extracted data characteristics into corresponding digital ids according to a dictionary with a fixed sequence, and splicing according to a set sequence to obtain a file feature vector capable of being used for calculation; then the file characteristic vectors are used as a judgment standard for whether prefetching or not to obtain a series of file characteristic vectors, then the obtained vectors are compared with the previous judgment standard vectors to obtain the file association degree, whether the metadata of the file are prefetched or not is judged, and finally the obtained prefetched metadata sequence is output.
And step S3: establishing an efficient access mechanism on the basis of a client/server mode architecture of a distributed file system, designing a client cache layer and a server cache layer, and providing two-stage access performance acceleration; the method comprises the steps that file metadata are analyzed, prefetched and cached, the number of access requests of the metadata in the system is reduced, and the optimized metadata access process and metadata access efficiency in the distributed file system are obtained; when a user initiates a file access operation, a read request operation for file metadata firstly reaches a client through a file system, then the client searches metadata of a target file in a local cache layer of the client, if the metadata is hit, the client processes a subsequent metadata request of the operation in the local cache, and then corresponding file metadata information is returned to an upper layer; otherwise, the client forwards the read request operation to the MDSs through the network, when the read request operation reaches one of the MDSs, the metadata prefetching module on the server searches the metadata of the target file and the related files in the cache layer of the server according to the result given by the association analysis model, then packs all the searched metadata and returns the packed metadata to the client, the client does not need to request other related metadata from the MDS during subsequent metadata access, and interacts with the corresponding OSD through the data index information in the metadata after the client processes the metadata of the files, and finally completes the read operation of the files.
Example 2:
example 2 is a preferred example of example 1, and the present invention will be described in more detail.
The invention also provides a system for multi-mode data storage management, which comprises the following modules:
a module M1: uniformly describing multi-source heterogeneous data, and standardizing and driving various data access processes based on metadata;
module M1.1: researching multi-source data templated extraction, combining rules and various machine learning-based templated extraction methods, performing metadata normalization processing and storage in a warehouse on multi-source heterogeneous data, and paying attention to the unified description of unstructured data;
module M1.2: naming according to rules to generate id fields of audio/video and images as identifiers of data management, inserting the id fields into extended attributes of metadata, and performing unified logical representation on heterogeneous data sources by using the metadata without changing the storage structure of original data;
module M1.3: storing all data as objects in a flat namespace; an object contains an id identifier, binary data, and metadata consisting of name/value pairs.
Module M1.4: and storing the related information into an extended attribute space of the metadata.
A module M2: uniformly describing multi-source heterogeneous data to obtain metadata with a uniform structure, extracting the characteristics of each heterogeneous data, analyzing and storing, connecting the characteristics of the multi-source heterogeneous data in series, and performing semantic analysis and internal data integration spanning the heterogeneous data; extracting the relevant characteristics of the read file and carrying out characteristic analysis on the file which is read in advance; converting characters in the extracted data characteristics into corresponding digital ids according to a dictionary with a fixed sequence, and splicing according to a set sequence to obtain a file feature vector capable of being used for calculation; then the file characteristic vectors are used as a judgment standard for whether prefetching or not to obtain a series of file characteristic vectors, then the obtained vectors are compared with the previous judgment standard vectors to obtain the file association degree, whether the metadata of the file are prefetched or not is judged, and finally the obtained prefetched metadata sequence is output.
A module M3: establishing an efficient access mechanism on the basis of a client/server mode architecture of a distributed file system, designing a client cache layer and a server cache layer, and providing two-stage access performance acceleration; the method comprises the steps that file metadata are analyzed, prefetched and cached, the number of access requests of the metadata in the system is reduced, and the optimized metadata access process and metadata access efficiency in the distributed file system are obtained; when a user initiates a file access operation, a read request operation for file metadata firstly reaches a client through a file system, then the client searches metadata of a target file in a local cache layer of the client, if the metadata is hit, the client processes a subsequent metadata request of the operation in the local cache, and then corresponding file metadata information is returned to an upper layer; otherwise, the client forwards the read request operation to the MDSs through the network, when the read request operation reaches one of the MDSs, the metadata prefetching module on the server searches the metadata of the target file and the related files in the cache layer of the server according to the result given by the association analysis model, then packs all the searched metadata and returns the packed metadata to the client, the client does not need to request other related metadata from the MDS during subsequent metadata access, and interacts with the corresponding OSD through the data index information in the metadata after the client processes the metadata of the files, and finally completes the read operation of the files.
Example 3:
example 3 is a preferred example of example 1, and the present invention will be described in more detail.
The invention provides a multi-mode data distributed intelligent storage technology, and designs a multi-mode data optimization management system based on the technology, so that the problems of effective management and storage of massive multi-source heterogeneous data are solved. The storage consistency of mass multi-mode data is supported, and the data access and storage are efficient, safe and reliable, so that the large-scale and distributed storage requirements of cloud computing and large data application on high performance, low time delay, high availability, high expandability and high safety are met. The method has the advantages that the realization of the storage target of mass data is promoted, the data advantage is converted into the decision advantage, the practical improvement of the data processing efficiency and the data processing capability is realized, a multi-mode data intercommunication system is built, and meanwhile, a solid foundation is laid for the construction of the knowledge graph.
Referring to fig. 1, the invention provides a multi-modal data distributed intelligent storage technology, and designs a multi-modal data optimization management system based on the technology, so as to realize uniform description of multi-source heterogeneous data and provide integrated data storage and access service for the multi-source data. By combing and analyzing the actual requirements of a business system and combining the domain knowledge of experts, the multi-mode data optimization management system provides the following three functions: metadata normalization processing, relevance analysis and an efficient access mechanism, the overall architecture of the system is provided, a core module is analyzed, and a data processing flow is combed. The metadata normalization processing is carried out on the multisource heterogeneous data, information contained in the heterogeneous data is better understood by analyzing the relevance of file metadata, and the prefetching cache is completed while the accumulated historical data is stored in a storage, so that the interaction and unification of the multisource heterogeneous system data are realized, the access efficiency is improved, and the data sharing is facilitated. The multimode data information adopts three storage functions of block device storage, file system storage and object storage, provides a mixed storage mode combining simplification and a complex data organization structure, provides rich data operation interfaces and ensures storage performance. A technical framework and a storage mode of a multi-mode mixed storage system are combed, a multi-mode data storage technology based on semantic analysis is researched, a reference scheme is provided for efficient storage of multi-source heterogeneous data, and a foundation is laid for correctly recognizing, quickly processing and effectively using the multi-mode data.
The metadata normalization processing module: the unified description of the multi-source heterogeneous data is realized, various data access processes are standardized and driven based on metadata, and integrated data access service is provided for the multi-source data. Researching multi-source data templated extraction, combining rules and various machine learning-based templated extraction methods, performing metadata normalization processing and storage in a warehouse on multi-source heterogeneous data, and focusing on uniform description of unstructured data; naming according to rules, generating id fields of audio, video and images as unique identifiers of data management, inserting the id fields into extended attributes of metadata, performing unified logical representation on heterogeneous data sources by using the metadata without changing the storage structure of original data, solving the heterogeneous problem of each data source, providing a unified basic structure for data integration, better understanding information contained in heterogeneous data, realizing data interaction and unification of a multi-source heterogeneous system, and laying a foundation for subsequent data communication, relevance analysis and data file sharing.
Referring to FIG. 2, a distributed multimodal data optimization management system stores all data as objects within a flat namespace, the objects containing an id identifier, binary data, and metadata consisting of name/value pairs. The related information is stored in the extended attribute space of the metadata, so that the times of initiating access requests to the metadata server are reduced, and the modification of a metadata structure and an interface function is avoided. The normalization process does not require changes to the storage structure of the raw data, and therefore the accessing user does not need to know the specific details and differences of the various data sources.
A relevance analysis module: originally, there is not the interaction between the heterogeneous data of different sources, unable linkage analysis, interconnection intercommunication. The uniformly described multi-source heterogeneous data obtains metadata with a uniform structure, complementary advantages among heterogeneous data can be exerted, characteristics of each heterogeneous data are extracted, analysis and storage are carried out, and the characteristics of the multi-source heterogeneous data are connected in series, so that semantic analysis and internal data integration spanning the heterogeneous data are possible.
In order to better deal with the association problem between metadata, processing using a method of calculating a hidden feature vector is considered here. If two other files are taken out after one file is taken out, the two files are related or similar to a certain extent, so that the distance between the feature vectors of the two files in the vector space is closer to the taken-out file; if after a file is fetched, no further files are fetched, these files are irrelevant, i.e. opposite, to the fetched file, so their feature vectors should be relatively far from the fetched file. Several typical characteristic elements are selected and coded as a basis for calculating the hidden feature vector, so that the hidden feature vector can more fully express the relevance: frequent access order property, peer directory storage relation property, application internal access order property, and user direct read order property.
In order to better deal with the problem of prefetching metadata, we mainly take two aspects into consideration when designing a relevance analysis model: on one hand, the model can learn the hidden relation between the taken file and the file to be analyzed from the reading of the file: if both File1 and File2 are files related to the currently fetched file, then the model needs to consider that File1, file2 are to some extent similar to the fetched file; another aspect is that the model can analyze files that have not yet performed any read operations and prefetch their corresponding metadata files.
Referring to fig. 3, the prefetch algorithm receives two part inputs. One part is to extract the relevant characteristics of the read file, and the other part is to perform characteristic analysis on the pre-read file. In order to encode the four extracted data characteristics into a computable feature vector, characters in the extracted data characteristics are converted into corresponding digital ids according to a dictionary with a fixed sequence, so that splicing can be performed according to a set sequence, and finally a file feature vector which can be used for calculation is obtained. The feature vectors of these files are then used as a criterion for pre-fetching. And then comparing and analyzing the obtained vector with the previous judgment standard vector to obtain the file association degree, judging whether to prefetch the metadata of the file or not, and finally outputting the obtained prefetch metadata sequence.
An efficient access mechanism: the method is based on a client/server (C/S) mode architecture of a distributed file system, and a client cache layer and a server cache layer are designed to provide two-stage access performance acceleration. By performing a series of operations such as analysis, prefetching and caching on the file metadata, the number of access requests of the metadata in the system is remarkably reduced, the metadata access process is optimized in the distributed file system, and the metadata access efficiency is improved.
Referring to fig. 4, when a user initiates a file access operation, a read request operation for file metadata first reaches a client through a file system, and then the client first searches for metadata of a target file in a local cache layer of the client, and if the metadata is hit, the client can process a subsequent metadata request of the operation in the local cache, and then returns corresponding file metadata information to an upper layer; otherwise, the client forwards the read request operation to the MDSs through the network, when the read request operation reaches one of the MDSs, a metadata prefetching module on the server searches metadata of a target file and related files in a server cache layer of the client according to a result given by the correlation analysis model, then packs all the searched metadata and returns the packed metadata to the client, the client does not need to request other related metadata from the MDS during subsequent metadata access, and interacts with corresponding OSD through data index information in the metadata after the metadata of the files are processed by the client, so that the file read operation is finally completed. When the file access contains a write request, the association analysis model in the client is activated, the access characteristics of the file are analyzed and extracted and integrated into a feature vector, the feature vector is stored in a cache layer of the client in a specified organization form, all information of the file is synchronously updated to the corresponding MDS after the client completes the write request operation, the metadata version is replaced, and the reliability and consistency of system data are guaranteed.
The invention provides a multi-mode data distributed intelligent storage technology by taking mass multi-source heterogeneous data as a research object, and designs a multi-mode data optimization management system based on the technology, the system can manage the multi-source heterogeneous data and simultaneously complete the effective accumulation of historical data, realize the unified description of the multi-source heterogeneous data, provide integrated data storage and access service for the multi-source data, and further promote the optimization of a data comprehensive treatment system.
By adopting the metadata normalization processing method, the metadata normalization processing and the storage in a warehouse are carried out on the multi-source heterogeneous data, the information contained in the heterogeneous data is better understood, the data interaction and the unification of the multi-source heterogeneous system are realized, and the data sharing is facilitated.
And learning the hidden relation between the extracted file and the file to be analyzed from the file access characteristics by adopting a relevance analysis method, extracting the relevant characteristics of the file, integrating the relevant characteristics into a characteristic vector, and then performing metadata prefetching by virtue of the characteristic vector. Through a series of operations such as prefetching and caching on the metadata of the file, the metadata access flow of the associated file is optimized, and the metadata access performance is improved.
A high-efficiency access method is adopted to design a client cache layer and a server cache layer so as to provide two-stage access performance acceleration. A series of operations such as analysis, prefetching and caching are carried out on the file metadata, so that the number of access requests of the metadata in the system is remarkably reduced, the metadata access process is optimized in the distributed file system, and the metadata access efficiency is improved.
Multimodal: in the information field, a modality may be understood as the existence of a data format, such as a text format, an audio format, an image format, a video format, and the like. The co-occurrence or concurrency of various single-modal information, collectively referred to as multi-modal information, is unstructured.
Multi-source heterogeneous data: multi-source heterogeneous data is a complex type of data, similar to multi-modal data, but containing many more data types.
Multi-source: it means that the whole of one data has a plurality of data holders and a plurality of sources.
Isomerization: the integral data comprises different data components, different content types and different characteristics, and comprises both discrete data and mixed data, namely structured data and unstructured data.
Those skilled in the art can understand this embodiment as a more specific description of embodiments 1 and 2.
Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the present invention can be regarded as a hardware component, and the devices, modules and units included therein for implementing various functions can also be regarded as structures within the hardware component; means, modules, units for realizing various functions can also be regarded as structures in both software modules and hardware components for realizing the methods.
The foregoing description has described specific embodiments of the present invention. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (10)

1. A method of multimodal data storage management, the method comprising the steps of:
step S1: uniformly describing multi-source heterogeneous data, and standardizing and driving various data access processes based on metadata;
step S2: uniformly describing multi-source heterogeneous data to obtain metadata with a uniform structure, extracting the characteristics of each heterogeneous data, analyzing and storing, connecting the characteristics of the multi-source heterogeneous data in series, and performing semantic analysis and internal data integration spanning the heterogeneous data;
and step S3: establishing an efficient access mechanism on the basis of a client/server mode architecture of a distributed file system, designing a client cache layer and a server cache layer, and providing two-stage access performance acceleration; by analyzing, prefetching and caching the file metadata, the number of access requests of the metadata in the system is reduced, and the access process and the metadata access efficiency of the optimized metadata in the distributed file system are obtained.
2. The method for multimodal data storage management according to claim 1, wherein the step S1 comprises the steps of:
step S1.1: researching multi-source data templated extraction, combining rules and various machine learning-based templated extraction methods, performing metadata normalization processing and storage in a warehouse on multi-source heterogeneous data, and paying attention to the unified description of unstructured data;
step S1.2: naming according to rules to generate id fields of audio, video and images as identifiers of data management, inserting the id fields into extended attributes of metadata, and performing unified logic representation on heterogeneous data sources by using the metadata without changing the storage structure of original data;
step S1.3: storing all data as objects in a flat namespace;
step S1.4: and storing the related information into an extended attribute space of the metadata.
3. The method of multimodal data storage management according to claim 2, wherein the object in step S1.3 contains an id identifier, binary data, and metadata consisting of name/value pairs.
4. The method for multimodal data storage management as claimed in claim 1, wherein the step S2 extracts relevant characteristics of the read files and performs characteristic analysis on the pre-read files; converting characters in the extracted data characteristics into corresponding digital ids according to a dictionary with a fixed sequence, and splicing according to a set sequence to obtain a file feature vector capable of being used for calculation; then the file feature vectors are used as a judgment standard for whether prefetching or not to obtain a series of file feature vectors, then the obtained vectors are compared with the previous judgment standard vectors to obtain the file association degree, whether metadata of the file is prefetched or not is judged, and finally the obtained prefetched metadata sequence is output.
5. The method according to claim 1, wherein when the user initiates a file access operation in step S3, a read request operation for file metadata first reaches the client through the file system, and then the client searches for metadata of a target file in its own local cache layer, and if the metadata is hit, the client processes a metadata request subsequent to the operation in the local cache, and then returns corresponding file metadata information to the upper layer; otherwise, the client forwards the read request operation to the MDSs through the network, when the read request operation reaches one of the MDSs, the metadata prefetching module on the server searches the metadata of the target file and the related files in the cache layer of the server according to the result given by the association analysis model, then packs all the searched metadata and returns the packed metadata to the client, the client does not need to request other related metadata from the MDS during subsequent metadata access, and interacts with the corresponding OSD through the data index information in the metadata after the client processes the metadata of the files, and finally completes the read operation of the files.
6. A system for multimodal data storage management, the system comprising:
a module M1: uniformly describing multi-source heterogeneous data, and standardizing and driving various data access processes based on metadata;
a module M2: uniformly describing multi-source heterogeneous data to obtain metadata with a uniform structure, extracting the characteristics of each heterogeneous data, analyzing and storing, connecting the characteristics of the multi-source heterogeneous data in series, and performing semantic analysis and internal data integration spanning the heterogeneous data;
a module M3: establishing an efficient access mechanism on the basis of a client/server mode architecture of a distributed file system, designing a client cache layer and a server cache layer, and providing two-stage access performance acceleration; by analyzing, prefetching and caching the file metadata, the number of access requests of the metadata in the system is reduced, and the access process and the metadata access efficiency of the optimized metadata in the distributed file system are obtained.
7. The system for multimodal data storage management according to claim 6, wherein the module M1 comprises the following modules:
module M1.1: researching multi-source data templated extraction, combining rules and various machine learning-based templated extraction systems, performing metadata normalization processing and warehousing storage on multi-source heterogeneous data, and paying attention to the unified description of unstructured data;
module M1.2: naming according to rules to generate id fields of audio, video and images as identifiers of data management, inserting the id fields into extended attributes of metadata, and performing unified logic representation on heterogeneous data sources by using the metadata without changing the storage structure of original data;
module M1.3: storing all data as objects in a flat namespace;
module M1.4: and storing the related information into an extended attribute space of the metadata.
8. System of multimodal data storage management according to claim 7, characterized in that the objects in the module M1.3 contain an id identifier, binary data, and metadata consisting of name/value pairs.
9. The system for multimodal data storage management according to claim 6, wherein the module M2 extracts relevant characteristics of the read files and performs characteristic analysis on the pre-read files; converting characters in the extracted data characteristics into corresponding digital ids according to a dictionary with a fixed sequence, and splicing according to a set sequence to obtain a file feature vector capable of being used for calculation; then the file characteristic vectors are used as a judgment standard for whether prefetching or not to obtain a series of file characteristic vectors, then the obtained vectors are compared with the previous judgment standard vectors to obtain the file association degree, whether the metadata of the file are prefetched or not is judged, and finally the obtained prefetched metadata sequence is output.
10. The system for multimodal data storage management as claimed in claim 6, wherein in the module M3, when a user initiates a file access operation, a read request operation for file metadata will first reach the client through the file system, and then the client searches metadata of a target file in its own local cache layer, and if the metadata is hit, the client processes a subsequent metadata request of this operation in the local cache, and then returns corresponding file metadata information to the upper layer; otherwise, the client forwards the read request operation to the MDSs through the network, when the read request operation reaches one of the MDSs, the metadata prefetching module on the server searches the metadata of the target file and the related files in the cache layer of the server according to the result given by the association analysis model, then packs all the searched metadata and returns the packed metadata to the client, the client does not need to request other related metadata from the MDS during subsequent metadata access, and interacts with the corresponding OSD through the data index information in the metadata after the client processes the metadata of the files, and finally completes the read operation of the files.
CN202211240474.4A 2022-10-11 2022-10-11 Multi-modal data storage management method and system Pending CN115587082A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211240474.4A CN115587082A (en) 2022-10-11 2022-10-11 Multi-modal data storage management method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211240474.4A CN115587082A (en) 2022-10-11 2022-10-11 Multi-modal data storage management method and system

Publications (1)

Publication Number Publication Date
CN115587082A true CN115587082A (en) 2023-01-10

Family

ID=84779926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211240474.4A Pending CN115587082A (en) 2022-10-11 2022-10-11 Multi-modal data storage management method and system

Country Status (1)

Country Link
CN (1) CN115587082A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116450596A (en) * 2023-06-19 2023-07-18 北京大数据先进技术研究院 Digital object storage method, digital object storage device, electronic equipment and readable storage medium
CN116976808A (en) * 2023-07-21 2023-10-31 中国矿业大学(北京) Multisource heterogeneous coal mine geologic data management system, method, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116450596A (en) * 2023-06-19 2023-07-18 北京大数据先进技术研究院 Digital object storage method, digital object storage device, electronic equipment and readable storage medium
CN116450596B (en) * 2023-06-19 2023-10-03 北京大数据先进技术研究院 Digital object storage method, digital object storage device, electronic equipment and readable storage medium
CN116976808A (en) * 2023-07-21 2023-10-31 中国矿业大学(北京) Multisource heterogeneous coal mine geologic data management system, method, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Ali et al. Comparison between SQL and NoSQL databases and their relationship with big data analytics
CN1845104B (en) System and method for intelligent retrieval and processing of information
CN105989150B (en) A kind of data query method and device based on big data environment
Chung et al. JackHare: a framework for SQL to NoSQL translation using MapReduce
CN115587082A (en) Multi-modal data storage management method and system
CN110716952A (en) Multi-source heterogeneous data processing method and device and storage medium
Guo et al. Manu: a cloud native vector database management system
Bellare et al. Woo: A scalable and multi-tenant platform for continuous knowledge base synthesis
Mostajabi et al. A Systematic Review of Data Models for the Big Data Problem
Ghotiya et al. Migration from relational to NoSQL database
Kang et al. Research on construction methods of big data semantic model
Liu et al. AUDR: an advanced unstructured data repository
El Alami et al. Supply of a key value database redis in-memory by data from a relational database
Shakhovska et al. Big Data Model" Entity and Features"
Erkimbaev et al. Standardization of Storage and Retrieval of Semi-structured Thermophysical Data in JSON-documents Associated with the Ontology
Pivert NoSQL data models: trends and challenges
Valduriez Principles of distributed data management in 2020?
Almutairi et al. An Analysis of Data Integration Challenges from Heterogeneous Databases
Samal et al. Big data processing: Big challenges and opportunities
Pan et al. Research on Mass Image Data Storage Method for Data Center
Zhang et al. Managing a large shared bank of unstructured data by using free-table
Ren et al. Intelligent visualization system for big multi-source medical data based on data lake
Morishima et al. A data modeling and query processing scheme for integration of structured document repositories and relational databases
Manta-Caro et al. Advances in real-time indexing models and techniques for the web of things
Hao et al. Research of hadoop-based digital library data service system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination