CN117076495B - Distributed storage method, device and equipment for multi-mode literature data - Google Patents

Distributed storage method, device and equipment for multi-mode literature data Download PDF

Info

Publication number
CN117076495B
CN117076495B CN202311336096.4A CN202311336096A CN117076495B CN 117076495 B CN117076495 B CN 117076495B CN 202311336096 A CN202311336096 A CN 202311336096A CN 117076495 B CN117076495 B CN 117076495B
Authority
CN
China
Prior art keywords
metadata
data
document
full
text data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311336096.4A
Other languages
Chinese (zh)
Other versions
CN117076495A (en
Inventor
陆矜菁
严笑然
厉燕
刘洋
陈一家
侯炜华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202311336096.4A priority Critical patent/CN117076495B/en
Publication of CN117076495A publication Critical patent/CN117076495A/en
Application granted granted Critical
Publication of CN117076495B publication Critical patent/CN117076495B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The present invention relates to the field of data processing, and in particular, to a distributed storage method, device and equipment for multimodal literature data. The method comprises the following steps: collecting full text data of a document and storing the full text data into a distributed file system; extracting document metadata of the document full-text data and storing the document metadata into a structured database; extracting image data in the full text data of the document, extracting image metadata of the image data, storing the image data into a distributed file system, and storing the image metadata into a structured database; constructing a knowledge graph based on the document metadata and the image metadata, and storing the knowledge graph into a distributed graph database; and constructing a distributed storage system based on the distributed file system, the structured database and the distributed graph database. The invention can integrate the full text data of the documents, is beneficial to the utilization and management of the full text data of the documents, and is convenient for the search query of the document data of each mode.

Description

Distributed storage method, device and equipment for multi-mode literature data
Technical Field
The present invention relates to the field of data processing, and in particular, to a distributed storage method, device and equipment for multimodal literature data.
Background
Although the existing document database provides basic information, quotations, abstracts and the like of documents, image data in the documents is not provided, and the unstructured data resources have the problem of waste. Therefore, it is very interesting to integrate literature data for each large platform according to the requirements.
In the aspect of document data storage, even if full-text data of each large platform is acquired by adopting a crawler and other modes, the problems of confusion in management, difficulty in analysis, insufficient single-machine storage space and the like exist. For large-scale document data, especially full-text document and document image data, it is very significant to solve the problems of how to uniformly store and query unstructured data and structured data of document data, because the unstructured data is difficult to manage like structured data, and the problems of difficult query and difficult analysis are existed.
Besides, the large-scale data storage has the problems of overlarge required space, difficult guarantee of safety and the like, so that the method has great significance in constructing a distributed storage query system for multi-mode literature data, not only fully considering the data scale, but also ensuring the data safety to a certain extent.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a distributed storage method, apparatus, computer device, and storage medium for multimodal document data.
In a first aspect, an embodiment of the present invention provides a distributed storage method for multimodal literature data, where the method includes:
collecting full text data of a document and storing the full text data into a distributed file system;
extracting document metadata of the document full-text data and storing the document metadata into a structured database;
extracting image data in the full text data of the document, extracting image metadata of the image data, storing the image data into a distributed file system, and storing the image metadata into a structured database;
constructing a knowledge graph based on the document metadata and the image metadata, and storing the knowledge graph into a distributed graph database;
and constructing a distributed storage system based on the distributed file system, the structured database and the distributed graph database.
In an embodiment, the extracting the document metadata of the document full text data and storing in the structured database includes:
extracting full text metadata of the full text data of the document; and
Extracting the quotation metadata of the full-text data of the literature and outputting quotation relation data;
and merging the full text metadata and the quotation metadata to obtain literature metadata, and respectively storing the literature metadata and the quotation relation data into a structural database.
In an embodiment, the extracting the full text metadata of the document full text data includes:
extracting initial full text metadata of the full text data of the document;
performing standardization processing on the initial full text metadata;
setting a first identifier of the initial full text metadata after standardized processing, and marking a storage position of the full text data of the corresponding document in the distributed file system to obtain the full text metadata.
In an embodiment, the extracting the quotation metadata of the document full-text data and outputting the quotation relationship data comprises:
extracting initial leading text data of the full text data of the document and labeling a first identifier for leading the full text data of the document;
carrying out standardization processing on the initial quotation metadata, and setting a second identifier of the cited document full-text data to obtain quotation relation data;
deleting a first identifier referencing the full text data of the document, and deduplicating to obtain the quotation metadata according to a second identifier.
In an embodiment, the building a knowledge-graph based on the document metadata and the image metadata includes:
setting an ontology, attributes and relations for constructing a knowledge graph;
extracting entities of points based on the ontology and the attribute of the knowledge graph, and extracting entities of edges according to the relation of the knowledge graph;
and constructing a model of the knowledge graph based on the ontology, the attribute and the relation, and constructing the knowledge graph based on the model, the entity of the point and the entity of the side.
In an embodiment, the method further comprises:
and searching and inquiring full-text document data, image data, metadata and a knowledge graph by using the distributed storage system.
In a second aspect, an embodiment of the present invention proposes a distributed storage device for multimodal literature data, the device including:
the data acquisition module is used for acquiring full-text data of the literature and storing the full-text data into the distributed file system;
the first extraction module is used for extracting the document metadata of the full-text data of the document and storing the document metadata into the structured database;
the second extraction module is used for extracting the image data in the full-text data of the literature, extracting the image metadata of the image data, storing the image data into a distributed file system and storing the image metadata into a structured database;
the map construction module is used for constructing a knowledge map based on the document metadata and the image metadata and storing the knowledge map into a distributed map database;
and the system construction module is used for constructing and obtaining a distributed storage system based on the distributed file system, the structured database and the distributed graph database.
In one embodiment, the system building module comprises:
the first search query interface is used for directly searching the metadata and indirectly searching the full-text document data and the image data;
and the second search query interface is used for direct search of the knowledge graph and indirect search of metadata, full-text document data and image data.
In a third aspect, an embodiment of the present invention proposes a computer device comprising a memory storing a computer program and a processor executing the steps of the first aspect.
In a fourth aspect, an embodiment of the present invention proposes a computer readable storage medium, on which a computer program is stored, the processor implementing the steps of the first aspect when executing the computer program.
Compared with the prior art, the method, the device, the computer equipment and the storage medium collect full-text data of the literature and store the full-text data into a distributed file system; extracting document metadata of the document full-text data and storing the document metadata into a structured database; extracting image data in the full text data of the document, extracting image metadata of the image data, storing the image data into a distributed file system, and storing the image metadata into a structured database; constructing a knowledge graph based on the document metadata and the image metadata, and storing the knowledge graph into a distributed graph database; and constructing a distributed storage system based on the distributed file system, the structured database and the distributed graph database. The invention can integrate the full text data of the documents, is beneficial to the utilization and management of the full text data of the documents, and is convenient for the search query of the document data of each mode.
Drawings
FIG. 1 is a schematic diagram of a terminal in an embodiment;
FIG. 2 is a flow chart of a method for distributed storage of multimodal literature data in one embodiment;
FIG. 3 is a flowchart illustrating the step S204 in one embodiment;
FIG. 4 is a flowchart of obtaining full metadata in an embodiment;
FIG. 5 is a flow chart of obtaining primitive data in an embodiment;
FIG. 6 is a flowchart illustrating the step S208 in one embodiment;
FIG. 7 is a schematic diagram of a distributed storage system according to an embodiment;
FIG. 8 is a schematic diagram of module connection of a distributed storage device facing multi-modal document data in one embodiment;
fig. 9 is a schematic structural diagram of a computer device in an embodiment.
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only some examples or embodiments of the present invention, and it is apparent to those of ordinary skill in the art that the present invention may be applied to other similar situations according to the drawings without inventive effort. Unless otherwise apparent from the context of the language or otherwise specified, like reference numerals in the figures refer to like structures or operations.
As used in the specification and in the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.
While the present invention makes various references to certain modules in an apparatus according to embodiments of the invention, any number of different modules may be used and run on a computing device and/or processor. The modules are merely illustrative and different aspects of the apparatus and method may use different modules.
It will be understood that when an element or module is referred to as being "connected," "coupled" to another element, module, or block, it can be directly connected or coupled or in communication with the other element, module, or block, or intervening elements, modules, or blocks may be present unless the context clearly dictates otherwise. The term "and/or" as used herein may include any and all combinations of one or more of the associated listed items.
The distributed storage method for the multi-mode document data can be applied to a terminal shown in fig. 1. As shown in fig. 1, the terminal may include one or two (only one is shown in fig. 1) processors 102 and a memory 104 for storing data, wherein the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA. The processor 102 may be deployed on a linux system. The terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 2 is merely illustrative and is not intended to limit the structure of the terminal. For example, the terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a distributed storage method for multimodal document data in the present embodiment, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the above-described method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. The network includes a wireless network provided by a communication provider of the terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
As shown in fig. 2, an embodiment of the present invention provides a distributed storage method for multimodal document data, which is illustrated by using the method applied to a terminal in fig. 1 as an example, and includes the following steps:
s202: collecting full text data of a document and storing the full text data into a distributed file system;
s204: extracting document metadata of the document full-text data and storing the document metadata into a structured database;
s206: extracting image data in the full text data of the document, extracting image metadata of the image data, storing the image data into a distributed file system, and storing the image metadata into a structured database;
s208: constructing a knowledge graph based on the document metadata and the image metadata, and storing the knowledge graph into a distributed graph database;
s210: and constructing a distributed storage system based on the distributed file system, the structured database and the distributed graph database.
In the embodiment, unstructured data and structured data of full-text data of documents are stored and queried uniformly, so that the problems of difficult storage and disordered management of multi-mode data are solved to a certain extent, and the unstructured data such as images, full-text data of documents and the like cannot be stored in a distributed file system like the structured data, so that the utilization of large-scale multi-mode document data resources is facilitated.
The distributed architecture is utilized, so that support can be provided on the data scale, and the data security can be ensured to a certain extent. For example, in the field of astronomy, the document data in millions can completely meet the storage query requirement by adopting a multi-mode document data-oriented distributed storage query method, and the problems of overlarge space overhead and insufficient file security required by local storage are solved.
In step S202, batch crawling, downloading and uploading are performed by an automation program, and for some documents which are difficult to automatically crawl but do have specific requirements, a manual downloading and uploading mode is adopted.
The full document data collected by the automation program is directly uploaded to the distributed file system, and the manual full document data is uploaded to the server by adopting shell commands.
In step S204, as shown in fig. 3, the extracting the document metadata of the document full text data and storing the document metadata in the structured database includes:
s302: extracting full text metadata of the full text data of the document; extracting the quotation metadata of the full-text data of the literature and outputting quotation relation data;
s304: and merging the full text metadata and the quotation metadata to obtain literature metadata, and respectively storing the literature metadata and the quotation relation data into a structural database.
In one embodiment, as shown in fig. 4, the extracting the full text metadata of the full text data of the document includes:
s402: extracting initial full text metadata of the document full text data.
Taking the full text data of the document as a PDF format as an example, the metadata of the full text data of the document is extracted in batches by adopting a Python library such as PyPDF2, and the like, the initial full text metadata mainly comprises a document title, an author, a subject, a date and the like, and meanwhile, the storage position pdf_path of the full text data of the document is marked.
S404: and carrying out standardization processing on the initial full text metadata.
For full-text document data from different sources, the extracted metadata may have the problems of inconsistent case and case, and the like, and normalization work is required to be performed on different data formats; manually labeling metadata key information missing (such as lack of title and author), and manually labeling fewer non-PDF format documents (CAJ files and the like); and de-duplicating according to the title of the normalized data.
S406: setting a first identifier of the initial full text metadata after standardized processing, and marking a storage position of the full text data of the corresponding document in the distributed file system to obtain the full text metadata.
Setting a first identifier paper_id of the full document data, the following is a format example of the full document metadata a:
in one embodiment, as shown in fig. 5, the extracting the quotation metadata of the document full-text data and outputting the quotation relationship data includes:
s502: extracting initial quotation metadata of the document full text data, and labeling a first identifier for quoting the document full text data.
And carrying out batch extraction on the initial primitive data of the full-text data of the document by adopting a Python library such as PyPDF2 and the like, wherein the extraction content mainly comprises a document title, an author and the like, and simultaneously, the full-text data paper_id of the full-text data of the document is marked.
S504: and carrying out standardization processing on the initial quotation metadata, and setting a second identifier of the cited document full-text data to obtain quotation relation data.
Normalizing the data formats of key information (title and author) such as inconsistent case and space problems; then, the second identifier r_paper_id is set for each cited document according to the document title, thereby obtaining the reference relation data B. The format of the reference relationship data B is as follows:
s506: deleting a first identifier referencing the full text data of the document, and deduplicating to obtain the quotation metadata according to a second identifier.
And deleting the paper_id column of the reference relation data B, and obtaining all the reference metadata C according to r_paper_id de-duplication.
And adding the quotation mark data C into the whole quotation mark data A, and carrying out deduplication according to the paper title. If a document p appears in both the leading metadata C and the full metadata A, the leading metadata C is used for filling the missing column data in the full metadata A, and if the leading metadata C and the full metadata A are inconsistent, the full metadata A is taken. The new data is marked with a paper_id, and the corresponding full text document is renamed to { paper_id }. Pdf according to the data with the pdf_path. Finally, the document metadata D is obtained, and the format of the document metadata D is as follows:
and finally, respectively storing the reference relation data B and all the document metadata D into a distributed structured database so as to facilitate the work of inquiring, searching, data analyzing and the like in the subsequent steps.
In step S206, the image data in the collected full text data of the document is extracted by a method of program batch extraction, and batch extraction is mainly performed by using the fitz library of PyMuPDF. The image is used as unstructured data, and is imported into a distributed file system for storage, wherein the storage path is image_path.
The image metadata of the image data is imported into a structured database for storage, and the table name is image_info. The image metadata generation process needs to number the image, the number will be used as a unique identifier of the image metadata, and the image is named as { image_id }. Png, the metadata format of the image is as follows:
in step S208, as shown in fig. 6, the constructing a knowledge graph based on the document metadata and the image metadata includes:
s602: and setting an ontology, attributes and relations for constructing the knowledge graph.
The knowledge graph has the advantages in scientific research literature that more relevant information can be obtained by utilizing the characteristics of the graph, and the problems which can be solved and analyzed mainly comprise: a related document to a certain document, a co-author, etc. are found, and thus the design is as follows:
(1) The body is: literature and authors; (2) Attributes of documents include title, author, publication time, keywords, abstract, etc., and attributes of author include name, organization, region, etc.; (3) Relationships include "written" relationships between authors and documents, and "cited" relationships between documents and documents.
S604: and extracting the entities of the sides according to the relation of the knowledge graph based on the entity and the attribute extraction point of the knowledge graph.
For the entity extracting the point, the unique identifier node_id of the entity point is extracted by the generated table paper_info.
Wherein (a) literature entity extraction: all columns of the table paper_info, the first identifier paper_id as its unique identifier node_id, the data range is [1, a ], a is the number of documents.
(b) Author entity extraction: the author-related attribute column (e.g., author name, organization, region, etc.) and the paper_id column of the table paper_info are extracted. The new node_id column is marked with the unique identifier node_id, the range is [ a+1, n ], and n is the number of all entity points. Meanwhile, a new table author_info is established in the structured database, and the format of the table author_info is:
for the entity extracting the edge, the entity is mainly extracted by the generated tables reference_info, paper_info and author_info.
Wherein (a) extract literature reference relationship entity: combining the table reference_info with the table page_info to obtain page_ids of two entity points, and generating entities of 'quoted' edges;
(b) Extracting relation entities of author written documents: the page_id (node_id of document) and the node_id (node_id of author) in the table author_info are extracted, and an entity of "writing" the edge is generated.
S606: and constructing a model of the knowledge graph based on the ontology, the attribute and the relation, and constructing the knowledge graph based on the model, the entity of the point and the entity of the side.
Firstly, constructing a Schema model of the atlas according to the ontology, the attribute and the relation of the knowledge atlas extracted in the step S41; then constructing a datamap according to the Schema and the entities of the points and the edges extracted in the S42; and finally, writing the Schema model, the datamap map and the map entity into a JanusGraph distributed map database for storage.
The Schema model mainly defines the basic structure and index of the knowledge graph. Defining a property key list of 'propertyKeys' in a Schema model, wherein the property key list comprises names, data types and the like of extracted properties; defining a 'vertex labels' point label, namely, each body name; the "edgeLabels" edge labels, i.e., the names of the relationships;
the datamap is a map file of point-to-point, edge entity files. The "vertex map" points need to define the corresponding ontology of the entity csv file of each point and the corresponding attribute keys of each column respectively; the "edge map" needs to define the corresponding relation of the entity csv file of each edge and the corresponding ontology of the end point of the edge.
In step S10, the distributed storage system is configured to store multi-modal data, and provide different storage structures according to the data of different modalities, where the distributed storage system is composed of a distributed file system, a structured database, and a distributed graph database, and the bottom layers of the structured database and the distributed graph database are all based on the distributed file system. The distributed storage system stores multi-mode data in a classified mode, and particularly the distributed file system is used for storing document metadata and image data; the structured database is used for storing various metadata; the distributed graph database is used for storing knowledge maps.
As shown in fig. 7, the distributed storage system has a search query interface, which mainly provides a search query function, and allows a user to search metadata and a constructed knowledge graph of a search document.
The first search query interface is used for directly searching metadata and indirectly searching full-text document data and image data, and the main method is to search the structured data stored in the HIVE by using HiveQL language, scripting language and the like. Retrieving metadata, namely directly retrieving and obtaining metadata acquired respectively; searching the full text document, and acquiring a storage address of corresponding acquired full text document data (pdf) by searching the acquired metadata, thereby acquiring the full text document data; retrieving the image data, retrieving the acquired image metadata, and obtaining a storage address corresponding to the extracted image data (png), thereby obtaining the image data.
The second search query interface can be used for direct search of knowledge maps and indirect search of metadata, full-text document data and image data. Retrieving a knowledge graph, and directly retrieving the generated knowledge graph by Gremlin language; retrieving metadata, namely retrieving the metadata stored in the structured database by HiveQL after the Gremlin retrieves the node_id of the information of a single point or a plurality of points, and obtaining the metadata of the corresponding points; and searching the full text document and the image data, and acquiring the corresponding addresses stored in the distributed file system, thereby acquiring the corresponding full text data and image data.
It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of steps or stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of execution of the steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least a part of other steps or stages.
In one embodiment, as shown in fig. 8, the present invention provides a distributed storage device for multimodal literature data, the device comprising:
the data acquisition module 802 is configured to acquire full-text data of a document and store the full-text data in a distributed file system;
a first extraction module 804, configured to extract document metadata of the document full-text data, and store the document metadata in a structured database;
a second extraction module 806, configured to extract image data in the full text data of the document, extract image metadata of the image data, store the image data in a distributed file system, and store the image metadata in a structured database;
the map construction module 808 is configured to construct a knowledge map based on the document metadata and the image metadata, and store the knowledge map in a distributed map database;
the system building module 810 is configured to build a distributed storage system based on the distributed file system, the structured database, and the distributed graph database.
For specific limitations on the distributed storage device for multimodal document data, reference may be made to the above limitation on the method for storing multimodal document data, and no further description is given here. The above-described modules in the multi-modal document data-oriented distributed storage device may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, the embodiment of the present invention provides a computer device, which may be a server, and an internal structure diagram thereof may be as shown in fig. 9. The computer device includes a processor, a memory, and a network interface connected by a device bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The nonvolatile storage medium stores an operating device, a computer program, and a database. The internal memory provides an environment for the operation of the operating device and the computer program in the non-volatile storage medium. The database of the computer device is for storing motion detection data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements the steps of any of the above-described embodiments of a distributed storage method for multimodal literature data.
It will be appreciated by those skilled in the art that the structure shown in fig. 9 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, an embodiment of the present invention provides a computer readable storage medium having a computer program stored thereon, where the computer program when executed by a processor implements the steps of any of the above embodiments of a distributed storage method for multimodal literature data.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program, which may be stored on a non-transitory computer readable storage medium and which, when executed, may comprise the steps of the above-described embodiments of the methods. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (DynamicRandom Access Memory, DRAM), and the like.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples represent only a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (8)

1. A distributed storage method for multimodal literature data, the method comprising:
collecting full text data of a document and storing the full text data into a distributed file system;
extracting full text metadata of the full text data of the document; extracting the quotation metadata of the full-text data of the literature and outputting quotation relation data; merging the full text metadata and the quotation metadata to obtain literature metadata, and respectively storing the literature metadata and the quotation relation data into a structural database;
extracting image data in the full text data of the document, extracting image metadata of the image data, storing the image data into a distributed file system, and storing the image metadata into a structured database;
constructing a knowledge graph based on the document metadata and the image metadata, and storing the knowledge graph into a distributed graph database;
constructing a distributed storage system based on the distributed file system, the structured database and the distributed graph database;
wherein the extracting the quotation metadata of the full-text data of the document and outputting the quotation relation data comprises the following steps:
extracting initial leading text data of the full text data of the document and labeling a first identifier for leading the full text data of the document;
carrying out standardization processing on the initial quotation metadata, and setting a second identifier of the cited document full-text data to obtain quotation relation data;
deleting a first identifier referencing the full text data of the document, and deduplicating to obtain the quotation metadata according to a second identifier.
2. The method of claim 1, wherein the extracting full text metadata of the document full text data comprises:
extracting initial full text metadata of the full text data of the document;
performing standardization processing on the initial full text metadata;
setting a first identifier of the initial full text metadata after standardized processing, and marking a storage position of the full text data of the corresponding document in the distributed file system to obtain the full text metadata.
3. The method of claim 1, wherein constructing a knowledge-graph based on the document metadata and image metadata comprises:
setting an ontology, attributes and relations for constructing a knowledge graph;
extracting entities of points based on the ontology and the attribute of the knowledge graph, and extracting entities of edges according to the relation of the knowledge graph;
and constructing a model of the knowledge graph based on the ontology, the attribute and the relation, and constructing the knowledge graph based on the model, the entity of the point and the entity of the side.
4. The method according to claim 1, wherein the method further comprises:
and searching and inquiring full-text document data, image data, metadata and a knowledge graph by using the distributed storage system.
5. A distributed storage device for multimodal literature data, the device comprising:
the data acquisition module is used for acquiring full-text data of the literature and storing the full-text data into the distributed file system;
the first extraction module is used for extracting full text metadata of the full text data of the literature; extracting the quotation metadata of the full-text data of the literature and outputting quotation relation data; merging the full text metadata and the quotation metadata to obtain literature metadata, and respectively storing the literature metadata and the quotation relation data into a structural database; wherein the extracting the quotation metadata of the full-text data of the document and outputting the quotation relation data comprises the following steps: extracting initial leading text data of the full text data of the document and labeling a first identifier for leading the full text data of the document; carrying out standardization processing on the initial quotation metadata, and setting a second identifier of the cited document full-text data to obtain quotation relation data; deleting a first identifier referencing the full text data of the document, and deduplicating according to a second identifier to obtain the quotation metadata;
the second extraction module is used for extracting the image data in the full-text data of the literature, extracting the image metadata of the image data, storing the image data into a distributed file system and storing the image metadata into a structured database;
the map construction module is used for constructing a knowledge map based on the document metadata and the image metadata and storing the knowledge map into a distributed map database;
and the system construction module is used for constructing and obtaining a distributed storage system based on the distributed file system, the structured database and the distributed graph database.
6. The apparatus of claim 5, wherein the system building block comprises:
the first search query interface is used for directly searching the metadata and indirectly searching the full-text document data and the image data;
and the second search query interface is used for direct search of the knowledge graph and indirect search of metadata, full-text document data and image data.
7. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any one of claims 1 to 4.
8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 4.
CN202311336096.4A 2023-10-16 2023-10-16 Distributed storage method, device and equipment for multi-mode literature data Active CN117076495B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311336096.4A CN117076495B (en) 2023-10-16 2023-10-16 Distributed storage method, device and equipment for multi-mode literature data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311336096.4A CN117076495B (en) 2023-10-16 2023-10-16 Distributed storage method, device and equipment for multi-mode literature data

Publications (2)

Publication Number Publication Date
CN117076495A CN117076495A (en) 2023-11-17
CN117076495B true CN117076495B (en) 2024-02-13

Family

ID=88713769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311336096.4A Active CN117076495B (en) 2023-10-16 2023-10-16 Distributed storage method, device and equipment for multi-mode literature data

Country Status (1)

Country Link
CN (1) CN117076495B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103729432A (en) * 2013-12-27 2014-04-16 河海大学 Method for analyzing and sequencing academic influence of theme literature in citation database
CN109192321A (en) * 2018-09-26 2019-01-11 北京理工大学 The construction method and calculating storage device of drug knowledge mapping
CN110990662A (en) * 2019-11-22 2020-04-10 北京市科学技术情报研究所 Domain expert selection method based on citation network and scientific research cooperation network
CN112434168A (en) * 2020-11-09 2021-03-02 广西壮族自治区图书馆 Knowledge graph construction method and fragmentized knowledge generation method based on library
CN113961528A (en) * 2021-10-27 2022-01-21 上海交通大学 Knowledge graph-based file semantic association storage system and method
CN116244344A (en) * 2022-11-25 2023-06-09 中国农业科学院农业信息研究所 Retrieval method and device based on user requirements and electronic equipment
CN116881436A (en) * 2023-08-09 2023-10-13 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Knowledge graph-based document retrieval method, system, terminal and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140351678A1 (en) * 2013-05-22 2014-11-27 European Molecular Biology Organisation Method and System for Associating Data with Figures

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103729432A (en) * 2013-12-27 2014-04-16 河海大学 Method for analyzing and sequencing academic influence of theme literature in citation database
CN109192321A (en) * 2018-09-26 2019-01-11 北京理工大学 The construction method and calculating storage device of drug knowledge mapping
CN110990662A (en) * 2019-11-22 2020-04-10 北京市科学技术情报研究所 Domain expert selection method based on citation network and scientific research cooperation network
CN112434168A (en) * 2020-11-09 2021-03-02 广西壮族自治区图书馆 Knowledge graph construction method and fragmentized knowledge generation method based on library
CN113961528A (en) * 2021-10-27 2022-01-21 上海交通大学 Knowledge graph-based file semantic association storage system and method
CN116244344A (en) * 2022-11-25 2023-06-09 中国农业科学院农业信息研究所 Retrieval method and device based on user requirements and electronic equipment
CN116881436A (en) * 2023-08-09 2023-10-13 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Knowledge graph-based document retrieval method, system, terminal and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
大规模科技文献深度解析和检索平台构建;吴素研;吴江瑞;李文波;;现代情报(第01期);112-117 *

Also Published As

Publication number Publication date
CN117076495A (en) 2023-11-17

Similar Documents

Publication Publication Date Title
Patroumpas et al. Triplegeo: an etl tool for transforming geospatial data into rdf triples.
CN108170752B (en) Template-based metadata management method and system
CN105988996B (en) Index file generation method and device
Alarabi et al. TAREEG: a MapReduce-based web service for extracting spatial data from OpenStreetMap
CN109284273B (en) Massive small file query method and system adopting suffix array index
CN113010476B (en) Metadata searching method, device, equipment and computer readable storage medium
WO2017036348A1 (en) Method and device for compressing and decompressing extensible markup language document
US20230252012A1 (en) Method for indexing data
Park et al. E-Navigation-supporting data management system for variant S-100-based data
CN110598204A (en) Entity identification data enhancement method and system based on knowledge graph
CN105069101A (en) Distributed index construction and search method
CN110674087A (en) File query method and device and computer readable storage medium
CN110990406A (en) Fuzzy query method, device, equipment and medium
CN117076495B (en) Distributed storage method, device and equipment for multi-mode literature data
US20140164338A1 (en) Organizing information directories
CN111930891A (en) Retrieval text expansion method based on knowledge graph and related device
CN107463618B (en) Index creating method and device
CN114706938A (en) Document tag determination method and device, electronic equipment and storage medium
CN115129786A (en) Method and device for maintaining block chain data, electronic equipment and storage medium
Shrivastava A review of spatial big data platforms, opportunities, and challenges
US20170116219A1 (en) Efficient differential techniques for metafiles
Niu Archival intellectual control in the digital age
CN108984519B (en) Dual-mode-based automatic event corpus construction method and device and storage medium
CN117076474B (en) Method, device, equipment and medium for updating offline multi-mode literature data
Dlugolinsky et al. Distributed web-scale infrastructure for crawling, indexing and search with semantic support

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant