CN117076495B

CN117076495B - Distributed storage method, device and equipment for multi-mode literature data

Info

Publication number: CN117076495B
Application number: CN202311336096.4A
Authority: CN
Inventors: 陆矜菁; 严笑然; 厉燕; 刘洋; 陈一家; 侯炜华
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-10-16
Filing date: 2023-10-16
Publication date: 2024-02-13
Anticipated expiration: 2043-10-16
Also published as: CN117076495A

Abstract

The present invention relates to the field of data processing, and in particular, to a distributed storage method, device and equipment for multimodal literature data. The method comprises the following steps: collecting full text data of a document and storing the full text data into a distributed file system; extracting document metadata of the document full-text data and storing the document metadata into a structured database; extracting image data in the full text data of the document, extracting image metadata of the image data, storing the image data into a distributed file system, and storing the image metadata into a structured database; constructing a knowledge graph based on the document metadata and the image metadata, and storing the knowledge graph into a distributed graph database; and constructing a distributed storage system based on the distributed file system, the structured database and the distributed graph database. The invention can integrate the full text data of the documents, is beneficial to the utilization and management of the full text data of the documents, and is convenient for the search query of the document data of each mode.

Description

Distributed storage method, device and equipment for multi-mode literature data

Technical Field

The present invention relates to the field of data processing, and in particular, to a distributed storage method, device and equipment for multimodal literature data.

Background

Although the existing document database provides basic information, quotations, abstracts and the like of documents, image data in the documents is not provided, and the unstructured data resources have the problem of waste. Therefore, it is very interesting to integrate literature data for each large platform according to the requirements.

In the aspect of document data storage, even if full-text data of each large platform is acquired by adopting a crawler and other modes, the problems of confusion in management, difficulty in analysis, insufficient single-machine storage space and the like exist. For large-scale document data, especially full-text document and document image data, it is very significant to solve the problems of how to uniformly store and query unstructured data and structured data of document data, because the unstructured data is difficult to manage like structured data, and the problems of difficult query and difficult analysis are existed.

Besides, the large-scale data storage has the problems of overlarge required space, difficult guarantee of safety and the like, so that the method has great significance in constructing a distributed storage query system for multi-mode literature data, not only fully considering the data scale, but also ensuring the data safety to a certain extent.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a distributed storage method, apparatus, computer device, and storage medium for multimodal document data.

In a first aspect, an embodiment of the present invention provides a distributed storage method for multimodal literature data, where the method includes:

collecting full text data of a document and storing the full text data into a distributed file system;

extracting document metadata of the document full-text data and storing the document metadata into a structured database;

extracting image data in the full text data of the document, extracting image metadata of the image data, storing the image data into a distributed file system, and storing the image metadata into a structured database;

constructing a knowledge graph based on the document metadata and the image metadata, and storing the knowledge graph into a distributed graph database;

and constructing a distributed storage system based on the distributed file system, the structured database and the distributed graph database.

In an embodiment, the extracting the document metadata of the document full text data and storing in the structured database includes:

extracting full text metadata of the full text data of the document; and

Extracting the quotation metadata of the full-text data of the literature and outputting quotation relation data;

and merging the full text metadata and the quotation metadata to obtain literature metadata, and respectively storing the literature metadata and the quotation relation data into a structural database.

In an embodiment, the extracting the full text metadata of the document full text data includes:

extracting initial full text metadata of the full text data of the document;

performing standardization processing on the initial full text metadata;

setting a first identifier of the initial full text metadata after standardized processing, and marking a storage position of the full text data of the corresponding document in the distributed file system to obtain the full text metadata.

In an embodiment, the extracting the quotation metadata of the document full-text data and outputting the quotation relationship data comprises:

extracting initial leading text data of the full text data of the document and labeling a first identifier for leading the full text data of the document;

carrying out standardization processing on the initial quotation metadata, and setting a second identifier of the cited document full-text data to obtain quotation relation data;

deleting a first identifier referencing the full text data of the document, and deduplicating to obtain the quotation metadata according to a second identifier.

In an embodiment, the building a knowledge-graph based on the document metadata and the image metadata includes:

setting an ontology, attributes and relations for constructing a knowledge graph;

extracting entities of points based on the ontology and the attribute of the knowledge graph, and extracting entities of edges according to the relation of the knowledge graph;

and constructing a model of the knowledge graph based on the ontology, the attribute and the relation, and constructing the knowledge graph based on the model, the entity of the point and the entity of the side.

In an embodiment, the method further comprises:

and searching and inquiring full-text document data, image data, metadata and a knowledge graph by using the distributed storage system.

In a second aspect, an embodiment of the present invention proposes a distributed storage device for multimodal literature data, the device including:

the data acquisition module is used for acquiring full-text data of the literature and storing the full-text data into the distributed file system;

the first extraction module is used for extracting the document metadata of the full-text data of the document and storing the document metadata into the structured database;

the second extraction module is used for extracting the image data in the full-text data of the literature, extracting the image metadata of the image data, storing the image data into a distributed file system and storing the image metadata into a structured database;

the map construction module is used for constructing a knowledge map based on the document metadata and the image metadata and storing the knowledge map into a distributed map database;

and the system construction module is used for constructing and obtaining a distributed storage system based on the distributed file system, the structured database and the distributed graph database.

In one embodiment, the system building module comprises:

the first search query interface is used for directly searching the metadata and indirectly searching the full-text document data and the image data;

and the second search query interface is used for direct search of the knowledge graph and indirect search of metadata, full-text document data and image data.

In a third aspect, an embodiment of the present invention proposes a computer device comprising a memory storing a computer program and a processor executing the steps of the first aspect.

In a fourth aspect, an embodiment of the present invention proposes a computer readable storage medium, on which a computer program is stored, the processor implementing the steps of the first aspect when executing the computer program.

Compared with the prior art, the method, the device, the computer equipment and the storage medium collect full-text data of the literature and store the full-text data into a distributed file system; extracting document metadata of the document full-text data and storing the document metadata into a structured database; extracting image data in the full text data of the document, extracting image metadata of the image data, storing the image data into a distributed file system, and storing the image metadata into a structured database; constructing a knowledge graph based on the document metadata and the image metadata, and storing the knowledge graph into a distributed graph database; and constructing a distributed storage system based on the distributed file system, the structured database and the distributed graph database. The invention can integrate the full text data of the documents, is beneficial to the utilization and management of the full text data of the documents, and is convenient for the search query of the document data of each mode.

Drawings

FIG. 1 is a schematic diagram of a terminal in an embodiment;

FIG. 2 is a flow chart of a method for distributed storage of multimodal literature data in one embodiment;

FIG. 3 is a flowchart illustrating the step S204 in one embodiment;

FIG. 4 is a flowchart of obtaining full metadata in an embodiment;

FIG. 5 is a flow chart of obtaining primitive data in an embodiment;

FIG. 6 is a flowchart illustrating the step S208 in one embodiment;

FIG. 7 is a schematic diagram of a distributed storage system according to an embodiment;

FIG. 8 is a schematic diagram of module connection of a distributed storage device facing multi-modal document data in one embodiment;

fig. 9 is a schematic structural diagram of a computer device in an embodiment.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only some examples or embodiments of the present invention, and it is apparent to those of ordinary skill in the art that the present invention may be applied to other similar situations according to the drawings without inventive effort. Unless otherwise apparent from the context of the language or otherwise specified, like reference numerals in the figures refer to like structures or operations.

As used in the specification and in the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

While the present invention makes various references to certain modules in an apparatus according to embodiments of the invention, any number of different modules may be used and run on a computing device and/or processor. The modules are merely illustrative and different aspects of the apparatus and method may use different modules.

It will be understood that when an element or module is referred to as being "connected," "coupled" to another element, module, or block, it can be directly connected or coupled or in communication with the other element, module, or block, or intervening elements, modules, or blocks may be present unless the context clearly dictates otherwise. The term "and/or" as used herein may include any and all combinations of one or more of the associated listed items.

The distributed storage method for the multi-mode document data can be applied to a terminal shown in fig. 1. As shown in fig. 1, the terminal may include one or two (only one is shown in fig. 1) processors 102 and a memory 104 for storing data, wherein the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA. The processor 102 may be deployed on a linux system. The terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 2 is merely illustrative and is not intended to limit the structure of the terminal. For example, the terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a distributed storage method for multimodal document data in the present embodiment, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the above-described method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. The network includes a wireless network provided by a communication provider of the terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.

As shown in fig. 2, an embodiment of the present invention provides a distributed storage method for multimodal document data, which is illustrated by using the method applied to a terminal in fig. 1 as an example, and includes the following steps:

s202: collecting full text data of a document and storing the full text data into a distributed file system;

s204: extracting document metadata of the document full-text data and storing the document metadata into a structured database;

s206: extracting image data in the full text data of the document, extracting image metadata of the image data, storing the image data into a distributed file system, and storing the image metadata into a structured database;

s208: constructing a knowledge graph based on the document metadata and the image metadata, and storing the knowledge graph into a distributed graph database;

s210: and constructing a distributed storage system based on the distributed file system, the structured database and the distributed graph database.

In the embodiment, unstructured data and structured data of full-text data of documents are stored and queried uniformly, so that the problems of difficult storage and disordered management of multi-mode data are solved to a certain extent, and the unstructured data such as images, full-text data of documents and the like cannot be stored in a distributed file system like the structured data, so that the utilization of large-scale multi-mode document data resources is facilitated.

The distributed architecture is utilized, so that support can be provided on the data scale, and the data security can be ensured to a certain extent. For example, in the field of astronomy, the document data in millions can completely meet the storage query requirement by adopting a multi-mode document data-oriented distributed storage query method, and the problems of overlarge space overhead and insufficient file security required by local storage are solved.

In step S202, batch crawling, downloading and uploading are performed by an automation program, and for some documents which are difficult to automatically crawl but do have specific requirements, a manual downloading and uploading mode is adopted.

The full document data collected by the automation program is directly uploaded to the distributed file system, and the manual full document data is uploaded to the server by adopting shell commands.

In step S204, as shown in fig. 3, the extracting the document metadata of the document full text data and storing the document metadata in the structured database includes:

s302: extracting full text metadata of the full text data of the document; extracting the quotation metadata of the full-text data of the literature and outputting quotation relation data;

s304: and merging the full text metadata and the quotation metadata to obtain literature metadata, and respectively storing the literature metadata and the quotation relation data into a structural database.

In one embodiment, as shown in fig. 4, the extracting the full text metadata of the full text data of the document includes:

s402: extracting initial full text metadata of the document full text data.

Taking the full text data of the document as a PDF format as an example, the metadata of the full text data of the document is extracted in batches by adopting a Python library such as PyPDF2, and the like, the initial full text metadata mainly comprises a document title, an author, a subject, a date and the like, and meanwhile, the storage position pdf_path of the full text data of the document is marked.

S404: and carrying out standardization processing on the initial full text metadata.

For full-text document data from different sources, the extracted metadata may have the problems of inconsistent case and case, and the like, and normalization work is required to be performed on different data formats; manually labeling metadata key information missing (such as lack of title and author), and manually labeling fewer non-PDF format documents (CAJ files and the like); and de-duplicating according to the title of the normalized data.

S406: setting a first identifier of the initial full text metadata after standardized processing, and marking a storage position of the full text data of the corresponding document in the distributed file system to obtain the full text metadata.

Setting a first identifier paper_id of the full document data, the following is a format example of the full document metadata a:

in one embodiment, as shown in fig. 5, the extracting the quotation metadata of the document full-text data and outputting the quotation relationship data includes:

s502: extracting initial quotation metadata of the document full text data, and labeling a first identifier for quoting the document full text data.

And carrying out batch extraction on the initial primitive data of the full-text data of the document by adopting a Python library such as PyPDF2 and the like, wherein the extraction content mainly comprises a document title, an author and the like, and simultaneously, the full-text data paper_id of the full-text data of the document is marked.

S504: and carrying out standardization processing on the initial quotation metadata, and setting a second identifier of the cited document full-text data to obtain quotation relation data.

Normalizing the data formats of key information (title and author) such as inconsistent case and space problems; then, the second identifier r_paper_id is set for each cited document according to the document title, thereby obtaining the reference relation data B. The format of the reference relationship data B is as follows:

s506: deleting a first identifier referencing the full text data of the document, and deduplicating to obtain the quotation metadata according to a second identifier.

And deleting the paper_id column of the reference relation data B, and obtaining all the reference metadata C according to r_paper_id de-duplication.

And adding the quotation mark data C into the whole quotation mark data A, and carrying out deduplication according to the paper title. If a document p appears in both the leading metadata C and the full metadata A, the leading metadata C is used for filling the missing column data in the full metadata A, and if the leading metadata C and the full metadata A are inconsistent, the full metadata A is taken. The new data is marked with a paper_id, and the corresponding full text document is renamed to { paper_id }. Pdf according to the data with the pdf_path. Finally, the document metadata D is obtained, and the format of the document metadata D is as follows:

and finally, respectively storing the reference relation data B and all the document metadata D into a distributed structured database so as to facilitate the work of inquiring, searching, data analyzing and the like in the subsequent steps.

In step S206, the image data in the collected full text data of the document is extracted by a method of program batch extraction, and batch extraction is mainly performed by using the fitz library of PyMuPDF. The image is used as unstructured data, and is imported into a distributed file system for storage, wherein the storage path is image_path.

The image metadata of the image data is imported into a structured database for storage, and the table name is image_info. The image metadata generation process needs to number the image, the number will be used as a unique identifier of the image metadata, and the image is named as { image_id }. Png, the metadata format of the image is as follows:

in step S208, as shown in fig. 6, the constructing a knowledge graph based on the document metadata and the image metadata includes:

s602: and setting an ontology, attributes and relations for constructing the knowledge graph.

The knowledge graph has the advantages in scientific research literature that more relevant information can be obtained by utilizing the characteristics of the graph, and the problems which can be solved and analyzed mainly comprise: a related document to a certain document, a co-author, etc. are found, and thus the design is as follows:

(1) The body is: literature and authors; (2) Attributes of documents include title, author, publication time, keywords, abstract, etc., and attributes of author include name, organization, region, etc.; (3) Relationships include "written" relationships between authors and documents, and "cited" relationships between documents and documents.

S604: and extracting the entities of the sides according to the relation of the knowledge graph based on the entity and the attribute extraction point of the knowledge graph.

For the entity extracting the point, the unique identifier node_id of the entity point is extracted by the generated table paper_info.

Wherein (a) literature entity extraction: all columns of the table paper_info, the first identifier paper_id as its unique identifier node_id, the data range is [1, a ], a is the number of documents.

(b) Author entity extraction: the author-related attribute column (e.g., author name, organization, region, etc.) and the paper_id column of the table paper_info are extracted. The new node_id column is marked with the unique identifier node_id, the range is [ a+1, n ], and n is the number of all entity points. Meanwhile, a new table author_info is established in the structured database, and the format of the table author_info is:

for the entity extracting the edge, the entity is mainly extracted by the generated tables reference_info, paper_info and author_info.

Wherein (a) extract literature reference relationship entity: combining the table reference_info with the table page_info to obtain page_ids of two entity points, and generating entities of 'quoted' edges;

(b) Extracting relation entities of author written documents: the page_id (node_id of document) and the node_id (node_id of author) in the table author_info are extracted, and an entity of "writing" the edge is generated.

S606: and constructing a model of the knowledge graph based on the ontology, the attribute and the relation, and constructing the knowledge graph based on the model, the entity of the point and the entity of the side.

Firstly, constructing a Schema model of the atlas according to the ontology, the attribute and the relation of the knowledge atlas extracted in the step S41; then constructing a datamap according to the Schema and the entities of the points and the edges extracted in the S42; and finally, writing the Schema model, the datamap map and the map entity into a JanusGraph distributed map database for storage.

The Schema model mainly defines the basic structure and index of the knowledge graph. Defining a property key list of 'propertyKeys' in a Schema model, wherein the property key list comprises names, data types and the like of extracted properties; defining a 'vertex labels' point label, namely, each body name; the "edgeLabels" edge labels, i.e., the names of the relationships;

the datamap is a map file of point-to-point, edge entity files. The "vertex map" points need to define the corresponding ontology of the entity csv file of each point and the corresponding attribute keys of each column respectively; the "edge map" needs to define the corresponding relation of the entity csv file of each edge and the corresponding ontology of the end point of the edge.

In step S10, the distributed storage system is configured to store multi-modal data, and provide different storage structures according to the data of different modalities, where the distributed storage system is composed of a distributed file system, a structured database, and a distributed graph database, and the bottom layers of the structured database and the distributed graph database are all based on the distributed file system. The distributed storage system stores multi-mode data in a classified mode, and particularly the distributed file system is used for storing document metadata and image data; the structured database is used for storing various metadata; the distributed graph database is used for storing knowledge maps.

As shown in fig. 7, the distributed storage system has a search query interface, which mainly provides a search query function, and allows a user to search metadata and a constructed knowledge graph of a search document.

The first search query interface is used for directly searching metadata and indirectly searching full-text document data and image data, and the main method is to search the structured data stored in the HIVE by using HiveQL language, scripting language and the like. Retrieving metadata, namely directly retrieving and obtaining metadata acquired respectively; searching the full text document, and acquiring a storage address of corresponding acquired full text document data (pdf) by searching the acquired metadata, thereby acquiring the full text document data; retrieving the image data, retrieving the acquired image metadata, and obtaining a storage address corresponding to the extracted image data (png), thereby obtaining the image data.

The second search query interface can be used for direct search of knowledge maps and indirect search of metadata, full-text document data and image data. Retrieving a knowledge graph, and directly retrieving the generated knowledge graph by Gremlin language; retrieving metadata, namely retrieving the metadata stored in the structured database by HiveQL after the Gremlin retrieves the node_id of the information of a single point or a plurality of points, and obtaining the metadata of the corresponding points; and searching the full text document and the image data, and acquiring the corresponding addresses stored in the distributed file system, thereby acquiring the corresponding full text data and image data.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of steps or stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of execution of the steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least a part of other steps or stages.

In one embodiment, as shown in fig. 8, the present invention provides a distributed storage device for multimodal literature data, the device comprising:

the data acquisition module 802 is configured to acquire full-text data of a document and store the full-text data in a distributed file system;

a first extraction module 804, configured to extract document metadata of the document full-text data, and store the document metadata in a structured database;

a second extraction module 806, configured to extract image data in the full text data of the document, extract image metadata of the image data, store the image data in a distributed file system, and store the image metadata in a structured database;

the map construction module 808 is configured to construct a knowledge map based on the document metadata and the image metadata, and store the knowledge map in a distributed map database;

the system building module 810 is configured to build a distributed storage system based on the distributed file system, the structured database, and the distributed graph database.

For specific limitations on the distributed storage device for multimodal document data, reference may be made to the above limitation on the method for storing multimodal document data, and no further description is given here. The above-described modules in the multi-modal document data-oriented distributed storage device may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, the embodiment of the present invention provides a computer device, which may be a server, and an internal structure diagram thereof may be as shown in fig. 9. The computer device includes a processor, a memory, and a network interface connected by a device bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The nonvolatile storage medium stores an operating device, a computer program, and a database. The internal memory provides an environment for the operation of the operating device and the computer program in the non-volatile storage medium. The database of the computer device is for storing motion detection data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements the steps of any of the above-described embodiments of a distributed storage method for multimodal literature data.

It will be appreciated by those skilled in the art that the structure shown in fig. 9 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, an embodiment of the present invention provides a computer readable storage medium having a computer program stored thereon, where the computer program when executed by a processor implements the steps of any of the above embodiments of a distributed storage method for multimodal literature data.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program, which may be stored on a non-transitory computer readable storage medium and which, when executed, may comprise the steps of the above-described embodiments of the methods. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (DynamicRandom Access Memory, DRAM), and the like.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples represent only a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A distributed storage method for multimodal literature data, the method comprising:

extracting full text metadata of the full text data of the document; extracting the quotation metadata of the full-text data of the literature and outputting quotation relation data; merging the full text metadata and the quotation metadata to obtain literature metadata, and respectively storing the literature metadata and the quotation relation data into a structural database;

constructing a distributed storage system based on the distributed file system, the structured database and the distributed graph database;

wherein the extracting the quotation metadata of the full-text data of the document and outputting the quotation relation data comprises the following steps:

2. The method of claim 1, wherein the extracting full text metadata of the document full text data comprises:

extracting initial full text metadata of the full text data of the document;

performing standardization processing on the initial full text metadata;

3. The method of claim 1, wherein constructing a knowledge-graph based on the document metadata and image metadata comprises:

4. The method according to claim 1, wherein the method further comprises:

5. A distributed storage device for multimodal literature data, the device comprising:

the first extraction module is used for extracting full text metadata of the full text data of the literature; extracting the quotation metadata of the full-text data of the literature and outputting quotation relation data; merging the full text metadata and the quotation metadata to obtain literature metadata, and respectively storing the literature metadata and the quotation relation data into a structural database; wherein the extracting the quotation metadata of the full-text data of the document and outputting the quotation relation data comprises the following steps: extracting initial leading text data of the full text data of the document and labeling a first identifier for leading the full text data of the document; carrying out standardization processing on the initial quotation metadata, and setting a second identifier of the cited document full-text data to obtain quotation relation data; deleting a first identifier referencing the full text data of the document, and deduplicating according to a second identifier to obtain the quotation metadata;

6. The apparatus of claim 5, wherein the system building block comprises:

7. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any one of claims 1 to 4.

8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 4.