CN117076474B - Method, device, equipment and medium for updating offline multi-mode literature data - Google Patents

Method, device, equipment and medium for updating offline multi-mode literature data Download PDF

Info

Publication number
CN117076474B
CN117076474B CN202311336095.XA CN202311336095A CN117076474B CN 117076474 B CN117076474 B CN 117076474B CN 202311336095 A CN202311336095 A CN 202311336095A CN 117076474 B CN117076474 B CN 117076474B
Authority
CN
China
Prior art keywords
metadata
document
new
data
full text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311336095.XA
Other languages
Chinese (zh)
Other versions
CN117076474A (en
Inventor
陆矜菁
严笑然
厉燕
刘洋
陈一家
侯炜华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202311336095.XA priority Critical patent/CN117076474B/en
Publication of CN117076474A publication Critical patent/CN117076474A/en
Application granted granted Critical
Publication of CN117076474B publication Critical patent/CN117076474B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to the field of data processing, and in particular, to a method, an apparatus, a device, and a medium for updating offline multi-modal document data. The method comprises the following steps: collecting full text data of a new document, and extracting second document metadata of the full text data of the new document; searching the structured database to judge whether the second literature metadata exists in the structured database, judging whether the new literature full text data exists in the distributed file system and judging whether the second literature metadata is different from the first literature metadata, and generating a literature update table; based on the document update table, updating the new document full text data and second image data of the new document full text data to the distributed file system and/or updating the second document metadata and second image metadata of the second image data to the structured database. The invention realizes the update of the multi-mode document data of the distributed storage system.

Description

Method, device, equipment and medium for updating offline multi-mode literature data
Technical Field
The present invention relates to the field of data processing, and in particular, to a method, an apparatus, a device, and a medium for updating offline multi-modal document data.
Background
Literature databases are numerous and subject to increasing amounts of research literature resources, but each large literature database is relatively closed. Therefore, integrating the literature data of each large platform according to the requirements is very significant, the first step is to construct a multi-mode literature data storage and query system, and the updating of the multi-mode literature data stored in the system is the other part of the multi-mode literature data which complements the multi-mode literature data, so that the method has great significance for the development of scientific research and the expansion of the scale of a high-quality data set.
The existing large document databases have no method for updating the multi-mode document data, so that a multi-mode document data updating method is needed to be provided with great significance.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a method, apparatus, computer device, and storage medium for updating offline multi-modal document data.
In a first aspect, an embodiment of the present invention provides an offline multi-modal document data updating method, applied to a distributed storage system, where the distributed storage system includes a distributed file system for storing document full text data and first image data in the document full text data, and a structured database for storing first document metadata of the document full text data and first image metadata of the first image data, and the method includes:
collecting full text data of a new document, and extracting second document metadata of the full text data of the new document;
searching the structured database to judge whether the second literature metadata exists in the structured database, judging whether the new literature full text data exists in the distributed file system and judging whether the second literature metadata is different from the first literature metadata, and generating a literature update table;
based on the document update table, updating the new document full text data and second image data of the new document full text data to the distributed file system and/or updating the second document metadata and second image metadata of the second image data to the structured database.
In an embodiment, the extracting the second document metadata of the new document full text data includes:
extracting full text metadata of the full text data of the new document; and
Extracting the quotation metadata of the full text data of the new document;
generating a new document metadata table based on the full text metadata and the quotation metadata, wherein the new document metadata table is used for recording the second document metadata.
In one embodiment, the retrieving the structured database to determine whether the second document metadata exists in the structured database, to determine whether the new document full text data exists in the distributed file system, and to determine whether the second document metadata is different from the first document metadata, and generating a document update table includes:
reading the new literature metadata table and the literature metadata table in the structured database;
searching whether the new document metadata exists in the document metadata table or not based on the new document metadata table and the document metadata table, and if the new document metadata does not exist, marking the new document metadata as a first update type; if so, searching whether the full-text data of the new document exists in the distributed file system, and if not, marking as a second update type; if so, judging whether the second document metadata is different from the first document metadata, and marking the second document metadata as a third update type;
and generating the document update table based on the result of the search.
In an embodiment, if the new document full text data does not exist, second image data of the new document full text data and second image metadata of the second image data are extracted.
In an embodiment, the updating the new document full text data and the second image data of the new document full text data to the distributed file system and/or updating the second document metadata and the second image metadata of the second image data to the structured database based on the document update table comprises:
determining an update category based on the document update table;
if the data is of the first updating type, updating the second document metadata into the document metadata table, updating the second image metadata of the second image data into the image metadata table in the structured database, and updating the new document full text data into the distributed file system;
if the data is of a second update type, updating the second document metadata into the document metadata table, updating the second image metadata of the second image data into the image metadata table in the structured database, and updating the new document full text data and the second image data of the new document full text data into the distributed file system;
and if the first document metadata is in the third updating category, correcting the first document metadata in the document metadata table based on the second document metadata.
In an embodiment, the updating the second document metadata into the document metadata table includes:
and updating the second document metadata into the document metadata table by adopting a direct updating method or a zipper table updating method.
In an embodiment, before said generating a new document metadata table based on said full text metadata and said quotation metadata, further comprises:
labeling the full text metadata and the quotation metadata;
carrying out standardization processing on the full text metadata and the quotation metadata after labeling;
setting a first identifier of the full text metadata and the quotation metadata after the normalization processing, and generating a new document metadata table.
In a second aspect, an embodiment of the present invention proposes an offline multi-modal document data updating apparatus applied to a distributed storage system, the distributed storage system including a distributed file system for storing document full text data and first image data in the document full text data, a structured database for storing first document metadata of the document full text data and first image metadata of the first image data, the apparatus comprising:
the extraction module is used for collecting the full text data of the new document and extracting the second document metadata of the full text data of the new document;
the searching module is used for searching the structured database to judge whether the second document metadata exists in the structured database, judging whether the new document full-text data exists in the distributed file system and judging whether the second document metadata is different from the first document metadata or not, and generating a document update table;
an updating module, configured to update the new document full text data and second image data of the new document full text data to the distributed file system and/or update the second document metadata and second image metadata of the second image data to the structured database based on the document update table.
In a third aspect, an embodiment of the present invention proposes a computer device comprising a memory storing a computer program and a processor executing the steps of the first aspect.
In a fourth aspect, an embodiment of the present invention proposes a computer readable storage medium, on which a computer program is stored, the processor implementing the steps of the first aspect when executing the computer program.
Compared with the prior art, the method, the device, the computer equipment and the storage medium collect the full text data of the new document and extract the second document metadata of the full text data of the new document; searching the structured database to judge whether the second literature metadata exists in the structured database, judging whether the new literature full text data exists in the distributed file system and judging whether the second literature metadata is different from the first literature metadata, and generating a literature update table; updating the new document full text data and the second image data of the new document full text data to the distributed file system and/or updating the second document metadata and the second image metadata of the second image data to the structured database based on the document update table, thereby realizing the updating of the multi-modal document data of the distributed storage system.
Drawings
FIG. 1 is a schematic diagram of a distributed storage system according to an embodiment;
FIG. 2 is a flowchart of a method for updating offline multi-modal document data in an embodiment;
FIG. 3 is a flow chart illustrating a method for generating a metadata table of a new document according to an embodiment;
FIG. 4 is a flowchart of a new document metadata table generation method according to another embodiment;
FIG. 5 is a flow chart illustrating generation of a document update table in one embodiment;
FIG. 6 is a flowchart illustrating the step S206 in one embodiment;
FIG. 7 is a schematic diagram illustrating a module connection of an offline multi-modal document data update apparatus according to an embodiment;
fig. 8 is a schematic structural diagram of a computer device in an embodiment.
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only some examples or embodiments of the present invention, and it is apparent to those of ordinary skill in the art that the present invention may be applied to other similar situations according to the drawings without inventive effort. Unless otherwise apparent from the context of the language or otherwise specified, like reference numerals in the figures refer to like structures or operations.
As used in the specification and in the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.
While the present invention makes various references to certain modules in an apparatus according to embodiments of the invention, any number of different modules may be used and run on a computing device and/or processor. The modules are merely illustrative and different aspects of the apparatus and method may use different modules.
It will be understood that when an element or module is referred to as being "connected," "coupled" to another element, module, or block, it can be directly connected or coupled or in communication with the other element, module, or block, or intervening elements, modules, or blocks may be present unless the context clearly dictates otherwise. The term "and/or" as used herein may include any and all combinations of one or more of the associated listed items.
The method for updating the offline multi-mode document data can be applied to a distributed storage system shown in fig. 1. As shown in fig. 1, the distributed storage system includes a distributed file system 102 for storing document full text data and image data in the document full text data, and a structured database 104 for storing document metadata of the document full text data and first image metadata of the image data. The distributed storage system further includes a search query interface, specifically a first search query interface 106, for search queries of multimodal data.
As shown in fig. 2, an embodiment of the present invention provides a method for updating offline multi-modal document data, which is illustrated by using the method applied to the system in fig. 1 as an example, and includes the following steps:
s202: and collecting full text data of the new document, and extracting second document metadata of the full text data of the new document.
The full-text data acquisition mode of the new document is mainly to carry out batch crawling and downloading of open source academic website documents by utilizing an automatic program, and is difficult to automatically crawl for part of documents, such as documents which are limited to be downloaded due to an anticreeper mechanism and the like but have specific requirements, the documents are downloaded to the local by adopting a manual downloading mode, and local folder path local_path is recorded.
S204: searching the structured database to judge whether the second literature metadata exists in the structured database, judging whether the new literature full text data exists in the distributed file system and judging whether the second literature metadata is different from the first literature metadata, and generating a literature update table;
s206: based on the document update table, updating the new document full text data and second image data of the new document full text data to the distributed file system and/or updating the second document metadata and second image metadata of the second image data to the structured database.
Based on the steps S202-S206, collecting full-text data of a new document, and extracting second document metadata of the full-text data of the new document; searching the structured database to judge whether the second literature metadata exists in the structured database, judging whether the new literature full text data exists in the distributed file system and judging whether the second literature metadata is different from the first literature metadata, and generating a literature update table; updating the new document full text data and the second image data of the new document full text data to the distributed file system and/or updating the second document metadata and the second image metadata of the second image data to the structured database based on the document update table, thereby realizing the updating of the multi-modal document data of the distributed storage system.
In step S202, as shown in fig. 3, the method specifically includes the following steps:
s302: extracting full text metadata of the full text data of the new document; and extracting the quotation metadata of the full-text data of the new document.
The method adopts a program batch extraction mode, mainly adopts Python libraries such as PyPDF2 and the like to extract the full text metadata of the full text data of the new document and the leading text metadata thereof in batches, and the extraction content mainly comprises a document title, an author, a subject, a date and the like, and simultaneously records the name pdf_name of the full text data of the new document.
S304: generating a new document metadata table based on the full text metadata and the quotation metadata, wherein the new document metadata table is used for recording the second document metadata.
In a further embodiment, before said generating a new document metadata table based on said full text metadata and said quotation metadata, as shown in fig. 4, the method further comprises the steps of:
s402: and labeling the full text metadata and the quotation metadata.
Labeling is divided into batch labeling and manual labeling. The main content of batch labeling is that a full text data column is added, 1 is labeled for the collected full text metadata, and 0 is labeled for the extracted quotation metadata. The main content of the manual annotation is to manually annotate the metadata key information extracted by the program (such as lack of title and author) and manually annotate fewer documents (CAJ files and the like) in a non-pdf format.
S404: and carrying out standardization processing on the full text metadata and the quotation metadata after labeling.
For example, by adopting data normalization, for new full-text document data from different sources, the extracted second document metadata may have the problems of inconsistent cases, inconsistent space numbers and the like, and normalization work needs to be performed on different data formats to make important information of the different data formats agree, for example, title formats agree with title formats in the original document data structure table paper_info.
S406: setting a first identifier of the full text metadata and the quotation metadata after the normalization processing, and generating a new document metadata table.
The method is novel literature full-text data collected in batches, so that the problems of partial literature repetition and citation repetition exist. The main way is to perform deduplication according to the document title of the normalized data, wherein for the "full text data" column in S402, the following formula is adopted for judgment:
P = P1∨P2∨P3……Pn ,
where P is the value of the "full text data" column, pn is the value of the "literature full text data present" (pdf_exposed) column of metadata for each repeated row. For other columns, the data is reserved by adopting a complementary method, like a document, the "subject" of the data in the a line is missing, but the data is acquired when the data exists in the b line, and the first line data is acquired when the data is the same or different.
Finally, the unique identifier new_paper_id is calibrated for the deduplicated data.
The second document metadata is imported into the structured database for storage, and a new document metadata table new_paper_info is built in the structured database for the second document metadata subjected to data processing and imported for storage.
The format of the new document metadata table new_paper_info is as follows:
the main purpose of step S204 is to search the existing distributed storage system, determine the type of document to be updated according to the second document metadata and the new document full text data, and generate a new document update table to update the multi-modal data in a classified manner. As shown in fig. 5, the method specifically comprises the following steps:
s502: and reading the new literature metadata table and the literature metadata table in the structured database.
The structured database is connected, and an original document metadata table paper_info and a new document metadata table new_paper_info are read, wherein the original document metadata table paper_info is in the following format:
and the result is stored as a temporary table tmp.csv, the source table is marked for distinguishing, the new document metadata table new_paper_info is T1, the original document metadata table paper_info is T2, and the format of the temporary table tmp.csv is as follows:
the number of lines of the temporary table tmp.csv is consistent with the number of lines of the T1 table, and if the original document metadata table does not have any document to be updated, all attributes of the T2 table are null values.
S504: searching whether the new document metadata exists in the document metadata table or not based on the new document metadata table and the document metadata table, and if the new document metadata does not exist, marking the new document metadata as a first update type; if so, searching whether the full-text data of the new document exists in the distributed file system, and if not, marking as a second update type; if so, judging whether the second document metadata is different from the first document metadata, and marking the second document metadata as a third update type;
for example, a first update category is labeled A, a second update category is labeled B, and a third update category is labeled C.
S506: and generating the document update table based on the result of the search.
The data column of the required literature attribute in the temporary table tmp.csv table is extracted, mainly the attribute of the T1 table is extracted, and the T2 table only extracts the T2.Paper_id. Obtaining a new table, and then connecting a new document metadata table new_paper_info output by the structured database in a replacement way, wherein the updated new_paper_info format is shown in the following table:
based on the updated new_paper_info table, the Update type and the Update mode are confirmed, and the Update type and the Update mode are mainly determined by two columns of update_type and pdf_exposed, and the document Update table update_info is output.
As follows, the document update table update_info indicates the corresponding update contents, and D1, D2, and D3 respectively represent data of three modes, and the update method will be described later. Where "metadata" means that a new piece of document metadata is inserted into the original document metadata table paper_info, "metadata supplement", "pdf information column in metadata" means that the value of a certain attribute column of the original data in the original document metadata table paper_info is changed. pdf_existed indicates whether new document full text data exists or not, pdf_existed=1 is mostly collected new document full text data, pdf_existed=0 is extracted citation, and some citations exist just in collected new document full text data, and this part pdf_existed=1.
In an embodiment, if the new document full text data does not exist, second image data of the new document full text data and second image metadata of the second image data are extracted.
First, extracting the required data according to the document update table update_info and the updated new document metadata table new_paper_info.
And secondly, connecting the distributed structured database, reading the updated new document metadata table new_paper_info, extracting required data, and reading the new document full-text data from the full-text data storage address local_path/pdf_name according to pdf_name provided by the new document metadata table new_paper_info.
And thirdly, extracting second image data of the full-text data of the new document in batches, wherein a program batch extraction method is adopted, and particularly batch extraction is carried out by using a fitz library of PyMuPDF. The second image data is stored locally as unstructured data, with the storage path being image_local_path.
Fourth, second image metadata of the second image data is generated and stored as structured data, and a new image metadata table new_image_info imported into the structured database is stored.
The second image metadata generation process needs to number the image, the number will be the unique identifier of the second image metadata, and the image is named { new_image_id }. Png, the metadata format of the image is as follows, wherein new_paper_id is obtained from the read new document metadata table new_paper_info:
in step S206, as shown in fig. 6, the method specifically includes the following steps:
s602: determining an update category based on the document update table;
s604: if the data is of the first updating type, updating the second document metadata into the document metadata table, updating the second image metadata of the second image data into the image metadata table in the structured database, and updating the new document full text data into the distributed file system;
s604: if the data is of a second update type, updating the second document metadata into the document metadata table, updating the second image metadata of the second image data into the image metadata table in the structured database, and updating the new document full text data and the second image data of the new document full text data into the distributed file system;
s608: and if the first document metadata is in the third updating category, correcting the first document metadata in the document metadata table based on the second document metadata.
The first update type A represents that the original document metadata table does not contain the new document, and the second document metadata, the full text data of the new document and the second image data of the new document need to be inserted; the second update category B represents that the original document metadata table has second document metadata, but no new document full-text data, so that updating the new document full-text data, updating pdf information columns in the second document metadata, image data, correcting or supplementing part of the second document metadata, and the like are required; the third update category C represents the second document metadata and the new document full-text data, and therefore, it is not necessary to update or supplement the missing part of the original first document metadata with the second document metadata, correct the first document metadata, and the like.
In an embodiment, the updating of the structured data, that is, the updating of the second document metadata and the second image metadata, may employ a direct updating method and a zipper table updating method. The zipper table updating method has the advantages that the subsequent query, deletion, data change and the like can be conveniently performed according to the updating date, and meanwhile, compared with a full-quantity table, the zipper table updating method can greatly save storage space. In addition, the direct updating method needs a manual mode to judge whether the document is subjected to metadata correction or not, and the zipper table updating method can supplement or correct original metadata while retaining the original metadata without manual judgment.
For the direct update method, the following is specific:
for data with update_type=a, the structured data Update requirement is to insert the second literature metadata. The part of data does not exist in the original data, the paper_id is null value, the required column is extracted from the new document metadata table new_paper_info by using python and the like, the paper_id is marked, the pdf_path column is constructed, and then the second document metadata is directly inserted in an Insert into mode.
The label paper_id is obtained by sequentially numbering n+1, n+2, … … and n+m to the selected data, wherein n is the data amount of the original document metadata table paper_info, m is the number of extracted data, and the information of the paper_id is updated to the new document metadata table new_paper_info. The pdf_path column is constructed with a value of { hfs_path }/{ pdf_name }, which is the folder where the new document full-text data is stored on the distributed file system.
For the data of update=b or C, there is a "metadata supplement" or "pdf information column in metadata" Update requirement, and the corresponding data in the new document metadata table new_paper_info is extracted by a command, so that the extracted data can be matched with the original document metadata table paper_info according to the paper_id. For the requirement of 'metadata correction', the original first document metadata is replaced by adopting a method of manual checking and judging.
For the zipper table updating method, the following is concrete:
two columns, start_time and to_time, are added to the original document metadata table paper_info, and the date of data update is recorded through the two columns.
In the initial situation, start_time is the date of the day, to_time is 9999/12/31, and 2023/01/01 is the time of initial data import; the new document metadata table new_paper_info is added with a time column update_date, and the time for updating the data is assigned, which is 2023/02/01. The original document metadata table paper_info is modified as follows:
update_type=a, the structured data Update requirement is to insert the second document metadata. For the data of update_type=a, the page_id is null, the required column of the data of update_type=a is extracted from the new document metadata table new_page_info by using python or the like, the page_id is marked, the pdf_page column is constructed, and the information of the page_id is updated to the new document metadata table new_page_info.
The difference from the direct update method is that the pull chain has a date column. Assuming that the date of insertion of new data into the document metadata table paper_info is 2023/02/01, n is the original metadata amount, the structured data table paper_info after insertion is:
when update_type=b or C, firstly, acquiring the data of the required column, and filling pdf_path, namely { hfs_path }/{ pdf_name }, into the data of update_type=b; and secondly, updating certain attribute values according To the updating requirement, and changing To the To_time in the original data corresponding To the paper_info of the metadata table of the matching literature into the previous day by adopting a method of changing the data after matching, wherein the new data start_time is 2023/02/01 of the current day and the To_time is 9999/12/31.
For example, if the paper1 supplements the updated pdf_path attribute value and the paper2 modifies the updated Subject attribute value, the updated paper_info is as follows:
for the update of unstructured data, the full-text data of new documents and the second image data are mainly uploaded to a distributed file system in batches.
The full text data of the new document is uploaded to a distributed file system, and the method mainly comprises the following steps:
the document Update table can be used to know the data type of the full text data of the new document to be uploaded, the pdf_name column data of pdf_exposed=1 and (update=a or B) is extracted from the new document metadata table new_paper_info, and the names of all the full text data of the new document to be uploaded are obtained and are recorded as pdf_name_list.
There are two methods for programs to upload new document full text data to a distributed file system in batches.
The first method is as follows: the storage path of the full-text data of the new document is local_path, and the shell command can be circularly executed by a program and uploaded to { hdfs_path }/.
The second method is as follows: the full text data of the new documents in all pdf_name_list can be transferred to a new folder local_path_new, and shell commands are directly uploaded.
For updating the second image data to the distributed storage system, the second image data (png, etc. format) needs to be uploaded to the distributed file system, and the second image metadata is updated to the structured database. The method mainly comprises the following steps:
the first step is to extract the information of the image to be updated based on the updated new document metadata table new_paper_info, the original image metadata table image_info and the new image metadata table new_image_info, and store the information as an image_tmp table, so as to provide a basis for uploading the second image data and updating the second image metadata. The method mainly comprises the following steps:
and obtaining the paper_id of the document corresponding to the second image data based on the new-document metadata table new_paper_info matching.
Specifically, it is known from the document update table that the second image data is required to be uploaded is data of pdf_exposed=1 and (update_type=aor B), and the updated new document metadata table new_paper_info table contains paper_id and new_paper_id of all new document full-text data. Therefore, the data of pdf_exposed=1 and (update_type=aor B) in the new document metadata table new_paper_info is extracted, and join operation is performed with the generated new image metadata table according to new_paper_id, and the result is an image_tmp table with the following format:
and marking the image_id of the second image data according to the original image metadata table image_info.
The image_id column is newly added to the image_tmp table, the original document metadata table image_info is read, the number m of the original document metadata table image_info is recorded, the image_ids of the new document metadata table new_image_info are marked according to the sequence, and the image_id of the second image data is: m+1, m+2, … … will be the unique identifier subsequently updated to the image metadata table.
The image_tmp table is as follows:
and step two, uploading the locally stored second image data to a distributed file system to finish updating the second image data.
The second Image data is { new_image_id }, png, the local storage path is image_local_path, the { image_local_path }/{ new_image_id } is uploaded in batch after the image_tmp table is read, the png is sent to the distributed file system, and the path on the distributed file system is marked as the image_path.
And thirdly, extracting the required columns of the generated image_tmp table, inserting the required columns into the original image metadata table, and finishing updating of the second image metadata.
First, the column image_path column is newly added based on the extracted image_tmp table, and the value is the path of the uploaded distributed file system.
Secondly, the image metadata table image_info is updated, and the above-mentioned direct update method or zipper table update method is mainly not described herein, and the format of the image metadata table image_info updated by the direct update method is the same as the original format, and is as follows:
the image metadata table image_info format updated by the zipper table update method is as follows:
it should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of steps or stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of execution of the steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least a part of other steps or stages.
In an embodiment, as shown in fig. 7, the present invention provides an apparatus for updating offline multi-modal document data, the apparatus comprising:
an extraction module 702, configured to collect full text data of a new document, and extract second document metadata of the full text data of the new document;
a retrieving module 704, configured to retrieve the structured database to determine whether the second document metadata exists in the structured database and whether the new document full-text data exists in the distributed file system, and generate a document update table;
an updating module 706, configured to update the new document full text data and second image data of the new document full text data to the distributed file system and/or update the second document metadata and second image metadata of the second image data to the structured database based on the document update table.
For specific limitations on the updating device of the offline multi-modal document data, reference may be made to the above limitation on the updating method of the offline multi-modal document data, which is not described herein. The above-described respective modules in the offline multi-modal document data updating apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, the embodiment of the present invention provides a computer device, which may be a server, and an internal structure diagram thereof may be shown in fig. 8. The computer device includes a processor, a memory, and a network interface connected by a device bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The nonvolatile storage medium stores an operating device, a computer program, and a database. The internal memory provides an environment for the operation of the operating device and the computer program in the non-volatile storage medium. The database of the computer device is for storing motion detection data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements the steps of an embodiment of the method of updating offline multimodal document data of any of the above.
It will be appreciated by those skilled in the art that the structure shown in fig. 8 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, an embodiment of the present invention provides a computer readable storage medium having a computer program stored thereon, where the computer program when executed by a processor implements the steps of any of the offline multi-modal document data update method embodiments described above.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program, which may be stored on a non-transitory computer readable storage medium and which, when executed, may comprise the steps of the above-described embodiments of the methods. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (DynamicRandom Access Memory, DRAM), and the like.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples represent only a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (8)

1. An update method of offline multi-modal document data, applied to a distributed storage system, the distributed storage system including a distributed file system for storing document full text data and first image data in the document full text data, a structured database for storing first document metadata of the document full text data and first image metadata of the first image data, the method comprising:
collecting full text data of a new document, and extracting second document metadata of the full text data of the new document; it comprises the following steps: extracting full text metadata of the full text data of the new document; extracting the quotation metadata of the full-text data of the new document; generating a new document metadata table based on the full text metadata and the quotation metadata, the new document metadata table being used for recording the second document metadata;
searching the structured database to judge whether the second literature metadata exists in the structured database, judging whether the new literature full text data exists in the distributed file system and judging whether the second literature metadata is different from the first literature metadata, and generating a literature update table; it comprises the following steps: reading the new literature metadata table and the literature metadata table in the structured database; searching whether the new document metadata exists in the document metadata table or not based on the new document metadata table and the document metadata table, and if the new document metadata does not exist, marking the new document metadata as a first update type; if so, searching whether the full-text data of the new document exists in the distributed file system, and if not, marking as a second update type; if so, judging whether the second document metadata is different from the first document metadata, and marking the second document metadata as a third update type; generating the document update table based on the result of the search;
based on the document update table, updating the new document full text data and second image data of the new document full text data to the distributed file system and/or updating the second document metadata and second image metadata of the second image data to the structured database.
2. The method for updating offline multi-modal document data according to claim 1, wherein if the new document full-text data does not exist, second image data of the new document full-text data and second image metadata of the second image data are extracted.
3. The method of updating off-line multimodal literature data according to claim 2, wherein the updating the new literature full text data and the second image data of the new literature full text data to the distributed file system and/or updating the second literature metadata and the second image metadata of the second image data to the structured database based on the literature update table comprises:
determining an update category based on the document update table;
if the data is of the first updating type, updating the second document metadata into the document metadata table, updating the second image metadata of the second image data into the image metadata table in the structured database, and updating the new document full text data into the distributed file system;
if the data is of a second update type, updating the second document metadata into the document metadata table, updating the second image metadata of the second image data into the image metadata table in the structured database, and updating the new document full text data and the second image data of the new document full text data into the distributed file system;
and if the first document metadata is in the third updating category, correcting the first document metadata in the document metadata table based on the second document metadata.
4. The method for updating offline multi-modal literature data according to claim 3, wherein the updating the second literature metadata into the literature metadata table includes:
and updating the second document metadata into the document metadata table by adopting a direct updating method or a zipper table updating method.
5. The method of updating offline multimodal literature data according to claim 1, further comprising, prior to said generating a new literature metadata table based on said full text metadata and said quote metadata:
labeling the full text metadata and the quotation metadata;
carrying out standardization processing on the full text metadata and the quotation metadata after labeling;
setting a first identifier of the full text metadata and the quotation metadata after the normalization processing, and generating a new document metadata table.
6. An updating device of offline multi-modal document data, applied to a distributed storage system, the distributed storage system comprising a distributed file system for storing document full text data and first image data in the document full text data, a structured database for storing first document metadata of the document full text data and first image metadata of the first image data, the device comprising:
the extraction module is used for collecting the full text data of the new document and extracting the second document metadata of the full text data of the new document; it comprises the following steps: extracting full text metadata of the full text data of the new document; extracting the quotation metadata of the full-text data of the new document; generating a new document metadata table based on the full text metadata and the quotation metadata, the new document metadata table being used for recording the second document metadata;
the searching module is used for searching the structured database to judge whether the second document metadata exists in the structured database, judging whether the new document full-text data exists in the distributed file system and judging whether the second document metadata is different from the first document metadata or not, and generating a document update table; it comprises the following steps: reading the new literature metadata table and the literature metadata table in the structured database; searching whether the new document metadata exists in the document metadata table or not based on the new document metadata table and the document metadata table, and if the new document metadata does not exist, marking the new document metadata as a first update type; if so, searching whether the full-text data of the new document exists in the distributed file system, and if not, marking as a second update type; if so, judging whether the second document metadata is different from the first document metadata, and marking the second document metadata as a third update type; generating the document update table based on the result of the search;
an updating module, configured to update the new document full text data and second image data of the new document full text data to the distributed file system and/or update the second document metadata and second image metadata of the second image data to the structured database based on the document update table.
7. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any one of claims 1 to 5.
8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 5.
CN202311336095.XA 2023-10-16 2023-10-16 Method, device, equipment and medium for updating offline multi-mode literature data Active CN117076474B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311336095.XA CN117076474B (en) 2023-10-16 2023-10-16 Method, device, equipment and medium for updating offline multi-mode literature data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311336095.XA CN117076474B (en) 2023-10-16 2023-10-16 Method, device, equipment and medium for updating offline multi-mode literature data

Publications (2)

Publication Number Publication Date
CN117076474A CN117076474A (en) 2023-11-17
CN117076474B true CN117076474B (en) 2024-03-12

Family

ID=88706427

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311336095.XA Active CN117076474B (en) 2023-10-16 2023-10-16 Method, device, equipment and medium for updating offline multi-mode literature data

Country Status (1)

Country Link
CN (1) CN117076474B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273481A (en) * 2017-06-10 2017-10-20 苏州唯亚信息科技股份有限公司 Suitable for the maintaining method of enterprise customer's R & D Database
CN108280085A (en) * 2017-01-06 2018-07-13 工业和信息化部电信研究院 The method and device of data deduplication
CN114048269A (en) * 2022-01-12 2022-02-15 北京奥星贝斯科技有限公司 Method and device for synchronously updating metadata in distributed database
CN115455131A (en) * 2022-08-22 2022-12-09 华中科技大学 Data storage method, system, equipment and storage medium based on multi-source isomerism
CN116126997A (en) * 2023-04-04 2023-05-16 北京洞悉网络有限公司 Document deduplication storage method, system, device and storage medium
CN116303427A (en) * 2023-01-12 2023-06-23 长鑫存储技术有限公司 Data processing method and device, electronic equipment and storage medium
CN116737130A (en) * 2023-08-15 2023-09-12 之江实验室 Method, system, equipment and storage medium for compiling modal-oriented intermediate representation

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110289112A1 (en) * 2009-01-26 2011-11-24 Junpei Kamimura Database system, database management method, database structure, and storage medium
US8868508B2 (en) * 2010-02-09 2014-10-21 Google Inc. Storage of data in a distributed storage system
WO2012151532A1 (en) * 2011-05-05 2012-11-08 Mario Vuksan Database system and method
US10726039B2 (en) * 2016-11-29 2020-07-28 Salesforce.Com, Inc. Systems and methods for updating database indexes
CN109144994B (en) * 2017-06-19 2022-04-29 华为技术有限公司 Index updating method, system and related device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280085A (en) * 2017-01-06 2018-07-13 工业和信息化部电信研究院 The method and device of data deduplication
CN107273481A (en) * 2017-06-10 2017-10-20 苏州唯亚信息科技股份有限公司 Suitable for the maintaining method of enterprise customer's R & D Database
CN114048269A (en) * 2022-01-12 2022-02-15 北京奥星贝斯科技有限公司 Method and device for synchronously updating metadata in distributed database
CN115455131A (en) * 2022-08-22 2022-12-09 华中科技大学 Data storage method, system, equipment and storage medium based on multi-source isomerism
CN116303427A (en) * 2023-01-12 2023-06-23 长鑫存储技术有限公司 Data processing method and device, electronic equipment and storage medium
CN116126997A (en) * 2023-04-04 2023-05-16 北京洞悉网络有限公司 Document deduplication storage method, system, device and storage medium
CN116737130A (en) * 2023-08-15 2023-09-12 之江实验室 Method, system, equipment and storage medium for compiling modal-oriented intermediate representation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种领域专家文献自动收集系统;廖晓锋;王永吉;周津慧;关贝;;计算机系统应用(第06期);117-122 *
大规模科技文献深度解析和检索平台构建;吴素研;吴江瑞;李文波;;现代情报(第01期);112-117 *

Also Published As

Publication number Publication date
CN117076474A (en) 2023-11-17

Similar Documents

Publication Publication Date Title
US7720885B2 (en) Generating a word-processing document from database content
US20200285666A1 (en) Media Search Processing Using Partial Schemas
CN112015900B (en) Medical attribute knowledge graph construction method, device, equipment and medium
CN109284273B (en) Massive small file query method and system adopting suffix array index
JP2009543235A5 (en)
CN116028653B (en) Method and system for constructing map by visually configuring multi-source heterogeneous data
CN106255962B (en) System and method for improved data structure storage
CN110287192B (en) Search application data processing method and device, computer equipment and storage medium
JP6645850B2 (en) Information management apparatus, information management method, and information management program
CN102110102A (en) Data processing method and device, and file identifying method and tool
CN1684065A (en) Method and device for handling metadata
CN105069101A (en) Distributed index construction and search method
CN117076474B (en) Method, device, equipment and medium for updating offline multi-mode literature data
CN110795520B (en) Automatic identification method for association relation between digital geological data packet directory and file
JPWO2020015613A5 (en)
CN104699688A (en) File searching method and electronic device
Myntti et al. Use existing data first: Reconcile metadata before creating new controlled vocabularies
CN113535962B (en) Data warehouse-in method, device, electronic device, program product and storage medium
CN114218347A (en) Method for quickly searching index of multiple file contents
CN113407538A (en) Incremental acquisition method for data of multi-source heterogeneous relational database
CN117076495B (en) Distributed storage method, device and equipment for multi-mode literature data
CN113468377A (en) Video and literature association and integration method
US8417736B2 (en) Method, server extension and database management system for storing non XML documents in a XML database
US20190034280A1 (en) Performant Process for Salvaging Renderable Content from Digital Data Sources
CN116955300B (en) File generation method and system based on label technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant