CN116719783B - Method for extracting metadata specification of office-entering OFD archive file and filling archive copybook - Google Patents

Method for extracting metadata specification of office-entering OFD archive file and filling archive copybook Download PDF

Info

Publication number
CN116719783B
CN116719783B CN202310996987.6A CN202310996987A CN116719783B CN 116719783 B CN116719783 B CN 116719783B CN 202310996987 A CN202310996987 A CN 202310996987A CN 116719783 B CN116719783 B CN 116719783B
Authority
CN
China
Prior art keywords
file
ofd
metadata
archive
xml file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310996987.6A
Other languages
Chinese (zh)
Other versions
CN116719783A (en
Inventor
陆钰童
何冉冉
何中
王斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Zhongwei Technology Software System Co ltd
Original Assignee
Jiangsu Zhongwei Technology Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Zhongwei Technology Software System Co ltd filed Critical Jiangsu Zhongwei Technology Software System Co ltd
Priority to CN202310996987.6A priority Critical patent/CN116719783B/en
Publication of CN116719783A publication Critical patent/CN116719783A/en
Application granted granted Critical
Publication of CN116719783B publication Critical patent/CN116719783B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/164File meta data generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • G06F16/835Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for extracting metadata of an in-store OFD archive file to be filled in archive copybooks in a standardized way, which comprises the following steps: creating a data acquisition library, wherein indexes are built in the data acquisition library, and the indexes correspond to the copyrighted items in the archive in a one-to-one mode; different identification models are established, and each identification model corresponds to the data acquisition library; analyzing the OFD file according to the OFD standard, and respectively storing field metadata extracted from the OFD file into an xml file according to whether the OFD file contains an image or not; acquiring metadata in the xml file in the step, further screening, and matching the metadata with the identification model to obtain identification extraction information; the method and the device lead the identification and extraction results into an established data acquisition library, and carry out archival writing according to the matched different results, and the method and the device follow the national standard GBT33190-2016, extract the OFD metadata under the establishment of the standard, apply the OFD metadata, and can carry out batch processing when entering the library to fill in the writing items, thereby greatly relieving the working pressure of personnel.

Description

Method for extracting metadata specification of office-entering OFD archive file and filling archive copybook
Technical Field
The invention relates to the technical field of digital files, in particular to a method for extracting metadata of an OFD file in an museum and filling file copybooks in a standard.
Background
Along with the promotion of national big data development strategy and Internet plus action plan, the ideas, technologies, methods and modes of archives are deeply influenced; files increasingly become national basic strategic resources; the file work field is wider, the content is richer, the demand is more various, and the status and the effect are more and more important. Along with more and more files, the management of the files is more and more complex, the configuration of the writing information is required to be carried out on files entering a museum, the writing of the traditional files is required to be carried out by manually extracting the scanned content of the files entering the museum, then new item information is added in a file system, corresponding writing field information is manually obtained from the scanned content, writing of the item information is completed, finally the item information is hung with the scanned content according to the hanging template, so that the work is completed, a large amount of human resources are required to be consumed, and the condition that writing items are inaccurately recorded due to manual misoperation exists.
Disclosure of Invention
The present invention is directed to a method for extracting metadata of an OFD archive field of an access library to specify and fill in archive copybooks, so as to solve one or more of the problems set forth in the background art.
In order to achieve the above purpose, the present invention provides the following technical solutions: the method for extracting the metadata specification of the office OFD archive file and filling the metadata specification into the archive copybook comprises the following steps:
step S1: creating a data acquisition library, wherein indexes are built in the data acquisition library, the indexes are in one-to-one correspondence with the copybooks in the archive, and the indexes comprise nomination, responsibilities, security classes, storage period and shovelers;
step S2: different recognition models are established according to the OFD document type, the document top page style, the key field position range, the key word position structure sequence and the recognition range offset, and each recognition model corresponds to the data acquisition library;
step S3: analyzing the OFD file according to the OFD standard, and respectively storing field metadata extracted from the OFD file into an xml file according to whether the OFD file contains an image or not;
step S4: acquiring metadata in the xml file in the step S3, further screening, and matching the metadata with the identification model in the step S2 to obtain identification extraction information;
step S5: and (3) importing the identification and extraction results into a data acquisition library established in the step (S1), and performing archival writing according to the matched different results.
Preferably, the index in the step S1 includes nomination, responsible person, security class, storage period, and person to be sent; the OFD official document types include commands, decisions, resolution, instructions, bulletins, announcements, notifications, announcements, reports, requests, comments, letters, meeting descriptions.
Preferably, in the step S3, if the OFD file includes an image, the data in the image is extracted by using OCR, and the extracted field data is stored in an xml file in the OFD file; if the OFD file does not contain the image, the field metadata in the xml file is directly acquired.
Preferably, the specific steps of metadata screening and matching in the xml file in step S4 are as follows:
step S41: matching the identification model with field metadata in the xml file to obtain file type information in the xml file;
step S42: finding a corresponding recognition model according to the file type information, and acquiring keywords in the recognition model;
step S43: continuing traversing the field metadata in the xml file, and matching the field metadata in the xml file with the identification model obtained in the step S42;
step S44: if the keyword is matched, acquiring the content of the keyword in the xml file, recalculating the offset of the keyword in the recognition model according to the content of the last keyword, resetting the parameter P of the initial position and the azimuth of the keyword, then judging whether the content of the acquired keyword is in the range of the position P, and if the content meets the condition, continuing to match;
step S45: and (4) circulating the steps S41-S44, and completing the screening of all the metadata of the data fields extracted from the OFD to obtain the final extraction information.
Preferably, if the content of the obtained keyword in the step S44 is not within the range of the position P, the metadata information of the field in the xml file is removed.
Preferably, the specific operation of the file writing in step S5 is as follows:
step S51: analyzing the ofd file finally obtained to obtain an xml file of the recorded field data;
step S52: importing the metadata of the xml file into a data acquisition library, matching and corresponding each field of each piece of data in the xml file with each index in the data acquisition library, and judging whether the fields are identical to the indexes;
step S53: according to the matching degree of the fields and the indexes, different writing item entry modes are adopted for processing;
step S54: after the data writing items are matched, automatically pushing the data in the collection library to the writing items in the file system to complete writing work.
Preferably, in the step S53, if the fields in the matching are the same, the fields are directly corresponding to each other; if the fields in the matching are different, manually selecting the field calibration, and manually corresponding the fields with similar semantics; if indexes exist in the data acquisition library but are not in the xml in the OFD file, manually writing the input data or not processing the input data.
Compared with the prior art, the invention has the beneficial effects that:
(1) The invention follows the national standard GBT33190-2016, and under the establishment of the standard, the OFD metadata is extracted and applied;
(2) The invention changes the conventional file writing sequence, extracts the key fields of the scanned part according to the preset model, extracts the field content required by the file, then automatically generates file entries, finally automatically fills the extracted field content into the writing item of the entry information, and assists the digitiser to finish the primary writing of the file, thereby improving the efficiency;
(3) When the archive file is imported into the library to fill in the copybooks, the archive file can be normally filled in, and meanwhile, the archive file can be processed in batches, so that the working pressure of personnel is greatly reduced.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram illustrating an example of a recognition model for requesting file types in the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples
Referring to fig. 1, the present invention provides a method for filling metadata specifications of an OFD archive file into an archive copybook, comprising the following steps:
s1: creating a data acquisition library, wherein indexes are built in the data acquisition library, the quantity of the indexes corresponds to the written items in the archive one by one, and the indexes comprise nomination, responsibilities, security classes, storage period, transcriber and the like;
s2: different recognition models are established according to the OFD document type, the document top page style, the key field position range, the key word position structure sequence and the recognition range offset, each recognition model corresponds to the data acquisition library, the OFD document types comprise commands, decisions, resolution, instructions, notices, announcements, reports, requests, wholesale, letters and meeting disciplines;
s3: extracting field metadata in the OFD file according to the type of the OFD file and whether the OFD file contains pictures, extracting data in the image by utilizing OCR if the OFD file contains the image, storing the extracted field metadata in an xml file in the OFD file, and directly acquiring the field metadata in the xml file if the OFD file does not contain pictures;
s4: the metadata in the xml file in the step S3 is obtained and further screened, and is matched with the identification model in the step S2 to obtain identification extraction information, and when the OFD file is the request file type, as shown in fig. 2, if the file type is to be identified and the field metadata of the file is to be obtained, the specific steps of screening and matching the metadata in the xml file are as follows:
s41: matching field metadata representing a file type in an xml file with a plurality of recognition models, and when the metadata in the xml file is matched with a certain recognition model, acquiring file type information in the xml file according to the recognition model, and acquiring a file with a request type according to the file type, wherein the recognition model of the request type is shown in fig. 2;
s42: finding a corresponding recognition model according to file type information, acquiring keywords in the recognition model, setting keywords K, P, T, O in the recognition model of the request file type when the file type is the request file type, wherein K represents some keywords, and in the request file type, the keywords comprise part numbers, senders, subject words or other occurrence position starting position ranges P { (x 1, y 1), (x 2, y 2) }, offset T (content length of the keywords) and occurrence sequence O in an OFD file;
s43: then, continuously traversing field metadata in the xml file, matching the field metadata in the xml file with the identification model of the request file type, and sequentially identifying the field metadata from top to bottom according to the identification model of the request file type;
s44: if the keyword is matched, acquiring the content of the keyword in the xml file, recalculating the offset f1, f2, f3 and f4 of the keyword in the recognition model according to the content of the last keyword, resetting the parameters P1{ (x1+f1, y1+f2), (x2+f3, y2_f4) } of the initial position and orientation of the keyword, then judging whether the content of the acquired keyword is in the range of the position P1, and if the content meets the condition, continuing to match;
s45: step S41-S44 is circulated, the metadata of the data field extracted from the OFD are completely screened to obtain final extracted information, and if the content of the obtained keyword is not in the range of the position P1, the metadata information of the field in the xml file is removed, so that the standardization of file filling is ensured;
s5: importing the identification and extraction result into a data acquisition library established in the step S1, and performing archival writing according to different matched results, wherein the specific operation is as follows:
s51: analyzing the ofd file finally obtained to obtain an xml file of the recorded field data;
s52: importing the metadata of the xml file into a data acquisition library, matching and corresponding each field of each piece of data in the xml file with each index in the data acquisition library, and judging whether the fields are identical to the indexes;
s53: according to the matching degree of the fields and the indexes, different writing item entry modes are adopted for processing, and if the fields in the matching are the same, the fields are directly corresponding; if the fields in the matching are different, manually selecting the field calibration, and manually corresponding the fields with similar semantics; if indexes exist in the data acquisition library but are not in the xml in the OFD file, manually writing input data or not processing the input data;
step S54: after the data writing items are matched, automatically pushing the data in the collection library to the writing items in the file system to complete writing work.
According to the method, the OFD file of the progressive library is firstly identified, then the identified result is extracted, the copybook item is matched, after the matching is completed, the copybook is automatically pushed into the file system through the data in the data acquisition library, and the work pressure of staff is greatly reduced.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims (5)

1. The method for extracting the metadata specification of the office-entering OFD archive file and filling the archive copybook is characterized by comprising the following steps of:
step S1: creating a data acquisition library, wherein indexes are established in the data acquisition library, the indexes comprise nomination, responsibilities, security classes, storage period and transcriber, and the indexes are in one-to-one correspondence with the copybooks in the archive;
step S2: different recognition models are established according to the OFD document type, the document top page style, the key field position range, the key word position structure sequence and the recognition range offset, and each recognition model corresponds to the data acquisition library;
step S3: analyzing the OFD file according to the OFD standard, and respectively storing field metadata extracted from the OFD file into an xml file according to whether the OFD file contains an image or not;
step S4: the metadata in the xml file in the step S3 is obtained and further screened, and is matched with the identification model in the step S2, so that identification extraction information is obtained, wherein the specific steps of screening and matching the metadata in the xml file are as follows: matching the identification model with field metadata in the xml file to obtain file type information in the xml file; finding a corresponding recognition model according to the file type information, and acquiring keywords in the recognition model; continuing traversing the field metadata in the xml file, and matching the field metadata in the xml file with the identification model obtained in the step; if the keyword is matched, obtaining the content of the keyword in the xml file, recalculating the offset of the keyword in the identification model according to the content of the last keyword, resetting the parameter P of the initial position and the azimuth of the keyword, judging whether the content of the obtained keyword is in the range of the position P, if the content of the obtained keyword is in accordance with the condition, continuing to match, and if the content of the obtained keyword is not in the range of the position P, removing the field metadata information in the xml file; all the steps are circulated, and metadata of data fields extracted from the OFD are completely screened to obtain final extraction information;
step S5: and (3) importing the identification and extraction results into a data acquisition library established in the step (S1), and performing archival writing according to the matched different results.
2. The method of claim 1, wherein the metadata specification of the OFD archive file is filled into the archive transcript, and wherein: the OFD document types comprise commands, decisions, resolution, instructions, notices, announcements, notices, reports, requests, comments, letters and meeting disciplines.
3. The method of claim 1, wherein the metadata specification of the OFD archive file is filled into the archive transcript, and wherein: in the step S3, if the OFD file includes an image, extracting data in the image by using OCR, and storing the extracted field metadata in an xml file in the OFD file; if the OFD file does not contain the image, the field metadata in the xml file is directly acquired.
4. The method for extracting metadata specification for an OFD archive file of an in-store OFD archive of claim 1, wherein the specific operations of the archive in step S5 are as follows:
step S51: analyzing the ofd file finally obtained to obtain an xml file of the recorded field data;
step S52: importing metadata in an xml file into a data acquisition library, matching and corresponding each field of each piece of data in the xml file with each index in the data acquisition library, and judging whether the fields are identical to the indexes;
step S53: according to the matching degree of the fields and the indexes, different writing item entry modes are adopted for processing;
step S54: after the data writing items are matched, automatically pushing the data in the collection library to the writing items in the file system to complete writing work.
5. The method for extracting metadata specification for an OFD archive file of an access library as claimed in claim 4, wherein: in the step S53, if the fields in the matching are the same, the matching is directly corresponding to the fields; if the fields in the matching are different, manually selecting the field calibration, and manually corresponding the fields with similar semantics; if indexes exist in the data acquisition library but are not in the xml in the OFD file, manually writing the input data or not processing the input data.
CN202310996987.6A 2023-08-09 2023-08-09 Method for extracting metadata specification of office-entering OFD archive file and filling archive copybook Active CN116719783B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310996987.6A CN116719783B (en) 2023-08-09 2023-08-09 Method for extracting metadata specification of office-entering OFD archive file and filling archive copybook

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310996987.6A CN116719783B (en) 2023-08-09 2023-08-09 Method for extracting metadata specification of office-entering OFD archive file and filling archive copybook

Publications (2)

Publication Number Publication Date
CN116719783A CN116719783A (en) 2023-09-08
CN116719783B true CN116719783B (en) 2023-10-20

Family

ID=87864705

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310996987.6A Active CN116719783B (en) 2023-08-09 2023-08-09 Method for extracting metadata specification of office-entering OFD archive file and filling archive copybook

Country Status (1)

Country Link
CN (1) CN116719783B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180366013A1 (en) * 2014-08-28 2018-12-20 Ideaphora India Private Limited System and method for providing an interactive visual learning environment for creation, presentation, sharing, organizing and analysis of knowledge on subject matter
CN112364223A (en) * 2020-10-21 2021-02-12 贵州电网有限责任公司 Digital archive system
CN115994230A (en) * 2022-12-29 2023-04-21 南京烽火星空通信发展有限公司 Intelligent archive construction method integrating artificial intelligence and knowledge graph technology

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180366013A1 (en) * 2014-08-28 2018-12-20 Ideaphora India Private Limited System and method for providing an interactive visual learning environment for creation, presentation, sharing, organizing and analysis of knowledge on subject matter
CN112364223A (en) * 2020-10-21 2021-02-12 贵州电网有限责任公司 Digital archive system
CN115994230A (en) * 2022-12-29 2023-04-21 南京烽火星空通信发展有限公司 Intelligent archive construction method integrating artificial intelligence and knowledge graph technology

Also Published As

Publication number Publication date
CN116719783A (en) 2023-09-08

Similar Documents

Publication Publication Date Title
CN111695439A (en) Image structured data extraction method, electronic device and storage medium
CN112560411A (en) Intelligent personnel information input method and system
CN109033220B (en) Automatic selection method, system, equipment and storage medium of labeled data
CN109447019B (en) Paper scanned document electronization method based on image recognition and database storage
WO2023241519A1 (en) Bim component creation method and apparatus, and digital design resource library application method and apparatus
CN114297140A (en) Archive management system based on artificial intelligence
CN117520503A (en) Financial customer service dialogue generation method, device, equipment and medium based on LLM model
CN109710628B (en) Information processing method, information processing device, information processing system, computer and readable storage medium
CN115116068A (en) Archive intelligent filing system based on OCR
CN114529933A (en) Contract data difference comparison method, device, equipment and medium
CN116719783B (en) Method for extracting metadata specification of office-entering OFD archive file and filling archive copybook
CN112613367A (en) Bill information text box acquisition method, system, equipment and storage medium
CN110659348A (en) Group enterprise universe risk fusion analysis method and system based on knowledge reasoning
CN113792081B (en) Method and system for automatically checking data assets
CN115525739A (en) Supply chain financial intelligent duplicate checking method, device, equipment and medium
CN114693435A (en) Intelligent return visit method and device for collection list, electronic equipment and storage medium
CN115731559A (en) Electronic file generation management method and device and computer equipment
CN113741864A (en) Automatic design method and system of semantic service interface based on natural language processing
CN112559739A (en) Method for processing insulation state data of power equipment
CN112801016A (en) Vote data statistical method, device, equipment and medium
CN111027296A (en) Report generation method and system based on knowledge base
CN113706997B (en) Urban and rural planning drawing standardization processing method and device and electronic equipment
US20240054586A1 (en) Systems and methods for automated real estate property matching across disparate data sources
CN116664066B (en) Method and system for managing enterprise planning income and actual income
WO2024056457A1 (en) A method for detection of modification in a first document and related electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant