CN114756617A - Method, system, equipment and storage medium for extracting structured data of engineering archives - Google Patents

Method, system, equipment and storage medium for extracting structured data of engineering archives Download PDF

Info

Publication number
CN114756617A
CN114756617A CN202210455488.1A CN202210455488A CN114756617A CN 114756617 A CN114756617 A CN 114756617A CN 202210455488 A CN202210455488 A CN 202210455488A CN 114756617 A CN114756617 A CN 114756617A
Authority
CN
China
Prior art keywords
data
text
engineering
extracting
archive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210455488.1A
Other languages
Chinese (zh)
Inventor
邹永增
魏宏俊
翁非
张望华
黄云飞
林衍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Fujian Electric Power Co Ltd
Original Assignee
State Grid Fujian Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Fujian Electric Power Co Ltd filed Critical State Grid Fujian Electric Power Co Ltd
Priority to CN202210455488.1A priority Critical patent/CN114756617A/en
Publication of CN114756617A publication Critical patent/CN114756617A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/26Visual data mining; Browsing structured data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for extracting structured data of an engineering archive, which comprises the following steps: constructing an engineering archive rule base, wherein the engineering archive rule base comprises a plurality of rule element attributes; pre-training a text extraction model, collecting structural data of professional vocabularies of historical engineering archives as original data, and performing iterative training by using unsupervised learning to obtain the text extraction model; acquiring input data from an engineering file, preprocessing the input data, inputting the preprocessed data into a pre-training text extraction model, and extracting text vocabularies; performing feature association and data cleaning processing on the text vocabulary to obtain text metadata; performing character matching on the text metadata to obtain a plurality of character attributes in the text metadata; carrying out rule matching through character attributes to an engineering archive rule base, and determining rule element attributes matched with the text metadata so as to determine entities related to the text metadata; and generating structured data according to the entity associated with the text metadata.

Description

Method, system, equipment and storage medium for extracting structured data of engineering archives
Technical Field
The invention relates to a method, a system, equipment and a storage medium for extracting structured data of an engineering archive, belonging to the technical field of digital processing of the engineering archive.
Background
With the rise and popularization of the field of artificial intelligence, the repeatable work content is greatly reduced, the model can be trained in a deep learning mode to meet the requirement of daily repetitive work, the cost of manual input is reduced, and the work value is greatly improved. The artificial intelligence field is widely applied, file business generally needs to store some important historical data, the data are usually stored in different ways along with the technology of the times, from original paper files to current digital data, a large amount of data need to be analyzed and recorded manually, and engineering files are one of the types, include important data in the engineering construction process and are important file resources.
The project archive file is a basis of project completion, a certificate of project quality, and reliable data in aspects of scientific research, infrastructure and the like, and has important values of recycling, providing reference and promoting innovation. In the management of engineering archives, the engineering archives are usually stored and saved by using a paper signature or an electronic scanning piece, and usually stored in an unstructured mode through engineering handover signature, engineering document backup scanning and electronic edition of the engineering archives to a database. For the condition that the data volume is increasing day by day, the operations such as arrangement and retrieval can not be well carried out on the engineering archives, a large amount of manpower is generally needed to carry out manual output, and a large amount of labor cost is needed. With the increasing maturity of artificial intelligence technology, the automatic recording of the unstructured data of the engineering archives by the artificial intelligence technology becomes a feasible solution.
The invention discloses a machine-related official document auxiliary generation method in the prior art, such as an invention patent with application number '202110289665.9', and relates to the technical field of natural language generation, the method utilizes the large-capacity storage, rapid processing and convenient human-computer interaction capacity of a computer to construct a computer-aided writing system based on a corpus, and recommends sentence patterns and example sentences derived from a real corpus in real time for a user in a human-computer interaction mode, so that assistance is provided for the core action of sentence making, the technical blank of the computer-aided writing system based on the corpus is filled up, and the one-stop intelligent official document writing auxiliary method is adopted, so that the problems of inaccurate information provision, low efficiency, insufficient writing auxiliary service and complete writing requirements in the prior art are solved, and the writing requirements are comprehensively met. However, this prior art is only a semi-automatic auxiliary writing method, which is modified by the author by recommending the related sentence patterns and example sentences to the author, and still requires a lot of manpower for refining the wording sentences.
For another example, the invention patent with application number "201811548852.9" discloses an automatic document writing method, matching a title template with the same article genre and core content from an article template library according to a target title input by a user, analyzing the target title according to the core content to generate a target title template, acquiring a candidate title template similar to the target title template, evaluating the candidate title template according to the target title template and information such as keywords of the candidate title template, selecting an article template corresponding to the candidate title template with the highest evaluation as a target article template corresponding to the target title, and finally generating an article according to the target title and the target article template. However, in the prior art, a large number of template libraries need to be manually prepared in advance, and meanwhile, the template libraries are selected only in a simple keyword matching manner, so that the condition that the template libraries are not applicable is difficult to avoid.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a method for extracting the structured data of the engineering archive, which constructs an engineering archive rule base, combines text extraction and feature association on the basis of the engineering archive rule base, thereby effectively associating the required text entities in the engineering file, and then performs entity association storage through the engineering archive management rules, thereby realizing the extraction work of the unstructured data of the engineering archive and greatly improving the digitization efficiency of the archive.
The technical scheme of the invention is as follows:
in one aspect, the invention provides a method for extracting structured data of an engineering archive, which comprises the following steps:
constructing an engineering archive rule base according to a historical engineering archive and an engineering archive management method, wherein the engineering archive rule base comprises a plurality of rule element attributes;
pre-training a text extraction model, collecting structural data of professional vocabularies of historical engineering archives as original data, extracting rules in the original data by using a data mining technology to form a reference model, performing unsupervised learning by using the extracted rules, and performing iterative training on the reference model by marking data except the rules to obtain the text extraction model;
Acquiring input data from an engineering file, preprocessing the input data, inputting the preprocessed data into a pre-training text extraction model, and extracting text vocabularies;
performing feature association and data cleaning processing on the extracted text vocabulary to obtain text metadata;
performing character matching on the text metadata to obtain a plurality of character attributes in the text metadata; performing rule matching through the character attributes to an engineering archive rule base, and determining rule element attributes matched with the text metadata so as to determine entities related to the text metadata; and generating structured data according to the entity associated with the text metadata.
As a preferred embodiment, the method for performing feature association on the extracted text vocabulary specifically includes:
acquiring word vectors of text vocabularies;
calculating the cross entropy of the word vector;
inputting the cross entropy of the word vector into a bidirectional LSTM neural network for feature extraction;
and setting a softmax function behind an output layer of the bidirectional LSTM neural network, and performing feature splicing to obtain a context feature vector of the word vector.
As a preferred embodiment, the method for performing data cleaning processing on the extracted text vocabulary specifically includes:
And filtering data which do not accord with the rules in the text vocabulary in a text regular mode.
As a preferred embodiment, the method for acquiring the input data from the project file specifically comprises:
OCR recognition is carried out on the engineering file, and image data of the engineering file are obtained and used as input data;
the method for preprocessing the input data specifically comprises the following steps:
and performing Gaussian filtering, mean blurring, tone adjustment, contrast enhancement, image marginalization and Gaussian noise processing on the image data in sequence.
In another aspect, the present invention provides a system for generating structured data of an engineering archive, including:
the system comprises an engineering archive rule base construction module, a rule base management module and a rule processing module, wherein the engineering archive rule base construction module is used for constructing an engineering archive rule base according to historical engineering archives and an engineering archive management method, and the engineering archive rule base comprises a plurality of rule element attributes;
the text extraction model training module is used for collecting structured data of professional vocabularies of the historical engineering archives as original data, extracting rules in the original data by using a data mining technology to form a reference model, performing unsupervised learning by using the extracted rules, and performing iterative training on the reference model by marking data except the rules to obtain a text extraction model;
The input module is used for acquiring input data from the engineering file, preprocessing the input data, inputting the preprocessed data into the pre-training text extraction model and extracting text vocabularies;
the association module is used for performing characteristic association on the extracted text vocabulary;
the cleaning module is used for performing data cleaning processing on the extracted text vocabulary to acquire text metadata;
the matching module is used for carrying out character matching on the text metadata to obtain a plurality of character attributes in the text metadata; carrying out rule matching through character attributes to an engineering archive rule base, and determining rule element attributes matched with the text metadata so as to determine entities related to the text metadata;
and the data generation module is used for generating structured data according to the entity associated with the text metadata.
As a preferred embodiment, the method for performing feature association on the extracted text vocabulary by the association module specifically includes:
acquiring word vectors of text vocabularies;
calculating the cross entropy of the word vector;
inputting the cross entropy of the word vector into a bidirectional LSTM neural network for feature extraction;
and setting a softmax function behind an output layer of the bidirectional LSTM neural network, and performing feature splicing to obtain a context feature vector of the word vector.
As a preferred embodiment, the method for performing data cleaning processing on the extracted text vocabulary by the cleaning module specifically includes:
and filtering data which do not accord with the rules in the text vocabulary in a text regular mode.
As a preferred embodiment, the method for the input module to obtain the input data from the project file specifically comprises:
performing OCR recognition on the engineering file to obtain image data of the engineering file as input data;
the method for preprocessing the input data by the input module specifically comprises the following steps:
and sequentially carrying out Gaussian filtering, mean value blurring, tone adjustment, contrast enhancement, image marginalization and Gaussian noise processing on the image data.
In another aspect, the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method for extracting engineering archive structured data according to any embodiment of the present invention is implemented.
In still another aspect, a computer-readable storage medium stores thereon a computer program, which when executed by a processor implements the method for extracting structured data of an engineering archive according to any embodiment of the present invention.
The invention has the following beneficial effects:
the invention relates to a method for extracting structured data of an engineering archive, which constructs an engineering archive rule base, combines text extraction and feature association on the basis of the engineering archive rule base, thereby effectively associating required text entities from engineering files, and then performs entity association storage through engineering archive management rules, thereby realizing the extraction of unstructured data of the engineering archive, and greatly improving the digitization efficiency of the archive
Drawings
FIG. 1 is a flow chart of a method of an embodiment of the present invention;
FIG. 2 is a diagram illustrating an example of associating features with a text vocabulary according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
It should be understood that the step numbers used herein are only for convenience of description and are not used as limitations on the order in which the steps are performed.
It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The term "and/or" refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
The first embodiment is as follows:
referring to fig. 1, a method for extracting structured data of an engineering archive includes the following steps:
the method comprises the steps that an engineering archive rule base is built according to historical engineering archives and an engineering archive management method, wherein the engineering archives of the embodiment are specifically engineering project archives of a national power grid, so historical power grid engineering archives are collected, the engineering archive rule base is built according to the power grid engineering project management method and the power grid management method, structured data are extracted from a large number of power grid engineering archive historical data, and a large number of rule element attributes are generated through manual verification and multiple rounds of iteration;
Pre-training a text extraction model, collecting structural data of professional vocabularies of historical engineering archives as original data, extracting rules in the original data by using a data mining technology to form a reference model, performing unsupervised learning by using the extracted rules, and performing iterative training on the reference model by marking data except the rules to obtain the text extraction model;
acquiring input data from an engineering file, preprocessing the input data, inputting the preprocessed data into a pre-training text extraction model, extracting text vocabularies, and then retrieving the extracted text vocabularies in a format of a common message;
performing feature association and data cleaning processing on the extracted text vocabulary to obtain text metadata;
performing character matching on the text metadata, and dividing the text metadata into a plurality of character attributes; carrying out rule matching on the character attributes to an engineering archive rule base, and determining the rule element attributes matched with the text metadata according to the characteristics of the rule element attributes so as to determine entities related to the text metadata; and generating structured data according to the entity associated with the text metadata.
Specifically referring to fig. 2, as a preferred embodiment of this embodiment, the method for performing feature association on the extracted text vocabulary specifically includes:
Acquiring word vectors of text vocabularies;
calculating the cross entropy of the word vector;
inputting the cross entropy of the word vector into an input layer of a bidirectional LSTM neural network for feature extraction;
and setting a softmax function behind an output layer of the bidirectional LSTM neural network, and performing feature splicing to obtain a context feature vector of the word vector.
As a preferred implementation of this embodiment, the method for performing data cleansing processing on the extracted text vocabulary specifically includes:
filtering data which do not accord with the rules in the text vocabulary in a text regular mode; the method can ensure that the specialty of the data is more accurate by using a text regular mode, and avoid some dirty data (characters, messy codes, symbols and the like) on the basis of accurate identification.
As a preferred implementation of this embodiment, the method for acquiring the input data from the project file specifically includes:
performing OCR recognition on the engineering file to obtain image data of the engineering file as input data;
the method for preprocessing the input data specifically comprises the following steps:
and performing Gaussian filtering, mean blurring, tone adjustment, contrast enhancement, image marginalization and Gaussian noise processing on the image data in sequence.
The second embodiment:
The embodiment provides a system for generating structured data of an engineering archive, comprising:
the system comprises an engineering archive rule base building module, a historical engineering archive management module and an engineering archive management module, wherein the engineering archive rule base building module is used for building an engineering archive rule base according to a historical engineering archive and an engineering archive management method, and the engineering archive rule base comprises a plurality of rule element attributes;
the text extraction model training module is used for collecting structured data of professional vocabularies of the historical engineering archives as original data, extracting rules in the original data by using a data mining technology to form a reference model, performing unsupervised learning by using the extracted rules, and performing iterative training on the reference model by marking data except the rules to obtain a text extraction model;
the input module is used for acquiring input data from the engineering file, preprocessing the input data, inputting the preprocessed data into the pre-training text extraction model and extracting text vocabularies;
the association module is used for performing characteristic association on the extracted text vocabulary;
the cleaning module is used for performing data cleaning processing on the extracted text vocabulary to acquire text metadata;
the matching module is used for carrying out character matching on the text metadata to obtain a plurality of character attributes in the text metadata; performing rule matching through the character attributes to an engineering archive rule base, and determining rule element attributes matched with the text metadata so as to determine entities related to the text metadata;
And the data generation module is used for generating the structured data according to the entity associated with the text metadata.
As a preferred embodiment, the method for performing feature association on the extracted text vocabulary by the association module specifically includes:
acquiring word vectors of text vocabularies;
calculating the cross entropy of the word vector;
inputting the cross entropy of the word vector into a bidirectional LSTM neural network for feature extraction;
and setting a softmax function behind an output layer of the bidirectional LSTM neural network, and performing feature splicing to obtain a context feature vector of the word vector.
As a preferred embodiment, the method for performing data cleaning processing on the extracted text vocabulary by the cleaning module specifically includes:
and filtering data which do not accord with the rules in the text vocabulary in a text regular mode.
As a preferred embodiment, the method for acquiring the input data from the engineering file by the input module specifically comprises the following steps:
OCR recognition is carried out on the engineering file, and image data of the engineering file are obtained and used as input data;
the method for preprocessing the input data by the input module specifically comprises the following steps:
and performing Gaussian filtering, mean blurring, tone adjustment, contrast enhancement, image marginalization and Gaussian noise processing on the image data in sequence.
Example three:
the invention provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the method for extracting the structured data of the engineering file according to any embodiment of the invention.
Example four:
a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements a method for extracting structured data of an engineering archive according to any one of the embodiments of the present invention.
In the embodiments of the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, and means that there may be three relationships, for example, a and/or B, and may mean that a exists alone, a and B exist simultaneously, and B exists alone. Wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" and similar expressions refer to any combination of these items, including any combination of singular or plural items. For example, at least one of a, b, and c may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple.
Those of ordinary skill in the art will appreciate that the various elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of electronic hardware and computer software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, any function, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a portable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other media capable of storing program codes.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A method for extracting structured data of engineering archives is characterized by comprising the following steps:
constructing an engineering archive rule base according to a historical engineering archive and an engineering archive management method, wherein the engineering archive rule base comprises a plurality of rule element attributes;
pre-training a text extraction model, collecting structural data of professional vocabularies of historical engineering archives as original data, extracting rules in the original data by using a data mining technology to form a reference model, performing unsupervised learning by using the extracted rules, and performing iterative training on the reference model by marking data except the rules to obtain the text extraction model;
acquiring input data from the engineering file, preprocessing the input data, inputting the preprocessed data into a pre-training text extraction model, and extracting text vocabularies;
performing characteristic association and data cleaning processing on the extracted text vocabulary to obtain text metadata;
Performing character matching on the text metadata to obtain a plurality of character attributes in the text metadata; performing rule matching through the character attributes to an engineering archive rule base, and determining rule element attributes matched with the text metadata so as to determine entities related to the text metadata; and generating the structured data according to the entity associated with the text metadata.
2. The method for extracting the structured data of the engineering archive according to claim 1, wherein the method for performing the feature association on the extracted text vocabulary specifically comprises:
acquiring word vectors of text vocabularies;
calculating the cross entropy of the word vector;
inputting the cross entropy of the word vector into a bidirectional LSTM neural network for feature extraction;
and setting a softmax function behind an output layer of the bidirectional LSTM neural network, and performing feature splicing to obtain a context feature vector of the word vector.
3. The method for extracting the structured data of the engineering archive according to claim 1, wherein the method for performing data cleaning processing on the extracted text vocabulary specifically comprises:
and filtering data which do not accord with the rules in the text vocabulary in a text regular mode.
4. The method for extracting the structured data of the engineering archive according to claim 1, wherein the method for acquiring the input data from the engineering file specifically comprises:
Performing OCR recognition on the engineering file to obtain image data of the engineering file as input data;
the method for preprocessing the input data specifically comprises the following steps:
and sequentially carrying out Gaussian filtering, mean value blurring, tone adjustment, contrast enhancement, image marginalization and Gaussian noise processing on the image data.
5. An engineering archive structured data extraction system, comprising:
the system comprises an engineering archive rule base building module, a historical engineering archive management module and an engineering archive management module, wherein the engineering archive rule base building module is used for building an engineering archive rule base according to a historical engineering archive and an engineering archive management method, and the engineering archive rule base comprises a plurality of rule element attributes;
the text extraction model training module is used for collecting structured data of professional vocabularies of historical engineering archives as original data, extracting rules in the original data by using a data mining technology to form a reference model, performing unsupervised learning by using the extracted rules, and performing iterative training on the reference model by marking data except the rules to obtain a text extraction model;
the input module is used for acquiring input data from the engineering file, preprocessing the input data, inputting the preprocessed data into the pre-training text extraction model and extracting text vocabularies;
The association module is used for performing characteristic association on the extracted text vocabulary;
the cleaning module is used for performing data cleaning processing on the extracted text vocabulary to acquire text metadata;
the matching module is used for carrying out character matching on the text metadata to obtain a plurality of character attributes in the text metadata; carrying out rule matching through character attributes to an engineering archive rule base, and determining rule element attributes matched with the text metadata so as to determine entities related to the text metadata;
and the data generation module is used for generating structured data according to the entity associated with the text metadata.
6. The system for extracting structured data of an engineering archive according to claim 5, wherein the method for performing feature association on the extracted text vocabulary by the association module specifically comprises:
acquiring word vectors of text vocabularies;
calculating the cross entropy of the word vector;
inputting the cross entropy of the word vector into a bidirectional LSTM neural network for feature extraction;
and setting a softmax function behind an output layer of the bidirectional LSTM neural network, and performing feature splicing to obtain a context feature vector of the word vector.
7. The system for extracting structured data of an engineering archive according to claim 5, wherein the method for performing data cleaning processing on the extracted text vocabulary by the cleaning module specifically comprises:
And filtering data which do not accord with the rules in the text vocabulary in a text regular mode.
8. The system for extracting structured data of engineering archives according to claim 5, wherein the method for the input module to obtain the input data from the engineering documents is specifically as follows:
performing OCR recognition on the engineering file to obtain image data of the engineering file as input data;
the method for preprocessing the input data by the input module specifically comprises the following steps:
and sequentially carrying out Gaussian filtering, mean value blurring, tone adjustment, contrast enhancement, image marginalization and Gaussian noise processing on the image data.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein: the processor, when executing the program, implements the method for extracting the structured data of the engineering archive according to any one of claims 1 to 4.
10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the program is executed by a processor to implement the method for extracting the structured data of the engineering file according to any one of claims 1 to 4.
CN202210455488.1A 2022-04-24 2022-04-24 Method, system, equipment and storage medium for extracting structured data of engineering archives Pending CN114756617A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210455488.1A CN114756617A (en) 2022-04-24 2022-04-24 Method, system, equipment and storage medium for extracting structured data of engineering archives

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210455488.1A CN114756617A (en) 2022-04-24 2022-04-24 Method, system, equipment and storage medium for extracting structured data of engineering archives

Publications (1)

Publication Number Publication Date
CN114756617A true CN114756617A (en) 2022-07-15

Family

ID=82332170

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210455488.1A Pending CN114756617A (en) 2022-04-24 2022-04-24 Method, system, equipment and storage medium for extracting structured data of engineering archives

Country Status (1)

Country Link
CN (1) CN114756617A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116596395A (en) * 2023-05-29 2023-08-15 深圳市中联信信息技术有限公司 Operation quality control platform for engineering project evaluation unit guidance and detection

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116596395A (en) * 2023-05-29 2023-08-15 深圳市中联信信息技术有限公司 Operation quality control platform for engineering project evaluation unit guidance and detection
CN116596395B (en) * 2023-05-29 2023-12-01 深圳市中联信信息技术有限公司 Operation quality control platform for engineering project evaluation unit guidance and detection

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN110929038B (en) Knowledge graph-based entity linking method, device, equipment and storage medium
CN108280114B (en) Deep learning-based user literature reading interest analysis method
CN109960804B (en) Method and device for generating topic text sentence vector
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
CN117076693A (en) Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN114048354A (en) Test question retrieval method, device and medium based on multi-element characterization and metric learning
CN111325019A (en) Word bank updating method and device and electronic equipment
CN110990003A (en) API recommendation method based on word embedding technology
CN112860898B (en) Short text box clustering method, system, equipment and storage medium
CN114756617A (en) Method, system, equipment and storage medium for extracting structured data of engineering archives
CN116629258B (en) Structured analysis method and system for judicial document based on complex information item data
CN111274494B (en) Composite label recommendation method combining deep learning and collaborative filtering technology
CN117149955A (en) Method, medium and system for automatically answering insurance clause consultation
CN116049376B (en) Method, device and system for retrieving and replying information and creating knowledge
CN112579666A (en) Intelligent question-answering system and method and related equipment
CN113901793A (en) Event extraction method and device combining RPA and AI
Liu et al. Suggestion mining from online reviews usingrandom multimodel deep learning
CN113342949A (en) Matching method and system of intellectual library experts and topic to be researched
CN112632985A (en) Corpus processing method and device, storage medium and processor
CN112445904A (en) Knowledge retrieval method, knowledge retrieval device, knowledge retrieval equipment and computer readable storage medium
Liu IntelliExtract: An End-to-End Framework for Chinese Resume Information Extraction from Document Images
CN117608565B (en) Method and system for recommending AI type components in RPA (remote procedure A) based on screenshot analysis
CN115618968B (en) New idea discovery method and device, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination