CN114756617A

CN114756617A - Method, system, equipment and storage medium for extracting structured data of engineering archives

Info

Publication number: CN114756617A
Application number: CN202210455488.1A
Authority: CN
Inventors: 邹永增; 魏宏俊; 翁非; 张望华; 黄云飞; 林衍
Original assignee: State Grid Fujian Electric Power Co Ltd
Current assignee: State Grid Fujian Electric Power Co Ltd
Priority date: 2022-04-24
Filing date: 2022-04-24
Publication date: 2022-07-15

Abstract

The invention relates to a method for extracting structured data of an engineering archive, which comprises the following steps: constructing an engineering archive rule base, wherein the engineering archive rule base comprises a plurality of rule element attributes; pre-training a text extraction model, collecting structural data of professional vocabularies of historical engineering archives as original data, and performing iterative training by using unsupervised learning to obtain the text extraction model; acquiring input data from an engineering file, preprocessing the input data, inputting the preprocessed data into a pre-training text extraction model, and extracting text vocabularies; performing feature association and data cleaning processing on the text vocabulary to obtain text metadata; performing character matching on the text metadata to obtain a plurality of character attributes in the text metadata; carrying out rule matching through character attributes to an engineering archive rule base, and determining rule element attributes matched with the text metadata so as to determine entities related to the text metadata; and generating structured data according to the entity associated with the text metadata.

Description

Method, system, equipment and storage medium for extracting structured data of engineering archives

Technical Field

The invention relates to a method, a system, equipment and a storage medium for extracting structured data of an engineering archive, belonging to the technical field of digital processing of the engineering archive.

Background

With the rise and popularization of the field of artificial intelligence, the repeatable work content is greatly reduced, the model can be trained in a deep learning mode to meet the requirement of daily repetitive work, the cost of manual input is reduced, and the work value is greatly improved. The artificial intelligence field is widely applied, file business generally needs to store some important historical data, the data are usually stored in different ways along with the technology of the times, from original paper files to current digital data, a large amount of data need to be analyzed and recorded manually, and engineering files are one of the types, include important data in the engineering construction process and are important file resources.

The project archive file is a basis of project completion, a certificate of project quality, and reliable data in aspects of scientific research, infrastructure and the like, and has important values of recycling, providing reference and promoting innovation. In the management of engineering archives, the engineering archives are usually stored and saved by using a paper signature or an electronic scanning piece, and usually stored in an unstructured mode through engineering handover signature, engineering document backup scanning and electronic edition of the engineering archives to a database. For the condition that the data volume is increasing day by day, the operations such as arrangement and retrieval can not be well carried out on the engineering archives, a large amount of manpower is generally needed to carry out manual output, and a large amount of labor cost is needed. With the increasing maturity of artificial intelligence technology, the automatic recording of the unstructured data of the engineering archives by the artificial intelligence technology becomes a feasible solution.

The invention discloses a machine-related official document auxiliary generation method in the prior art, such as an invention patent with application number '202110289665.9', and relates to the technical field of natural language generation, the method utilizes the large-capacity storage, rapid processing and convenient human-computer interaction capacity of a computer to construct a computer-aided writing system based on a corpus, and recommends sentence patterns and example sentences derived from a real corpus in real time for a user in a human-computer interaction mode, so that assistance is provided for the core action of sentence making, the technical blank of the computer-aided writing system based on the corpus is filled up, and the one-stop intelligent official document writing auxiliary method is adopted, so that the problems of inaccurate information provision, low efficiency, insufficient writing auxiliary service and complete writing requirements in the prior art are solved, and the writing requirements are comprehensively met. However, this prior art is only a semi-automatic auxiliary writing method, which is modified by the author by recommending the related sentence patterns and example sentences to the author, and still requires a lot of manpower for refining the wording sentences.

For another example, the invention patent with application number "201811548852.9" discloses an automatic document writing method, matching a title template with the same article genre and core content from an article template library according to a target title input by a user, analyzing the target title according to the core content to generate a target title template, acquiring a candidate title template similar to the target title template, evaluating the candidate title template according to the target title template and information such as keywords of the candidate title template, selecting an article template corresponding to the candidate title template with the highest evaluation as a target article template corresponding to the target title, and finally generating an article according to the target title and the target article template. However, in the prior art, a large number of template libraries need to be manually prepared in advance, and meanwhile, the template libraries are selected only in a simple keyword matching manner, so that the condition that the template libraries are not applicable is difficult to avoid.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a method for extracting the structured data of the engineering archive, which constructs an engineering archive rule base, combines text extraction and feature association on the basis of the engineering archive rule base, thereby effectively associating the required text entities in the engineering file, and then performs entity association storage through the engineering archive management rules, thereby realizing the extraction work of the unstructured data of the engineering archive and greatly improving the digitization efficiency of the archive.

The technical scheme of the invention is as follows:

in one aspect, the invention provides a method for extracting structured data of an engineering archive, which comprises the following steps:

constructing an engineering archive rule base according to a historical engineering archive and an engineering archive management method, wherein the engineering archive rule base comprises a plurality of rule element attributes;

pre-training a text extraction model, collecting structural data of professional vocabularies of historical engineering archives as original data, extracting rules in the original data by using a data mining technology to form a reference model, performing unsupervised learning by using the extracted rules, and performing iterative training on the reference model by marking data except the rules to obtain the text extraction model;

Acquiring input data from an engineering file, preprocessing the input data, inputting the preprocessed data into a pre-training text extraction model, and extracting text vocabularies;

performing feature association and data cleaning processing on the extracted text vocabulary to obtain text metadata;

performing character matching on the text metadata to obtain a plurality of character attributes in the text metadata; performing rule matching through the character attributes to an engineering archive rule base, and determining rule element attributes matched with the text metadata so as to determine entities related to the text metadata; and generating structured data according to the entity associated with the text metadata.

As a preferred embodiment, the method for performing feature association on the extracted text vocabulary specifically includes:

acquiring word vectors of text vocabularies;

calculating the cross entropy of the word vector;

inputting the cross entropy of the word vector into a bidirectional LSTM neural network for feature extraction;

and setting a softmax function behind an output layer of the bidirectional LSTM neural network, and performing feature splicing to obtain a context feature vector of the word vector.

As a preferred embodiment, the method for performing data cleaning processing on the extracted text vocabulary specifically includes:

And filtering data which do not accord with the rules in the text vocabulary in a text regular mode.

As a preferred embodiment, the method for acquiring the input data from the project file specifically comprises:

OCR recognition is carried out on the engineering file, and image data of the engineering file are obtained and used as input data;

the method for preprocessing the input data specifically comprises the following steps:

and performing Gaussian filtering, mean blurring, tone adjustment, contrast enhancement, image marginalization and Gaussian noise processing on the image data in sequence.

In another aspect, the present invention provides a system for generating structured data of an engineering archive, including:

the system comprises an engineering archive rule base construction module, a rule base management module and a rule processing module, wherein the engineering archive rule base construction module is used for constructing an engineering archive rule base according to historical engineering archives and an engineering archive management method, and the engineering archive rule base comprises a plurality of rule element attributes;

the text extraction model training module is used for collecting structured data of professional vocabularies of the historical engineering archives as original data, extracting rules in the original data by using a data mining technology to form a reference model, performing unsupervised learning by using the extracted rules, and performing iterative training on the reference model by marking data except the rules to obtain a text extraction model;

The input module is used for acquiring input data from the engineering file, preprocessing the input data, inputting the preprocessed data into the pre-training text extraction model and extracting text vocabularies;

the association module is used for performing characteristic association on the extracted text vocabulary;

the cleaning module is used for performing data cleaning processing on the extracted text vocabulary to acquire text metadata;

the matching module is used for carrying out character matching on the text metadata to obtain a plurality of character attributes in the text metadata; carrying out rule matching through character attributes to an engineering archive rule base, and determining rule element attributes matched with the text metadata so as to determine entities related to the text metadata;

and the data generation module is used for generating structured data according to the entity associated with the text metadata.

As a preferred embodiment, the method for performing feature association on the extracted text vocabulary by the association module specifically includes:

acquiring word vectors of text vocabularies;

calculating the cross entropy of the word vector;

As a preferred embodiment, the method for performing data cleaning processing on the extracted text vocabulary by the cleaning module specifically includes:

As a preferred embodiment, the method for the input module to obtain the input data from the project file specifically comprises:

performing OCR recognition on the engineering file to obtain image data of the engineering file as input data;

the method for preprocessing the input data by the input module specifically comprises the following steps:

and sequentially carrying out Gaussian filtering, mean value blurring, tone adjustment, contrast enhancement, image marginalization and Gaussian noise processing on the image data.

In another aspect, the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method for extracting engineering archive structured data according to any embodiment of the present invention is implemented.

In still another aspect, a computer-readable storage medium stores thereon a computer program, which when executed by a processor implements the method for extracting structured data of an engineering archive according to any embodiment of the present invention.

The invention has the following beneficial effects:

the invention relates to a method for extracting structured data of an engineering archive, which constructs an engineering archive rule base, combines text extraction and feature association on the basis of the engineering archive rule base, thereby effectively associating required text entities from engineering files, and then performs entity association storage through engineering archive management rules, thereby realizing the extraction of unstructured data of the engineering archive, and greatly improving the digitization efficiency of the archive

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

FIG. 2 is a diagram illustrating an example of associating features with a text vocabulary according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

It should be understood that the step numbers used herein are only for convenience of description and are not used as limitations on the order in which the steps are performed.

It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The term "and/or" refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

The first embodiment is as follows:

referring to fig. 1, a method for extracting structured data of an engineering archive includes the following steps:

the method comprises the steps that an engineering archive rule base is built according to historical engineering archives and an engineering archive management method, wherein the engineering archives of the embodiment are specifically engineering project archives of a national power grid, so historical power grid engineering archives are collected, the engineering archive rule base is built according to the power grid engineering project management method and the power grid management method, structured data are extracted from a large number of power grid engineering archive historical data, and a large number of rule element attributes are generated through manual verification and multiple rounds of iteration;

acquiring input data from an engineering file, preprocessing the input data, inputting the preprocessed data into a pre-training text extraction model, extracting text vocabularies, and then retrieving the extracted text vocabularies in a format of a common message;

performing character matching on the text metadata, and dividing the text metadata into a plurality of character attributes; carrying out rule matching on the character attributes to an engineering archive rule base, and determining the rule element attributes matched with the text metadata according to the characteristics of the rule element attributes so as to determine entities related to the text metadata; and generating structured data according to the entity associated with the text metadata.

Specifically referring to fig. 2, as a preferred embodiment of this embodiment, the method for performing feature association on the extracted text vocabulary specifically includes:

Acquiring word vectors of text vocabularies;

calculating the cross entropy of the word vector;

inputting the cross entropy of the word vector into an input layer of a bidirectional LSTM neural network for feature extraction;

As a preferred implementation of this embodiment, the method for performing data cleansing processing on the extracted text vocabulary specifically includes:

filtering data which do not accord with the rules in the text vocabulary in a text regular mode; the method can ensure that the specialty of the data is more accurate by using a text regular mode, and avoid some dirty data (characters, messy codes, symbols and the like) on the basis of accurate identification.

As a preferred implementation of this embodiment, the method for acquiring the input data from the project file specifically includes:

The second embodiment:

The embodiment provides a system for generating structured data of an engineering archive, comprising:

the system comprises an engineering archive rule base building module, a historical engineering archive management module and an engineering archive management module, wherein the engineering archive rule base building module is used for building an engineering archive rule base according to a historical engineering archive and an engineering archive management method, and the engineering archive rule base comprises a plurality of rule element attributes;

the matching module is used for carrying out character matching on the text metadata to obtain a plurality of character attributes in the text metadata; performing rule matching through the character attributes to an engineering archive rule base, and determining rule element attributes matched with the text metadata so as to determine entities related to the text metadata;

And the data generation module is used for generating the structured data according to the entity associated with the text metadata.

acquiring word vectors of text vocabularies;

calculating the cross entropy of the word vector;

As a preferred embodiment, the method for acquiring the input data from the engineering file by the input module specifically comprises the following steps:

Example three:

the invention provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the method for extracting the structured data of the engineering file according to any embodiment of the invention.

Example four:

a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements a method for extracting structured data of an engineering archive according to any one of the embodiments of the present invention.

In the embodiments of the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, and means that there may be three relationships, for example, a and/or B, and may mean that a exists alone, a and B exist simultaneously, and B exists alone. Wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" and similar expressions refer to any combination of these items, including any combination of singular or plural items. For example, at least one of a, b, and c may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple.

Those of ordinary skill in the art will appreciate that the various elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of electronic hardware and computer software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, any function, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a portable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other media capable of storing program codes.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for extracting structured data of engineering archives is characterized by comprising the following steps:

acquiring input data from the engineering file, preprocessing the input data, inputting the preprocessed data into a pre-training text extraction model, and extracting text vocabularies;

performing characteristic association and data cleaning processing on the extracted text vocabulary to obtain text metadata;

Performing character matching on the text metadata to obtain a plurality of character attributes in the text metadata; performing rule matching through the character attributes to an engineering archive rule base, and determining rule element attributes matched with the text metadata so as to determine entities related to the text metadata; and generating the structured data according to the entity associated with the text metadata.

2. The method for extracting the structured data of the engineering archive according to claim 1, wherein the method for performing the feature association on the extracted text vocabulary specifically comprises:

acquiring word vectors of text vocabularies;

calculating the cross entropy of the word vector;

3. The method for extracting the structured data of the engineering archive according to claim 1, wherein the method for performing data cleaning processing on the extracted text vocabulary specifically comprises:

4. The method for extracting the structured data of the engineering archive according to claim 1, wherein the method for acquiring the input data from the engineering file specifically comprises:

5. An engineering archive structured data extraction system, comprising:

the text extraction model training module is used for collecting structured data of professional vocabularies of historical engineering archives as original data, extracting rules in the original data by using a data mining technology to form a reference model, performing unsupervised learning by using the extracted rules, and performing iterative training on the reference model by marking data except the rules to obtain a text extraction model;

6. The system for extracting structured data of an engineering archive according to claim 5, wherein the method for performing feature association on the extracted text vocabulary by the association module specifically comprises:

acquiring word vectors of text vocabularies;

calculating the cross entropy of the word vector;

7. The system for extracting structured data of an engineering archive according to claim 5, wherein the method for performing data cleaning processing on the extracted text vocabulary by the cleaning module specifically comprises:

8. The system for extracting structured data of engineering archives according to claim 5, wherein the method for the input module to obtain the input data from the engineering documents is specifically as follows:

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein: the processor, when executing the program, implements the method for extracting the structured data of the engineering archive according to any one of claims 1 to 4.

10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the program is executed by a processor to implement the method for extracting the structured data of the engineering file according to any one of claims 1 to 4.