CN115034200A - Drawing information extraction method and device, electronic equipment and storage medium - Google Patents

Drawing information extraction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115034200A
CN115034200A CN202110239779.2A CN202110239779A CN115034200A CN 115034200 A CN115034200 A CN 115034200A CN 202110239779 A CN202110239779 A CN 202110239779A CN 115034200 A CN115034200 A CN 115034200A
Authority
CN
China
Prior art keywords
information
entity
text
area
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110239779.2A
Other languages
Chinese (zh)
Inventor
乔宇
罗翔弘
卢广龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Bozhilin Software Technology Co.,Ltd.
Original Assignee
Guangdong Bozhilin Robot Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Bozhilin Robot Co Ltd filed Critical Guangdong Bozhilin Robot Co Ltd
Priority to CN202110239779.2A priority Critical patent/CN115034200A/en
Publication of CN115034200A publication Critical patent/CN115034200A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/116Details of conversion of file system types or formats
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The application discloses a drawing information extraction method, a drawing information extraction device, electronic equipment and a storage medium, wherein the method comprises the following steps: identifying an information area in the picture according to the target detection model, wherein the information area comprises a character area and/or a table area; extracting text data in the information area, and performing named entity recognition on the text data to obtain entity words; data in a predetermined format is extracted from the entity words. The drawing information extraction method provided by the embodiment of the application realizes the rapid extraction of the drawing information based on optical character recognition, target detection, named entity recognition and dependency syntactic analysis, and realizes automatic batch file import, analysis and information extraction.

Description

Drawing information extraction method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of information processing technologies, and in particular, to a drawing information extraction method and apparatus, an electronic device, and a storage medium.
Background
In the related technology, a method for extracting the drawing Information of a common BIM (Building Information Modeling) is mainly based on an IFC (Industry Foundation Class, data format) format, and then an IFC format BIM model is subjected to data analysis by directly using an open source tool such as an eddevelopper Seat IFC, an ifcengendll or a Revit secondary encapsulation API (Application Programming Interface), so that the interaction of the Revit with the structure Information of openses and a marc platform and the extraction of the structure attribute Information are realized; or the natural language word segmentation technology is used for finding out the attributes of the model, the required attribute information is obtained by constructing the incidence relation among the attributes, attribute value constraint and specification, then the information is classified, and the information identification and extraction method suitable for various categories is designed by combining the research result in the aspect of attribute information expression.
However, the dependence on Revit software in the related art is high, and the related art has great limitation; most of the attribute information is realized through mapping, and a large amount of attribute information is possibly lost; only information in a predefined attribute set can be extracted, and the global content cannot be analyzed and extracted; the information content is analyzed by lacking of intelligent means, and the information semantics and important potential information cannot be analyzed; depends on manpower, and cannot be efficiently, quickly and batch processed.
Content of application
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, the first purpose of the invention is to provide a drawing information extraction method, which is used for realizing the rapid extraction of drawing information based on optical character recognition, target detection, named entity recognition and dependency syntactic analysis, realizing automatic batch file import, analysis and information extraction, solving the problem that the excessive dependence on Revit software in the related technology has great limitation; most of the attribute information is realized through mapping, and a large amount of attribute information is possibly lost; only information in a predefined attribute set can be extracted, and the global content cannot be analyzed and extracted; the information content is analyzed by lacking of intelligent means, and the information semantics and important potential information cannot be analyzed; depends on manpower and cannot be processed efficiently, quickly and in batches.
The second purpose of the invention is to provide a drawing information extraction device.
A third object of the invention is to propose an electronic device.
A fourth object of the invention is to propose a computer-readable storage medium.
In order to achieve the above object, an embodiment of a first aspect of the present application provides a drawing information extraction method, including the following steps:
identifying an information area in the picture according to a target detection model, wherein the information area comprises a character area and/or a table area;
extracting text data in the information area, and carrying out named entity recognition on the text data to obtain entity words;
and extracting data in a preset format from the entity words.
In addition, the drawing information extraction method according to the above embodiment of the present invention may further have the following additional technical features:
optionally, the target detection model is obtained by training through the following steps:
acquiring a positive sample picture and a negative sample picture with a preset proportion, wherein the positive sample picture comprises the table area, and the negative sample picture does not comprise the table area;
marking the positive sample picture and the negative sample picture according to a preset marking mode;
and training the marked positive sample picture and the marked negative sample picture through a preset target algorithm to obtain the target detection model.
Optionally, the extracting text data in the information area includes:
segmenting the information area in the picture through the target detection model;
and performing text processing on the information area based on an optical character recognition model to obtain the text data.
Optionally, when the information region is a text region, the performing named entity recognition on the text data to obtain an entity word includes:
determining a text with entity labeling information according to a preset named entity recognition model and the text data, and acquiring the entity words from the text with the entity labeling information.
Optionally, when the information area is a table area, the performing named entity recognition on the text data to obtain an entity word includes:
extracting a table structure of the table area based on a preset table structure extraction model;
performing text processing on the table structure through the optical character recognition model to obtain text data in the table;
generating an editable table file based on the text data within the table and the table structure;
and determining a text with entity labeling information according to the preset named entity recognition model and the editable table file, and acquiring the entity word from the text with the entity labeling information.
Optionally, the extracting data in a predetermined format from the entity word includes:
judging whether the entity words meet preset complex sentence pattern conditions or not;
and if the preset complex sentence pattern condition is met, extracting the structured information of the entity words according to the dependency syntax analysis, updating the entity words according to the extraction result, and extracting data in a preset format from the updated entity words.
Optionally, the drawing information extraction method further includes:
and if the preset complex sentence pattern condition is not met, directly extracting data in a preset format from the entity words.
According to the drawing information extraction method, the information area in the picture can be identified according to the target detection model, the text data in the information area is extracted, named entity identification is carried out on the text data to obtain entity words, and data in a preset format is extracted from the entity words. Therefore, the drawing information is quickly extracted based on optical character recognition, target detection, named entity recognition and dependency syntactic analysis, automatic batch file importing, analysis and information extraction are realized, and the problem that Revit software is excessively depended on in the related technology is solved, so that the method has great limitation; most of the attribute information is realized through mapping, and a large amount of attribute information is possibly lost; only information in a predefined attribute set can be extracted, and the global content cannot be analyzed and extracted; the information content is analyzed by lacking of intelligent means, and the information semantics and important potential information cannot be analyzed; depends on manpower and cannot be processed efficiently, quickly and in batches.
In order to achieve the above object, a drawing information extraction device according to a second aspect of the present application includes:
the identification module is used for identifying an information area in the picture according to the target detection model, wherein the information area comprises a character area and/or a table area;
the acquisition module is used for extracting text data in the information area and carrying out named entity recognition on the text data to acquire entity words;
and the extraction module is used for extracting data in a preset format from the entity words.
Optionally, the identification module is specifically configured to:
acquiring a positive sample picture and a negative sample picture in a preset proportion, wherein the positive sample picture comprises the table area, and the negative sample picture does not comprise the table area;
marking the positive sample picture and the negative sample picture according to a preset marking mode;
and training the marked positive sample picture and the marked negative sample picture through a preset target algorithm to obtain the target detection model.
Optionally, the obtaining module is specifically configured to:
segmenting the information area by the target detection model;
and performing text processing on the information area based on an optical character recognition model to obtain the text data.
Optionally, when the information region is a text region, the obtaining module includes:
determining a text with entity labeling information according to a preset named entity recognition model and the text data, and acquiring the entity words from the text with the entity labeling information.
Optionally, when the information area is a table area, the obtaining module includes:
extracting a table structure of the table area based on a preset table structure extraction model;
performing text processing on the table structure through the optical character recognition model to obtain text data in the table;
generating an editable table file based on the text data within the table and the table structure;
and determining a text with entity labeling information according to the preset named entity recognition model and the editable table file, and acquiring the entity word from the text with the entity labeling information.
Optionally, the extracting module is specifically configured to:
judging whether the entity words meet preset complex sentence pattern conditions or not;
and if the preset complex sentence pattern condition is met, extracting the structured information of the entity words according to the dependency syntax analysis, updating the entity words according to the extraction result, and extracting data in a preset format from the updated entity words.
Optionally, the drawing information extraction device further includes:
and if the preset complex sentence pattern condition is not met, directly extracting data in a preset format from the entity words.
According to the drawing information extraction device, the information area in the picture can be identified according to the target detection model, the text data in the information area is extracted, named entity identification is carried out on the text data to obtain entity words, and data in a preset format is extracted from the entity words, so that the drawing information is quickly extracted based on optical character identification, target detection, named entity identification and dependency syntactic analysis, automatic batch file importing, analysis and information extraction are realized, excessive dependency Revit software in the related technology is solved, and the drawing information extraction device has great limitation; most of the attribute information is realized through mapping, and a large amount of attribute information is possibly lost; only information in a predefined attribute set can be extracted, and the global content cannot be analyzed and extracted; the information content is analyzed by lacking of intelligent means, and the information semantics and important potential information cannot be analyzed; depends on manpower and cannot be processed efficiently, quickly and in batches.
To achieve the above object, an embodiment of a third aspect of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions configured to perform the drawing information extraction method according to the above embodiment.
In order to achieve the above object, a fourth aspect of the present application provides a computer-readable storage medium storing computer instructions for causing a computer to execute the drawing information extraction method according to the foregoing embodiment.
Therefore, the information area in the picture can be identified according to the target detection model, the text data in the information area is extracted, the text data is subjected to named entity identification to obtain entity words, and data in a preset format is extracted from the entity words. Therefore, the drawing information is quickly extracted based on optical character recognition, target detection, named entity recognition and dependency syntactic analysis, automatic batch file import, analysis and information extraction are realized, and the problem that the dependence on Revit software in the related technology is excessive is solved, so that the method has great limitation; most of the attribute information is realized through mapping, and a large amount of attribute information is possibly lost; only information in a predefined attribute set can be extracted, and the global content cannot be analyzed and extracted; the information content is analyzed by lacking of intelligent means, and the information semantics and important potential information cannot be analyzed; depends on manpower and cannot be processed efficiently, quickly and in batches.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a flowchart of a drawing information extraction method according to an embodiment of the present application;
FIG. 2 is an exemplary diagram of a form in a drawing sheet according to one embodiment of the present application;
FIG. 3 is a schematic diagram of a bidirectional high-order feature extraction algorithm model according to one embodiment of the present application;
FIG. 4 is a flowchart of a drawing information extraction method according to one embodiment of the present application;
fig. 5 is an exemplary diagram of a drawing information extraction apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.
The drawing information extraction method, apparatus, electronic device, and storage medium according to embodiments of the present invention are described below with reference to the accompanying drawings, and first, the drawing information extraction method according to embodiments of the present invention will be described with reference to the accompanying drawings.
Specifically, fig. 1 is a schematic flow chart of a drawing information extraction method provided in an embodiment of the present application.
As shown in fig. 1, the drawing information extraction method includes the following steps:
in step S101, an information area in the picture is identified according to the target detection model, wherein the information area includes a text area and/or a table area.
Optionally, in some implementations, the target detection model is trained by: acquiring a positive sample picture and a negative sample picture in a preset proportion, wherein the positive sample picture comprises a table area, and the negative sample picture does not comprise the table area; marking the positive sample picture and the negative sample picture according to a preset marking mode; and training the marked positive sample picture and the marked negative sample picture through a preset target algorithm to obtain a target detection model.
It can be understood that the picture may be a BIM drawing as an example, and before the picture form in the BIM drawing is segmented by the target detection model, the embodiment of the present application may obtain as many drawings and other project files as possible from the paper design project, and for these project files, a method of classifying according to a format is adopted to divide the sample into: classes (e.g., images, pdf, txt, microsoft office files) and classes to be preprocessed can be directly processed. For the files needing to be preprocessed, the files are converted into files capable of being directly processed by adopting a preprocessing method, if the files cannot be converted, the samples are discarded, and the files (images, pdf, txt and Microsoft office files) which belong to the directly processed files are obtained after the original samples are processed.
Specifically, the embodiment of the present application may detect an area containing a text or a table in a picture or pdf file, cut the area out, and transmit the area to an OCR (Optical Character Recognition) for processing. In the image or pdf file, only the area which may contain more important information is taken, OCR is transmitted, and the information clutter or the area with less information is discarded, so that OCR does not scan the whole image or pdf file, but only a small part of the image or pdf file needs to be scanned, thereby greatly reducing the workload of OCR, memory and time consumption.
Further, in the embodiment of the application, 2000 drawings can be trained, each drawing is a picture and can be expressed as a three-dimensional matrix, the first two dimensions represent the length and the width of the drawing, the last dimension represents the color of the drawing, the model input can be a batch of drawings, the batch length is usually greater than 1, therefore, a GPU (Graphics Processing Unit) can well play the acceleration role during training, and multiple drawings can input the model at one time during prediction and then output the result. For ease of understanding, the model description below is based on a single drawing sheet, but in practice, the input may be multiple drawing sheets.
Further, each drawing is uniformly divided into several parts, each part is responsible for detecting targets in the region of the part, as shown in fig. 1, for example, a form on the form drawing on the drawing is subjected to a series of convolutional neural network operations to extract drawing features, the dimensions of the picture are reduced, the model folds each output drawing feature map into a one-dimensional vector, and the result can be output after passing through a layer of full-connection layer network again. Further, the result of each drawing can be represented as a three-dimensional matrix, the length and the width represent the number of the drawings which are uniformly cut by the model at the beginning, and the height covers the target detection frame and the target category information, for example, if the drawings are divided into nine parts at the beginning, the output result of the model is a matrix of 5 × 3; 5 represents xy coordinates, length, width and object type information of the object detection frame, and 3 x 3 represents nine parts into which the drawing is initially divided. Because the last layer of neural network of the model outputs a one-dimensional vector, the final result needs to be output by matrix transformation into a matrix conforming to the dimension.
Second, a training target detection model needs to be performed.
Specifically, 2000 pieces of drawing data were collected, which included 80% of positive samples (including chart information) and 20% of negative samples (not including chart information), and then sample labeling was performed. Marking the type of an appointed target, wherein the description only has one diagram type, but the model supports various types, detecting the xy coordinates of the middle point of the frame, namely the (0, 0) point is taken from the upper left corner of the drawing paper picture and the length and the width of the frame, and when the frame completely wraps the diagram, a gap is left between the frame and the outermost layer of the diagram, so that the subsequent structural reduction processing of the diagram is facilitated.
Furthermore, the target detection model can be adjusted and trained by data and labels thereof. In consideration of the training configuration requirements and the prediction accuracy, the object detection model of the embodiment of the present application is adjusted to: training algebra: 6000, batch length: 64, input picture size: 832*832. After the training of the target detection model is finished, the required information area can be intercepted by inputting the image or the pdf file, and the accuracy is about 90%.
Furthermore, the target detection model has data and labels thereof, and can be adjusted and trained. In consideration of the training configuration requirements and the prediction accuracy, the target detection model of the embodiment of the present application is adjusted to: training algebra: 6000, batch length: 64, input picture size: 832*832. After the training of the target detection model is finished, the required information area can be intercepted by inputting the image or the pdf file, and the accuracy is about 90%.
Therefore, in the image or pdf file, only the area possibly containing more important information is taken, so that the workload of OCR (optical character recognition), the memory and the time consumption are effectively reduced, the extraction time is reduced, and the extraction efficiency is ensured.
In step S102, text data in the information area is extracted, and named entity recognition is performed on the text data to obtain entity words.
Optionally, in some implementations, extracting text data in the information region includes: dividing an information area in the picture through a target detection model; and performing text processing on the information area based on the optical character recognition model to obtain text data.
It can be understood that, in the embodiment of the application, after the picture form is intercepted by the BIM drawing based on the target detection model, the region may be processed by OCR to obtain the text data.
Specifically, OCR is largely divided into two parts, character detection, i.e., locating picture text information, and character recognition, i.e., recognizing the locating picture text information into text information. Specifically, the character detection model is mainly divided into a U-type network model and a post-processing part based on a differentiable and binary function: the U-shaped segmentation network of the first part mainly comprises a convolution layer and a deconvolution layer, wherein the convolution layer is mainly used for extracting context information of an image and generating a segmentation result by combining the convolution layer to achieve the purpose of image segmentation; the result of the U-shaped network can only output a two-dimensional probability map, and if characters in the picture are to be positioned, the probability map needs to be binarized. Wherein, binaryzation can be obtained through threshold setting, but different thresholds have a great influence on the final result, so if can let the network learn and this threshold of automatic adjustment in the training process, there is very big improvement to the final effect, wherein, the function of binaryzation that can go a little is:
Figure BDA0002961732190000111
wherein, P is a probability map, A is a threshold map, B is the binarization result obtained finally, and B can be obtained by network learning.
Further, the character detection model is output as a two-dimensional positioning diagram of the image characters, the image characters are amplified, the real text information can be conveniently extracted by a subsequent character recognition model, and the character recognition uses a convolutional neural network and a cyclic neural network and comprises the following three parts: the image is firstly input into the convolutional layer and is responsible for extracting the image characteristics. The feature graph output by the convolutional network cannot be directly input into the cyclic neural network, so that the output needs to be serialized, and the elements in the sequence are pixels of each column of the feature graph, so that the label distribution of each column of pixels can be predicted by using the cyclic neural network; finally, the prediction result of the recurrent neural network is processed to obtain a text sequence (for example, pppersson is converted into person), so that the following formula is required to be used:
Figure BDA0002961732190000112
where c is the output text, which is composed of the text labels, x is each element of the sequence, which is itself a vector, T is the conversion in the above example, and s is the original sequence before T conversion. It should be noted that, because the threshold value of binarization is automatically set by the model, that is, automatically learned according to the actual situation, the text positioning is accurate, and a solid foundation is laid for the subsequent character recognition; and, according to its own characteristics, both convolutional neural networks and cyclic neural networks have data types that are good at processing, the former being generally used for processing picture-related recognition tasks, and the latter being generally applied to text, speech, and everything including time dimensions. The OCR text information is extracted and input into picture information and output into text information, so the model comprises a convolution neural network and a cyclic neural network, and the advantages of the two are fully utilized to realize the picture text information.
Therefore, characters in the file are obtained to serve as input data for named entity identification, and the advantages of convolution and a cyclic neural network are fully utilized to achieve picture text information.
Optionally, in some implementations, when the information region is a text region, performing named entity recognition on the text data to obtain entity words includes: and determining a text with entity labeling information according to a preset named entity recognition model and text data, and acquiring entity words from the text with the entity labeling information.
It can be understood that, in the embodiment of the present application, after obtaining the characters recognized by the OCR, algorithm processing may be performed to implement information extraction. For example, the characters recognized by OCR are: the ventilator and the infrared curtain are installed on the tenth to twelfth floors of the No. 6 building of the Bozhilin research center, if the information needing to be extracted is the information of the number of the building, the floor and the equipment, the processing result is as follows: "building: floor No. three, 6; floor: ten to twelve layers; equipment: smoke exhaust ventilator, infrared curtain ".
Optionally, in some implementations, when the information area is a table area, performing named entity recognition on the text data to obtain entity words includes: extracting a table structure of the table area based on a preset table structure extraction model; performing text processing on the table structure through an optical character recognition model to obtain text data in the table; generating an editable table file based on the text data and the table structure in the table; and determining a text with entity labeling information according to a preset named entity recognition model and an editable table file, and acquiring an entity word from the text with the entity labeling information.
It can be understood that after the picture table is intercepted by the BIM drawing based on the target detection model, the embodiment of the application can restore the table content to the excel format, so that information can be directly extracted from the table during subsequent natural language processing. As shown in fig. 2, if the information of the person who performs design and verification needs to be extracted, only the information of the next frame of the "design" and "verification" fields in the form needs to be read, so that the workload of named entity identification is effectively reduced.
Specifically, the table structure extraction model may be trained first, and the table structure extraction model is designed first.
The table construction and restoration adopts image morphology recognition to extract a table structure from a chart cut by the target recognition model, and restores the chart into an excel file. The table structure mainly comprises horizontal and vertical lines, so that each grid in the table is positioned only by identifying intersection points among the lines, and the horizontal and vertical lines are identified mainly according to the principle of morphology. For example, identifying vertical lines, the following matrix may be defined:
[[0,1,0],
[0,1,0],
[0,1,0]];
after the table picture is binarized, the matrix can be slid from the upper left corner of the image, whether all pixel values of the original image corresponding to the matrix are 1 or not is calculated each time, if yes, the central pixel keeps the original pixel value, and if not, the central pixel is set to be 0. The calculation can be applied to the detection cross-line as well, and is not described in detail herein to avoid redundancy. The result of the calculation can obtain two pictures, wherein the former highlights the vertical line, the latter highlights the horizontal line, and other noise pixel values are 0, namely, the two pictures are removed. And finally, carrying out logical AND operation on the two pictures to obtain the grids in the intersection point positioning table.
After the grids in the chart are positioned, the grids can be cut out to be used as input into an OCR model, so that the text information of all grids in the chart can be recognized, and then the text information is combined with the recognized table structure to generate an excel table file.
It should be noted that the table in the drawing is generally clear, so that the table structure can be extracted by using morphology to achieve a higher accuracy. This eliminates the need to train the model again to accomplish this task. Model training, especially neural network training, requires a large amount of data, and therefore, a large amount of manpower and material resources are required for data collection and labeling. Furthermore, there is no conclusion about what labels are needed for identifying table structures to label the industry, so the embodiments of the present application use morphology to solve the task of table structure extraction.
And the importance of restoring the form in the drawing by the excel format is to structurally arrange disordered information after object recognition and OCR recognition and highlight the relationship between the information. Although the natural language processing technology can train a model to judge the relationship between entities in a text, some drawing information relationships cannot be judged by the model, for example, if the structure of form information is not restored, the relationship between a person name and a role in a form cannot be identified. Therefore, the information structuring can improve the accuracy of subsequent natural language processing to a great extent.
Therefore, the form structure extraction task is solved by adopting morphology, the information is structurally arranged, and the extraction accuracy is improved.
In step S103, data in a predetermined format is extracted from the entity words. The data in the predetermined format may be data that can be recognized by a computer, and is not particularly limited herein.
It will be appreciated that named entity recognition belongs to the category of natural language processing, and the most prominent features of natural language processing tasks are: context information must be considered. The named entity recognition model is thus constructed as follows:
(1) when inputting data, not only the word vector of the word segmentation phrase is included, but also the position information is used as a part of the input data. The position information is embodied by two aspects: firstly, the position information of the word is coded to obtain a characteristic vector, and secondly, the position information (context) of the sentence in which the word is positioned is coded. Therefore, the named entity recognition model input should include three parts of word vector, position information and sentence context information.
(2) In order to enable the model to have stronger context prediction capability, when the model is trained, a masking mechanism is used for randomly shielding part of words, the proportion is about 15%, after a certain word to be shielded is determined, the word is directly replaced by a zero vector in 80%, the word is replaced by any other word in 10%, and no change is made in 10%.
(3) The traditional cyclic neural network and convolutional neural network cannot solve the long-term dependence problem, so a self-attention mechanism needs to be introduced into the algorithm to solve the problem.
(4) The model is designed into 12 layers, each layer comprises a plurality of self-attention mechanisms, and in order to relieve the overfitting problem caused by the depth network, a residual error network mechanism is introduced to reduce gradient dissipation.
Thus, a named entity recognition model is obtained, the structural details of which can be shown in fig. 3.
Further, the named entity recognition model can be trained, and information can be extracted according to the text data based on the named entity recognition model. Specifically, the model is used as a backbone network, model training is performed based on the SQuAD data set, and after the training is completed, high-order feature vectors corresponding to the corpus can be obtained, and the vectors can be used for various tasks such as text classification, intelligent dialogue, named entity recognition and the like; then aiming at the current natural language processing task, obtaining a new model by only adopting a method of adding a layer of softmax normalization behind the model, and carrying out named entity recognition incremental training based on the new model; thus, the entity category to be identified is determined to be as follows according to the service requirement: building representations, floor representations, specials, equipment, and components (other categories may be added later, depending on business needs). And then obtaining corpus data from the intercepted document, and obtaining training data by adopting a data labeling method according to the entity category. When the method is specifically implemented, 10000 pieces of labeled data are obtained through sorting, and model parameters during training are as follows: generation (10), learning rate (0.00002), batch length (32), slow heat learning ratio (0.1), maximum sequence length (128). After the model training is completed, the prediction accuracy is 85%. That is to say, the named entity recognition model can be used for prediction after being trained, a text needs to be input for the prediction of the model, a prediction result, namely the text with entity labeling information, can be obtained after model calculation processing, and finally, the entity is obtained from the text according to the labeling information.
Therefore, the information is expressed in pure numbers, a computer can read the information easily, and the text with the entity labeling information directly obtains the entity from the text.
Optionally, in some implementations, extracting data in a predetermined format from the entity word includes: judging whether the entity words meet preset complex sentence pattern conditions or not; and if the preset complex sentence pattern condition is met, extracting the structured information of the entity words according to the dependency syntax analysis, updating the entity words according to the extraction result, and extracting data in a preset format from the updated entity words.
It will be appreciated that the result of the above process is: "building: floor No. three, 6; floor: ten to twelve layers; equipment: smoke exhaust ventilator, infrared curtain ", but the floor and building information is not yet detailed enough to be processed into expressions that can be directly recognized by a computer, namely: building: [3, 6], floor: [10, 11, 12], the information can be directly used by any other software system after being stored in a database, so as to realize the quick and high-precision information extraction. Therefore, embodiments of the present application need to express these information purely numerically, which is easy to read by a computer, i.e., processed by dependency parsing.
First, the embodiments of the present application can design a dependency parsing model. Dependency syntax analysis mainly learns the dependency characteristics among phrases, and the model construction steps are as follows:
(2) the nonlinear scoring model based on the neural network automatically learns the characteristics (low-dimensional dense vector representation) of the long-distance dependence by means of the characteristic learning capability of the model. The scoring model is based on a Multilayer Perceptron (MLP for short) and adopts a first-order decomposition strategy, and the form is as follows:
Score(w h ,w m ;φ)=MLP(Φ(w h ,w m );φ);
wherein phi (w) h ,w m ) Is a dependent edge { w h ,w m ) Is the network model parameters that need to be obtained by training learning.
(2) And (3) randomly initializing feature vectors, replacing the feature vectors with word vectors for the features which take the values of phrases or characters, training the feature vectors and the scoring model together, and learning the interaction influence among the features through a nonlinear mechanism (an output layer uses a tanh activation function) of the scoring model.
(3) And finally, performing feature learning based on a bidirectional LSTM mechanism, wherein the input of the scoring model is directly defined as: and splicing vectors after bidirectional LSTM learning corresponding to the central words and the dependent words.
Second, editable information is generated from the information based on the dependency parsing model.
Specifically, the conventional graph decoding dependency analysis mostly adopts a training method of a maximum interval principle, and therefore, the training principle of the present application may be as follows:
(1) given a training instance (x) (i) ,y (i) ) E.g. D, where D represents the training set, y (i) Is the sentence x (i) Correct dependency tree of (1), let T (x) (i) ) Representing sentence x (i) All possible dependency parse trees.
(2) The goal of the maximum separation principle is to find an optimal set of model parameters θ, such that for x (i) In other words, the score of the correct dependency tree is at least one interval greater than the score of the errant dependency tree, and the interval is proportional to the structural loss of the errant dependency tree
Figure BDA0002961732190000174
Namely:
Figure BDA0002961732190000173
the loss function is as follows:
Figure BDA0002961732190000171
Figure BDA0002961732190000172
(3) and selecting Adam as an optimizer to realize model training.
Furthermore, after the model training is finished, the dependency relationship among phrases or words in the text can be predicted, and the accuracy is higher than 87%.
Therefore, the dependency relationship among phrases or words in the text is predicted, the accuracy is ensured, and meanwhile, the information is extracted quickly and accurately.
Optionally, in some implementations, the drawing information extraction method further includes: and if the preset complex sentence pattern condition is not met, directly extracting data in a preset format from the entity words.
Specifically, in the embodiments of the present application, the target entity may be directly obtained for some non-complex sentence patterns, and the target entity is obtained after the relation storage processing is performed for some complex sentence patterns, for example: after the three-five storied building, the 7# building of the Bozhilin research center, the No. 8 building and the like are subjected to the expression processing, the specific digital content is obtained, namely: floors (3, 4, 5) and floor numbers (7, 8).
In order to enable those skilled in the art to further understand the drawing information extraction method according to the embodiment of the present application, a description is given below with reference to a specific embodiment.
As shown in fig. 4, the drawing information extraction method includes the following steps:
s401, sample pretreatment.
During sample preprocessing, samples can be preprocessed by text format conversion and format screening, so that directly processable files (images, pdf, txt and Microsoft office files) are obtained.
S402, inputting a sample.
S403, judging whether the input sample is a pdf file or an image, if so, executing step S404, otherwise, executing step S407.
That is to say, the embodiment of the application may check whether the file is in a pdf format or an image, and if so, use the target detection in combination with the OCR technology to obtain the text and the table in the pdf file; otherwise, directly using openwork libraries such as openpyxl, docx and pptx to process, and then realizing the complete extraction of the file content.
And S404, detecting the target.
S405, the process is performed by OCR.
S406, extracting a table structure.
And S407, acquiring characters and tables.
S408, named entity recognition.
S409, judging whether the entity is a complex entity, if so, executing the step S410, otherwise, executing the step S412.
And S410, dependency syntax analysis.
S411, information extraction is successful.
S412, information extraction is successful.
According to the drawing information extraction method provided by the embodiment of the application, the information area in the picture can be identified according to the target detection model, the text data in the information area is extracted, named entity identification is carried out on the text data to obtain entity words, and the data in the preset format is extracted from the entity words. Therefore, the drawing information is quickly extracted based on optical character recognition, target detection, named entity recognition and dependency syntactic analysis, automatic batch file import, analysis and information extraction are realized, and the problem that the dependence on Revit software in the related technology is excessive is solved, so that the method has great limitation; most of the attribute information is realized through mapping, and a large amount of attribute information is possibly lost; only information in a predefined attribute set can be extracted, and the global content cannot be analyzed and extracted; the information content is analyzed by lacking of intelligent means, and the information semantics and important potential information cannot be analyzed; depends on manpower and cannot be processed efficiently, quickly and in batches.
Next, a drawing information extraction device proposed according to an embodiment of the present application is described with reference to the drawings.
Fig. 5 is a block diagram schematically illustrating a drawing information extraction device according to an embodiment of the present application.
As shown in fig. 5, the drawing information extraction apparatus 10 includes: an identification module 100, an acquisition module 200 and an extraction module 300.
The identification module 100 is configured to identify an information area in the picture according to the target detection model, where the information area includes a text area and/or a table area;
the obtaining module 200 is configured to extract text data in the information area, and perform named entity recognition on the text data to obtain entity words;
the extraction module 300 is used for extracting data in a predetermined format from the entity words.
Optionally, the identification module 100 is specifically configured to:
acquiring a positive sample picture and a negative sample picture with a preset proportion, wherein the positive sample picture comprises a table area, and the negative sample picture does not comprise the table area;
marking the positive sample picture and the negative sample picture according to a preset marking mode;
and training the marked positive sample picture and the marked negative sample picture through a preset target algorithm to obtain a target detection model.
Optionally, the obtaining module 200 is specifically configured to:
segmenting an information area in the picture through a target detection model;
and performing text processing on the information area based on the optical character recognition model to obtain text data.
Alternatively, when the information region is a text region, the obtaining module 200 includes:
and determining a text with entity labeling information according to a preset named entity recognition model and text data, and acquiring entity words from the text with the entity labeling information.
Optionally, when the information area is a table area, the obtaining module 200 includes:
extracting a table structure of the table area based on a preset table structure extraction model;
performing text processing on the table structure through an optical character recognition model to obtain text data in the table;
generating an editable table file based on the text data and the table structure in the table;
and determining a text with entity labeling information according to a preset named entity recognition model and an editable table file, and acquiring an entity word from the text with the entity labeling information.
Optionally, the extraction module 300 further comprises:
judging whether the entity words meet preset complex sentence pattern conditions or not;
and if the preset complex sentence pattern condition is met, extracting the structured information of the entity words according to the dependency syntax analysis, updating the entity words according to the extraction result, and extracting data in a preset format from the updated entity words.
Optionally, the drawing information extraction device 10 further includes:
and if the preset complex sentence pattern condition is not met, directly extracting data in a preset format from the entity words.
It should be noted that the explanation of the embodiment of the drawing information extraction method is also applicable to the drawing information extraction device of the embodiment, and is not repeated here.
According to the drawing information extraction device provided by the embodiment of the application, the information area in the picture can be identified according to the target detection model, the text data in the information area is extracted, named entity identification is carried out on the text data to obtain entity words, and data in a preset format is extracted from the entity words. Therefore, the drawing information is quickly extracted based on optical character recognition, target detection, named entity recognition and dependency syntactic analysis, automatic batch file import, analysis and information extraction are realized, and the problem that the dependence on Revit software in the related technology is excessive is solved, so that the method has great limitation; most of the attribute information is realized through mapping, and a large amount of attribute information is possibly lost; only information in a predefined attribute set can be extracted, and the global content cannot be analyzed and extracted; the information content is analyzed by lacking of intelligent means, and the information semantics and important potential information cannot be analyzed; depends on manpower and cannot be processed efficiently, quickly and in batches.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device may include:
a memory 601, a processor 602, and a computer program stored on the memory 601 and executable on the processor 602.
The processor 602 executes the program to implement the drawing information extraction method provided in the above-described embodiment.
Further, the electronic device further includes:
a communication interface 603 for communication between the memory 601 and the processor 602.
The memory 601 is used for storing computer programs that can be run on the processor 602.
Memory 601 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
If the memory 601, the processor 602 and the communication interface 603 are implemented independently, the communication interface 603, the memory 601 and the processor 602 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.
Optionally, in a specific implementation, if the memory 601, the processor 602, and the communication interface 603 are integrated on a chip, the memory 601, the processor 602, and the communication interface 603 may complete mutual communication through an internal interface.
The processor 602 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present Application.
The present embodiment also provides a computer-readable storage medium on which a computer program is stored, characterized in that the program realizes the drawing information extraction method as above when executed by a processor.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Moreover, various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without being mutually inconsistent.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature. In the description of the present application, "N" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more N executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or N wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims (10)

1. A drawing information extraction method is characterized by comprising the following steps:
identifying an information area in the picture according to a target detection model, wherein the information area comprises a character area and/or a table area;
extracting text data in the information area, and carrying out named entity recognition on the text data to obtain entity words;
and extracting data in a preset format from the entity words.
2. The method of claim 1, wherein the object detection model is trained by:
acquiring a positive sample picture and a negative sample picture in a preset proportion, wherein the positive sample picture comprises the table area, and the negative sample picture does not comprise the table area;
marking the positive sample picture and the negative sample picture according to a preset marking mode;
and training the marked positive sample picture and the marked negative sample picture through a preset target algorithm to obtain the target detection model.
3. The method of claim 1, wherein the extracting text data in the information region comprises:
segmenting the information area in the picture through the target detection model;
and performing text processing on the information area based on an optical character recognition model to obtain the text data.
4. The method according to claim 1, wherein when the information region is a text region, the performing named entity recognition on the text data to obtain entity words comprises:
and determining a text with entity labeling information according to a preset named entity recognition model and the text data, and acquiring the entity words from the text with the entity labeling information.
5. The method according to claim 1, wherein when the information area is a table area, the performing named entity recognition on the text data to obtain entity words comprises:
extracting a table structure of the table area based on a preset table structure extraction model;
performing text processing on the table structure through the optical character recognition model to obtain text data in the table;
generating an editable table file based on the text data within the table and the table structure;
and determining a text with entity labeling information according to the preset named entity recognition model and the editable table file, and acquiring the entity word from the text with the entity labeling information.
6. The method of claim 1, wherein the extracting data in a predetermined format from the entity words comprises:
judging whether the entity words meet preset complex sentence pattern conditions or not;
and if the preset complex sentence pattern condition is met, extracting the structured information of the entity words according to the dependency syntax analysis, updating the entity words according to the extraction result, and extracting data in a preset format from the updated entity words.
7. The method of claim 6, further comprising:
and if the preset complex sentence pattern condition is not met, directly marking the picture according to the entity words to obtain data in a preset format.
8. A drawing information extraction apparatus, characterized by comprising:
the identification module is used for identifying an information area in the picture according to the target detection model, wherein the information area comprises a character area and/or a table area;
the acquisition module is used for extracting text data in the information area and carrying out named entity recognition on the text data to acquire entity words;
and the extraction module is used for extracting data in a preset format from the entity words.
9. An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the drawing information extraction method according to any one of claims 1 to 7.
10. A computer-readable storage medium on which a computer program is stored, the program being executed by a processor for implementing the drawing information extraction method according to any one of claims 1 to 7.
CN202110239779.2A 2021-03-04 2021-03-04 Drawing information extraction method and device, electronic equipment and storage medium Pending CN115034200A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110239779.2A CN115034200A (en) 2021-03-04 2021-03-04 Drawing information extraction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110239779.2A CN115034200A (en) 2021-03-04 2021-03-04 Drawing information extraction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115034200A true CN115034200A (en) 2022-09-09

Family

ID=83117674

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110239779.2A Pending CN115034200A (en) 2021-03-04 2021-03-04 Drawing information extraction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115034200A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115618852A (en) * 2022-11-22 2023-01-17 山东天成书业有限公司 Text digital automatic proofreading system
CN116167727A (en) * 2023-04-25 2023-05-26 公安部信息通信中心 Image analysis-based flow node identification and processing system
CN116758578A (en) * 2023-08-18 2023-09-15 上海楷领科技有限公司 Mechanical drawing information extraction method, device, system and storage medium
CN117611710A (en) * 2023-12-07 2024-02-27 南京云阶电力科技有限公司 Terminal strip drawing vectorization method and system based on deep learning and image processing

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115618852A (en) * 2022-11-22 2023-01-17 山东天成书业有限公司 Text digital automatic proofreading system
CN116167727A (en) * 2023-04-25 2023-05-26 公安部信息通信中心 Image analysis-based flow node identification and processing system
CN116758578A (en) * 2023-08-18 2023-09-15 上海楷领科技有限公司 Mechanical drawing information extraction method, device, system and storage medium
CN116758578B (en) * 2023-08-18 2023-11-07 上海楷领科技有限公司 Mechanical drawing information extraction method, device, system and storage medium
CN117611710A (en) * 2023-12-07 2024-02-27 南京云阶电力科技有限公司 Terminal strip drawing vectorization method and system based on deep learning and image processing

Similar Documents

Publication Publication Date Title
US10817717B2 (en) Method and device for parsing table in document image
CN115034200A (en) Drawing information extraction method and device, electronic equipment and storage medium
CN110490081B (en) Remote sensing object interpretation method based on focusing weight matrix and variable-scale semantic segmentation neural network
US20240013005A1 (en) Method and system for identifying citations within regulatory content
US20240012846A1 (en) Systems and methods for parsing log files using classification and a plurality of neural networks
CN112434691A (en) HS code matching and displaying method and system based on intelligent analysis and identification and storage medium
CN115424282A (en) Unstructured text table identification method and system
US20230267345A1 (en) Form structure extraction by predicting associations
CN113393370A (en) Method, system and intelligent terminal for migrating Chinese calligraphy character and image styles
CN116311310A (en) Universal form identification method and device combining semantic segmentation and sequence prediction
Colter et al. Tablext: A combined neural network and heuristic based table extractor
CN113158895A (en) Bill identification method and device, electronic equipment and storage medium
CN112269872A (en) Resume analysis method and device, electronic equipment and computer storage medium
CN111461121A (en) Electric meter number identification method based on YO L OV3 network
CN115082659A (en) Image annotation method and device, electronic equipment and storage medium
CN112949455B (en) Value-added tax invoice recognition system and method
CN111414889B (en) Financial statement identification method and device based on character identification
CN112613367A (en) Bill information text box acquisition method, system, equipment and storage medium
CN112149523B (en) Method and device for identifying and extracting pictures based on deep learning and parallel-searching algorithm
CN115937887A (en) Method and device for extracting document structured information, electronic equipment and storage medium
US20220172301A1 (en) System and method for clustering an electronic document that includes transaction evidence
CN113111869B (en) Method and system for extracting text picture and description thereof
Cherepanov et al. On automated workflow for fine-tuning deepneural network models for table detection in document images
CN115546801A (en) Method for extracting paper image data features of test document
CN113821555A (en) Unstructured data collection processing method of intelligent supervision black box

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230627

Address after: A4-04, Floor 4, Building A1, No. 1, Panpu Road, Country Garden Community, Beijiao Town, Shunde District, Foshan, Guangdong 528311

Applicant after: Guangdong Bozhilin Software Technology Co.,Ltd.

Address before: 528311 a2-05, 2nd floor, building A1, 1 Panpu Road, Biguiyuan community, Beijiao Town, Shunde District, Foshan City, Guangdong Province

Applicant before: GUANGDONG BOZHILIN ROBOT Co.,Ltd.

TA01 Transfer of patent application right