CN115035541A

CN115035541A - Large-size complex pdf engineering drawing text detection and identification method

Info

Publication number: CN115035541A
Application number: CN202210735421.3A
Authority: CN
Inventors: 姚昊; 潘炼; 伍吉泽; 李武平; 沈祯杰; 刘忠良; 李清; 熊伟; 张永兴; 李强
Original assignee: CNNC Nuclear Power Operation Management Co Ltd
Current assignee: CNNC Nuclear Power Operation Management Co Ltd
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2022-09-09

Abstract

The invention provides a text detection and identification method for a large-size complex pdf engineering drawing, which comprises the following steps: step S1: preprocessing pdf engineering drawings to generate corresponding high-resolution images; step S2: cutting the high-resolution image into a plurality of low-resolution subgraphs, and recording the corresponding sequence of the subgraphs according to the positions; step S3: carrying out first sub-image text detection, preliminarily positioning a text region range in the sub-image, and outputting position coordinates corresponding to the range; step S4: mapping the position coordinates of the text regions in the sub-image to the original large image, removing repeated data in the large image, and acquiring corresponding text region images according to the position coordinates after the repetition removal; step S5: performing second text detection, accurately positioning the text in the text area, and cutting a corresponding text block; step S6: and performing text recognition on the text block, and extracting text content in the text block and a corresponding coordinate position. The method provided by the invention improves the text recognition accuracy of the complex drawing.

Description

Large-size complex pdf engineering drawing text detection and identification method

Technical Field

The invention relates to the technical field of text drawing management of nuclear power plants, in particular to a text detection and identification method for large-size complex pdf engineering drawings.

Background

In the engineering field, a relationship between a drawing and text contents thereof is often required to be established so as to quickly query information such as material codes, component numbers and the like in the drawing and the corresponding drawing. In the past, most of the work is realized by manual means, the efficiency is low, and the cost of manpower resources is extremely high under the condition of processing text data of a large number of drawings. Therefore, a method for automatically identifying the text content of the drawing is needed to replace manual work, so that text extraction of a large amount of pdf drawings is completed, the labor cost is reduced, and the text extraction efficiency of engineering drawings is improved.

Currently, text recognition for drawings generally requires two steps: text detection and text recognition. The text detection aims at detecting a text area in a drawing, realizing the positioning of a text in an image and outputting position coordinate information corresponding to the text area; the text recognition aims at outputting corresponding texts aiming at text areas in drawings.

Therefore, the problems of high cost, low efficiency, complex drawing content and the like exist in the conventional drawing text extraction method.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a method for detecting and identifying the text of a large-size complex pdf engineering drawing, which is low in cost and high in efficiency.

In order to achieve the above purpose, the invention provides the following technical scheme:

a method for detecting and identifying a text of a large-size complex pdf engineering drawing comprises the following steps:

step S1: preprocessing pdf engineering drawings to generate corresponding high-resolution images;

step S2: cutting the high-resolution image into a plurality of low-resolution subgraphs, and recording the corresponding sequence of the subgraphs according to the positions;

step S3: carrying out first sub-image text detection, preliminarily positioning a text region range in the sub-image, and outputting position coordinates corresponding to the range;

step S4: mapping the position coordinates of the text regions in the sub-image to the original large image, removing repeated data in the large image, and acquiring corresponding text region images according to the position coordinates after the repetition removal;

step S5: performing second text detection, accurately positioning the text in the text area, and cutting a corresponding text block;

step S6: and performing text recognition on the text block, and extracting text content in the text block and a corresponding coordinate position.

In step S2, the high resolution image is cut into several low resolution sub-images by using sliding window cropping.

In step S3, the text detection of the sub-image is completed by using an advanced east method, and the rough position information of the text region in the sub-image is preliminarily obtained.

Step S4 includes:

step S41: mapping the coordinate position in the step S3 to the original high-resolution large image;

step S42: removing repeated data in the coordinate information;

step S43: and cutting the corresponding text area image according to the position coordinates after the duplication removal.

In step S5, the text region image obtained in step S4 is subjected to a second text detection, the text is accurately positioned, and a corresponding text image is cut out.

In step S6, a PaddleOCR text recognition scheme is used to complete text recognition of the text image obtained in step S5, and finally the text content and the corresponding image area coordinates are output.

Compared with the prior art, the text detection and identification method for the large-size complex pdf engineering drawing provided by the invention has the following beneficial effects:

the method provided by the invention can accurately detect the effective text area in the large-size complex PDF engineering drawing, including the coordinate information of the transverse text area and the coordinate information of the vertical text area, and accurately identify the text content in the effective text area.

In addition, through two continuous text detections, the adverse effect of interference of lines, patterns and the like on recognition is effectively avoided, and the text recognition accuracy of the complex drawings is improved.

Furthermore, the text detection and identification method is applied to large-size drawings in a sliding window blocking processing mode, and meanwhile the risk of continuous text interception is avoided.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a text detection and identification method for a large-size complex pdf engineering drawing according to an embodiment of the present invention.

Detailed Description

The following is a detailed description of the preferred embodiments.

The invention provides a text detection and identification method for a large-size complex pdf engineering drawing, which can be generally divided into four parts: firstly, processing PDF drawing, converting the PDF drawing into a high-resolution image, and orderly splitting the image into sub-images with fixed sizes. And secondly, performing text detection twice, and accurately positioning a text area. The method comprises the steps of detecting a subgraph for the first time, and finding out a rough region range with a text in the subgraph; and the second detection aims at the area detected for the first time, eliminates the interference existing in the area and accurately positions the text position. Processing text coordinates, mapping coordinates in the subgraph to a high-resolution big graph, and screening out repeated data in the coordinates; and fourthly, text recognition, namely recognizing the text content of the corresponding area according to the text detection result, and outputting the text content and the corresponding coordinate position thereof.

As shown in fig. 1, the method for detecting and identifying the text of the large-size complex pdf engineering drawing provided by the invention comprises the following steps:

step S1: preprocessing a pdf engineering drawing to generate a corresponding high-resolution image, such as a ten-million pixel level image of 3680x 2944;

step S2: and cutting the high-resolution image into a plurality of subgraphs with smaller sizes by using a sliding window cutting method, and recording corresponding sequence numbers of the subgraphs according to the transverse and longitudinal sliding times i and j of the cutting window. The specific method comprises the following steps: for the large graph of 3680x2944, each subgraph I _i，j The width w and the height h of the sliding frame are both 736, the horizontal sliding step length delta x and the longitudinal sliding step length delta y are both 368, and finally 63 sub-graphs are obtained;

step S3: completing text detection of the subgraph by using an advanced east method, and primarily acquiring rough position information of a text region in the subgraph, wherein the rough position information is specifically represented as four vertexes of a rectangular text region and corresponds to 8 coordinate values (x) ₀ ，y ₀ )…(x ₃ ，y ₃ )；

Step S4: mapping the position coordinates of the text region in the sub-graph to the original large graph, removing repeated data in the large graph, and acquiring a corresponding text region image according to the position coordinates after the repeated data are removed;

step S41: and mapping the coordinate position in the step S3 to the original high-resolution large graph, wherein the coordinate mapping formula is as follows:

X _m ＝i*Δx+x _m ，m＝0，1，2，3；

Y _n ＝j*Δy+y _n ，n＝0，1，2，3；

step S42: and removing repeated data in the coordinate information. Since the sub-graph is obtained by clipping using the sliding window in step S2, there is a case that the same text region is detected multiple times in the detection, so that multiple sets of coordinate information pointing to the same region in the original graph are obtained, and these repeated data need to be merged into one set of coordinate data. The repeated data merging judgment formula is as follows:

wherein S is _i And representing the text area, if the text detection area contains the situation, combining the coordinates of the text area, and discarding the coordinates of the smaller area.

Step S5: and carrying out secondary text detection on the text area image subjected to the primary text detection, accurately positioning the text, and cutting out a corresponding text image. The secondary detection can effectively remove the interference of lines or patterns except the text content in the text region detected for the first time, realize more accurate text positioning and ensure the accuracy of subsequent identification.

Step S6: and performing text recognition on the accurate text region obtained by text detection by using a PaddleOCR text recognition scheme. Finally, the text content and the corresponding image area coordinate position are output.

The text detection method adopts an advanced EAST open source text detection scheme, takes a VGG16 network structure as a main network to extract pixel characteristics in a drawing, realizes multi-channel characteristic fusion by using modes of upsampling, convolution and the like, and predicts a text region according to the fusion characteristics. The text recognition portion uses the PaddleOCR open source text recognition scheme based on the CRNN model using CTC Loss as a Loss function.

The invention provides an application type basic technology, solves the problems of text detection and identification of PDF engineering drawings in scenes with large size (note: the whole PDF drawing cannot be directly used as an input source) and complex content (note: interference lines or patterns of horizontal texts, vertical texts and similar texts exist), and can provide technical support for relevant application of specific texts in large-size complex PDF engineering drawings, such as: identification of codes of device codes or material codes, code error correction recommendation, code location query, code file association, and the like.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A text detection and identification method for a large-size complex pdf engineering drawing is characterized by comprising the following steps:

step S1: preprocessing a pdf engineering drawing to generate a corresponding high-resolution image;

2. The method for detecting and recognizing text in large-sized complex pdf engineering drawing according to claim 1, wherein in step S2, the high-resolution image is cut into several low-resolution sub-images by using sliding window cropping.

3. The method for detecting and recognizing the text of the large-sized complex pdf engineering drawing according to claim 1, wherein in step S3, the text detection of the sub-graph is completed by using an advanced east method, and rough position information of the text region in the sub-graph is obtained preliminarily.

4. The method for detecting and identifying the text of the large-size complex pdf engineering drawing according to claim 1, wherein the step S4 comprises:

step S42: removing repeated data in the coordinate information;

5. The method for detecting and recognizing text in large-sized complex pdf engineering drawing according to claim 1, wherein in step S5, the text region image obtained in step S4 is subjected to a second text detection, the text is precisely located, and the corresponding text image is cut out.

6. The method for detecting and recognizing text in large-sized complex pdf engineering drawing according to claim 1, wherein in step S6, using PaddleOCR text recognition scheme, completing text recognition of the text image obtained in step S5, and finally outputting the text content and the corresponding image area coordinates.