CN112580500A - Information extraction method and device for engineering reply file and electronic equipment - Google Patents

Information extraction method and device for engineering reply file and electronic equipment Download PDF

Info

Publication number
CN112580500A
CN112580500A CN202011495587.XA CN202011495587A CN112580500A CN 112580500 A CN112580500 A CN 112580500A CN 202011495587 A CN202011495587 A CN 202011495587A CN 112580500 A CN112580500 A CN 112580500A
Authority
CN
China
Prior art keywords
line
distance
edge line
vertical
horizontal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011495587.XA
Other languages
Chinese (zh)
Other versions
CN112580500B (en
Inventor
樊蕊霞
宋俊国
李靖宇
刘自力
李莉
成功
闫键
朱丹
何云波
陈芸
李彪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jincheng Power Supply Co of State Grid Shanxi Electric Power Co Ltd
Original Assignee
Jincheng Power Supply Co of State Grid Shanxi Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jincheng Power Supply Co of State Grid Shanxi Electric Power Co Ltd filed Critical Jincheng Power Supply Co of State Grid Shanxi Electric Power Co Ltd
Priority to CN202011495587.XA priority Critical patent/CN112580500B/en
Publication of CN112580500A publication Critical patent/CN112580500A/en
Application granted granted Critical
Publication of CN112580500B publication Critical patent/CN112580500B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/174Form filling; Merging
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Geometry (AREA)
  • Multimedia (AREA)
  • Computer Graphics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides an information extraction method and device for an engineering reply file and electronic equipment, wherein the method comprises the following steps: extracting project names and page numbers from the project batch files; setting a vertical edge line and a horizontal edge line of a target page, and extracting transverse line elements, vertical line elements, end points and effective intersection points in the target page; extracting text elements in the target area; and forming a table file by taking the end points and the effective intersection points as the vertexes of the cells, filling the text elements into the corresponding cells, and outputting the filled table file. By the technical scheme provided by the embodiment of the invention, the form in the engineering reply file can be quickly converted into the form file, and the labor can be saved. The size relationship of the attribute information can uniquely represent the relative position between the elements, each element can be described by less attribute information, the attribute information can also be directly used, and the processing efficiency can be improved.

Description

Information extraction method and device for engineering reply file and electronic equipment
Technical Field
The invention relates to the technical field of data processing, in particular to an information extraction method and device for engineering reply files, electronic equipment and a computer readable storage medium.
Background
At present, the initial batch file of the engineering project comprises information such as engineering construction content, standard materials required by the engineering, engineering demolition material parts and the like. The information has great value for compiling annual retirement plans, full recovery of waste materials and the like, and the required information needs to be extracted from engineering batch files.
Since most of the engineering replication files are CEB (China-paper Basic) format documents, the information extraction needs manual selection and extraction, and is time-consuming and labor-consuming. Because standard materials required by engineering, engineering demolition parts and the like are in a table form, the content in the table is extracted mainly based on an OCR (Optical Character Recognition) technology or a neural network at present, the processing amount is large, and the efficiency is low.
Disclosure of Invention
In order to solve the existing technical problem, embodiments of the present invention provide an information extraction method and apparatus for an engineering reply file, an electronic device, and a computer-readable storage medium.
In a first aspect, an embodiment of the present invention provides an information extraction method for a project replication file, including:
acquiring a project replication file to be processed, and extracting a project name and a page number corresponding to the project name from the project replication file;
taking a page corresponding to the page number in the engineering replication file as a target page, setting a vertical edge line and a horizontal edge line of the target page, and extracting transverse line elements and vertical line elements in the target page, wherein the transverse line elements have attribute information including: distance Yi from the horizontal edge lineHThe distance Xi between the left end of the line and the vertical edge lineH1The distance Xi between the right end of the line and the vertical edge lineH2The vertical line element has attribute information including: distance Xj from the vertical edge lineVThe distance Yj between the lower end of the line and the horizontal edge lineV1The distance Yj between the upper end of the line and the horizontal edge lineV2
Determining the left end point P of the horizontal line elementLiRight end point PRiLower end point P of said vertical line elementDjAnd an upper end point PUjAnd determining an effective intersection point P between the horizontal line element and the vertical line elementij(ii) a Wherein the left endpoint PLiThe distance from the horizontal edge line is YiHAnd the distance from the vertical edge line is XiH1The right end point PRiThe distance from the horizontal edge line is YiHAnd the distance from the vertical edge line is XiH2The lower end point PDjAt a distance Xj from the vertical edge lineVAnd the distance between the horizontal edge line and the horizontal edge line is YjV1The upper end point PUjAt a distance Xj from the vertical edge lineVAnd the distance between the horizontal edge line and the horizontal edge line is YjV2(ii) a If YiHAt YjV1And YjV2And XjVAt XiH1And XiH2In between, then the effective intersection point PijAnd between the horizontal edge lineIs a distance of YiHAnd the distance between the vertical edge line and the vertical edge line is XjV
Determining a target area in the target page, and extracting text elements in the target area, wherein the attribute information of the a-th text element comprises: distance Xa between left side and the vertical edge line1The distance Xa between the right side and said vertical edge line2The distance Ya between the lower side and the horizontal edge line1The distance Ya between the upper side and the horizontal edge line2(ii) a The target area is an area corresponding to the positions of the horizontal line elements and the vertical line elements;
with the left end point PLiThe right end point PRiThe lower endpoint PDjThe upper end point PUjAnd the effective intersection point PijAnd forming a table file which is used as a vertex of the cell and corresponds to the project item name and comprises a plurality of cells, filling the text element into the corresponding cell according to the attribute information of the text element, and outputting the filled table file.
In a second aspect, an embodiment of the present invention further provides an information extraction apparatus for a project replication file, including:
the system comprises a preprocessing module, a data processing module and a data processing module, wherein the preprocessing module is used for acquiring a project replication file to be processed and extracting a project name and a page number corresponding to the project name from the project replication file;
a line element extraction module, configured to use a page corresponding to the page number in the engineering replication document as a target page, set a vertical edge line and a horizontal edge line of the target page, and extract a horizontal line element and a vertical line element in the target page, where the horizontal line element has attribute information including: distance Yi from the horizontal edge lineHThe distance Xi between the left end of the line and the vertical edge lineH1The distance Xi between the right end of the line and the vertical edge lineH2The vertical line element has attribute information including: distance Xj from the vertical edge lineVBetween the lower end of the line and the horizontal edge lineDistance YjV1The distance Yj between the upper end of the line and the horizontal edge lineV2
A point determination module for determining a left end point P of the horizontal line elementLiRight end point PRiLower end point P of said vertical line elementDjAnd an upper end point PUjAnd determining an effective intersection point P between the horizontal line element and the vertical line elementij(ii) a Wherein the left endpoint PLiThe distance from the horizontal edge line is YiHAnd the distance from the vertical edge line is XiH1The right end point PRiThe distance from the horizontal edge line is YiHAnd the distance from the vertical edge line is XiH2The lower end point PDjAt a distance Xj from the vertical edge lineVAnd the distance between the horizontal edge line and the horizontal edge line is YjV1The upper end point PUjAt a distance Xj from the vertical edge lineVAnd the distance between the horizontal edge line and the horizontal edge line is YjV2(ii) a If YiHAt YjV1And YjV2And XjVAt XiH1And XiH2In between, then the effective intersection point PijThe distance from the horizontal edge line is YiHAnd the distance between the vertical edge line and the vertical edge line is XjV
A text element extraction module, configured to determine a target area in the target page and extract a text element in the target area, where attribute information of an a-th text element includes: distance Xa between left side and the vertical edge line1The distance Xa between the right side and said vertical edge line2The distance Ya between the lower side and the horizontal edge line1The distance Ya between the upper side and the horizontal edge line2(ii) a The target area is an area corresponding to the positions of the horizontal line elements and the vertical line elements;
a conversion module for converting the left endpoint PLiThe right end point PRiThe lower endpoint PDjThe upper end point PUjAnd the effective intersection point PijAs a cellAnd the vertex forms a table file which corresponds to the project name and comprises a plurality of cells, fills the text elements into the corresponding cells according to the attribute information of the text elements, and outputs the filled table file.
In a third aspect, an embodiment of the present invention provides an electronic device, including a bus, a transceiver, a memory, a processor, and a computer program stored on the memory and executable on the processor, where the transceiver, the memory, and the processor are connected via the bus, and when the computer program is executed by the processor, the method for extracting information from a engineering batch file as described in any one of the above embodiments is implemented.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the information extraction method for a engineering batch file described in any one of the above.
According to the information extraction method, the information extraction device, the electronic equipment and the computer-readable storage medium for the engineering reply file, provided by the embodiment of the invention, after the line element and the text element are extracted, the form in the engineering reply file can be quickly converted into the form file, so that the form file can be conveniently processed subsequently; the method does not need to manually copy the tables in the project replication files, can save manpower and can improve efficiency. The method comprises the steps of presetting vertical edge lines and horizontal edge lines, determining attribute information of extracted line elements and text elements by taking the edge lines as a reference, wherein the size relationship of the attribute information can uniquely represent the relative position between the elements, not only can each element be described by less attribute information, but also can be directly used in the subsequent processing process without other conversion or calculation, and the processing efficiency can be improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments or the background art of the present invention, the drawings required to be used in the embodiments or the background art of the present invention will be described below.
Fig. 1 shows a flowchart of an information extraction method for a project reply file according to an embodiment of the present invention;
fig. 2 is a schematic diagram illustrating extraction of a form element in the information extraction method for a project reply file according to the embodiment of the present invention;
fig. 3a is a schematic diagram illustrating a table with line elements omitted in the method for extracting information from a project reply file according to an embodiment of the present invention;
fig. 3b is another schematic diagram illustrating a table with line elements omitted in the method for extracting information from a engineering reply file according to the embodiment of the present invention;
fig. 4 shows another schematic diagram of extracting form elements in the information extraction method for a project replication file according to the embodiment of the present invention;
fig. 5 is a schematic structural diagram illustrating an information extraction apparatus for a project reply file according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device for executing an information extraction method for a project batch file according to an embodiment of the present invention.
Detailed Description
In the description of the embodiments of the present invention, it should be apparent to those skilled in the art that the embodiments of the present invention can be embodied as methods, apparatuses, electronic devices, and computer-readable storage media. Thus, embodiments of the invention may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), a combination of hardware and software. Furthermore, in some embodiments, embodiments of the invention may also be embodied in the form of a computer program product in one or more computer-readable storage media having computer program code embodied in the medium.
The computer-readable storage media described above may take any combination of one or more computer-readable storage media. The computer-readable storage medium includes: an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer-readable storage medium include: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only Memory (ROM), an erasable programmable read-only Memory (EPROM), a Flash Memory, an optical fiber, a compact disc read-only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any combination thereof. In embodiments of the invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, device, or apparatus.
The computer program code embodied on the computer readable storage medium may be transmitted using any appropriate medium, including: wireless, wire, fiber optic cable, Radio Frequency (RF), or any suitable combination thereof.
Computer program code for carrying out operations for embodiments of the present invention may be written in assembly instructions, Instruction Set Architecture (ISA) instructions, machine related instructions, microcode, firmware instructions, state setting data, integrated circuit configuration data, or in one or more programming languages, including an object oriented programming language, such as: java, Smalltalk, C + +, and also include conventional procedural programming languages, such as: c or a similar programming language. The computer program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be over any of a variety of networks, including: a Local Area Network (LAN) or a Wide Area Network (WAN), which may be connected to the user's computer, may be connected to an external computer.
The method, the device and the electronic equipment are described through the flow chart and/or the block diagram.
It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions. These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing apparatus to function in a particular manner. Thus, the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The embodiments of the present invention will be described below with reference to the drawings.
Fig. 1 shows a flowchart of an information extraction method for a project replication file according to an embodiment of the present invention. As shown in fig. 1, the method includes:
step 101: and acquiring a project replication file to be processed, and extracting a project name and a page number corresponding to the project name from the project replication file.
In the embodiment of the invention, the engineering replication file to be processed refers to an engineering replication file from which required form information needs to be extracted, such as standard materials required by engineering, engineering demolition materials and the like. The engineering reply content is more, and comprises a reply description, a sub-project description (comprising an engineering project name, a standard material table required by engineering, an engineering demolition material table) and the like, wherein the beginning part of each sub-project description comprises a corresponding engineering project name, the engineering project name generally has a fixed naming format, and one naming format is 'number (Chinese character) +, (Tun number) + city name + project', such as 'one, Shanxi Jincheng XX line modification project', 'two, Shanxi Changzhi XX line demolition project' and the like. In the embodiment, the content with the corresponding naming format can be conveniently analyzed through the regular expression, so that the project name and the page number corresponding to the project name are extracted. Because the general distances between the project name and the standard material table and the project demolition material table required by the project are not far away in the project batch file, the page number where the project name is located and the next page number can be used as the page number corresponding to the project name at the moment, the next page number can also be used as the page number corresponding to the project name, and the setting can be specifically based on the actual situation.
Step 102: taking a page corresponding to a page number in the engineering replication file as a target page, setting a vertical edge line and a horizontal edge line of the target page, and extracting transverse line elements and vertical line elements in the target page, wherein the transverse line elements have attribute information including: distance Yi from horizontal edge lineHThe distance Xi between the left end of the line and the vertical edge lineH1The distance Xi between the right end of the line and the vertical edge lineH2The vertical line element has attribute information including: distance Xj from vertical edge lineVThe distance Yj between the lower end of the line and the horizontal edge lineV1Distance Yj between the upper end of the line and the horizontal edge lineV2
In the embodiment of the invention, the engineering reply file is generally stored in a CEB format, a table can be directly extracted from the engineering reply file in the CEB format, and the table can also be extracted after the engineering reply file in other formats is converted into the engineering reply file in other formats. In this embodiment, only the page corresponding to the page number determined in step 101 is used as a target page from which a form needs to be extracted, that is, the target page is a page where the project name is located, or a next page, or the like; meanwhile, when the table is extracted, the vertical edge line and the horizontal edge line of the target page are set in advance. In this embodiment, the vertical edge line is a line which is vertical in the target page and located at the edge of the target page, for example, an edge line on the left side or an edge line on the right side of the target page; similarly, the horizontal edge line is a line which is horizontal in the target page and located at the edge of the target page, for example, an edge line on the upper side of the target page, an edge line on the lower side, or the like. When attribute information of the extracted elements (including line elements, text elements and the like) is determined subsequently, the edge lines (vertical edge lines and horizontal edge lines) are taken as a reference; by taking the edge line as a reference, all the elements are positioned at the same side of the edge line, and the position relationship among the elements can be uniquely represented by the size relationship of the attribute information of the elements, so that the relative positions among the elements can be conveniently and accurately determined. Preferably, the vertical edge line is a line of an edge position on the left side of the target page, and the horizontal edge line is a line of an edge position on the right side of the target page. Wherein different target pages may share the same vertical and horizontal edge lines.
In the embodiment of the invention, the table in the target page comprises line elements and text elements, wherein the line elements form a frame of the table, and the text elements are contents in each cell. After extracting line elements from the target page, the line elements are divided into two categories according to their directions: the horizontal lines are horizontal line elements, and the vertical lines are vertical line elements. Assuming that a table in the engineering reply document is shown in fig. 2, the table has two rows and three columns, the horizontal line elements of the table include AD, EH, and IL, and the vertical line elements include AI, BJ, CK, and DL. In this embodiment, each cell has four sides, each side may be a horizontal line or a vertical line, but since there may be other collinear lines, a plurality of lines may be used as a line element at this time, for example, horizontal lines AB, BC, and CD may be used as a horizontal line element AD.
In addition, since a plurality of line elements (horizontal line elements or vertical line elements) generally exist in the target page, the present embodiment distinguishes different line elements by attribute information. In the embodiment of the present invention, the left and right ends of the horizontal line element and the related information of the horizontal line element itself in the vertical direction are used as the attribute information, and specifically, the attribute information of the ith horizontal line element includes: distance Yi from horizontal edge lineHWire, wireDistance Xi between left end and vertical edge lineH1The distance Xi between the right end of the line and the vertical edge lineH2. As shown in fig. 2, the edge line OX at the bottom of the target page in fig. 2 is taken as a horizontal edge line, and the edge line OY at the left side is taken as a vertical edge line; if the horizontal line element EH is the ith horizontal line element in the target page, the distance Yi between the horizontal line element EH and the horizontal edge lineHI.e. the distance between the horizontal edge line element EH and the horizontal edge line OX, the distance Xi between the left end of the line and the vertical edge lineH1I.e. the distance from the point E to the vertical edge line OY, the distance Xi between the right end of the line and the vertical edge lineH2I.e. the distance from the point H to the vertical edge line OY; the position of each horizontal line element can be accurately described based on the three items of attribute information. Similarly, the information about the upper and lower ends of the vertical line element and the horizontal direction of the vertical line element itself is used as the attribute information, and specifically, the attribute information of the jth vertical line element includes: distance Xj from vertical edge lineVThe distance Yj between the lower end of the line and the horizontal edge lineV1Distance Yj between the upper end of the line and the horizontal edge lineV2. As shown in FIG. 2, if the vertical line element CK is the jth vertical line element in the target page, the distance Xj between the jth vertical line element CK and the vertical edge line isVThe distance between the vertical line element CK and the vertical edge line OY, and the distance Yj between the lower end of the line and the horizontal edge lineV1I.e. the distance from the point K to the horizontal edge line OX, the distance Yj between the upper end of the line and the horizontal edge lineV2Is the distance of point C to the horizontal edge line OX; the position of each vertical line element can also be accurately described based on these three items of attribute information. In the embodiment of the invention, each line element is not described in a traditional coordinate representation mode, but three items of attribute information corresponding to each line element are determined by setting the vertical edge line and the horizontal edge line and taking the two edge lines as the reference, so that the line elements can be described by less attribute information, and the attribute information can be directly used in the subsequent processing process without other conversion or calculation, thereby facilitating the subsequent processing.
Step 103: determining the left end point P of the horizontal line elementLiRight end point PRiLower end point P of vertical line elementDjAnd an upper end point PUjAnd determining an effective intersection point P between the horizontal line element and the vertical line elementij(ii) a Wherein, the left end point PLiDistance between the edge line and the horizontal edge line is YiHAnd the distance from the vertical edge line is XiH1Right end point PRiDistance between the edge line and the horizontal edge line is YiHAnd the distance from the vertical edge line is XiH2Lower extreme point PDjAt a distance Xj from the vertical edge lineVAnd a distance Yj from the horizontal edge lineV1Upper end point PUjAt a distance Xj from the vertical edge lineVAnd a distance Yj from the horizontal edge lineV2(ii) a If YiHAt YjV1And YjV2And XjVAt XiH1And XiH2Between, then the effective intersection point PijDistance between the edge line and the horizontal edge line is YiHAnd a distance Xj from the vertical edge lineV
In the embodiment of the invention, the transverse line element has a left end point and a right end point, and the vertical line element has an upper end point and a lower end point. Specifically, the ith horizontal line element has a left end point PLiRight end point PRiThe jth vertical line element has a lower end point PDjAnd an upper end point PUj. Still taking the horizontal line element EH and the vertical line element CK in fig. 2 as an example, the left end point P of the horizontal line element EHLiIs point E, right end point PRiPoint H; meanwhile, the distances between the point E and the horizontal edge line OX and the vertical edge line OY are the distances Yi from the horizontal line element EH to the horizontal edge line OXHThe distance Xi between the left end of the transverse line element EH and the vertical edge lineH1So the left end point PLiDistance between the edge line and the horizontal edge line is YiHAnd the distance from the vertical edge line is XiH1. In the same way, it can be determined that the right end point P of the horizontal line element EHRiIs point H, the right end point PRiThe distance from the horizontal edge line OX is also YiHAnd a distance Xi from the vertical edge line OYH2
Likewise, the lower endpoint P of the jth vertical line element CKDjIs point K, upper end point PUjIs point C; and point K, vertical edge line OY and horizontal edge lineThe distances between OX are respectively: distance Xj between vertical line element CK and vertical edge lineVThe distance Yj between the lower end of the line and the horizontal edge lineV1Therefore lower endpoint PDjAt a distance Xj from the vertical edge lineVAnd a distance Yj from the horizontal edge lineV1. By analogy, the upper end point P can be determinedUjAt a distance Xj from the vertical edge lineVAnd a distance Yj from the horizontal edge lineV2
In addition, there may be intersections between the horizontal line elements and the vertical line elements, and the intersections other than the intersections coinciding with the end points (left end point, right end point, upper end point, lower end point) may be regarded as the effective intersections Pij. Specifically, if the ith horizontal line element is between the upper and lower ends of the jth vertical line element (i.e., Yi)HAt YjV1And YjV2Between) and the jth vertical line element is also between the left and right ends of the ith horizontal line element (i.e., Xj)VAt XiH1And XiH2In between), then the intersection point between the ith horizontal line element and the jth vertical line element is the effective intersection point PijAnd the effective intersection point PijDistance between the edge line and the horizontal edge line is YiHAnd a distance Xj from the vertical edge lineV. As shown in FIG. 2, for the ith horizontal line element EH and the jth vertical line element CK, Yj is satisfiedV1<YiH<YjV2And XiH1<XjV<XiH2So that the intersection G between the two is an effective intersection and the distance between the point G and the horizontal edge line is YiHAnd a distance Xj from the vertical edge lineV. And the intersection point between the horizontal line element AD and the vertical line element CK is C, which is not a valid intersection point since it is the upper end point of the vertical line element CK. In addition, if the form in the project replication file is not standardized, an intersection point exists near a certain end point, for example, a small part of the upper end of the vertical line element CK in fig. 2 protrudes, so that an end point C exists, and the end point C intersects with the AD at an intersection point C', which is not used as an effective intersection point; or, a portion protruding from the upper end of the vertical line element CK is removed in advance, that is, the distance between the upper end of the vertical line element CK and the horizontal edge line is adjustedThe distance from the horizontal line element AD to the horizontal edge line is taken as a whole.
Step 104: determining a target area in a target page, and extracting text elements in the target area, wherein the attribute information of the a-th text element comprises: distance Xa between left and vertical edge line1Distance Xa between right side and vertical edge line2The distance Ya between the lower side and the horizontal edge line1Distance Ya between upper side and horizontal edge line2(ii) a The target area is an area corresponding to the positions of the horizontal line elements and the vertical line elements.
In the embodiment of the invention, the horizontal line element and the vertical line element can cover one area, and the area can be used as a target area for extracting the text element. Specifically, a rectangular area capable of completely covering all horizontal line elements and vertical line elements may be used as a target area of the target page, and the rectangular area is a non-inclined rectangle; as shown in fig. 2, a rectangle ADLI (or a larger rectangle) may be used as a rectangular area. In this embodiment, the target area may include a plurality of text elements, and the position of each text element is described by four items of attribute information. As shown in FIG. 2, each text element corresponds to a rectangular region, and if "text 1" is the a-th text element, the distance Xa between the left side of the a-th text element and the vertical edge line OY1Distance Xa between right side and vertical edge line OY2The distance Ya between the lower side and the horizontal edge line OX1Distance Ya between the upper side and the horizontal edge line OX2(ii) a Over four distances Xa1、Xa2、Ya1、Ya2The rectangular area corresponding to the text element, i.e. the dashed box outside "text 1" in fig. 2, can be determined.
In this embodiment, since each text element may include one or more character elements, for example, the text element "text 1" includes three character elements "text", "1", it is necessary to determine the attribute information of the corresponding text element by the attribute information of one or more character elements. Specifically, the step 104 of "extracting text elements in the target region" includes:
step A1: extraction ofAnd character elements in the target area are extracted, wherein the attribute information of the character elements comprises: distance xk between left side and vertical edge line1The distance xk between the right side and the vertical edge line2The distance yk between the lower side and the horizontal edge line1The distance yk between the upper side and the horizontal edge line2
Step A2: and clustering the character elements according to the attribute information of the character elements, and determining the text elements containing one or more character elements.
Wherein the distance Xa between the left side of the text element and the vertical edge line1Is the distance xk between the left side and the vertical edge line of the target character element1Is the distance Xa between the right side of the text element and the vertical edge line2Is the distance xk between the right side of the target character element and the vertical edge line2Is the distance Ya between the lower side of the text element and the horizontal edge line1Is the distance yk between the lower side of the target character element and the horizontal edge line1Is the distance Ya between the upper side of the text element and the horizontal edge line2Is the distance yk between the upper side of the target character element and the horizontal edge line2The maximum value of (d); the target character element is a character element included in the text element.
In the embodiment of the present invention, when there is only one character element in the text element, the character element is the text element, that is, the character element and the text element are essentially the same, but the number of characters included in the text element is different. Therefore, the character element can also indicate its position by four items of attribute information, that is, the attribute information of the kth character element includes: distance xk between left side and vertical edge line1The distance xk between the right side and the vertical edge line2The distance yk between the lower side and the horizontal edge line1The distance yk between the upper side and the horizontal edge line2
After all the character elements in the target area are extracted, clustering processing is carried out on the character elements according to the attribute information of the character elements, so that a plurality of character elements can be gathered into different classes according to the positions of the character elements, and each class corresponds to one text element. Wherein the content of the first and second substances,the clustering processing can be carried out based on the existing mature clustering algorithm, and two character elements with the distance smaller than a preset value can be classified into the same class, so that clustering is realized. In the embodiment of the present invention, since one text element may include a plurality of character elements, the four items of attribute information of the text element are the most significant values (maximum values or minimum values) of the four items of attribute information of all the character elements, specifically, whether the four items of attribute information are the maximum values or the minimum values, and need to be based on the positions of the horizontal edge line and the vertical edge line. For example, the vertical edge line is the right edge line of the target page, and the left side of the text element is farther from the vertical edge line, so the distance Xa between the left side of the text element and the vertical edge line1Is the distance xk between the left side and the vertical edge line of the target character element1Is measured. Conversely, as shown in FIG. 2, if the vertical edge line is the edge line on the left side of the target page, then the left side of the text element is closer to the vertical edge line, so Xa1The minimum of the distance between the left side of the target character element and the vertical edge line. At the same time, the distance Xa between the right side of the text element and the vertical edge line2Is the distance xk between the right side of the target character element and the vertical edge line2Is measured.
Likewise, the distance Ya between the underside of the text element and the horizontal edge line1Is the distance yk between the lower side of the target character element and the horizontal edge line1Is the minimum value in fig. 2; distance Ya between the upper side of the text element and the horizontal edge line2Is the distance yk between the upper side of the target character element and the horizontal edge line2Is the maximum value in fig. 2.
Furthermore, as will be understood by those skilled in the art, all text elements (or character elements) in the target page may be extracted, and then it may be determined which text elements (or character elements) are text elements (or character elements) in the target area; or the target area may be determined first, and then only the text element (or the character element) in the target area is extracted, and the embodiment does not limit the timing, the manner, and the like for extracting the text element (or the character element).
Step 105: at the left end point PLiRight end point PRiLower endpoint PDjUpper end point PUjAnd the effective intersection point PijAnd forming a table file which corresponds to the project item name and comprises a plurality of cells as the vertexes of the cells, filling the text elements into the corresponding cells according to the attribute information of the text elements, and outputting the filled table file.
In the embodiment of the invention, after all the end points and the effective intersection points in the engineering replication file are determined, the table file, such as an excel file, a csv format file and the like, can be regenerated based on the end points and the effective intersection points. The end points of the horizontal line elements and the vertical line elements are intersected at the same end point, so that the end points can be various end points at the same time, namely end points with overlapped positions exist, and only one end point can be reserved at the time. As shown in fig. 2, point a is the left end point of the horizontal line element AD and is the upper end point of the vertical line element AI, i.e. there are left end point a and upper end point a, and the overlapping end points may be eliminated. The points identified in fig. 2 include A, B, …, L, etc. for 12 points, where point F, G is the valid intersection point and the remaining points are endpoints; for example, B, C is the upper endpoint and E is the left endpoint.
In the embodiment of the invention, one cell can be determined by four adjacent points (end points and/or effective intersection points), and the four points can determine the corresponding position or area of the cell; meanwhile, the attribute information of the text element can also represent the position of the text element, so that which cell the text element corresponds to can be determined based on the attribute information of the text element, and the corresponding text element is filled into the corresponding cell. B, C, G, F are four adjacent dots that may form a rectangle, as shown in FIG. 2, corresponding to one cell; the position of the text element text 2 corresponds to the cell, so that the extracted text 2 can be filled into the cell, and a complete form file can be formed after all the text elements are filled into the corresponding cells. And then the table file is output, so that the user can edit the table file directly.
According to the information extraction method for the engineering reply file, provided by the embodiment of the invention, after the line element and the text element are extracted, the form in the engineering reply file can be quickly converted into the form file, so that the form file can be conveniently processed subsequently; the method does not need to manually copy the tables in the project replication files, can save manpower and can improve efficiency. The method comprises the steps of presetting vertical edge lines and horizontal edge lines, determining attribute information of extracted line elements and text elements by taking the edge lines as a reference, wherein the size relationship of the attribute information can uniquely represent the relative position between the elements, not only can each element be described by less attribute information, but also can be directly used in the subsequent processing process without other conversion or calculation, and the processing efficiency can be improved.
On the basis of the above embodiment, although the tables in the engineering reply document are generally standard tables, a partially simplified table still exists, and the simplified table contains fewer line elements; the simplified table can be seen in particular in fig. 3a and 3 b. The simplified table does not contain partial line elements, so that the extracted line elements have a missing phenomenon; in order to complete the missing line elements, this embodiment determines whether to complete the line elements according to the adjacent text elements. Specifically, a boundary line of the entire table is determined in advance from the extracted horizontal line elements and vertical line elements, the boundary line including two horizontal line elements and vertical line elements. Taking the table shown in fig. 3b as an example, which only includes one horizontal line element and one vertical line element, i.e., the horizontal line element EH and the vertical line element BN in fig. 4, the boundary line of the table can be determined according to four end points (i.e., the point B, N, E, H), which include the horizontal line elements AD and MQ, the vertical line elements AM and DQ, and also can determine the corresponding end points, i.e., the point A, D, Q, M. The attribute information of the corresponding line element in the boundary line can be directly determined according to the existing line elements EH and BN (or the known end point B, N, E, H), and calculation is not needed.
In this embodiment, after "extracting the text element in the target region" in step 104, the method further includes:
step B1: if there is no distance Xj from the vertical edge lineVAt Xa2And Xb1A new vertical line element is arranged between the a-th text element and the b-th text element, the distance between the lower end of the new vertical line element and the horizontal edge line is the distance between the horizontal edge line and the horizontal line element closest to the lower side of the a-th text element or the b-th text element, and the distance between the upper end of the new vertical line element and the horizontal edge line is the distance between the horizontal edge line and the horizontal line element closest to the upper side of the a-th text element or the b-th text element; wherein, Xa2Is the distance, Xb, between the right side of the a-th text element and the vertical edge line1The distance between the left side of the b-th text element and the vertical edge line, the b-th text element is the text element which is positioned at the right side of the a-th text element and adjacent to the a-th text element.
In the embodiment of the present invention, if the a-th text element and the b-th text element are adjacent to each other left and right, the b-th text element is located on the right side of the a-th text element, and no vertical line element exists between the two text elements, the vertical line element needs to be regenerated at this time. As shown in fig. 4, if "text 5" is the a-th text element, and "text 6" is the b-th text element; the distance Xa between the right side of the text 5 and the vertical edge line OY2The distance Xb between the left side of the "text 6" and the vertical edge line OY1Since Xj does not currently existVAt Xa2And Xb1In between, so that it can be determined that there is no vertical line element between two text elements, and a new vertical line element is formed between the two, i.e. the distance between the new vertical line element and the vertical edge line is at Xa2And Xb1Is preferably Xa2And Xb1Average value of (a). Furthermore, the elements may be sorted in advance according to the distance between each element and the horizontal edge line, so that the element of the horizontal line closest to the lower side of the a-th text element or the b-th text element and the element of the horizontal line closest to the upper side of the a-th text element or the b-th text element may be determined; in fig. 4, the horizontal line element MQ is closest to the lower sides of the two text elements, and the horizontal line EH is closest to the upper sides of the two text elements, so the distance between the lower end of the line (point P) of the new vertical line element and the horizontal edge line OX is the horizontal line elementMQ to the horizontal edge line OX, and the distance between the line upper end (point G) of the new vertical line element, that is, the vertical line element GP in fig. 4, and the horizontal edge line OX is the distance from the horizontal edge line OX to the horizontal line element EH.
In addition, the new vertical line element is also a vertical line element in nature, and after the vertical line element GP is formed, there is a vertical line element between the texts 8 and 9, and the step B1 is not required to be repeated. In addition, a new vertical line element CG may be determined between the texts 2 and 3, and the new vertical line element CG and the vertical line element GP may be merged into the vertical line element CP.
Step B2: if there is no distance Yi from the horizontal edge lineHIs located at Ya1And Yc2A new horizontal line element is arranged between the a-th text element and the c-th text element, the distance between the left end of the line of the new horizontal line element and the vertical edge line is the distance between the vertical line element closest to the left side of the a-th text element or the c-th text element and the vertical edge line, and the distance between the right end of the line of the new horizontal line element and the vertical edge line is the distance between the vertical line element closest to the right side of the a-th text element or the c-th text element and the vertical edge line; wherein, Ya1Is the distance between the underside of the a-th text element and the horizontal edge line, Yc2The distance between the upper side of the c-th text element and the horizontal edge line, the c-th text element is the text element which is positioned below and adjacent to the a-th text element.
Similarly, if the a-th text element and the c-th text element are adjacent to each other, the c-th text element is located below the a-th text element, and there is no horizontal line element between the two text elements, the horizontal line element needs to be regenerated. As shown in fig. 4, if "text 5" is the a-th text element, and "text 8" is the c-th text element; at this point a new cross-line element may be generated. Wherein the distance between the new transverse line element and the horizontal edge line is Ya1And Yc2In between, preferably the average of both; the vertical line element closest to the left side of the a-th text element or the c-th text element is a vertical line element BN, and the vertical line element is away from the a-th text element or the c-th text elementThe nearest vertical line element to the right of the element is the vertical line element DQ, so the distance between the left end of the line (point J) and the vertical edge line OY of the new horizontal line element is the distance between the vertical line element BN and the vertical edge line, and the distance between the right end of the line (point L) and the vertical edge line of the new horizontal line element is the distance between the vertical line element DQ and the vertical edge line, that is, the new horizontal line element is the horizontal line element JL in fig. 4.
In the embodiment of the invention, the omitted transverse line elements or vertical line elements can be supplemented by the adjacent text elements, and the two ends of the new transverse line elements or the new vertical line elements are limited by the existing transverse line elements and vertical line elements, so that the line elements are prevented from not conforming to the actual situation due to the overlong newly generated line elements. For example, in the case where there is a merged cell, if the text 4 and the text 7 in fig. 4 are contents in one cell and the text 4 and the text 7 are clustered into one text element, only the horizontal line element JL can be generated when generating a new horizontal line element, and the horizontal line element IJ or IL is not generated.
Optionally, after "taking a page corresponding to the page number in the engineering replication document as a target page" in step 102, the method further includes:
step C1: when the bottom vertical line element exists in the target page, taking the page next to the target page as the target page; and the bottom vertical line element is a vertical line element of which the distance between the lower end of the line and the bottom of the page is smaller than a preset threshold value.
In the embodiment of the present invention, because the same table in the engineering replication document may be divided into multiple pages, when the distance between the lower end of a vertical line element in the target page and the bottom of the page (when the horizontal edge line is located at the bottom of the page, the distance between the lower segment of the line and the horizontal edge line) is smaller than the preset threshold, it indicates that the vertical line element is closer to the bottom of the page, and there is a possibility that a table may exist in the page of the next page, at this time, the vertical line element may be referred to as a "bottom vertical line element", and meanwhile, the page of the next page of the current target page is also used as the target page, and the above step 102 and step 104 are continuously executed.
Optionally, the step 105 of forming a table file containing a plurality of cells corresponding to the project item name includes:
step D1: and when the transverse number of the cells is smaller than the preset number, forming a table file of the standard materials required by the engineering.
Step D2: and when the transverse number of the cells is not less than the preset number, forming a form file for removing materials in the project.
In the embodiment of the invention, the engineering replication file comprises two tables, namely a table of standard materials required by engineering and a table of engineering removed materials, and each table can be extracted by the method; meanwhile, since the number (i.e., the number of columns) of the horizontal cells of the two tables is different, the table of the standard materials required for the engineering is generally 4 columns as shown in table 1 below, and the table of the engineering demolishing materials is generally 5 columns as shown in table 2 below.
TABLE 1
Figure BDA0002842062040000171
TABLE 2
Figure BDA0002842062040000172
Figure BDA0002842062040000181
According to the embodiment, two forms can be simply and quickly distinguished through the transverse number of the cells, so that a corresponding form file is generated. The form file of the standard materials required by the project and the form file of the materials removed by the project can be two separate excel files or two worksheets of the same excel file.
The above describes in detail the information extraction method for the engineering reply file provided by the embodiment of the present invention, and the method may also be implemented by a corresponding apparatus.
Fig. 5 shows a schematic structural diagram of an information extraction apparatus for a project batch file according to an embodiment of the present invention. As shown in fig. 5, the information extracting apparatus of the engineering reply file includes:
the preprocessing module 51 is configured to acquire a project replication file to be processed, and extract a project name and a page number corresponding to the project name from the project replication file;
a line element extracting module 52, configured to use a page corresponding to the page number in the engineering replication document as a target page, set a vertical edge line and a horizontal edge line of the target page, and extract a horizontal line element and a vertical line element in the target page, where the horizontal line element has attribute information including: distance Yi from the horizontal edge lineHThe distance Xi between the left end of the line and the vertical edge lineH1The distance Xi between the right end of the line and the vertical edge lineH2The vertical line element has attribute information including: distance Xj from the vertical edge lineVThe distance Yj between the lower end of the line and the horizontal edge lineV1The distance Yj between the upper end of the line and the horizontal edge lineV2
A point determining module 53 for determining the left end point P of the horizontal line elementLiRight end point PRiLower end point P of said vertical line elementDjAnd an upper end point PUjAnd determining an effective intersection point P between the horizontal line element and the vertical line elementij(ii) a Wherein the left endpoint PLiThe distance from the horizontal edge line is YiHAnd the distance from the vertical edge line is XiH1The right end point PRiThe distance from the horizontal edge line is YiHAnd the distance from the vertical edge line is XiH2The lower end point PDjAt a distance Xj from the vertical edge lineVAnd the distance between the horizontal edge line and the horizontal edge line is YjV1The upper end point PUjAt a distance Xj from the vertical edge lineVAnd between the horizontal edge lineIs YjV2(ii) a If YiHAt YjV1And YjV2And XjVAt XiH1And XiH2In between, then the effective intersection point PijThe distance from the horizontal edge line is YiHAnd the distance between the vertical edge line and the vertical edge line is XjV
A text element extracting module 54, configured to determine a target area in the target page, and extract a text element in the target area, where attribute information of an a-th text element includes: distance Xa between left side and the vertical edge line1The distance Xa between the right side and said vertical edge line2The distance Ya between the lower side and the horizontal edge line1The distance Ya between the upper side and the horizontal edge line2(ii) a The target area is an area corresponding to the positions of the horizontal line elements and the vertical line elements;
a conversion module 55 for converting the left endpoint PLiThe right end point PRiThe lower endpoint PDjThe upper end point PUjAnd the effective intersection point PijAnd forming a table file which is used as a vertex of the cell and corresponds to the project item name and comprises a plurality of cells, filling the text element into the corresponding cell according to the attribute information of the text element, and outputting the filled table file.
On the basis of the above embodiment, the extracting, by the text element extracting module 54, the text element in the target region includes:
extracting character elements in the target area, wherein the attribute information of the character elements comprises: the distance xk between the left side and the vertical edge line1The distance xk between the right side and the vertical edge line2The distance yk between the lower side and the horizontal edge line1The distance yk between the upper side and the horizontal edge line2
Clustering the character elements according to the attribute information of the character elements, and determining text elements containing one or more character elements;
wherein a distance Xa between the left side of the text element and the vertical edge line1Is the distance xk between the left side of the target character element and the vertical edge line1Of the text element, the distance Xa between the right side of the text element and the vertical edge line2Is the distance xk between the right side of the target character element and the vertical edge line2The distance Ya between the lower side of the text element and the horizontal edge line1Is the distance yk between the lower side of the target character element and the horizontal edge line1The distance Ya between the upper side of the text element and the horizontal edge line2Is the distance yk between the upper side of the target character element and the horizontal edge line2The maximum value of (d); the target character element is a character element contained in the text element.
On the basis of the above embodiment, the apparatus further includes: a line element generation module;
after the text element extraction module 54 extracts the text element in the target region, the line element generation module is configured to:
if there is no distance Xj to the vertical edge lineVAt Xa2And Xb1A new vertical line element is arranged between the a-th text element and the b-th text element, the distance between the lower line end of the new vertical line element and the horizontal edge line is the distance between the horizontal edge line and the horizontal line element closest to the lower side of the a-th text element or the b-th text element, and the distance between the upper line end of the new vertical line element and the horizontal edge line is the distance between the horizontal edge line and the horizontal line element closest to the upper side of the a-th text element or the b-th text element; wherein, Xa2Is the distance, Xb, between the right side of the a-th text element and the vertical edge line1The distance between the left side of the b text element and the vertical edge line is the distance between the left side of the b text element and the vertical edge line, and the b text element is a text element which is positioned at the right side of the a text element and is adjacent to the a text element;
if not storedDistance Yi between the horizontal edge line and the edge lineHIs located at Ya1And Yc2A new horizontal line element is arranged between the a-th text element and the c-th text element, the distance between the left end of the new horizontal line element and the vertical edge line is the distance between the vertical line element closest to the left side of the a-th text element or the c-th text element and the vertical edge line, and the distance between the right end of the new horizontal line element and the vertical edge line is the distance between the vertical line element closest to the right side of the a-th text element or the c-th text element and the vertical edge line; wherein, Ya1Is the distance, Yc, between the underside of the a-th text element and the horizontal edge line2The distance between the upper side of the c-th text element and the horizontal edge line is the c-th text element which is positioned below and adjacent to the a-th text element.
On the basis of the embodiment, the device also comprises a newly added page module;
after the line element extraction module 52 takes the page corresponding to the page number in the engineering replication document as a target page, the added page module is configured to:
when the bottom vertical line element exists in the target page, taking the page next to the target page as the target page; and the bottom vertical line element is a vertical line element of which the distance between the lower end of the line and the bottom of the page is smaller than a preset threshold value.
On the basis of the above embodiment, the conversion module 55 forms a table file corresponding to the project item name and including a plurality of cells, including:
when the transverse number of the cells is smaller than the preset number, a form file of standard materials required by engineering is formed;
and when the transverse number of the cells is not less than the preset number, forming a form file for removing materials in a project.
In addition, an embodiment of the present invention further provides an electronic device, which includes a bus, a transceiver, a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the transceiver, the memory, and the processor are connected via the bus, and when the computer program is executed by the processor, the processes of the embodiment of the method for extracting information from an engineering copy-and-reply file are implemented, and the same technical effects can be achieved, and are not described herein again to avoid duplication.
Specifically, referring to fig. 6, an embodiment of the present invention further provides an electronic device, which includes a bus 1110, a processor 1120, a transceiver 1130, a bus interface 1140, a memory 1150, and a user interface 1160.
In an embodiment of the present invention, the electronic device further includes: a computer program stored on the memory 1150 and executable on the processor 1120, the computer program implementing the processes of the above-mentioned information extraction method embodiment of the engineering reply file when being executed by the processor 1120.
A transceiver 1130 for receiving and transmitting data under the control of the processor 1120.
In embodiments of the invention in which a bus architecture (represented by bus 1110) is used, bus 1110 may include any number of interconnected buses and bridges, with bus 1110 connecting various circuits including one or more processors, represented by processor 1120, and memory, represented by memory 1150.
Bus 1110 represents one or more of any of several types of bus structures, including a memory bus, and memory controller, a peripheral bus, an Accelerated Graphics Port (AGP), a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include: an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA), a Peripheral Component Interconnect (PCI) bus.
Processor 1120 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method embodiments may be performed by integrated logic circuits in hardware or instructions in software in a processor. The processor described above includes: general purpose processors, Central Processing Units (CPUs), Network Processors (NPs), Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), Complex Programmable Logic Devices (CPLDs), Programmable Logic Arrays (PLAs), Micro Control Units (MCUs) or other Programmable Logic devices, discrete gates, transistor Logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in embodiments of the present invention may be implemented or performed. For example, the processor may be a single core processor or a multi-core processor, which may be integrated on a single chip or located on multiple different chips.
Processor 1120 may be a microprocessor or any conventional processor. The steps of the method disclosed in connection with the embodiments of the present invention may be directly performed by a hardware decoding processor, or may be performed by a combination of hardware and software modules in the decoding processor. The software modules may be located in a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), a register, and other readable storage media known in the art. The readable storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.
The bus 1110 may also connect various other circuits such as peripherals, voltage regulators, or power management circuits to provide an interface between the bus 1110 and the transceiver 1130, as is well known in the art. Therefore, the embodiments of the present invention will not be further described.
The transceiver 1130 may be one element or may be multiple elements, such as multiple receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. For example: the transceiver 1130 receives external data from other devices, and the transceiver 1130 transmits data processed by the processor 1120 to other devices. Depending on the nature of the computer system, a user interface 1160 may also be provided, such as: touch screen, physical keyboard, display, mouse, speaker, microphone, trackball, joystick, stylus.
It is to be appreciated that in embodiments of the invention, the memory 1150 may further include memory located remotely with respect to the processor 1120, which may be coupled to a server via a network. One or more portions of the above-described networks may be an ad hoc network (ad hoc network), an intranet (intranet), an extranet (extranet), a Virtual Private Network (VPN), a Local Area Network (LAN), a Wireless Local Area Network (WLAN), a Wide Area Network (WAN), a Wireless Wide Area Network (WWAN), a Metropolitan Area Network (MAN), the Internet (Internet), a Public Switched Telephone Network (PSTN), a plain old telephone service network (POTS), a cellular telephone network, a wireless fidelity (Wi-Fi) network, and combinations of two or more of the above. For example, the cellular telephone network and the wireless network may be a global system for Mobile Communications (GSM) system, a Code Division Multiple Access (CDMA) system, a Worldwide Interoperability for Microwave Access (WiMAX) system, a General Packet Radio Service (GPRS) system, a Wideband Code Division Multiple Access (WCDMA) system, a Long Term Evolution (LTE) system, an LTE Frequency Division Duplex (FDD) system, an LTE Time Division Duplex (TDD) system, a long term evolution-advanced (LTE-a) system, a Universal Mobile Telecommunications (UMTS) system, an enhanced Mobile Broadband (eMBB) system, a mass Machine Type Communication (mtc) system, an Ultra Reliable Low Latency Communication (urrllc) system, or the like.
It is to be understood that the memory 1150 in embodiments of the present invention can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. Wherein the nonvolatile memory includes: Read-Only Memory (ROM), Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), or Flash Memory.
The volatile memory includes: random Access Memory (RAM), which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as: static random access memory (Static RAM, SRAM), Dynamic random access memory (Dynamic RAM, DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (Double Data Rate SDRAM, DDRSDRAM), Enhanced Synchronous DRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DRRAM). The memory 1150 of the electronic device described in the embodiments of the invention includes, but is not limited to, the above and any other suitable types of memory.
In an embodiment of the present invention, memory 1150 stores the following elements of operating system 1151 and application programs 1152: an executable module, a data structure, or a subset thereof, or an expanded set thereof.
Specifically, the operating system 1151 includes various system programs such as: a framework layer, a core library layer, a driver layer, etc. for implementing various basic services and processing hardware-based tasks. Applications 1152 include various applications such as: media Player (Media Player), Browser (Browser), for implementing various application services. A program implementing a method of an embodiment of the invention may be included in application program 1152. The application programs 1152 include: applets, objects, components, logic, data structures, and other computer system executable instructions that perform particular tasks or implement particular abstract data types.
In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements each process of the above-mentioned method for extracting information of an engineering batch file, and can achieve the same technical effect, and in order to avoid repetition, the details are not repeated here.
The computer-readable storage medium includes: permanent and non-permanent, removable and non-removable media may be tangible devices that retain and store instructions for use by an instruction execution apparatus. The computer-readable storage medium includes: electronic memory devices, magnetic memory devices, optical memory devices, electromagnetic memory devices, semiconductor memory devices, and any suitable combination of the foregoing. The computer-readable storage medium includes: phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), non-volatile random access memory (NVRAM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic tape cartridge storage, magnetic tape disk storage or other magnetic storage devices, memory sticks, mechanically encoded devices (e.g., punched cards or raised structures in a groove having instructions recorded thereon), or any other non-transmission medium useful for storing information that may be accessed by a computing device. As defined in embodiments of the present invention, the computer-readable storage medium does not include transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses traveling through a fiber optic cable), or electrical signals transmitted through a wire.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, electronic device and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electrical, mechanical or other form of connection.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to solve the problem to be solved by the embodiment of the invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present invention may be substantially or partially contributed by the prior art, or all or part of the technical solutions may be embodied in a software product stored in a storage medium and including instructions for causing a computer device (including a personal computer, a server, a data center, or other network devices) to execute all or part of the steps of the methods of the embodiments of the present invention. And the storage medium includes various media that can store the program code as listed in the foregoing.
The above description is only a specific implementation of the embodiments of the present invention, but the scope of the embodiments of the present invention is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present invention, and all such changes or substitutions should be covered by the scope of the embodiments of the present invention. Therefore, the protection scope of the embodiments of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. An information extraction method of a project replication file is characterized by comprising the following steps:
acquiring a project replication file to be processed, and extracting a project name and a page number corresponding to the project name from the project replication file;
taking a page corresponding to the page number in the engineering replication file as a target page, setting a vertical edge line and a horizontal edge line of the target page, and extracting transverse line elements and vertical line elements in the target page, wherein the transverse line elements have attribute information including: distance Yi from the horizontal edge lineHThe distance Xi between the left end of the line and the vertical edge lineH1The distance Xi between the right end of the line and the vertical edge lineH2The vertical line element has attribute information including: distance Xj from the vertical edge lineVThe distance Yj between the lower end of the line and the horizontal edge lineV1The distance Yj between the upper end of the line and the horizontal edge lineV2
Determining the left end point P of the horizontal line elementLiRight end point PRiLower end point P of said vertical line elementDjAnd an upper end point PUjAnd determining an effective intersection point P between the horizontal line element and the vertical line elementij(ii) a Wherein the left endpoint PLiThe distance from the horizontal edge line is YiHAnd the distance from the vertical edge line is XiH1The right end point PRiThe distance from the horizontal edge line is YiHAnd the distance from the vertical edge line is XiH2The lower end point PDjAt a distance Xj from the vertical edge lineVAnd the distance between the horizontal edge line and the horizontal edge line is YjV1The upper end point PUjAt a distance Xj from the vertical edge lineVAnd the distance between the horizontal edge line and the horizontal edge line is YjV2(ii) a If YiHAt YjV1And YjV2And XjVAt XiH1And XiH2In between, then the effective intersection point PijAnd between said horizontal edge linesDistance of YiHAnd the distance between the vertical edge line and the vertical edge line is XjV
Determining a target area in the target page, and extracting text elements in the target area, wherein the attribute information of the a-th text element comprises: distance Xa between left side and the vertical edge line1The distance Xa between the right side and said vertical edge line2The distance Ya between the lower side and the horizontal edge line1The distance Ya between the upper side and the horizontal edge line2(ii) a The target area is an area corresponding to the positions of the horizontal line elements and the vertical line elements;
with the left end point PLiThe right end point PRiThe lower endpoint PDjThe upper end point PUjAnd the effective intersection point PijAnd forming a table file which is used as a vertex of the cell and corresponds to the project item name and comprises a plurality of cells, filling the text element into the corresponding cell according to the attribute information of the text element, and outputting the filled table file.
2. The method of claim 1, wherein the extracting the text element in the target region comprises:
extracting character elements in the target area, wherein the attribute information of the character elements comprises: the distance xk between the left side and the vertical edge line1The distance xk between the right side and the vertical edge line2The distance yk between the lower side and the horizontal edge line1The distance yk between the upper side and the horizontal edge line2
Clustering the character elements according to the attribute information of the character elements, and determining text elements containing one or more character elements;
wherein a distance Xa between the left side of the text element and the vertical edge line1Is the distance xk between the left side of the target character element and the vertical edge line1Of the text element, right side of the text element and theDistance Xa between vertical edge lines2Is the distance xk between the right side of the target character element and the vertical edge line2The distance Ya between the lower side of the text element and the horizontal edge line1Is the distance yk between the lower side of the target character element and the horizontal edge line1The distance Ya between the upper side of the text element and the horizontal edge line2Is the distance yk between the upper side of the target character element and the horizontal edge line2The maximum value of (d); the target character element is a character element contained in the text element.
3. The method according to claim 1, further comprising, after said extracting the text element in the target region:
if there is no distance Xj to the vertical edge lineVAt Xa2And Xb1A new vertical line element is arranged between the a-th text element and the b-th text element, the distance between the lower line end of the new vertical line element and the horizontal edge line is the distance between the horizontal edge line and the horizontal line element closest to the lower side of the a-th text element or the b-th text element, and the distance between the upper line end of the new vertical line element and the horizontal edge line is the distance between the horizontal edge line and the horizontal line element closest to the upper side of the a-th text element or the b-th text element; wherein, Xa2Is the distance, Xb, between the right side of the a-th text element and the vertical edge line1The distance between the left side of the b text element and the vertical edge line is the distance between the left side of the b text element and the vertical edge line, and the b text element is a text element which is positioned at the right side of the a text element and is adjacent to the a text element;
if there is no distance Yi from the horizontal edge lineHIs located at Ya1And Yc2A new horizontal line element is arranged between the a text element and the c text element, and the distance between the left end of the line of the new horizontal line element and the vertical edge line is the distance between the a text element and the c text elementOr the distance between the vertical line element closest to the left side of the c-th text element and the vertical edge line, and the distance between the right end of the new horizontal line element and the vertical edge line is the distance between the vertical line element closest to the right side of the a-th text element or the c-th text element and the vertical edge line; wherein, Ya1Is the distance, Yc, between the underside of the a-th text element and the horizontal edge line2The distance between the upper side of the c-th text element and the horizontal edge line is the c-th text element which is positioned below and adjacent to the a-th text element.
4. The method according to claim 1, wherein after the taking the page corresponding to the page number in the engineering reply document as a target page, the method further comprises:
when the bottom vertical line element exists in the target page, taking the page next to the target page as the target page; and the bottom vertical line element is a vertical line element of which the distance between the lower end of the line and the bottom of the page is smaller than a preset threshold value.
5. The method of claim 1, wherein forming a tabular file containing a plurality of cells corresponding to the project item name comprises:
when the transverse number of the cells is smaller than the preset number, a form file of standard materials required by engineering is formed;
and when the transverse number of the cells is not less than the preset number, forming a form file for removing materials in a project.
6. An information extraction device of engineering batch files is characterized by comprising:
the system comprises a preprocessing module, a data processing module and a data processing module, wherein the preprocessing module is used for acquiring a project replication file to be processed and extracting a project name and a page number corresponding to the project name from the project replication file;
a line element extracting module for extracting the line element,the system is used for setting a vertical edge line and a horizontal edge line of the target page by taking a page corresponding to the page number in the engineering reply document as the target page, and extracting a horizontal line element and a vertical line element in the target page, wherein the horizontal line element has attribute information including: distance Yi from the horizontal edge lineHThe distance Xi between the left end of the line and the vertical edge lineH1The distance Xi between the right end of the line and the vertical edge lineH2The vertical line element has attribute information including: distance Xj from the vertical edge lineVThe distance Yj between the lower end of the line and the horizontal edge lineV1The distance Yj between the upper end of the line and the horizontal edge lineV2
A point determination module for determining a left end point P of the horizontal line elementLiRight end point PRiLower end point P of said vertical line elementDjAnd an upper end point PUjAnd determining an effective intersection point P between the horizontal line element and the vertical line elementij(ii) a Wherein the left endpoint PLiThe distance from the horizontal edge line is YiHAnd the distance from the vertical edge line is XiH1The right end point PRiThe distance from the horizontal edge line is YiHAnd the distance from the vertical edge line is XiH2The lower end point PDjAt a distance Xj from the vertical edge lineVAnd the distance between the horizontal edge line and the horizontal edge line is YjV1The upper end point PUjAt a distance Xj from the vertical edge lineVAnd the distance between the horizontal edge line and the horizontal edge line is YjV2(ii) a If YiHAt YjV1And YjV2And XjVAt XiH1And XiH2In between, then the effective intersection point PijThe distance from the horizontal edge line is YiHAnd the distance between the vertical edge line and the vertical edge line is XjV
A text element extraction module, configured to determine a target area in the target page, and extract a text element in the target area, which is the a-th text elementThe attribute information includes: distance Xa between left side and the vertical edge line1The distance Xa between the right side and said vertical edge line2The distance Ya between the lower side and the horizontal edge line1The distance Ya between the upper side and the horizontal edge line2(ii) a The target area is an area corresponding to the positions of the horizontal line elements and the vertical line elements;
a conversion module for converting the left endpoint PLiThe right end point PRiThe lower endpoint PDjThe upper end point PUjAnd the effective intersection point PijAnd forming a table file which is used as a vertex of the cell and corresponds to the project item name and comprises a plurality of cells, filling the text element into the corresponding cell according to the attribute information of the text element, and outputting the filled table file.
7. The apparatus of claim 6, wherein the text element extraction module extracts the text element in the target region comprises:
extracting character elements in the target area, wherein the attribute information of the character elements comprises: the distance xk between the left side and the vertical edge line1The distance xk between the right side and the vertical edge line2The distance yk between the lower side and the horizontal edge line1The distance yk between the upper side and the horizontal edge line2
Clustering the character elements according to the attribute information of the character elements, and determining text elements containing one or more character elements;
wherein a distance Xa between the left side of the text element and the vertical edge line1Is the distance xk between the left side of the target character element and the vertical edge line1Of the text element, the distance Xa between the right side of the text element and the vertical edge line2Is the distance xk between the right side of the target character element and the vertical edge line2Between the underside of the text element and the horizontal edge lineDistance Ya1Is the distance yk between the lower side of the target character element and the horizontal edge line1The distance Ya between the upper side of the text element and the horizontal edge line2Is the distance yk between the upper side of the target character element and the horizontal edge line2The maximum value of (d); the target character element is a character element contained in the text element.
8. The apparatus of claim 6, further comprising: a line element generation module;
after the text element extraction module extracts the text element in the target region, the line element generation module is configured to:
if there is no distance Xj to the vertical edge lineVAt Xa2And Xb1A new vertical line element is arranged between the a-th text element and the b-th text element, the distance between the lower line end of the new vertical line element and the horizontal edge line is the distance between the horizontal edge line and the horizontal line element closest to the lower side of the a-th text element or the b-th text element, and the distance between the upper line end of the new vertical line element and the horizontal edge line is the distance between the horizontal edge line and the horizontal line element closest to the upper side of the a-th text element or the b-th text element; wherein, Xa2Is the distance, Xb, between the right side of the a-th text element and the vertical edge line1The distance between the left side of the b text element and the vertical edge line is the distance between the left side of the b text element and the vertical edge line, and the b text element is a text element which is positioned at the right side of the a text element and is adjacent to the a text element;
if there is no distance Yi from the horizontal edge lineHIs located at Ya1And Yc2A new horizontal line element is arranged between the a-th text element and the c-th text element, and the distance between the left end of the line of the new horizontal line element and the vertical edge line is the vertical line element closest to the left side of the a-th text element or the c-th text element and the vertical edge lineThe distance between the line right end of the new horizontal line element and the vertical edge line is the distance between the vertical line element closest to the right side of the a-th text element or the c-th text element and the vertical edge line; wherein, Ya1Is the distance, Yc, between the underside of the a-th text element and the horizontal edge line2The distance between the upper side of the c-th text element and the horizontal edge line is the c-th text element which is positioned below and adjacent to the a-th text element.
9. An electronic device comprising a bus, a transceiver, a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the transceiver, the memory and the processor are connected via the bus, wherein the computer program, when executed by the processor, implements the steps in the method for extracting information from a project replica file according to any one of claims 1 to 5.
10. A computer-readable storage medium on which a computer program is stored, the computer program, when being executed by a processor, implementing the steps in the information extraction method of a project replication file according to any one of claims 1 to 5.
CN202011495587.XA 2020-12-17 2020-12-17 Information extraction method and device for engineering reply file and electronic equipment Active CN112580500B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011495587.XA CN112580500B (en) 2020-12-17 2020-12-17 Information extraction method and device for engineering reply file and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011495587.XA CN112580500B (en) 2020-12-17 2020-12-17 Information extraction method and device for engineering reply file and electronic equipment

Publications (2)

Publication Number Publication Date
CN112580500A true CN112580500A (en) 2021-03-30
CN112580500B CN112580500B (en) 2023-07-11

Family

ID=75135956

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011495587.XA Active CN112580500B (en) 2020-12-17 2020-12-17 Information extraction method and device for engineering reply file and electronic equipment

Country Status (1)

Country Link
CN (1) CN112580500B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1906608A (en) * 2003-11-21 2007-01-31 新加坡科技研究局 Method and system for validating the content of technical documents
CN109635268A (en) * 2018-12-29 2019-04-16 南京吾道知信信息技术有限公司 The extracting method of form data in pdf document
US20190294399A1 (en) * 2018-03-26 2019-09-26 Abc Fintech Co., Ltd. Method and device for parsing tables in pdf document
CN110633660A (en) * 2019-08-30 2019-12-31 盈盛智创科技(广州)有限公司 Document identification method, device and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1906608A (en) * 2003-11-21 2007-01-31 新加坡科技研究局 Method and system for validating the content of technical documents
US20190294399A1 (en) * 2018-03-26 2019-09-26 Abc Fintech Co., Ltd. Method and device for parsing tables in pdf document
CN109635268A (en) * 2018-12-29 2019-04-16 南京吾道知信信息技术有限公司 The extracting method of form data in pdf document
CN110633660A (en) * 2019-08-30 2019-12-31 盈盛智创科技(广州)有限公司 Document identification method, device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
钟辉;孙士兰;刘倩;: "中文版面分析和重构", 沈阳建筑大学学报(自然科学版), no. 02 *

Also Published As

Publication number Publication date
CN112580500B (en) 2023-07-11

Similar Documents

Publication Publication Date Title
US9990347B2 (en) Borderless table detection engine
US9965444B2 (en) Vector graphics classification engine
US20130191732A1 (en) Fixed Format Document Conversion Engine
US10025979B2 (en) Paragraph property detection and style reconstruction engine
JP2022172381A (en) Text extraction method, text extraction model training method, device and equipment
US20180260376A1 (en) System and method to create searchable electronic documents
CN114596566B (en) Text recognition method and related device
CN108304376B (en) Text vector determination method and device, storage medium and electronic device
CN111079944A (en) Method and device for realizing interpretation of transfer learning model, electronic equipment and storage medium
CN112784009A (en) Subject term mining method and device, electronic equipment and storage medium
US10224958B2 (en) Computer-readable recording medium, encoding apparatus, and encoding method
CN115659917A (en) Document format restoration method and device, electronic equipment and storage equipment
US10643022B2 (en) PDF extraction with text-based key
KR20170134251A (en) Image processing apparatus that performs compression processing of document file and compression method of document file and storage medium
CN105373527A (en) Omission recovery method and question-answering system
CN113343658A (en) PDF file information extraction method and device and computer equipment
CN112580500A (en) Information extraction method and device for engineering reply file and electronic equipment
CN112416340A (en) Webpage generation method and system based on sketch
CN110276051B (en) Method and device for splitting font part
CN110309517B (en) Expression document processing method, device, system and storage medium
CN113988020A (en) Engineering technical label book compiling method, device, equipment and storage medium
JP5478936B2 (en) Information processing apparatus and information processing method
US20220178814A1 (en) Method for calculating a density of stem cells in a cell image, electronic device, and storage medium
CN109635681B (en) Document processing method and device
CN107562734A (en) Translation template determination, machine translation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant