CN115934928A - Information extraction method, device, equipment and storage medium - Google Patents
Information extraction method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN115934928A CN115934928A CN202211696330.XA CN202211696330A CN115934928A CN 115934928 A CN115934928 A CN 115934928A CN 202211696330 A CN202211696330 A CN 202211696330A CN 115934928 A CN115934928 A CN 115934928A
- Authority
- CN
- China
- Prior art keywords
- fund
- poster
- layout
- coordinates
- extraction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 73
- 238000000034 method Methods 0.000 claims abstract description 35
- 238000012216 screening Methods 0.000 claims abstract description 13
- 238000012545 processing Methods 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 16
- 239000002356 single layer Substances 0.000 claims description 12
- 238000012015 optical character recognition Methods 0.000 claims description 9
- 238000001914 filtration Methods 0.000 claims description 6
- 239000010410 layer Substances 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000004891 communication Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 239000000463 material Substances 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000001737 promoting effect Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides an information extraction method, an information extraction device, information extraction equipment and a storage medium, wherein the method comprises the following steps: converting the fund poster into an editable file; performing layout identification on the fund poster to obtain coordinates of layout blocks; extracting each target field from the text of the editable file with the coordinates in each layout block aiming at each layout block; screening the target field to obtain an extraction result; and combining the extraction results to obtain structured data, and presenting the structured data. The technical scheme provided by the embodiment of the invention can improve the working efficiency and save the manual reading time.
Description
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to an information extraction method, an information extraction device, information extraction equipment and a storage medium.
Background
With the rapid development of Chinese economy in recent years, the number of financial assets which can be allocated by residents is increased, and the public fund raising industry is developed vigorously. The fund propaganda and promotion material is used as a necessary form for promoting fund products, and various propaganda and promotion materials can be manufactured by related personnel or fund sales organizations according to the requirement of improving the propaganda effect. In order to protect the legitimate rights and interests of consumers and promote the healthy development of the market, the propaganda and promotion material is required to have objective content and real data.
Fund promotional material is usually presented in the form of a "Fund poster" with no fixed format or content. At present, some computer technical means are urgently needed to assist manual processing, so that the working efficiency is improved, and the labor cost is reduced, wherein information extraction is an important step, but the methods in the related technologies are difficult to process.
Disclosure of Invention
The embodiment of the invention provides an information extraction method, an information extraction device, information extraction equipment and a storage medium, which can improve the working efficiency and save the manual reading time.
In a first aspect, an embodiment of the present invention provides an information extraction method, including:
converting the fund poster into an editable file;
performing layout identification on the fund poster to obtain coordinates of layout blocks;
extracting each target field from the text of the editable file with the coordinates in each layout block aiming at each layout block;
screening the target field to obtain an extraction result;
and combining the extraction results to obtain structured data, and presenting the structured data.
In a second aspect, an embodiment of the present invention provides an information extraction apparatus, including:
the conversion module is used for converting the fund poster into an editable file;
the identification module is used for identifying the layout of the fund poster to obtain the coordinates of the layout block;
the extraction module is used for extracting each target field from the text of the editable file with the coordinates positioned in each edition block aiming at each edition block;
the filtering module is used for screening the target field to obtain an extraction result;
and the combination and presentation module is used for combining the extraction results to obtain structured data and presenting the structured data.
In a third aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the method provided by the embodiments of the present invention.
In a fourth aspect, the embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are used to enable a processor to implement the method provided by the embodiment of the present invention when executed.
According to the technical scheme provided by the embodiment of the invention, the fund poster is converted into an editable file; performing layout identification on the fund poster to obtain coordinates of layout blocks; extracting each target field from the text of the editable file with the coordinates in each layout block aiming at each layout block; screening the target field to obtain an extraction result; and combining the extraction results to obtain structured data, and presenting the structured data, so that the working efficiency can be improved, and the manual reading time can be saved.
It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of an information extraction method according to an embodiment of the present invention;
fig. 2a is a flowchart of an information extraction method according to an embodiment of the present invention;
FIG. 2b is a flowchart of an information extraction method according to an embodiment of the present invention;
fig. 3 is a block diagram of an information extraction apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Fig. 1 is a flowchart of an information extraction method according to an embodiment of the present invention, where the embodiment is applicable to information extraction of poster material of a fund, and the method may be executed by an information extraction apparatus, which may be implemented in the form of hardware and/or software, and the apparatus may be configured in an electronic device. As shown in fig. 1, the method includes:
s110: the fund poster is converted into an editable file.
In the embodiment of the present invention, before S110, the method may further include dividing the key information to be identified into several target fields, and configuring the target fields. Analyzing key information to be identified, dividing the key information into a plurality of target fields, and setting the target fields. After the setup is completed, the fund poster may be uploaded to process the fund poster.
In an implementation of the embodiment of the present invention, optionally, the converting the fund poster into an editable file includes: converting the fund posters in various formats into single-layer PDF files; and converting the single-layer PDF file into a double-layer PDF file through optical character recognition, and recording the coordinate of each character. The contents of the fund posters are normalized, and specifically, the fund posters in various formats, such as pictures, documents, scanning pieces and the like, are converted into single-layer PDF files and stored. Then, the text in the single-layer PDF file is recognized by Optical Character Recognition (OCR) and added to the PDF file, a two-layer PDF file is formed, and the coordinates of each Character are recorded.
S120: and carrying out layout identification on the fund poster to obtain the coordinates of layout blocks.
In the embodiment of the present invention, the fund poster is generally in columns, but the boundaries of the columns cannot be identified during OCR recognition, which easily causes an error in the reading sequence of the fund poster (for example, two sentences that are not a paragraph may be linked together), and causes the subsequent target field to be incorrectly extracted, so that it is necessary to perform layout recognition to obtain each layout block, and thus perform the extraction of the target field in each block.
In an implementation manner of the embodiment of the present invention, optionally, the performing layout identification on the fund poster to obtain coordinates of layout blocks includes: carrying out binarization processing on the fund poster; and (4) performing line scanning on the binaryzation-processed fund poster to obtain the coordinates of the layout blocks. Optionally, the step of performing binarization processing on the fund poster includes: converting the color picture of the fund poster into a black and white picture of the fund poster; correspondingly, the line scanning is carried out on the fund poster subjected to binarization processing to obtain the coordinates of the layout blocks, and the method comprises the following steps: and performing line scanning on the black and white picture by adopting OpenCV to obtain coordinates of layout blocks. The method can be used for dividing a picture into information islands containing information by OpenCV line scanning, so that block division is realized, and coordinates of layout blocks are obtained.
Therefore, by carrying out layout identification on the fund poster, the coordinates of each layout block are obtained, so that the target field can be conveniently and correctly extracted subsequently, and the information can be correctly extracted.
S130: for each layout block, each target field is extracted from the text of the editable file with coordinates located in each layout block.
In the embodiment of the invention, because the coordinates of each character in the editable file are recorded, the coordinates of each layout block are also obtained, the text of the editable file with the coordinates positioned in each layout block can be obtained, and each target field is extracted from the text.
S140: and screening the target field to obtain an extraction result.
In the embodiment of the invention, some extracted target fields do not meet the requirements, and the target fields need to be screened, so that the target fields are repeatedly extracted to obtain the extraction result.
In an implementation manner of the embodiment of the present invention, optionally, the database is queried to determine whether the target field meets the requirement; and filtering the target fields which do not meet the requirements to obtain an extraction result. Specifically, each target field is matched with a database, whether each target field is a field which needs to be extracted and meets the requirement is judged, if the target field does not meet the requirement, the target field is filtered, and the target field meeting the requirement is reserved to obtain an extraction result. For example, to extract a fund name, when setting, the target field of the extraction XX fund is set. During extraction, all fields of the XX fund are obtained, a certain 'essential fund' field may not be a fund name, a database is required to be inquired to judge whether the 'essential fund' field is the fund name, and after inquiry, the field is filtered, and a target field conforming to the fund name is reserved to obtain an extraction result, wherein the 'essential fund' field is not the fund name.
S150: and combining the extraction results to obtain structured data, and presenting the structured data.
In an implementation manner of the embodiment of the present invention, optionally, the combining the extraction results to obtain the structured data includes: and combining the extracted results based on the position relation of the layout blocks to obtain structured data. Specifically, in the extraction result, fields with specified relationships of the same layout block are combined, and structured data is formed based on the belonged relationships. For example, in the same layout block, one of fund performance x and 2021 is found, and the other of fund performance y and 2022 is found, and these two groups belong to fund performance and fund a to which the fund performance belongs. Therefore, fund a, the fund performance, two groupings under fund performance may form structured data.
According to the technical scheme provided by the embodiment of the invention, the fund poster is converted into an editable file; performing layout identification on the fund poster to obtain coordinates of layout blocks; extracting each target field from the text of the editable file with the coordinates in each layout block aiming at each layout block; screening the target field to obtain an extraction result; and combining the extraction results to obtain structured data, and presenting the structured data, so that the extraction efficiency can be improved, and the time for manual reading can be saved.
Fig. 2a is a flowchart of an information extraction method provided in an embodiment of the present invention, where in this embodiment, optionally, the performing layout identification on the fund poster to obtain coordinates of layout blocks includes:
carrying out binarization processing on the fund poster;
and (4) performing line scanning on the binaryzation-processed fund poster to obtain the coordinates of the layout blocks.
Optionally, the converting the fund poster into an editable file comprises:
converting the fund posters in various formats into single-layer PDF files;
and converting the single-layer PDF file into a double-layer PDF file through optical character recognition, and recording the coordinate of each character.
The screening of the target field to obtain an extraction result comprises:
judging whether the target field meets the requirement or not by inquiring a database;
and filtering the target fields which do not meet the requirements to obtain an extraction result.
As shown in fig. 2a, the technical solution provided by the embodiment of the present invention includes:
s210: the fund posters in various formats are converted into single-layer PDF files.
S220: and converting the single-layer PDF file into a double-layer PDF file through optical character recognition, and recording the coordinate of each character.
S230: and carrying out binarization processing on the fund poster.
S240: and (4) performing line scanning on the foundation poster subjected to the binarization processing to obtain coordinates of the layout blocks.
S250: for each layout block, each target field is extracted from the text of the editable file with coordinates located in each layout block.
S260: and judging whether the target field meets the requirement or not by querying a database.
S270: and filtering the target fields which do not meet the requirements to obtain an extraction result.
S280: and combining the extraction results to obtain structured data, and presenting the structured data.
Wherein, reference may be made to the description of the above embodiments for S210 to S280.
The technical solution provided by the embodiment of the present invention may also refer to fig. 2b, as shown in fig. 2b, the method includes:
setting fields: analyzing the key information to be identified, and dividing the key information into a plurality of target fields.
Data normalization: the pictures, documents and fund posters of the scanned pieces in various formats are converted into PDFs and are stored locally.
OCR processing: and identifying characters in the picture type, adding the characters to a PDF file, and recording the coordinates of the characters.
And (3) binarization processing: the color gold poster picture is converted into black and white.
And (3) identifying the layout: and obtaining the coordinates of the layout blocks by using OpenCV line scanning.
Fuzzy extraction: for each layout block, each target field is extracted for the text whose coordinates are within the layout block.
Back check and extraction: and inquiring a data database for matching the extracted information, and accurately extracting the result.
And (3) extracting the relation: and combining the extracted information according to the position relation of the layout blocks.
And presenting the result: and presenting the extraction result according to the combined structure.
According to the technical scheme provided by the embodiment of the invention, the layout of the fund poster and the characters in the fund poster are automatically identified, some specific key information is extracted on the basis, the information is combined according to the correlation relation to obtain the structured data of the key information of the fund poster, the structured data is displayed to the user according to the structure of the information, the identification of the fund poster without limiting the file format can be supported, the identification efficiency is high, the key information in the fund poster can be read in a short time, the manual reading time is saved, the accuracy is high, the processing efficiency is improved, the expansibility is good, and different extraction rules can be met only by adjusting the field setting.
Fig. 3 is a block diagram of an information extraction apparatus according to an embodiment of the present invention, and as shown in fig. 3, the apparatus includes: a conversion module 310, an identification module 320, an extraction module 330,
A conversion module 310 for converting the fund poster into an editable file;
the identification module 320 is used for identifying the layout of the fund poster to obtain the coordinates of a layout block;
an extracting module 330, configured to extract, for each of the layout blocks, each target field from the text of the editable file whose coordinates are located in each of the layout blocks;
the screening module 340 is configured to screen the target field to obtain an extraction result;
a combining and presenting module 350, configured to combine the extraction results to obtain structured data, and present the structured data.
Optionally, the right of the fund poster performs layout recognition to obtain coordinates of layout blocks, including:
carrying out binarization processing on the fund poster;
and (4) performing line scanning on the binaryzation-processed fund poster to obtain the coordinates of the layout blocks.
Optionally, the binarizing processing of the fund poster includes:
converting the color picture of the fund poster into a black and white picture of the fund poster;
correspondingly, the line scanning is carried out on the fund poster subjected to binarization processing to obtain the coordinates of the layout blocks, and the method comprises the following steps:
and performing line scanning on the black and white picture by adopting OpenCV to obtain coordinates of layout blocks.
Optionally, the converting the fund poster into an editable file includes:
converting the fund posters in various formats into single-layer PDF files;
and converting the single-layer PDF file into a double-layer PDF file through optical character recognition, and recording the coordinate of each character.
Optionally, the screening the target field to obtain an extraction result includes:
judging whether the target field meets the requirement or not by querying a database;
and filtering the target fields which do not meet the requirements to obtain an extraction result.
Optionally, the combining the extraction results to obtain structured data includes:
and combining the extracted results based on the position relation of the layout blocks to obtain structured data.
Optionally, the apparatus further includes a setting module, configured to:
dividing key information to be identified into a plurality of target fields, and configuring the target fields.
The device provided by the embodiment of the invention can execute the method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
FIG. 4 shows a schematic block diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 4, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as the information extraction method.
In some embodiments, the information extraction method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the information extraction method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the information extraction method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the Internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.
The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. An information extraction method, comprising:
converting the fund poster into an editable file;
performing layout identification on the fund poster to obtain coordinates of layout blocks;
extracting each target field from the text of the editable file with the coordinates in each layout block aiming at each layout block;
screening the target field to obtain an extraction result;
and combining the extraction results to obtain structured data, and presenting the structured data.
2. The method of claim 1 wherein said layout recognizing said fund poster to obtain coordinates of a layout patch comprises:
carrying out binarization processing on the fund poster;
and (4) performing line scanning on the binaryzation-processed fund poster to obtain the coordinates of the layout blocks.
3. The method according to claim 2, wherein the subjecting the fund poster to binarization processing includes:
converting the color picture of the fund poster into a black and white picture of the fund poster;
correspondingly, the line scanning is carried out on the fund poster subjected to binarization processing to obtain the coordinates of the layout blocks, and the method comprises the following steps:
and performing line scanning on the black and white picture by adopting OpenCV to obtain coordinates of layout blocks.
4. The method of claim 1,
the converting a fund poster into an editable file, comprising:
converting the fund posters in various formats into single-layer PDF files;
and converting the single-layer PDF file into a double-layer PDF file through optical character recognition, and recording the coordinate of each character.
5. The method of claim 1, wherein the screening the target field to obtain an extraction result comprises:
judging whether the target field meets the requirement or not by inquiring a database;
and filtering the target fields which do not meet the requirements to obtain an extraction result.
6. The method of claim 1, wherein said combining the extracted results to obtain structured data comprises:
and combining the extracted results based on the position relation of the layout blocks to obtain structured data.
7. The method of claim 1, further comprising:
dividing key information to be identified into a plurality of target fields, and configuring the target fields.
8. An information extraction apparatus, characterized by comprising:
the conversion module is used for converting the fund poster into an editable file;
the identification module is used for identifying the layout of the fund poster to obtain the coordinates of a layout block;
the extraction module is used for extracting each target field from the text of the editable file with the coordinates positioned in each edition block aiming at each edition block;
the screening module is used for screening the target field to obtain an extraction result;
and the combination and presentation module is used for combining the extraction results to obtain structured data and presenting the structured data.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
10. A computer-readable storage medium storing computer instructions for causing a processor to perform the method of any one of claims 1-7 when executed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211696330.XA CN115934928A (en) | 2022-12-28 | 2022-12-28 | Information extraction method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211696330.XA CN115934928A (en) | 2022-12-28 | 2022-12-28 | Information extraction method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115934928A true CN115934928A (en) | 2023-04-07 |
Family
ID=86557569
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211696330.XA Pending CN115934928A (en) | 2022-12-28 | 2022-12-28 | Information extraction method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115934928A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113159969A (en) * | 2021-05-17 | 2021-07-23 | 广州故新智能科技有限责任公司 | Financial long text rechecking system |
CN118550891A (en) * | 2024-05-10 | 2024-08-27 | 北京度友信息技术有限公司 | Portable file format document processing method, portable file format document processing device, electronic equipment and storage medium |
-
2022
- 2022-12-28 CN CN202211696330.XA patent/CN115934928A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113159969A (en) * | 2021-05-17 | 2021-07-23 | 广州故新智能科技有限责任公司 | Financial long text rechecking system |
CN118550891A (en) * | 2024-05-10 | 2024-08-27 | 北京度友信息技术有限公司 | Portable file format document processing method, portable file format document processing device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115934928A (en) | Information extraction method, device, equipment and storage medium | |
CN113705554A (en) | Training method, device and equipment of image recognition model and storage medium | |
US11341319B2 (en) | Visual data mapping | |
CN113408323B (en) | Extraction method, device and equipment of table information and storage medium | |
US12118770B2 (en) | Image recognition method and apparatus, electronic device and readable storage medium | |
CN113239807B (en) | Method and device for training bill identification model and bill identification | |
WO2023231380A1 (en) | Electrode plate defect recognition method and apparatus, and electrode plate defect recognition model training method and apparatus, and electronic device | |
CN115098440A (en) | Electronic archive query method, device, storage medium and equipment | |
CN114924959A (en) | Page testing method and device, electronic equipment and medium | |
EP3869398A2 (en) | Method and apparatus for processing image, device and storage medium | |
CN114187448A (en) | Document image recognition method and device, electronic equipment and computer readable medium | |
CN113610809A (en) | Fracture detection method, fracture detection device, electronic device, and storage medium | |
CN112528610A (en) | Data labeling method and device, electronic equipment and storage medium | |
CN112801016A (en) | Vote data statistical method, device, equipment and medium | |
CN115393870A (en) | Text information processing method, device, equipment and storage medium | |
CN114049686A (en) | Signature recognition model training method and device and electronic equipment | |
CN115116070A (en) | Method, device and equipment for accurately cutting PDF and storage medium | |
CN115116080A (en) | Table analysis method and device, electronic equipment and storage medium | |
CN113515280A (en) | Page code generation method and device | |
CN114998906B (en) | Text detection method, training method and device of model, electronic equipment and medium | |
CN114328242B (en) | Form testing method and device, electronic equipment and medium | |
CN116644724B (en) | Method, device, equipment and storage medium for generating bid | |
CN114911963A (en) | Template picture classification method, device, equipment, storage medium and product | |
CN116884023A (en) | Image recognition method, device, electronic equipment and storage medium | |
CN115757739A (en) | Information extraction model training method, information extraction device, information extraction equipment and information extraction medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |