CN118070789A - Information extraction method and device - Google Patents
Information extraction method and device Download PDFInfo
- Publication number
- CN118070789A CN118070789A CN202410256448.3A CN202410256448A CN118070789A CN 118070789 A CN118070789 A CN 118070789A CN 202410256448 A CN202410256448 A CN 202410256448A CN 118070789 A CN118070789 A CN 118070789A
- Authority
- CN
- China
- Prior art keywords
- information
- text
- instruction file
- instruction
- image information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 41
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 70
- 238000000034 method Methods 0.000 claims abstract description 38
- 238000004458 analytical method Methods 0.000 claims abstract description 32
- 230000000007 visual effect Effects 0.000 claims abstract description 27
- 238000012545 processing Methods 0.000 claims abstract description 24
- 239000013598 vector Substances 0.000 claims description 69
- 238000012015 optical character recognition Methods 0.000 claims description 41
- 238000004590 computer program Methods 0.000 claims description 18
- 238000013527 convolutional neural network Methods 0.000 claims description 16
- 238000003062 neural network model Methods 0.000 claims description 11
- 238000013145 classification model Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 5
- 238000005516 engineering process Methods 0.000 abstract description 6
- 238000010586 diagram Methods 0.000 description 14
- 230000008569 process Effects 0.000 description 8
- 230000005540 biological transmission Effects 0.000 description 6
- 238000007726 management method Methods 0.000 description 4
- 238000012552 review Methods 0.000 description 4
- 238000012546 transfer Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000013524 data verification Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005111 flow chemistry technique Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/146—Aligning or centring of the image pick-up or image-field
- G06V30/147—Determination of region of interest
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/412—Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Character Input (AREA)
Abstract
The invention provides an information extraction method and device, wherein the method comprises the following steps: carrying out text recognition on an instruction file to be processed to obtain text information, wherein the instruction file comprises a plurality of instructions in a preset format; visual analysis is carried out on the instruction file to obtain image information; and processing the text information and the image information through a pre-trained planar layout algorithm model to obtain target instruction information in a preset format. According to the embodiment of the invention, the image information and the text information of the instruction file can be effectively fused to obtain the extraction result of the key field in the managed instruction, and the problem of poor text information identification accuracy caused by the complex information hierarchy structure of the managed instruction file in the related technology is solved.
Description
Technical Field
The present invention relates to the field of data processing, and in particular, to an information extraction method and apparatus.
Background
The asset hosting business introduces a large number of high-quality clients and a large number of low-cost deposits for the bank, and is one of hot spot businesses for bank financial management. But is significantly less digital due to the hosted bank. At present, a large amount of business still receives a large amount of files such as contracts, instructions, orders and the like from managers through paper fax, e-mails and other channels, then manually judges the types of the files and classifies the files for manual processing, and the files are circulated in the whole process of hosting service. The management mode has low efficiency and high risk, and seriously affects the overall operation level of the hosting department. One of the important links of transferring the managed service management from offline to online is to form a complete data processing flow processing system for file receiving, order entry, review and each post review, so that the electronic management of managed information is realized. However, the hosted instruction files are mostly pictures or scan pieces, and it is difficult to directly obtain digitized instruction information.
In a banking asset hosting scenario, a manager transmits information of money transfer to a hosted person through a hosting instruction, and the manager examines the information with a strict flow, and a plurality of auditors in the hosted portion need to examine money transfer time and money amount, check account information and the like, so as to ensure the compliance of the hosting instruction, element completeness, money rationality and account accuracy. Thus, the accuracy of the extraction result of the new generation of managed intelligent extraction service is a key for improving the service efficiency. However, the managed instruction information extraction has three difficulties: 1. the mandated instruction filling conventions followed by all institutions are different, so that the format difference of the instruction files is larger; 2. the information included in each managed instruction is more, and part of the information has uncertainty, so that the information hierarchy structure of each page of managed instruction file is complex; 3. the number of instructions in each page of instructions is not fixed, so that the analysis result is easy to generate misplacement and confusion.
In the related art, character information in a hosted document is acquired through optical character recognition (Optical Character Recognition, abbreviated as OCR), but characters acquired through direct lateral scanning of OCR do not have a correct reading sequence due to complexity of the document, and accuracy of instruction information is poor.
For the above-described problems in the related art, no solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides an information extraction method and device, which at least solve the problem of poor identification accuracy of text information caused by complex information hierarchy structure of a managed instruction file in the related technology.
According to an embodiment of the present invention, there is provided an information extraction method including: carrying out text recognition on an instruction file to be processed to obtain text information, wherein the instruction file comprises a plurality of instructions in a preset format; performing visual analysis on the instruction file to obtain image information; and processing the text information and the image information through a pre-trained plane layout algorithm model to obtain target instruction information in a preset format.
Optionally, performing text recognition on the instruction file to be processed to obtain text information includes: performing optical character recognition on the instruction file to obtain an optical character recognition result composed of a plurality of characters; carrying out layout knowledge analysis on the instruction file to obtain two-dimensional absolute position information of the plurality of characters in the instruction file, and correcting the optical character recognition result based on the two-dimensional absolute position information; carrying out named entity recognition on the corrected optical character recognition result to obtain a plurality of named entities; wherein the text information includes the plurality of characters and the plurality of named entities, and the two-dimensional absolute position information includes coordinate positions of four boundary points.
Optionally, after performing named entity recognition on the corrected optical character recognition result to obtain a plurality of named entities, the method further includes: and carrying out relative position coding on the plurality of characters and the plurality of named entities to obtain relative position information of each character or the named entity in the text information, wherein the text information also comprises the relative position information, and the relative position information comprises a head vector and a tail vector.
Optionally, performing visual analysis on the instruction file to obtain image information, including: inputting the instruction file into a pre-trained regional convolution neural network model to obtain the image information output by the regional convolution neural network model, wherein the image information comprises regional characteristics of one or more regions of interest in the instruction file.
Optionally, inputting the instruction file into a pre-trained regional convolutional neural network model to obtain the image information output by the regional convolutional neural network model, including: capturing text position information in the instruction file through the area convolution neural network model, segmenting the instruction file into a plurality of slicing areas, determining one or more areas of interest from the plurality of slicing areas according to the text position information, and outputting area characteristics of the one or more areas of interest, wherein the area characteristics comprise: region identification, text direction, and region location.
Optionally, processing the text information and the image information through a pre-trained planar layout algorithm model to obtain target instruction information in a preset format, including: inputting the text information into the planar layout algorithm model through a vector matrix formed by seven layers of embedded vectors; inputting the image information into the planar layout algorithm model through an image embedding vector; and carrying out joint modeling and semantic alignment on the text information and the image information based on the seven-layer character embedding vector and the image embedding vector through the plane layout algorithm model to obtain target instruction information, wherein the target instruction information comprises field information of a plurality of preset fields, the text information is used for indicating field information and field positions of the plurality of fields, and the image information is used for indicating the position of the target instruction information in the instruction file.
Optionally, the text information includes: the method comprises the steps of providing a character, a named entity, two-dimensional absolute position information of the character or the named entity in an instruction file, and relative position information of the character or the named entity in the instruction file, wherein the two-dimensional absolute position information comprises coordinate positions of four boundary points, and the relative position information comprises a head vector and a tail vector; the image information includes: region identification, text direction, and region location of one or more regions of interest.
Optionally, the target instruction information includes field information of a plurality of preset fields, and after the text information and the image information are processed through a pre-trained planar layout algorithm model to obtain target instruction information in a preset format, the method further includes: judging whether the confidence coefficient of the field information of each preset field in the target instruction information is larger than or equal to a preset confidence coefficient threshold value or not through a preset text classification model.
According to another embodiment of the present invention, there is also provided an information extraction apparatus including:
The identification module is used for carrying out text identification on an instruction file to be processed to obtain text information, wherein the instruction file comprises a plurality of instructions in a preset format;
the analysis module is used for carrying out visual analysis on the instruction file to obtain image information;
And the processing module is used for processing the text information and the image information through a pre-trained planar layout algorithm model to obtain target instruction information in a preset format.
Optionally, the identification module includes:
The optical character recognition unit is used for carrying out optical character recognition on the instruction file to obtain an optical character recognition result composed of a plurality of characters;
The layout knowledge analysis unit is used for carrying out layout knowledge analysis on the instruction file to obtain two-dimensional absolute position information of the plurality of characters in the instruction file, and correcting the optical character recognition result based on the two-dimensional absolute position information;
The named entity recognition unit is used for carrying out named entity recognition on the corrected optical character recognition result to obtain a plurality of named entities; wherein the text information includes the plurality of characters and the plurality of named entities, and the two-dimensional absolute position information includes coordinate positions of four boundary points.
Optionally, the identification module further comprises: the coding unit is used for carrying out relative position coding on the plurality of characters and the plurality of named entities to obtain relative position information of each character or the named entity in the text information, wherein the text information also comprises the relative position information, and the relative position information comprises a head vector and a tail vector.
Optionally, the parsing module includes: the visual analysis unit is used for inputting the instruction file into a pre-trained regional convolution neural network model to obtain the image information output by the regional convolution neural network model, wherein the image information comprises regional characteristics of one or more regions of interest in the instruction file.
Optionally, the visual analysis unit is configured to capture text position information in the instruction file through the area convolutional neural network model, segment the instruction file into a plurality of slice areas, determine one or more regions of interest from the plurality of slice areas according to the text position information, and output area features of the one or more regions of interest, where the area features include: region identification, text direction, and region location.
Optionally, the processing module includes:
An input unit for inputting the text information into the planar layout algorithm model through a vector matrix composed of seven layers of embedded vectors; inputting the image information into the planar layout algorithm model through an image embedding vector;
The algorithm processing unit is used for carrying out joint modeling and semantic alignment on the text information and the image information based on the seven-layer character embedding vector and the image embedding vector through the plane layout algorithm model to obtain target instruction information, wherein the target instruction information comprises field information of a plurality of preset fields, the text information is used for indicating field information and field positions of the plurality of fields, and the image information is used for indicating the position of the target instruction information in the instruction file.
Optionally, the text information includes: the method comprises the steps of providing a character, a named entity, two-dimensional absolute position information of the character or the named entity in an instruction file, and relative position information of the character or the named entity in the instruction file, wherein the two-dimensional absolute position information comprises coordinate positions of four boundary points, and the relative position information comprises a head vector and a tail vector; the image information includes: region identification, text direction, and region location of one or more regions of interest.
Optionally, the target instruction information includes field information of a plurality of preset fields.
Optionally, the apparatus further comprises: the judging module is used for judging whether the confidence coefficient of the field information of each preset field in the target instruction information is larger than or equal to a preset confidence coefficient threshold value through a preset text classification model.
According to a further embodiment of the present invention, there is also provided a computer-readable storage medium having stored therein a computer program, wherein the computer program, when executed by a processor, performs the steps of any of the method embodiments described above.
According to a further embodiment of the invention, there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
According to the invention, the image information and the text information of the instruction file can be effectively fused to obtain the extraction result of the key field in the managed instruction, so that the problem of poor text information identification accuracy caused by the complex information hierarchy structure of the managed instruction file in the related technology is solved, and the working efficiency of the new generation instruction managed service is further improved by improving the accuracy of instruction information extraction.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
Fig. 1 is a block diagram of a hardware configuration of a computer terminal of an information extraction method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method of information extraction according to an embodiment of the invention;
FIG. 3 is a schematic diagram of the FLAT algorithm in accordance with an embodiment of the present invention;
FIG. 4 is a schematic diagram of a planar layout algorithm model according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a business process of a hosted file information retrieval service in an embodiment of the invention;
FIG. 6 is a schematic diagram of a business process for hosting a file system in one embodiment of the present invention;
FIG. 7 is a schematic diagram of a smart identification model for instructions under custody in accordance with one embodiment of the invention;
Fig. 8 is a block diagram of an information extraction apparatus according to an embodiment of the present invention.
Detailed Description
The application will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
Example 1
The method according to the first embodiment of the present application may be implemented in a mobile terminal, a computer terminal or a similar computing device. Taking a computer terminal as an example, fig. 1 is a block diagram of a hardware structure of a computer terminal of the information extraction method according to an embodiment of the present application, as shown in fig. 1, the computer terminal may include one or more (only one is shown in fig. 1) processors 102 (the processors 102 may include, but are not limited to, a microprocessor MCU, a programmable logic device FPGA, or the like) and a memory 104 for storing data, and optionally, the computer terminal may further include a transmission device 106 for a communication function and an input-output device 108. It will be appreciated by those skilled in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the computer terminal described above. For example, the computer terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to the information extraction method in the embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the computer terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a computer terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
In this embodiment, there is provided an information extraction method running on the above computer terminal or network architecture, and fig. 2 is a flowchart of the information extraction method according to an embodiment of the present invention, as shown in fig. 2, where the flowchart includes the following steps:
step S202, carrying out text recognition on an instruction file to be processed to obtain text information, wherein the instruction file comprises a plurality of instructions in a preset format;
Step S204, performing visual analysis on the instruction file to obtain image information;
Step S206, processing the text information and the image information through a pre-trained planar layout algorithm model to obtain target instruction information in a preset format.
In the embodiment of the invention, through the steps S202 to S206, the image information and the text information of the instruction file can be effectively fused to obtain the extraction result of the key field in the managed instruction, so that the problem of poor text information identification accuracy caused by the complex information hierarchy structure of the managed instruction file in the related technology is solved, and the working efficiency of the new generation of instruction managed service is further improved by improving the accuracy of instruction information extraction.
In this embodiment, the instruction file includes a plurality of instructions in a predetermined format, but the specific format of the instructions is not limited by the instruction filling convention followed by each financial institution.
In the embodiment, a plane Layout algorithm model (FLAT-Layout) is designed by integrating document Layout information on the basis of a FLAT (FLAT-Lattice Transformer) algorithm, so that the interpretation capability of the algorithm on a document with a complex structure, in particular a table document, can be enhanced.
Optionally, step S202 performs text recognition on the instruction file to be processed to obtain text information, including the following steps:
step S2022, performing optical character recognition (Optical Character Recognition, abbreviated as OCR) on the instruction file to obtain an optical character recognition result composed of a plurality of characters;
Step S2024, performing layout knowledge analysis on the instruction file to obtain two-dimensional absolute position information of the plurality of characters in the instruction file, and correcting the optical character recognition result based on the two-dimensional absolute position information;
In step S2026, the corrected optical character recognition result is subjected to Named entity recognition (Named-Entity Recognition, NER for short) to obtain a plurality of Named entities.
In this embodiment, the text information includes the plurality of characters and the plurality of named entities, and the two-dimensional absolute position information includes coordinate positions of four boundary points.
In some embodiments, named entity recognition may be based on rules, statistics, or deep learning, as the invention is not limited in this regard.
Optionally, after performing named entity recognition on the corrected optical character recognition result in step S2026 to obtain a plurality of named entities, the method further includes: step S2028, performing relative position coding on the plurality of characters and the plurality of named entities to obtain relative position information of each of the characters or the named entities in the text information, where the text information further includes the relative position information, and the relative position information includes a head vector and a tail vector.
In this embodiment, the representation of two relative positions of the Head vector (Head) and the Tail vector (Tail) can be constructed for each character/named entity by relative position coding. The head vector and the tail vector are the starting sequence number and the ending sequence number of the character/named entity in the character sequence respectively.
Optionally, step S204 may include performing visual analysis on the instruction file to obtain image information: inputting the instruction file into a pre-trained regional convolutional neural network model (Region-based Convolutional Neural Network, R-CNN for short) to obtain the image information output by the regional convolutional neural network model, wherein the image information comprises regional characteristics of one or more regions of interest (Region of Interest, ROI for short) in the instruction file.
In an exemplary embodiment, the pre-trained regional convolutional neural network model may use a fast regional convolutional neural network model (fast R-CNN), but the present application is not limited thereto.
In some embodiments, inputting the instruction file into a pre-trained regional convolutional neural network model to obtain the image information output by the regional convolutional neural network model may include: capturing text position information in the instruction file through the area convolution neural network model, segmenting the instruction file into a plurality of slicing areas, determining one or more areas of interest from the plurality of slicing areas according to the text position information, and outputting area characteristics of the one or more areas of interest, wherein the area characteristics comprise: region identification, text direction, and region location.
In some embodiments, the step S204 performs visual analysis on the instruction file, and may classify the instruction typesetting manners of the instruction file, where the instruction typesetting manners include, but are not limited to, horizontal edition instruction, vertical edition instruction, single instruction, multi-instruction file, and the like.
In some embodiments, the image information obtained by performing visual analysis on the instruction file in step S204 may further include visual information such as a font, a text direction, and a color.
Optionally, processing the text information and the image information through a pre-trained planar layout algorithm model to obtain target instruction information in a preset format, including: inputting the text information into the planar layout algorithm model through a vector matrix formed by seven layers of embedded vectors; inputting the image information into the planar layout algorithm model through an image embedding vector; and carrying out joint modeling and semantic alignment on the text information and the image information based on the seven-layer character embedding vector and the image embedding vector through the plane layout algorithm model to obtain target instruction information, wherein the target instruction information comprises field information of a plurality of preset fields, the text information is used for indicating field information and field positions of the plurality of fields, and the image information is used for indicating the position of the target instruction information in the instruction file.
Optionally, the text information includes: the method comprises the steps of providing a character, a named entity, two-dimensional absolute position information of the character or the named entity in an instruction file, and relative position information of the character or the named entity in the instruction file, wherein the two-dimensional absolute position information comprises coordinate positions of four boundary points, and the relative position information comprises a head vector and a tail vector; the image information includes: region identification, text direction, and region location of one or more regions of interest.
In some embodiments, the seven-layer character-embedded vector may include: text vectors (character/field values of named entities), two relative position vectors (head and tail vectors), and four absolute position vectors. For example, a coordinate system may be constructed with the upper left corner of the instruction file as the origin, with four absolute position vectors (X0, Y0, X1, Y1) defining the boundary positions of the character/named entity in the instruction file.
Optionally, the target instruction information includes field information of a plurality of preset fields, and after the text information and the image information are processed through a pre-trained planar layout algorithm model to obtain target instruction information in a preset format, the method further includes: judging whether the confidence coefficient of the field information of each preset field in the target instruction information is larger than or equal to a preset confidence coefficient threshold value or not through a preset text classification model.
In an exemplary embodiment, the preset fields include, but are not limited to: payment time, payer account, payer issuer, payer payment system number, uppercase amount, lowercase amount, payment use, etc.
According to the embodiment of the invention, the document layout information (contained in the image information) can be skillfully introduced into the FLAT algorithm, so that the information of the combined text and document layout is realized, the problem that the text information identification accuracy is poor due to the complex information hierarchy structure of the managed instruction file in the related technology is solved, and the working efficiency of the new-generation instruction managed service is further improved by improving the accuracy of instruction information extraction.
FIG. 3 is a schematic diagram of the FLAT algorithm of an embodiment of the present invention, as shown in FIG. 3, incorporating the dynamic structure of lexical information based on Transformer Encoder (transducer encoder).
In this embodiment, the FLAT algorithm constructs two head-tail vectors for each character and word (i.e., named entity), and the related words share related location information. The FLAT algorithm thus directly models interactions between characters and all matching lexical information, and characterizes the relative position information of each character or lexical.
In this embodiment, each character in the sequence of characters may be marked by Transformer Encoder to determine the position of the character in the named entity as B-LOC (start position), I-LOC (intermediate position) or E-LOC (end position).
In this embodiment, through the FLAT algorithm, the structure information of the complex structure document (such as the form document) can be obtained while the semantics of the managed instruction are understood, so as to implement parallelization calculation, greatly improve the inference speed and quickly identify the keywords.
In this embodiment, the interaction between spans is characterized by using relative position codes, and the spans of each vocabulary comprise different lengths. According to the starting position and the ending position, the spans have three different relations, namely intersecting, containing and separating. Illustratively, to simultaneously characterize the relationship between spans and their distances, a dense vector may be employed to characterize the relationship between two characters (or words) Xi and Xj, specifically by:
in the present embodiment of the present invention, in the present embodiment, Representing the distance of the head vector of the character (or vocabulary) Xi to the head vector of Xj,/>Representing the distance of the head vector of the character (or vocabulary) Xi to the tail vector of Xj,/>Representing the distance of the tail vector of the character (or vocabulary) Xi to the head vector of Xj,/>Representing the distance of the tail vector of the character (or vocabulary) Xi to the tail vector of Xj.
FIG. 4 is a schematic structural diagram of a planar Layout algorithm model according to an embodiment of the present invention, as shown in FIG. 4, a model structure of a planar Layout algorithm (FLAT-Layout) algorithm uses the FLAT algorithm as a support, adds two-dimensional absolute position information and image information, and captures the relative position of a character (token) in a document and visual information such as a font, a text direction, a color, and the like, respectively.
While the FLAT algorithm may introduce external knowledge, the advantage of transforming the long-range dependence and the advantage of pre-training the model are exploited to improve NER performance, and a smart position code is designed to use the relative position code. But the usability for document layout information is weak. Therefore, the embodiment of the invention designs the FLAT-Layout algorithm by integrating the document Layout information on the basis of the FLAT algorithm, and can strengthen the interpretation capability of the algorithm on the form type document.
In an exemplary embodiment, the two-dimensional absolute position information may be represented by four position embedding vectors, the document page is considered as a coordinate system (upper left is origin), boundaries are defined by (x 0, y0, x1, y 1), and four position embedding vector layers (Position Embeddings) are added, the embedding layers appear to have the same dimensions of the same embedding table.
In an exemplary embodiment, the image information may be characterized by an image embedding vector (Image Embeddings), the document page image is segmented into a sequence of small pictures, and a feature token of a region of interest (ROI) of the entire picture is modeled based on fast R-CNN, resulting in an image embedding vector.
In the embodiment, the FLAT-Layout algorithm can skillfully introduce document Layout information into the algorithm on the basis of introducing external knowledge into the FLAT algorithm and utilizing grid and information optimization parallelism by using relative position coding, so that information of combined text and document Layout is realized.
In another embodiment of the invention, an intelligent identification model of the hosted instruction is established based on the information extraction method, and a business flow scheme of the hosted file information extraction service is established aiming at the characteristics of the hosted instruction file.
Fig. 5 is a schematic diagram of a business process of a hosted file information extraction service according to an embodiment of the present invention, as shown in fig. 5, where the business process includes the following steps:
Step S501, receiving mail or fax, and downloading storage attachment.
Step S502, sorting the attachments, and distinguishing the hosting instruction file, the transaction list, the attachments, the public attachments, the waste pieces, and the like.
Step S503, invoking OCR with respect to the escrow instruction file, recognizing the character position by OCR with coordinates in common, and recognizing the stamp by stamp detection.
Step S504, extracting key information through the intelligent identification model of the hosting instruction. According to the characteristics of the managed instruction file, the managed instruction intelligent recognition model distinguishes the text and the form area of the instruction file. And acquiring the characteristics of the hierarchical structure of the table through the semantic location model, and further obtaining the value (key-value) of the preset field.
In step S505, the accuracy of the obtained preset field is checked through the text classification model, for example, whether the confidence level is greater than or equal to the preset confidence level threshold is judged, and the result is returned to the hosting service system. The preset fields include a money transfer time, a payer account, a payer issuer, a payer payment system number, a uppercase amount, a lowercase amount, a money transfer purpose, and the like.
And step S506, the extracted fields are sent to service personnel for review and file archiving.
In this embodiment, the hosting instruction intelligent recognition model in step S504 is a key step of the whole business service, and information extraction may be performed according to the steps in any of the above method embodiments. The accuracy of the hosted instruction smart recognition model will directly determine the working efficiency of the new generation hosted smart extraction service.
According to the embodiment of the invention, the image information and the text information of the instruction file can be effectively fused to obtain the extraction result of the key field in the managed instruction, so that the problem that the text information identification accuracy is poor due to the complex information hierarchy structure of the managed instruction file in the related technology is solved, and the working efficiency of the new-generation instruction managed service is further improved by improving the accuracy of instruction information extraction.
In another embodiment of the present invention, the business process of the hosted file information extracting service may be implemented by a separate business system, or may be implemented based on a file hosting system in the prior art.
FIG. 6 is a schematic diagram of a business process for hosting a file system in one embodiment of the invention, as shown in FIG. 6, which may include the following modules:
the document receiving module 61 is used for receiving and storing documents in mail or fax.
The file sort module 62 is configured to distinguish file types.
An OCR service module 63 for recognizing image text.
The information extraction module 64 is configured to extract the key information through the hosting instruction intelligent recognition model.
The information verification module 65 is configured to determine a confidence level of the extraction result of each field based on the confidence model.
In this embodiment, the information extraction module 64 is used for extracting information according to the steps in any of the above method embodiments, and is mainly implemented according to the flag-Layout algorithm.
The managed file system in the embodiment of the invention provides an algorithm architecture design combining the document structure information and the visual information on the basis of providing an online scheme, combines the visual information of the characters into an algorithm by utilizing the image characteristics, and enhances the recognition and understanding capability of the algorithm on the managed instruction. Furthermore, the invention provides a FLAT-Layout algorithm based on the transform, which fuses the dynamic structure of the vocabulary information, and supports parallelization calculation while representing the position information based on the lattice structure of the word-word, so that the algorithm speed can be greatly improved. Meanwhile, the document layout information is skillfully introduced into an algorithm, so that the information of the combined text and the document layout is realized.
According to the embodiment of the invention, the managed business can be converted from off-line manual input into on-line automatic processing, so that the self-adaptive identification of different types of instruction formats is realized, various data verification modes are supported, and the workload of manual review is reduced. The invention can also improve the recognition effect through model updating and program iteration quick repair, can recognize any field on the document, and does not limit the number of the fields. Furthermore, the on-line processing flow can also realize operation record of each step in the whole flow, and backtrack data when necessary.
FIG. 7 is a schematic structural diagram of a smart identification model for a managed instruction according to an embodiment of the present invention, and as shown in FIG. 7, the algorithm structure of the smart identification model is mainly divided into the following parts:
Visual parsing, optical character recognition, layout knowledge analysis, named entity recognition, and a FLAT algorithm.
In this embodiment, the hosting instruction file is first scanned by OCR (optical character recognition), and the positions of the text and the text box are resolved and the coordinates thereof are obtained. The OCR parsing result is usually a sweep, but the host instruction is usually a table, and the text position information is difficult to simply arrange in the order of "top to bottom", "left to right", unlike the reading order of the regular document. In order to obtain the hierarchical structure information of the managed instruction file, layout knowledge analysis is further needed to be carried out on the hierarchical structure information so as to correct OCR scanning results.
In this embodiment, to further enhance the accuracy of the model, visual analysis may be performed on the instruction file, and the instruction file may be divided into a horizontal instruction, a vertical instruction, a single instruction, and a multi-instruction file according to the typesetting.
In this embodiment, on the basis of the corrected OCR recognition result, key field information such as time, account, bank name, and amount may be recognized by a named entity recognition algorithm.
In this embodiment, the instruction file has both payer information and payee information, and the information such as account, bank name, account number, payment system number, etc. cannot be identified by only relying on a named entity identification algorithm, but also should be analyzed with assistance of layout knowledge. Thus, it is also necessary to acquire text embedding (relative position information) and position embedding (two-dimensional absolute position information) corresponding to the above field information, and input them into the FLAT algorithm (including the FLAT-Layout algorithm).
In this embodiment, the FLAT algorithm is based on a transducer, and is used for jointly modeling text and layout information, so that the semantic alignment capability and layout analysis capability of the model service on a complex table file such as an instruction file are effectively improved.
In this embodiment, the field extraction result may be post-processed under the condition of accurately analyzing the key field information and the position of the instruction file. For example, information such as the number of instructions is acquired by means of instruction alignment.
According to the embodiment of the invention, the FLAT algorithm is based on which layout information and text information can be effectively fused, and the extraction result of key field information of the managed instruction can be obtained. In addition, because the managed instructions are various in variety, various in templates and large in format difference, the intelligent managed instruction recognition model cannot be solved by a single algorithm, and the accuracy of the instruction recognition result can be improved by combining various means such as visual analysis, OCR, layout knowledge analysis, semantic understanding and layout knowledge enhancement.
The algorithm architecture in the embodiment of the invention has at least the following advantages: the host instructions exhibit multimodal information features in that they include document structural information and visual information in addition to textual information. In the document structure information, the position relation of the document text contains rich semantic information. Typically, the key-value pairs are arranged in a side-to-side or top-to-bottom fashion and have a particular type relationship. Similarly, in a form document, the text in the form is typically arranged in a grid, and the header typically appears in the first column or first row. By pre-training, these location information that is naturally aligned with text can provide more rich semantic information for downstream information extraction tasks. Visual information for rich text documents, visual information presented in text format may also aid downstream tasks in addition to the positional relationship of the text itself.
According to another embodiment of the present invention, there is also provided an information extraction apparatus.
Fig. 8 is a block diagram of the information extraction device according to the embodiment of the present invention, and as shown in fig. 8, the information extraction device includes the following structure:
the identifying module 82 is configured to perform text identification on an instruction file to be processed to obtain text information, where the instruction file includes a plurality of instructions in a preset format;
the parsing module 84 is configured to perform visual parsing on the instruction file to obtain image information;
The processing module 86 is configured to process the text information and the image information through a pre-trained planar layout algorithm model, so as to obtain target instruction information in a preset format.
Optionally, the identification module includes:
The optical character recognition unit is used for carrying out optical character recognition on the instruction file to obtain an optical character recognition result composed of a plurality of characters;
The layout knowledge analysis unit is used for carrying out layout knowledge analysis on the instruction file to obtain two-dimensional absolute position information of the plurality of characters in the instruction file, and correcting the optical character recognition result based on the two-dimensional absolute position information;
The named entity recognition unit is used for carrying out named entity recognition on the corrected optical character recognition result to obtain a plurality of named entities; wherein the text information includes the plurality of characters and the plurality of named entities, and the two-dimensional absolute position information includes coordinate positions of four boundary points.
Optionally, the identification module further comprises: the coding unit is used for carrying out relative position coding on the plurality of characters and the plurality of named entities to obtain relative position information of each character or the named entity in the text information, wherein the text information also comprises the relative position information, and the relative position information comprises a head vector and a tail vector.
Optionally, the parsing module includes: the visual analysis unit is used for inputting the instruction file into a pre-trained regional convolution neural network model to obtain the image information output by the regional convolution neural network model, wherein the image information comprises regional characteristics of one or more regions of interest in the instruction file.
Optionally, the visual analysis unit is configured to capture text position information in the instruction file through the area convolutional neural network model, segment the instruction file into a plurality of slice areas, determine one or more regions of interest from the plurality of slice areas according to the text position information, and output area features of the one or more regions of interest, where the area features include: region identification, text direction, and region location.
Optionally, the processing module includes:
An input unit for inputting the text information into the planar layout algorithm model through a vector matrix composed of seven layers of embedded vectors; inputting the image information into the planar layout algorithm model through an image embedding vector;
The algorithm processing unit is used for carrying out joint modeling and semantic alignment on the text information and the image information based on the seven-layer character embedding vector and the image embedding vector through the plane layout algorithm model to obtain target instruction information, wherein the target instruction information comprises field information of a plurality of preset fields, the text information is used for indicating field information and field positions of the plurality of fields, and the image information is used for indicating the position of the target instruction information in the instruction file.
Optionally, the text information includes: the method comprises the steps of providing a character, a named entity, two-dimensional absolute position information of the character or the named entity in an instruction file, and relative position information of the character or the named entity in the instruction file, wherein the two-dimensional absolute position information comprises coordinate positions of four boundary points, and the relative position information comprises a head vector and a tail vector; the image information includes: region identification, text direction, and region location of one or more regions of interest.
Optionally, the target instruction information includes field information of a plurality of preset fields.
Optionally, the apparatus further comprises: the judging module is used for judging whether the confidence coefficient of the field information of each preset field in the target instruction information is larger than or equal to a preset confidence coefficient threshold value through a preset text classification model.
It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; or the above modules may be located in different processors in any combination.
Embodiments of the present invention also provide a computer-readable storage medium having a computer program stored therein, wherein the computer program, when executed by a processor, performs the steps of any of the method embodiments described above.
Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of:
S1, carrying out text recognition on an instruction file to be processed to obtain text information, wherein the instruction file comprises a plurality of instructions in a preset format;
s2, performing visual analysis on the instruction file to obtain image information;
and S3, processing the text information and the image information through a pre-trained planar layout algorithm model to obtain target instruction information in a preset format.
Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.
An embodiment of the invention also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.
Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:
S1, carrying out text recognition on an instruction file to be processed to obtain text information, wherein the instruction file comprises a plurality of instructions in a preset format;
s2, performing visual analysis on the instruction file to obtain image information;
and S3, processing the text information and the image information through a pre-trained planar layout algorithm model to obtain target instruction information in a preset format.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present invention should be included in the protection scope of the present invention.
Claims (11)
1. An information extraction method, comprising:
carrying out text recognition on an instruction file to be processed to obtain text information, wherein the instruction file comprises a plurality of instructions in a preset format;
performing visual analysis on the instruction file to obtain image information;
and processing the text information and the image information through a pre-trained plane layout algorithm model to obtain target instruction information in a preset format.
2. The method of claim 1, wherein text recognition of the instruction file to be processed to obtain text information comprises:
performing optical character recognition on the instruction file to obtain an optical character recognition result composed of a plurality of characters;
Carrying out layout knowledge analysis on the instruction file to obtain two-dimensional absolute position information of the plurality of characters in the instruction file, and correcting the optical character recognition result based on the two-dimensional absolute position information;
Carrying out named entity recognition on the corrected optical character recognition result to obtain a plurality of named entities;
Wherein the text information includes the plurality of characters and the plurality of named entities, and the two-dimensional absolute position information includes coordinate positions of four boundary points.
3. The method of claim 2, wherein after performing named entity recognition on the corrected optical character recognition result to obtain a plurality of named entities, the method further comprises:
And carrying out relative position coding on the plurality of characters and the plurality of named entities to obtain relative position information of each character or the named entity in the text information, wherein the text information also comprises the relative position information, and the relative position information comprises a head vector and a tail vector.
4. The method of claim 1, wherein visually parsing the instruction file to obtain image information comprises:
Inputting the instruction file into a pre-trained regional convolution neural network model to obtain the image information output by the regional convolution neural network model, wherein the image information comprises regional characteristics of one or more regions of interest in the instruction file.
5. The method of claim 4, wherein inputting the instruction file into a pre-trained regional convolutional neural network model to obtain the image information output by the regional convolutional neural network model, comprises:
Capturing text position information in the instruction file through the area convolution neural network model, segmenting the instruction file into a plurality of slicing areas, determining one or more areas of interest from the plurality of slicing areas according to the text position information, and outputting area characteristics of the one or more areas of interest, wherein the area characteristics comprise: region identification, text direction, and region location.
6. The method according to claim 1, wherein processing the text information and the image information through a pre-trained planar layout algorithm model to obtain target instruction information in a preset format comprises:
Inputting the text information into the planar layout algorithm model through a vector matrix formed by seven layers of embedded vectors;
Inputting the image information into the planar layout algorithm model through an image embedding vector;
and carrying out joint modeling and semantic alignment on the text information and the image information based on the seven-layer character embedding vector and the image embedding vector through the plane layout algorithm model to obtain target instruction information, wherein the target instruction information comprises field information of a plurality of preset fields, the text information is used for indicating field information and field positions of the plurality of fields, and the image information is used for indicating the position of the target instruction information in the instruction file.
7. The method of claim 6, wherein the step of providing the first layer comprises,
The text information includes: the method comprises the steps of providing a character, a named entity, two-dimensional absolute position information of the character or the named entity in an instruction file, and relative position information of the character or the named entity in the instruction file, wherein the two-dimensional absolute position information comprises coordinate positions of four boundary points, and the relative position information comprises a head vector and a tail vector;
The image information includes: region identification, text direction, and region location of one or more regions of interest.
8. The method according to claim 1, wherein the target instruction information includes field information of a plurality of preset fields, and after the text information and the image information are processed by a pre-trained planar layout algorithm model, the method further comprises:
Judging whether the confidence coefficient of the field information of each preset field in the target instruction information is larger than or equal to a preset confidence coefficient threshold value or not through a preset text classification model.
9. An information extraction apparatus, comprising:
The identification module is used for carrying out text identification on an instruction file to be processed to obtain text information, wherein the instruction file comprises a plurality of instructions in a preset format;
the analysis module is used for carrying out visual analysis on the instruction file to obtain image information;
And the processing module is used for processing the text information and the image information through a pre-trained planar layout algorithm model to obtain target instruction information in a preset format.
10. A computer-readable storage medium, characterized in that a computer program is stored in the storage medium, wherein the computer program, when being executed by a processor, performs the method of any one of claims 1 to 8.
11. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the method of any of the claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410256448.3A CN118070789A (en) | 2024-03-06 | 2024-03-06 | Information extraction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410256448.3A CN118070789A (en) | 2024-03-06 | 2024-03-06 | Information extraction method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118070789A true CN118070789A (en) | 2024-05-24 |
Family
ID=91105496
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410256448.3A Pending CN118070789A (en) | 2024-03-06 | 2024-03-06 | Information extraction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118070789A (en) |
-
2024
- 2024-03-06 CN CN202410256448.3A patent/CN118070789A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10943105B2 (en) | Document field detection and parsing | |
US11501061B2 (en) | Extracting structured information from a document containing filled form images | |
CN109543690B (en) | Method and device for extracting information | |
CN112508011A (en) | OCR (optical character recognition) method and device based on neural network | |
US20210366055A1 (en) | Systems and methods for generating accurate transaction data and manipulation | |
WO2007080642A1 (en) | Sheet slip processing program and sheet slip program device | |
CN113469067B (en) | Document analysis method, device, computer equipment and storage medium | |
CN113033269B (en) | Data processing method and device | |
CN109446345A (en) | Nuclear power file verification processing method and system | |
CN112883980B (en) | Data processing method and system | |
CN114612921B (en) | Form recognition method and device, electronic equipment and computer readable medium | |
CN114971294A (en) | Data acquisition method, device, equipment and storage medium | |
CN114724156A (en) | Form identification method and device and electronic equipment | |
CN113673528B (en) | Text processing method, text processing device, electronic equipment and readable storage medium | |
CN115880702A (en) | Data processing method, device, equipment, program product and storage medium | |
CN115294593A (en) | Image information extraction method and device, computer equipment and storage medium | |
CN115512340A (en) | Intention detection method and device based on picture | |
CN115223182A (en) | Document layout identification method and related device | |
Bhatt et al. | Text Extraction & Recognition from Visiting Cards | |
CN118070789A (en) | Information extraction method and device | |
CN113901817A (en) | Document classification method and device, computer equipment and storage medium | |
Rahul et al. | Deep reader: Information extraction from document images via relation extraction and natural language | |
JP2021033743A (en) | Information processing apparatus, document identification method, and information processing system | |
CN114780773B (en) | Document picture classification method and device, storage medium and electronic equipment | |
Ylisiurunen | Extracting semi-structured information from receipts |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |