CN112800848A - Structured extraction method, device and equipment of information after bill identification - Google Patents

Structured extraction method, device and equipment of information after bill identification Download PDF

Info

Publication number
CN112800848A
CN112800848A CN202011628351.9A CN202011628351A CN112800848A CN 112800848 A CN112800848 A CN 112800848A CN 202011628351 A CN202011628351 A CN 202011628351A CN 112800848 A CN112800848 A CN 112800848A
Authority
CN
China
Prior art keywords
text
information
template
bill
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011628351.9A
Other languages
Chinese (zh)
Inventor
刘渊
张科
梁扩战
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongdian Jinxin Software Co Ltd
Original Assignee
Zhongdian Jinxin Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongdian Jinxin Software Co Ltd filed Critical Zhongdian Jinxin Software Co Ltd
Priority to CN202011628351.9A priority Critical patent/CN112800848A/en
Publication of CN112800848A publication Critical patent/CN112800848A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/418Document matching, e.g. of document images

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Character Input (AREA)

Abstract

The application provides a structured extraction method, a structured extraction device and structured extraction equipment for information after bill identification, wherein the method comprises the following steps: acquiring image information of a bill to be identified; analyzing the image information, and identifying at least one text message in the bill and position information of each text message in the at least one text message on the bill line by line from top to bottom; classifying the text information, and selecting a target data template with semantic matching from a preset template library according to a classification result; and extracting text data in the text information according to the text information, the position information and the target data template. According to the method and the device, template alignment is realized through dual matching of coordinates and semantic concepts, template alignment under the condition of dynamic changes of line number, word number and the like of characters is realized, the components of information are determined based on the template, the information structured extraction precision of the complex layout bill is improved, and finally the data identification accuracy is improved.

Description

Structured extraction method, device and equipment of information after bill identification
Technical Field
The application relates to the technical field of data identification, in particular to a structured extraction method, a structured extraction device and structured extraction equipment for information after bill identification.
Background
OCR (Optical Character Recognition) refers to a process in which an electronic device (e.g., a scanner or a digital camera) examines a Character printed on paper, determines its shape by detecting dark and light patterns, and then translates the shape into computer text using a Character Recognition method. OCR technology is widely used in related fields such as handwriting recognition, print recognition, and text image recognition. The method has the advantages of document identification, bank card identification and advertisement and poster identification, and greatly simplifies the data processing mode.
In the field of bill recognition, a bill image is first input into an OCR model, and unstructured data is output. After bill recognition, unstructured data is converted into structured data, and the structured data is generally formed by matching bills with templates and extracting data from the unstructured data according to data extraction rules in the templates.
However, a common method in the prior art is alignment of optical anchor points, and if the number of lines of characters, the number of words, and other dynamic changes occur, it is difficult to align what contents are in which region according to a template, so that robustness of template alignment under the condition of dynamic changes of the number of lines of characters, the number of words, and other dynamic changes is poor.
Disclosure of Invention
The embodiment of the application aims to provide a method, a device and equipment for extracting information after bill identification in a structured manner, template alignment is realized through double matching of coordinates and semantic concepts, template alignment under the condition of dynamic changes of line numbers, word numbers and the like of characters is realized, and data identification accuracy is improved.
The first aspect of the embodiments of the present application provides a method for identifying a ticket, including: acquiring image information of a bill to be identified; analyzing the image information, and identifying at least one text message in the bill and position information of each text message in the at least one text message on the bill line by line from top to bottom; classifying the text information, and selecting a target data template with semantic matching from a preset template library according to a classification result; and extracting text data in the text information according to the text information, the position information and the target data template.
In an embodiment, the analyzing the image information and identifying at least one text message in the ticket line by line from top to bottom and the position information of each text message in the at least one text message on the ticket includes: identifying the image information, and generating a text library of the bill, wherein the text library comprises: the whole text content of the bill and the coordinate information of each character on the bill; and selecting target text content pointed by each preset field from the text library as the text information of the preset field, wherein the position information is the target coordinate range where the target text content is located.
In an embodiment, the classifying the text information and selecting a target data template with semantic matching from a preset template library according to a classification result includes: identifying target semantic information of the target text content aiming at each preset field; and selecting the target data template with the maximum similarity between the template semantic information and the target semantic information from the template library based on the target semantic information.
In one embodiment, the target data template includes: a plurality of labeling boxes marked with semantic labels and position labels; the extracting text data in the text information according to the text information, the position information and the target data template comprises: respectively calculating the overlapping rate of the position information and the position label of each marking frame in the target data template aiming at each preset field, and taking the marking frame with the overlapping rate larger than a preset threshold value as a candidate marking frame; in the candidate labeling boxes, respectively calculating semantic similarity between the text information under the same preset field and a semantic label in each candidate labeling box, and selecting the candidate labeling box with the largest semantic similarity as a template labeling box of the preset field; and extracting text data of the text information marked by the template marking box.
In an embodiment, the extracting text data of the text information labeled by the template labeling box includes: and calling a data extraction rule corresponding to the template marking frame aiming at each preset field, and extracting the text data in the text information from the text information based on the data extraction rule.
A second aspect of the embodiments of the present application provides a bill identifying apparatus, including: the acquisition module is used for acquiring the image information of the bill to be identified; the analysis module is used for analyzing the image information, and identifying at least one text message in the bill and the position information of each text message in the at least one text message on the bill line by line from top to bottom; the matching module is used for classifying the text information and selecting a target data template with semantic matching from a preset template library according to a classification result; and the extraction module is used for extracting the text data in the text information according to the text information, the position information and the target data template.
In one embodiment, the parsing module is configured to: identifying the image information, and generating a text library of the bill, wherein the text library comprises: the whole text content of the bill and the coordinate information of each character on the bill; and selecting target text content pointed by each preset field from the text library as the text information of the preset field, wherein the position information is the target coordinate range where the target text content is located.
In one embodiment, the matching module is configured to: identifying target semantic information of the target text content aiming at each preset field; and selecting the target data template with the maximum similarity between the template semantic information and the target semantic information from the template library based on the target semantic information.
In one embodiment, the target data template includes: a plurality of labeling boxes marked with semantic labels and position labels; the extraction module is configured to: respectively calculating the overlapping rate of the position information and the position label of each marking frame in the target data template aiming at each preset field, and taking the marking frame with the overlapping rate larger than a preset threshold value as a candidate marking frame; in the candidate labeling boxes, respectively calculating semantic similarity between the text information under the same preset field and a semantic label in each candidate labeling box, and selecting the candidate labeling box with the largest semantic similarity as a template labeling box of the preset field; extracting text data of the text information marked by the template marking box;
in an embodiment, the extracting text data of the text information labeled by the template labeling box includes: and calling a data extraction rule corresponding to the template marking frame aiming at each preset field, and extracting the text data in the text information from the text information based on the data extraction rule. A third aspect of embodiments of the present application provides an electronic device, including: a memory to store a computer program; a processor configured to perform the method of the first aspect of the embodiments of the present application and any embodiment thereof to identify text data in a ticket.
According to the method, the device and the equipment for extracting the information structuralized after the bill identification, firstly, the image information of the bill to be identified is obtained, then the text information of the bill and the coordinate of the text information on the bill are obtained through analysis based on the image information, and finally, the text information of the bill and a target data template are subjected to double matching of the coordinate and the semantic concept, so that template alignment is realized, the text data of the bill is extracted from the text information, and the accuracy of data identification is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
FIG. 2 is a schematic view of a ticket according to an embodiment of the present application;
FIG. 3 is a schematic flowchart of a bill identification method according to an embodiment of the present application;
FIG. 4 is a schematic flowchart of a bill identification method according to an embodiment of the present application;
FIG. 5 is a diagram illustrating a target data template according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a bill identifying device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. In the description of the present application, the terms "first," "second," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.
As shown in fig. 1, the present embodiment provides an electronic apparatus 1 including: at least one processor 11 and a memory 12, one processor being exemplified in fig. 1. Processor 11 and memory 12 are connected by bus 10, and memory 12 stores instructions executable by processor 11, and the instructions are executed by processor 11 to cause electronic device 1 to perform all or part of the flow of the method in the embodiments described below to identify text data in a ticket.
In an embodiment, the electronic device 1 may be a mobile phone, a notebook computer, a desktop computer, or a computing system composed of multiple computers.
Please refer to fig. 2, which is a schematic diagram of a ticket 2 according to an embodiment of the present application, wherein the ticket 2 may be a contract, a receipt, an invoice, a ticket, an agreement, a bill, etc., and the ticket includes various information such as a ticket type, a ticket date, a ticket number, etc. Taking a contract as an example, one contract will contain a variety of information, and in the actual contract management, in order to facilitate the management of contract data, specific attribute information of the contract is generally filed and stored in a targeted manner to form structured data, so as to facilitate subsequent query. For example, the contract contains information such as the type of the bill, "contract", the date of the bill, "contract date", and the number of the bill, "contract number", and during the process of identifying the contract, the contract information of different types needs to be identified and stored.
Please refer to fig. 3, which is a method for ticket recognition according to an embodiment of the present application, and the method can be executed by the electronic device 1 shown in fig. 1 and can be applied in a scenario of ticket recognition shown in fig. 2 to recognize text data in a ticket. The method comprises the following steps:
step 301: and acquiring the image information of the bill to be identified.
In this step, the bill may be a contract, a receipt, an invoice, a ticket, an agreement, a bill, or the like, and the image information may be photo information or a scanned article of the bill. The image information of the bill to be identified can be obtained by taking a picture on site or from a preset database.
Step 302: and analyzing the image information, and identifying at least one text message in the bill and the position information of each text message in the at least one text message on the bill line by line from top to bottom.
In this step, the text information of the ticket may be various texts recorded in the ticket, each text in a ticket has specific position information, and the position information may be represented in the form of coordinates. By analyzing the image information of the bill, such as image recognition, the text information of the bill and the position information of the text information on the bill can be obtained.
Step 303: and classifying the text information, and selecting a target data template with semantic matching from a preset template library according to a classification result.
In this step, various types of data templates are prestored in the preset template library, each data template is configured with a data extraction rule, and the data templates and the extraction rules can be customized based on the actual needs of the user. When the bill to be recognized is recognized, the text information is classified based on the text information of the bill, and the target data template with semantic matching is selected from the preset template library according to the classification result.
Step 304: and extracting text data in the text information according to the text information, the position information and the target data template.
In the step, the text information and the position information of the text information on the bill are comprehensively considered and matched with the target data template, and then the text data is extracted from the text information, wherein the text data can be stored in a structured mode so as to facilitate data management.
According to the bill identification method, firstly, the image information of the bill to be identified is obtained, then the text information of the bill and the coordinates of the text information on the bill are obtained through analysis based on the image information, finally, the text information of the bill and the target data template are subjected to double matching of coordinates and semantic concepts, template alignment is achieved, further, the text data of the bill is extracted from the text information, and the data identification accuracy is improved.
Please refer to fig. 4, which is a method for ticket recognition according to an embodiment of the present application, and the method can be executed by the electronic device 1 shown in fig. 1 and can be applied in a scenario of ticket recognition shown in fig. 2 to recognize text data in a ticket. The method comprises the following steps:
step 401: and acquiring the image information of the bill to be identified. See the description of step 301 in the above embodiments for details.
Step 402: recognizing image information and generating a text library of the bill, wherein the text library comprises: the full text content of the ticket and the coordinate information of each character on the ticket.
In this step, the coordinate information may be a coordinate range. The image information of the bill can be recognized by using an OCR technology, for example, the image information of the bill to be recognized is input into an OCR recognition model, and a recognition result in a JSON format can be output, where the recognition result includes a text library of the bill, and the text library at least includes text contents of the bill and a coordinate range corresponding to each character of each text content on the bill. The coordinate range is a real coordinate value.
Step 403: and selecting the target text content pointed by each preset field from a text library as the text information of the preset field, wherein the position information is the target coordinate range where the target text content is located.
In this step, the preset field may be a designated field set based on a user requirement, and taking a contract as an example of a to-be-recognized bill, the preset field may include: the ticket type is contract, contract date and contract number, etc. The text library comprises all text contents and corresponding coordinate information in the contract, in a specific actual scene, in order to simplify data processing capacity, target text contents corresponding to preset fields can be selected as text information for data identification, and the position information is a target coordinate range where the target text contents are located. Taking a contract as an example of a bill to be identified, the text information and the position information in the preset field JSON format may be as follows:
Figure BDA0002879623510000071
Figure BDA0002879623510000081
step 404: and identifying target semantic information of the target text content aiming at each preset field.
In this step, semantic recognition is performed on the target text content of each preset field, and the ontology concept (target semantic information) of the text content can be determined according to the part-of-speech tagging model trained by the predefined concept corpus.
In an embodiment, part-of-speech automatic tagging and named Entity Recognition NER (Name Entity Recognition, which refers to recognizing entities having specific meanings in text, including Name, place Name, organization Name, proper noun, etc.) based on NLP (Natural Language Processing) technology can be used to realize semantic Recognition. Taking a contract as an example, word segmentation and part-of-speech tagging can be performed on a full-text recognition result (text base) of the contract, for example, word segmentation is performed on the text base by using a word segmentation tool jieba, a part-of-speech tagging model obtained by training is used by combining a HanLP (Han Language Processing package) with a preset concept vocabulary base, and then ontology concept properties of each word in the text base are tagged by using the part-of-speech tagging model.
Step 405: and selecting a target data template with the maximum similarity between the template semantic information and the target semantic information from the template library based on the target semantic information.
In this step, a plurality of data templates are pre-stored in the template library, the voice similarity between the target semantic information corresponding to the current preset field and each data template can be respectively calculated, and then the data template with the largest similarity is selected as the target data template corresponding to the preset field. As shown in fig. 5, the target data template 5 records the label boxes and additional information of each label box in advance, and the additional information may include text concepts corresponding to each label box and position information of the label box. The marking box can record the position through two percentage coordinates of the upper left part and the lower right part.
Step 406: the target data template comprises: a plurality of labeling boxes marked with semantic labels and position labels; and respectively calculating the overlapping rate of the position information and the position label of each marking frame in the target data template aiming at each preset field, and taking the marking frame with the overlapping rate larger than a preset threshold value as a candidate marking frame.
In this step, the target data template includes, but is not limited to: and a plurality of labeling boxes marked with semantic labels and position labels. The preset threshold is a threshold representing the overlapping rate of the position labels, when the overlapping rate exceeds the preset threshold, it is indicated that the target coordinate range where the target text content of the preset field is located is basically coincident with the position, corresponding to the bill, of the candidate marking frame in the target data template, and the preset threshold can be obtained based on historical statistical data of an actual scene.
Step 407: and in the candidate labeling boxes, respectively calculating the semantic similarity between the text information under the same preset field and the semantic label in each candidate labeling box, and selecting the candidate labeling box with the maximum semantic similarity as the template labeling box of the preset field.
In this step, in step 406, a plurality of candidate labeling boxes may exist in the target text content of each preset field, and in order to further determine the most accurate labeling box, semantic similarity between text information in the same preset field and a semantic label in each candidate labeling box may be further calculated, and one candidate labeling box with the largest semantic similarity is selected as the template labeling box of the preset field. Therefore, the ontology concept and the coordinate range of the text content of the preset field are respectively matched with the text concept and the coordinate range of the labeling frame in the target data template, if the overlapping rate of the coordinate range of the target text content and the coordinate range of the labeling frame is within the preset threshold value, and the semantic concept of the target text content is matched with the semantic concept corresponding to the labeling frame, the target text content of the preset field is confirmed to be corresponding to the template labeling frame, and template alignment is further achieved.
Step 408: and calling a data extraction rule corresponding to the template marking frame aiming at each preset field, and extracting text data in the text information from the text information based on the data extraction rule.
In this step, according to the data extraction rule of the template annotation frame in the target data template, corresponding content is extracted from the target text content of the preset field, and structured data is formed. For example, the structured data may be as follows:
Figure BDA0002879623510000101
Figure BDA0002879623510000111
the bill identification method determines the components of the information based on the template, solves the problem of low accuracy of template matching based on image pixel similarity at present, improves the information structured extraction precision of the bill with the complex layout, and finally improves the data identification accuracy.
Please refer to fig. 6, which is a document identification apparatus 600 according to an embodiment of the present application, and the apparatus is applied to the electronic device 1 shown in fig. 1, and can be applied to a document identification scenario shown in fig. 2 to identify text data in a document. The device includes: the system comprises an acquisition module 601, an analysis module 602, a matching module 603 and an extraction module 604, wherein the principle relationship of each module is as follows:
the acquiring module 601 is configured to acquire image information of a to-be-identified bill. See the description of step 301 in the above embodiments for details.
The parsing module 602 is configured to parse the image information, and identify, from top to bottom, at least one text message in the ticket and location information of each text message in the at least one text message on the ticket line by line. See the description of step 302 in the above embodiments for details.
And the matching module 603 is configured to classify the text information, and select a target data template with semantic matching from a preset template library according to a classification result. See the description of step 303 in the above embodiments for details.
And an extracting module 604, configured to extract text data in the text information according to the text information, the location information, and the target data template. See the description of step 304 in the above embodiments for details.
In one embodiment, the parsing module 602 is configured to: recognizing image information and generating a text library of the bill, wherein the text library comprises: the full text content of the ticket and the coordinate information of each character on the ticket. And selecting the target text content pointed by each preset field from the text library as the text information of the preset field, wherein the position information is the target coordinate range where the target text content is located. See the description of steps 402 to 403 in the above embodiments for details.
In one embodiment, the matching module 603 is configured to: and identifying target semantic information of the target text content aiming at each preset field. And selecting a target data template with the maximum similarity between the template semantic information and the target semantic information from the template library based on the target semantic information. See the above embodiments for a detailed description of steps 404 to 405.
In one embodiment, the target data template includes: a plurality of labeling boxes marked with semantic labels and position labels; the extraction module 604 is configured to: and respectively calculating the overlapping rate of the position information and the position label of each marking frame in the target data template aiming at each preset field, and taking the marking frame with the overlapping rate larger than a preset threshold value as a candidate marking frame. And in the candidate labeling boxes, respectively calculating the semantic similarity between the text information under the same preset field and the semantic label in each candidate labeling box, and selecting the candidate labeling box with the maximum semantic similarity as the template labeling box of the preset field. And extracting text data of the text information marked by the template marking box. See the description of step 406 to step 407 in the above embodiments in detail.
In an embodiment, the extracting text data of the text information labeled by the template labeling box includes: and calling a data extraction rule corresponding to the template marking frame aiming at each preset field, and extracting text data in the text information from the text information based on the data extraction rule. See the description of step 408 in the above embodiments for details.
For a detailed description of the bill identifying apparatus 600, please refer to the description of the related method steps in the above embodiments.
An embodiment of the present invention further provides a non-transitory electronic device readable storage medium, including: a program that, when run on an electronic device, causes the electronic device to perform all or part of the procedures of the methods in the above-described embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like. The storage medium may also comprise a combination of memories of the kind described above.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (10)

1. A method of bill identification, comprising:
acquiring image information of a bill to be identified;
analyzing the image information, and identifying at least one text message in the bill and position information of each text message in the at least one text message on the bill line by line from top to bottom;
classifying the text information, and selecting a target data template with semantic matching from a preset template library according to a classification result;
and extracting text data in the text information according to the text information, the position information and the target data template.
2. The method of claim 1, wherein the parsing the image information, identifying at least one text message in the ticket line by line from top to bottom, and location information of each text message in the at least one text message on the ticket comprises:
identifying the image information, and generating a text library of the bill, wherein the text library comprises: the whole text content of the bill and the coordinate information of each character on the bill;
and selecting target text content pointed by each preset field from the text library as the text information of the preset field, wherein the position information is the target coordinate range where the target text content is located.
3. The method according to claim 2, wherein the classifying the text information and selecting the semantically matched target data template from a preset template library according to the classification result comprises:
identifying target semantic information of the target text content aiming at each preset field;
and selecting the target data template with the maximum similarity between the template semantic information and the target semantic information from the template library based on the target semantic information.
4. The method of claim 3, wherein the target data template comprises: a plurality of labeling boxes marked with semantic labels and position labels; the extracting text data in the text information according to the text information, the position information and the target data template comprises:
respectively calculating the overlapping rate of the position information and the position label of each marking frame in the target data template aiming at each preset field, and taking the marking frame with the overlapping rate larger than a preset threshold value as a candidate marking frame;
in the candidate labeling boxes, respectively calculating semantic similarity between the text information under the same preset field and a semantic label in each candidate labeling box, and selecting the candidate labeling box with the largest semantic similarity as a template labeling box of the preset field;
and extracting text data of the text information marked by the template marking box.
5. The method of claim 4, wherein the extracting text data of the text information labeled by the template labeling box comprises:
and calling a data extraction rule corresponding to the template marking frame aiming at each preset field, and extracting the text data in the text information from the text information based on the data extraction rule.
6. A bill identifying apparatus, comprising:
the acquisition module is used for acquiring the image information of the bill to be identified;
the analysis module is used for analyzing the image information, and identifying at least one text message in the bill and the position information of each text message in the at least one text message on the bill line by line from top to bottom;
the matching module is used for classifying the text information and selecting a target data template with semantic matching from a preset template library according to a classification result;
and the extraction module is used for extracting the text data in the text information according to the text information, the position information and the target data template.
7. The apparatus of claim 6, wherein the parsing module is configured to:
identifying the image information, and generating a text library of the bill, wherein the text library comprises: the whole text content of the bill and the coordinate information of each character on the bill;
and selecting target text content pointed by each preset field from the text library as the text information of the preset field, wherein the position information is the target coordinate range where the target text content is located.
8. The apparatus of claim 7, wherein the matching module is configured to:
identifying target semantic information of the target text content aiming at each preset field;
and selecting the target data template with the maximum similarity between the template semantic information and the target semantic information from the template library based on the target semantic information.
9. The apparatus of claim 8, wherein the target data template comprises: a plurality of labeling boxes marked with semantic labels and position labels; the extraction module is configured to:
respectively calculating the overlapping rate of the position information and the position label of each marking frame in the target data template aiming at each preset field, and taking the marking frame with the overlapping rate larger than a preset threshold value as a candidate marking frame;
in the candidate labeling boxes, respectively calculating semantic similarity between the text information under the same preset field and a semantic label in each candidate labeling box, and selecting the candidate labeling box with the largest semantic similarity as a template labeling box of the preset field;
extracting text data of the text information marked by the template marking box;
the extracting the text data of the text information labeled by the template labeling box comprises the following steps:
and calling a data extraction rule corresponding to the template marking frame aiming at each preset field, and extracting the text data in the text information from the text information based on the data extraction rule.
10. An electronic device, comprising:
a memory to store a computer program;
a processor arranged to perform the method of any one of claims 1 to 5 to identify text data in a document.
CN202011628351.9A 2020-12-31 2020-12-31 Structured extraction method, device and equipment of information after bill identification Pending CN112800848A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011628351.9A CN112800848A (en) 2020-12-31 2020-12-31 Structured extraction method, device and equipment of information after bill identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011628351.9A CN112800848A (en) 2020-12-31 2020-12-31 Structured extraction method, device and equipment of information after bill identification

Publications (1)

Publication Number Publication Date
CN112800848A true CN112800848A (en) 2021-05-14

Family

ID=75807956

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011628351.9A Pending CN112800848A (en) 2020-12-31 2020-12-31 Structured extraction method, device and equipment of information after bill identification

Country Status (1)

Country Link
CN (1) CN112800848A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343663A (en) * 2021-06-29 2021-09-03 广州智选网络科技有限公司 Bill structuring method and device
CN113469005A (en) * 2021-06-24 2021-10-01 金蝶软件(中国)有限公司 Recognition method of bank receipt, related device and storage medium
CN113592571A (en) * 2021-07-27 2021-11-02 北京沃东天骏信息技术有限公司 Bill issuing early warning method, device, equipment and computer readable medium
CN113723347A (en) * 2021-09-09 2021-11-30 京东科技控股股份有限公司 Information extraction method and device, electronic equipment and storage medium
CN113821555A (en) * 2021-08-26 2021-12-21 陈仲永 Unstructured data collection processing method of intelligent supervision black box
CN114116616A (en) * 2022-01-26 2022-03-01 上海朝阳永续信息技术股份有限公司 Method, apparatus and medium for mining PDF files
CN114969266A (en) * 2021-07-20 2022-08-30 支付宝(杭州)信息技术有限公司 Bill processing method and device
CN114997137A (en) * 2022-06-16 2022-09-02 壹沓科技(上海)有限公司 Document information extraction method, device and equipment and readable storage medium
CN115775391A (en) * 2022-11-08 2023-03-10 北京博望华科科技有限公司 Enterprise financial information processing method, system and computer storage medium
CN116152833A (en) * 2022-12-30 2023-05-23 北京百度网讯科技有限公司 Training method of form restoration model based on image and form restoration method
CN116403203A (en) * 2023-06-06 2023-07-07 武汉精臣智慧标识科技有限公司 Label generation method, system, electronic equipment and storage medium
CN117315705A (en) * 2023-10-10 2023-12-29 河北神玥软件科技股份有限公司 Universal card identification method, device and system, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100235165A1 (en) * 2009-03-13 2010-09-16 Invention Machine Corporation System and method for automatic semantic labeling of natural language texts
CN107622255A (en) * 2017-10-12 2018-01-23 江苏鸿信系统集成有限公司 Bill images field localization method and system based on situation template and semantic template
CN110263694A (en) * 2019-06-13 2019-09-20 泰康保险集团股份有限公司 A kind of bank slip recognition method and device
CN110457973A (en) * 2018-05-07 2019-11-15 北京中海汇银财税服务有限公司 A kind of method and system of bank slip recognition
CN111275037A (en) * 2020-01-09 2020-06-12 上海知达教育科技有限公司 Bill identification method and device
CN112036304A (en) * 2020-08-31 2020-12-04 平安医疗健康管理股份有限公司 Medical bill layout identification method and device and computer equipment
CN112132016A (en) * 2020-09-22 2020-12-25 平安科技(深圳)有限公司 Bill information extraction method and device and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100235165A1 (en) * 2009-03-13 2010-09-16 Invention Machine Corporation System and method for automatic semantic labeling of natural language texts
CN107622255A (en) * 2017-10-12 2018-01-23 江苏鸿信系统集成有限公司 Bill images field localization method and system based on situation template and semantic template
CN110457973A (en) * 2018-05-07 2019-11-15 北京中海汇银财税服务有限公司 A kind of method and system of bank slip recognition
CN110263694A (en) * 2019-06-13 2019-09-20 泰康保险集团股份有限公司 A kind of bank slip recognition method and device
CN111275037A (en) * 2020-01-09 2020-06-12 上海知达教育科技有限公司 Bill identification method and device
CN112036304A (en) * 2020-08-31 2020-12-04 平安医疗健康管理股份有限公司 Medical bill layout identification method and device and computer equipment
CN112132016A (en) * 2020-09-22 2020-12-25 平安科技(深圳)有限公司 Bill information extraction method and device and electronic equipment

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469005A (en) * 2021-06-24 2021-10-01 金蝶软件(中国)有限公司 Recognition method of bank receipt, related device and storage medium
CN113343663A (en) * 2021-06-29 2021-09-03 广州智选网络科技有限公司 Bill structuring method and device
CN114969266A (en) * 2021-07-20 2022-08-30 支付宝(杭州)信息技术有限公司 Bill processing method and device
CN113592571A (en) * 2021-07-27 2021-11-02 北京沃东天骏信息技术有限公司 Bill issuing early warning method, device, equipment and computer readable medium
CN113821555A (en) * 2021-08-26 2021-12-21 陈仲永 Unstructured data collection processing method of intelligent supervision black box
CN113723347B (en) * 2021-09-09 2023-11-07 京东科技控股股份有限公司 Information extraction method and device, electronic equipment and storage medium
CN113723347A (en) * 2021-09-09 2021-11-30 京东科技控股股份有限公司 Information extraction method and device, electronic equipment and storage medium
CN114116616A (en) * 2022-01-26 2022-03-01 上海朝阳永续信息技术股份有限公司 Method, apparatus and medium for mining PDF files
CN114997137A (en) * 2022-06-16 2022-09-02 壹沓科技(上海)有限公司 Document information extraction method, device and equipment and readable storage medium
CN115775391A (en) * 2022-11-08 2023-03-10 北京博望华科科技有限公司 Enterprise financial information processing method, system and computer storage medium
CN116152833A (en) * 2022-12-30 2023-05-23 北京百度网讯科技有限公司 Training method of form restoration model based on image and form restoration method
CN116152833B (en) * 2022-12-30 2023-11-24 北京百度网讯科技有限公司 Training method of form restoration model based on image and form restoration method
CN116403203A (en) * 2023-06-06 2023-07-07 武汉精臣智慧标识科技有限公司 Label generation method, system, electronic equipment and storage medium
CN116403203B (en) * 2023-06-06 2023-08-29 武汉精臣智慧标识科技有限公司 Label generation method, system, electronic equipment and storage medium
CN117315705A (en) * 2023-10-10 2023-12-29 河北神玥软件科技股份有限公司 Universal card identification method, device and system, electronic equipment and storage medium
CN117315705B (en) * 2023-10-10 2024-04-30 河北神玥软件科技股份有限公司 Universal card identification method, device and system, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112800848A (en) Structured extraction method, device and equipment of information after bill identification
US11580763B2 (en) Representative document hierarchy generation
CN111680490B (en) Cross-modal document processing method and device and electronic equipment
US9552516B2 (en) Document information extraction using geometric models
KR101769918B1 (en) Recognition device based deep learning for extracting text from images
Wilkinson et al. Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections
US8208737B1 (en) Methods and systems for identifying captions in media material
US11379690B2 (en) System to extract information from documents
CN112464927B (en) Information extraction method, device and system
CN114821612B (en) Method and system for extracting information of PDF document in securities future scene
Mathew et al. Asking questions on handwritten document collections
CN113762100B (en) Method, device, computing equipment and storage medium for extracting and standardizing names in medical notes
Vitadhani et al. Detection of clickbait thumbnails on YouTube using tesseract-OCR, face recognition, and text alteration
CN112800771B (en) Article identification method, apparatus, computer readable storage medium and computer device
KR20180126352A (en) Recognition device based deep learning for extracting text from images
CN112036330A (en) Text recognition method, text recognition device and readable storage medium
CN114254231A (en) Webpage content extraction method
CN116822634A (en) Document visual language reasoning method based on layout perception prompt
CN115004261A (en) Text line detection
Vishwanath et al. Deep reader: Information extraction from document images via relation extraction and natural language
CN115294593A (en) Image information extraction method and device, computer equipment and storage medium
Dhivya et al. Tablet identification using support vector machine based text recognition and error correction by enhanced n‐grams algorithm
CN116324910A (en) Method and system for performing image-to-text conversion on a device
CN114067343A (en) Data set construction method, model training method and corresponding device
Gupta et al. Table detection and metadata extraction in document images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210514

RJ01 Rejection of invention patent application after publication