CN112800848A

CN112800848A - Structured extraction method, device and equipment of information after bill identification

Info

Publication number: CN112800848A
Application number: CN202011628351.9A
Authority: CN
Inventors: 刘渊; 张科; 梁扩战
Original assignee: Zhongdian Jinxin Software Co Ltd
Current assignee: Zhongdian Jinxin Software Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-05-14

Abstract

The application provides a structured extraction method, a structured extraction device and structured extraction equipment for information after bill identification, wherein the method comprises the following steps: acquiring image information of a bill to be identified; analyzing the image information, and identifying at least one text message in the bill and position information of each text message in the at least one text message on the bill line by line from top to bottom; classifying the text information, and selecting a target data template with semantic matching from a preset template library according to a classification result; and extracting text data in the text information according to the text information, the position information and the target data template. According to the method and the device, template alignment is realized through dual matching of coordinates and semantic concepts, template alignment under the condition of dynamic changes of line number, word number and the like of characters is realized, the components of information are determined based on the template, the information structured extraction precision of the complex layout bill is improved, and finally the data identification accuracy is improved.

Description

Structured extraction method, device and equipment of information after bill identification

Technical Field

The application relates to the technical field of data identification, in particular to a structured extraction method, a structured extraction device and structured extraction equipment for information after bill identification.

Background

OCR (Optical Character Recognition) refers to a process in which an electronic device (e.g., a scanner or a digital camera) examines a Character printed on paper, determines its shape by detecting dark and light patterns, and then translates the shape into computer text using a Character Recognition method. OCR technology is widely used in related fields such as handwriting recognition, print recognition, and text image recognition. The method has the advantages of document identification, bank card identification and advertisement and poster identification, and greatly simplifies the data processing mode.

In the field of bill recognition, a bill image is first input into an OCR model, and unstructured data is output. After bill recognition, unstructured data is converted into structured data, and the structured data is generally formed by matching bills with templates and extracting data from the unstructured data according to data extraction rules in the templates.

However, a common method in the prior art is alignment of optical anchor points, and if the number of lines of characters, the number of words, and other dynamic changes occur, it is difficult to align what contents are in which region according to a template, so that robustness of template alignment under the condition of dynamic changes of the number of lines of characters, the number of words, and other dynamic changes is poor.

Disclosure of Invention

The embodiment of the application aims to provide a method, a device and equipment for extracting information after bill identification in a structured manner, template alignment is realized through double matching of coordinates and semantic concepts, template alignment under the condition of dynamic changes of line numbers, word numbers and the like of characters is realized, and data identification accuracy is improved.

The first aspect of the embodiments of the present application provides a method for identifying a ticket, including: acquiring image information of a bill to be identified; analyzing the image information, and identifying at least one text message in the bill and position information of each text message in the at least one text message on the bill line by line from top to bottom; classifying the text information, and selecting a target data template with semantic matching from a preset template library according to a classification result; and extracting text data in the text information according to the text information, the position information and the target data template.

In an embodiment, the analyzing the image information and identifying at least one text message in the ticket line by line from top to bottom and the position information of each text message in the at least one text message on the ticket includes: identifying the image information, and generating a text library of the bill, wherein the text library comprises: the whole text content of the bill and the coordinate information of each character on the bill; and selecting target text content pointed by each preset field from the text library as the text information of the preset field, wherein the position information is the target coordinate range where the target text content is located.

In an embodiment, the classifying the text information and selecting a target data template with semantic matching from a preset template library according to a classification result includes: identifying target semantic information of the target text content aiming at each preset field; and selecting the target data template with the maximum similarity between the template semantic information and the target semantic information from the template library based on the target semantic information.

In one embodiment, the target data template includes: a plurality of labeling boxes marked with semantic labels and position labels; the extracting text data in the text information according to the text information, the position information and the target data template comprises: respectively calculating the overlapping rate of the position information and the position label of each marking frame in the target data template aiming at each preset field, and taking the marking frame with the overlapping rate larger than a preset threshold value as a candidate marking frame; in the candidate labeling boxes, respectively calculating semantic similarity between the text information under the same preset field and a semantic label in each candidate labeling box, and selecting the candidate labeling box with the largest semantic similarity as a template labeling box of the preset field; and extracting text data of the text information marked by the template marking box.

In an embodiment, the extracting text data of the text information labeled by the template labeling box includes: and calling a data extraction rule corresponding to the template marking frame aiming at each preset field, and extracting the text data in the text information from the text information based on the data extraction rule.

A second aspect of the embodiments of the present application provides a bill identifying apparatus, including: the acquisition module is used for acquiring the image information of the bill to be identified; the analysis module is used for analyzing the image information, and identifying at least one text message in the bill and the position information of each text message in the at least one text message on the bill line by line from top to bottom; the matching module is used for classifying the text information and selecting a target data template with semantic matching from a preset template library according to a classification result; and the extraction module is used for extracting the text data in the text information according to the text information, the position information and the target data template.

In one embodiment, the parsing module is configured to: identifying the image information, and generating a text library of the bill, wherein the text library comprises: the whole text content of the bill and the coordinate information of each character on the bill; and selecting target text content pointed by each preset field from the text library as the text information of the preset field, wherein the position information is the target coordinate range where the target text content is located.

In one embodiment, the matching module is configured to: identifying target semantic information of the target text content aiming at each preset field; and selecting the target data template with the maximum similarity between the template semantic information and the target semantic information from the template library based on the target semantic information.

In one embodiment, the target data template includes: a plurality of labeling boxes marked with semantic labels and position labels; the extraction module is configured to: respectively calculating the overlapping rate of the position information and the position label of each marking frame in the target data template aiming at each preset field, and taking the marking frame with the overlapping rate larger than a preset threshold value as a candidate marking frame; in the candidate labeling boxes, respectively calculating semantic similarity between the text information under the same preset field and a semantic label in each candidate labeling box, and selecting the candidate labeling box with the largest semantic similarity as a template labeling box of the preset field; extracting text data of the text information marked by the template marking box;

in an embodiment, the extracting text data of the text information labeled by the template labeling box includes: and calling a data extraction rule corresponding to the template marking frame aiming at each preset field, and extracting the text data in the text information from the text information based on the data extraction rule. A third aspect of embodiments of the present application provides an electronic device, including: a memory to store a computer program; a processor configured to perform the method of the first aspect of the embodiments of the present application and any embodiment thereof to identify text data in a ticket.

According to the method, the device and the equipment for extracting the information structuralized after the bill identification, firstly, the image information of the bill to be identified is obtained, then the text information of the bill and the coordinate of the text information on the bill are obtained through analysis based on the image information, and finally, the text information of the bill and a target data template are subjected to double matching of the coordinate and the semantic concept, so that template alignment is realized, the text data of the bill is extracted from the text information, and the accuracy of data identification is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 2 is a schematic view of a ticket according to an embodiment of the present application;

FIG. 3 is a schematic flowchart of a bill identification method according to an embodiment of the present application;

FIG. 4 is a schematic flowchart of a bill identification method according to an embodiment of the present application;

FIG. 5 is a diagram illustrating a target data template according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a bill identifying device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. In the description of the present application, the terms "first," "second," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

As shown in fig. 1, the present embodiment provides an electronic apparatus 1 including: at least one processor 11 and a memory 12, one processor being exemplified in fig. 1. Processor 11 and memory 12 are connected by bus 10, and memory 12 stores instructions executable by processor 11, and the instructions are executed by processor 11 to cause electronic device 1 to perform all or part of the flow of the method in the embodiments described below to identify text data in a ticket.

In an embodiment, the electronic device 1 may be a mobile phone, a notebook computer, a desktop computer, or a computing system composed of multiple computers.

Please refer to fig. 2, which is a schematic diagram of a ticket 2 according to an embodiment of the present application, wherein the ticket 2 may be a contract, a receipt, an invoice, a ticket, an agreement, a bill, etc., and the ticket includes various information such as a ticket type, a ticket date, a ticket number, etc. Taking a contract as an example, one contract will contain a variety of information, and in the actual contract management, in order to facilitate the management of contract data, specific attribute information of the contract is generally filed and stored in a targeted manner to form structured data, so as to facilitate subsequent query. For example, the contract contains information such as the type of the bill, "contract", the date of the bill, "contract date", and the number of the bill, "contract number", and during the process of identifying the contract, the contract information of different types needs to be identified and stored.

Please refer to fig. 3, which is a method for ticket recognition according to an embodiment of the present application, and the method can be executed by the electronic device 1 shown in fig. 1 and can be applied in a scenario of ticket recognition shown in fig. 2 to recognize text data in a ticket. The method comprises the following steps:

step 301: and acquiring the image information of the bill to be identified.

In this step, the bill may be a contract, a receipt, an invoice, a ticket, an agreement, a bill, or the like, and the image information may be photo information or a scanned article of the bill. The image information of the bill to be identified can be obtained by taking a picture on site or from a preset database.

Step 302: and analyzing the image information, and identifying at least one text message in the bill and the position information of each text message in the at least one text message on the bill line by line from top to bottom.

In this step, the text information of the ticket may be various texts recorded in the ticket, each text in a ticket has specific position information, and the position information may be represented in the form of coordinates. By analyzing the image information of the bill, such as image recognition, the text information of the bill and the position information of the text information on the bill can be obtained.

Step 303: and classifying the text information, and selecting a target data template with semantic matching from a preset template library according to a classification result.

In this step, various types of data templates are prestored in the preset template library, each data template is configured with a data extraction rule, and the data templates and the extraction rules can be customized based on the actual needs of the user. When the bill to be recognized is recognized, the text information is classified based on the text information of the bill, and the target data template with semantic matching is selected from the preset template library according to the classification result.

Step 304: and extracting text data in the text information according to the text information, the position information and the target data template.

In the step, the text information and the position information of the text information on the bill are comprehensively considered and matched with the target data template, and then the text data is extracted from the text information, wherein the text data can be stored in a structured mode so as to facilitate data management.

According to the bill identification method, firstly, the image information of the bill to be identified is obtained, then the text information of the bill and the coordinates of the text information on the bill are obtained through analysis based on the image information, finally, the text information of the bill and the target data template are subjected to double matching of coordinates and semantic concepts, template alignment is achieved, further, the text data of the bill is extracted from the text information, and the data identification accuracy is improved.

Please refer to fig. 4, which is a method for ticket recognition according to an embodiment of the present application, and the method can be executed by the electronic device 1 shown in fig. 1 and can be applied in a scenario of ticket recognition shown in fig. 2 to recognize text data in a ticket. The method comprises the following steps:

step 401: and acquiring the image information of the bill to be identified. See the description of step 301 in the above embodiments for details.

Step 402: recognizing image information and generating a text library of the bill, wherein the text library comprises: the full text content of the ticket and the coordinate information of each character on the ticket.

In this step, the coordinate information may be a coordinate range. The image information of the bill can be recognized by using an OCR technology, for example, the image information of the bill to be recognized is input into an OCR recognition model, and a recognition result in a JSON format can be output, where the recognition result includes a text library of the bill, and the text library at least includes text contents of the bill and a coordinate range corresponding to each character of each text content on the bill. The coordinate range is a real coordinate value.

Step 403: and selecting the target text content pointed by each preset field from a text library as the text information of the preset field, wherein the position information is the target coordinate range where the target text content is located.

In this step, the preset field may be a designated field set based on a user requirement, and taking a contract as an example of a to-be-recognized bill, the preset field may include: the ticket type is contract, contract date and contract number, etc. The text library comprises all text contents and corresponding coordinate information in the contract, in a specific actual scene, in order to simplify data processing capacity, target text contents corresponding to preset fields can be selected as text information for data identification, and the position information is a target coordinate range where the target text contents are located. Taking a contract as an example of a bill to be identified, the text information and the position information in the preset field JSON format may be as follows:

step 404: and identifying target semantic information of the target text content aiming at each preset field.

In this step, semantic recognition is performed on the target text content of each preset field, and the ontology concept (target semantic information) of the text content can be determined according to the part-of-speech tagging model trained by the predefined concept corpus.

In an embodiment, part-of-speech automatic tagging and named Entity Recognition NER (Name Entity Recognition, which refers to recognizing entities having specific meanings in text, including Name, place Name, organization Name, proper noun, etc.) based on NLP (Natural Language Processing) technology can be used to realize semantic Recognition. Taking a contract as an example, word segmentation and part-of-speech tagging can be performed on a full-text recognition result (text base) of the contract, for example, word segmentation is performed on the text base by using a word segmentation tool jieba, a part-of-speech tagging model obtained by training is used by combining a HanLP (Han Language Processing package) with a preset concept vocabulary base, and then ontology concept properties of each word in the text base are tagged by using the part-of-speech tagging model.

Step 405: and selecting a target data template with the maximum similarity between the template semantic information and the target semantic information from the template library based on the target semantic information.

In this step, a plurality of data templates are pre-stored in the template library, the voice similarity between the target semantic information corresponding to the current preset field and each data template can be respectively calculated, and then the data template with the largest similarity is selected as the target data template corresponding to the preset field. As shown in fig. 5, the target data template 5 records the label boxes and additional information of each label box in advance, and the additional information may include text concepts corresponding to each label box and position information of the label box. The marking box can record the position through two percentage coordinates of the upper left part and the lower right part.

Step 406: the target data template comprises: a plurality of labeling boxes marked with semantic labels and position labels; and respectively calculating the overlapping rate of the position information and the position label of each marking frame in the target data template aiming at each preset field, and taking the marking frame with the overlapping rate larger than a preset threshold value as a candidate marking frame.

In this step, the target data template includes, but is not limited to: and a plurality of labeling boxes marked with semantic labels and position labels. The preset threshold is a threshold representing the overlapping rate of the position labels, when the overlapping rate exceeds the preset threshold, it is indicated that the target coordinate range where the target text content of the preset field is located is basically coincident with the position, corresponding to the bill, of the candidate marking frame in the target data template, and the preset threshold can be obtained based on historical statistical data of an actual scene.

Step 407: and in the candidate labeling boxes, respectively calculating the semantic similarity between the text information under the same preset field and the semantic label in each candidate labeling box, and selecting the candidate labeling box with the maximum semantic similarity as the template labeling box of the preset field.

In this step, in step 406, a plurality of candidate labeling boxes may exist in the target text content of each preset field, and in order to further determine the most accurate labeling box, semantic similarity between text information in the same preset field and a semantic label in each candidate labeling box may be further calculated, and one candidate labeling box with the largest semantic similarity is selected as the template labeling box of the preset field. Therefore, the ontology concept and the coordinate range of the text content of the preset field are respectively matched with the text concept and the coordinate range of the labeling frame in the target data template, if the overlapping rate of the coordinate range of the target text content and the coordinate range of the labeling frame is within the preset threshold value, and the semantic concept of the target text content is matched with the semantic concept corresponding to the labeling frame, the target text content of the preset field is confirmed to be corresponding to the template labeling frame, and template alignment is further achieved.

Step 408: and calling a data extraction rule corresponding to the template marking frame aiming at each preset field, and extracting text data in the text information from the text information based on the data extraction rule.

In this step, according to the data extraction rule of the template annotation frame in the target data template, corresponding content is extracted from the target text content of the preset field, and structured data is formed. For example, the structured data may be as follows:

the bill identification method determines the components of the information based on the template, solves the problem of low accuracy of template matching based on image pixel similarity at present, improves the information structured extraction precision of the bill with the complex layout, and finally improves the data identification accuracy.

Please refer to fig. 6, which is a document identification apparatus 600 according to an embodiment of the present application, and the apparatus is applied to the electronic device 1 shown in fig. 1, and can be applied to a document identification scenario shown in fig. 2 to identify text data in a document. The device includes: the system comprises an acquisition module 601, an analysis module 602, a matching module 603 and an extraction module 604, wherein the principle relationship of each module is as follows:

the acquiring module 601 is configured to acquire image information of a to-be-identified bill. See the description of step 301 in the above embodiments for details.

The parsing module 602 is configured to parse the image information, and identify, from top to bottom, at least one text message in the ticket and location information of each text message in the at least one text message on the ticket line by line. See the description of step 302 in the above embodiments for details.

And the matching module 603 is configured to classify the text information, and select a target data template with semantic matching from a preset template library according to a classification result. See the description of step 303 in the above embodiments for details.

And an extracting module 604, configured to extract text data in the text information according to the text information, the location information, and the target data template. See the description of step 304 in the above embodiments for details.

In one embodiment, the parsing module 602 is configured to: recognizing image information and generating a text library of the bill, wherein the text library comprises: the full text content of the ticket and the coordinate information of each character on the ticket. And selecting the target text content pointed by each preset field from the text library as the text information of the preset field, wherein the position information is the target coordinate range where the target text content is located. See the description of steps 402 to 403 in the above embodiments for details.

In one embodiment, the matching module 603 is configured to: and identifying target semantic information of the target text content aiming at each preset field. And selecting a target data template with the maximum similarity between the template semantic information and the target semantic information from the template library based on the target semantic information. See the above embodiments for a detailed description of steps 404 to 405.

In one embodiment, the target data template includes: a plurality of labeling boxes marked with semantic labels and position labels; the extraction module 604 is configured to: and respectively calculating the overlapping rate of the position information and the position label of each marking frame in the target data template aiming at each preset field, and taking the marking frame with the overlapping rate larger than a preset threshold value as a candidate marking frame. And in the candidate labeling boxes, respectively calculating the semantic similarity between the text information under the same preset field and the semantic label in each candidate labeling box, and selecting the candidate labeling box with the maximum semantic similarity as the template labeling box of the preset field. And extracting text data of the text information marked by the template marking box. See the description of step 406 to step 407 in the above embodiments in detail.

In an embodiment, the extracting text data of the text information labeled by the template labeling box includes: and calling a data extraction rule corresponding to the template marking frame aiming at each preset field, and extracting text data in the text information from the text information based on the data extraction rule. See the description of step 408 in the above embodiments for details.

For a detailed description of the bill identifying apparatus 600, please refer to the description of the related method steps in the above embodiments.

An embodiment of the present invention further provides a non-transitory electronic device readable storage medium, including: a program that, when run on an electronic device, causes the electronic device to perform all or part of the procedures of the methods in the above-described embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like. The storage medium may also comprise a combination of memories of the kind described above.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A method of bill identification, comprising:

acquiring image information of a bill to be identified;

analyzing the image information, and identifying at least one text message in the bill and position information of each text message in the at least one text message on the bill line by line from top to bottom;

classifying the text information, and selecting a target data template with semantic matching from a preset template library according to a classification result;

and extracting text data in the text information according to the text information, the position information and the target data template.

2. The method of claim 1, wherein the parsing the image information, identifying at least one text message in the ticket line by line from top to bottom, and location information of each text message in the at least one text message on the ticket comprises:

identifying the image information, and generating a text library of the bill, wherein the text library comprises: the whole text content of the bill and the coordinate information of each character on the bill;

and selecting target text content pointed by each preset field from the text library as the text information of the preset field, wherein the position information is the target coordinate range where the target text content is located.

3. The method according to claim 2, wherein the classifying the text information and selecting the semantically matched target data template from a preset template library according to the classification result comprises:

identifying target semantic information of the target text content aiming at each preset field;

and selecting the target data template with the maximum similarity between the template semantic information and the target semantic information from the template library based on the target semantic information.

4. The method of claim 3, wherein the target data template comprises: a plurality of labeling boxes marked with semantic labels and position labels; the extracting text data in the text information according to the text information, the position information and the target data template comprises:

respectively calculating the overlapping rate of the position information and the position label of each marking frame in the target data template aiming at each preset field, and taking the marking frame with the overlapping rate larger than a preset threshold value as a candidate marking frame;

in the candidate labeling boxes, respectively calculating semantic similarity between the text information under the same preset field and a semantic label in each candidate labeling box, and selecting the candidate labeling box with the largest semantic similarity as a template labeling box of the preset field;

and extracting text data of the text information marked by the template marking box.

5. The method of claim 4, wherein the extracting text data of the text information labeled by the template labeling box comprises:

and calling a data extraction rule corresponding to the template marking frame aiming at each preset field, and extracting the text data in the text information from the text information based on the data extraction rule.

6. A bill identifying apparatus, comprising:

the acquisition module is used for acquiring the image information of the bill to be identified;

the analysis module is used for analyzing the image information, and identifying at least one text message in the bill and the position information of each text message in the at least one text message on the bill line by line from top to bottom;

the matching module is used for classifying the text information and selecting a target data template with semantic matching from a preset template library according to a classification result;

and the extraction module is used for extracting the text data in the text information according to the text information, the position information and the target data template.

7. The apparatus of claim 6, wherein the parsing module is configured to:

8. The apparatus of claim 7, wherein the matching module is configured to:

9. The apparatus of claim 8, wherein the target data template comprises: a plurality of labeling boxes marked with semantic labels and position labels; the extraction module is configured to:

extracting text data of the text information marked by the template marking box;

the extracting the text data of the text information labeled by the template labeling box comprises the following steps:

10. An electronic device, comprising:

a memory to store a computer program;

a processor arranged to perform the method of any one of claims 1 to 5 to identify text data in a document.