CN115578736A

CN115578736A - Certificate information extraction method, device, storage medium and equipment

Info

Publication number: CN115578736A
Application number: CN202211329886.5A
Authority: CN
Inventors: 孙立奋; 杨珉; 马艳艳; 刘金铭; 罗远辉; 李敬昭; 崔伟
Original assignee: Tianyi Digital Life Technology Co Ltd
Current assignee: Tianyi Shilian Technology Co ltd
Priority date: 2022-10-25
Filing date: 2022-10-25
Publication date: 2023-01-06

Abstract

The application provides a certificate information extraction method, a certificate information extraction device, a storage medium and equipment. The method comprises the following steps: performing text detection on the acquired certificate picture, forming a plurality of text areas in the certificate picture, and intercepting picture blocks corresponding to the text areas from the certificate picture; carrying out character recognition on the picture block to obtain characters contained in the picture block; clustering and coding characters contained in each picture block according to the central coordinates of the picture blocks to obtain coded text information; and inputting the coded text information into an information structured model to obtain structured information of the certificate. By the method, the same type of information with similar distances in the certificate is combined and marked, the problem that the certificate information cannot be identified as the same type of information due to being divided into multiple lines is avoided, and the accuracy of structured extraction of the certificate information is improved.

Description

Certificate information extraction method, device, storage medium and equipment

Technical Field

The invention relates to the technical field of text information extraction, in particular to a certificate information extraction method, a certificate information extraction device, a certificate information storage medium and certificate information extraction equipment.

Background

In the process of handling business, corresponding certificate information needs to be extracted from certificate pictures uploaded by users, and the existing certificate information extraction is based on deep learning technology. From certificate picture samples to structured extraction of certificate information, multiple steps such as picture quality evaluation, certificate area detection and correction, text detection, character recognition, structured extraction of information and the like are required. The accuracy of each step can affect the final extraction result of information structuring, and the information extraction structuring result is inaccurate under the conditions of poor quality of certificate pictures shot by users, missed detection of text regions, character recognition as similar characters and the like.

To achieve structured extraction of credential information, existing solutions are based on keyword matching. The certificate generally has information such as name, gender, birth year and month, nationality, address, certificate number, issuing authority, effective date and the like, and the conventional scheme is that an optical character recognition model is adopted to recognize whether the certificate text contains the keywords, so that the information attribute of the text can be judged, and the structured extraction of the certificate information is realized. However, the above structured information extraction scheme highly depends on the accuracy of character recognition, and when there is an error or the character recognition cannot be recognized, the scheme has increased matching difficulty, and even cannot recognize the related certificate information.

In addition, with the development of neuro-linguistic programming technology, the information structure extraction based on text semantics receives more and more attention, however, in the document information structured extraction process, it is difficult to realize the structured extraction of information by directly using the information structured extraction model based on text semantics, for example, the text information of an address is long and often divided into multiple lines, and the semantic information of the text is easily damaged, resulting in a poor structured extraction effect of the text information.

Disclosure of Invention

Based on the method, the device, the storage medium and the equipment for extracting the certificate information, the certificate information is extracted by combining the position coding and the text semantic model, and the accuracy of structured extraction of the certificate information is improved.

In a first aspect, the present invention provides a method for extracting credential information, including:

performing text detection on the acquired certificate picture, forming a plurality of text areas in the certificate picture, and intercepting picture blocks corresponding to the text areas from the certificate picture;

carrying out character recognition on the picture block to obtain characters contained in the picture block;

clustering and coding characters contained in each picture block according to the central coordinates of the picture blocks to obtain coded text information;

and inputting the coded text information into an information structured model to obtain structured information of the certificate.

In a second aspect, the present invention provides a certificate information extraction apparatus, including:

the picture block intercepting module is used for carrying out text detection processing on the acquired certificate picture, forming a plurality of text areas in the certificate picture and intercepting picture blocks corresponding to the text areas from the certificate picture;

the character recognition module is used for carrying out character recognition on the picture block to obtain characters contained in the picture block;

the text coding module is used for clustering and coding characters contained in each picture block according to the central coordinates of the picture blocks to obtain coded text information;

and the information extraction module is used for inputting the coded text information into an information structured model to obtain structured information of the certificate.

In a third aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of any one of the credential information extraction methods of the first aspect.

In a fourth aspect, the present invention provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor executes the computer program to perform any one of the credential information extraction methods of the first aspect.

The beneficial effects of adopting the above technical scheme are: according to the certificate information extraction method, the identification characters are classified, sorted and combined through the position information of the identification characters to form coded text information, and the coded text information is extracted through an information structural model to obtain structural information of a certificate. The information extraction method combines the same type of information with similar distances together, avoids the problem that the extracted structured information is lost due to the fact that the certificate information is divided into multiple lines and cannot be identified as the same type of information, and improves the accuracy of structured extraction of the certificate information.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below.

FIG. 1 is a schematic diagram of a certificate information extraction method in one embodiment of the application;

FIG. 2 is a schematic diagram of a text region mark of a certificate picture in one embodiment of the application;

FIG. 3 is a diagram illustrating word clustering included in a tile according to an embodiment of the present application;

FIG. 4 is a diagram illustrating an exemplary encoding process for a tile of the present application;

FIG. 5 is a schematic diagram illustrating training of an information extraction model according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a credential information extraction device in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. In order to explain the present invention in more detail, the following describes the certificate information extraction method, device, storage medium and apparatus provided by the present invention in detail with reference to the accompanying drawings.

With the development of handling business in more and more lines, most certificates (such as identity cards, driver licenses and the like) are uploaded in a picture mode, and information of the certificates is extracted in a text mode through a series of steps. Structured extraction from document pictures to document information requires going through a number of steps, including but not limited to: the method comprises the following steps of certificate picture quality evaluation, certificate area detection and correction, text detection, character recognition and information structured extraction. However, the same type of information in the certificate occupies a plurality of lines, which easily causes the same type of information to be divided into a plurality of scattered information in the certificate information extraction process, and causes the finally extracted certificate information to be too scattered and not to accord with semantics. On the basis, the application provides a certificate information extraction method, a certificate information extraction device, a storage medium and equipment. The certificate information extraction method only improves the processes of text detection, character recognition and information structured extraction, and does not limit the part of certificate picture quality evaluation or certificate area detection and correction.

The structured information extracted from the certificate picture refers to the text information extracted from the certificate picture, and is analyzed and then decomposed into a plurality of mutually associated components, and each component has a clear hierarchical structure.

The embodiment of the application provides a specific application scene of the certificate information extraction method. The application scenario includes the terminal device provided in the embodiment, where the terminal device includes, but is not limited to, a smartphone and a computer device, where the computer device may be at least one of a desktop computer, a portable computer, a laptop computer, a tablet computer, and the like. The user operates the terminal device, and the terminal device executes the certificate information extraction method of the invention, please refer to the certificate information extraction method embodiment in detail with reference to fig. 1.

Step S101: text detection is carried out on the obtained certificate picture, a plurality of text areas are formed in the certificate picture, and picture blocks corresponding to the text areas are intercepted from the certificate picture.

The document of the present application is a document that can be stored in a picture format and contains text format information, including but not limited to an identification card, a driving license, a pass and a passport. The certificate includes a plurality of text fields such as name, gender, year and month of birth, address, certificate number, etc.

Specifically, several text regions are formed in the acquired certificate picture. The text area is an area including text content on the certificate. As shown in fig. 2, taking the id card as an example, the text area included in the front picture of the id card is the area where the name, gender, ethnicity, year and month of birth, address, and id card number information are located. Specifically, open source text detection models such as a DBNet model, a PSENet model, a PAN model and the like can be adopted to detect text information in the certificate picture, so as to obtain a plurality of text regions in the certificate picture.

Further, not only the text region in the certificate picture is obtained in the text detection process, but also the corresponding coordinates of the text region are obtained. And according to the corresponding coordinates of the text area, the picture block corresponding to the text area can be intercepted from the certificate picture. The certificate picture comprises a plurality of text areas in the text detection process, a plurality of picture blocks are correspondingly intercepted from the certificate picture, and the intercepted picture blocks correspond to the text areas one by one.

By identifying the text area in the certificate picture and intercepting the picture block corresponding to the text area, the interference of irrelevant information content in the certificate picture can be greatly reduced, and the subsequent character identification process is only carried out on the intercepted picture block, so that the character identification efficiency is greatly improved.

Step S102: and carrying out character recognition on the picture block to obtain characters contained in the picture block.

Specifically, character recognition is performed on the plurality of extracted picture blocks to obtain characters contained in each picture block. The character recognition can be used for recognizing characters contained in the picture block by adopting an RCNN model, and the obtained characters are in a character format.

Step S103: and clustering and coding the characters contained in each picture block according to the central coordinates of the picture blocks to obtain coded text information.

From characters contained in the picture block to coded text information, two steps of clustering and coding are required to be carried out on the identification characters, specifically:

firstly, with reference to fig. 3, clustering processing needs to be performed on the characters contained in each picture block according to the center coordinates of the picture block, and the method specifically includes the following steps:

step S201: the center coordinates and the height of the picture block are acquired.

Obtaining the central abscissa of the picture block according to the maximum value and the minimum value of the abscissa of the picture block; and obtaining the central ordinate of the picture block and the height of the picture block according to the maximum value and the minimum value of the ordinate of the picture block.

Step S202: and if the clustering region in the certificate picture is not empty, for any picture block, comparing the central coordinate of the picture block with the central coordinate of the clustering region.

Step S203: if the difference value between the center coordinate of the picture block and the center coordinate of the clustering region is smaller than a first set threshold value, the center coordinate and the height of the clustering region are updated, and characters contained in the picture block are added to the content of the clustering region.

Step S204: if the clustering area in the certificate picture is empty or the difference value between the center coordinate of the picture block and the center coordinate of each clustering area is larger than a first set threshold value, generating a new cluster, recording the center coordinate and the height of the picture block as the center coordinate and the height of the clustering area, and recording the characters contained in the picture block as the content of the clustering area.

The difference value between the central coordinate of the clustering area and the central coordinate of the picture block is the difference value between the longitudinal coordinates of the central coordinates of the clustering area and the longitudinal coordinates of the central coordinates of the picture block; accordingly, the first set threshold is half of the superposition of the height of the clustering region and the height of the picture block. Judging whether characters contained in the picture block belong to the existing clustering region or not by comparing the distance of the ordinate with a first set threshold value, and if the characters contained in the picture block belong to the existing clustering region, adding the characters contained in the picture block to the content of the corresponding clustering region; if the characters contained in the picture block are found not to belong to the existing clustering region after traversing the existing clustering region, constructing a new cluster, and marking the characters contained in the picture block as the content of the new clustering region; the recognized characters are divided into proper clustering areas by the method.

In addition, the center coordinates of the clustering region are updated according to the original center coordinates of the clustering region and the center coordinates of the picture block, and the updated center coordinates of the clustering region are marked as (0.5 x (x) ₁ +x ₂ )，0.5*(y ₁ +y ₂ ) And the original center coordinate of the clustering area is (x) ₁ ，y ₁ ) The center coordinate of the picture block is (x) ₂ ，y ₂ ) (ii) a Updating the height of the clustering region according to the original height of the clustering region and the height of the picture block, wherein the updated height of the clustering region is 0.5 (h) ₁ +h ₂ ) The original height of the cluster region is h ₁ The height of the picture block is h ₂ 。

The text clusters contained in the picture block may specifically be:

acquiring the center coordinate (x) of the picture block ₂ ，y ₂ ) And a height h ₂ ；

If the clustering area in the certificate picture is not empty, comparing the central coordinates (x) of the picture blocks ₂ ，y ₂ ) With the central coordinate (x) of the cluster region ₁ ，y ₁ )；

If the difference value between the central coordinate of the picture block and the central coordinate of the clustering region is smaller than a first set threshold value, namely y ₂ -y ₁ |＜0.5*(h ₁ +h ₂ ) If the distance between the picture block and the clustering region is close, the characters contained in the picture block and the characters in the clustering region can be merged, and the central coordinate (0.5 x (x)) of the clustering region is updated ₁ +x ₂ )，0.5*(y ₁ +y ₂ ) ) and height0.5*(h ₁ +h ₂ )；

If the clustering region in the certificate picture is empty or the difference value between the center coordinate of the picture block and the center coordinate of each clustering region is larger than a first set threshold value, the fact that the distance between the picture block and the existing clustering region is not similar means that characters contained in the picture block cannot be merged into any existing clustering region, a new cluster needs to be additionally generated, the characters contained in the picture block serve as the content of the new clustering region, and the center coordinate and the height of the picture block are recorded as the center coordinate and the height of the clustering region.

Further, after completing the clustering of the characters contained in the picture block, the method for coding the characters contained in the picture block specifically includes the following steps in combination with fig. 4:

step S401: and acquiring the center coordinates of the clustering region and the content of the clustering region.

Step S402: and arranging the content of each clustering area in an ascending order according to the abscissa of the central coordinate to obtain the character strings of each clustering area.

Step S403: and sorting and combining the character strings of each clustering area according to the vertical coordinates of the central coordinates of the clustering areas to obtain coded text information.

The encoding process can be realized by using a sorted function built in python. The content of the clustering area is characters of the clustering area, the identification characters in the clustering area are not sorted in the previous stage, the character strings of each clustering area are respectively obtained through ascending arrangement of the abscissa, and the identification characters in the clustering area are arranged according to the sequence of the abscissa from small to large, so that the identification characters in the clustering area have readability. And the character strings of each clustering area are sorted according to the vertical coordinate and then are combined to obtain the identification characters which are sequentially arranged from top to bottom in the certificate picture, namely the coded text information.

Step S104: and inputting the coded text information into an information structured model to obtain structured information of the certificate.

Specifically, the structured information of the certificate includes that the characters contained in the certificate picture are analyzed and then decomposed into a plurality of mutually associated components, and each component has a definite hierarchical structure. The information structured model is a model trained in advance and is used for obtaining structured information from the coded text information, and with reference to fig. 5, the training process of the information structured model is as follows:

step S501: the method comprises the steps of obtaining a plurality of tagged text information data sets, and dividing the text information data sets into a training data set, a testing data set and a verification data set. In this embodiment, the dividing ratio of the text information data set is training data set, test data set, and verification data set = 8: 1.

And marking the coded text information according to the text type and the initial position to obtain the labeled text information. Specifically, a labeling tool, such as doccano, is used to label the encoded text information to obtain labeled text information. Taking the positive side of the identity card as an example, the coded text information is a character string in which identification characters on the positive side of the identity card are arranged from small to large according to coordinates, and the coded text information is labeled according to the name, the gender, the year and month of birth, the nationality, the address and the identity card number by adopting doccano to obtain labeled text information.

Step S502: the Information Extraction model UIE (Universal Information Extraction) is adjusted using the training data set. The information extraction model UIE is used to extract structured information from text without structure or with structure.

Step S503: and verifying the adjusted information extraction model UIE by using a verification data set, recording the current iteration times and the accuracy of the information extraction model UIE on the verification data set, and storing the information extraction model UIE.

Step S504: and judging whether the current iteration number is greater than a preset threshold or whether the accuracy of the information extraction model UIE on the verification data set meets an early stop condition.

Wherein the early stop condition is that the variation value of the accuracy of the information extraction model UIE on the verification data set does not exceed a set threshold value for the last several times. In this embodiment, the early-stop condition is specifically that the variation value of the accuracy of the information extraction model UIE on the verification data set in the last three times is not more than 0.1.

Step S505: and if the current iteration times are larger than a preset threshold value or the accuracy of the information extraction model UIE on the verification data set meets the early stop condition, stopping the model training.

Step S506: and if the current iteration number is less than a preset threshold value or the accuracy of the information extraction model UIE on the verification data set does not meet the stop condition, returning to the step S503.

Step S507: and testing the stored information extraction model UIE by using the test data set to obtain the information extraction model UIE with the highest accuracy, and taking the information extraction model UIE with the highest accuracy as an information structured model.

It should be understood that although the various steps in the flowchart of fig. 1 are shown in order, as indicated by the nominal arrows, the steps are not necessarily performed in order, as indicated by the arrows. The steps are not limited to being performed in the exact order described, and may be performed in other orders, unless otherwise indicated herein. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or sub-stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.

The above embodiment of the present disclosure describes a method for extracting credential information in detail, and the above method of the present disclosure can be implemented by various types of devices, so the present disclosure also discloses a device for extracting credential information corresponding to the above method, and a detailed description is given below with reference to fig. 6.

The picture block intercepting module 601 is configured to perform text detection on an acquired certificate picture, form a plurality of text regions in the certificate picture, and intercept, from the certificate picture, a picture block corresponding to the text region.

A character recognition module 602, configured to perform character recognition on the picture block to obtain characters included in the picture block.

The text encoding module 603 is configured to perform clustering and encoding on the characters included in each picture block according to the center coordinates of the picture block to obtain encoded text information.

And an information extraction module 604, configured to input the encoded text information into an information structured model to obtain structured information of the certificate.

For specific definition of the credential information extraction device, reference may be made to the above definition of the method, which is not described in detail here. The various modules in the above-described apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or be independent from a processor of the terminal device, and can also be stored in a memory of the terminal device in a software form, so that the processor calls and executes operations corresponding to the modules.

In one embodiment, the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the certificate information extraction method of the first aspect.

The computer readable storage medium may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM (erasable programmable read only memory), a hard disk, or a ROM. Alternatively, the computer-readable storage medium includes a non-transitory computer-readable storage medium. The computer readable storage medium has storage space for program code for performing any of the method steps of the above-described method. These program codes can be read from or written to one or more computer program products, which can be compressed in a suitable form.

In one embodiment, the invention provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor executes the steps of the certificate information extraction method when executing the computer program.

The computer device includes a memory, a processor, and one or more computer programs, wherein the one or more computer programs can be stored in the memory and configured to be executed by the one or more processors, and the one or more application programs are configured to perform the above-described credential information extraction method.

A processor may include one or more processing cores. The processor, using the various interfaces and lines connecting the various parts throughout the computer device, performs the various functions of the computer device and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in memory, and calling data stored in memory. Alternatively, the processor may be implemented in hardware using at least one of Digital Signal Processing (DSP), field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor can integrate one or a combination of a Central Processing Unit (CPU), a Graphic Processing Unit (GPU), a modem and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is to be understood that the modem may be implemented by a communication chip without being integrated into the processor.

The Memory may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory may be used to store an instruction, a program, code, a set of codes, or a set of instructions. The memory may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like. The storage data area may also store data created by the terminal device in use, and the like.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A certificate information extraction method is characterized by comprising the following steps:

2. The method for extracting certificate information as claimed in claim 1, wherein said clustering the characters contained in each of said picture blocks according to the center coordinates of the picture block comprises:

acquiring the center coordinates and the height of each picture block;

if the clustering area in the certificate picture is not empty, for any picture block, comparing the center coordinate of the picture block with the center coordinate of the clustering area;

if the difference value between the center coordinate of the picture block and the center coordinate of the clustering area is smaller than a first set threshold value, the center coordinate and the height of the clustering area are updated, and characters contained in the picture block are added to the content of the clustering area.

3. The method for extracting certificate information as claimed in claim 2, wherein said clustering process of the characters contained in each of said picture blocks according to the center coordinates of the picture block further comprises:

and if the clustering region in the certificate picture is empty or the difference value between the center coordinate of the picture block and the center coordinate of each clustering region is larger than a first set threshold value, generating a new cluster, recording the center coordinate and the height of the picture block as the center coordinate and the height of the clustering region, and recording the characters contained in the picture block as the content of the clustering region.

4. The method as claimed in claim 3, wherein said encoding the text included in each of the picture blocks according to the center coordinates of the picture block comprises:

acquiring the center coordinates of the clustering areas and the content of the clustering areas;

arranging the content of each clustering area in an ascending order according to the abscissa of the central coordinate to obtain the character string of each clustering area;

and sorting and combining the character strings of each clustering area according to the vertical coordinates of the central coordinates of the clustering areas to obtain coded text information.

5. The method of claim 1, wherein determining the center coordinates of the picture block comprises:

obtaining the central abscissa of the picture block according to the maximum value and the minimum value of the abscissa of the picture block;

and obtaining the central vertical coordinate of the picture block and the height of the picture block according to the maximum value and the minimum value of the vertical coordinate of the picture block.

6. The method of claim 1, wherein the training process of the information structuring model comprises:

acquiring a plurality of tagged text messages to form a text message data set, and dividing the text message data set into a training data set, a testing data set and a verification data set;

adjusting a preset information extraction model by using a training data set;

verifying the adjusted information extraction model by using a verification data set, recording the iteration times and the accuracy of the adjusted information extraction model on the verification data set, and storing the adjusted information extraction model;

if the iteration times are larger than a preset threshold value or the accuracy of the adjusted information extraction model on the verification data set meets an early stop condition, stopping model training;

and testing the saved adjusted information extraction model by using a test data set to obtain an information extraction model with the highest accuracy, and taking the information extraction model with the highest accuracy as an information structural model.

7. The method of claim 6, wherein the early stop condition comprises:

the accuracy rate change value of the adjusted information extraction model on the verification data set does not exceed a set threshold value for the last times.

8. A certificate information extraction device, characterized by comprising:

the picture block intercepting module is used for carrying out text detection on the acquired certificate picture, forming a plurality of text areas in the certificate picture and intercepting picture blocks corresponding to the text areas from the certificate picture;

and the information extraction module is used for inputting the coded text information into an information structural model to obtain structural information of the certificate.

9. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, implements the steps of the certificate information extraction method as claimed in any one of claims 1 to 7.

10. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor executes the computer program to perform the certificate information extraction method of any one of claims 1 to 7.