CN116665224A

CN116665224A - Certificate information extraction method and device based on OCR (optical character recognition) technology

Info

Publication number: CN116665224A
Application number: CN202310554961.6A
Authority: CN
Inventors: 张洁; 许康; 翟铖杰
Original assignee: Nanjing Zhiying Artificial Intelligence Research Institute Co ltd; Nanjing Xuanying Network Technology Co ltd
Current assignee: Nanjing Zhiying Artificial Intelligence Research Institute Co ltd; Nanjing Xuanying Network Technology Co ltd
Priority date: 2023-05-17
Filing date: 2023-05-17
Publication date: 2023-08-29

Abstract

The application discloses a certificate information extraction method and device based on OCR technology, comprising a photographing module, an identification module, an extraction module, a searching module, an updating personnel information module and a newly added personnel information module; the photographing module is used for sending photographing names at fixed time to photograph pictures, storing the photographed pictures in the device and uploading the photographed pictures to the identification module; the identification module identifies the photo shot by the shooting module through ocr and sends the photo to the extraction module for processing; the extraction module is used for judging whether the identity card or the lawyer card is an identity card or a lawyer card and extracting key information of the corresponding certificate; the searching module is used for judging searching information by receiving the data transmitted by the extracting module, and the newly added module is used for inserting and storing the data transmitted by the searching module. The application realizes that lawyers and parties can register quickly by the device, other businesses in the court can extract information quickly by the information extraction method mentioned in the patent, and reduces the manual input time.

Description

Certificate information extraction method and device based on OCR (optical character recognition) technology

Technical Field

The application belongs to the technical field of image recognition, and particularly relates to a certificate information extraction method and device based on an OCR (optical character recognition) technology.

Background

With the increasing awareness of public laws in recent years, cases accepted by courts are increasing year by year, a large number of parties and lawyers need to enter the court every day, and the conventional manual registration and the service of extracting certificate information provided by the prior art cannot meet the court requirements.

At present, in order to improve registration efficiency, many courts introduce a visitor to register, and can utilize an identity card reader to acquire identity card information, so that manual registration operation is reduced, but as lawyers do not have chips and cannot quickly identify, the lawyers still stay in manual input, and for no identity card original, only the principal of copying can also be manually input, thus increasing the workload of the registrants and the waiting time of partial principals and lawyers.

Disclosure of Invention

In view of this, the application provides a certificate information extraction method and device based on OCR technology, so as to realize that lawyers and parties can register quickly by the device, and other businesses in the court can extract information quickly by the information extraction method mentioned in the patent, thereby reducing manual entry time.

The application provides a certificate information extraction device based on an OCR (optical character recognition) technology, which comprises a photographing module, a recognition module, an extraction module, a searching module, a personnel information updating module and a newly added personnel information module;

the photographing module comprises a high-speed photographing instrument, the high-speed photographing instrument can adjust the size and photographing speed of a photo, and is used for sending photographing names at fixed time to photograph the photo, storing the photographed photo in the device and uploading the photographed photo to the identification module;

the identification module: the system consists of an off-line ocr (optical character recognition) and api (application programming interface) service, wherein after a photo shot by a shooting module is submitted to a recognition module, the recognition module recognizes through ocr and gives the photo to an extraction module for processing;

the extraction module is used for judging whether the identification card or the lawyer card is an identification card or a lawyer card, extracting key information of the corresponding certificate, and transmitting personnel information to the search module after the extraction is completed;

the searching module is used for connecting with the database and judging to search information by receiving the data transmitted by the extracting module

The newly added module is used for connecting with the database, and inserting and storing the data transmitted by the searching module;

the updating module is used for being connected with the database and updating information according to the certificate information provided by the searching module.

Further, the device also comprises a preprocessing module, wherein the preprocessing module is used for receiving the picture of the photographing module, optimizing the picture quality and format and transmitting the picture to the identification module.

Further, an error detection and correction module is included for correcting errors that may be introduced by the OCR engine.

Further, the system also comprises an encryption module which is used for taking data security and privacy protection measures when processing and storing the sensitive information.

The certificate information extraction method based on the OCR technology is suitable for the certificate information extraction device based on the OCR technology, and comprises the following steps:

step S0: the following operations are performed by the preprocessing module: denoising, rotating, cutting, adjusting the size, graying and binarizing;

step S1: judging OCR text information, extracting the text information in the picture, and returning the text information in a josn form;

step S2: correcting errors possibly introduced by the OCR engine through an error detection and correction module;

step S3: analyzing the josn information, identifying that the identification card is used for carrying out identification card number retrieval, and identifying that the identification card is used for carrying out lawyer card number retrieval;

step S4: when the relevant information is not searched in the step S3, a new module is called for processing, and the new information is stored by the new module for the next identification;

step S5: when the step S3 is carried out, the information at the moment is covered with the old information and stored through the updating module when the related information is queried, and the information is updated;

step S6: data security and privacy protection measures are taken in processing and storing sensitive information (such as identification numbers and lawyers);

step S7: when errors are identified or information needs to be updated, a convenient feedback mechanism is provided for the user so that they can correct the errors or update the information.

Further, the step S3 includes the steps of:

step S31: judging whether ocr text information contains a resident identification card keyword or not, traversing ocr results, finding out an identification card number in the text information by using rules, removing content from ocr identification results after finding out the identification card number, and reducing subsequent information extraction judgment;

step S32: judging a name zone bit, if the name is followed by other characters and accords with common names and the length is compliant, judging whether the next position of the identification bit accords with the common names and the length is compliant, if not, judging that the identification bit is the name of the cardholder, if so, removing identification content, otherwise, returning to the blank;

step S33: firstly, finding the keyword identification position of a citizen identification number, then finding the address keyword identification position, wherein the content in the middle of the two identification positions is the address, and if the keyword cannot be found because of inaccurate character recognition, judging to match the keyword in provincial regions, and finding the address;

step S34: finding a sex keyword, if a man is followed, the expressive sex is a man, if a woman is followed, the sex is a woman, if not, judging the next subscript content of the text list, judging whether the character list is a man or a woman, if not, traversing the whole text, detecting whether the character list contains the man or the woman keyword, and if so, setting the character list as the sex.

Further, the step S3 further includes the following steps:

step S321: judging whether a keyword of a practice mechanism is contained, if so, judging whether the next line ends with the word, if so, indicating that the next line is a legal address, if not, continuing to judge the next line, and judging at most two lines;

step S322: traversing the whole text, firstly removing the related content of the qualification number containing English letters, then utilizing the regular matching of the full-digital license number, and removing the content after successful identification because the full-digital license number is not required to be matched with the identity number due to the fact that the length of the full-digital license number is different from that of the identity number;

step S323: the identification card number is matched with the regular pattern, the content is successfully removed, if the identification card number is not successfully matched, the key word of the identification card number is found, whether the next row or two rows have the identification card content is judged, if not, the previous row is judged, and the identification of serial is prevented;

step S324: the identification position of the licensor is found, whether the content following the licensor accords with common names or not is judged firstly, the length is compliant, if so, the name is judged, otherwise, whether the content of the next zone position accords with the common names or not is judged, if so, the length is compliant, the name is judged, and if not, the content of the previous zone position of the licensor is judged.

Further, the key information of the identification card comprises a name, a gender, an address and an identification card number, and the key information in the lawyer card extracting lawyer card comprises a name, a license number, a gender, an identification card number and a license agency.

Further, the specific steps of step S7 are as follows:

step S71: creating a feedback form;

step S72: associating the identification result with a feedback form;

step S73: receiving and processing feedback;

step S74: the OCR model is optimized.

Compared with the prior art, the application has the beneficial effects that:

1. the certificate information extraction method and device based on the OCR technology provided by the application realize that lawyers and parties can register quickly by the device, other businesses in the court can extract information quickly by the information extraction method mentioned by the patent, and the manual input time is reduced.

2. The certificate information extraction method and device based on the OCR technology provided by the application are used for describing the logic of how to extract the identity card and lawyer certificate information in detail, and different programming languages can quickly reproduce examples only by referring to the logic, so that key information extraction is realized.

3. According to the certificate information extraction method and device based on the OCR technology, the working efficiency of court guard is greatly improved, lawyers and identity card information are not required to be filled manually, convenience is provided for the parties and lawyers, even if the parties forget to take the identity card or the lawyers and only copy is required, quick registration can be performed based on OCR identification, and satisfaction of people to court work is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a use flow of the device;

FIG. 2 is a block diagram of the overall module of the apparatus of the present application;

FIG. 3 is a step diagram of the present application;

fig. 4-7 are diagrams illustrating embodiments of the present application.

Description of the embodiments

The present application will be described in further detail with reference to the accompanying drawings, but embodiments of the present application are not limited thereto.

The whole device shown in fig. 1 consists of a photographing module, an identification module, an extraction module, a searching module, a personnel information updating module and a newly added personnel information module;

and a photographing module: the main hardware is a high-speed shooting instrument, the definition can be adjusted to adjust the size and shooting speed of a photo, and the main work is to send shooting names at regular time through the device to shoot the photo, and the shot photo is stored in the device.

And an identification module: the device mainly comprises an off-line ocr identification api service, when the device submits the photo shot by the shooting module to the identification module, the identification module identifies through ocr, extracts all text information in the photo, returns the text information in a josn format, and sends the text information to the extraction module for processing

And an extraction module: after json information returned by the identification module is received, analysis is carried out, all character information of ocr character identification is obtained, the identification card or lawyer card is judged according to the extraction logic, key information of the corresponding certificate is extracted, and after the extraction is finished, personnel information is transmitted to the search module

And (3) a searching module: connecting a database, judging whether to search lawyer information or principal information by receiving the data transmitted by the extraction module, searching according to lawyer certificate number if lawyer is, searching according to identity card number if principal is, calling different modules according to search results, calling a new module to process if relevant information is not searched, and calling an update module to process if relevant information is searched

And (3) a new adding module: connecting the database, inserting and storing the data transferred by the searching module

And an updating module: the database is connected, and the old certificate information is covered according to the certificate information provided by the searching module to update the information

step S0: the preprocessing module optimizes the picture quality and format, and executes the following operations: denoising, rotating, cutting, adjusting the size, graying and binarizing;

denoising: applying smoothing filters, such as gaussian or median filters, to reduce noise in the picture, implementations of these filters are provided using the OpenCV library, as shown in fig. 4;

and (3) rotation: if the picture is not properly aligned, it is corrected using a rotate operation. In order to realize automatic rotation, detecting straight lines in the picture by using Hough transformation, and calculating a rotation angle; examples of codes are as follows:

import numpy as np

gray_image = cv2.cvtColor(denoised_image, cv2.COLOR_BGR2GRAY)

edges = cv2.Canny(gray_image, 50, 150)

lines = cv2.HoughLines(edges, 1, np.pi / 180, 100)

angle = 0

for line in lines:

rho, theta = line[0]

angle += np.degrees(theta)

average_angle = angle / len(lines)

rotation_matrix = cv2.getRotationMatrix2D((width / 2, height / 2), -average_angle, 1)

rotated_image = cv2.warpAffine(denoised_image, rotation_matrix, (width, height))；

cutting: edge detection and contour lookup are used to determine the boundaries of text regions and crop them out, process code examples are as follows:

gray_image = cv2.cvtColor(rotated_image, cv2.COLOR_BGR2GRAY)

_, threshold_image = cv2.threshold(gray_image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

contours, _ = cv2.findContours(threshold_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

max_area = 0

for contour in contours:

area = cv2.contourArea(contour)

if area>max_area:

max_area = area

x, y, w, h = cv2.boundingRect(contour)

cropped_image = rotated_image[y:y + h, x:x + w]

resizing the: the cut picture is adjusted to be a standard size so as to facilitate subsequent processing, and the process code example is as follows;

standard_width, standard_height = 600, 400

resized_image = cv2.resize(cropped_image, (standard_width, standard_height))

graying: converting the color picture into a gray picture to reduce the amount of computation and simplify subsequent processing, the process code examples being as follows;

gray_image = cv2.cvtColor(resized_image, cv2.COLOR_BGR2GRAY)

binarization: converting the gray picture into a binary picture by using a threshold method (such as Otsu method) so as to facilitate text recognition, wherein an example of a process code is as follows;

_, binary_image = cv2.threshold(gray_image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

after these preprocessing operations are completed, step S1 is performed.

step S2: introducing an error detection and correction module to correct errors possibly introduced by the OCR engine; the method comprises the following steps:

a function is created for validity checking of the extracted identification number and lawyer number. This function should include the following rules:

a. length inspection: ensuring that the extracted number length matches the expected one. For example, the number of the Chinese resident identification card is 18.

b. And (5) checking a format: and verifying whether the extracted number meets the specific format requirement. For example, the first 17 digits of the identification number should be a number and the last digit may be the number or the letter "X".

c. Checking check codes: and calculating and verifying the check code according to the corresponding check algorithm. For example, the last bit of the identification card number is a check code, which can be calculated by the first 17 bits.

If an invalid identification number or lawyer number is detected, an automatic correction is attempted. For example, for a common OCR error (e.g., a number of "0" identified as the letter "O"), the possible wrong character may be replaced and the validity of the number rechecked.

The correct number is entered manually. The system may compare the correct number entered by the user with the extracted number to further optimize the OCR model. An example is shown in fig. 5.

step S4, when the relevant information is not searched in the step S3, a new module is called to process, and the new information is stored by the new module for the next recognition;

and S5, when the step S3 is carried out, the information at the moment is covered with the old information and stored through the updating module when the related information is queried, and the information is updated.

Step S6: data security and privacy protection measures are taken in processing and storing sensitive information such as identification numbers and lawyers. For example, data may be encrypted, access rights set, secure transmission protocols used, etc.;

as a specific embodiment, sensitive data is encrypted using a strong encryption algorithm (e.g., AES) to prevent unauthorized access and leakage. Encryption and decryption of data using public and private keys may be considered, as shown in particular in fig. 6 (1);

setting access rights: access to sensitive data is restricted by Access Control Lists (ACLs) or Role Based Access Control (RBACs) and the like. Ensuring that only authenticated and authorized users can access, modify or delete sensitive information. As shown in fig. 6 (2);

step S7: when errors are identified or information needs to be updated, a convenient feedback mechanism is provided for the user so that they can correct the errors or update the information. This will help to improve the accuracy and real-time of the system. The method comprises the following steps:

step S71: creating a feedback form:

a simple user interface is designed so that the user can enter the correct information and submit it. The form should contain all information fields that may need correction or updating, such as an identification number, a lawyer number, etc.

Step S72: associating the recognition result with a feedback form:

when the identification result is displayed, a feedback button is provided for the user, and the user can click the button to directly enter the feedback form. The original recognition result should be automatically filled in the form so that the user only needs to modify the wrong part.

Step S73: receiving and processing feedback:

after the user submits the feedback, the back-end server receives the feedback data and verifies the feedback data. After the verification is passed, the relevant records in the database are updated.

Step S74: optimizing an OCR model:

and (3) sorting feedback data of the user into a training set for optimizing the OCR model. And through iterative training and model updating, the recognition accuracy is improved.

As shown in fig. 7 (1) and (2), a simple example using flash and HTML, the feedback form function is implemented.

As a specific embodiment, the step S3 includes the following steps:

step S31, judging whether the ocr text information contains a resident identification card keyword or not, traversing ocr the result, finding out the identification card number in the text information by using a rule, removing the content from ocr identification result after finding out, and reducing subsequent information extraction judgment;

step S32, judging a name zone bit, if the name is followed by other characters and accords with the common name, and the length is compliant, judging whether the next position of the identification bit accords with the common name, and the length is compliant, if so, removing the identification content, otherwise, returning to the blank;

step S33, firstly, finding the keyword identification position of the citizen identification number, then finding the address keyword identification position, wherein the content in the middle of the two identification positions is the address, and if the keyword cannot be found because of inaccurate character recognition, judging to match the keyword in the provincial area, and finding the address;

step S34, finding a sex keyword, if a man is followed, representing the sex, if the man is followed, the sex is female, if the woman is followed, judging the next subscript content of the text list if the next subscript content is not found, judging whether the character list is male or female, if the character list is not found yet, traversing the whole text, detecting whether the character list contains male or female keywords, and setting the character list as the sex.

As a specific embodiment, the step S2 further includes the following steps:

step S321, judging whether a license agency keyword is contained, if so, judging whether the next line ends with the word, if so, indicating that the next line is a legal address, and if not, continuing to judge the next line, and judging at most two lines;

step S322, traversing the full text, firstly removing the related content of the qualification number containing English letters, then utilizing the regular matching of the full-digital license number, and removing the content after successful identification because the full-digital license number is different from the identity number in length and no worry about matching the identity number is needed;

step S323, using regular matching of the identity card numbers, successfully removing the content by matching, if not, finding out the key words of the identity card numbers, judging whether the next row or two rows have the identity card content, if not, judging the previous row, and preventing the identification of serial;

step S324, finding the identification position of the licensor, firstly judging whether the following content of the licensor accords with common names, and judging that the following content accords with the length, if not, judging whether the content of the next zone position accords with the common names, and judging that the length accords with the length, if not, judging that the preceding content of the identification position of the licensor accords with the name, and if not, judging that the content of the next zone position accords with the length, the content of the first zone position of the licensor does not exceed 4.

As a specific implementation mode, the key information of the identity card comprises a name, a gender, an address and an identity card number, the key information in the lawyer certificate comprises the name, the license number, the gender, the identity card number and the license agency.

According to the method, the method for extracting the key information of the certificate is arranged and unified, and the fault tolerance mechanism is increased according to the registered actual photographing result and the identification error rate, so that the accuracy of the identification result is further improved; the working efficiency of court guard is greatly improved, lawyers and identity card information are not required to be filled manually, convenience is provided for parties and lawyers, even if the parties forget to take the identity card or the lawyers, the personnel can be quickly registered based on ocr identification, satisfaction of the masses to court work is improved, the safety of data can be effectively protected when certificates are identified, and leakage of personal information is avoided.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims

1. The certificate information extraction device based on the OCR technology is characterized by comprising a photographing module, an identification module, an extraction module, a searching module, an updating personnel information module and a newly added personnel information module;

the shooting module can adjust the definition to adjust the size and shooting speed of the photo, is used for sending shooting names at fixed time to shoot the picture, and stores the shot picture in the device and uploads the shot picture;

the identification module is composed of an off-line ocr identification api service, and after a photo shot by the shooting module is submitted to the identification module, the identification module is identified through ocr and submitted to the extraction module for processing;

the extraction module is compiled by java codes according to the key information extraction logic of the identity card and the lawyer card, and is used for judging whether the identity card or the lawyer card is the identity card or the lawyer card, extracting the key information of the corresponding certificate, and transmitting personnel information to the search module after the extraction is completed;

2. The apparatus for extracting information from a document based on OCR recognition technology as recited in claim 1, further comprising a preprocessing module for receiving a picture of the photographing module, optimizing a picture quality and a format, and transmitting the picture to the recognition module.

3. The apparatus for extracting information from a document based on OCR recognition technology of claim 1, further comprising an error detection and correction module for correcting errors that may be introduced by the OCR engine.

4. The apparatus for extracting information from a document based on OCR recognition technology of claim 1, further comprising an encryption module for taking data security and privacy protection measures when processing and storing sensitive information.

5. A method for extracting certificate information based on an OCR recognition technology, which is applicable to the device for extracting certificate information based on an OCR recognition technology as set forth in claims 1 to 4, and is characterized by comprising the following steps:

step S6: data security and privacy protection measures are taken when sensitive information is processed and stored;

step S7: when an error is identified or information needs to be updated, a convenient feedback mechanism is provided for the user to correct the error and update the information.

6. The method for extracting information from a document based on the OCR recognition technology as recited in claim 5, wherein the step S3 includes the steps of:

7. The method for extracting information from a document based on the OCR recognition technology as recited in claim 5, wherein the step S3 further comprises:

8. The method for extracting information from a document based on the OCR recognition technology as recited in claim 5, wherein the key information of the identification card includes a name, a gender, an address, and an identification card number, and the key information of the lawyer's certificate includes a name, a license number, a gender, an identification card number, and a license agency.

9. The method for extracting certificate information based on OCR technology as recited in claim 5, wherein step S7 comprises the following specific steps:

step S71: creating a feedback form;

step S72: associating the identification result with a feedback form;

step S73: receiving and processing feedback;

step S74: the OCR model is optimized.