CN116665224A - Certificate information extraction method and device based on OCR (optical character recognition) technology - Google Patents

Certificate information extraction method and device based on OCR (optical character recognition) technology Download PDF

Info

Publication number
CN116665224A
CN116665224A CN202310554961.6A CN202310554961A CN116665224A CN 116665224 A CN116665224 A CN 116665224A CN 202310554961 A CN202310554961 A CN 202310554961A CN 116665224 A CN116665224 A CN 116665224A
Authority
CN
China
Prior art keywords
module
information
identification
ocr
judging
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310554961.6A
Other languages
Chinese (zh)
Inventor
张洁
许康
翟铖杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Zhiying Artificial Intelligence Research Institute Co ltd
Nanjing Xuanying Network Technology Co ltd
Original Assignee
Nanjing Zhiying Artificial Intelligence Research Institute Co ltd
Nanjing Xuanying Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Zhiying Artificial Intelligence Research Institute Co ltd, Nanjing Xuanying Network Technology Co ltd filed Critical Nanjing Zhiying Artificial Intelligence Research Institute Co ltd
Priority to CN202310554961.6A priority Critical patent/CN116665224A/en
Publication of CN116665224A publication Critical patent/CN116665224A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • G06V10/95Hardware or software architectures specially adapted for image or video understanding structured as a network, e.g. client-server architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/98Detection or correction of errors, e.g. by rescanning the pattern or by human intervention; Evaluation of the quality of the acquired patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/16Image preprocessing
    • G06V30/162Quantising the image signal
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Character Discrimination (AREA)

Abstract

The application discloses a certificate information extraction method and device based on OCR technology, comprising a photographing module, an identification module, an extraction module, a searching module, an updating personnel information module and a newly added personnel information module; the photographing module is used for sending photographing names at fixed time to photograph pictures, storing the photographed pictures in the device and uploading the photographed pictures to the identification module; the identification module identifies the photo shot by the shooting module through ocr and sends the photo to the extraction module for processing; the extraction module is used for judging whether the identity card or the lawyer card is an identity card or a lawyer card and extracting key information of the corresponding certificate; the searching module is used for judging searching information by receiving the data transmitted by the extracting module, and the newly added module is used for inserting and storing the data transmitted by the searching module. The application realizes that lawyers and parties can register quickly by the device, other businesses in the court can extract information quickly by the information extraction method mentioned in the patent, and reduces the manual input time.

Description

Certificate information extraction method and device based on OCR (optical character recognition) technology
Technical Field
The application belongs to the technical field of image recognition, and particularly relates to a certificate information extraction method and device based on an OCR (optical character recognition) technology.
Background
With the increasing awareness of public laws in recent years, cases accepted by courts are increasing year by year, a large number of parties and lawyers need to enter the court every day, and the conventional manual registration and the service of extracting certificate information provided by the prior art cannot meet the court requirements.
At present, in order to improve registration efficiency, many courts introduce a visitor to register, and can utilize an identity card reader to acquire identity card information, so that manual registration operation is reduced, but as lawyers do not have chips and cannot quickly identify, the lawyers still stay in manual input, and for no identity card original, only the principal of copying can also be manually input, thus increasing the workload of the registrants and the waiting time of partial principals and lawyers.
Disclosure of Invention
In view of this, the application provides a certificate information extraction method and device based on OCR technology, so as to realize that lawyers and parties can register quickly by the device, and other businesses in the court can extract information quickly by the information extraction method mentioned in the patent, thereby reducing manual entry time.
The application provides a certificate information extraction device based on an OCR (optical character recognition) technology, which comprises a photographing module, a recognition module, an extraction module, a searching module, a personnel information updating module and a newly added personnel information module;
the photographing module comprises a high-speed photographing instrument, the high-speed photographing instrument can adjust the size and photographing speed of a photo, and is used for sending photographing names at fixed time to photograph the photo, storing the photographed photo in the device and uploading the photographed photo to the identification module;
the identification module: the system consists of an off-line ocr (optical character recognition) and api (application programming interface) service, wherein after a photo shot by a shooting module is submitted to a recognition module, the recognition module recognizes through ocr and gives the photo to an extraction module for processing;
the extraction module is used for judging whether the identification card or the lawyer card is an identification card or a lawyer card, extracting key information of the corresponding certificate, and transmitting personnel information to the search module after the extraction is completed;
the searching module is used for connecting with the database and judging to search information by receiving the data transmitted by the extracting module
The newly added module is used for connecting with the database, and inserting and storing the data transmitted by the searching module;
the updating module is used for being connected with the database and updating information according to the certificate information provided by the searching module.
Further, the device also comprises a preprocessing module, wherein the preprocessing module is used for receiving the picture of the photographing module, optimizing the picture quality and format and transmitting the picture to the identification module.
Further, an error detection and correction module is included for correcting errors that may be introduced by the OCR engine.
Further, the system also comprises an encryption module which is used for taking data security and privacy protection measures when processing and storing the sensitive information.
The certificate information extraction method based on the OCR technology is suitable for the certificate information extraction device based on the OCR technology, and comprises the following steps:
step S0: the following operations are performed by the preprocessing module: denoising, rotating, cutting, adjusting the size, graying and binarizing;
step S1: judging OCR text information, extracting the text information in the picture, and returning the text information in a josn form;
step S2: correcting errors possibly introduced by the OCR engine through an error detection and correction module;
step S3: analyzing the josn information, identifying that the identification card is used for carrying out identification card number retrieval, and identifying that the identification card is used for carrying out lawyer card number retrieval;
step S4: when the relevant information is not searched in the step S3, a new module is called for processing, and the new information is stored by the new module for the next identification;
step S5: when the step S3 is carried out, the information at the moment is covered with the old information and stored through the updating module when the related information is queried, and the information is updated;
step S6: data security and privacy protection measures are taken in processing and storing sensitive information (such as identification numbers and lawyers);
step S7: when errors are identified or information needs to be updated, a convenient feedback mechanism is provided for the user so that they can correct the errors or update the information.
Further, the step S3 includes the steps of:
step S31: judging whether ocr text information contains a resident identification card keyword or not, traversing ocr results, finding out an identification card number in the text information by using rules, removing content from ocr identification results after finding out the identification card number, and reducing subsequent information extraction judgment;
step S32: judging a name zone bit, if the name is followed by other characters and accords with common names and the length is compliant, judging whether the next position of the identification bit accords with the common names and the length is compliant, if not, judging that the identification bit is the name of the cardholder, if so, removing identification content, otherwise, returning to the blank;
step S33: firstly, finding the keyword identification position of a citizen identification number, then finding the address keyword identification position, wherein the content in the middle of the two identification positions is the address, and if the keyword cannot be found because of inaccurate character recognition, judging to match the keyword in provincial regions, and finding the address;
step S34: finding a sex keyword, if a man is followed, the expressive sex is a man, if a woman is followed, the sex is a woman, if not, judging the next subscript content of the text list, judging whether the character list is a man or a woman, if not, traversing the whole text, detecting whether the character list contains the man or the woman keyword, and if so, setting the character list as the sex.
Further, the step S3 further includes the following steps:
step S321: judging whether a keyword of a practice mechanism is contained, if so, judging whether the next line ends with the word, if so, indicating that the next line is a legal address, if not, continuing to judge the next line, and judging at most two lines;
step S322: traversing the whole text, firstly removing the related content of the qualification number containing English letters, then utilizing the regular matching of the full-digital license number, and removing the content after successful identification because the full-digital license number is not required to be matched with the identity number due to the fact that the length of the full-digital license number is different from that of the identity number;
step S323: the identification card number is matched with the regular pattern, the content is successfully removed, if the identification card number is not successfully matched, the key word of the identification card number is found, whether the next row or two rows have the identification card content is judged, if not, the previous row is judged, and the identification of serial is prevented;
step S324: the identification position of the licensor is found, whether the content following the licensor accords with common names or not is judged firstly, the length is compliant, if so, the name is judged, otherwise, whether the content of the next zone position accords with the common names or not is judged, if so, the length is compliant, the name is judged, and if not, the content of the previous zone position of the licensor is judged.
Further, the key information of the identification card comprises a name, a gender, an address and an identification card number, and the key information in the lawyer card extracting lawyer card comprises a name, a license number, a gender, an identification card number and a license agency.
Further, the specific steps of step S7 are as follows:
step S71: creating a feedback form;
step S72: associating the identification result with a feedback form;
step S73: receiving and processing feedback;
step S74: the OCR model is optimized.
Compared with the prior art, the application has the beneficial effects that:
1. the certificate information extraction method and device based on the OCR technology provided by the application realize that lawyers and parties can register quickly by the device, other businesses in the court can extract information quickly by the information extraction method mentioned by the patent, and the manual input time is reduced.
2. The certificate information extraction method and device based on the OCR technology provided by the application are used for describing the logic of how to extract the identity card and lawyer certificate information in detail, and different programming languages can quickly reproduce examples only by referring to the logic, so that key information extraction is realized.
3. According to the certificate information extraction method and device based on the OCR technology, the working efficiency of court guard is greatly improved, lawyers and identity card information are not required to be filled manually, convenience is provided for the parties and lawyers, even if the parties forget to take the identity card or the lawyers and only copy is required, quick registration can be performed based on OCR identification, and satisfaction of people to court work is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a use flow of the device;
FIG. 2 is a block diagram of the overall module of the apparatus of the present application;
FIG. 3 is a step diagram of the present application;
fig. 4-7 are diagrams illustrating embodiments of the present application.
Description of the embodiments
The present application will be described in further detail with reference to the accompanying drawings, but embodiments of the present application are not limited thereto.
The whole device shown in fig. 1 consists of a photographing module, an identification module, an extraction module, a searching module, a personnel information updating module and a newly added personnel information module;
and a photographing module: the main hardware is a high-speed shooting instrument, the definition can be adjusted to adjust the size and shooting speed of a photo, and the main work is to send shooting names at regular time through the device to shoot the photo, and the shot photo is stored in the device.
And an identification module: the device mainly comprises an off-line ocr identification api service, when the device submits the photo shot by the shooting module to the identification module, the identification module identifies through ocr, extracts all text information in the photo, returns the text information in a josn format, and sends the text information to the extraction module for processing
And an extraction module: after json information returned by the identification module is received, analysis is carried out, all character information of ocr character identification is obtained, the identification card or lawyer card is judged according to the extraction logic, key information of the corresponding certificate is extracted, and after the extraction is finished, personnel information is transmitted to the search module
And (3) a searching module: connecting a database, judging whether to search lawyer information or principal information by receiving the data transmitted by the extraction module, searching according to lawyer certificate number if lawyer is, searching according to identity card number if principal is, calling different modules according to search results, calling a new module to process if relevant information is not searched, and calling an update module to process if relevant information is searched
And (3) a new adding module: connecting the database, inserting and storing the data transferred by the searching module
And an updating module: the database is connected, and the old certificate information is covered according to the certificate information provided by the searching module to update the information
The certificate information extraction method based on the OCR technology is suitable for the certificate information extraction device based on the OCR technology, and comprises the following steps:
step S0: the preprocessing module optimizes the picture quality and format, and executes the following operations: denoising, rotating, cutting, adjusting the size, graying and binarizing;
denoising: applying smoothing filters, such as gaussian or median filters, to reduce noise in the picture, implementations of these filters are provided using the OpenCV library, as shown in fig. 4;
and (3) rotation: if the picture is not properly aligned, it is corrected using a rotate operation. In order to realize automatic rotation, detecting straight lines in the picture by using Hough transformation, and calculating a rotation angle; examples of codes are as follows:
import numpy as np
gray_image = cv2.cvtColor(denoised_image, cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(gray_image, 50, 150)
lines = cv2.HoughLines(edges, 1, np.pi / 180, 100)
angle = 0
for line in lines:
rho, theta = line[0]
angle += np.degrees(theta)
average_angle = angle / len(lines)
rotation_matrix = cv2.getRotationMatrix2D((width / 2, height / 2), -average_angle, 1)
rotated_image = cv2.warpAffine(denoised_image, rotation_matrix, (width, height));
cutting: edge detection and contour lookup are used to determine the boundaries of text regions and crop them out, process code examples are as follows:
gray_image = cv2.cvtColor(rotated_image, cv2.COLOR_BGR2GRAY)
_, threshold_image = cv2.threshold(gray_image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
contours, _ = cv2.findContours(threshold_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
max_area = 0
for contour in contours:
area = cv2.contourArea(contour)
if area>max_area:
max_area = area
x, y, w, h = cv2.boundingRect(contour)
cropped_image = rotated_image[y:y + h, x:x + w]
resizing the: the cut picture is adjusted to be a standard size so as to facilitate subsequent processing, and the process code example is as follows;
standard_width, standard_height = 600, 400
resized_image = cv2.resize(cropped_image, (standard_width, standard_height))
graying: converting the color picture into a gray picture to reduce the amount of computation and simplify subsequent processing, the process code examples being as follows;
gray_image = cv2.cvtColor(resized_image, cv2.COLOR_BGR2GRAY)
binarization: converting the gray picture into a binary picture by using a threshold method (such as Otsu method) so as to facilitate text recognition, wherein an example of a process code is as follows;
_, binary_image = cv2.threshold(gray_image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
after these preprocessing operations are completed, step S1 is performed.
Step S1: judging OCR text information, extracting the text information in the picture, and returning the text information in a josn form;
step S2: introducing an error detection and correction module to correct errors possibly introduced by the OCR engine; the method comprises the following steps:
a function is created for validity checking of the extracted identification number and lawyer number. This function should include the following rules:
a. length inspection: ensuring that the extracted number length matches the expected one. For example, the number of the Chinese resident identification card is 18.
b. And (5) checking a format: and verifying whether the extracted number meets the specific format requirement. For example, the first 17 digits of the identification number should be a number and the last digit may be the number or the letter "X".
c. Checking check codes: and calculating and verifying the check code according to the corresponding check algorithm. For example, the last bit of the identification card number is a check code, which can be calculated by the first 17 bits.
If an invalid identification number or lawyer number is detected, an automatic correction is attempted. For example, for a common OCR error (e.g., a number of "0" identified as the letter "O"), the possible wrong character may be replaced and the validity of the number rechecked.
The correct number is entered manually. The system may compare the correct number entered by the user with the extracted number to further optimize the OCR model. An example is shown in fig. 5.
Step S3: analyzing the josn information, identifying that the identification card is used for carrying out identification card number retrieval, and identifying that the identification card is used for carrying out lawyer card number retrieval;
step S4, when the relevant information is not searched in the step S3, a new module is called to process, and the new information is stored by the new module for the next recognition;
and S5, when the step S3 is carried out, the information at the moment is covered with the old information and stored through the updating module when the related information is queried, and the information is updated.
Step S6: data security and privacy protection measures are taken in processing and storing sensitive information such as identification numbers and lawyers. For example, data may be encrypted, access rights set, secure transmission protocols used, etc.;
as a specific embodiment, sensitive data is encrypted using a strong encryption algorithm (e.g., AES) to prevent unauthorized access and leakage. Encryption and decryption of data using public and private keys may be considered, as shown in particular in fig. 6 (1);
setting access rights: access to sensitive data is restricted by Access Control Lists (ACLs) or Role Based Access Control (RBACs) and the like. Ensuring that only authenticated and authorized users can access, modify or delete sensitive information. As shown in fig. 6 (2);
step S7: when errors are identified or information needs to be updated, a convenient feedback mechanism is provided for the user so that they can correct the errors or update the information. This will help to improve the accuracy and real-time of the system. The method comprises the following steps:
step S71: creating a feedback form:
a simple user interface is designed so that the user can enter the correct information and submit it. The form should contain all information fields that may need correction or updating, such as an identification number, a lawyer number, etc.
Step S72: associating the recognition result with a feedback form:
when the identification result is displayed, a feedback button is provided for the user, and the user can click the button to directly enter the feedback form. The original recognition result should be automatically filled in the form so that the user only needs to modify the wrong part.
Step S73: receiving and processing feedback:
after the user submits the feedback, the back-end server receives the feedback data and verifies the feedback data. After the verification is passed, the relevant records in the database are updated.
Step S74: optimizing an OCR model:
and (3) sorting feedback data of the user into a training set for optimizing the OCR model. And through iterative training and model updating, the recognition accuracy is improved.
As shown in fig. 7 (1) and (2), a simple example using flash and HTML, the feedback form function is implemented.
As a specific embodiment, the step S3 includes the following steps:
step S31, judging whether the ocr text information contains a resident identification card keyword or not, traversing ocr the result, finding out the identification card number in the text information by using a rule, removing the content from ocr identification result after finding out, and reducing subsequent information extraction judgment;
step S32, judging a name zone bit, if the name is followed by other characters and accords with the common name, and the length is compliant, judging whether the next position of the identification bit accords with the common name, and the length is compliant, if so, removing the identification content, otherwise, returning to the blank;
step S33, firstly, finding the keyword identification position of the citizen identification number, then finding the address keyword identification position, wherein the content in the middle of the two identification positions is the address, and if the keyword cannot be found because of inaccurate character recognition, judging to match the keyword in the provincial area, and finding the address;
step S34, finding a sex keyword, if a man is followed, representing the sex, if the man is followed, the sex is female, if the woman is followed, judging the next subscript content of the text list if the next subscript content is not found, judging whether the character list is male or female, if the character list is not found yet, traversing the whole text, detecting whether the character list contains male or female keywords, and setting the character list as the sex.
As a specific embodiment, the step S2 further includes the following steps:
step S321, judging whether a license agency keyword is contained, if so, judging whether the next line ends with the word, if so, indicating that the next line is a legal address, and if not, continuing to judge the next line, and judging at most two lines;
step S322, traversing the full text, firstly removing the related content of the qualification number containing English letters, then utilizing the regular matching of the full-digital license number, and removing the content after successful identification because the full-digital license number is different from the identity number in length and no worry about matching the identity number is needed;
step S323, using regular matching of the identity card numbers, successfully removing the content by matching, if not, finding out the key words of the identity card numbers, judging whether the next row or two rows have the identity card content, if not, judging the previous row, and preventing the identification of serial;
step S324, finding the identification position of the licensor, firstly judging whether the following content of the licensor accords with common names, and judging that the following content accords with the length, if not, judging whether the content of the next zone position accords with the common names, and judging that the length accords with the length, if not, judging that the preceding content of the identification position of the licensor accords with the name, and if not, judging that the content of the next zone position accords with the length, the content of the first zone position of the licensor does not exceed 4.
As a specific implementation mode, the key information of the identity card comprises a name, a gender, an address and an identity card number, the key information in the lawyer certificate comprises the name, the license number, the gender, the identity card number and the license agency.
According to the method, the method for extracting the key information of the certificate is arranged and unified, and the fault tolerance mechanism is increased according to the registered actual photographing result and the identification error rate, so that the accuracy of the identification result is further improved; the working efficiency of court guard is greatly improved, lawyers and identity card information are not required to be filled manually, convenience is provided for parties and lawyers, even if the parties forget to take the identity card or the lawyers, the personnel can be quickly registered based on ocr identification, satisfaction of the masses to court work is improved, the safety of data can be effectively protected when certificates are identified, and leakage of personal information is avoided.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims (9)

1. The certificate information extraction device based on the OCR technology is characterized by comprising a photographing module, an identification module, an extraction module, a searching module, an updating personnel information module and a newly added personnel information module;
the shooting module can adjust the definition to adjust the size and shooting speed of the photo, is used for sending shooting names at fixed time to shoot the picture, and stores the shot picture in the device and uploads the shot picture;
the identification module is composed of an off-line ocr identification api service, and after a photo shot by the shooting module is submitted to the identification module, the identification module is identified through ocr and submitted to the extraction module for processing;
the extraction module is compiled by java codes according to the key information extraction logic of the identity card and the lawyer card, and is used for judging whether the identity card or the lawyer card is the identity card or the lawyer card, extracting the key information of the corresponding certificate, and transmitting personnel information to the search module after the extraction is completed;
the searching module is used for connecting with the database and judging to search information by receiving the data transmitted by the extracting module
The newly added module is used for connecting with the database, and inserting and storing the data transmitted by the searching module;
the updating module is used for being connected with the database and updating information according to the certificate information provided by the searching module.
2. The apparatus for extracting information from a document based on OCR recognition technology as recited in claim 1, further comprising a preprocessing module for receiving a picture of the photographing module, optimizing a picture quality and a format, and transmitting the picture to the recognition module.
3. The apparatus for extracting information from a document based on OCR recognition technology of claim 1, further comprising an error detection and correction module for correcting errors that may be introduced by the OCR engine.
4. The apparatus for extracting information from a document based on OCR recognition technology of claim 1, further comprising an encryption module for taking data security and privacy protection measures when processing and storing sensitive information.
5. A method for extracting certificate information based on an OCR recognition technology, which is applicable to the device for extracting certificate information based on an OCR recognition technology as set forth in claims 1 to 4, and is characterized by comprising the following steps:
step S0: the following operations are performed by the preprocessing module: denoising, rotating, cutting, adjusting the size, graying and binarizing;
step S1: judging OCR text information, extracting the text information in the picture, and returning the text information in a josn form;
step S2: correcting errors possibly introduced by the OCR engine through an error detection and correction module;
step S3: analyzing the josn information, identifying that the identification card is used for carrying out identification card number retrieval, and identifying that the identification card is used for carrying out lawyer card number retrieval;
step S4: when the relevant information is not searched in the step S3, a new module is called for processing, and the new information is stored by the new module for the next identification;
step S5: when the step S3 is carried out, the information at the moment is covered with the old information and stored through the updating module when the related information is queried, and the information is updated;
step S6: data security and privacy protection measures are taken when sensitive information is processed and stored;
step S7: when an error is identified or information needs to be updated, a convenient feedback mechanism is provided for the user to correct the error and update the information.
6. The method for extracting information from a document based on the OCR recognition technology as recited in claim 5, wherein the step S3 includes the steps of:
step S31: judging whether ocr text information contains a resident identification card keyword or not, traversing ocr results, finding out an identification card number in the text information by using rules, removing content from ocr identification results after finding out the identification card number, and reducing subsequent information extraction judgment;
step S32: judging a name zone bit, if the name is followed by other characters and accords with common names and the length is compliant, judging whether the next position of the identification bit accords with the common names and the length is compliant, if not, judging that the identification bit is the name of the cardholder, if so, removing identification content, otherwise, returning to the blank;
step S33: firstly, finding the keyword identification position of a citizen identification number, then finding the address keyword identification position, wherein the content in the middle of the two identification positions is the address, and if the keyword cannot be found because of inaccurate character recognition, judging to match the keyword in provincial regions, and finding the address;
step S34: finding a sex keyword, if a man is followed, the expressive sex is a man, if a woman is followed, the sex is a woman, if not, judging the next subscript content of the text list, judging whether the character list is a man or a woman, if not, traversing the whole text, detecting whether the character list contains the man or the woman keyword, and if so, setting the character list as the sex.
7. The method for extracting information from a document based on the OCR recognition technology as recited in claim 5, wherein the step S3 further comprises:
step S321: judging whether a keyword of a practice mechanism is contained, if so, judging whether the next line ends with the word, if so, indicating that the next line is a legal address, if not, continuing to judge the next line, and judging at most two lines;
step S322: traversing the whole text, firstly removing the related content of the qualification number containing English letters, then utilizing the regular matching of the full-digital license number, and removing the content after successful identification because the full-digital license number is not required to be matched with the identity number due to the fact that the length of the full-digital license number is different from that of the identity number;
step S323: the identification card number is matched with the regular pattern, the content is successfully removed, if the identification card number is not successfully matched, the key word of the identification card number is found, whether the next row or two rows have the identification card content is judged, if not, the previous row is judged, and the identification of serial is prevented;
step S324: the identification position of the licensor is found, whether the content following the licensor accords with common names or not is judged firstly, the length is compliant, if so, the name is judged, otherwise, whether the content of the next zone position accords with the common names or not is judged, if so, the length is compliant, the name is judged, and if not, the content of the previous zone position of the licensor is judged.
8. The method for extracting information from a document based on the OCR recognition technology as recited in claim 5, wherein the key information of the identification card includes a name, a gender, an address, and an identification card number, and the key information of the lawyer's certificate includes a name, a license number, a gender, an identification card number, and a license agency.
9. The method for extracting certificate information based on OCR technology as recited in claim 5, wherein step S7 comprises the following specific steps:
step S71: creating a feedback form;
step S72: associating the identification result with a feedback form;
step S73: receiving and processing feedback;
step S74: the OCR model is optimized.
CN202310554961.6A 2023-05-17 2023-05-17 Certificate information extraction method and device based on OCR (optical character recognition) technology Pending CN116665224A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310554961.6A CN116665224A (en) 2023-05-17 2023-05-17 Certificate information extraction method and device based on OCR (optical character recognition) technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310554961.6A CN116665224A (en) 2023-05-17 2023-05-17 Certificate information extraction method and device based on OCR (optical character recognition) technology

Publications (1)

Publication Number Publication Date
CN116665224A true CN116665224A (en) 2023-08-29

Family

ID=87727117

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310554961.6A Pending CN116665224A (en) 2023-05-17 2023-05-17 Certificate information extraction method and device based on OCR (optical character recognition) technology

Country Status (1)

Country Link
CN (1) CN116665224A (en)

Similar Documents

Publication Publication Date Title
US11151369B2 (en) Systems and methods for classifying payment documents during mobile image processing
US11170248B2 (en) Video capture in data capture scenario
US9576272B2 (en) Systems, methods and computer program products for determining document validity
US8326041B2 (en) Machine character recognition verification
US6886136B1 (en) Automatic template and field definition in form processing
US11227154B2 (en) Ledger recognition system
US20070217692A1 (en) Property record document data verification systems and methods
CN110781460A (en) Copyright authentication method, device, equipment, system and computer readable storage medium
WO2021259096A1 (en) Identity authentication method, apparatus, electronic device, and storage medium
CN108805787A (en) A kind of method and apparatus that paper document distorts Jianzhen
CN111625798A (en) System and method for handheld identity card authentication user real-name registration
CN111539414B (en) Method and system for character recognition and character correction of OCR (optical character recognition) image
CN111683202B (en) Bill stamping method, device, equipment and storage medium
CN116665224A (en) Certificate information extraction method and device based on OCR (optical character recognition) technology
CN112132693A (en) Transaction verification method, transaction verification device, computer equipment and computer-readable storage medium
CN113705560A (en) Data extraction method, device and equipment based on image recognition and storage medium
CN101727572A (en) Method for ensuring image integrity by using file characteristics
Dhanva et al. Cheque image security enhancement in online banking
KR102523598B1 (en) Unmaned entrance system
CN115375998B (en) Certificate identification method and device, electronic equipment and storage medium
CN116189181B (en) Image normalization method and system for identity card OCR
US11238686B2 (en) Item validation and image evaluation system with feedback loop
WO2021248912A1 (en) Picture audit method and device, computing device and storage medium
JP3360030B2 (en) Character recognition device, character recognition method, and recording medium recording character recognition method in program form
Borse et al. Smart Vehicle Identification And Surveillance System Using OCR

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination