CN116665224A - Certificate information extraction method and device based on OCR (optical character recognition) technology - Google Patents
Certificate information extraction method and device based on OCR (optical character recognition) technology Download PDFInfo
- Publication number
- CN116665224A CN116665224A CN202310554961.6A CN202310554961A CN116665224A CN 116665224 A CN116665224 A CN 116665224A CN 202310554961 A CN202310554961 A CN 202310554961A CN 116665224 A CN116665224 A CN 116665224A
- Authority
- CN
- China
- Prior art keywords
- module
- information
- identification
- ocr
- judging
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 39
- 238000005516 engineering process Methods 0.000 title claims abstract description 23
- 238000012015 optical character recognition Methods 0.000 title description 28
- 238000012545 processing Methods 0.000 claims abstract description 15
- 238000000034 method Methods 0.000 claims description 18
- 238000012937 correction Methods 0.000 claims description 7
- 238000007781 pre-processing Methods 0.000 claims description 7
- 238000001514 detection method Methods 0.000 claims description 5
- 230000008713 feedback mechanism Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000012797 qualification Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 abstract description 4
- 230000008569 process Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000003708 edge detection Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19173—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/94—Hardware or software architectures specially adapted for image or video understanding
- G06V10/95—Hardware or software architectures specially adapted for image or video understanding structured as a network, e.g. client-server architectures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/98—Detection or correction of errors, e.g. by rescanning the pattern or by human intervention; Evaluation of the quality of the acquired patterns
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/16—Image preprocessing
- G06V30/162—Quantising the image signal
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Quality & Reliability (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Character Discrimination (AREA)
Abstract
The application discloses a certificate information extraction method and device based on OCR technology, comprising a photographing module, an identification module, an extraction module, a searching module, an updating personnel information module and a newly added personnel information module; the photographing module is used for sending photographing names at fixed time to photograph pictures, storing the photographed pictures in the device and uploading the photographed pictures to the identification module; the identification module identifies the photo shot by the shooting module through ocr and sends the photo to the extraction module for processing; the extraction module is used for judging whether the identity card or the lawyer card is an identity card or a lawyer card and extracting key information of the corresponding certificate; the searching module is used for judging searching information by receiving the data transmitted by the extracting module, and the newly added module is used for inserting and storing the data transmitted by the searching module. The application realizes that lawyers and parties can register quickly by the device, other businesses in the court can extract information quickly by the information extraction method mentioned in the patent, and reduces the manual input time.
Description
Technical Field
The application belongs to the technical field of image recognition, and particularly relates to a certificate information extraction method and device based on an OCR (optical character recognition) technology.
Background
With the increasing awareness of public laws in recent years, cases accepted by courts are increasing year by year, a large number of parties and lawyers need to enter the court every day, and the conventional manual registration and the service of extracting certificate information provided by the prior art cannot meet the court requirements.
At present, in order to improve registration efficiency, many courts introduce a visitor to register, and can utilize an identity card reader to acquire identity card information, so that manual registration operation is reduced, but as lawyers do not have chips and cannot quickly identify, the lawyers still stay in manual input, and for no identity card original, only the principal of copying can also be manually input, thus increasing the workload of the registrants and the waiting time of partial principals and lawyers.
Disclosure of Invention
In view of this, the application provides a certificate information extraction method and device based on OCR technology, so as to realize that lawyers and parties can register quickly by the device, and other businesses in the court can extract information quickly by the information extraction method mentioned in the patent, thereby reducing manual entry time.
The application provides a certificate information extraction device based on an OCR (optical character recognition) technology, which comprises a photographing module, a recognition module, an extraction module, a searching module, a personnel information updating module and a newly added personnel information module;
the photographing module comprises a high-speed photographing instrument, the high-speed photographing instrument can adjust the size and photographing speed of a photo, and is used for sending photographing names at fixed time to photograph the photo, storing the photographed photo in the device and uploading the photographed photo to the identification module;
the identification module: the system consists of an off-line ocr (optical character recognition) and api (application programming interface) service, wherein after a photo shot by a shooting module is submitted to a recognition module, the recognition module recognizes through ocr and gives the photo to an extraction module for processing;
the extraction module is used for judging whether the identification card or the lawyer card is an identification card or a lawyer card, extracting key information of the corresponding certificate, and transmitting personnel information to the search module after the extraction is completed;
the searching module is used for connecting with the database and judging to search information by receiving the data transmitted by the extracting module
The newly added module is used for connecting with the database, and inserting and storing the data transmitted by the searching module;
the updating module is used for being connected with the database and updating information according to the certificate information provided by the searching module.
Further, the device also comprises a preprocessing module, wherein the preprocessing module is used for receiving the picture of the photographing module, optimizing the picture quality and format and transmitting the picture to the identification module.
Further, an error detection and correction module is included for correcting errors that may be introduced by the OCR engine.
Further, the system also comprises an encryption module which is used for taking data security and privacy protection measures when processing and storing the sensitive information.
The certificate information extraction method based on the OCR technology is suitable for the certificate information extraction device based on the OCR technology, and comprises the following steps:
step S0: the following operations are performed by the preprocessing module: denoising, rotating, cutting, adjusting the size, graying and binarizing;
step S1: judging OCR text information, extracting the text information in the picture, and returning the text information in a josn form;
step S2: correcting errors possibly introduced by the OCR engine through an error detection and correction module;
step S3: analyzing the josn information, identifying that the identification card is used for carrying out identification card number retrieval, and identifying that the identification card is used for carrying out lawyer card number retrieval;
step S4: when the relevant information is not searched in the step S3, a new module is called for processing, and the new information is stored by the new module for the next identification;
step S5: when the step S3 is carried out, the information at the moment is covered with the old information and stored through the updating module when the related information is queried, and the information is updated;
step S6: data security and privacy protection measures are taken in processing and storing sensitive information (such as identification numbers and lawyers);
step S7: when errors are identified or information needs to be updated, a convenient feedback mechanism is provided for the user so that they can correct the errors or update the information.
Further, the step S3 includes the steps of:
step S31: judging whether ocr text information contains a resident identification card keyword or not, traversing ocr results, finding out an identification card number in the text information by using rules, removing content from ocr identification results after finding out the identification card number, and reducing subsequent information extraction judgment;
step S32: judging a name zone bit, if the name is followed by other characters and accords with common names and the length is compliant, judging whether the next position of the identification bit accords with the common names and the length is compliant, if not, judging that the identification bit is the name of the cardholder, if so, removing identification content, otherwise, returning to the blank;
step S33: firstly, finding the keyword identification position of a citizen identification number, then finding the address keyword identification position, wherein the content in the middle of the two identification positions is the address, and if the keyword cannot be found because of inaccurate character recognition, judging to match the keyword in provincial regions, and finding the address;
step S34: finding a sex keyword, if a man is followed, the expressive sex is a man, if a woman is followed, the sex is a woman, if not, judging the next subscript content of the text list, judging whether the character list is a man or a woman, if not, traversing the whole text, detecting whether the character list contains the man or the woman keyword, and if so, setting the character list as the sex.
Further, the step S3 further includes the following steps:
step S321: judging whether a keyword of a practice mechanism is contained, if so, judging whether the next line ends with the word, if so, indicating that the next line is a legal address, if not, continuing to judge the next line, and judging at most two lines;
step S322: traversing the whole text, firstly removing the related content of the qualification number containing English letters, then utilizing the regular matching of the full-digital license number, and removing the content after successful identification because the full-digital license number is not required to be matched with the identity number due to the fact that the length of the full-digital license number is different from that of the identity number;
step S323: the identification card number is matched with the regular pattern, the content is successfully removed, if the identification card number is not successfully matched, the key word of the identification card number is found, whether the next row or two rows have the identification card content is judged, if not, the previous row is judged, and the identification of serial is prevented;
step S324: the identification position of the licensor is found, whether the content following the licensor accords with common names or not is judged firstly, the length is compliant, if so, the name is judged, otherwise, whether the content of the next zone position accords with the common names or not is judged, if so, the length is compliant, the name is judged, and if not, the content of the previous zone position of the licensor is judged.
Further, the key information of the identification card comprises a name, a gender, an address and an identification card number, and the key information in the lawyer card extracting lawyer card comprises a name, a license number, a gender, an identification card number and a license agency.
Further, the specific steps of step S7 are as follows:
step S71: creating a feedback form;
step S72: associating the identification result with a feedback form;
step S73: receiving and processing feedback;
step S74: the OCR model is optimized.
Compared with the prior art, the application has the beneficial effects that:
1. the certificate information extraction method and device based on the OCR technology provided by the application realize that lawyers and parties can register quickly by the device, other businesses in the court can extract information quickly by the information extraction method mentioned by the patent, and the manual input time is reduced.
2. The certificate information extraction method and device based on the OCR technology provided by the application are used for describing the logic of how to extract the identity card and lawyer certificate information in detail, and different programming languages can quickly reproduce examples only by referring to the logic, so that key information extraction is realized.
3. According to the certificate information extraction method and device based on the OCR technology, the working efficiency of court guard is greatly improved, lawyers and identity card information are not required to be filled manually, convenience is provided for the parties and lawyers, even if the parties forget to take the identity card or the lawyers and only copy is required, quick registration can be performed based on OCR identification, and satisfaction of people to court work is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a use flow of the device;
FIG. 2 is a block diagram of the overall module of the apparatus of the present application;
FIG. 3 is a step diagram of the present application;
fig. 4-7 are diagrams illustrating embodiments of the present application.
Description of the embodiments
The present application will be described in further detail with reference to the accompanying drawings, but embodiments of the present application are not limited thereto.
The whole device shown in fig. 1 consists of a photographing module, an identification module, an extraction module, a searching module, a personnel information updating module and a newly added personnel information module;
and a photographing module: the main hardware is a high-speed shooting instrument, the definition can be adjusted to adjust the size and shooting speed of a photo, and the main work is to send shooting names at regular time through the device to shoot the photo, and the shot photo is stored in the device.
And an identification module: the device mainly comprises an off-line ocr identification api service, when the device submits the photo shot by the shooting module to the identification module, the identification module identifies through ocr, extracts all text information in the photo, returns the text information in a josn format, and sends the text information to the extraction module for processing
And an extraction module: after json information returned by the identification module is received, analysis is carried out, all character information of ocr character identification is obtained, the identification card or lawyer card is judged according to the extraction logic, key information of the corresponding certificate is extracted, and after the extraction is finished, personnel information is transmitted to the search module
And (3) a searching module: connecting a database, judging whether to search lawyer information or principal information by receiving the data transmitted by the extraction module, searching according to lawyer certificate number if lawyer is, searching according to identity card number if principal is, calling different modules according to search results, calling a new module to process if relevant information is not searched, and calling an update module to process if relevant information is searched
And (3) a new adding module: connecting the database, inserting and storing the data transferred by the searching module
And an updating module: the database is connected, and the old certificate information is covered according to the certificate information provided by the searching module to update the information
The certificate information extraction method based on the OCR technology is suitable for the certificate information extraction device based on the OCR technology, and comprises the following steps:
step S0: the preprocessing module optimizes the picture quality and format, and executes the following operations: denoising, rotating, cutting, adjusting the size, graying and binarizing;
denoising: applying smoothing filters, such as gaussian or median filters, to reduce noise in the picture, implementations of these filters are provided using the OpenCV library, as shown in fig. 4;
and (3) rotation: if the picture is not properly aligned, it is corrected using a rotate operation. In order to realize automatic rotation, detecting straight lines in the picture by using Hough transformation, and calculating a rotation angle; examples of codes are as follows:
import numpy as np
gray_image = cv2.cvtColor(denoised_image, cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(gray_image, 50, 150)
lines = cv2.HoughLines(edges, 1, np.pi / 180, 100)
angle = 0
for line in lines:
rho, theta = line[0]
angle += np.degrees(theta)
average_angle = angle / len(lines)
rotation_matrix = cv2.getRotationMatrix2D((width / 2, height / 2), -average_angle, 1)
rotated_image = cv2.warpAffine(denoised_image, rotation_matrix, (width, height));
cutting: edge detection and contour lookup are used to determine the boundaries of text regions and crop them out, process code examples are as follows:
gray_image = cv2.cvtColor(rotated_image, cv2.COLOR_BGR2GRAY)
_, threshold_image = cv2.threshold(gray_image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
contours, _ = cv2.findContours(threshold_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
max_area = 0
for contour in contours:
area = cv2.contourArea(contour)
if area>max_area:
max_area = area
x, y, w, h = cv2.boundingRect(contour)
cropped_image = rotated_image[y:y + h, x:x + w]
resizing the: the cut picture is adjusted to be a standard size so as to facilitate subsequent processing, and the process code example is as follows;
standard_width, standard_height = 600, 400
resized_image = cv2.resize(cropped_image, (standard_width, standard_height))
graying: converting the color picture into a gray picture to reduce the amount of computation and simplify subsequent processing, the process code examples being as follows;
gray_image = cv2.cvtColor(resized_image, cv2.COLOR_BGR2GRAY)
binarization: converting the gray picture into a binary picture by using a threshold method (such as Otsu method) so as to facilitate text recognition, wherein an example of a process code is as follows;
_, binary_image = cv2.threshold(gray_image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
after these preprocessing operations are completed, step S1 is performed.
Step S1: judging OCR text information, extracting the text information in the picture, and returning the text information in a josn form;
step S2: introducing an error detection and correction module to correct errors possibly introduced by the OCR engine; the method comprises the following steps:
a function is created for validity checking of the extracted identification number and lawyer number. This function should include the following rules:
a. length inspection: ensuring that the extracted number length matches the expected one. For example, the number of the Chinese resident identification card is 18.
b. And (5) checking a format: and verifying whether the extracted number meets the specific format requirement. For example, the first 17 digits of the identification number should be a number and the last digit may be the number or the letter "X".
c. Checking check codes: and calculating and verifying the check code according to the corresponding check algorithm. For example, the last bit of the identification card number is a check code, which can be calculated by the first 17 bits.
If an invalid identification number or lawyer number is detected, an automatic correction is attempted. For example, for a common OCR error (e.g., a number of "0" identified as the letter "O"), the possible wrong character may be replaced and the validity of the number rechecked.
The correct number is entered manually. The system may compare the correct number entered by the user with the extracted number to further optimize the OCR model. An example is shown in fig. 5.
Step S3: analyzing the josn information, identifying that the identification card is used for carrying out identification card number retrieval, and identifying that the identification card is used for carrying out lawyer card number retrieval;
step S4, when the relevant information is not searched in the step S3, a new module is called to process, and the new information is stored by the new module for the next recognition;
and S5, when the step S3 is carried out, the information at the moment is covered with the old information and stored through the updating module when the related information is queried, and the information is updated.
Step S6: data security and privacy protection measures are taken in processing and storing sensitive information such as identification numbers and lawyers. For example, data may be encrypted, access rights set, secure transmission protocols used, etc.;
as a specific embodiment, sensitive data is encrypted using a strong encryption algorithm (e.g., AES) to prevent unauthorized access and leakage. Encryption and decryption of data using public and private keys may be considered, as shown in particular in fig. 6 (1);
setting access rights: access to sensitive data is restricted by Access Control Lists (ACLs) or Role Based Access Control (RBACs) and the like. Ensuring that only authenticated and authorized users can access, modify or delete sensitive information. As shown in fig. 6 (2);
step S7: when errors are identified or information needs to be updated, a convenient feedback mechanism is provided for the user so that they can correct the errors or update the information. This will help to improve the accuracy and real-time of the system. The method comprises the following steps:
step S71: creating a feedback form:
a simple user interface is designed so that the user can enter the correct information and submit it. The form should contain all information fields that may need correction or updating, such as an identification number, a lawyer number, etc.
Step S72: associating the recognition result with a feedback form:
when the identification result is displayed, a feedback button is provided for the user, and the user can click the button to directly enter the feedback form. The original recognition result should be automatically filled in the form so that the user only needs to modify the wrong part.
Step S73: receiving and processing feedback:
after the user submits the feedback, the back-end server receives the feedback data and verifies the feedback data. After the verification is passed, the relevant records in the database are updated.
Step S74: optimizing an OCR model:
and (3) sorting feedback data of the user into a training set for optimizing the OCR model. And through iterative training and model updating, the recognition accuracy is improved.
As shown in fig. 7 (1) and (2), a simple example using flash and HTML, the feedback form function is implemented.
As a specific embodiment, the step S3 includes the following steps:
step S31, judging whether the ocr text information contains a resident identification card keyword or not, traversing ocr the result, finding out the identification card number in the text information by using a rule, removing the content from ocr identification result after finding out, and reducing subsequent information extraction judgment;
step S32, judging a name zone bit, if the name is followed by other characters and accords with the common name, and the length is compliant, judging whether the next position of the identification bit accords with the common name, and the length is compliant, if so, removing the identification content, otherwise, returning to the blank;
step S33, firstly, finding the keyword identification position of the citizen identification number, then finding the address keyword identification position, wherein the content in the middle of the two identification positions is the address, and if the keyword cannot be found because of inaccurate character recognition, judging to match the keyword in the provincial area, and finding the address;
step S34, finding a sex keyword, if a man is followed, representing the sex, if the man is followed, the sex is female, if the woman is followed, judging the next subscript content of the text list if the next subscript content is not found, judging whether the character list is male or female, if the character list is not found yet, traversing the whole text, detecting whether the character list contains male or female keywords, and setting the character list as the sex.
As a specific embodiment, the step S2 further includes the following steps:
step S321, judging whether a license agency keyword is contained, if so, judging whether the next line ends with the word, if so, indicating that the next line is a legal address, and if not, continuing to judge the next line, and judging at most two lines;
step S322, traversing the full text, firstly removing the related content of the qualification number containing English letters, then utilizing the regular matching of the full-digital license number, and removing the content after successful identification because the full-digital license number is different from the identity number in length and no worry about matching the identity number is needed;
step S323, using regular matching of the identity card numbers, successfully removing the content by matching, if not, finding out the key words of the identity card numbers, judging whether the next row or two rows have the identity card content, if not, judging the previous row, and preventing the identification of serial;
step S324, finding the identification position of the licensor, firstly judging whether the following content of the licensor accords with common names, and judging that the following content accords with the length, if not, judging whether the content of the next zone position accords with the common names, and judging that the length accords with the length, if not, judging that the preceding content of the identification position of the licensor accords with the name, and if not, judging that the content of the next zone position accords with the length, the content of the first zone position of the licensor does not exceed 4.
As a specific implementation mode, the key information of the identity card comprises a name, a gender, an address and an identity card number, the key information in the lawyer certificate comprises the name, the license number, the gender, the identity card number and the license agency.
According to the method, the method for extracting the key information of the certificate is arranged and unified, and the fault tolerance mechanism is increased according to the registered actual photographing result and the identification error rate, so that the accuracy of the identification result is further improved; the working efficiency of court guard is greatly improved, lawyers and identity card information are not required to be filled manually, convenience is provided for parties and lawyers, even if the parties forget to take the identity card or the lawyers, the personnel can be quickly registered based on ocr identification, satisfaction of the masses to court work is improved, the safety of data can be effectively protected when certificates are identified, and leakage of personal information is avoided.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.
Claims (9)
1. The certificate information extraction device based on the OCR technology is characterized by comprising a photographing module, an identification module, an extraction module, a searching module, an updating personnel information module and a newly added personnel information module;
the shooting module can adjust the definition to adjust the size and shooting speed of the photo, is used for sending shooting names at fixed time to shoot the picture, and stores the shot picture in the device and uploads the shot picture;
the identification module is composed of an off-line ocr identification api service, and after a photo shot by the shooting module is submitted to the identification module, the identification module is identified through ocr and submitted to the extraction module for processing;
the extraction module is compiled by java codes according to the key information extraction logic of the identity card and the lawyer card, and is used for judging whether the identity card or the lawyer card is the identity card or the lawyer card, extracting the key information of the corresponding certificate, and transmitting personnel information to the search module after the extraction is completed;
the searching module is used for connecting with the database and judging to search information by receiving the data transmitted by the extracting module
The newly added module is used for connecting with the database, and inserting and storing the data transmitted by the searching module;
the updating module is used for being connected with the database and updating information according to the certificate information provided by the searching module.
2. The apparatus for extracting information from a document based on OCR recognition technology as recited in claim 1, further comprising a preprocessing module for receiving a picture of the photographing module, optimizing a picture quality and a format, and transmitting the picture to the recognition module.
3. The apparatus for extracting information from a document based on OCR recognition technology of claim 1, further comprising an error detection and correction module for correcting errors that may be introduced by the OCR engine.
4. The apparatus for extracting information from a document based on OCR recognition technology of claim 1, further comprising an encryption module for taking data security and privacy protection measures when processing and storing sensitive information.
5. A method for extracting certificate information based on an OCR recognition technology, which is applicable to the device for extracting certificate information based on an OCR recognition technology as set forth in claims 1 to 4, and is characterized by comprising the following steps:
step S0: the following operations are performed by the preprocessing module: denoising, rotating, cutting, adjusting the size, graying and binarizing;
step S1: judging OCR text information, extracting the text information in the picture, and returning the text information in a josn form;
step S2: correcting errors possibly introduced by the OCR engine through an error detection and correction module;
step S3: analyzing the josn information, identifying that the identification card is used for carrying out identification card number retrieval, and identifying that the identification card is used for carrying out lawyer card number retrieval;
step S4: when the relevant information is not searched in the step S3, a new module is called for processing, and the new information is stored by the new module for the next identification;
step S5: when the step S3 is carried out, the information at the moment is covered with the old information and stored through the updating module when the related information is queried, and the information is updated;
step S6: data security and privacy protection measures are taken when sensitive information is processed and stored;
step S7: when an error is identified or information needs to be updated, a convenient feedback mechanism is provided for the user to correct the error and update the information.
6. The method for extracting information from a document based on the OCR recognition technology as recited in claim 5, wherein the step S3 includes the steps of:
step S31: judging whether ocr text information contains a resident identification card keyword or not, traversing ocr results, finding out an identification card number in the text information by using rules, removing content from ocr identification results after finding out the identification card number, and reducing subsequent information extraction judgment;
step S32: judging a name zone bit, if the name is followed by other characters and accords with common names and the length is compliant, judging whether the next position of the identification bit accords with the common names and the length is compliant, if not, judging that the identification bit is the name of the cardholder, if so, removing identification content, otherwise, returning to the blank;
step S33: firstly, finding the keyword identification position of a citizen identification number, then finding the address keyword identification position, wherein the content in the middle of the two identification positions is the address, and if the keyword cannot be found because of inaccurate character recognition, judging to match the keyword in provincial regions, and finding the address;
step S34: finding a sex keyword, if a man is followed, the expressive sex is a man, if a woman is followed, the sex is a woman, if not, judging the next subscript content of the text list, judging whether the character list is a man or a woman, if not, traversing the whole text, detecting whether the character list contains the man or the woman keyword, and if so, setting the character list as the sex.
7. The method for extracting information from a document based on the OCR recognition technology as recited in claim 5, wherein the step S3 further comprises:
step S321: judging whether a keyword of a practice mechanism is contained, if so, judging whether the next line ends with the word, if so, indicating that the next line is a legal address, if not, continuing to judge the next line, and judging at most two lines;
step S322: traversing the whole text, firstly removing the related content of the qualification number containing English letters, then utilizing the regular matching of the full-digital license number, and removing the content after successful identification because the full-digital license number is not required to be matched with the identity number due to the fact that the length of the full-digital license number is different from that of the identity number;
step S323: the identification card number is matched with the regular pattern, the content is successfully removed, if the identification card number is not successfully matched, the key word of the identification card number is found, whether the next row or two rows have the identification card content is judged, if not, the previous row is judged, and the identification of serial is prevented;
step S324: the identification position of the licensor is found, whether the content following the licensor accords with common names or not is judged firstly, the length is compliant, if so, the name is judged, otherwise, whether the content of the next zone position accords with the common names or not is judged, if so, the length is compliant, the name is judged, and if not, the content of the previous zone position of the licensor is judged.
8. The method for extracting information from a document based on the OCR recognition technology as recited in claim 5, wherein the key information of the identification card includes a name, a gender, an address, and an identification card number, and the key information of the lawyer's certificate includes a name, a license number, a gender, an identification card number, and a license agency.
9. The method for extracting certificate information based on OCR technology as recited in claim 5, wherein step S7 comprises the following specific steps:
step S71: creating a feedback form;
step S72: associating the identification result with a feedback form;
step S73: receiving and processing feedback;
step S74: the OCR model is optimized.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310554961.6A CN116665224A (en) | 2023-05-17 | 2023-05-17 | Certificate information extraction method and device based on OCR (optical character recognition) technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310554961.6A CN116665224A (en) | 2023-05-17 | 2023-05-17 | Certificate information extraction method and device based on OCR (optical character recognition) technology |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116665224A true CN116665224A (en) | 2023-08-29 |
Family
ID=87727117
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310554961.6A Pending CN116665224A (en) | 2023-05-17 | 2023-05-17 | Certificate information extraction method and device based on OCR (optical character recognition) technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116665224A (en) |
-
2023
- 2023-05-17 CN CN202310554961.6A patent/CN116665224A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11151369B2 (en) | Systems and methods for classifying payment documents during mobile image processing | |
US11170248B2 (en) | Video capture in data capture scenario | |
US9576272B2 (en) | Systems, methods and computer program products for determining document validity | |
US8326041B2 (en) | Machine character recognition verification | |
US6886136B1 (en) | Automatic template and field definition in form processing | |
US11227154B2 (en) | Ledger recognition system | |
US20070217692A1 (en) | Property record document data verification systems and methods | |
CN110781460A (en) | Copyright authentication method, device, equipment, system and computer readable storage medium | |
WO2021259096A1 (en) | Identity authentication method, apparatus, electronic device, and storage medium | |
CN108805787A (en) | A kind of method and apparatus that paper document distorts Jianzhen | |
CN111625798A (en) | System and method for handheld identity card authentication user real-name registration | |
CN111539414B (en) | Method and system for character recognition and character correction of OCR (optical character recognition) image | |
CN111683202B (en) | Bill stamping method, device, equipment and storage medium | |
CN116665224A (en) | Certificate information extraction method and device based on OCR (optical character recognition) technology | |
CN112132693A (en) | Transaction verification method, transaction verification device, computer equipment and computer-readable storage medium | |
CN113705560A (en) | Data extraction method, device and equipment based on image recognition and storage medium | |
CN101727572A (en) | Method for ensuring image integrity by using file characteristics | |
Dhanva et al. | Cheque image security enhancement in online banking | |
KR102523598B1 (en) | Unmaned entrance system | |
CN115375998B (en) | Certificate identification method and device, electronic equipment and storage medium | |
CN116189181B (en) | Image normalization method and system for identity card OCR | |
US11238686B2 (en) | Item validation and image evaluation system with feedback loop | |
WO2021248912A1 (en) | Picture audit method and device, computing device and storage medium | |
JP3360030B2 (en) | Character recognition device, character recognition method, and recording medium recording character recognition method in program form | |
Borse et al. | Smart Vehicle Identification And Surveillance System Using OCR |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |