WO2018188199A1 - 理赔单据的字符识别方法、装置、服务器及存储介质 - Google Patents

理赔单据的字符识别方法、装置、服务器及存储介质 Download PDF

Info

Publication number
WO2018188199A1
WO2018188199A1 PCT/CN2017/091363 CN2017091363W WO2018188199A1 WO 2018188199 A1 WO2018188199 A1 WO 2018188199A1 CN 2017091363 W CN2017091363 W CN 2017091363W WO 2018188199 A1 WO2018188199 A1 WO 2018188199A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
character recognition
segmentation
document
training
Prior art date
Application number
PCT/CN2017/091363
Other languages
English (en)
French (fr)
Inventor
金飞虎
薛燕
米艺
李欢欢
仇一
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Priority to SG11201900263SA priority Critical patent/SG11201900263SA/en
Priority to EP17899230.1A priority patent/EP3432197B1/en
Priority to AU2017408799A priority patent/AU2017408799B2/en
Priority to KR1020187023693A priority patent/KR102171220B1/ko
Priority to US16/084,244 priority patent/US10650231B2/en
Priority to JP2018536430A priority patent/JP6710483B2/ja
Publication of WO2018188199A1 publication Critical patent/WO2018188199A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/046Forward inferencing; Production systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/1801Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections
    • G06V30/18019Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections by matching or filtering
    • G06V30/18038Biologically-inspired filters, e.g. difference of Gaussians [DoG], Gabor filters
    • G06V30/18048Biologically-inspired filters, e.g. difference of Gaussians [DoG], Gabor filters with interaction between the responses of different filters, e.g. cortical complex cells
    • G06V30/18057Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a character recognition method, apparatus, server, and computer readable storage medium for a claim document.
  • OCR Optical Character Recognition
  • the existing identification scheme for using the OCR technology for claim image image characters only uses its own recognition engine to uniformly identify the characters in the entire claim document image, and does not consider the influence of the claim document frame format on the recognition accuracy, nor does it Considering the interference of the frame line in the document to the character recognition, the recognition accuracy of the existing identification scheme is not high, and it takes a lot of manpower and material resources to perform verification.
  • the main object of the present invention is to provide a character recognition method, device, server and computer readable storage medium for claim documents, which aim to improve the recognition accuracy of the claim documents.
  • a first aspect of the present invention provides a character recognition method for a claim document, the method comprising the following steps:
  • the server After receiving the image of the claim document of the character to be recognized, the server performs area segmentation according to the frame layout of the format of the claim document frame to obtain one or more divided regions;
  • Each of the obtained divided regions is analyzed by calling a predetermined analysis model, and each of the analyzed divided regions is respectively subjected to character recognition by using a predetermined identification rule to identify characters in each divided region.
  • a second aspect of the present invention provides a character recognition apparatus for a claim document, the character recognition apparatus comprising:
  • a segmentation module configured to perform region segmentation according to a frame line arrangement of the format of the claim document frame after receiving the image of the claim document of the character to be recognized, to obtain one or more segmentation regions;
  • An identification module for invoking a predetermined analysis model to divide each obtained segmentation area And analyzing, each of the analyzed divided regions is subjected to character recognition by using a predetermined identification rule to identify characters in each divided region.
  • a third aspect of the present invention provides a character recognition server for a claim slip, the character recognition server of the claim slip comprising: a memory and a processor, wherein the memory stores a character recognition program of the claim slip, and the character recognition program of the claim slip is When the processor is executed, the following steps can be implemented:
  • the server After receiving the image of the claim document of the character to be recognized, the server performs area segmentation according to the frame layout of the format of the claim document frame to obtain one or more divided regions;
  • Each of the obtained divided regions is analyzed by calling a predetermined analysis model, and each of the analyzed divided regions is respectively subjected to character recognition by using a predetermined identification rule to identify characters in each divided region.
  • a fourth aspect of the present invention provides a computer readable storage medium having a character recognition program for storing a claim slip, the character recognition program of the claim slip being executable by at least one processor to implement the following steps:
  • the server After receiving the image of the claim document of the character to be recognized, the server performs area segmentation according to the frame layout of the format of the claim document frame to obtain one or more divided regions;
  • Each of the obtained divided regions is analyzed by calling a predetermined analysis model, and each of the analyzed divided regions is respectively subjected to character recognition by using a predetermined identification rule to identify characters in each divided region.
  • the character recognition method, device, server and computer readable storage medium of the claim document provided by the present invention are arranged according to the frame line format of the claim document frame before performing character recognition on the claim document image.
  • the segmentation is performed, and each segmentation region of the claim document is separately identified by a predetermined recognition rule to identify characters in each segmentation region.
  • the segmentation is performed according to the frame layout of the claim document frame format before character recognition, and the character recognition is performed for each segmentation region, thereby avoiding the entire claim document.
  • the influence and interference of the frame lines on the character recognition in the document can effectively improve the recognition accuracy of the characters in the claim documents.
  • FIG. 1 is a schematic flow chart of a first embodiment of a character recognition method for a claim document according to the present invention
  • FIG. 2 is a schematic flow chart of a second embodiment of a character recognition method for a claim document according to the present invention.
  • FIG. 3 is a schematic diagram of functional modules of a first embodiment of a character recognition apparatus for a claim document according to the present invention
  • FIG. 4 is a schematic diagram of a first embodiment of a character recognition server of a claim document of the present invention.
  • the invention provides a character recognition method for a claim document.
  • FIG. 1 is a schematic flowchart diagram of a first embodiment of a character recognition method for a claim document according to the present invention.
  • the character recognition method of the claim document includes:
  • Step S10 After receiving the image of the claim document of the character to be recognized, the server performs area segmentation according to the frame line arrangement of the format of the claim document frame to obtain one or more divided regions;
  • the server may receive a character recognition request sent by the user for the claim document image containing the character to be recognized, for example, receiving a character recognition request sent by the user through a terminal such as a mobile phone, a tablet computer, or a self-service terminal device, such as receiving the user in the mobile phone.
  • the server After receiving the image of the claim document identified by the character to be recognized, the server performs area segmentation according to the frame layout of the format of the claim document frame, and the claim document image is arranged with horizontal or vertical frame lines according to the frame format thereof to form Various input fields are provided for users to fill in relevant information.
  • the area division is performed according to the frame line arrangement of the claim document frame format, and one or more divided areas are obtained.
  • corresponding documents may be obtained in advance according to the type of documents uploaded by the user (possibly different insurances have different document formats).
  • the document template is then split according to the format of the template.
  • the document template corresponding to the claim document image is found, and then the region segmentation is performed according to the corresponding document template.
  • the segmentation area is an area of a minimum unit surrounded by a frame line of the claim document frame format, and the segmentation area is an area that does not include a frame line, so as to avoid subsequent frame line pairs in performing character recognition for each of the divided areas.
  • the identification accuracy is related to the influence.
  • the segmentation area is similar to each square of the excel table. Each square of the excel table is the area containing no borders in the smallest area.
  • Step S20 the predetermined analysis model is invoked to analyze each of the obtained segmentation regions, and each of the analyzed segmentation regions is respectively subjected to character recognition by using a predetermined recognition rule to identify characters in each segmentation region.
  • a predetermined analysis model may be invoked to analyze the obtained segmentation regions, and the predetermined recognition is utilized.
  • the rule performs character recognition on each divided area to identify characters in each divided area, that is, characters in the claim document image.
  • a predetermined analysis model may be used to analyze a recognition model or a recognition mode applicable to each segmentation region, and then according to the analysis result, character recognition is performed for each segmentation region by using a recognition model or a recognition method suitable for each segmentation region. Improve the accuracy of character recognition.
  • the way of character recognition can be analyzed by using an optical character recognition engine, and can also be identified by other recognition engines or training recognition models, which are not limited herein. Characters in each divided area are identified, and characters in each divided area are automatically filled and entered into respective input fields of the electronic claim slip corresponding to the claim document image.
  • the embodiment Before performing the character recognition on the claim document image, the embodiment divides the area according to the frame layout of the claim document frame, and performs character recognition on each segment of the claim document by using a predetermined identification rule. The characters in each divided area are respectively identified. Considering the influence of the format of the claim document frame on the recognition accuracy, the segmentation is performed according to the frame layout of the claim document frame format before character recognition, and the character recognition is performed for each segmentation region, thereby avoiding the entire claim document. When the characters in the image are uniformly recognized, the influence and interference of the frame lines on the character recognition in the document can effectively improve the recognition accuracy of the characters in the claim documents.
  • the second embodiment of the present invention provides a character recognition method for a claim document.
  • the step S20 includes:
  • Step S201 invoking a predetermined analysis model to analyze each obtained segmentation region to analyze a first segmentation region that can be identified by an optical character recognition engine and a second segmentation region that is not recognized by an optical character recognition engine;
  • Step S202 performing character recognition on each of the first divided regions by using a predetermined optical character recognition engine to identify characters in each of the first divided regions, and calling a predetermined recognition model for each of the second The segmentation area performs character recognition to identify characters in each of the second segmentation regions.
  • a predetermined analysis model pair is also called to obtain each of the obtained segments.
  • the segmentation region is analyzed to analyze a first segmentation region that does not require depth recognition and a second segmentation region that requires depth recognition.
  • the OCR character recognition engine is described by taking the current recognition engine as an example.
  • the OCR character recognition engine can correctly identify or identify the region with high recognition rate as the region without deep recognition, that is, using the current OCR character recognition engine.
  • the characters in the area can be correctly identified without the need for other means of identification.
  • the area that is not recognized by the OCR character recognition engine or has a low recognition rate is used as the area that needs to be deeply recognized. That is, the current OCR character recognition engine cannot correctly recognize the characters in the area, and other recognition methods such as trained are needed. Identify models for character recognition.
  • the first segmentation region and the second segment can be analyzed.
  • the segmentation area adopts different recognition methods for character recognition.
  • Each of the first divided regions is subjected to character recognition using a predetermined OCR character recognition engine to correctly recognize characters in each of the first divided regions.
  • Performing character recognition on each of the second divided regions by calling a predetermined recognition model to correctly identify characters in each of the second divided regions, and the predetermined recognition model may be trained for a large number of divided region samples. Identifying the model can also be more complicated and identifiable than its own OCR character recognition engine recognition method. There is no better recognition engine, which is not limited here.
  • the predetermined analysis model is a Convolutional Neural Network (CNN) model
  • CNN Convolutional Neural Network
  • A. Obtain a preset number (for example, 500,000) of the claim document image samples based on the claim document frame format for the predetermined claim document frame format;
  • the verification pass rate is greater than or equal to a preset threshold (for example, 98%), the training is completed, or if the verification pass rate is less than the preset threshold, the number of the claim document image samples is increased, and the step A is repeatedly performed. B, C, D, E until the verification pass rate is greater than or equal to the preset threshold.
  • a preset threshold for example, 98%)
  • the convolutional neural network model trained by a large number of claim document image samples is used to perform segmentation region analysis, and the first segmentation of the character can be accurately identified in each segmentation region of the claim document by using the OCR character recognition engine to correctly recognize the character.
  • the area and the second segmentation area of the character cannot be correctly recognized by the OCR character recognition engine, so that different recognition methods are respectively used for the first segmentation region and the second segmentation region to perform accurate character recognition operations, thereby improving the claims document. Character recognition accuracy.
  • the predetermined recognition model is a Long Short-Term Memory (LSTM) model
  • the training process of the predetermined recognition model is as follows:
  • the region sample may be a segmentation region sample in the historical data for the region segmentation of the plurality of claims documents according to the frame format of the frame format.
  • the font in the segmented region sample can be uniformly set to black and the background to white to facilitate character recognition.
  • Each segmentation area sample is labeled, for example, the name of each segmentation region sample can be named as the character included in the segmentation region sample for labeling.
  • a preset ratio for example, 8:2
  • the model is tested using a second data set to evaluate the effects of the currently trained model.
  • the trained model can be used to perform character recognition on the segmentation region samples in the second data set, and the model obtained by the training is used to compare the character recognition result of the segmentation region sample with the annotation of the segmentation region sample. To calculate the error of the character recognition result of the trained model and the labeling of the segmentation region sample.
  • the edit distance can be used as the calculation standard, wherein the Edit Distance, also known as the Levenshtein distance, refers to the minimum edit between two strings, one from one to the other.
  • the smaller the editing distance the greater the similarity between the two strings. Therefore, when the edited distance is used as the calculation standard to calculate the error of the character recognition result of the trained model and the labeling of the segmented region sample, the calculated error is smaller, indicating the character recognition result of the trained model and the segmentation region.
  • the greater the similarity of the label of the sample on the contrary, the larger the calculated error, the smaller the similarity between the character recognition result of the trained model and the label of the segmented region sample.
  • the calculated error of the character recognition result of the model obtained by the training and the labeling of the segmentation region sample is training.
  • the error between the character recognition result of the obtained model and the characters included in the segmentation region sample can reflect the error between the character recognized by the trained model and the correct character. Record the error of each test on the trained model using the second data set, and analyze the trend of the error. If the training model in the analysis test diverges the error of the character recognition of the segmented region sample, adjust the training parameters such as the activation function.
  • the LSTM layer number, the input and output variable dimensions, etc. are retrained so that the training model at the test can converge on the error of the character recognition of the segmentation region samples.
  • the model training is ended, and the generated training model is used as the trained recognition model.
  • the trained LSTM model is used for the region that is not recognized by the OCR character recognition engine.
  • the LSTM model is a model that has been trained by a large number of segmented regions and has error converging on the character recognition of the segmentation region samples.
  • the LSTM model can more accurately identify the characters in the segmentation area by using the long-term information remembered by the model, such as context information, when identifying the characters in the segmentation region, thereby further improving the pair.
  • the present invention further provides a character recognition apparatus for a claim slip.
  • FIG. 3 is a schematic diagram of functional modules of a first embodiment of a character recognition apparatus for a claim document according to the present invention.
  • the character recognition device of the claim document includes:
  • the segmentation module 01 is configured to perform region segmentation according to a frame line arrangement of the format of the claim document frame after receiving the image of the claim document of the character to be recognized, to obtain one or more segmentation regions;
  • the server may receive a claim document file containing the character to be recognized sent by the user.
  • the character recognition request of the image for example, receiving a character recognition request sent by the user through a terminal such as a mobile phone, a tablet computer, a self-service terminal device, or the like, for example, the receiving user sends the pre-installed client in a terminal such as a mobile phone, a tablet computer, or a self-service terminal device.
  • the server After receiving the image of the claim document identified by the character to be recognized, the server performs area segmentation according to the frame layout of the format of the claim document frame, and the claim document image is arranged with horizontal or vertical frame lines according to the frame format thereof to form Various input fields are provided for users to fill in relevant information.
  • the area division is performed according to the frame line arrangement of the claim document frame format, and one or more divided areas are obtained.
  • corresponding documents may be obtained in advance according to the type of documents uploaded by the user (possibly different insurances have different document formats).
  • the document template is then split according to the format of the template.
  • the document template corresponding to the claim document image is found, and then the region segmentation is performed according to the corresponding document template.
  • the segmentation area is an area of a minimum unit surrounded by a frame line of the claim document frame format, and the segmentation area is an area that does not include a frame line, so as to avoid subsequent frame line pairs in performing character recognition for each of the divided areas.
  • the identification accuracy is related to the influence.
  • the segmentation area is similar to each square of the excel table. Each square of the excel table is the area containing no borders in the smallest area.
  • the identification module 02 is configured to call a predetermined analysis model to analyze each of the obtained divided regions, and perform character recognition on each of the analyzed divided regions by using a predetermined identification rule to identify characters in each divided region.
  • a predetermined analysis model may be invoked to analyze the obtained segmentation regions, and the predetermined recognition is utilized.
  • the rule performs character recognition on each divided area to identify characters in each divided area, that is, characters in the claim document image.
  • a predetermined analysis model may be used to analyze a recognition model or a recognition mode applicable to each segmentation region, and then according to the analysis result, character recognition is performed for each segmentation region by using a recognition model or a recognition method suitable for each segmentation region. Improve the accuracy of character recognition.
  • the manner of character recognition can be analyzed by using an optical character recognition engine, and other recognition engines or training recognition models can be used for identification, which is not limited herein. Characters in each divided area are identified, and characters in each divided area are automatically filled and entered into respective input fields of the electronic claim slip corresponding to the claim document image.
  • the embodiment Before performing the character recognition on the claim document image, the embodiment divides the area according to the frame layout of the claim document frame, and performs character recognition on each segment of the claim document by using a predetermined identification rule. The characters in each divided area are respectively identified. Considering the influence of the format of the claim document frame on the recognition accuracy, the segmentation is performed according to the frame layout of the claim document frame format before character recognition, and the character recognition is performed for each segmentation region, thereby avoiding the entire claim document. When the characters in the image are uniformly recognized, the influence and interference of the frame lines on the character recognition in the document can effectively improve the recognition accuracy of the characters in the claim documents.
  • the foregoing identification module 02 is further configured to:
  • a predetermined analysis model pair is also called to obtain each of the obtained segments.
  • the segmentation region is analyzed to analyze a first segmentation region that does not require depth recognition and a second segmentation region that requires depth recognition.
  • the OCR character recognition engine is described by taking the current recognition engine as an example.
  • the OCR character recognition engine can correctly identify or identify the region with high recognition rate as the region without deep recognition, that is, using the current OCR character recognition engine.
  • the characters in the area can be correctly identified without the need for other means of identification.
  • the area that is not recognized by the OCR character recognition engine or has a low recognition rate is used as the area that needs to be deeply recognized. That is, the current OCR character recognition engine cannot correctly recognize the characters in the area, and other recognition methods such as trained are needed. Identify models for character recognition.
  • the first segmentation region and the second segment can be analyzed.
  • the segmentation area adopts different recognition methods for character recognition.
  • Each of the first divided regions is subjected to character recognition using a predetermined OCR character recognition engine to correctly recognize characters in each of the first divided regions.
  • Performing character recognition on each of the second divided regions by calling a predetermined recognition model to correctly identify characters in each of the second divided regions, and the predetermined recognition model may be trained for a large number of divided region samples.
  • the recognition model may also be a recognition engine that is more complicated and has a better recognition effect than the OCR character recognition engine recognition method, and is not limited herein.
  • the predetermined analysis model is a Convolutional Neural Network (CNN) model
  • CNN Convolutional Neural Network
  • A. Obtain a preset number (for example, 500,000) of the claim document image samples based on the claim document frame format for the predetermined claim document frame format;
  • the verification pass rate is greater than or equal to a preset threshold (for example, 98%), the training is completed, or if the verification pass rate is less than the preset threshold, the number of the claim document image samples is increased, and the step A is repeatedly performed. B, C, D, E until the verification pass rate is greater than or equal to the preset threshold.
  • a preset threshold for example, 98%)
  • the convolutional neural network model trained by a large number of claim document image samples is used to perform segmentation region analysis, and the first segmentation of the character can be accurately identified in each segmentation region of the claim document by using the OCR character recognition engine to correctly recognize the character.
  • the area and the second segmentation area of the character cannot be correctly recognized by the OCR character recognition engine, so that different recognition methods are respectively used for the first segmentation region and the second segmentation region to perform accurate character recognition operations, thereby improving the claims document. Character recognition accuracy.
  • the predetermined recognition model is a Long Short-Term Memory (LSTM) model
  • the training process of the predetermined recognition model is as follows:
  • the region sample may be a segmentation region sample in the historical data for the region segmentation of the plurality of claims documents according to the frame format of the frame format.
  • the font in the segmented region sample can be uniformly set to black and the background to white to facilitate character recognition.
  • Each segmentation area sample is labeled, for example, the name of each segmentation region sample can be named as the character included in the segmentation region sample for labeling.
  • a preset ratio for example, 8:2
  • the first data set is sent to the LSTM network for model training, and the model is tested using the second data set every predetermined time (eg, every 10 minutes or every 1000 iterations) to evaluate the effect of the currently trained model.
  • the trained model can be used to perform character recognition on the segmentation region samples in the second data set, and the model obtained by the training is used to compare the character recognition result of the segmentation region sample with the annotation of the segmentation region sample. To calculate the error of the character recognition result of the trained model and the labeling of the segmentation region sample.
  • the edit distance can be used as the calculation standard, wherein the Edit Distance, also known as the Levenshtein distance, refers to the minimum edit between two strings, one from one to the other.
  • the smaller the editing distance the greater the similarity between the two strings. Therefore, when the edited distance is used as the calculation standard to calculate the error of the character recognition result of the trained model and the labeling of the segmented region sample, the calculated error is smaller, indicating the character recognition result of the trained model and the segmentation region.
  • the greater the similarity of the label of the sample on the contrary, the larger the calculated error, the smaller the similarity between the character recognition result of the trained model and the label of the segmented region sample.
  • the calculated error of the character recognition result of the model obtained by the training and the labeling of the segmentation region sample is training.
  • the error between the character recognition result of the obtained model and the characters included in the segmentation region sample can reflect the error between the character recognized by the trained model and the correct character. Record the error of each test on the trained model using the second data set, and analyze the trend of the error. If the training model in the analysis test diverges the error of the character recognition of the segmented region sample, adjust the training parameters such as the activation function.
  • the LSTM layer number, the input and output variable dimensions, etc. are retrained so that the training model at the test can converge on the error of the character recognition of the segmentation region samples.
  • the model training is ended, and the generated training model is used as the trained recognition model.
  • the trained LSTM model is used for the region that is not recognized by the OCR character recognition engine.
  • the LSTM model is a model that has been trained by a large number of segmented regions and has error converging on the character recognition of the segmentation region samples.
  • the LSTM model can more accurately identify the characters in the segmentation area by using the long-term information remembered by the model, such as context information, when identifying the characters in the segmentation region, thereby further improving the pair.
  • the present invention further provides a character recognition server for a claim slip.
  • FIG. 4 is a schematic diagram of a first embodiment of a character recognition server of a claim document according to the present invention.
  • the character recognition server of the claim document includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.
  • the communication bus 13 is used to implement connection communication between these components.
  • the memory 11 includes a memory and at least one type of readable storage medium.
  • the memory provides a cache for the operation of the character recognition server of the claim document;
  • the readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card type memory, or the like.
  • the readable storage medium may be an internal storage unit of a character recognition server of the claim document, such as a hard disk or memory of a character recognition server of the claim document.
  • the readable storage medium may also be an external storage device of the character recognition server of the claim document, such as a plug-in hard disk equipped on the character recognition server of the claim document, and a smart memory card ( Smart Media Card, SMC), Secure Digital (SD) card, Flash Card, etc.
  • a smart memory card Smart Media Card, SMC
  • SD Secure Digital
  • Flash Card Flash Card
  • the readable storage medium of the memory 11 is generally used to store application software of the character recognition server installed in the claim document and various types of data, such as a character recognition program of the claim document.
  • the memory 11 can also be used to temporarily store data that has been output or is about to be output.
  • the processor 12 may be a Central Processing Unit (CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory 11.
  • CPU Central Processing Unit
  • microprocessor or other data processing chip for running program code or processing data stored in the memory 11.
  • Network interface 14 may include a standard wired interface, a wireless interface (such as a WI-FI interface).
  • Figure 4 shows only the character recognition server with the claims documents of components 11-14, but it should be understood It is not required to implement all of the illustrated components, and more or fewer components may be implemented instead.
  • the character recognition server of the claim document may further include a user interface
  • the user interface may include a standard wired interface and a wireless interface.
  • an input unit such as a keyboard, a wired or wireless headset port, an external power (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, Audio input/output (I/O) ports, video I/O ports, headphone ports, and more.
  • the user interface can be used to receive input from an external device (eg, data information, power, etc.) and transmit the received input to one or more components of the terminal.
  • an external device eg, data information, power, etc.
  • the character recognition server of the claim document may further include a display, and the display may be an LED display, a liquid crystal display, a touch liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like.
  • the display is for displaying information processed in a character recognition server of the claim slip, a user interface for displaying visualization, and the like.
  • the memory 11 may include a character recognition program of the claim slip, and when the processor 12 executes the character recognition program of the claim slip stored in the memory 11, the following steps are implemented:
  • the area is divided according to the frame line arrangement of the format of the claim document frame to obtain one or more divided areas;
  • Each of the obtained divided regions is analyzed by calling a predetermined analysis model, and each of the analyzed divided regions is respectively subjected to character recognition by using a predetermined identification rule to identify characters in each divided region.
  • the step of invoking the predetermined analysis model to analyze each of the obtained segmentation regions comprises:
  • the predetermined analysis model is a convolutional neural network model
  • the training process of the predetermined analysis model is as follows:
  • A. Obtain a preset number of claim document image samples based on the claim document frame format for a predetermined claim document frame format
  • the training is completed, or if the verification pass rate is less than the preset threshold, increase the number of the claim document image samples, and repeat the above steps A, B, C, D, E until the verification pass rate is greater than or equal to the preset threshold.
  • the predetermined recognition model is a long-term and short-term memory LSTM model, and the training process of the predetermined recognition model is as follows:
  • the preset training parameters are adjusted and retrained until the error of the model recognition character obtained by the training can be converged;
  • the model training is ended, and the generated model is used as the trained recognition model.
  • the divided area is an area of a minimum unit surrounded by a frame line of the claim document frame format, and the divided area is an area that does not include a frame line.
  • the invention further provides a computer readable storage medium.
  • the computer readable storage medium stores a character recognition program of the claim slip, and the character recognition program of the claim slip file can be executed by at least one processor to implement the following steps:
  • the area is divided according to the frame line arrangement of the format of the claim document frame to obtain one or more divided areas;
  • Each of the obtained divided regions is analyzed by calling a predetermined analysis model, and each of the analyzed divided regions is respectively subjected to character recognition by using a predetermined identification rule to identify characters in each divided region.
  • the invoking a predetermined analysis model analyzes each obtained segmentation region
  • the steps include:
  • the predetermined analysis model is a convolutional neural network model
  • the training process of the predetermined analysis model is as follows:
  • A. Obtain a preset number of claim document image samples based on the claim document frame format for a predetermined claim document frame format
  • the training is completed, or if the verification pass rate is less than the preset threshold, increase the number of the claim document image samples, and repeat the above steps A, B, C, D, E until the verification pass rate is greater than or equal to the preset threshold.
  • the predetermined recognition model is a long-term and short-term memory LSTM model, and the training process of the predetermined recognition model is as follows:
  • the preset training parameters are adjusted and Retraining until the error of the model recognition character obtained by the training can be converged;
  • the model training is ended, and the generated model is used as the trained recognition model.
  • the divided area is an area of a minimum unit surrounded by a frame line of the claim document frame format, and the divided area is an area that does not include a frame line.
  • the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and can also be implemented by hardware, but in many cases, the former is A better implementation.
  • the technical solution of the present invention which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a cell phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Accounting & Taxation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Finance (AREA)
  • Biophysics (AREA)
  • Geometry (AREA)
  • Computer Graphics (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Character Discrimination (AREA)
  • Character Input (AREA)

Abstract

本发明公开了一种理赔单据的字符识别方法、装置、服务器及存储介质,该方法包括:服务器在收到待识别字符的理赔单据影像后,按照该理赔单据框架格式的框线排布进行区域分割,获得一个或多个分割区域;调用预先确定的分析模型对获得的各个分割区域进行分析,并利用预先确定的识别规则对分析出的各个分割区域分别进行字符识别,以识别出各个分割区域中的字符。本发明由于考虑到理赔单据框架格式对识别精度的影响,在进行字符识别前先按照理赔单据框架格式的框线排布进行区域分割,再针对各个分割区域来进行字符识别,避免了在对整个理赔单据影像中的字符进行统一识别时单据中的框线对字符识别的影响及干涉,能有效提高对理赔单据中字符的识别精度。

Description

理赔单据的字符识别方法、装置、服务器及存储介质
优先权申明
本申请基于巴黎公约申明享有2017年4月11日递交的申请号为CN201710233613.3、名称为“理赔单据的字符识别方法及服务器”的中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。
技术领域
本发明涉及计算机技术领域,尤其涉及一种理赔单据的字符识别方法、装置、服务器及计算机可读存储介质。
背景技术
随着大众保险意识的增强、购买保险的客户群大幅增多,保险公司需处理的客户理赔申请越来越多,保险公司作业人员需录入的理赔单据影像也越来越多,以致于录单作业人员的人力紧张,同时,经常会出现录单错误。为了有效减少录单错误、提高录单效率,目前,有些保险公司在录单作业过程中引入OCR(Optical Character Recognition,光学字符识别)技术,以自动识别出理赔单据影像的字符以填充到对应的输入栏位中。
然而,现有的利用OCR技术进行理赔单据影像字符的识别方案仅利用自身的识别引擎对整个理赔单据影像中的字符进行统一识别,并未考虑理赔单据框架格式对识别精度的影响,也并未考虑单据中的框线对字符识别的干涉,使得现有的识别方案的识别精度不高,需要耗费大量的人力、物力进行校验。
发明内容
本发明的主要目的在于提供一种理赔单据的字符识别方法、装置、服务器及计算机可读存储介质,旨在提高理赔单据的识别精度。
为实现上述目的,本发明第一方面提供一种理赔单据的字符识别方法,所述方法包括以下步骤:
服务器在收到待识别字符的理赔单据影像后,按照该理赔单据框架格式的框线排布进行区域分割,获得一个或多个分割区域;
调用预先确定的分析模型对获得的各个分割区域进行分析,并利用预先确定的识别规则对分析出的各个分割区域分别进行字符识别,以识别出各个分割区域中的字符。
本发明第二方面还提供一种理赔单据的字符识别装置,所述字符识别装置包括:
分割模块,用于在收到待识别字符的理赔单据影像后,按照该理赔单据框架格式的框线排布进行区域分割,获得一个或多个分割区域;
识别模块,用于调用预先确定的分析模型对获得的各个分割区域进行分 析,并利用预先确定的识别规则对分析出的各个分割区域分别进行字符识别,以识别出各个分割区域中的字符。
本发明第三方面提供一种理赔单据的字符识别服务器,该理赔单据的字符识别服务器包括:存储器及处理器,该存储器上存储有理赔单据的字符识别程序,该理赔单据的字符识别程序被该处理器执行时,可实现如下步骤:
服务器在收到待识别字符的理赔单据影像后,按照该理赔单据框架格式的框线排布进行区域分割,获得一个或多个分割区域;
调用预先确定的分析模型对获得的各个分割区域进行分析,并利用预先确定的识别规则对分析出的各个分割区域分别进行字符识别,以识别出各个分割区域中的字符。
本发明第四方面提供一种计算机可读存储介质,该计算机可读存储介质上存储有理赔单据的字符识别程序,该理赔单据的字符识别程序可被至少一处理器执行,以实现如下步骤:
服务器在收到待识别字符的理赔单据影像后,按照该理赔单据框架格式的框线排布进行区域分割,获得一个或多个分割区域;
调用预先确定的分析模型对获得的各个分割区域进行分析,并利用预先确定的识别规则对分析出的各个分割区域分别进行字符识别,以识别出各个分割区域中的字符。
相较于现有技术,本发明提出的理赔单据的字符识别方法、装置、服务器及计算机可读存储介质,在对理赔单据影像进行字符识别前,按照该理赔单据框架格式的框线排布对其进行区域分割,利用预先确定的识别规则对该理赔单据的各个分割区域分别进行字符识别,以分别识别出各个分割区域中的字符。由于考虑到理赔单据框架格式对识别精度的影响,在进行字符识别前先按照理赔单据框架格式的框线排布进行区域分割,再针对各个分割区域来进行字符识别,避免了在对整个理赔单据影像中的字符进行统一识别时单据中的框线对字符识别的影响及干涉,能有效提高对理赔单据中字符的识别精度。
附图说明
图1为本发明理赔单据的字符识别方法第一实施例的流程示意图;
图2为本发明理赔单据的字符识别方法第二实施例的流程示意图;
图3为本发明理赔单据的字符识别装置第一实施例的功能模块示意图;
图4为本发明理赔单据的字符识别服务器第一实施例的示意图。
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
为了使本发明所要解决的技术问题、技术方案及有益效果更加清楚、明白,以下结合附图和实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
本发明提供一种理赔单据的字符识别方法。
参照图1,图1为本发明理赔单据的字符识别方法第一实施例的流程示意图。
在第一实施例中,该理赔单据的字符识别方法包括:
步骤S10,服务器在收到待识别字符的理赔单据影像后,按照该理赔单据框架格式的框线排布进行区域分割,获得一个或多个分割区域;
本实施例中,服务器可以接收用户发出的包含待识别字符的理赔单据影像的字符识别请求,例如,接收用户通过手机、平板电脑、自助终端设备等终端发送的字符识别请求,如接收用户在手机、平板电脑、自助终端设备等终端中预先安装的客户端上发送来的字符识别请求,或接收用户在手机、平板电脑、自助终端设备等终端中的浏览器系统上发送来的字符识别请求。
服务器在收到待识别字符识别的理赔单据影像后,按照该理赔单据框架格式的框线排布进行区域分割,理赔单据影像中按照其框架格式排布有横向或竖向的框线,以组成各项输入栏供用户填写相关信息。本实施例中,按照该理赔单据框架格式的框线排布进行区域分割,获得一个或多个分割区域。例如,在一种实施方式中,由于一般不同类型的保险均对应有不同的单据格式模板,因此,可预先根据用户上传的单据类型(可能不同的保险有不同的单据格式),获取到对应的单据模板,然后根据模板的格式来分割。如可根据收到的待识别字符的理赔单据影像的单据类型,找到该理赔单据影像对应的单据模板,然后根据其对应的单据模板进行区域分割。该分割区域是由该理赔单据框架格式的框线所围成的最小单位的区域,且该分割区域为不包含框线的区域,以避免后续在对每一分割区域进行字符识别时框线对识别精度的干涉及影响,该分割区域类似于excel表格的每个方格,excel表格的每个方格即是最小区域内不包含框线的区域。
步骤S20,调用预先确定的分析模型对获得的各个分割区域进行分析,并利用预先确定的识别规则对分析出的各个分割区域分别进行字符识别,以识别出各个分割区域中的字符。
在按照该理赔单据框架格式的框线排布对理赔单据影像进行区域分割得到一个或多个分割区域后,可调用预先确定的分析模型对获得的各个分割区域进行分析,并利用预先确定的识别规则对各个分割区域分别进行字符识别,以识别出各个分割区域中的字符,也即理赔单据影像中的字符。例如,可利用预先确定的分析模型分析各个分割区域所适用的识别模型或识别方式,再根据分析出的结果针对各个分割区域利用适合各个分割区域自身的识别模型或识别方式来进行字符识别,以提高字符识别的准确率。如针对不同的分割 区域,可分析出字符识别的方式为利用光学字符识别引擎进行识别,也可以用其他识别引擎或训练的识别模型来进行识别,在此不做限定。识别出各个分割区域中的字符,还可将各个分割区域中的字符自动填充、录入至与该理赔单据影像对应的电子理赔单据的各相应输入栏位中。
本实施例在对理赔单据影像进行字符识别前,按照该理赔单据框架格式的框线排布对其进行区域分割,利用预先确定的识别规则对该理赔单据的各个分割区域分别进行字符识别,以分别识别出各个分割区域中的字符。由于考虑到理赔单据框架格式对识别精度的影响,在进行字符识别前先按照理赔单据框架格式的框线排布进行区域分割,再针对各个分割区域来进行字符识别,避免了在对整个理赔单据影像中的字符进行统一识别时单据中的框线对字符识别的影响及干涉,能有效提高对理赔单据中字符的识别精度。
如图2所示,本发明第二实施例提出一种理赔单据的字符识别方法,在上述实施例的基础上,所述步骤S20包括:
步骤S201,调用预先确定的分析模型对获得的各个分割区域进行分析,以分析出可利用光学字符识别引擎识别的第一分割区域和不可利用光学字符识别引擎识别的第二分割区域;
步骤S202,利用预先确定的光学字符识别引擎对各个所述第一分割区域进行字符识别,以识别出各个所述第一分割区域中的字符,并调用预先确定的识别模型对各个所述第二分割区域进行字符识别,以识别出各个所述第二分割区域中的字符。
本实施例中,在按照该理赔单据框架格式的框线排布进行区域分割得到一个或多个分割区域后,在对获得的分割区域进行识别之前,还调用预先确定的分析模型对获得的各个分割区域进行分析,以分析出无需深度识别的第一分割区域和需要深度识别的第二分割区域。例如,以当前自身的识别引擎为OCR字符识别引擎为例进行说明,可将OCR字符识别引擎能正确识别或识别率高的区域作为无需深度识别的区域,即利用当前自身的OCR字符识别引擎即可对该区域的字符进行正确的识别,无需借助其他识别方式。将OCR字符识别引擎无法识别或识别率低的区域作为需要深度识别的区域,即利用当前自身的OCR字符识别引擎无法对该区域的字符进行正确的识别,需借助其他识别方式如经训练过的识别模型来进行字符识别。
在分析出该理赔单据影像中可利用OCR字符识别引擎进行正确识别的第一分割区域和不可利用OCR字符识别引擎识别的第二分割区域之后,即可针对分析出的第一分割区域和第二分割区域采取不同的识别方式进行字符识别。利用预先确定的OCR字符识别引擎对各个所述第一分割区域进行字符识别,以正确识别出各个所述第一分割区域中的字符。调用预先确定的识别模型对各个所述第二分割区域进行字符识别,以正确识别出各个所述第二分割区域中的字符,该预先确定的识别模型可以是针对大量分割区域样本进行训练好的识别模型,也可以是比自身的OCR字符识别引擎识别方式更复杂、识 别效果更好的识别引擎,在此不做限定。
进一步地,在其他实施例中,所述预先确定的分析模型为卷积神经网络(Convolutional Neural Network,简称CNN)模型,所述预先确定的分析模型的训练过程如下:
A、针对预先确定的理赔单据框架格式,获取预设数量(例如,50万个)的基于该理赔单据框架格式的理赔单据影像样本;
B、对每一个理赔单据影像样本按照该理赔单据框架格式的框线排布进行区域分割,并确定出各个理赔单据影像样本中OCR字符识别引擎识别错误的第三分割区域和OCR字符识别引擎识别正确的第四分割区域;
C、将所有第三分割区域归入第一训练集,将所有第四分割区域归入第二训练集;
D、分别从第一训练集和第二训练集中提取出第一预设比例(例如,80%)的分割区域作为待训练的分割区域,并将第一训练集和第二训练集中剩余的分割区域作为待验证的分割区域;
E、利用提取的各个待训练的分割区域进行模型训练,以生成所述预先确定的分析模型,并利用各个待验证的分割区域对生成的所述预先确定的分析模型进行验证;
F、若验证通过率大于等于预设阈值(例如,98%),则训练完成,或者,若验证通过率小于预设阈值,则增加理赔单据影像样本的数量,并重复执行所述步骤A、B、C、D、E,直至验证通过率大于或等于预设阈值。
本实施例中利用经大量理赔单据影像样本训练过的卷积神经网络模型来进行分割区域分析,能够准确分析出理赔单据的各个分割区域中可利用OCR字符识别引擎来正确识别字符的第一分割区域和无法利用OCR字符识别引擎来正确识别字符的第二分割区域,以便后续针对第一分割区域和第二分割区域分别采用不同的识别方式来进行准确的字符识别操作,从而提高对理赔单据中字符的识别精度。
进一步地,在其他实施例中,所述预先确定的识别模型为长短期记忆(Long Short-Term Memory,简称LSTM)模型,所述预先确定的识别模型的训练过程如下:
获取预设数量(例如,10万)的区域样本,该区域样本可以是历史数据中对若干理赔单据按照其框架格式的框线排布进行区域分割后的分割区域样本。在一种实施方式中,可统一将分割区域样本中的字体设置为黑色,背景设置为白色,以便于进行字符识别。并将各个分割区域样本进行标注,如可将各个分割区域样本的名称命名为该分割区域样本所包含的字符以进行标注。
将预设数量的分割区域样本按照预设比例(例如,8:2)分为第一数据集和第二数据集,将第一数据集作为训练集,将第二数据集作为测试集,其中,第一数据集的样本数量比例大于或者等于第二数据集的样本数量比例。
将第一数据集送入LSTM网络进行模型训练,每隔预设时间(例如每30 分钟或每进行1000次迭代),对模型使用第二数据集进行测试,以评估当前训练的模型效果。例如,在测试时,可使用训练得到的模型对第二数据集中的分割区域样本进行字符识别,并将利用训练得到的模型对分割区域样本的字符识别结果与该分割区域样本的标注进行比对,以计算出训练得到的模型的字符识别结果与该分割区域样本的标注的误差。具体地,在计算误差时,可采用编辑距离作为计算标准,其中,编辑距离(Edit Distance),又称Levenshtein距离,是指两个字串之间,由一个转成另一个所需的最少编辑操作次数。许可的编辑操作包括将一个字符替换成另一个字符,插入一个字符,删除一个字符,一般来说,编辑距离越小,两个串的相似度越大。因此,在以编辑距离作为计算标准来计算训练得到的模型的字符识别结果与该分割区域样本的标注的误差时,计算得到的误差越小,说明训练得到的模型的字符识别结果与该分割区域样本的标注的相似度越大;相反,计算得到的误差越大,说明训练得到的模型的字符识别结果与该分割区域样本的标注的相似度越小。
由于该分割区域样本的标注为该分割区域样本的名称也即该分割区域样本所包含的字符,因此,计算出的训练得到的模型的字符识别结果与该分割区域样本的标注的误差即为训练得到的模型的字符识别结果与该分割区域样本所包含的字符之间的误差,能反映出训练得到的模型识别出的字符与正确的字符之间的误差。记录每一次对训练的模型使用第二数据集进行测试的误差,并分析误差的变化趋势,若分析测试时的训练模型对分割区域样本的字符识别的误差出现发散,则调整训练参数如activation函数、LSTM层数、输入输出的变量维度等,并重新训练,使测试时的训练模型对分割区域样本的字符识别的误差能够收敛。当分析测试时的训练模型对分割区域样本的字符识别的误差收敛后,则结束模型训练,将生成的训练模型作为训练好的所述预先确定的识别模型。
本实施例中,针对OCR字符识别引擎无法识别的区域,采用训练好的LSTM模型进行识别,由于LSTM模型为经大量分割区域样本训练过的,且对分割区域样本的字符识别的误差收敛的模型,配合LSTM模型自身的长期记忆功能使该LSTM模型在识别分割区域中的字符时,能利用模型记住的长期信息如上下文信息等,更加准确地识别出分割区域中的字符,从而进一步提高对理赔单据中字符的识别精度。
本发明进一步提供一种理赔单据的字符识别装置。
参照图3,图3为本发明理赔单据的字符识别装置第一实施例的功能模块示意图。
在第一实施例中,该理赔单据的字符识别装置包括:
分割模块01,用于在收到待识别字符的理赔单据影像后,按照该理赔单据框架格式的框线排布进行区域分割,获得一个或多个分割区域;
本实施例中,服务器可以接收用户发出的包含待识别字符的理赔单据影 像的字符识别请求,例如,接收用户通过手机、平板电脑、自助终端设备等终端发送的字符识别请求,如接收用户在手机、平板电脑、自助终端设备等终端中预先安装的客户端上发送来的字符识别请求,或接收用户在手机、平板电脑、自助终端设备等终端中的浏览器系统上发送来的字符识别请求。
服务器在收到待识别字符识别的理赔单据影像后,按照该理赔单据框架格式的框线排布进行区域分割,理赔单据影像中按照其框架格式排布有横向或竖向的框线,以组成各项输入栏供用户填写相关信息。本实施例中,按照该理赔单据框架格式的框线排布进行区域分割,获得一个或多个分割区域。例如,在一种实施方式中,由于一般不同类型的保险均对应有不同的单据格式模板,因此,可预先根据用户上传的单据类型(可能不同的保险有不同的单据格式),获取到对应的单据模板,然后根据模板的格式来分割。如可根据收到的待识别字符的理赔单据影像的单据类型,找到该理赔单据影像对应的单据模板,然后根据其对应的单据模板进行区域分割。该分割区域是由该理赔单据框架格式的框线所围成的最小单位的区域,且该分割区域为不包含框线的区域,以避免后续在对每一分割区域进行字符识别时框线对识别精度的干涉及影响,该分割区域类似于excel表格的每个方格,excel表格的每个方格即是最小区域内不包含框线的区域。
识别模块02,用于调用预先确定的分析模型对获得的各个分割区域进行分析,并利用预先确定的识别规则对分析出的各个分割区域分别进行字符识别,以识别出各个分割区域中的字符。
在按照该理赔单据框架格式的框线排布对理赔单据影像进行区域分割得到一个或多个分割区域后,可调用预先确定的分析模型对获得的各个分割区域进行分析,并利用预先确定的识别规则对各个分割区域分别进行字符识别,以识别出各个分割区域中的字符,也即理赔单据影像中的字符。例如,可利用预先确定的分析模型分析各个分割区域所适用的识别模型或识别方式,再根据分析出的结果针对各个分割区域利用适合各个分割区域自身的识别模型或识别方式来进行字符识别,以提高字符识别的准确率。如针对不同的分割区域,可分析出字符识别的方式为利用光学字符识别引擎进行识别,也可以用其他识别引擎或训练的识别模型来进行识别,在此不做限定。识别出各个分割区域中的字符,还可将各个分割区域中的字符自动填充、录入至与该理赔单据影像对应的电子理赔单据的各相应输入栏位中。
本实施例在对理赔单据影像进行字符识别前,按照该理赔单据框架格式的框线排布对其进行区域分割,利用预先确定的识别规则对该理赔单据的各个分割区域分别进行字符识别,以分别识别出各个分割区域中的字符。由于考虑到理赔单据框架格式对识别精度的影响,在进行字符识别前先按照理赔单据框架格式的框线排布进行区域分割,再针对各个分割区域来进行字符识别,避免了在对整个理赔单据影像中的字符进行统一识别时单据中的框线对字符识别的影响及干涉,能有效提高对理赔单据中字符的识别精度。
进一步地,在上述实施例的基础上,上述识别模块02还用于:
调用预先确定的分析模型对获得的各个分割区域进行分析,以分析出可利用光学字符识别引擎识别的第一分割区域和不可利用光学字符识别引擎识别的第二分割区域;
利用预先确定的光学字符识别引擎对各个所述第一分割区域进行字符识别,以识别出各个所述第一分割区域中的字符,并调用预先确定的识别模型对各个所述第二分割区域进行字符识别,以识别出各个所述第二分割区域中的字符。
本实施例中,在按照该理赔单据框架格式的框线排布进行区域分割得到一个或多个分割区域后,在对获得的分割区域进行识别之前,还调用预先确定的分析模型对获得的各个分割区域进行分析,以分析出无需深度识别的第一分割区域和需要深度识别的第二分割区域。例如,以当前自身的识别引擎为OCR字符识别引擎为例进行说明,可将OCR字符识别引擎能正确识别或识别率高的区域作为无需深度识别的区域,即利用当前自身的OCR字符识别引擎即可对该区域的字符进行正确的识别,无需借助其他识别方式。将OCR字符识别引擎无法识别或识别率低的区域作为需要深度识别的区域,即利用当前自身的OCR字符识别引擎无法对该区域的字符进行正确的识别,需借助其他识别方式如经训练过的识别模型来进行字符识别。
在分析出该理赔单据影像中可利用OCR字符识别引擎进行正确识别的第一分割区域和不可利用OCR字符识别引擎识别的第二分割区域之后,即可针对分析出的第一分割区域和第二分割区域采取不同的识别方式进行字符识别。利用预先确定的OCR字符识别引擎对各个所述第一分割区域进行字符识别,以正确识别出各个所述第一分割区域中的字符。调用预先确定的识别模型对各个所述第二分割区域进行字符识别,以正确识别出各个所述第二分割区域中的字符,该预先确定的识别模型可以是针对大量分割区域样本进行训练好的识别模型,也可以是比自身的OCR字符识别引擎识别方式更复杂、识别效果更好的识别引擎,在此不做限定。
进一步地,在其他实施例中,所述预先确定的分析模型为卷积神经网络(Convolutional Neural Network,简称CNN)模型,所述预先确定的分析模型的训练过程如下:
A、针对预先确定的理赔单据框架格式,获取预设数量(例如,50万个)的基于该理赔单据框架格式的理赔单据影像样本;
B、对每一个理赔单据影像样本按照该理赔单据框架格式的框线排布进行区域分割,并确定出各个理赔单据影像样本中OCR字符识别引擎识别错误的第三分割区域和OCR字符识别引擎识别正确的第四分割区域;
C、将所有第三分割区域归入第一训练集,将所有第四分割区域归入第二训练集;
D、分别从第一训练集和第二训练集中提取出第一预设比例(例如,80%)的分割区域作为待训练的分割区域,并将第一训练集和第二训练集中剩余的分割区域作为待验证的分割区域;
E、利用提取的各个待训练的分割区域进行模型训练,以生成所述预先确定的分析模型,并利用各个待验证的分割区域对生成的所述预先确定的分析模型进行验证;
F、若验证通过率大于等于预设阈值(例如,98%),则训练完成,或者,若验证通过率小于预设阈值,则增加理赔单据影像样本的数量,并重复执行所述步骤A、B、C、D、E,直至验证通过率大于或等于预设阈值。
本实施例中利用经大量理赔单据影像样本训练过的卷积神经网络模型来进行分割区域分析,能够准确分析出理赔单据的各个分割区域中可利用OCR字符识别引擎来正确识别字符的第一分割区域和无法利用OCR字符识别引擎来正确识别字符的第二分割区域,以便后续针对第一分割区域和第二分割区域分别采用不同的识别方式来进行准确的字符识别操作,从而提高对理赔单据中字符的识别精度。
进一步地,在其他实施例中,所述预先确定的识别模型为长短期记忆(Long Short-Term Memory,简称LSTM)模型,所述预先确定的识别模型的训练过程如下:
获取预设数量(例如,10万)的区域样本,该区域样本可以是历史数据中对若干理赔单据按照其框架格式的框线排布进行区域分割后的分割区域样本。在一种实施方式中,可统一将分割区域样本中的字体设置为黑色,背景设置为白色,以便于进行字符识别。并将各个分割区域样本进行标注,如可将各个分割区域样本的名称命名为该分割区域样本所包含的字符以进行标注。
将预设数量的分割区域样本按照预设比例(例如,8:2)分为第一数据集和第二数据集,将第一数据集作为训练集,将第二数据集作为测试集,其中,第一数据集的样本数量比例大于或者等于第二数据集的样本数量比例。
将第一数据集送入LSTM网络进行模型训练,每隔预设时间(例如每30分钟或每进行1000次迭代),对模型使用第二数据集进行测试,以评估当前训练的模型效果。例如,在测试时,可使用训练得到的模型对第二数据集中的分割区域样本进行字符识别,并将利用训练得到的模型对分割区域样本的字符识别结果与该分割区域样本的标注进行比对,以计算出训练得到的模型的字符识别结果与该分割区域样本的标注的误差。具体地,在计算误差时,可采用编辑距离作为计算标准,其中,编辑距离(Edit Distance),又称Levenshtein距离,是指两个字串之间,由一个转成另一个所需的最少编辑操作次数。许可的编辑操作包括将一个字符替换成另一个字符,插入一个字符,删除一个字符,一般来说,编辑距离越小,两个串的相似度越大。因此,在以编辑距离作为计算标准来计算训练得到的模型的字符识别结果与该分割区域样本的标注的误差时,计算得到的误差越小,说明训练得到的模型的字符识别结果与该分割区域样本的标注的相似度越大;相反,计算得到的误差越大,说明训练得到的模型的字符识别结果与该分割区域样本的标注的相似度越小。
由于该分割区域样本的标注为该分割区域样本的名称也即该分割区域样本所包含的字符,因此,计算出的训练得到的模型的字符识别结果与该分割区域样本的标注的误差即为训练得到的模型的字符识别结果与该分割区域样本所包含的字符之间的误差,能反映出训练得到的模型识别出的字符与正确的字符之间的误差。记录每一次对训练的模型使用第二数据集进行测试的误差,并分析误差的变化趋势,若分析测试时的训练模型对分割区域样本的字符识别的误差出现发散,则调整训练参数如activation函数、LSTM层数、输入输出的变量维度等,并重新训练,使测试时的训练模型对分割区域样本的字符识别的误差能够收敛。当分析测试时的训练模型对分割区域样本的字符识别的误差收敛后,则结束模型训练,将生成的训练模型作为训练好的所述预先确定的识别模型。
本实施例中,针对OCR字符识别引擎无法识别的区域,采用训练好的LSTM模型进行识别,由于LSTM模型为经大量分割区域样本训练过的,且对分割区域样本的字符识别的误差收敛的模型,配合LSTM模型自身的长期记忆功能使该LSTM模型在识别分割区域中的字符时,能利用模型记住的长期信息如上下文信息等,更加准确地识别出分割区域中的字符,从而进一步提高对理赔单据中字符的识别精度。
本发明进一步提供一种理赔单据的字符识别服务器。
参照图4,图4为本发明理赔单据的字符识别服务器第一实施例的示意图。
在第一实施例中,该理赔单据的字符识别服务器包括:存储器11、处理器12、通信总线13及网络接口14。其中,通信总线13用于实现这些组件之间的连接通信。
存储器11包括内存及至少一种类型的可读存储介质。内存为理赔单据的字符识别服务器的运行提供缓存;可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器等的非易失性存储介质。在一些实施例中,所述可读存储介质可以是所述理赔单据的字符识别服务器的内部存储单元,例如该理赔单据的字符识别服务器的硬盘或内存。在另一些实施例中,所述可读存储介质也可以是所述理赔单据的字符识别服务器的外部存储设备,例如所述理赔单据的字符识别服务器上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。
本实施例中,所述存储器11的可读存储介质通常用于存储安装于所述理赔单据的字符识别服务器的应用软件及各类数据,例如理赔单据的字符识别程序等。所述存储器11还可以用于暂时地存储已经输出或者将要输出的数据。
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其他数据处理芯片,用于运行所述存储器11中存储的程序代码或处理数据。
网络接口14可以包括标准的有线接口、无线接口(如WI-FI接口)。
图4仅示出了具有组件11-14的理赔单据的字符识别服务器,但是应理解 的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
可选地,该理赔单据的字符识别服务器还可以包括用户接口,用户接口可以包括标准的有线接口、无线接口。例如,输入单元比如键盘(Keyboard)、有线或无线头戴式耳机端口、外部电源(或电池充电器)端口、有线或无线数据端口、存储卡端口、用于连接具有识别模块的装置的端口、音频输入/输出(I/O)端口、视频I/O端口、耳机端口等等。该用户接口可以用于接收来自外部装置的输入(例如,数据信息、电力等等)并且将接收到的输入传输到终端的一个或多个元件。
可选地,该理赔单据的字符识别服务器还可以包括显示器,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。所述显示器用于显示在所述理赔单据的字符识别服务器中处理的信息以及用于显示可视化的用户界面等。
在图4所示的理赔单据的字符识别服务器实施例中,存储器11中可以包括理赔单据的字符识别程序,处理器12执行存储器11中存储的理赔单据的字符识别程序时实现如下步骤:
在收到待识别字符的理赔单据影像后,按照该理赔单据框架格式的框线排布进行区域分割,获得一个或多个分割区域;
调用预先确定的分析模型对获得的各个分割区域进行分析,并利用预先确定的识别规则对分析出的各个分割区域分别进行字符识别,以识别出各个分割区域中的字符。
优选地,所述调用预先确定的分析模型对获得的各个分割区域进行分析的步骤包括:
调用预先确定的分析模型对获得的各个分割区域进行分析,以分析出可利用光学字符识别引擎识别的第一分割区域和不可利用光学字符识别引擎识别的第二分割区域;
所述利用预先确定的识别规则对分析出的各个分割区域分别进行字符识别的步骤还包括:
利用预先确定的光学字符识别引擎对各个所述第一分割区域进行字符识别,以识别出各个所述第一分割区域中的字符,并调用预先确定的识别模型对各个所述第二分割区域进行字符识别,以识别出各个所述第二分割区域中的字符。
优选地,所述预先确定的分析模型为卷积神经网络模型,所述预先确定的分析模型的训练过程如下:
A、针对预先确定的理赔单据框架格式,获取预设数量的基于该理赔单据框架格式的理赔单据影像样本;
B、对每一个理赔单据影像样本按照该理赔单据框架格式的框线排布进行区域分割,并确定出各个理赔单据影像样本中利用光学字符识别引擎识别错误的第三分割区域和利用光学字符识别引擎识别正确的第四分割区域;
C、将所有第三分割区域归入第一训练集,将所有第四分割区域归入第二训练集;
D、分别从所述第一训练集和所述第二训练集中提取出第一预设比例的分割区域作为待训练的分割区域,并将所述第一训练集和所述第二训练集中剩余的分割区域作为待验证的分割区域;
E、利用提取的各个待训练的分割区域进行模型训练,以生成所述预先确定的分析模型,并利用各个待验证的分割区域对生成的所述预先确定的分析模型进行验证;
F、若验证通过率大于或等于预设阈值,则训练完成,或者,若验证通过率小于预设阈值,则增加理赔单据影像样本的数量,并重复执行上述步骤A、B、C、D、E,直至验证通过率大于或等于预设阈值。
优选地,所述预先确定的识别模型为长短期记忆LSTM模型,所述预先确定的识别模型的训练过程如下:
获取预设数量的分割区域样本,对各个分割区域样本以该分割区域样本所含字符来进行标注;
将预设数量的分割区域样本按照预设比例分为第一数据集和第二数据集,并将所述第一数据集作为训练集,将所述第二数据集作为测试集;
将所述第一数据集送入LSTM网络进行模型训练,每隔预设时间,使用训练得到的模型对所述第二数据集中的分割区域样本进行字符识别,并将识别的字符与该分割区域样本的标注进行比对,以计算识别的字符和标注的误差;
若训练得到的模型识别字符的误差出现发散,则调整预设的训练参数并重新训练,直至使得训练得到的模型识别字符的误差能够收敛;
若训练得到的模型识别字符的误差收敛,则结束模型训练,将生成的模型作为训练好的所述预先确定的识别模型。
优选地,所述分割区域是由该理赔单据框架格式的框线所围成的最小单位的区域,且所述分割区域为不包含框线的区域。
本发明之理赔单据的字符识别服务器的具体实施方式与上述理赔单据的字符识别方法的具体实施方式大致相同,故不再赘述。
本发明进一步提供一种计算机可读存储介质。
所述计算机可读存储介质上存储有理赔单据的字符识别程序,该理赔单据的字符识别程序可被至少一处理器执行,以实现如下步骤:
在收到待识别字符的理赔单据影像后,按照该理赔单据框架格式的框线排布进行区域分割,获得一个或多个分割区域;
调用预先确定的分析模型对获得的各个分割区域进行分析,并利用预先确定的识别规则对分析出的各个分割区域分别进行字符识别,以识别出各个分割区域中的字符。
优选地,所述调用预先确定的分析模型对获得的各个分割区域进行分析 的步骤包括:
调用预先确定的分析模型对获得的各个分割区域进行分析,以分析出可利用光学字符识别引擎识别的第一分割区域和不可利用光学字符识别引擎识别的第二分割区域;
所述利用预先确定的识别规则对分析出的各个分割区域分别进行字符识别的步骤还包括:
利用预先确定的光学字符识别引擎对各个所述第一分割区域进行字符识别,以识别出各个所述第一分割区域中的字符,并调用预先确定的识别模型对各个所述第二分割区域进行字符识别,以识别出各个所述第二分割区域中的字符。
优选地,所述预先确定的分析模型为卷积神经网络模型,所述预先确定的分析模型的训练过程如下:
A、针对预先确定的理赔单据框架格式,获取预设数量的基于该理赔单据框架格式的理赔单据影像样本;
B、对每一个理赔单据影像样本按照该理赔单据框架格式的框线排布进行区域分割,并确定出各个理赔单据影像样本中利用光学字符识别引擎识别错误的第三分割区域和利用光学字符识别引擎识别正确的第四分割区域;
C、将所有第三分割区域归入第一训练集,将所有第四分割区域归入第二训练集;
D、分别从所述第一训练集和所述第二训练集中提取出第一预设比例的分割区域作为待训练的分割区域,并将所述第一训练集和所述第二训练集中剩余的分割区域作为待验证的分割区域;
E、利用提取的各个待训练的分割区域进行模型训练,以生成所述预先确定的分析模型,并利用各个待验证的分割区域对生成的所述预先确定的分析模型进行验证;
F、若验证通过率大于或等于预设阈值,则训练完成,或者,若验证通过率小于预设阈值,则增加理赔单据影像样本的数量,并重复执行上述步骤A、B、C、D、E,直至验证通过率大于或等于预设阈值。
优选地,所述预先确定的识别模型为长短期记忆LSTM模型,所述预先确定的识别模型的训练过程如下:
获取预设数量的分割区域样本,对各个分割区域样本以该分割区域样本所含字符来进行标注;
将预设数量的分割区域样本按照预设比例分为第一数据集和第二数据集,并将所述第一数据集作为训练集,将所述第二数据集作为测试集;
将所述第一数据集送入LSTM网络进行模型训练,每隔预设时间,使用训练得到的模型对所述第二数据集中的分割区域样本进行字符识别,并将识别的字符与该分割区域样本的标注进行比对,以计算识别的字符和标注的误差;
若训练得到的模型识别字符的误差出现发散,则调整预设的训练参数并 重新训练,直至使得训练得到的模型识别字符的误差能够收敛;
若训练得到的模型识别字符的误差收敛,则结束模型训练,将生成的模型作为训练好的所述预先确定的识别模型。
优选地,所述分割区域是由该理赔单据框架格式的框线所围成的最小单位的区域,且所述分割区域为不包含框线的区域。
本发明之计算机可读存储介质的具体实施方式与上述理赔单据的字符识别方法的具体实施方式大致相同,故不再赘述。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件来实现,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本发明各个实施例所述的方法。
以上参照附图说明了本发明的优选实施例,并非因此局限本发明的权利范围。上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。另外,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
本领域技术人员不脱离本发明的范围和实质,可以有多种变型方案实现本发明,比如作为一个实施例的特征可用于另一实施例而得到又一实施例。凡在运用本发明的技术构思之内所作的任何修改、等同替换和改进,均应在本发明的权利范围之内。

Claims (20)

  1. 一种理赔单据的字符识别方法,其特征在于,所述方法包括以下步骤:
    服务器在收到待识别字符的理赔单据影像后,按照该理赔单据框架格式的框线排布进行区域分割,获得一个或多个分割区域;
    调用预先确定的分析模型对获得的各个分割区域进行分析,并利用预先确定的识别规则对分析出的各个分割区域分别进行字符识别,以识别出各个分割区域中的字符。
  2. 如权利要求1所述的理赔单据的字符识别方法,其特征在于,所述调用预先确定的分析模型对获得的各个分割区域进行分析的步骤包括:
    调用预先确定的分析模型对获得的各个分割区域进行分析,以分析出可利用光学字符识别引擎识别的第一分割区域和不可利用光学字符识别引擎识别的第二分割区域;
    所述利用预先确定的识别规则对分析出的各个分割区域分别进行字符识别的步骤还包括:
    利用预先确定的光学字符识别引擎对各个所述第一分割区域进行字符识别,以识别出各个所述第一分割区域中的字符,并调用预先确定的识别模型对各个所述第二分割区域进行字符识别,以识别出各个所述第二分割区域中的字符。
  3. 如权利要求2所述的理赔单据的字符识别方法,其特征在于,所述预先确定的分析模型为卷积神经网络模型,所述预先确定的分析模型的训练过程如下:
    A、针对预先确定的理赔单据框架格式,获取预设数量的基于该理赔单据框架格式的理赔单据影像样本;
    B、对每一个理赔单据影像样本按照该理赔单据框架格式的框线排布进行区域分割,并确定出各个理赔单据影像样本中利用光学字符识别引擎识别错误的第三分割区域和利用光学字符识别引擎识别正确的第四分割区域;
    C、将所有第三分割区域归入第一训练集,将所有第四分割区域归入第二训练集;
    D、分别从所述第一训练集和所述第二训练集中提取出第一预设比例的分割区域作为待训练的分割区域,并将所述第一训练集和所述第二训练集中剩余的分割区域作为待验证的分割区域;
    E、利用提取的各个待训练的分割区域进行模型训练,以生成所述预先确定的分析模型,并利用各个待验证的分割区域对生成的所述预先确定的分析模型进行验证;
    F、若验证通过率大于或等于预设阈值,则训练完成,或者,若验证通过率小于预设阈值,则增加理赔单据影像样本的数量,并重复执行上述步骤A、 B、C、D、E,直至验证通过率大于或等于预设阈值。
  4. 如权利要求2或3所述的理赔单据的字符识别方法,其特征在于,所述预先确定的识别模型为长短期记忆LSTM模型,所述预先确定的识别模型的训练过程如下:
    获取预设数量的分割区域样本,对各个分割区域样本以该分割区域样本所含字符来进行标注;
    将预设数量的分割区域样本按照预设比例分为第一数据集和第二数据集,并将所述第一数据集作为训练集,将所述第二数据集作为测试集;
    将所述第一数据集送入LSTM网络进行模型训练,每隔预设时间,使用训练得到的模型对所述第二数据集中的分割区域样本进行字符识别,并将识别的字符与该分割区域样本的标注进行比对,以计算识别的字符和标注的误差;
    若训练得到的模型识别字符的误差出现发散,则调整预设的训练参数并重新训练,直至使得训练得到的模型识别字符的误差能够收敛;
    若训练得到的模型识别字符的误差收敛,则结束模型训练,将生成的模型作为训练好的所述预先确定的识别模型。
  5. 如权利要求1所述的理赔单据的字符识别方法,其特征在于,所述分割区域是由该理赔单据框架格式的框线所围成的最小单位的区域,且所述分割区域为不包含框线的区域。
  6. 一种理赔单据的字符识别装置,其特征在于,所述字符识别装置包括:
    分割模块,用于在收到待识别字符的理赔单据影像后,按照该理赔单据框架格式的框线排布进行区域分割,获得一个或多个分割区域;
    识别模块,用于调用预先确定的分析模型对获得的各个分割区域进行分析,并利用预先确定的识别规则对分析出的各个分割区域分别进行字符识别,以识别出各个分割区域中的字符。
  7. 如权利要求6所述的理赔单据的字符识别装置,其特征在于,所述识别模块还用于:
    调用预先确定的分析模型对获得的各个分割区域进行分析,以分析出可利用光学字符识别引擎识别的第一分割区域和不可利用光学字符识别引擎识别的第二分割区域;
    利用预先确定的光学字符识别引擎对各个所述第一分割区域进行字符识别,以识别出各个所述第一分割区域中的字符,并调用预先确定的识别模型对各个所述第二分割区域进行字符识别,以识别出各个所述第二分割区域中的字符。
  8. 如权利要求7所述的理赔单据的字符识别装置,其特征在于,所述预先确定的分析模型为卷积神经网络模型,所述预先确定的分析模型的训练过程如下:
    A、针对预先确定的理赔单据框架格式,获取预设数量的基于该理赔单据框架格式的理赔单据影像样本;
    B、对每一个理赔单据影像样本按照该理赔单据框架格式的框线排布进行区域分割,并确定出各个理赔单据影像样本中利用光学字符识别引擎识别错误的第三分割区域和利用光学字符识别引擎识别正确的第四分割区域;
    C、将所有第三分割区域归入第一训练集,将所有第四分割区域归入第二训练集;
    D、分别从所述第一训练集和所述第二训练集中提取出第一预设比例的分割区域作为待训练的分割区域,并将所述第一训练集和所述第二训练集中剩余的分割区域作为待验证的分割区域;
    E、利用提取的各个待训练的分割区域进行模型训练,以生成所述预先确定的分析模型,并利用各个待验证的分割区域对生成的所述预先确定的分析模型进行验证;
    F、若验证通过率大于或等于预设阈值,则训练完成,或者,若验证通过率小于预设阈值,则增加理赔单据影像样本的数量,并重复执行上述步骤A、B、C、D、E,直至验证通过率大于或等于预设阈值。
  9. 如权利要求7或8所述的理赔单据的字符识别装置,其特征在于,所述预先确定的识别模型为长短期记忆LSTM模型,所述预先确定的识别模型的训练过程如下:
    获取预设数量的分割区域样本,对各个分割区域样本以该分割区域样本所含字符来进行标注;
    将预设数量的分割区域样本按照预设比例分为第一数据集和第二数据集,并将所述第一数据集作为训练集,将所述第二数据集作为测试集;
    将所述第一数据集送入LSTM网络进行模型训练,每隔预设时间,使用训练得到的模型对所述第二数据集中的分割区域样本进行字符识别,并将识别的字符与该分割区域样本的标注进行比对,以计算识别的字符和标注的误差;
    若训练得到的模型识别字符的误差出现发散,则调整预设的训练参数并重新训练,直至使得训练得到的模型识别字符的误差能够收敛;
    若训练得到的模型识别字符的误差收敛,则结束模型训练,将生成的模型作为训练好的所述预先确定的识别模型。
  10. 如权利要求6所述的理赔单据的字符识别装置,其特征在于,所述分割区域是由该理赔单据框架格式的框线所围成的最小单位的区域,且所述分割区域为不包含框线的区域。
  11. 一种理赔单据的字符识别服务器,其特征在于,该理赔单据的字符识别服务器包括:存储器及处理器,该存储器上存储有理赔单据的字符识别程序,该理赔单据的字符识别程序被该处理器执行,实现如下步骤:
    服务器在收到待识别字符的理赔单据影像后,按照该理赔单据框架格式的框线排布进行区域分割,获得一个或多个分割区域;
    调用预先确定的分析模型对获得的各个分割区域进行分析,并利用预先确定的识别规则对分析出的各个分割区域分别进行字符识别,以识别出各个分割区域中的字符。
  12. 如权利要求11所述的理赔单据的字符识别服务器,其特征在于,所述调用预先确定的分析模型对获得的各个分割区域进行分析的步骤包括:
    调用预先确定的分析模型对获得的各个分割区域进行分析,以分析出可利用光学字符识别引擎识别的第一分割区域和不可利用光学字符识别引擎识别的第二分割区域;
    所述利用预先确定的识别规则对分析出的各个分割区域分别进行字符识别的步骤还包括:
    利用预先确定的光学字符识别引擎对各个所述第一分割区域进行字符识别,以识别出各个所述第一分割区域中的字符,并调用预先确定的识别模型对各个所述第二分割区域进行字符识别,以识别出各个所述第二分割区域中的字符。
  13. 如权利要求12所述的理赔单据的字符识别服务器,其特征在于,所述预先确定的分析模型为卷积神经网络模型,所述预先确定的分析模型的训练过程如下:
    A、针对预先确定的理赔单据框架格式,获取预设数量的基于该理赔单据框架格式的理赔单据影像样本;
    B、对每一个理赔单据影像样本按照该理赔单据框架格式的框线排布进行区域分割,并确定出各个理赔单据影像样本中利用光学字符识别引擎识别错误的第三分割区域和利用光学字符识别引擎识别正确的第四分割区域;
    C、将所有第三分割区域归入第一训练集,将所有第四分割区域归入第二训练集;
    D、分别从所述第一训练集和所述第二训练集中提取出第一预设比例的分割区域作为待训练的分割区域,并将所述第一训练集和所述第二训练集中剩余的分割区域作为待验证的分割区域;
    E、利用提取的各个待训练的分割区域进行模型训练,以生成所述预先确定的分析模型,并利用各个待验证的分割区域对生成的所述预先确定的分析模型进行验证;
    F、若验证通过率大于或等于预设阈值,则训练完成,或者,若验证通过 率小于预设阈值,则增加理赔单据影像样本的数量,并重复执行上述步骤A、B、C、D、E,直至验证通过率大于或等于预设阈值。
  14. 如权利要求12或13所述的理赔单据的字符识别服务器,其特征在于,所述预先确定的识别模型为长短期记忆LSTM模型,所述预先确定的识别模型的训练过程如下:
    获取预设数量的分割区域样本,对各个分割区域样本以该分割区域样本所含字符来进行标注;
    将预设数量的分割区域样本按照预设比例分为第一数据集和第二数据集,并将所述第一数据集作为训练集,将所述第二数据集作为测试集;
    将所述第一数据集送入LSTM网络进行模型训练,每隔预设时间,使用训练得到的模型对所述第二数据集中的分割区域样本进行字符识别,并将识别的字符与该分割区域样本的标注进行比对,以计算识别的字符和标注的误差;
    若训练得到的模型识别字符的误差出现发散,则调整预设的训练参数并重新训练,直至使得训练得到的模型识别字符的误差能够收敛;
    若训练得到的模型识别字符的误差收敛,则结束模型训练,将生成的模型作为训练好的所述预先确定的识别模型。
  15. 如权利要求11所述的理赔单据的字符识别服务器,其特征在于,所述分割区域是由该理赔单据框架格式的框线所围成的最小单位的区域,且所述分割区域为不包含框线的区域。
  16. 一种计算机可读存储介质,其特征在于,该计算机可读存储介质上存储有理赔单据的字符识别程序,该理赔单据的字符识别程序可被至少一处理器执行,以实现如下步骤:
    服务器在收到待识别字符的理赔单据影像后,按照该理赔单据框架格式的框线排布进行区域分割,获得一个或多个分割区域;
    调用预先确定的分析模型对获得的各个分割区域进行分析,并利用预先确定的识别规则对分析出的各个分割区域分别进行字符识别,以识别出各个分割区域中的字符。
  17. 如权利要求16所述的计算机可读存储介质,其特征在于,所述调用预先确定的分析模型对获得的各个分割区域进行分析的步骤包括:
    调用预先确定的分析模型对获得的各个分割区域进行分析,以分析出可利用光学字符识别引擎识别的第一分割区域和不可利用光学字符识别引擎识别的第二分割区域;
    所述利用预先确定的识别规则对分析出的各个分割区域分别进行字符识别的步骤还包括:
    利用预先确定的光学字符识别引擎对各个所述第一分割区域进行字符识别,以识别出各个所述第一分割区域中的字符,并调用预先确定的识别模型对各个所述第二分割区域进行字符识别,以识别出各个所述第二分割区域中的字符。
  18. 如权利要求17所述的计算机可读存储介质,其特征在于,所述预先确定的分析模型为卷积神经网络模型,所述预先确定的分析模型的训练过程如下:
    A、针对预先确定的理赔单据框架格式,获取预设数量的基于该理赔单据框架格式的理赔单据影像样本;
    B、对每一个理赔单据影像样本按照该理赔单据框架格式的框线排布进行区域分割,并确定出各个理赔单据影像样本中利用光学字符识别引擎识别错误的第三分割区域和利用光学字符识别引擎识别正确的第四分割区域;
    C、将所有第三分割区域归入第一训练集,将所有第四分割区域归入第二训练集;
    D、分别从所述第一训练集和所述第二训练集中提取出第一预设比例的分割区域作为待训练的分割区域,并将所述第一训练集和所述第二训练集中剩余的分割区域作为待验证的分割区域;
    E、利用提取的各个待训练的分割区域进行模型训练,以生成所述预先确定的分析模型,并利用各个待验证的分割区域对生成的所述预先确定的分析模型进行验证;
    F、若验证通过率大于或等于预设阈值,则训练完成,或者,若验证通过率小于预设阈值,则增加理赔单据影像样本的数量,并重复执行上述步骤A、B、C、D、E,直至验证通过率大于或等于预设阈值。
  19. 如权利要求17或18所述的计算机可读存储介质,其特征在于,所述预先确定的识别模型为长短期记忆LSTM模型,所述预先确定的识别模型的训练过程如下:
    获取预设数量的分割区域样本,对各个分割区域样本以该分割区域样本所含字符来进行标注;
    将预设数量的分割区域样本按照预设比例分为第一数据集和第二数据集,并将所述第一数据集作为训练集,将所述第二数据集作为测试集;
    将所述第一数据集送入LSTM网络进行模型训练,每隔预设时间,使用训练得到的模型对所述第二数据集中的分割区域样本进行字符识别,并将识别的字符与该分割区域样本的标注进行比对,以计算识别的字符和标注的误差;
    若训练得到的模型识别字符的误差出现发散,则调整预设的训练参数并重新训练,直至使得训练得到的模型识别字符的误差能够收敛;
    若训练得到的模型识别字符的误差收敛,则结束模型训练,将生成的模 型作为训练好的所述预先确定的识别模型。
  20. 如权利要求16所述的计算机可读存储介质,其特征在于,所述分割区域是由该理赔单据框架格式的框线所围成的最小单位的区域,且所述分割区域为不包含框线的区域。
PCT/CN2017/091363 2017-04-11 2017-06-30 理赔单据的字符识别方法、装置、服务器及存储介质 WO2018188199A1 (zh)

Priority Applications (6)

Application Number Priority Date Filing Date Title
SG11201900263SA SG11201900263SA (en) 2017-04-11 2017-06-30 Method, device and server for recognizing characters of claim document, and storage medium
EP17899230.1A EP3432197B1 (en) 2017-04-11 2017-06-30 Method and device for identifying characters of claim settlement bill, server and storage medium
AU2017408799A AU2017408799B2 (en) 2017-04-11 2017-06-30 Method, device and server for recognizing characters for claim document, and storage medium
KR1020187023693A KR102171220B1 (ko) 2017-04-11 2017-06-30 클레임 서류의 문자 인식 방법, 장치, 서버 및 저장매체
US16/084,244 US10650231B2 (en) 2017-04-11 2017-06-30 Method, device and server for recognizing characters of claim document, and storage medium
JP2018536430A JP6710483B2 (ja) 2017-04-11 2017-06-30 損害賠償請求書類の文字認識方法、装置、サーバ及び記憶媒体

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710233613.3A CN107220648B (zh) 2017-04-11 2017-04-11 理赔单据的字符识别方法及服务器
CN201710233613.3 2017-04-11

Publications (1)

Publication Number Publication Date
WO2018188199A1 true WO2018188199A1 (zh) 2018-10-18

Family

ID=59927567

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/091363 WO2018188199A1 (zh) 2017-04-11 2017-06-30 理赔单据的字符识别方法、装置、服务器及存储介质

Country Status (9)

Country Link
US (1) US10650231B2 (zh)
EP (1) EP3432197B1 (zh)
JP (1) JP6710483B2 (zh)
KR (1) KR102171220B1 (zh)
CN (1) CN107220648B (zh)
AU (1) AU2017408799B2 (zh)
SG (1) SG11201900263SA (zh)
TW (1) TWI621077B (zh)
WO (1) WO2018188199A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115981798A (zh) * 2023-03-21 2023-04-18 北京探境科技有限公司 文件解析方法、装置、计算机设备及可读存储介质

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107798299B (zh) * 2017-10-09 2020-02-07 平安科技(深圳)有限公司 票据信息识别方法、电子装置及可读存储介质
CN107766809B (zh) * 2017-10-09 2020-05-19 平安科技(深圳)有限公司 电子装置、票据信息识别方法和计算机可读存储介质
CN108319641A (zh) * 2017-12-21 2018-07-24 无锡雅座在线科技股份有限公司 菜品信息录入方法及装置
CN108198591A (zh) * 2017-12-28 2018-06-22 泰康保险集团股份有限公司 用于远程审核单据的方法与装置
CN110135225B (zh) * 2018-02-09 2021-04-09 北京世纪好未来教育科技有限公司 样本标注方法及计算机存储介质
CN108595519A (zh) * 2018-03-26 2018-09-28 平安科技(深圳)有限公司 热点事件分类方法、装置及存储介质
CN110321760A (zh) * 2018-03-29 2019-10-11 北京和缓医疗科技有限公司 一种医疗单据识别方法和装置
CN108564035B (zh) 2018-04-13 2020-09-25 杭州睿琪软件有限公司 识别单据上记载的信息的方法及系统
WO2019241897A1 (en) * 2018-06-21 2019-12-26 Element Ai Inc. Data extraction from short business documents
CN109241857A (zh) * 2018-08-13 2019-01-18 杭州睿琪软件有限公司 一种单据信息的识别方法及系统
CN109190594A (zh) * 2018-09-21 2019-01-11 广东蔚海数问大数据科技有限公司 光学字符识别系统及信息提取方法
CN110569700B (zh) * 2018-09-26 2020-11-03 创新先进技术有限公司 优化损伤识别结果的方法及装置
CN109492549A (zh) * 2018-10-24 2019-03-19 杭州睿琪软件有限公司 一种训练样本集处理、模型训练方法及系统
EP3673412A1 (en) * 2018-11-02 2020-07-01 Alibaba Group Holding Limited Monitoring multiple system indicators
CN109344838B (zh) * 2018-11-02 2023-11-24 长江大学 发票信息自动快速识别方法、系统以及装置
TWI684950B (zh) * 2018-12-12 2020-02-11 全友電腦股份有限公司 物種數據解析方法、系統及電腦程式產品
TWI703508B (zh) * 2018-12-19 2020-09-01 洽吧智能股份有限公司 字元影像識別方法與系統
CN109784341A (zh) * 2018-12-25 2019-05-21 华南理工大学 一种基于lstm神经网络的医疗单据识别方法
JP2020027598A (ja) * 2018-12-27 2020-02-20 株式会社シグマクシス 文字認識装置、文字認識方法及び文字認識プログラム
CN109903172A (zh) * 2019-01-31 2019-06-18 阿里巴巴集团控股有限公司 理赔信息提取方法和装置、电子设备
CN110084704A (zh) * 2019-03-15 2019-08-02 北京水滴互联科技有限公司 一种互助保障服务器、系统及互助保障方法
SG10201904825XA (en) * 2019-05-28 2019-10-30 Alibaba Group Holding Ltd Automatic optical character recognition (ocr) correction
CN110610175A (zh) * 2019-08-06 2019-12-24 深圳市华付信息技术有限公司 一种ocr数据误标注清洗方法
US11481605B2 (en) 2019-10-25 2022-10-25 Servicenow Canada Inc. 2D document extractor
CN111291742B (zh) * 2020-02-10 2023-08-04 北京百度网讯科技有限公司 对象识别方法和装置、电子设备、存储介质
CN111539424A (zh) * 2020-04-21 2020-08-14 北京云从科技有限公司 一种基于ocr的图像处理方法、系统、设备及介质
US11972489B1 (en) 2020-04-24 2024-04-30 State Farm Mutual Automobile Insurance Company Claims process assistance using models
CN111259873B (zh) * 2020-04-26 2021-02-26 江苏联著实业股份有限公司 一种表格数据提取方法及装置
CN112686262A (zh) * 2020-12-28 2021-04-20 广州博士信息技术研究院有限公司 一种基于图像识别技术的手册提取结构化数据并快速归档的方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567764A (zh) * 2012-01-13 2012-07-11 中国工商银行股份有限公司 一种提高电子影像识别效率的票据凭证及系统
CN103258198A (zh) * 2013-04-26 2013-08-21 四川大学 一种表格文档图像中字符提取方法
US20140085669A1 (en) * 2012-09-27 2014-03-27 Kyocera Document Solutions Inc. Image Processing Apparatus That Generates Hyperlink Structure Data
CN106557747A (zh) * 2016-11-15 2017-04-05 平安科技(深圳)有限公司 识别保险单号码的方法及装置

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04304586A (ja) 1991-04-01 1992-10-27 Mitsubishi Electric Corp 文字認識装置
JP2003256772A (ja) * 2002-03-06 2003-09-12 Ricoh Co Ltd 文字認識装置及び記録媒体
TW200802137A (en) 2006-06-16 2008-01-01 Univ Nat Chiao Tung Serial-type license plate recognition system
TWI355853B (en) 2008-04-25 2012-01-01 Hon Hai Prec Ind Co Ltd Image capturing device and image arranging method
KR101028670B1 (ko) * 2008-10-22 2011-04-12 엔에이치엔(주) 언어모델과 ocr을 이용하여 문서에 포함된 문자열을 인식하는 방법, 시스템 및 컴퓨터 판독 가능한 기록 매체
JP4856235B2 (ja) 2009-12-15 2012-01-18 富士通株式会社 帳票認識方法及び帳票認識装置
US8625113B2 (en) * 2010-09-24 2014-01-07 Ricoh Company Ltd System and method for distributed optical character recognition processing
US9800895B2 (en) 2013-06-27 2017-10-24 Qualcomm Incorporated Depth oriented inter-view motion vector prediction
JP6773400B2 (ja) * 2014-09-30 2020-10-21 メディア株式会社 帳票認識装置、帳票認識システム、帳票認識システムのプログラム、帳票認識システムの制御方法、帳票認識システムプログラムを搭載した記録媒体
US9659213B2 (en) * 2015-07-03 2017-05-23 Cognizant Technology Solutions India Pvt. Ltd. System and method for efficient recognition of handwritten characters in documents
CN105654072B (zh) * 2016-03-24 2019-03-01 哈尔滨工业大学 一种低分辨率医疗票据图像的文字自动提取和识别系统与方法
CN106446881B (zh) * 2016-07-29 2019-05-21 北京交通大学 从医疗化验单图像中提取化验结果信息的方法
WO2018071403A1 (en) * 2016-10-10 2018-04-19 Insurance Services Office, Inc. Systems and methods for optical charater recognition for low-resolution ducuments
JP6401806B2 (ja) * 2017-02-14 2018-10-10 株式会社Pfu 日付識別装置、日付識別方法及び日付識別プログラム

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567764A (zh) * 2012-01-13 2012-07-11 中国工商银行股份有限公司 一种提高电子影像识别效率的票据凭证及系统
US20140085669A1 (en) * 2012-09-27 2014-03-27 Kyocera Document Solutions Inc. Image Processing Apparatus That Generates Hyperlink Structure Data
CN103258198A (zh) * 2013-04-26 2013-08-21 四川大学 一种表格文档图像中字符提取方法
CN106557747A (zh) * 2016-11-15 2017-04-05 平安科技(深圳)有限公司 识别保险单号码的方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115981798A (zh) * 2023-03-21 2023-04-18 北京探境科技有限公司 文件解析方法、装置、计算机设备及可读存储介质

Also Published As

Publication number Publication date
CN107220648B (zh) 2018-06-22
EP3432197B1 (en) 2022-07-06
US10650231B2 (en) 2020-05-12
EP3432197A4 (en) 2019-06-19
EP3432197A1 (en) 2019-01-23
KR102171220B1 (ko) 2020-10-29
KR20190026641A (ko) 2019-03-13
US20190147239A1 (en) 2019-05-16
TWI621077B (zh) 2018-04-11
JP6710483B2 (ja) 2020-06-17
JP2019520615A (ja) 2019-07-18
SG11201900263SA (en) 2019-02-27
CN107220648A (zh) 2017-09-29
TW201837788A (zh) 2018-10-16
AU2017408799B2 (en) 2019-10-10
AU2017408799A1 (en) 2018-11-08

Similar Documents

Publication Publication Date Title
WO2018188199A1 (zh) 理赔单据的字符识别方法、装置、服务器及存储介质
AU2017302250B2 (en) Optical character recognition in structured documents
US10049096B2 (en) System and method of template creation for a data extraction tool
US20220309549A1 (en) Identifying key-value pairs in documents
CN110647523B (zh) 数据质量的分析方法及装置、存储介质、电子设备
CN115758451A (zh) 基于人工智能的数据标注方法、装置、设备及存储介质
CN107168635A (zh) 信息呈现方法和装置
CN112669850A (zh) 语音质量检测方法、装置、计算机设备及存储介质
CN116704528A (zh) 票据识别核验方法、装置、计算机设备及存储介质
CN116453125A (zh) 基于人工智能的数据录入方法、装置、设备及存储介质
CN113158988B (zh) 财务报表处理方法、装置以及计算机可读存储介质
WO2022105120A1 (zh) 图片文字检测方法、装置、计算机设备及存储介质
CN113221888B (zh) 车牌号管理系统测试方法、装置、电子设备及存储介质
CN114820211B (zh) 理赔资料质检核验方法、装置、计算机设备及存储介质
US20240078270A1 (en) Classifying documents using geometric information
US11869260B1 (en) Extracting structured data from an image
US20240160838A1 (en) System and Methods for Enabling User Interaction with Scan or Image of Document
US20240135740A1 (en) System to extract checkbox symbol and checkbox option pertaining to checkbox question from a document
CN117612182A (zh) 文档分类方法、装置、电子设备和介质
CN117037197A (zh) 基于人工智能的异常识别方法、装置、设备及存储介质
WO2023237870A1 (en) Generating datasets for scenario based training and testing of machine learning systems
EP4302227A1 (en) System and method for automated document analysis
CN116738948A (zh) 数据处理方法、装置、计算机设备及存储介质
CN116360796A (zh) sql脚本的优化方法、装置、计算机设备及存储介质
CN114254625A (zh) 文件检查方法、设备及存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2018536430

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20187023693

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2017899230

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2017899230

Country of ref document: EP

Effective date: 20181018

ENP Entry into the national phase

Ref document number: 2017408799

Country of ref document: AU

Date of ref document: 20170630

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17899230

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE