CN112926577B

CN112926577B - Medical bill image structuring method and device and computer readable medium

Info

Publication number: CN112926577B
Application number: CN202110193283.6A
Authority: CN
Inventors: 康帅兵; 褚一平; 陈建勇; 郑义; 朱华山; 郁星星; 张雪妮; 陈士春; 潘翔; 赵小敏; 郑河荣
Original assignee: Hangzhou Hailiang Information Technology Co ltd
Current assignee: Hangzhou Hailiang Information Technology Co ltd
Priority date: 2021-02-20
Filing date: 2021-02-20
Publication date: 2021-11-26
Anticipated expiration: 2041-02-20
Also published as: CN112926577A

Abstract

The invention discloses a medical bill image structuring method and device based on mean value clustering and character recognition and a computer readable medium, comprising the following steps: step 1, performing OCR character recognition on the obtained medical bill image to obtain full-text character string information of the bill; step S2, KMeans clustering is carried out on the note full-text character string information; step S3, determining the title position according to the clustering result, and extracting the entry data of the corresponding column according to the title position information; and step S4, carrying out validity check and correction on the entry data to obtain the structured data of the medical bill. By adopting the technical scheme of the invention, the bill structuring effect can be greatly improved.

Description

Medical bill image structuring method and device and computer readable medium

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a medical bill image structuring method and device based on mean value clustering and character recognition and a computer readable medium.

Background

In recent years, with the continuous and deep development of the medical informatization of China, the electronization of medical bills has become a trend. However, the reimbursement unit cannot directly acquire the detailed medical information of the user, so that the user needs to submit the original medical document during reimbursement, and then the original medical document is manually input into the system by reimbursers and reimbursed according to a specific reimbursement proportion and reimbursement amount after checking item by item. There are a lot of drawbacks in manual entry process, it can be inevitable to appear the wrong problem of missing one item to enter manually on the one hand, on the other hand needs to dispose a large amount of human resources and carries out high repeatability work, this not only can bring very big pressure for medical staff, leads to the reimbursement flow consuming time hard and inefficiency.

For automatic bill recognition, the character information in the image is recognized by OCR technology. The text recognition result is subjected to structuring processing according to the structured information of the bill to form a detailed medical bill result. However, the existing table recognition technology mainly adopts the characteristics of table lines and the like to carry out segmentation so as to obtain the table structure information. But for many medical tickets there is no form line. Therefore, the structuring process cannot be completed by the existing method.

Disclosure of Invention

The invention aims to solve the technical problem of providing a medical bill image structuring method and device based on mean value clustering and character recognition, which can greatly improve the bill structuring effect.

In order to achieve the purpose, the invention adopts the following technical scheme:

a medical bill image structuring method based on mean value clustering and character recognition comprises the following steps:

step 1, performing OCR character recognition on the obtained medical bill image to obtain full-text character string information of the bill;

step S2, KMeans clustering is carried out on the note full-text character string information;

step S3, determining title position information according to the clustering result, and extracting entry data of the corresponding column according to the title position information;

and step S4, carrying out validity check and correction on the entry data to obtain the structured data of the medical bill.

Preferably, the obtaining of the full-text character string information in step S1 includes:

preprocessing the medical bill image;

calculating the rotation angle of the preprocessed medical bill image;

performing rotation correction on the preprocessed medical bill image according to the rotation angle;

OCR recognition is carried out on the corrected medical bill image to obtain bill full text character string information, and the bill full text character string information comprises: the method comprises the following steps of (1) character string content, character string coordinate positions, recognition confidence coefficients of character strings and candidate characters;

and filtering the full text character string information of the bill.

Preferably, the clustering of the full-text character string information of the ticket in step S2 includes:

step 2.1, extracting all the note full text character string information obtained in the step S1, and initializing the vector of each character string at the left position of each character string;

step 2.2, initializing k to 10 central points, wherein k represents the final clustering result;

step 2.3, randomly selecting k points as initial clustering centers, calculating the distance from each character string vector to each clustering center,

wherein x and y are the coordinate values of the left position of the character,

step 2.4, comparing the distance from each character string vector to each clustering center, and dividing the distance into the clusters closest to the clustering centers;

step 2.5, recalculating each clustering center until convergence, and outputting clustering results, wherein the target function formula is as follows:

wherein u is_iIs S_iMean of all points, S_iIndicating the ith cluster.

Preferably, in step S3, performing row segmentation according to the clustering result and the OCR full text recognition data, and counting attributes corresponding to each row of data, where the attributes include: the sum of the number of Chinese characters, the number of digits and the number, and performing semantic analysis on the attributes to determine the title position information.

Preferably, in step S3, the extracting entry data corresponding to the column specifically includes: sequentially extracting vertical direction data according to the coordinate information of the left and right boundaries of the title; matching the amount information with the nearest distance in the vertical direction according to the extracted item name position information to extract unit price or quantity; merging the project name line feed data meeting the merging condition to obtain specific entry data.

A medical bill image structuring device based on mean value clustering and character recognition comprises:

the recognition module is used for carrying out OCR character recognition on the acquired medical bill image to obtain the full-text character string information of the bill;

the clustering module is used for performing KMeans clustering on the note full-text character string information;

the extraction module is used for determining title position information according to the clustering result and extracting the entry data of the corresponding column according to the title position information;

and the correction module is used for carrying out validity check and correction on the entry data to obtain the structured data of the medical bill.

A computer readable medium having stored thereon instructions which, when executed by a processor, implement steps for a medical ticket image structuring method based on mean clustering and character recognition.

Firstly, obtaining a character string by adopting an OCR technology, then carrying out cluster analysis on the character string according to coordinates, and determining column data; and simultaneously, according to semantic features of the medical bill, including title keywords, Chinese character statistics and numerical statistics information, four fields of item names, unit prices, quantity and total prices in the medical bill are extracted. In addition, in order to ensure that the structured output data is accurate and error-free, the invention also adds multiple check rules, self-determines character information with lower confidence coefficient based on the internal logic association relationship between the fields, and carries out heuristic correction on the data which is possibly wrong according to the internal logic. Finally, multiple modes are integrated, the data can be quickly checked and corrected, and a complete, quick and accurate data basis is provided for medical insurance reimbursement.

Drawings

FIG. 1 is a flow chart of a medical document image structuring method of the present invention;

FIG. 2 is a flow chart of string clustering based on left position X coordinates;

FIG. 3 is a title location flow diagram;

FIG. 4 is a digital adaptive correction flow chart;

fig. 5 is a schematic structural diagram of the medical bill image structuring device of the invention.

Detailed Description

In order to better explain the technical scheme of the invention, the invention is further described in detail by combining the drawings and the specific embodiment. It should be noted that the embodiments described herein are only for illustrating and explaining the present invention, and are not to be construed as limiting the present invention.

As shown in fig. 1, an embodiment of the present invention provides a medical bill image structuring method based on mean value clustering and character recognition, including the following steps:

step S3, determining the title position according to the clustering result, and extracting the entry data of the corresponding column according to the title position information;

Further, the obtaining of the full-text character string information of the ticket in step S1 includes:

step 1.1, preprocessing the medical bill image, including cutting, binaryzation and scaling; the image is cut to avoid the influence of black edges on the calculation of the rotation angle; carrying out adaptive threshold value binarization on the cut image, zooming the binarized image, wherein the zooming ratio is one fourth of the cut image, and the image processing speed is further improved through image zooming;

step 1.2, calculating the rotation angle of the preprocessed medical bill image by a histogram method;

step 1.3, performing rotation correction on the preprocessed medical bill image according to the rotation angle;

step 1.4, performing OCR recognition on the corrected medical bill image to obtain bill full-text character string information, wherein the bill full-text character string information comprises: the method comprises the following steps of (1) character string content, character string coordinate positions, recognition confidence coefficients of character strings and candidate characters;

step 1.5: and filtering the full-text character string information of the bill, wherein the filtering comprises the following steps: the subtotal, total, and total character strings are removed.

Further, in step S2, since there is a significant characteristic between the medical note columns, when the clustering is initialized, if k is 10, even if there is some interference information in the header, a good clustering effect can be obtained. The clustering sample is an x central point coordinate of each character string identified by OCR, the initial clustering value is set to be 10, the result obtained by clustering is a class label to which each character string belongs, and the clustering mean value is a central point coordinate corresponding to each class.

Clustering the full-text character string information of the bill, as shown in fig. 2, specifically comprises the following steps:

step 2.1, extracting all the character strings obtained in the step S1, and initializing the vector of each character string by the left position of each character string;

wherein u is_iIs S_iMean vector of all points, S_iRepresents the ith cluster;

further, in step S3, performing row segmentation according to the clustering result and the OCR full text recognition data, that is, counting attributes corresponding to each row (category) of data, where the attributes include the number of chinese characters, the number of digits, and the sum of the numbers. And performing semantic analysis according to the segmentation information, determining the position information of the title, and analyzing the position information into two conditions of existence and nonexistence of the original title.

When the first type has title row information, searching the position of a title according to title candidate characters, and simultaneously verifying by combining a clustering result, namely, the row with the most Chinese characters has more money columns and more numbers than the Chinese characters, if the difference between the position of the title positioned by a keyword and the clustering result is larger, verifying and judging the corresponding adjacent row again;

and when the title line information does not exist in the second type, determining the item name and the amount according to the classification attribute, wherein the column with the most Chinese characters is the item name column, and the column with the most amount is the amount column. In addition, when the statistics is performed, the influence of subtotal, total, and total is excluded. After the project name and the amount position are located, the data in the same row are checked, and the unit price or the quantity column can be obtained by checking the digital part in the same row.

And finally, after the position information of the title is positioned, determining the left and right boundaries of the title according to the information of the corresponding column, determining by traversing the widest entry in all data of the corresponding column, and simultaneously meeting the condition that the distance is within 30 pixels from the clustering center.

The determining of the position of the mark in step S3, as shown in fig. 3, specifically includes the following steps:

step 3.1, classifying the full text results according to the clustering results, and recording the mean value of each class;

step 3.2, counting attributes corresponding to each type of data, wherein the attributes comprise the number of Chinese characters, the number of digits and the sum of the numbers, and meanwhile, obtaining title keywords of each row for identifying the semantics of the row;

and 3.3, positioning title keywords of the full text result, wherein the title keywords comprise item names and item code words, and the title keywords are defined as item name candidate columns. If an amount key is included in the column, an amount candidate column is defined. Typically, the sum of money is much larger than the sum of unit prices. For the columns which can not adopt the keywords for semantic recognition, the columns are identified according to the numerical sum, and the money columns are distinguished;

and 3.4, matching the column data obtained by semantic recognition according to the consistency of the line positions to obtain the information of the project name, the amount, the unit price or the quantity of each line.

The step S3 of extracting entry data where the corresponding column is located specifically includes: the method comprises the steps of firstly, sequentially taking out vertical direction data according to the coordinate information of the left and right boundaries of the title, then matching the amount information of the nearest distance in the vertical direction according to the position information of the taken-out item name to take out unit price or quantity, finally judging data of line feed of the item name, and merging the data meeting merging conditions to obtain specific entry data.

Further, the step S4 of performing a validity check and matching correction on the entry data includes:

and 4.1, carrying out decimal point number check on the numbers of the extracted entry data, wherein decimal point consistency exists in the amount, unit price and number of each column. Therefore, the algorithm firstly counts the subsequent numbers of the decimal point and determines the precision of each row of numbers. If the decimal point is not recognized in the line, the decimal point is increased according to the number digit, and the recognition precision is improved. As shown in fig. 4, specifically: firstly, removing Chinese characters, dates and special characters; secondly, counting the average occurrence times of the decimal points in the row, wherein when the average occurrence times is more than one half of the total row number, the positions (from right to left) of the decimal points in the character are counted; then, taking the average value of the first half (from large to small) of the times as the position of the corresponding column decimal point, and filling the data lacking the decimal point;

4.2, if the amount, the unit price and the quantity are simultaneously recognized in the current row, carrying out consistency check; if the verification fails, selecting two values with the highest confidence degrees according to the recognition confidence degree sum of the character string, and performing back calculation on the other data to ensure the data consistency;

and 4.3, matching and checking are carried out by combining the medical bill semantic dictionary and the recognition result, and data of a specific title is corrected by combining candidate characters recognized by the OCR based on the medical bill semantic dictionary (the dictionary comprises a medical specific title and common combination information).

The invention combines the average value clustering and OCR technology, combines the common OCR recognition technology and the average value clustering method, can accurately position the title position, then takes out the corresponding item data, and finally combines the multiple matching rule engine, can accurately recognize the information of the name, the amount, the unit price and the quantity of the item. The medical staff input time can be saved, a large amount of repeated labor is avoided, the input speed is increased, the recognition accuracy is high, errors caused by manual input can be avoided, the work efficiency can be improved, and social resources are saved.

As shown in fig. 5, an embodiment of the present invention further provides a medical bill image structuring apparatus based on mean value clustering and character recognition, including:

Embodiments of the present invention also provide a computer readable medium having stored thereon instructions that, when executed by a processor, implement the steps of the method for structuring medical document images based on mean clustering and character recognition of the present invention.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable medium or transmitted from one computer readable medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk) and any combination thereof.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A medical bill image structuring method based on mean value clustering and character recognition is characterized by comprising the following steps:

s1, performing OCR character recognition on the acquired medical bill image to obtain full-text character string information of the bill;

step S2, KMeans clustering is carried out on the note full text character string information, and the method comprises the following steps:

step 2.2, initializing k to 10 central points, wherein k represents the final clustering result,

wherein u is_iIs S_iMean of all points, S_iRepresents the ith cluster;

step S3, determining title position information according to the clustering result, and extracting entry data where the corresponding column is located according to the title position information;

in step S3, performing row segmentation according to the clustering result and the OCR full text recognition data, and counting attributes corresponding to each row of data, where the attributes include: performing semantic analysis on the attributes to determine title position information;

in step S3, the extracting entry data where the corresponding column is located specifically includes: sequentially extracting vertical direction data according to the coordinate information of the left and right boundaries of the title; matching the amount information with the nearest distance in the vertical direction according to the extracted item name position information to extract unit price or quantity; merging the project name line feed data meeting the merging condition to obtain specific entry data;

2. The medical bill image structuring method based on mean value clustering and character recognition according to claim 1, wherein the obtaining of the full-text character string information in the step S1 comprises:

preprocessing the medical bill image;

calculating the rotation angle of the preprocessed medical bill image;

OCR recognition is carried out on the corrected medical bill image to obtain bill full text character string information, and the bill full text character string information comprises: the method comprises the following steps of (1) character string content, character string coordinate positions, character string recognition confidence coefficients and candidate characters;

and filtering the full text character string information of the bill.

3. A medical bill image structuring device based on mean value clustering and character recognition is characterized by comprising:

the clustering module is used for performing KMeans clustering on the note full-text character string information; the method specifically comprises the following steps:

extracting all note full-text character string information, and initializing a vector of each character string according to the left position of each character string;

initializing k to 10 central points, k representing the result of the final clustering,

randomly selecting k points as initial clustering centers, calculating the distance from each character string vector to each clustering center,

comparing the distance from each character string vector to each clustering center, and dividing the distance into the cluster closest to the clustering center;

recalculating each clustering center until convergence, and outputting a clustering result, wherein the target function formula is as follows:

wherein u is_iIs S_iMean of all points, S_iRepresents the ith cluster;

the extraction module is used for determining title position information according to the clustering result and extracting the entry data of the corresponding column according to the title position information; performing row segmentation according to the clustering result and the data identified by the OCR full text, and counting attributes corresponding to each row of data, wherein the attributes comprise: performing semantic analysis on the attributes to determine title position information; the specific steps for extracting the entry data of the corresponding column are as follows: sequentially extracting vertical direction data according to the coordinate information of the left and right boundaries of the title; matching the amount information with the nearest distance in the vertical direction according to the extracted item name position information to extract unit price or quantity; merging the project name line feed data meeting the merging condition to obtain specific entry data;

4. A computer readable medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement the steps of the method of any of claims 1-2.