CN113705733A

CN113705733A - Medical bill image processing method and device, electronic device and storage medium

Info

Publication number: CN113705733A
Application number: CN202111148275.6A
Authority: CN
Inventors: 杨紫崴
Original assignee: Ping An Medical and Healthcare Management Co Ltd
Current assignee: Shenzhen Ping An Medical Health Technology Service Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2021-11-26

Abstract

The invention relates to the field of artificial intelligence and intelligent medical treatment, and discloses a medical bill image processing method and device, electronic equipment and a storage medium, wherein the medical bill image processing method comprises the following steps: acquiring text position information and text content information of a text in a target medical bill picture; inputting the target medical bill picture, the text content information and the text position information into an attention mechanism model for feature learning to obtain at least one structured text field in the target medical bill picture and a field type corresponding to each structured text field; the self-attention mechanism model is obtained by training a self-attention mechanism-based machine translation model and a document understanding pre-training model by using a medical bill picture sample set containing multiple layouts. The invention solves the technical problems of low efficiency, low accuracy and the like of the existing scheme for extracting the information in the medical bill due to the complicated layout of the medical bill in the related technology.

Description

Medical bill image processing method and device, electronic device and storage medium

Technical Field

The invention relates to the field of artificial intelligence and intelligent medical treatment, in particular to a medical bill image processing method and device, electronic equipment and a storage medium.

Background

With the development and progress of science and technology, the artificial intelligence technology is gradually improved, the automatic identification of the medical invoice in the business insurance claim and the medical insurance reimbursement becomes the mainstream direction of the industry, the automatic identification of the medical invoice effectively saves the labor cost input, and the service efficiency is improved.

In the current medical invoice recognition system, after text detection and text recognition are realized based on deep learning, data structuralized customized development needs to be carried out according to different layouts of different regions, and different layouts need to be provided with a set of different data post-processing methods. The current customized data structuring mode has great disadvantages in data quantity, research and development manpower input, marking manpower input, research and development period and other aspects. For example, in the case of lack of data, it is difficult to implement the development of a system that matches national layouts; in order to cover medical bills of various formats released in real time in various regions, a great deal of research and development manpower is invested, and great cost is generated; in addition, when the traditional medical invoice identification scheme encounters the problems of unstable text position or image quality and the like in the layout, the accuracy of data structuring is low.

In view of the above problems in the related art, no effective solution has been found at present.

Disclosure of Invention

The embodiment of the invention provides a medical bill image processing method and device, electronic equipment and a storage medium, which at least solve the technical problems of low efficiency, low accuracy and the like of the existing scheme for extracting information in a medical bill due to the complicated layout type of the medical bill in the related technology.

According to an embodiment of the invention, a medical bill image processing method is provided, which includes: acquiring text position information and text content information of a text in a target medical bill picture; inputting the target medical bill picture, the text content information and the text position information into an attention mechanism model for feature learning to obtain at least one structured text field in the target medical bill picture and a field type corresponding to each structured text field; the self-attention mechanism model is obtained by training a self-attention mechanism-based machine translation model and a document understanding pre-training model by using a medical bill picture sample set containing multiple layouts.

Optionally, the acquiring text position information and text content information of the text in the target medical bill picture includes: inputting the target medical bill picture into a pre-trained text detection model for text box detection, determining at least one text box containing a text in the target medical bill picture, and marking the text box in the target medical bill picture; and inputting the marked target medical bill picture into a pre-trained text recognition model for text recognition to obtain text position information and text content information corresponding to the text box.

Optionally, the text recognition model includes a convolutional neural network, a long-short term memory model, and a time-series class classification model based on the neural network; the step of inputting the marked target medical bill picture into a pre-trained text recognition model for feature learning to obtain text position information and text content information corresponding to the text box comprises the following steps: inputting the marked target medical bill picture into the convolutional neural network for feature extraction to obtain an image convolution feature corresponding to the marked target medical bill picture; inputting the image convolution characteristics into the long-term and short-term memory model for characteristic extraction to obtain sequence characteristics corresponding to the image convolution characteristics; and inputting the sequence characteristics into the time sequence class classification model based on the neural network to perform text alignment, so as to obtain text content information in the text box and text position information corresponding to each text content.

Optionally, before the target medical bill picture, the text content information, and the text position information are input into an attention mechanism model for feature learning to obtain at least one structured text field in the target medical bill picture and a field type corresponding to each structured text field, the method further includes: collecting the medical bill picture sample set containing various layouts; sequentially inputting the medical bill picture sample set into a pre-trained text detection model and a pre-trained text recognition model, and extracting text position information and text content information of a text in each medical bill picture in the medical bill picture sample set; inputting the medical bill picture sample set and text position information and text content information of the text in each medical bill picture into a pre-training model based on an attention mechanism for training to obtain the attention mechanism model; the pre-training model based on the self-attention mechanism is obtained by sequentially connecting a preset machine translation model and a preset document understanding pre-training model.

Optionally, the inputting the target medical bill picture, the text content information, and the text position information into an attention mechanism model for feature learning to obtain at least one structured text field in the target medical bill picture and a field type corresponding to each structured text field includes: inputting the target medical bill picture, the text content information and the text position information into the self-attention mechanism model; performing feature extraction on the target medical bill picture, the text content information and the text position information by using the machine translation model based on the self-attention mechanism to generate a feature combination vector corresponding to the target medical bill picture; and performing a pre-training task on the feature combination vector by using the document understanding pre-training model to obtain at least one structured text field in the target medical bill picture and a field type corresponding to each structured text field.

Optionally, the performing, by using the machine translation model based on the self-attention mechanism, feature extraction on the target medical bill picture, the text content information, and the text position information, and generating a feature combination vector corresponding to the target medical bill picture includes: extracting a graph feature vector corresponding to the target medical bill picture by utilizing a global average pooling principle and a linear projection principle; extracting a text content characteristic vector corresponding to the text content information by using a text splitting principle; extracting a text position characteristic vector corresponding to the text position information and a relative position characteristic vector between different texts by constructing a coordinate system of a text box corresponding to the text position information; and performing feature combination on the graph feature vector, the text content feature vector, the text position feature vector and the relative position feature vector to obtain the feature combination vector.

Optionally, the pre-training task performed on the feature combination vector by using the document understanding pre-training model to obtain at least one structured text field in the target medical bill picture and a field type corresponding to each structured text field includes: inputting the feature combination vector into the document understanding pre-training model; obtaining matching information of the text content information and the target medical bill picture by judging whether the text content information is matched with the target medical bill picture; and/or obtaining text blackening judgment information by judging whether the text content information is blackened or not; and/or obtaining text shielding judgment information by judging whether the text content information is shielded; performing feature fusion on at least one of the matching information, the text blackening judging information and the text shading judging information and the feature combination vector to obtain a feature fusion vector; learning the context of the target medical bill picture and the feature fusion vector to obtain a modal alignment relation between the at least one structured text field and the field type of the at least one structured text field; and outputting the at least one structured text field and the field type corresponding to each structured text field.

According to an embodiment of the present invention, there is provided a medical ticket image processing apparatus including: the acquisition module is used for acquiring text position information and text content information of a text in the target medical bill picture; the learning module is used for inputting the target medical bill picture, the text content information and the text position information into an attention mechanism model for feature learning to obtain at least one structured text field in the target medical bill picture and a field type corresponding to each structured text field; the self-attention mechanism model is obtained by training a self-attention mechanism-based machine translation model and a document understanding pre-training model by using a medical bill picture sample set containing multiple layouts.

Optionally, the obtaining module includes: the determining unit is used for inputting the target medical bill picture into a pre-trained text detection model for text box detection, determining at least one text box containing a text in the target medical bill picture, and marking the text box in the target medical bill picture; and the first learning unit is used for inputting the marked target medical bill picture into a pre-trained text recognition model for text recognition to obtain text position information and text content information corresponding to the text box.

Optionally, the text recognition model includes a convolutional neural network, a long-short term memory model, and a time-series class classification model based on the neural network; the first learning unit includes: the first extraction subunit is used for inputting the marked target medical bill picture into the convolutional neural network for feature extraction to obtain an image convolution feature corresponding to the marked target medical bill picture; the second extraction subunit is used for inputting the image convolution characteristics into the long-term and short-term memory model for characteristic extraction to obtain sequence characteristics corresponding to the image convolution characteristics; and the processing subunit is used for inputting the sequence features into the time sequence class classification model based on the neural network to perform text alignment, so as to obtain text content information in the text box and text position information corresponding to each text content.

Optionally, the apparatus further comprises: the acquisition module is used for acquiring the medical bill picture sample set containing multiple layouts before the learning module inputs the target medical bill picture, the text content information and the text position information into an attention mechanism model for feature learning to obtain at least one structured text field in the target medical bill picture and a field type corresponding to each structured text field; the extraction module is used for sequentially inputting the medical bill picture sample set into a pre-trained text detection model and a pre-trained text recognition model and extracting text position information and text content information of a text in each medical bill picture in the medical bill picture sample set; the training module is used for inputting the medical bill picture sample set and text position information and text content information of texts in each medical bill picture into a pre-training model based on a self-attention mechanism for training to obtain the self-attention mechanism model; the pre-training model based on the self-attention mechanism is obtained by sequentially connecting a preset machine translation model and a preset document understanding pre-training model.

Optionally, the learning module includes: the input unit is used for inputting the target medical bill picture, the text content information and the text position information into the self-attention mechanism model; the generating unit is used for performing feature extraction on the target medical bill picture, the text content information and the text position information by using the machine translation model based on the self-attention mechanism to generate a feature combination vector corresponding to the target medical bill picture; and the execution unit is used for performing a pre-training task on the feature combination vector by using the document understanding pre-training model to obtain at least one structured text field in the target medical bill picture and a field type corresponding to each structured text field.

Optionally, the generating unit includes: the third extraction subunit is used for extracting the image feature vector corresponding to the target medical bill image by utilizing a global average pooling principle and a linear projection principle; extracting a text content characteristic vector corresponding to the text content information by using a text splitting principle; extracting a text position characteristic vector corresponding to the text position information and a relative position characteristic vector between different texts by constructing a coordinate system of a text box corresponding to the text position information; and the feature combination subunit is configured to perform feature combination on the graph feature vector, the text content feature vector, the text position feature vector, and the relative position feature vector to obtain the feature combination vector.

Optionally, the execution unit includes: an input subunit, configured to input the feature combination vector into the document understanding pre-training model; the judging subunit is used for obtaining matching information of the text content information and the target medical bill picture by judging whether the text content information is matched with the target medical bill picture; and/or obtaining text blackening judgment information by judging whether the text content information is blackened or not; and/or obtaining text shielding judgment information by judging whether the text content information is shielded; a feature fusion subunit, configured to perform feature fusion on the feature combination vector and information of at least one of the matching information, the text blackening determination information, and the text occlusion determination information to obtain a feature fusion vector; the learning subunit is used for learning the context of the target medical bill picture and the feature fusion vector to obtain a modal alignment relationship between the at least one structured text field and the field type of the at least one structured text field; and the output subunit is used for outputting the at least one structured text field and the field type corresponding to each structured text field.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps in any of the apparatus embodiments described above when executed.

By the method, a machine translation model and a document understanding pre-training model based on a self-attention mechanism are trained by using medical bills of various layouts to generate the self-attention mechanism model; inputting the medical bill pictures to be identified and the text content information and the text position information in the pictures into an attention mechanism model for feature learning to obtain at least one structured text field and a field type corresponding to each structured text field, so that the unstructured medical bill pictures of various layouts can be accurately converted into the structured text fields without considering the data structured customized development of the medical bills of different layouts; in addition, the self-attention mechanism can reduce the training time of the model due to high parallelizable computing capacity, greatly improve the processing efficiency of data structuring, and solve the technical problems of low efficiency, low accuracy and the like of the existing scheme for extracting the information in the medical bill due to the complicated layout of the medical bill in the prior art related to the prior art.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a block diagram of a hardware structure of a medical bill image processing method applied to a computer terminal according to an embodiment of the present invention;

FIG. 2 is a flow chart of a medical ticket image processing method according to an embodiment of the invention;

fig. 3 is a block diagram of a medical bill image processing device according to an embodiment of the invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The method provided by the embodiment of the application can be executed in a mobile terminal, a server, a computer terminal or a similar operation device. Taking the example of the method running on a computer terminal, fig. 1 is a block diagram of a hardware structure of a medical bill image processing method applied to a computer terminal according to an embodiment of the present invention. As shown in fig. 1, the computer terminal may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally, a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the computer terminal. For example, the computer terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 can be used for storing computer programs, for example, software programs and modules of application software, such as a computer program corresponding to the medical ticket image processing method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory, and may also include volatile memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to a computer terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

Fig. 2 is a flowchart of a medical ticket image processing method according to an embodiment of the present invention, and as shown in fig. 2, the flowchart includes the following steps:

step S202, acquiring text position information and text content information of a text in a target medical bill picture;

the medical bill is a receipt issued by a nonprofit medical health institution to provide medical services to patients in outpatients, emergency services, first aid, hospitalization, physical examination, etc., and to receive medical income, and is also referred to as a "medical charging bill". The medical bills in different areas have different page types, and mainly comprise outpatient service charging bills and hospitalization charging bills; for example, the basic contents of the hospitalization charging bill include information such as bill name, bill code, business serial number, hospital type, billing time, name, gender, medical insurance type, medical insurance payment mode, social security number, item, amount, total, prepaid amount, subsidized amount, refund amount, medical insurance block payment, personal account payment, other medical insurance payment, self-fee, collection unit, payee and the like.

In one alternative scheme of the scheme, a target medical bill picture is input into a pre-trained text detection model for text box detection, at least one text box containing a text in the target medical bill picture is determined, and the text box is marked in the target medical bill picture; and inputting the marked target medical bill picture into a pre-trained text recognition model for text recognition to obtain text position information and text content information corresponding to the text box.

Preferably, the text detection model uses a DBnet algorithm (fully referred to as a Differential Binary (DB), a Differentiable Binarization algorithm). Specifically, the text detection of the target medical bill picture by using the DBnet algorithm comprises the following steps: inputting the target medical bill picture into a feature pyramid for feature extraction; the pyramid features are up-sampled to the same size F, a probability graph P and a threshold graph T are predicted based on the feature graph F, and an approximate binary graph is calculated by using the feature graph F and the threshold graph T; in the training phase, supervising the probability map, the threshold map and the approximate binary map, wherein the probability map and the approximate binary map share one supervising; in the reasoning process, a text surrounding box is obtained from the approximate binary image or the probability image through a box formula module, so that the text box containing the text in the medical bill is accurately positioned.

Preferably, the text recognition model comprises a convolutional neural network, a long-short term memory model and a time-series class classification model based on the neural network. Specifically, inputting the marked target medical bill picture into a convolutional neural network for feature extraction to obtain an image convolution feature corresponding to the marked target medical bill picture; inputting the image convolution characteristics into a long-term and short-term memory model for characteristic extraction to obtain sequence characteristics corresponding to the image convolution characteristics; and inputting the sequence characteristics into a time sequence class classification model based on a neural network to perform text alignment, so as to obtain text content information in the text box and text position information corresponding to each text content.

In an alternative example of the above embodiment, the text recognition model uses CRNN + CTC algorithm, namely CNN (called Convolutional Neural Networks collectively) + RNN (called recurrent Neural Networks collectively) + CTC (called connecting temporal classification model based on Neural Networks collectively).

Firstly, determining the size (such as 32,100,3), (height, width, channel) form of the image of the input medical bill marked with the text box, namely (height, width, channel);

then, using the over-convolution layers (convolution feature maps) to extract the convolution feature maps of the input pictures, and converting the picture size (32,100,3) into a convolution feature matrix with the size of (1,25, 512); further, let the image scale to [32 × W × 3] size (W represents an arbitrary width) with a fixed aspect ratio; then changing the CNN into [1 × (W/4) × 512 ]; setting [ T ═ W/4 ] for LSTM, i.e., inputting the features into LSTM;

further, character sequence features are extracted on the basis of convolution features by using a current layers; wherein, the Current layers is a deep bidirectional LSTM network.

Then, character sequence features are output using Transcription layers.

In addition, the embodiment of the invention introduces CTC to replace the common Softmax Loss, and training samples do not need to be aligned; by introducing blank characters, the problem that some positions have no characters can be solved.

According to the above example, the text content information and the text position information in the text box are output based on the above CNN + RNN + CTC algorithm.

In another alternative embodiment of the present invention, the data of the text content and the text position can be obtained and processed based on artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Step S204, inputting the target medical bill picture, the text content information and the text position information into an attention mechanism model for feature learning to obtain at least one structured text field in the target medical bill picture and a field type corresponding to each structured text field;

the self-attention mechanism model is obtained by training a self-attention mechanism-based machine translation model and a document understanding pre-training model by using a medical bill picture sample set containing multiple layouts.

In this embodiment, the preferred layout model for the document understanding pre-training model is a layout model lm2 model, layout lm2 is a new generation of multi-modal document understanding pre-training model, image information is directly introduced in the input stage, and the multi-modal pre-training framework is used to perform joint learning on the text, the image and the text position, so that local invariance information of different document template types can be learned, when the model needs to be migrated to another template type, the text in the template can be structured by labeling a small number of samples, and thus the structured text field in the medical bill and the field type of the structured text field are extracted.

In an optional embodiment of the scheme, before at least one structured text field in a target medical bill picture and a field type corresponding to each structured text field are obtained, a medical bill picture sample set comprising multiple layouts is collected; sequentially inputting the medical bill picture sample set into a pre-trained text detection model and a pre-trained text recognition model, and extracting text position information and text content information of a text in each medical bill picture in the medical bill picture sample set; inputting the medical bill picture sample set and text position information and text content information of a text in each medical bill picture into a pre-training model based on a self-attention mechanism for training to obtain a self-attention mechanism model; the pre-training model based on the self-attention mechanism is obtained by sequentially connecting a preset machine translation model and a preset document understanding pre-training model.

By the embodiment of the invention, the machine translation model and the document understanding pre-training model based on the self-attention mechanism are trained by using medical bills with various layouts to generate the self-attention mechanism model; inputting the medical bill pictures to be identified and the text content information and the text position information in the pictures into an attention mechanism model for feature learning to obtain at least one structured text field and a field type corresponding to each structured text field, so that the unstructured medical bill pictures of various layouts can be accurately converted into the structured text fields without considering the data structured customized development of the medical bills of different layouts; in addition, the self-attention mechanism can reduce the training time of the model due to high parallelizable computing capacity, greatly improve the processing efficiency of data structuring, and solve the technical problems of low efficiency, low accuracy and the like of the existing scheme for extracting the information in the medical bill due to the complicated layout of the medical bill in the prior art related to the prior art.

In an optional embodiment of the present disclosure, the step S204 specifically includes: inputting the target medical bill picture, the text content information and the text position information into a self-attention mechanism model; performing feature extraction on the target medical bill picture, the text content information and the text position information by using a machine translation model based on a self-attention mechanism to generate a feature fusion vector corresponding to the target medical bill picture; and performing a pre-training task on the feature combination vector by using a document understanding pre-training model to obtain at least one structured text field in the target medical bill picture and a field type corresponding to each structured text field.

In the embodiment, the text content information, the text position information and the target medical bill picture are input into a self-attention mechanism model for feature extraction, and the self-attention mechanism model can be obtained by training through a machine translation model and a document understanding pre-training model based on the self-attention mechanism. In the embodiment, a machine translation model transformer model is constructed based on a self-attention mechanism, and the distance between any two positions in a sequence is reduced to a constant; applying the transformer model to the post-processing process of automatic identification of medical bill pictures, and training to obtain a data structured model (namely the self-attention mechanism model), so that the results of text detection and text identification are subjected to structured processing through the data structured model, and structured medical bill data are finally obtained; and learning local invariance information in bills of different layouts by utilizing a document understanding pre-training model, and outputting a structured text field marked with a field type, so that unstructured information in the bill picture is converted into structured information.

Further, in an optional scheme of the embodiment of the present invention, a graph feature vector corresponding to the target medical bill picture is extracted by using a global average pooling principle and a linear projection principle; extracting text content characteristic vectors corresponding to the text content information by using a text splitting principle; extracting text position characteristic vectors corresponding to the text position information and relative position characteristic vectors between different texts by constructing a coordinate system of a text box corresponding to the text position information; and performing feature combination on the image feature vector, the text content feature vector, the text position feature vector and the relative position feature vector to obtain a feature combination vector.

In this embodiment, a transform model is used to perform feature extraction on text content information, text position information (layout), and medical bill images, convert the text content information, text position information (layout), and medical bill images into corresponding text content feature vectors, text position feature vectors, image feature vectors, and relative position feature vectors (i.e., relative position relationships between different text blocks in an image), and deliver each feature vector to an encoder network in the transform model, so as to perform feature vector splicing on the text content feature vectors, the text position feature vectors, the image feature vectors, and the relative position feature vectors, that is, to splice the image and the text sequence, thereby obtaining a feature combination vector.

In a possible implementation manner of the present disclosure, the extracting of the text content feature vector includes: segmenting text content by using text segmentation Wordpience; adding marks by using [ CLS ] and [ SEP ], and filling the length by using [ PAD ] to obtain a text input sequence; and combining the word vector, the one-dimensional position vector and the segmented vector based on the text input sequence to obtain a text content characteristic vector.

In one possible implementation manner of the present disclosure, the extracting of the graph feature vector includes: extracting a characteristic diagram of the target medical bill picture; pooling the feature maps to a fixed size (W × H) on average; expanding the feature graph after the average pooling according to rows; obtaining a graph characteristic sequence corresponding to the picture through linear projection; and adding the image feature sequence, the one-dimensional position vector and the segmentation vector to obtain a final image feature vector.

In a possible implementation manner of the present disclosure, the extracting of the text position feature vector includes: constructing a coordinate system parallel to the text box corresponding to the text position information; and (3) representing the position of the text box and the relative position between different text blocks by using 4 boundary coordinate values, width and height of the text box, and finally outputting the layout information of the text, namely the text position feature vector and the relative position feature vector.

Further, the feature combination vector is input to an encoder network, the encoder network being configured to: (1) encoding the picture or text content; (2) encoding the 2D two-dimensional position information; (3) encoding the 1D one-dimensional position information; (4) and coding the picture classification for confirming the classification of the information segment, wherein the picture information is C, and the text content is A.

In an optional embodiment of the present disclosure, the pre-training task performed on the feature combination vector by using the document understanding pre-training model to obtain at least one structured text field in the target medical bill picture and a field type corresponding to each structured text field specifically includes: inputting the feature combination vector into a document understanding pre-training model; obtaining matching information of the text content information and the target medical bill picture by judging whether the text content information is matched with the target medical bill picture; and/or obtaining text blackening judgment information by judging whether the text content information is blackened or not; and/or obtaining text shielding judgment information by judging whether the text content information is shielded; performing feature fusion on at least one of the matching information, the text blackening judgment information and the text shading judgment information and the feature combination vector to obtain a feature fusion vector; learning the context and feature fusion vector of the target medical bill picture to obtain a modal alignment relation between the field types of the at least one structured text field and the at least one structured text field; and outputting at least one structured text field and a field type corresponding to each structured text field.

In this embodiment, after the feature vector, the text content feature vector, the text position feature vector, and the relative position feature vector are spliced to obtain the feature combination vector, a pre-training task is performed on the combination vector by using a document understanding pre-training model, which specifically includes: whether the image is matched with the text information, whether the text content is blackened and whether the text content is shielded are judged to obtain fusion image pixel information, a text position, content information and related characteristics among texts; performing feature fusion on at least one of matching information of the text information and the image information, text blackening judgment information and occlusion text prediction information and the fusion feature vector to obtain a feature fusion vector of each sample; then, based on an automatic supervision learning method, the appearance probability of the next word is predicted by utilizing the previous word sequence to learn the context correlation representation, original data can be reconstructed for the shielded sentences or a certain word, the disordered word sequence and the like, and the text recognition accuracy is improved.

For example, in one example of text-to-picture alignment, a portion of text is randomly masked by lines on an image, a word-level two classification is performed using the text portion output of the model, and it is predicted whether each word is masked to align the position information of the text and the picture. Such as Text1 (Text 1), the content of this field, T1, and the location box, boxT1, the output should be True, and False if the input is T1, boxT 3.

In one example of text-image matching, whether the images and texts are matched is predicted by using a document level two classification mode of a model so as to align the content information of the text and the images.

In another example, in order to judge that the pixel information and the text content information are correct, the characteristics of the pixel information and the content are slowly trained; for the pixel information and the matched text information, the weight behind the transform is large, and the weight behind the transform is small.

Further, the feature fusion vector is input into the layout lm2, the layout lm2 model learns the modal alignment relationship between the text position and the text semantic (i.e., the field type) according to the context of the bill picture, the feature fusion vector, and the inferred occluded vocabulary, and further obtains the modal alignment relationship between the text field and the field type, and obtains a structured text field (i.e., structured information) having the field type, for example, "4/19/2021" is labeled as the visit date, "0013853632" is labeled as the invoice number, and the classification (i.e., the final structured data) of each text segment can be finally obtained through the pre-training task on the text field.

The scheme can also realize that structured text fields such as names, prices, quantities and the like are extracted from the medical bill pictures according to predefined key information entities (such as names, prices, quantities and the like).

The existing medical invoice identification scheme needs to redesign post-processing for each page of invoice, namely each set of invoice needs to be provided with a set of post-processing algorithm, the method has the disadvantages that the page information of the medical invoice must be clearly obtained to carry out subsequent development, and then a training sample cannot be taken due to the fact that the medical invoice version in a certain area cannot be taken, and the development is difficult under the condition that the training sample is not available; through this scheme, use the LayoutLM2 model with traditional aftertreatment part modeling, no longer distinguish the space of a whole page, need not match one set of aftertreatment to different space of a whole page, the whole regional space of a whole page of a model adaptation solves the medical invoice recognition problem of the whole country's space of a whole page, sparingly research and development manpower, promotes the development ageing.

Based on the embodiment, the medical invoice identification method based on the self-attention mechanism provided by the scheme replaces the traditional post-processing development part with a modeling method, and a unified model is adapted to medical invoices on all pages across the country; the updating iteration is fast, new versions of medical bills can be popularized in all regions, new layout can be rapidly adapted, additional research and development investment is not needed, research and development manpower is greatly saved, and development timeliness is improved.

The scheme applies the transformer algorithm in natural language processing to the field of visual algorithms, solves the problem of text classification after OCR (optical character recognition), reduces errors caused by text position change and picture quality in the traditional post-processing method, and improves the accuracy of data structuring.

Based on the medical bill image processing method provided in the foregoing embodiments, based on the same inventive concept, the present embodiment further provides a medical bill image processing apparatus, which is used for implementing the foregoing embodiments and preferred embodiments, and the descriptions that have been already made are omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 3 is a block diagram of a medical ticket image processing apparatus according to an embodiment of the present invention, and as shown in fig. 3, the apparatus includes: the acquisition module 30 is used for acquiring text position information and text content information of a text in the target medical bill picture; the learning module is used for inputting the target medical bill picture, the text content information and the text position information into the self-attention mechanism model for feature learning to obtain at least one structured text field in the target medical bill picture and a field type corresponding to each structured text field; the self-attention mechanism model is obtained by training a self-attention mechanism-based machine translation model and a document understanding pre-training model by using a medical bill picture sample set containing multiple layouts.

Optionally, the obtaining module 30 includes: the determining unit is used for inputting the target medical bill picture into a pre-trained text detection model for text box detection, determining at least one text box containing a text in the target medical bill picture, and marking the text box in the target medical bill picture; and the first learning unit is used for inputting the marked target medical bill picture into a pre-trained text recognition model for text recognition to obtain text position information and text content information corresponding to the text box.

Optionally, the text recognition model includes a convolutional neural network, a long-short term memory model, and a time-series class classification model based on the neural network; the first learning unit includes: the first extraction subunit is used for inputting the marked target medical bill picture into a convolutional neural network for feature extraction to obtain an image convolution feature corresponding to the marked target medical bill picture; the second extraction subunit is used for inputting the image convolution characteristics into the long-term and short-term memory model for characteristic extraction to obtain sequence characteristics corresponding to the image convolution characteristics; and the processing subunit is used for inputting the sequence characteristics into a time sequence class classification model based on a neural network to perform text alignment, so as to obtain text content information in the text box and text position information corresponding to each text content.

Optionally, the apparatus further comprises: the acquisition module is used for acquiring a medical bill picture sample set containing multiple layouts before the learning module inputs the target medical bill picture, the text content information and the text position information into the self-attention mechanism model for feature learning to obtain at least one text field with the classification identification; the extraction module is used for sequentially inputting the medical bill picture sample set into a pre-trained text detection model and a pre-trained text recognition model and extracting text position information and text content information of a text in each medical bill picture in the medical bill picture sample set; the training module is used for inputting the medical bill picture sample set and the text position information and the text content information of the text in each medical bill picture into a pre-training model based on a self-attention mechanism for training to obtain a self-attention mechanism model; the pre-training model based on the self-attention mechanism is obtained by sequentially connecting a preset machine translation model and a preset document understanding pre-training model.

Optionally, the learning module 32 includes: the input unit is used for inputting the target medical bill picture, the text content information and the text position information into the self-attention mechanism model; the generating unit is used for performing feature extraction on the target medical bill picture, the text content information and the text position information by using a machine translation model based on a self-attention mechanism to generate a feature combination vector corresponding to the target medical bill picture; and the execution unit is used for performing a pre-training task on the feature combination vector by using the document understanding pre-training model to obtain at least one structured text field in the target medical bill picture and a field type corresponding to each structured text field.

Optionally, the generating unit includes: the third extraction subunit is used for extracting the image feature vector corresponding to the target medical bill image by utilizing a global average pooling principle and a linear projection principle; extracting text content characteristic vectors corresponding to the text content information by using a text splitting principle; extracting text position characteristic vectors corresponding to the text position information and relative position characteristic vectors between different texts by constructing a coordinate system of a text box corresponding to the text position information; and the characteristic combination subunit is used for carrying out characteristic combination on the image characteristic vector, the text content characteristic vector, the text position characteristic vector and the relative position characteristic vector to obtain a characteristic combination vector.

Optionally, the execution unit includes: the input subunit is used for inputting the feature combination vector into a document understanding pre-training model; the judging subunit is used for obtaining the matching information of the text content information and the target medical bill picture by judging whether the text content information is matched with the target medical bill picture; and/or obtaining text blackening judgment information by judging whether the text content information is blackened or not; and/or obtaining text shielding judgment information by judging whether the text content information is shielded; the characteristic fusion subunit is used for carrying out characteristic fusion on at least one of the matching information, the text blackening judgment information and the text shading judgment information and the characteristic combination vector to obtain a characteristic fusion vector; the learning subunit is used for learning the context and the feature fusion vector of the target medical bill picture to obtain a modal alignment relationship between the at least one structured text field and the field type of the at least one structured text field; and the output subunit is used for outputting at least one structured text field and the field type corresponding to each structured text field.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring text position information and text content information of the text in the target medical bill picture;

s2, inputting the target medical bill picture, the text content information and the text position information into a self-attention mechanism model for feature learning to obtain at least one structured text field in the target medical bill picture and a field type corresponding to each structured text field; the self-attention mechanism model is obtained by training a self-attention mechanism-based machine translation model and a document understanding pre-training model by using a medical bill picture sample set containing multiple layouts.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Based on the above embodiment of the method shown in fig. 2 and the apparatus shown in fig. 3, in order to achieve the above object, the present application further provides an electronic device, as shown in fig. 4, including a memory 42 and a processor 41, where the memory 42 and the processor 41 are both disposed on a bus 43, the memory 42 stores a computer program, and the processor 41 implements the medical bill image processing method shown in fig. 2 when executing the computer program.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a memory (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions to enable an electronic device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the implementation scenarios of the present application.

Optionally, the device may also be connected to a user interface, a network interface, a camera, Radio Frequency (RF) circuitry, sensors, audio circuitry, a WI-FI module, and so forth. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., a bluetooth interface, WI-FI interface), etc.

It will be understood by those skilled in the art that the structure of an electronic device provided in the present embodiment does not constitute a limitation of the physical device, and may include more or less components, or some components in combination, or a different arrangement of components.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A medical bill image processing method is characterized by comprising the following steps:

acquiring text position information and text content information of a text in a target medical bill picture;

inputting the target medical bill picture, the text content information and the text position information into an attention mechanism model for feature learning to obtain at least one structured text field in the target medical bill picture and a field type corresponding to each structured text field;

2. The method of claim 1, wherein the obtaining text position information and text content information of the text in the target medical ticket picture comprises:

inputting the target medical bill picture into a pre-trained text detection model for text box detection, determining at least one text box containing a text in the target medical bill picture, and marking the text box in the target medical bill picture;

and inputting the marked target medical bill picture into a pre-trained text recognition model for text recognition to obtain text position information and text content information corresponding to the text box.

3. The method of claim 2, wherein the text recognition model comprises a convolutional neural network, a long-short term memory model, and a neural network-based time-series class classification model; the step of inputting the marked target medical bill picture into a pre-trained text recognition model for feature learning to obtain text position information and text content information corresponding to the text box comprises the following steps:

inputting the marked target medical bill picture into the convolutional neural network for feature extraction to obtain an image convolution feature corresponding to the marked target medical bill picture;

inputting the image convolution characteristics into the long-term and short-term memory model for characteristic extraction to obtain sequence characteristics corresponding to the image convolution characteristics;

and inputting the sequence characteristics into the time sequence class classification model based on the neural network to perform text alignment, so as to obtain text content information in the text box and text position information corresponding to each text content.

4. The method of claim 1, wherein before inputting the target medical ticket image, the text content information, and the text position information into a self-attention mechanism model for feature learning, obtaining at least one structured text field and a field type corresponding to each structured text field in the target medical ticket image, the method further comprises:

collecting the medical bill picture sample set containing various layouts;

sequentially inputting the medical bill picture sample set into a pre-trained text detection model and a pre-trained text recognition model, and extracting text position information and text content information of a text in each medical bill picture in the medical bill picture sample set;

inputting the medical bill picture sample set and text position information and text content information of the text in each medical bill picture into a pre-training model based on an attention mechanism for training to obtain the attention mechanism model; the pre-training model based on the self-attention mechanism is obtained by sequentially connecting a preset machine translation model and a preset document understanding pre-training model.

5. The method of claim 1, wherein the inputting the target medical ticket image, the text content information, and the text position information into a self-attention mechanism model for feature learning to obtain at least one structured text field and a field type corresponding to each structured text field in the target medical ticket image comprises:

inputting the target medical bill picture, the text content information and the text position information into the self-attention mechanism model;

performing feature extraction on the target medical bill picture, the text content information and the text position information by using the machine translation model based on the self-attention mechanism to generate a feature combination vector corresponding to the target medical bill picture;

and performing a pre-training task on the feature combination vector by using the document understanding pre-training model to obtain at least one structured text field in the target medical bill picture and a field type corresponding to each structured text field.

6. The method of claim 5, wherein the performing feature extraction on the target medical ticket image, the text content information and the text position information by using the machine translation model based on the self-attention mechanism, and generating a feature combination vector corresponding to the target medical ticket image comprises:

extracting a graph feature vector corresponding to the target medical bill picture by utilizing a global average pooling principle and a linear projection principle; extracting a text content characteristic vector corresponding to the text content information by using a text splitting principle; extracting a text position characteristic vector corresponding to the text position information and a relative position characteristic vector between different texts by constructing a coordinate system of a text box corresponding to the text position information;

and performing feature combination on the graph feature vector, the text content feature vector, the text position feature vector and the relative position feature vector to obtain the feature combination vector.

7. The method of claim 5, wherein the pre-training the feature combination vector using the document understanding pre-training model to obtain at least one structured text field in the target medical order picture and a field type corresponding to each structured text field comprises:

inputting the feature combination vector into the document understanding pre-training model;

obtaining matching information of the text content information and the target medical bill picture by judging whether the text content information is matched with the target medical bill picture; and/or obtaining text blackening judgment information by judging whether the text content information is blackened or not; and/or obtaining text shielding judgment information by judging whether the text content information is shielded;

performing feature fusion on at least one of the matching information, the text blackening judging information and the text shading judging information and the feature combination vector to obtain a feature fusion vector;

learning the context of the target medical bill picture and the feature fusion vector to obtain a modal alignment relation between the at least one structured text field and the field type of the at least one structured text field;

and outputting the at least one structured text field and the field type corresponding to each structured text field.

8. A medical bill image processing apparatus characterized by comprising:

the acquisition module is used for acquiring text position information and text content information of a text in the target medical bill picture;

the learning module is used for inputting the target medical bill picture, the text content information and the text position information into an attention mechanism model for feature learning to obtain at least one structured text field in the target medical bill picture and a field type corresponding to each structured text field;

9. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer storage medium on which a computer program is stored, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.