CN115984890A

CN115984890A - Bill text recognition method and device, computer equipment and storage medium

Info

Publication number: CN115984890A
Application number: CN202211559252.9A
Authority: CN
Inventors: 郭喜亚
Original assignee: Ping An Health Insurance Company of China Ltd
Current assignee: Ping An Health Insurance Company of China Ltd
Priority date: 2022-12-06
Filing date: 2022-12-06
Publication date: 2023-04-18

Abstract

The embodiment of the application belongs to the technical field of bill text recognition in artificial intelligence, and relates to a bill text recognition method which comprises the steps of obtaining a bill image to be recognized; performing text recognition on the bill image through a preset recognition positioning model; conducting named entity extraction on the text information through a preset multi-mode transform model; based on a preset pairing rule, combining a plurality of named entities and layout information to construct a candidate set of entity pairs; judging whether each entity pair has association or not through a preset association judgment model; the entity pair with the judgment result as the existence of the associationAnd merging. The application also provides a bill recognition device, computer equipment and a storage medium. In addition, the application also relates to a block chain technology, and the note image, the text information and the like of the user can be stored in the block chain _。 The method and the device improve the accuracy of the note recognition under the conditions of folding, line changing of partial fields and the like.

Description

Bill text recognition method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for identifying a document of a note, a computer device, and a storage medium.

Background

The medical bill is a material which must be provided for medical insurance reimbursement, and the medical bill face comprises key fields of the name of a patient to be seen, an invoice number, a total amount, expense information, overall fund payment, date of the patient to be seen and the like. At present, medical bills of various formats exist all over the country, and the positions and the forms of the key fields are not uniform. Even if the electronic bill is carried out in the current country, a quite high proportion of hospitals still do not have the electronic invoice, and the printing information of the 'other information' area of the electronic invoice is inconsistent in all hospitals. These circumstances result in the need for the entry personnel to focus on different information for different format invoices based on an understanding of the business when reimbursing the medical insurance.

Structured recognition in a medical billing scenario typically has several solutions: performing text recognition on the medical invoice by using an OCR recognition model, and performing full-text extraction based on an NLP technology; extracting required key field information based on the fixed field slice or the fixed area; based on a plurality of detection and segmentation models, carrying out identification and matching in blocks; and a large number of analysis templates are defined by users, and different types of invoices are shunted to corresponding analysis processes.

However, in practical use, the following problems often occur: the paper is thin and easy to fold and bend, so that a plurality of pieces of information of the same expense item of the bill are not on the same horizontal line, and the information such as the name, the amount and the like cannot be in one-to-one correspondence; and the line feed occurs in part of the fields, and only the information of the first line can be extracted.

In the prior art, the characteristics of easy folding, line-feed printing of partial fields and the like of the medical bills are not optimized in a targeted manner, so that the recognition result is frequently wrong, more links needing manual intervention are required, the period is prolonged, and meanwhile, the claim settlement reimbursement informatization cost is also improved.

Disclosure of Invention

The embodiment of the application aims to provide a bill text recognition method, a bill text recognition device, a computer device and a storage medium, so as to solve the problem that errors are easy to occur in recognition under the conditions that a bill is folded, a part of fields are rewound and the like in the prior art.

In order to solve the above technical problem, embodiments of the present application provide a method, an apparatus, a computer device, and a storage medium for identifying a bill text, which adopt the following technical solutions:

a bill text recognition method comprises the following steps:

acquiring a bill image to be identified;

performing text recognition on the bill image through a preset recognition positioning model to obtain text information and corresponding layout information;

conducting named entity extraction on the text information through a preset multi-mode transform model to obtain a plurality of corresponding named entities;

based on a preset pairing rule, combining the multiple named entities and the layout information to construct a candidate set of entity pairs;

judging whether each entity pair has association or not through a preset association judgment model;

and merging the entity pairs with the association as a judgment result to obtain a merged text.

Further, before the step of performing text recognition on the ticket image through a preset recognition and positioning model, the method further includes:

judging whether the bill image has deflection or not;

and if the bill image has deflection, rotating the bill image to obtain the positive bill image.

inputting the bill image into a preset semantic segmentation model to obtain a corresponding mask image;

extracting the boundary of the connected domain of the mask image, and setting the minimum circumscribed rectangular area of the boundary;

and carrying out white filling processing on the area outside the rectangular area of the bill image.

Further, before the step of extracting the named entity from the text information through a preset multi-modal transformer model, the method further includes:

judging whether synonyms exist in the text information or not by inquiring a preset synonym library;

and when judging that the synonym exists, carrying out synonym replacement on the text information.

Further, the step of extracting the named entity from the text information through a preset multi-modal transformer model specifically includes:

and inputting the bill image, the text information and the layout information into the preset multi-modal transformer model to extract the named entities to obtain the named entities.

Further, after the step of merging the entity pairs whose determination result is that there is an association, the method further includes:

and generating a target judgment result of the reliability of the merged text based on a preset judgment rule.

Further, after the step of combining the entity pairs whose determination result is that there is an association to obtain a combined text, the method further includes:

inputting the bill image into the recognition positioning model for text recognition to obtain a text confidence rate;

inputting the text information into the multi-modal transformer model for named entity extraction to obtain a named confidence rate;

judging whether the entity pair is input into the association judgment model to judge whether the association exists or not to obtain a judgment confidence rate;

averaging or weighted averaging is carried out on at least two of the text confidence rate, the naming confidence rate and the judgment confidence rate, and a credibility target judgment result of the combined text is generated;

and when the target judgment result exceeds a preset threshold value, directly outputting the merged text.

In order to solve the above technical problem, an embodiment of the present application further provides a device for identifying a bill text, which adopts the following technical solutions:

the acquisition module is used for acquiring a bill image to be identified;

the recognition module is used for carrying out text recognition on the bill image through a preset recognition positioning model to obtain text information and corresponding layout information;

the extraction module is used for extracting named entities from the text information through a preset multi-modal transformer model to obtain a plurality of corresponding named entities;

the construction module is used for constructing a candidate set of entity pairs by combining the plurality of named entities and the layout information based on a preset pairing rule;

the judging module is used for judging whether each entity pair has association or not through a preset association judging model;

and the merging module is used for merging the entity pairs with the judgment result of association to obtain a merged text.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:

a computer device comprising a memory having computer readable instructions stored therein and a processor that when executed implements the steps of a ticket text recognition method as described above.

In order to solve the foregoing technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:

a computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the steps of a ticket text recognition method as described above.

Compared with the prior art, the embodiment of the application mainly has the following beneficial effects: according to the method and the device, text recognition is carried out on the bill image, entity extraction is carried out on the text information obtained through recognition, a candidate set of entity pairs is constructed by combining a plurality of named entities and layout information obtained through extraction, and whether each entity pair has correlation or not is judged, and the entity pairs with correlation are combined, so that text information with high correlation can be effectively combined, and the accuracy of the text recognition under the conditions of folding, line changing of partial fields and the like of the bill is improved.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the description below are some embodiments of the present application, and that other drawings may be obtained by those skilled in the art without inventive effort.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a ticket text recognition method according to the present application;

FIG. 3 is a flow diagram of another embodiment of a ticket text recognition method according to the present application;

FIG. 4 is a flow diagram of another embodiment of a ticket text recognition method according to the present application;

FIG. 5 is a schematic diagram of the structure of one embodiment of a document text recognition device according to the present application;

FIG. 6 is a block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to a smart phone, a tablet computer, an e-book reader, an MP3 player (Mov I ng P I characters Experts G roup Aud I o Layer I, motion picture Experts compression standard audio Layer 3), an MP4 (Mov I ng P I ctu re Experts Group Aud I o Layer I V, motion picture Experts compression standard audio Layer 4) player, a laptop portable computer, a desktop computer, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that the method for text recognition of a ticket provided in the embodiments of the present application is generally executed by a server/terminal device, and accordingly, a text recognition apparatus for a ticket is generally disposed in the server/terminal device.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow diagram of one embodiment of a ticket text recognition method according to the present application is shown. The bill text recognition method comprises the following steps:

step S201, acquiring a bill image to be identified.

In this embodiment, an electronic device (for example, the server/terminal device shown in fig. 1) on which the method for text recognition of a ticket image operates may respond to a request for uploading a ticket image from a client via a wired connection or a wireless connection, and receive a ticket image uploaded from the client. It should be noted that the wireless connection manners may include, but are not limited to, 3G/4G/5G connection, wi-fi connection, bluetooth connection, wi-MAX connection, Z i gbee connection, UWB (u l t ra W i deband) connection, and other wireless connection manners now known or developed in the future.

Step S202, text recognition is carried out on the bill image through a preset positioning recognition model, and text information and corresponding layout information are obtained.

In this embodiment, the recognition and positioning model may adopt various existing models with recognition and positioning functions to perform text recognition and positioning, such as an optical character recognition (Opt i ca i character r Recogn i on, hereinafter referred to as OCR) model.

Specifically, the document image may be used as an input of the OCR model, and corresponding text information (e.g., a plurality of text information slices) in the document image and layout information (e.g., coordinate information) corresponding to each text information may be output.

And step S203, conducting named entity extraction on the text information through a preset multi-modal transformer model to obtain a plurality of corresponding named entities.

In this embodiment, the overall structure of the multi-modal t-ransform r model is an encoder-decoder framework, each encoder may include a plurality of encoder sub-modules, each decoder may include a plurality of decoder sub-modules, and the model may be pre-trained by establishing a special vocabulary based on the vocabulary in the fields of medical bills, insurance bills, etc. (e.g., the vocabulary in the fields of medical bills, such as name, invoice number, expense item name, and expense item amount).

In one embodiment, the text information can be directly used as an input of a preset multi-modal transformer model, and a plurality of named entities (each named entity includes an entity type and multi-modal information corresponding to the entity) are obtained through output.

In another embodiment, three kinds of information, namely the bill image, the text information and the layout information, can be input into a preset multi-modal transformer model, and output to obtain a plurality of named entities. The three kinds of information, namely the bill image, the text information and the layout information, are input into a preset multi-mode transform model, so that multi-dimensional reference of the image, the text and the layout can be provided for entity extraction, and the accuracy of entity extraction is improved.

Wherein, the multi-modal information refers to information of multiple modalities, including: text, images, video, audio, etc. Based on the above description of the embodiments, for example: when the transform model is input in a single text mode, the output is also in the text mode; when the input model is the bill image, the text and the layout, the multi-mode of the fused bill image, the text and the layout can be output.

In one embodiment, the different slices of text information may be connected by a [ SEP ] flag; the layout information may be coordinates of each text information slice acquired based on the recognition positioning model.

Taking the medical bill as an example, by performing named entity extraction on a plurality of text information slices obtained by performing text recognition on the medical bill image based on step S202, entity types such as name, invoice number, charge item name, charge item amount, and the like, and multi-modal information corresponding to each entity can be extracted.

Step S204, based on a preset pairing rule, combining a plurality of named entities and corresponding layout information to construct a candidate set of entity pairs.

In this embodiment, based on different application scenarios, different pairing rules may be preset, and taking a medical bill as an example, entity pairs of the same entity type, such as name, number, unit or amount, and coinciding in the X-axis direction may be added to the candidate relationship set; and adding entity pairs with different expense item entity types but closer in the Y-axis direction into the candidate relationship set.

Step S205, determining whether each entity pair has an association through a preset association determination model.

In this embodiment, the association determination model may be a classification model of a dual affine attention mechanism. A plurality of named entities (each named entity may include an entity type and multi-modal information corresponding to the entity) output by the multi-modal transformer model may be used as an input of the association determination model, so as to obtain a classification determination result of whether an entity pair has an association. Specifically, the head node multi-modal information, the head node entity category information, the tail node multi-modal information, and the tail node entity category information of each entity pair may be connected and then input to the association determination model. Illustratively, for example, any two named entities output by the multimodal transformer model form an entity pair, where the multimodal information of one named entity is "text + image + layout", the entity type is "expense item name", the multimodal information of the other named entity is "text + image + layout", and the entity type is "expense item amount". The two named entity forming entity pairs are subjected to information connection to obtain head node multi-mode information which is 'text + image + layout', head node entity type information which is 'expense item name', tail node multi-mode information which is 'text + image + layout', tail node entity type which is 'expense item amount', the connected information is input into a judgment model, and the judgment model outputs a judgment result about whether the entity pairs are associated or not.

And step S206, merging the entity pairs with the judgment result that the association exists to obtain a merged text.

In this embodiment, if the determination result in step S205 is that there is merging of the entity pairs with association, then there is no merging of the entity pairs with no association, for example, the line feed information may be merged, and the information (such as name, quantity, unit, and amount) of the same type of expense items may be integrated.

According to the method and the device, text recognition is carried out on the bill image, entity extraction is carried out on the text information obtained through recognition, a candidate set of entity pairs is constructed by combining a plurality of named entities and layout information obtained through extraction, and whether each entity pair has correlation or not is judged, and the entity pairs with correlation are combined, so that text information with high correlation can be effectively combined, and the accuracy of the text recognition under the conditions of folding, line changing of partial fields and the like of the bill is improved.

With continued reference to FIG. 3, a flow diagram of another embodiment of a ticket text recognition method according to the present application is shown. In some optional implementation manners of this embodiment, after acquiring the document image to be recognized in step 201, before performing text recognition on the document image through a preset recognition positioning model in step 202, the electronic device may further perform the following steps:

step S207, determining whether the bill image has a deflection.

In step S208, if there is a deflection, the sheet image is rotated to obtain a positive sheet image.

In this embodiment, the rotation operation performed on the bill image may adopt various existing image processing methods or a neural network model for direction classification, such as a residual neural network model, specifically, a four-classification result related to the direction of the bill image may be output based on the bill image input to the neural network model, a judgment is performed on the four-classification result, the bill image with the judgment result being non-forward is determined, and the image is rotated to obtain a forward bill image.

According to the bill text recognition method and the bill text recognition system, the requirement for the directionality of the obtained bill images is reduced through rotating the bill images, the error of text recognition caused by the fact that the bills have the directionality problem is avoided, and the applicability of the bill text recognition method and the accuracy of integral text recognition are improved.

In some optional implementation manners of this embodiment, after acquiring the document image to be recognized in step 201, before performing text recognition on the document image through a preset recognition positioning model in step 202, the electronic device may further perform the following steps:

and inputting the bill image into a preset semantic segmentation model to obtain a corresponding mask image.

And extracting the boundary of the connected domain of the mask map, and setting the minimum circumscribed rectangular area of the boundary.

And performing white filling processing on the area outside the rectangular area of the bill image.

In this embodiment, generally, the precision of subsequent text recognition, merging, and the like of the bill images is affected by the acquired image noise, the frame background, and the like, or the presence of multiple bills in the same bill image (for example, multiple bills with continuous numbers may be included in the same bill image), so that the bill images may be preprocessed in the above steps to obtain the bill images with background information removed, and the bill images after being split are obtained.

According to the method and the device, the bill images are input into the preset semantic segmentation model for preprocessing, background information in the bill images can be removed, and the condition that a plurality of bills exist in the same bill image is split, so that convenience is brought to subsequent text recognition, and the whole text recognition precision is improved.

With continuing reference to FIG. 4, a flow diagram of one embodiment of a ticket text recognition method in accordance with the present application is shown. In some optional implementation manners of this embodiment, after performing text recognition on the ticket image through the preset recognition and positioning model in step S202, and before performing named entity extraction on the text information through the preset multimodal transformer model in step 203, the electronic device may further perform the following steps:

step S209, by querying a preset synonym library, determine whether there is a synonym in the text message.

And step S210, when judging that the synonym exists, carrying out synonym replacement on the text information.

In this embodiment, whether the preset synonym exists in each text message of the entity to be extracted can be judged by querying the preset synonym library, when the synonym exists, synonym replacement is performed on the text message, for example, a payment date, a settlement date and a settlement date are replaced by a charging date, and by performing synonym replacement on each text message, the named entity can be extracted more accurately in the following process, and the influence of the long tail effect can be reduced. Specifically, a replacement vocabulary template base can be established in advance, different expression modes of the same replacement field can be determined, and in practice, the template base can be continuously expanded based on the input conditions of various online bills, so that the adaptability of the template to the replacement of various text information is improved. In addition, synonym replacement can be performed on the text information by adopting a preset neural network model.

In some optional implementation manners of this embodiment, in step S206, after the step of merging the entity pairs whose determination result is that there is an association, the electronic device may further perform the following steps:

In this embodiment, the preset rule may be set specifically according to different application scenarios, for example, whether the name of the patient is consistent with the name of the emergency person is determined; judging whether a plurality of same dates are extracted; if the total amount of the fee matches the total amount of the total amount, the determination result may be true, and a confidence level of 1 may be assigned.

In the embodiment, the high-confidence field can be ignored during manual verification by generating the target judgment result of the confidence level of the merged text, so that the manual verification cost is reduced.

In some optional implementation manners of this embodiment, in step S202, after the step of merging the entity pairs whose determination result is that there is an association, the electronic device may further perform the following steps:

and inputting the bill image into the recognition positioning model for text recognition to obtain a text confidence rate.

And inputting the text information into a multi-modal transformer model for named entity extraction to obtain a naming confidence rate.

And (4) inputting the entity pair into the association judgment model to judge whether the association exists or not to obtain the judgment confidence rate.

And averaging or weighted averaging at least two of the text confidence rate, the naming confidence rate and the judgment confidence rate to generate a credibility target judgment result of the combined text.

In this embodiment, the confidence rates of the multiple models may be integrated to generate a target judgment result of the text confidence level. In one embodiment, in addition to synthesizing at least two of the text confidence rate, the naming confidence rate and the judgment confidence rate to generate a target judgment result of the merged text confidence level, the confidence rates output by other models used in the document text recognition process of the present application, such as the semantic segmentation model and the direction classification model, mentioned in the above embodiments, may be added to synthesize the respective confidence rates to assign confidence levels.

In this embodiment, by generating a target judgment result of the confidence level of the merged text, the merged text may be directly output when the target judgment result (confidence level) exceeds a preset threshold, and when the target judgment result does not exceed the preset threshold, a verification result trigger interface may be subsequently displayed to the user, so that the user may perform corresponding trigger selection according to the verification result after further verifying the target judgment result, and if a verification passing signal triggered by the user is received, the merged text is output, and if a verification failing signal triggered by the user is received, the merged text is deleted, and the two named entities before merging are respectively output. By the method, the high-confidence field can be ignored during the subsequent manual verification, so that the manual verification cost is reduced; in addition, for the fields that cannot be determined based on the preset determination rule, the confidence may be given by averaging or weighted averaging based on at least the text confidence, the name confidence, and the determination confidence that are output by the recognition and localization model, the multimodal transformer model, and the association determination model, respectively.

It should be emphasized that, in order to further ensure the privacy and security of the bill image, text information, etc., the bill image and text information may also be stored in a node of a block chain.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. The artificial intelligence (Art I f I c I a l I nte l I gene, ai) is a theory, method, technology and application system for simulating, extending and expanding human intelligence, sensing environment, acquiring knowledge and obtaining optimal results by using knowledge by using a digital computer or a machine controlled by the digital computer.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, the processes of the embodiments of the methods described above can be included. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless otherwise indicated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of execution is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

With further reference to fig. 5, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a ticket text recognition apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which is specifically applicable to various electronic devices.

As shown in fig. 5, the ticket text recognition apparatus 400 according to the present embodiment includes: an acquisition module 401, an identification module 402, an extraction module 403, a construction module 404, a judgment module 405, and a merging module 406. Wherein:

the acquiring module 401 is used for acquiring a bill image to be identified.

The recognition module 402 is configured to perform text recognition on the ticket image through a preset recognition positioning model to obtain text information and corresponding layout information.

The extracting module 403 is configured to perform named entity extraction on the text information through a preset multi-modal transformer model to obtain a plurality of corresponding named entities.

The building module 404 is configured to build a candidate set of entity pairs based on a preset pairing rule in combination with a plurality of named entities and layout information.

The judging module 405 is configured to judge whether each entity pair has an association through a preset association judgment model.

The merging module 406 is configured to merge the entity pairs with the determined result that there is the association, so as to obtain a merged text.

In this embodiment, text information may be directly used as input of a preset multi-modal transformer model; the three information of the bill image, the text information and the layout information can be input into a preset multi-mode transform model and output to obtain a plurality of named entities, and the three information of the bill image, the text information and the layout information can be input into the preset multi-mode t-ransformer model to provide more dimensional reference information for entity extraction, so that the accuracy of entity extraction is improved.

The bill text recognition device can preset different pairing rules based on different application scenes, and can add entity pairs with the same entity type, such as name, quantity, unit or amount, and coincident in the X-axis direction into a candidate relationship set by taking a medical bill as an example; and adding entity pairs with different expense item entity types but closer in the Y-axis direction into the candidate relationship set.

In the embodiment, the text recognition is performed on the bill image, the entity extraction is performed on the text information obtained by the recognition, the candidate set of the entity pairs is constructed by combining the plurality of named entities and the layout information obtained by the extraction, and the entity pairs with the association are merged by judging whether the entity pairs have the association or not, so that the text information with high association can be effectively merged, and the accuracy of the text recognition under the conditions of folding, line changing of partial fields and the like of the bill is improved.

In some optional implementations of this embodiment, the bill text recognition device further includes a deflection determination module and a rotation module.

And the deflection judging module is used for judging whether the bill image has deflection or not.

And the rotation module is used for rotating the bill image if deflection exists so as to obtain a positive bill image.

According to the bill text recognition method and the bill text recognition system, the requirement for the directionality of the obtained bill images is reduced by rotating the bill images, the error of text recognition caused by the fact that the bills have the directionality problem is avoided, and the applicability of the bill text recognition method and the accuracy of overall text recognition are improved.

In some optional implementations of this embodiment, the ticket text recognition device further includes a preprocessing module, a setting module, and a filling module.

The preprocessing module is used for inputting the bill image into a preset semantic segmentation model to obtain a corresponding mask image.

The setting module is used for extracting the boundary of the connected domain of the mask image and setting the minimum circumscribed rectangular area of the boundary.

And the filling module is used for performing white filling processing on the area outside the rectangular area of the bill image.

According to the method, the bill image is preprocessed through the preset semantic segmentation model, the background information in the bill image can be removed, and the condition that one bill is multiple is split, so that convenience is brought to subsequent text recognition, and the accuracy of overall text recognition is improved.

In some optional implementations of this embodiment, the ticket text recognition apparatus further includes a replacement determination module and a replacement module.

And the replacement judging module is used for judging whether the synonym exists in the text information by inquiring a preset synonym library.

And the replacing module is used for replacing synonyms for the text information when judging that the synonyms exist.

In this embodiment, a synonym library may be pre-established, and different expression modes of the same replacement field may be determined; synonym replacement can be performed on the text information by adopting a preset neural network model.

According to the embodiment, the keywords are subjected to standardized replacement, so that the named entities can be extracted more accurately in the follow-up process, and the influence of the long tail effect can be reduced.

In some optional implementation manners of this embodiment, the bill text recognition device further includes a first calculation module, configured to generate a target judgment result of the reliability of the merged text based on a preset judgment rule.

According to the embodiment, the high-confidence-degree field can be ignored in the subsequent manual verification by generating the target judgment result of the text confidence degree, so that the manual verification cost is reduced.

In some optional implementation manners of this embodiment, the ticket text recognition device further includes a text obtaining module, a name obtaining module, a judgment obtaining module, a second calculating module, and a text output module.

And the text acquisition module is used for inputting the bill image into the recognition positioning model to perform text recognition so as to obtain the text confidence rate.

And the naming acquisition module is used for inputting the text information into the multi-modal transformer model to extract the named entities to obtain the naming confidence rate.

And the judgment acquisition module is used for judging whether the entity pair is input into the association judgment model to obtain the judgment confidence rate.

And the second calculation module is used for averaging or weighted averaging at least two of the text confidence rate, the naming confidence rate and the judgment confidence rate to generate a credibility target judgment result of the combined text.

And the text output module is used for directly outputting the merged text when the target judgment result exceeds a preset threshold value.

In this embodiment, for fields that cannot be determined based on a preset determination rule, the confidence levels may be given by averaging or weighted averaging based on the confidence levels output by the recognition and localization model, the multimodal transformer model, the association determination model, and the like. In one embodiment, the plurality of models may further include a semantic segmentation model, a direction classification model, and the like mentioned in the above embodiments.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 6, fig. 6 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 6 comprises a memory 61, a processor 62, a network interface 63 communicatively connected to each other via a system bus. It is noted that only the computer device 6 having the components 61-63 is shown in the figure, but it is understood that not all of the shown components are required to be implemented, and that more or less components may be implemented instead. AS will be understood by those skilled in the art, the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an application specific integrated circuit (App I cat I on Spec I C I integrated C I rcu I, AS ic), a programmable Gate array (F I l D-programmable ab l Gate Ar ray, FPGA), a digital Processor (D I ta l S I gna l Processor, DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 61 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 61 may be an internal storage unit of the computer device 6, such as a hard disk or a memory of the computer device 6. In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 6. Of course, the memory 61 may also comprise both an internal storage unit of the computer device 6 and an external storage device thereof. In this embodiment, the memory 61 is generally used for storing an operating system installed in the computer device 6 and various types of application software, such as computer readable instructions of a bill identification method. Further, the memory 61 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 62 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 62 is typically used to control the overall operation of the computer device 6. In this embodiment, the processor 62 is configured to execute computer readable instructions stored in the memory 61 or process data, such as computer readable instructions for executing the bill identification method.

The network interface 63 may comprise a wireless network interface or a wired network interface, and the network interface 63 is typically used for establishing a communication connection between the computer device 6 and other electronic devices.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the ticket identification method as described above.

According to the method and the device, text recognition is carried out on the bill image, entity extraction is carried out on the text information obtained through recognition, a candidate set of entity pairs is constructed by combining a plurality of named entities and layout information obtained through extraction, and the entity pairs with association are merged by judging whether each entity pair has association or not, so that the text information with high association can be effectively merged, and the precision of the text recognition of the bill under the conditions of folding, line changing of partial fields and the like is improved.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that modifications can be made to the embodiments described in the foregoing detailed description, or equivalents can be substituted for some of the features described therein. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A bill text recognition method is characterized by comprising the following steps:

acquiring a bill image to be identified;

conducting named entity extraction on the text information through a preset multi-modal transformer model to obtain a plurality of corresponding named entities;

based on a preset pairing rule, combining the plurality of named entities and the layout information to construct a candidate set of entity pairs;

and merging the entity pairs with the judgment results of the existence of the association to obtain a merged text.

2. The method for recognizing bill text according to claim 1, further comprising, before the step of performing text recognition on the bill image by a preset recognition positioning model:

judging whether the bill image has deflection or not;

3. The bill text recognition method according to claim 1 or 2, further comprising, before the step of text recognition of the bill image by a preset recognition positioning model:

4. The bill text recognition method according to claim 1 or 2, wherein the step of conducting named entity extraction on the text information through a preset multi-modal transformer model is preceded by the following steps:

5. The method for bill text recognition according to claim 1 or 2, wherein the step of conducting named entity extraction on the text information through a preset multi-modal transformer model specifically comprises:

6. The method for identifying the bill text according to claim 1 or 2, wherein after the step of combining the entity pairs whose determination result is that there is an association to obtain the combined text, the method further comprises:

7. The method for recognizing bill texts according to claim 1 or 2, wherein after the step of merging the entity pairs whose determination result is that there is an association, the method further comprises:

inputting the entity into the association judgment model to judge whether the association exists or not to obtain a judgment confidence rate;

8. A bill identifying apparatus, comprising:

the acquisition module is used for acquiring a bill image to be identified;

the recognition module is used for performing text recognition on the bill image through a preset recognition positioning model to obtain text information and corresponding layout information;

the extraction module is used for extracting the named entities from the text information through a preset multi-mode transform model to obtain a plurality of corresponding named entities;

9. A computer device comprising a memory having computer readable instructions stored therein and a processor which when executed implements the steps of the ticket text recognition method of any of claims 1 to 7.

10. A computer readable storage medium, having computer readable instructions stored thereon, which when executed by a processor, implement the steps of the ticket text recognition method of any of claims 1 to 7.