CN112613367A

CN112613367A - Bill information text box acquisition method, system, equipment and storage medium

Info

Publication number: CN112613367A
Application number: CN202011471091.9A
Authority: CN
Inventors: 王丹; 屈舜中
Original assignee: Pacific Century Bill Service Shenzhen Co ltd
Current assignee: Pacific Century Bill Service Shenzhen Co ltd
Priority date: 2020-12-14
Filing date: 2020-12-14
Publication date: 2021-04-06

Abstract

The invention discloses a method, a system, equipment and a storage medium for acquiring a bill information text box, wherein the method comprises the following steps: acquiring a bill picture to be identified, and identifying all text boxes contained in the bill picture to be identified; extracting the text box characteristics of each text box in all the text boxes; judging the prediction probability value of each text box as a target text box by adopting a preset detection model according to the text box characteristics of each text box; and determining the text box with the maximum prediction probability value as the target text box of the corresponding type. The method solves the problems that the extraction rule in the existing bill identification method is too complex, is difficult to maintain and is easy to generate rule conflict, improves the acquisition precision of the text box in bill identification and ensures the accuracy of bill information extraction.

Description

Bill information text box acquisition method, system, equipment and storage medium

Technical Field

The invention relates to the technical field of image recognition, in particular to a method, a system, equipment and a storage medium for acquiring a bill information text box.

Background

In the bill identification process, characters on a bill picture are mainly identified by means of an OCR technology to obtain one text block, and key information such as a bill number, a bill drawing date, an expiration date, a full name of a bill drawer, a full name of a bill receiver, a name of a bill acceptor, an account number of an account issuer of the bill acceptor, a bill drawing amount and the like is extracted from the text block. The tickets issued by each organization need to contain the same key content items, but the formats may be different, which increases the difficulty in extracting the key information of the tickets from the pictures.

The existing bill identification method generally sets extraction rules through a regular expression technology, the adjacent relation of text blocks, the cell relation of a combined table and the like, traverses each text block on a picture, and considers the text block conforming to the rules as key information to be extracted. The method needs to configure complex rules, and the extraction accuracy depends on the table identification accuracy. For example, as disclosed in CN110427853A, the patent is named as a method for extracting and processing intelligent ticket information, and the technical solution of the patent is to pre-process a ticket picture (cut out a content area), classify the ticket, and set different extraction rules for different ticket types to extract key information. Therefore, the setting of the extraction rule requires manual rule summarization, the workload is large, and a large number of samples need to be adapted to summarize a set of complete extraction rules, so that the set of extraction rules is finally too complex and difficult to maintain, and the condition of rule conflict is easy to occur (for example, a new rule is added, so that the old rule fails).

Disclosure of Invention

The embodiment of the application provides a method, a system, equipment and a storage medium for acquiring a bill information text box, and aims to solve the problems that an extraction rule in the existing bill identification method is too complex, is difficult to maintain and is easy to generate rule conflict.

The embodiment of the application provides a method for acquiring a bill information text box, which comprises the following steps:

acquiring a bill picture to be identified, and identifying all text boxes contained in the bill picture to be identified;

extracting the text box characteristics of each text box in all the text boxes;

judging the prediction probability value of each text box as a target text box by adopting a preset detection model according to the text box characteristics of each text box; the preset detection model comprises a plurality of classification models of different types;

and determining the text box with the maximum prediction probability value as the target text box of the corresponding type.

In an embodiment, the acquiring a to-be-recognized ticket picture includes:

acquiring an original bill picture;

and preprocessing the original bill picture to obtain the bill picture to be identified.

In an embodiment, the identifying all text boxes included in the to-be-identified bill picture includes:

and acquiring four vertex coordinates of a rectangular area corresponding to each text message in the bill picture to be recognized, and connecting the four vertex coordinates according to a preset sequence to obtain a text box corresponding to each text message.

In an embodiment, the extracting the text box feature of each text box in all the text boxes includes:

acquiring a first text box characteristic of each text box in all text boxes and a second text box characteristic of a text box adjacent to each text box;

the first text box feature and the second text box feature of the text box adjacent to each text box are taken together as the text box feature.

In an embodiment, before the determining, by using a preset detection model, the prediction probability value of each text box as a target text box according to the text box feature of each text box, the method includes:

acquiring a plurality of training bill pictures, and preprocessing each training bill picture to obtain a plurality of preprocessed pictures;

identifying all training text boxes contained in the plurality of preprocessed pictures, and extracting the training text box characteristics of each training text box in all the training text boxes;

determining the type of each training text box according to the training text information in each training text box;

marking the type of each training text box, and associating the type of each training text box with the training text box characteristics of each training text box;

selecting a first preset number of first positive sample data and a second preset number of first negative sample data from all the training text boxes;

and training an original classification model of the same type as the first positive sample data by adopting the first positive sample data of the first preset quantity and the first negative sample data of the second preset quantity to obtain a classification model of the same type as the first positive sample data.

In an embodiment, the selecting a first preset number of first positive sample data and a second preset number of first negative sample data from all training textboxes includes:

marking the training text box characteristics of the training text boxes with the same type and the same number as the preprocessed pictures in all the training text boxes by adopting a first label to obtain the marked training text boxes with the same number as the preprocessed pictures;

determining the marked training text boxes with the same number as the preprocessed pictures as the first positive sample data with the first preset number;

marking the training text box characteristics of a second preset number of training text boxes with different types in all the training text boxes by adopting a second label to obtain the marked training text boxes with the second preset number;

and determining the labeled training text boxes with the second preset number as the first negative sample data with the second preset number.

In an embodiment, the method for acquiring a ticket information text box further includes:

when the text box feature of at least one text box contained in the to-be-recognized bill picture cannot be recognized, predicting the type of the text box which cannot be recognized to obtain the type of the text box feature of the text box which cannot be recognized, and selecting the text boxes which have the same type and the same number as the to-be-recognized bill picture and the text boxes with the third preset number and different types from the to-be-recognized bill picture;

increasing the number of the unidentifiable text boxes with the same type and the same number as the number of the to-be-identified bill pictures to a fourth preset number, labeling the text box characteristics of the unidentifiable text boxes with the fourth preset number by adopting the first label to obtain second positive sample data with the fourth preset number, and labeling the text box characteristics of the text boxes with the third preset number by adopting the second label to obtain second negative sample data with the third preset number;

and retraining the original classification model of any type according to the first negative sample data of the second preset quantity, the second negative sample data of the third preset quantity and the second positive sample data of the fourth preset quantity to obtain the classification model of the same type as the second positive sample data.

In addition, in order to achieve the above object, the present invention further provides a system for acquiring a ticket information text box, including:

the image acquisition module is used for acquiring a bill picture to be identified and identifying all text boxes contained in the bill picture to be identified;

the feature extraction module is used for extracting the text box features of each text box in all the text boxes;

the text box prediction module is used for judging the prediction probability value of each text box as a target text box by adopting a preset detection model according to the text box characteristics of each text box; the preset detection model comprises a plurality of classification models of different types;

and the text box determining module is used for determining the text box with the maximum prediction probability value as the target text box of the corresponding type.

In addition, in order to achieve the above object, the present invention further provides a method and apparatus for acquiring a ticket information text box, including: the device comprises a memory, a processor and a bill information text box acquisition program which is stored on the memory and can run on the processor, wherein the bill information text box acquisition program realizes the steps of the bill information text box acquisition method when being executed by the processor.

Further, to achieve the above object, the present invention also provides a storage medium having stored thereon a ticket information text box acquiring program which, when executed by a processor, realizes the steps of the above ticket information text box acquiring method.

The technical scheme of the method, the system, the equipment and the storage medium for acquiring the bill information text box provided by the embodiment of the application at least has the following technical effects or advantages:

the technical scheme that the bill picture to be recognized is obtained, all text boxes contained in the bill picture to be recognized are recognized, the text box feature of each text box in all the text boxes is extracted, the prediction probability value of each text box serving as the target text box is judged by adopting the preset detection model according to the text box feature of each text box, and the text box with the maximum prediction probability value is determined to be the target text box of the corresponding type is adopted, so that the problems that the extraction rule in the existing bill recognition method is too complex, the maintenance is difficult, and the rule conflict is easy to occur are solved, the obtaining precision of the text boxes in bill recognition is improved, and the accuracy of bill information extraction is ensured.

Drawings

FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for acquiring a ticket information textbox according to a first embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method for acquiring a ticket information textbox according to a second embodiment of the present invention;

FIG. 4 is a schematic diagram of the layout of a portion of a text box;

FIG. 5 is a flowchart illustrating a method for acquiring a ticket information textbox according to a third embodiment of the present invention;

FIG. 6 is a flowchart illustrating a fourth embodiment of a method for acquiring a ticket information textbox according to the present invention;

fig. 7 is a functional block diagram of the ticket information text box acquiring system of the present invention.

Detailed Description

For a better understanding of the above technical solutions, exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The invention provides a bill information textbox acquisition device. As shown in fig. 1, fig. 1 is a schematic structural diagram of a hardware operating environment according to an embodiment of the present invention.

It should be noted that fig. 1 is a schematic structural diagram of a hardware operating environment of the ticket information text box acquiring device.

As shown in fig. 1, the ticket information text box acquiring apparatus may include: a processor 1001, such as a CPU, a memory 1005, a user interface 1003, a network interface 1004, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Optionally, the ticket information textbox acquiring device may further include an RF (Radio Frequency) circuit, a sensor, an audio circuit, a WiFi module, and the like.

Those skilled in the art will appreciate that the ticket information text box acquiring device structure shown in fig. 1 is not intended to be limiting of the ticket information text box acquiring device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a network communication module, a user interface module, and a ticket information text box acquiring program. The operating system is a program for managing and controlling hardware and software resources of the ticket information text box acquiring device, the ticket information text box acquiring program, and the operation of other software or programs.

In the ticket information textbox acquiring apparatus shown in fig. 1, the user interface 1003 is mainly used for connecting a terminal and performing data communication with the terminal; the network interface 1004 is mainly used for the background server and performs data communication with the background server; the processor 1001 may be used to invoke a ticket information text box acquisition program stored in the memory 1005.

In this embodiment, the ticket information text box acquiring apparatus includes: a memory 1005, a processor 1001 and a ticket information text box acquisition program stored on the memory and executable on the processor, wherein:

when the processor 1001 calls the ticket information text box acquisition program stored in the memory 1005, the following operations are performed:

extracting the text box characteristics of each text box in all the text boxes;

When the processor 1001 calls the ticket information text box acquisition program stored in the memory 1005, the following operations are also performed:

acquiring an original bill picture;

It should be noted that, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order different from that shown, and the method for acquiring a ticket information text box is applied to the recognition of a text box in a picture.

As shown in fig. 2, in a first embodiment of the present application, a method for acquiring a ticket information text box of the present application includes the following steps:

step S210: acquiring a bill picture to be identified, and identifying all text boxes contained in the bill picture to be identified.

In this embodiment, the ticket picture to be recognized is the original ticket picture after preprocessing, and can be directly used as the processing object. After acquiring the bill picture to be identified, traversing the bill picture to be identified, and drawing a corresponding text box for each region containing text information in the bill picture to be identified one by one, namely the text information is contained in the text box; wherein the text box is a rectangular text box.

Further, as shown in fig. 3, in the second embodiment of the present application, step S210 specifically includes:

step S211: and acquiring an original bill picture.

The original bill picture can be an online bill screenshot, a PDF file of a bill and the like, and can also be an offline bill photograph, a bill scanned piece and the like.

Step S212: and preprocessing the original bill picture to obtain the bill picture to be identified.

After the original bill picture is obtained, the original bill picture is preprocessed, wherein the preprocessing is noise reduction processing and/or restoration processing. And denoising, namely removing the foreign pixels covering the text information containing area and the text information not containing area in the original bill picture. And (4) repairing the missing or damaged area in the original bill picture. The original bill picture after preprocessing is beneficial to improving the accuracy of text box identification.

Step S213: and acquiring four vertex coordinates of a rectangular area corresponding to each text message in the bill picture to be recognized, and connecting the four vertex coordinates according to a preset sequence to obtain a text box corresponding to each text message.

After the bill picture to be recognized is obtained, hiding the region which does not contain the text information in the bill picture to be recognized, displaying the shape of the region containing the text information in a rectangular shape, further obtaining four vertex coordinates of each region containing the text information, and then sequentially connecting the four vertices according to a preset sequence to obtain the text box corresponding to each text information. The setting sequence may be a clockwise sequence or a counterclockwise sequence. For example, the four vertex coordinates of the region containing text information are: and connecting the upper left coordinate point A, the lower left coordinate point D, the upper right coordinate point B and the lower right coordinate point C in the order of A-B-C-D-A or D-C-B-A-D to obtain the text box.

Step S220: and extracting the text box characteristics of each text box in all the text boxes.

In this embodiment, the text box feature of each text box is a sum of the text box feature of the text box itself and the text box features of other adjacent text boxes, and is specifically used as a preset detection model for input, so as to determine whether the text box is the target text box of the finally determined corresponding type through the preset detection model.

Further, step S220 specifically includes: acquiring a first text box feature of each text box in all text boxes and a second text box feature of a text box adjacent to each text box, and taking the first text box feature and the second text box feature of the text box adjacent to each text box as the text box features.

The text box feature of each text box is composed of sub-features of at least five text boxes, specifically, a first text box feature of one text box and a second text box feature of each text box adjacent to the one text box. If the text box feature of each text box is specifically composed of the sub-features of five text boxes, the text box feature is specifically composed of a first text box feature of one text box and second text box features of a left text box, a right text box, an upper text box and a lower text box adjacent to the one text box. Wherein, the left text box, the right text box, the upper text box and the lower text box are respectively the text box which can be overlapped with one of the text boxes and has the smallest center distance. As shown in fig. 4, the text box corresponding to the text information 10 is the text box 10, the text box corresponding to the text information 11 is the text box 11, the text box corresponding to the text information 12 is the text box 12, the text box corresponding to the text information 20 is the text box 20, and the text box corresponding to the text information 30 is the text box 30. In the horizontal direction, the text box 20, the text box 21, the text box 22, the text box 30, the text box 31, and the text box 32 all overlap the text box 10, but the distance between the center of the text box 20 and the center of the text box 10 relative to the text box 21 and the center of the text box 22 is the smallest, and the distance between the center of the text box 30 and the center of the text box 10 relative to the center of the text box 31 and the center of the text box 32 is the smallest, so that the text box feature of the text box 10 is the combination of the first text box feature of the text box 10, the second text box feature of the text box 11, the second text box feature of the text box 12, the second text box feature of the text box 20, and the second text box feature of the text box 30. The manner of determining the text box in the vertical direction is the same as that in the horizontal direction.

Specifically, one text box has at least 9 sub-features, the at least 9 sub-features are first text features, each of the other adjacent text boxes has at least 10 sub-features, and the at least 10 sub-features are second text features. The text box feature of each text box is composed of sub-features of at least five text boxes, and the text box feature of each text box then includes at least 49 sub-features. If the text box feature of each text box consists specifically of the sub-features of five text boxes, i.e., the text box feature of each text box consists of 49 sub-features. One text box has 9 sub-features, namely a 10 th sub-feature, a 11 th sub-feature, a. Each of the other adjacent text boxes has 10 sub-features, namely, a 20 th sub-feature, a 21 st sub-feature, a. I.e., the text box feature of each text box consists of 49 sub-features. It should be noted that the text boxes adjacent to the one text box may further include the text boxes of the upper left corner, the lower left corner, the upper right corner and the lower right corner of the one text box, and then the text box feature of each text box is at least composed of 89 sub-features. The invention may also add more adjacent text boxes, which are not specifically limited herein. Wherein the more text boxes, the more sub-features in the text box feature of each text box, and the more accurate the identified target text box.

Wherein, the types of the 9 sub-features in the first text feature and the first 9 sub-features in the second text feature are the same, and are respectively: the ratio of the horizontal coordinate of each vertex of the text box to the width of the bill picture to be recognized, the ratio of the vertical coordinate of each vertex of the text box to the height of the bill picture to be recognized, the ratio of the width of the text box to the width of the bill picture to be recognized, the ratio of the height of the text box to the height of the bill picture to be recognized, the ratio of the number of digits of the first 1 st character in the text information in the character set to the total number of characters in the character set in the text information, the ratio of the digit number of the first 2 nd character in the text information in the character set to the total number of characters in the character set in the text information, the ratio of the digit number of the last 1 st character in the text information in the character set to the total number of characters in the character set in the text information, the ratio of the digit number of the last 2 nd character in the text information in the character set to the total number of characters in the character set in the text information, and the ratio of the total number of characters in the character set in the text information to a preset value. When the number of the first and last characters in the text information is less than 2, 0 is used for supplementing, so that the number of the first and last characters in the text information is more than or equal to 2; the preset value is preferably 100, but may be other values. The 10 th sub-feature in the second text feature is the ratio of the shortest distance between one text box and each text box adjacent to the text box to the maximum value of the height and the width of the bill picture to be recognized. Referring to fig. 4, if the width of the text box 10 is greater than the height, the 10 th sub-feature is the ratio of the shortest distance between the text box 10 and the text boxes 11, 12, 20, and 30 to the width of the text box 10, respectively. It should be noted that, if an adjacent text box lacks a text box in a certain direction or all adjacent text boxes are missing, 0 is used to replace the missing text box, and then the second text feature of each adjacent text box is 0.

Step S230: and judging the prediction probability value of each text box as a target text box by adopting a preset detection model according to the text box characteristics of each text box.

In this embodiment, the preset detection model includes a plurality of classification models of different types, the classification models of different types are obtained through pre-training, and each classification model stores comparison feature data for identifying a text box feature of a corresponding type. Specifically, the text box features of each text box of different types are used as input data, after the input data are input into a preset detection model, the preset detection model respectively carries out one-by-one comparison processing on the text box features of each text box by adopting each classification model, and then each text box is output as the predicted probability value of the target text box with the same type as the classification model. Referring to fig. 4, for example, after the text information 10 is specifically a "drawer," the text information 11 is specifically an "account," and the text information 12 is specifically an "account line," and the text box features of the text box 10, the text box 11, and the text box 12 are respectively input into the preset detection model, the preset detection model adopts a classification model of drawer type, a classification model of account line type, and other types of classification models, and compares the text box feature of the text box 10, the text box feature of the text box 11, and the text box feature of the text box 12 one by one. Specifically, after comparing the text box feature of the text box 10, the text box feature of the text box 11, and the text box feature of the text box 12, the classification model for the drawer type outputs a predicted probability value of a1 for the text box 10 as the drawer type text box, a predicted probability value of a2 for the text box 11 as the drawer type text box, and a predicted probability value of A3 for the text box 12 as the drawer type text box. Similarly, after comparing the text box features of the text box 10, the text box features of the text box 11 and the text box features of the text box 12 with the account type classification model, the account opening type classification model and other types of classification models, the prediction probability values of the text box 10, the text box 11 and the text box 12 as the target text boxes of the corresponding types are output.

Step S240: and determining the text box with the maximum prediction probability value as the target text box of the corresponding type.

In this embodiment, after comparing the text box features of each text box with different types of classification models, each text box is output as a prediction probability value of a target text box of the same type as the classification model, then each prediction probability value is ranked, and the text box with the highest prediction probability value is determined as the target text box of the same type as the classification model. For example, if the classification model for the drawer type outputs a predicted probability value a1 of 1 for the text box 10 as the drawer type text box, a predicted probability value a2 of 0.75 for the text box 11 as the drawer type text box, and a predicted probability value A3 of 0.5 for the text box 12 as the drawer type text box, where a1> a2> A3, then the text box 10 is the drawer type text box.

According to the technical scheme, the method comprises the steps of obtaining a bill picture to be recognized, recognizing all text boxes contained in the bill picture to be recognized, extracting the text box characteristics of each text box in all the text boxes, judging the prediction probability value of each text box serving as a target text box by adopting a preset detection model according to the text box characteristics of each text box, and determining the text box with the maximum prediction probability value as the target text box of the corresponding type, so that the obtaining precision of the text box in bill recognition is improved, and the extracting accuracy of bill information is ensured.

As shown in fig. 5, in the third embodiment of the present application, the ticket information text box acquiring method of the present application includes the following steps before step S230:

step S310: the method comprises the steps of obtaining a plurality of training bill pictures, and preprocessing each training bill picture to obtain a plurality of preprocessed pictures.

The plurality of training bill pictures are the same type of bill pictures, and can be on-line bill screenshots, bill PDF files and the like, and can also be off-line bill pictures, bill scanning pieces and the like. And after each training bill picture is subjected to noise reduction and/or restoration processing, preprocessing pictures with the same quantity as the training bill pictures are obtained.

Step S320: and identifying all training text boxes contained in the plurality of preprocessed pictures, and extracting the training text box characteristics of each training text box in all the training text boxes.

And acquiring each training text information area in each preprocessed picture, and drawing a rectangular training text box for each training text information area in each preprocessed picture. Wherein each training text information is contained within a training text box. Further, after all the training text boxes are obtained, the process of extracting the training text box feature of each training text box in all the training text boxes is the same as the process of extracting the text box feature of each text box in all the text boxes in step S220, and the specific process is not repeated here.

Step S330: and determining the type of each training text box according to the training text information in each training text box.

And determining the type of each training text message according to the specific content of each training text message, and taking the type of the training text message as the type of a training text box containing the training text message. For example, if the content of a training text message in one of the preprocessed pictures is "open account", the training text message is an open account type, and the type of the training text box containing the "open account" is the open account type.

Step S340: and marking the type of each training text box, and associating the type of each training text box with the training text box characteristics of each training text box.

And marking the type of each training text box by adopting marks, such as numbers, letters, number-letter combinations and the like. And the training text boxes of the same type are marked by the same identifier. For example, the "open lines" type training text boxes in the preprocessed pictures are all labeled with the same label, such as label K1. Further, after the training text box features of each training text box are extracted, the training text box features of each training text box are labeled similarly, that is, the training text box features are labeled by using a common identifier with the identifier of the training text box, for example, the identifier of the training text box feature of the "open-row" type training text box is K1-10. And then, associating the identifications of the training text boxes and the identification of the training text boxes, namely associating the training text boxes of the same type with the features of the training text boxes of the same type. Specifically, the type of the training text box corresponding to the training text box feature can be determined by recognizing the identifier of the training text box feature.

Step S350: and selecting a first preset number of first positive sample data and a second preset number of first negative sample data from all the training text boxes.

Specifically, the training text box features of training text boxes with the same type and the same number as the number of the preprocessed pictures in all the training text boxes are labeled by adopting a first label, and the labeled training text boxes with the same number as the number of the preprocessed pictures are obtained; and determining the marked training text boxes with the same number as the number of the preprocessed pictures as first positive sample data with a first preset number. The type of the first positive sample data is the same as the type of the original classification model to be trained, and the training text box feature of the training text box with the same type as the original classification model to be trained is the first positive sample data. The first label is used for labeling the first positive sample data, and the first label may be set in the form of numbers, letters, and the like, for example, set to 1, which indicates that the probability that the first positive sample data is the target text box is 100%. Each preprocessed picture has a first positive sample data, for example, 1000 preprocessed pictures have 1000 first positive sample data, and the first preset number is 1000, that is, 1000 first positive sample data are marked as 1.

Further, marking the training text box characteristics of a second preset number of training text boxes with different types in all the training text boxes by adopting a second label to obtain a second preset number of marked training text boxes; and determining the second preset number of the marked training text boxes as second preset number of first negative sample data. The second label may also be set in the form of numbers, letters, and the like, but specific numbers, letters, and the like are different from the first label, for example, set to 0, which indicates that the probability that the first negative sample data is the target text box is 0. The type of the first negative sample data is different from that of the first positive sample data, and the first negative sample data is in each preprocessed picture except the first positive sample data. For example, there are more than 1000 preprocessed pictures, there are more than 1000 first negative sample data, and there are 5000 first negative sample data if 5 first negative sample data are randomly selected from each preprocessed picture, where the first preset number is 5000, that is, 5000 first negative sample data are all marked as 0.

Step S360: and training an original classification model of the same type as the first positive sample data by adopting the first positive sample data of the first preset quantity and the first negative sample data of the second preset quantity to obtain a classification model of the same type as the first positive sample data.

And training an original classification model with the same type as the first positive sample data by taking all the selected first positive sample data and the first negative sample data as input data to obtain the classification model with the same type as the first positive sample data. For example, 1000 first positive sample data of the type of the issuer, 1000 first negative sample data of the type of the drawer, 1000 first negative sample data of the full-name type, 1000 first negative sample data of the type of the account number, 1000 first negative sample data of the type of the bill amount and 1000 first negative sample data of the type of the acceptance person are selected, 6000 sample data in total are used as input data, an original classification model of the type of the issuer is trained, and the classification model of the type of the issuer can be obtained after the training is completed.

According to the technical scheme, the method comprises the steps of obtaining a plurality of training note pictures, preprocessing each training note picture to obtain a plurality of preprocessed pictures, identifying all training text boxes contained in the plurality of preprocessed pictures, extracting training text box characteristics of each training text box in all the training text boxes, determining the type of each training text box according to the training text box characteristics of each training text box, marking the type of each training text box, associating the type of each training text box with the training text box characteristics of each training text box, selecting a first preset number of first positive sample data and a second preset number of first negative sample data from all the training text boxes, training an original classification model with the same type as the first positive sample data by using the first preset number of first positive sample data and the second preset number of first negative sample data, the technical means for obtaining the classification model with the same type as the first positive sample data can obtain the classification models with different types.

As shown in fig. 6, in a fourth embodiment of the present application, the method for acquiring a ticket information text box of the present application further includes the following steps:

step S410: and when the text box characteristics of at least one text box contained in the to-be-recognized bill picture cannot be recognized, predicting the type of the text box which cannot be recognized to obtain the type of the text box characteristics of the text box which cannot be recognized, and selecting the text boxes which have the same type and the same number as the to-be-recognized bill picture and have the same type and the third preset number of text boxes with different types from the to-be-recognized bill picture.

The format of the bill picture to be identified may be updated during actual use, that is, a text box is added or deleted for the original bill. For example, the updated ticket has a new text box, which is recorded as the new text box, and the other text boxes are consistent with the original ticket. When the updated bill picture is identified, the text box features of the newly added text box may not be identified. And when the text box features of the newly added text box cannot be identified, firstly translating the text information in the newly added text box, secondly determining the type of the text information according to the translated text information, and then predicting the type of the newly added text box according to the type of the text information, thereby determining the type of the text box features of the newly added text box. And the text box feature of the newly added text box is the text box feature of the unrecognizable text box. Further, the unrecognizable text boxes with the same type and the same number as the number of the to-be-recognized bill pictures and the text boxes with the third preset number and different types are selected from the to-be-recognized bill pictures. If only one bill picture to be identified is provided, the number of the unidentifiable text boxes with the same type is also one, and the number of the text boxes with different types can be multiple, for example, 10, that is, the third preset number is 10.

Step S420: and increasing the number of the unidentifiable text boxes with the same type and the same number as the number of the to-be-identified bill pictures to a fourth preset number, marking the text box characteristics of the unidentifiable text boxes with the fourth preset number by adopting the first label to obtain second positive sample data with the fourth preset number, and marking the text box characteristics of the text boxes with the third preset number by adopting the second label to obtain second negative sample data with the third preset number.

Specifically, the number of the unrecognizable text boxes with the same type is increased to a fourth preset number, for example, 5 unrecognizable text boxes with the same type are increased to obtain 2000 unrecognizable text boxes, that is, the fourth preset number is 2000. Marking the text box characteristics of a fourth preset number of unrecognizable text boxes by using the first label to obtain fourth preset number of second positive sample data; and simultaneously labeling the text box characteristics of the third preset number of text boxes by adopting a second label to obtain third preset number of second negative sample data. The labeling process of the text box features of the fourth preset number of text boxes that cannot be identified and the text box features of the third preset number of text boxes is the same as the labeling process in step S350, and is not repeated here.

Step S430: and retraining the original classification model of any type according to the first negative sample data of the second preset number, the second negative sample data of the third preset number and the second positive sample data of the fourth preset number to obtain a classification model of the same type as the second positive sample data.

Assuming that there are 5000 first negative samples, that is, the second preset number is 5000, 10 second negative sample data, that is, the third preset number is 10, and the second positive sample data is 2000, that is, the fourth preset number is 2000, 7010 sample data are counted, and the 7010 sample data are adopted to retrain the original classification model of the account type, that is, the classification model of the same type as the second positive sample data is obtained. If the text information in the newly added text box is the application date, the type of the newly added text box is the application date type, the type of the second positive sample data is the application date type, and the classification model obtained by retraining is the application date type. The classification models of the types obtained by the previous training are still retained, that is, the application date type classification models are newly added on the basis of the classification models of the types obtained by the previous training, for example, the number of the previous classification models is 20, and is now 21.

According to the above technical solution, in this embodiment, when the text box features of each text box included in the to-be-recognized bill picture cannot be recognized, a third preset number of text boxes of one type and the same number as the to-be-recognized bill picture and different types are selected from the to-be-recognized bill picture, the text box features of the text boxes of one type and the same number as the to-be-recognized bill picture are labeled by using the first label to obtain second positive sample data of the same number as the to-be-recognized bill picture, and the text box features of the text boxes of the third preset number are labeled by using the second label to obtain second negative sample data of the third preset number, according to the first positive sample of the first preset number, the first negative sample of the second preset number, the second positive sample data of the same number as the to-be-recognized bill picture and the second negative sample data of the third preset number, and the technical means of retraining the classification model with the same type as the first positive sample data to obtain the updated classification model with the same type as the first positive sample data realizes the addition of the classification models and is beneficial to the adaptation and detection of more types of text boxes by the preset detection model.

As shown in fig. 7, the present application provides a ticket information textbox acquiring system, including:

the image acquisition module 510 is configured to acquire a to-be-identified bill picture and identify all text boxes included in the to-be-identified bill picture;

a feature extraction module 520, configured to extract a text box feature of each text box in all text boxes;

the text box prediction module 530 is configured to determine, according to text box characteristics of each text box, a prediction probability value of each text box serving as a target text box by using a preset detection model; the preset detection model comprises a plurality of classification models of different types;

and a text box determining module 540, configured to determine the text box with the largest prediction probability value as the target text box of the corresponding type.

Further, the image obtaining module 510 specifically includes, in terms of obtaining a to-be-identified ticket image:

the original image acquisition unit is used for acquiring an original bill image;

and the original picture processing unit is used for preprocessing the original bill picture to obtain the bill picture to be identified.

Further, the image obtaining module 510 is specifically configured to, in terms of identifying all text boxes included in the to-be-identified bill picture, obtain four vertex coordinates of a rectangular area corresponding to each text information in the to-be-identified bill picture, and connect the four vertex coordinates according to a preset sequence to obtain a text box corresponding to each text information.

Further, the feature extraction module 520 includes:

the text box acquiring unit is used for acquiring a first text box characteristic of each text box in all the text boxes and a second text box characteristic of a text box adjacent to each text box;

a feature determination unit configured to use the first text box feature and the second text box feature of the text box adjacent to each text box as the text box features together.

Further, the system for acquiring the ticket information textbox further comprises:

the training picture acquisition unit is used for acquiring a plurality of training bill pictures and preprocessing each training bill picture to obtain a plurality of preprocessed pictures;

the text box recognition unit is used for recognizing all training text boxes contained in the plurality of preprocessed pictures and extracting the training text box characteristics of each training text box in all the training text boxes;

the type determining unit is used for determining the type of each training text box according to the training text information in each training text box;

the feature association unit is used for marking the type of each training text box and associating the type of each training text box with the training text box feature of each training text box;

the training device comprises a sample acquisition unit, a data acquisition unit and a data processing unit, wherein the sample acquisition unit is used for selecting a first positive sample data with a first preset quantity and a first negative sample data with a second preset quantity from all training text boxes;

and the model training unit is used for training an original classification model of the same type as the first positive sample data by adopting the first positive sample data of the first preset quantity and the first negative sample data of the second preset quantity to obtain a classification model of the same type as the first positive sample data.

Further, the sample acquiring unit includes:

the first feature labeling subunit is used for labeling the training text box features of the training text boxes with the same type and the same number as the preprocessed pictures in all the training text boxes by adopting first labels to obtain the labeled training text boxes with the same number as the preprocessed pictures;

a first sample determining subunit, configured to determine the labeled training text boxes with the same number as the pre-processed pictures as the first positive sample data of the first preset number;

the second feature labeling subunit is configured to label, by using a second label, the training text box features of a second preset number of training text boxes with different types in all the training text boxes to obtain a second preset number of labeled training text boxes;

and the second sample determining subunit is configured to determine the labeled training text boxes of the second preset number as the first negative sample data of the second preset number.

the characteristic detection unit is used for predicting the type of the text box which cannot be identified when the text box characteristic of at least one text box contained in the bill picture to be identified cannot be identified, so as to obtain the type of the text box characteristic of the text box which cannot be identified, and selecting the text boxes which have the same type and the same number as the bill picture to be identified and have the same number as the bill picture to be identified and the text boxes with the third preset number and different types from the bill picture to be identified;

the sample updating unit is used for increasing the number of the unidentifiable text boxes with the same type and the same number as the number of the to-be-identified bill pictures to a fourth preset number, marking the text box characteristics of the unidentifiable text boxes with the fourth preset number by adopting the first label to obtain second positive sample data with the fourth preset number, and marking the text box characteristics of the text boxes with the third preset number by adopting the second label to obtain second negative sample data with the third preset number;

and the model updating unit is used for retraining any type of original classification model according to the second preset number of first negative samples, the third preset number of second negative sample data and the fourth preset number of second positive sample data to obtain a classification model of the same type as the second positive sample data.

The specific implementation of the system for acquiring the ticket information text box of the present invention is basically the same as that of each embodiment of the method for acquiring the ticket information text box, and is not described herein again.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for acquiring a bill information text box is characterized by comprising the following steps:

extracting the text box characteristics of each text box in all the text boxes;

2. The method of claim 1, wherein the obtaining of the ticket image to be recognized comprises:

acquiring an original bill picture;

3. The method of claim 1, wherein the identifying all text boxes contained in the to-be-identified ticket picture comprises:

4. The method of claim 1, wherein extracting text box features for each of all text boxes comprises:

5. The method as claimed in claim 1, wherein before determining the predicted probability value of each text box as the target text box according to the text box characteristics of each text box by using a preset detection model, the method comprises:

6. The method of claim 5, wherein said selecting a first preset number of first positive sample data and a second preset number of first negative sample data from all training textboxes comprises:

7. The method of claim 5, wherein the ticket information text box acquiring method further comprises:

and retraining any type of original classification model according to the second preset number of the first negative sample data, the third preset number of the second negative sample data and the fourth preset number of the second positive sample data to obtain a classification model of the same type as the second positive sample data.

8. A ticket information textbox acquisition system, comprising:

9. A ticket information textbox acquiring apparatus, comprising: a memory, a processor and a ticket information text box capture program stored on the memory and executable on the processor, the ticket information text box capture program when executed by the processor implementing the steps of the ticket information text box capture method of any of claims 1-7.

10. A storage medium, having stored thereon a ticket information text box acquisition program which, when executed by a processor, implements the steps of the ticket information text box acquisition method of any one of claims 1-7.