CN114067343A - Data set construction method, model training method and corresponding device - Google Patents

Data set construction method, model training method and corresponding device Download PDF

Info

Publication number
CN114067343A
CN114067343A CN202111421423.7A CN202111421423A CN114067343A CN 114067343 A CN114067343 A CN 114067343A CN 202111421423 A CN202111421423 A CN 202111421423A CN 114067343 A CN114067343 A CN 114067343A
Authority
CN
China
Prior art keywords
entity
information
category
element entity
bill
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111421423.7A
Other languages
Chinese (zh)
Inventor
徐云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202111421423.7A priority Critical patent/CN114067343A/en
Publication of CN114067343A publication Critical patent/CN114067343A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The embodiment of the application relates to the field of data processing, and particularly discloses a data set construction method, a model training method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a bill image to be processed; performing OCR recognition on the bill image aiming at each bill image, and determining the element information of each element entity and the position information of each element entity in the bill image; the element information comprises at least one of character information, table information and signature information; determining the category of each element entity according to the element information and the position information; based on each element entity, applying a label corresponding to the category of the element entity to label the element entity; and determining an element set formed by each marked element entity in each to-be-processed bill image as a data set. The method is used for improving the accuracy of the data set in the collected bill image, and further applying the data set to bill recognition to improve the accuracy of bill recognition.

Description

Data set construction method, model training method and corresponding device
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method for constructing a data set, a method for training a model, a corresponding apparatus, an electronic device, and a storage medium.
Background
The bill is an important text carrier of structured information, and along with the development of social situation, the style of the bill is increased linearly, and the types with different forms are developed. When the related departments reimburse, several bills or even more than ten different types of bills need to be audited, and the structures of partial bills have great similarity.
In the prior art, a large amount of linguistic data of bills are collected and processed correspondingly to identify the bills. Therefore, the accuracy of bill data collection directly affects the recognition accuracy.
Disclosure of Invention
The embodiment of the application provides a data set construction method, a model training method, a corresponding device, electronic equipment and a storage medium, which are used for improving the accuracy of a data set in a collected bill image, and further applying the data set to bill recognition to improve the accuracy of the bill recognition.
In a first aspect, an embodiment of the present application provides a method for constructing a data set, including:
acquiring a bill image to be processed;
performing OCR recognition on the bill image aiming at each bill image, and determining element information of each element entity and position information of each element entity in the bill image; wherein the element information comprises at least one of character information, table information and signature information;
for each element entity, determining the category of the element entity according to the element information and the position information;
based on each element entity, applying a label corresponding to the category of the element entity to label the element entity;
and determining an element set formed by each marked element entity in each to-be-processed bill image as a data set.
In some exemplary embodiments, the determining the category of the element entity according to the element information and the position information includes:
if the element information does not include preset keyword information, determining an adjacent element entity of the element entity according to the position information of the element information, and determining the category of the element entity according to the adjacent element entity; wherein the distance between the adjacent element entity and the element entity on the bill image is smaller than a preset distance threshold value.
In some exemplary embodiments, the labeling, based on the category of each element entity, the element entity by applying the label corresponding to the category includes:
if the category of the element entity is a simple category, marking the element entity as a label of the element entity; wherein the number of the element entities in the area where the element entities of the simple category are located is one;
if the category of the element entity is a composite category, marking the element entity as a composite label; the composite label comprises labels of all element entities in the composite elements and corresponding element values; the number of the element entities in the area where the element entities of the composite category are located is at least two.
In some exemplary embodiments, before the labeling, based on the category of each element entity, the element entity by applying the label corresponding to the category, further includes:
and displaying each element entity according to a preset display form.
In some exemplary embodiments, if the preset presentation form is an html form, the presenting each element entity according to the preset presentation form includes:
determining an element text box, and displaying the element text box in an html form; wherein, the element text box comprises attribute information of elements to be displayed;
if the preset display form is a json form, displaying each element entity according to the preset display form comprises:
displaying the element key value pairs according to the determined corresponding relation of the element key value pairs; the key value in the key value pair is a nested json string of the elements to be displayed.
In some exemplary embodiments, before determining the element text box and presenting the element text box in html form, the method further comprises:
if the elements needing to be displayed are elements in the table, determining attribute labels of the element text boxes;
and displaying the attribute tag as the attached display information of the element text box.
In a second aspect, an embodiment of the present application provides a method for training a bill recognition model, including:
obtaining a training data set, wherein the training data comprises a data set obtained by applying the method of the first aspect;
and training a pre-constructed neural network model by using the data set until the neural network model converges to obtain a bill identification model.
In a third aspect, an embodiment of the present application provides an apparatus for constructing a data set, including:
the image acquisition module is used for acquiring a bill image to be processed;
the image recognition module is used for performing OCR recognition on the bill images aiming at each bill image and determining the element information of each element entity and the position information of each element entity in the bill images; wherein the element information comprises at least one of character information, table information and signature information;
a category determination module, configured to determine, for each element entity, a category of the element entity according to the element information and the location information;
the labeling module is used for labeling each element entity by applying a label corresponding to the category of the element entity based on the element entity;
and the data set determining module is used for determining that the element set formed by each marked element entity in each to-be-processed bill image is a data set.
In some exemplary embodiments, the category determination module is specifically configured to:
if the element information does not include preset keyword information, determining an adjacent element entity of the element entity according to the position information of the element information, and determining the category of the element entity according to the adjacent element entity; wherein the distance between the adjacent element entity and the element entity on the bill image is smaller than a preset distance threshold value.
In some exemplary embodiments, the labeling module is specifically configured to:
if the category of the element entity is a simple category, marking the element entity as a label of the element entity; wherein the number of the element entities in the area where the element entities of the simple category are located is one;
if the category of the element entity is a composite category, marking the element entity as a composite label; the composite label comprises labels of all element entities in the composite elements and corresponding element values; the number of the element entities in the area where the element entities of the composite category are located is at least two.
In some exemplary embodiments, the method further includes a presentation module, where the presentation module is configured to, before the element entity is labeled by applying a label corresponding to the category based on the category of each element entity:
and displaying each element entity according to a preset display form.
In some exemplary embodiments, if the preset presentation form is an html form, the presentation module is specifically configured to:
determining an element text box, and displaying the element text box in an html form; wherein, the element text box comprises attribute information of elements to be displayed;
if the preset display form is a json form, the display module is specifically configured to:
displaying the element key value pairs according to the determined corresponding relation of the element key value pairs; the key value in the key value pair is a nested json string of the elements to be displayed.
In some exemplary embodiments, the presentation module further has a stock, prior to determining the element text box and presenting the element text box in html form:
if the elements needing to be displayed are elements in the table, determining attribute labels of the element text boxes;
and displaying the attribute tag as the attached display information of the element text box.
In a fourth aspect, an embodiment of the present application provides a training apparatus for a bill recognition model, including:
a data set obtaining module, configured to obtain a training data set, where the training data includes a data set obtained by the method of the second aspect;
and the training module is used for training a pre-constructed neural network model by using the data set until the neural network model is converged to obtain a bill identification model.
In a fifth aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of any one of the methods in the first or second aspects when executing the computer program.
In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium, on which computer program instructions are stored, which, when executed by a processor, implement the steps of any one of the methods of the first or second aspects.
In a seventh aspect, an embodiment of the present application provides a computer program product, where the computer program product includes a computer program, the computer program is stored in a computer-readable storage medium, and at least one processor of the device reads and executes the computer program from the computer-readable storage medium, so that the device performs the method shown in any one of the embodiments of the first aspect or the second aspect.
According to the method and the device, after the bill images to be processed are obtained, OCR recognition is carried out on each bill image, and the element information of each element entity and the position information of each element entity in the bill images are determined; in this way, the type of the corresponding element entity can be determined by using the element information and the position information, and further, in the process of data tagging, the element entity can be tagged by applying a tag corresponding to the type of the element entity, and an element set formed by each tagged element entity is a data set. The data set not only comprises element information, but also comprises position information, so that the accuracy of the bill recognition model obtained by training by using the data set as a sample is improved, and the accuracy of bill recognition is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario of ticket recognition according to an embodiment of the present application;
FIG. 2 is a schematic view of a form of a document according to an embodiment of the present application;
FIG. 3 is a flowchart of a method for constructing a data set according to an embodiment of the present application;
fig. 4 is a diagram illustrating an effect after identifying a factor entity according to an embodiment of the present application;
FIG. 5 is a schematic illustration of an annotation process provided in an embodiment of the present application;
fig. 6 is a schematic diagram of an excel label according to an embodiment of the present application;
FIG. 7 is a schematic diagram of an integrated annotation result provided in accordance with an embodiment of the present application;
FIG. 8 is a flowchart of a method for training a bill identification model according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a data set constructing apparatus according to an embodiment of the present application;
FIG. 10 is a schematic structural diagram of a training apparatus for a bill identification model according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
For convenience of understanding, terms referred to in the embodiments of the present application are explained below:
(1) OCR (Optical Character Recognition): and analyzing the image file to obtain the character information and the layout information on the image.
(2) NLP (Natural Language Processing): various theories and methods have been developed to enable efficient communication between humans and computers using natural language.
(3) And (3) sequence labeling: sequence tagging is a relatively simple NLP task, and is used to solve a series of problems in classifying characters, such as word segmentation, part-of-speech tagging, named entity recognition, relationship extraction, and the like.
Any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.
In a specific practice process, the bill is an important text carrier of structured information, and along with the development of social situation, the style presentation of the bill is increased linearly, and various types with different forms are developed. When the related departments reimburse, several bills or even more than ten different types of bills need to be audited, and the structures of partial bills have great similarity. In the prior art, a large amount of linguistic data of bills are collected, and some processing is applied to identify the bills. Therefore, the accuracy of bill data collection directly affects the recognition accuracy.
In addition, for a data set of natural language, sequence marking is generally generated, and such data can lose the position information of the characters on the bill. And for the labeling of the image text, the character information on the image needs to be manually input, and the entity labeling is carried out, so that the operation is time-consuming and labor-consuming.
Particularly for application scenes, the direction of natural language processing is still based on a question-answering robot, and more data sets are used for carrying out semantic analysis on sentences, and analyzing the parts of speech of words in the sentences and semantic relations in the context so as to carry out corresponding answers. Such labeling data is completely sufficient for a question and answer robot targeting a conversation, but for extracting bill information in which the position of a character itself is also an extraction element, simple sequence labeling does not well reflect the position of an entity in the bill itself, and additional labeling is required, which is complicated.
Therefore, the method utilizes not only the element information of the element entities, but also the position information of each element entity, so that the types of the element entities can be determined according to the element information and the position information, the element entities are labeled according to labels corresponding to the types of the element entities, and the element sets formed by the labeled element entities are data sets. The bill recognition model obtained by training by using the data set as a sample has high accuracy, and the accuracy of subsequent bill recognition is also high.
After introducing the design concept of the embodiment of the present application, some simple descriptions are provided below for application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.
Reference is made to fig. 1, which is a schematic view of an application scenario of ticket recognition according to an embodiment of the present application. The application scenario includes a plurality of terminal devices 101 (including terminal device 101-1, terminal device 101-2, … … terminal device 101-n), server 102. The terminal device 101 and the server 102 are connected via a wireless or wired network, and the terminal device 101 includes but is not limited to a desktop computer, a mobile phone, a mobile computer, a tablet computer, a media player, a smart wearable device, a smart television, and other electronic devices. The server 102 may be a server, a server cluster composed of several servers, or a cloud computing center. The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like.
The method comprises the steps of collecting bill images for each terminal device, sending each bill image to a server for processing, and obtaining marked element entities called as data sets by the server through the data set construction method in the embodiment of the application. Of course, the process of constructing the data set may also be performed by the terminal device.
Of course, the method provided in the embodiment of the present application is not limited to be used in the application scenario shown in fig. 1, and may also be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described in the following method embodiments, and will not be described in detail herein.
To further illustrate the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide method steps as shown in the following embodiments or figures, more or fewer steps may be included in the method based on conventional or non-inventive efforts. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application.
The following describes the technical solution provided in the embodiment of the present application with reference to the application scenario shown in fig. 1. Fig. 2 shows a schematic diagram of a form of a bill including a plurality of elements, when there are a large number of bills, the form or type of the bill, etc. need to be automatically recognized.
Referring to fig. 3, an embodiment of the present application provides a method for constructing a data set, including the following steps:
s301, acquiring a to-be-processed bill image.
S302, performing OCR recognition on the bill image aiming at each bill image, and determining the element information of each element entity and the position information of each element entity in the bill image; wherein the element information includes at least one of character information, table information, and signature information.
And S303, determining the element entity according to the element information and the position information aiming at each element entity.
And S304, based on each element entity, applying a label corresponding to the category of the element entity to label the element entity.
S305, determining that the element set formed by each marked element entity in each bill image to be processed is a data set.
According to the method and the device, after the bill images to be processed are obtained, OCR recognition is carried out on each bill image, and the element information of each element entity and the position information of each element entity in the bill images are determined; in this way, the type of the corresponding element entity can be determined by using the element information and the position information, and further, in the process of data tagging, the element entity can be tagged by applying a tag corresponding to the type of the element entity, and an element set formed by each tagged element entity is a data set. The data set not only comprises element information, but also comprises position information, so that the accuracy of the bill recognition model obtained by training by using the data set as a sample is improved, and the accuracy of bill recognition is improved.
Referring to S301, to obtain a sample participating in model training, a ticket image to be processed is first acquired. In the process, a large number of bill images of different types can be obtained, and if the bill images are electronic bills, the bill images can be directly captured; if the bill is a paper bill, the paper bill can be shot to obtain a bill image.
Referring to S302, OCR recognition technology is applied to each of the bill images to recognize element information and position information of each element entity in each of the bill images. In the field of the bill, the element information includes at least one of text information, table information and signature information, and the types of the bills are different, and the entity of the element is also different, which is only an example and is not a specific limitation. Specifically, the position information may be determined by outputting coordinates of the element information using an OCR recognition technology.
Therefore, due to the difference of bill producers, bills are different in system, but the information of element entities to be expressed is basically consistent, and the positions of a large part of element entities are relatively fixed, so that the information of characters, keywords around the characters and the position information of the characters on the bills are combined, most of elements can be well covered by the bills, and the confidence coefficient of element extraction is increased during extraction.
Referring to S303, since a plurality of element entities are included in one ticket image, a category of the element entity can be determined from the element information and the position information for each element entity.
Specifically, the category of the element entity may be determined by determining whether the element information includes preset keyword information. The preset keyword information represents the type of the element or data capable of representing the type of the element, for example, the preset keyword information may be "date", "drawer", or the like, so that the category of the element entity may be determined by the preset keyword.
In the first case, the element information includes a preset keyword.
In this case, a preset keyword included in the element information may be directly extracted to determine a category of an element entity to which the element information corresponds.
In the second case, the element information does not include a predetermined keyword.
In this case, if only the element information is utilized, the category of the element entity cannot be determined. Therefore, at this time, the proximity element entity of the element entity can be determined from the position information of the element information, and the proximity relationship can be determined by calculating the distance on the bill image. In a specific example, for example, if the current element information does not include the preset keyword information, an element entity whose distance from the current element entity on the note image is smaller than a preset distance threshold is determined as the adjacent element entity, so that the category of the current element entity can be determined according to the preset keyword included in the adjacent element entity.
Therefore, for the labeling task of the bill information, the OCR character recognition technology can well replace the step of manually inputting the image characters, and extract all position information and content information of the image characters. The marking data of the reserved position information or the adjacent element information can be used as a weight to learn when the position information is trained, the weight ratio of the position information or the adjacent element information is very important in the bill image with a complex information style, particularly when no keyword information exists around the bill image, the position information specific to one element and the element layout information around the bill image have high similarity in the same type of bill, and if the information is utilized, the corresponding element information can be better extracted from the type of image bill.
Referring to S304, after the category of each element entity is determined, based on each element entity, the element entity is labeled by applying a label corresponding to the category.
Specifically, for a bill image, the labeling process is different according to different extractable information, and the labeling process includes the following conditions:
in the first case, the category of the element entity is a simple category.
Wherein, the number of the element entities in the area where the element entities of the simple category are located is one. Therefore, when the category of the element entity is a simple category, the element entity is directly labeled as a label of the element entity, for example, the label of the element entity is "invoice" or the like.
For simple categories of elements, such as "keywords: the element value or the label of the element value is directly marked as the label of the element, and an attribute is added to store the actual element value of the label.
In a specific example, in fig. 4, an element text box (also called span box) is an element of a simple category, such as "invoice" in fig. 4, the span box has only "invoice", and the element label is a ticket title.
In the second case, the category of the element entity is a composite category.
And the number of the element entities in the area where the element entities of the composite category are located is at least two. Therefore, when the category of the element entity is the composite category, the composite label is applied to label the current element entity. The compliant label includes labels and corresponding element values of each element entity in the composite element.
For a composite tag, the shape is "keyword 1: element value 1{ keyword 2: the element value 2 … … } or the element value 1{ element value 2 … … } is labeled as a composite label, and the entry value holds the label of each element and the element value, such as entry: "{ tag 1: element value 1} { tag 2: element value 2} … … ".
In one specific example, still referring to fig. 4, "drawer: XXX billing date: 2021-06-01 ", this span box contains two pieces of information, the person and date of the invoice, and a composite label COMPLEX needs to be defined to satisfy this situation.
For example, fig. 5 shows a schematic diagram of an annotation process, wherein a region 501 is an annotation of a simple category of element entities, and a region 502 is an annotation of a complex category of element entities.
In addition, the characters extracted by OCR character recognition can be subjected to label selection and element value filling by using excel, as shown in fig. 6.
The OCR recognition technology may also be combined to perform additional labeling on the OCR recognition result, that is, after the bill image is imported into the system, OCR character recognition is performed, and then the recognized result is displayed on the page, and for the character recognition frame on the page, the element label can be selected, and a specific element value can be labeled, as shown in fig. 7. Specifically, elements of the simple category may be labeled with only text boxes, and elements of the composite category may be labeled with labels and element values.
In summary, part-of-speech tagging in the analog sequence tagging is to compare the element tag with part-of-speech tagging, and the element entity is to compare words in the sentence, by classifying and tagging information on a kind of bill through limited element tags, a feature set contained in a tag can be obtained, and the larger the set is, the richer the contained feature information is, and the more accurate the model learning is. The elements of the simple category and the elements of the compound category are distinguished, and the elements of the compound category are specially marked, so that the accuracy of the constructed data set is improved.
The above-mentioned tag may be replaced by a part of speech of the element information included in the element entity, and specifically, a part of speech tag of the sequence tag may be applied to determine the tag of each element entity. By tagging element entity values with element tags, the tags can be determined using part-of-speech tagging to construct a data set.
In step S305, after labeling each element entity in each document image, an element set composed of each labeled element entity in each document image to be processed is determined as a data set. The data set can be used as a training sample to obtain a bill identification model, and the bill to be identified is identified by applying the bill identification model, so that the accuracy rate is high.
The data set extracted by the bill image information is constructed by combining OCR character recognition and NLP sequence marking, the character information and the table information of the bill image can be extracted by fully utilizing the OCR recognition technology by using the method, and the element entities in the bill are classified and marked by marking the OCR recognition result, so that the position information and the entity information of the characters in the bill image can be simultaneously reserved, and the learnt information is more comprehensive and complete.
In the actual application process, in order to improve the labeling effect, before labeling each element entity, the identification result is displayed through a page according to a preset display form, that is, each element entity is displayed, so that the labeling can be performed with reference to the preset display form.
In a specific example, the preset display forms are different, and the display forms are different.
If the preset display form is the html form, displaying each element entity according to the preset display form comprises: determining an element text box, and displaying the element text box in an html form; wherein, the element text box comprises the attribute information of the element to be displayed.
The text box includes attribute information of elements to be displayed, and the attribute information is attribute information such as the font size, position, color and the like of a certain line of characters. Therefore, when the preset display form is the html form, the element text box is determined, and the element text box is displayed in the html form.
In addition, if the elements to be displayed are elements in the table, the attribute label of the element text box can be determined and displayed as the attached attribute information of the element text box. The attribute tag may be a table tag indicating that a line of text is in the table.
If the preset display form is a json form, displaying each element entity according to the preset display form comprises: displaying the element key value pair according to the determined corresponding relation of the element key value pair; the key in the key value pair is the serial number of the element to be displayed, and the value in the key value pair is the nested json string of the element to be displayed.
Json (JavaScript Object Notation) is a lightweight data exchange format, and when the preset display form is the Json form, the element key value pairs are displayed according to the corresponding relation of the determined element key value pairs. Illustratively, a key in a key-value pair is a serial number of an element to be displayed, and a value in a key-value pair is a nested json string of the element to be displayed.
In a specific example, referring to fig. 4, a display effect diagram after identifying an element entity is shown, where a region 401 is an html display part, and a region 402 is a json display part.
And after a plurality of to-be-processed bill images are applied to obtain a labeled data set, training the data set as a training data set to obtain a bill identification model. Referring to FIG. 8, a schematic diagram of a training method of a bill recognition model is shown.
S801, acquiring a training data set, wherein the training data comprises the data set obtained by applying the method of any one of claims 1-6.
S802, training the pre-constructed neural network model by using the data set until the neural network model converges to obtain a bill identification model.
In the embodiment of the present application, the training data set is obtained by the method for constructing a data set in the foregoing embodiment, and therefore, the data set is used to train a pre-constructed neural network model until the neural network model converges, so as to obtain a bill recognition model. The neural network model may be a convolutional neural network model or the like. The data set participating in training is obtained by applying the data set construction method in the application, so that the neural network model obtained by training is more accurate, and the recognition accuracy is high when the neural network model is applied to bill recognition.
In addition, in the embodiment of the present application, the labeling process of the data set may also be performed in the following manner:
when labeling is carried out for the first time, pure manual labeling is needed, and a data set for initial model training is provided; after the initial model is trained, the system automatically calls the bill extraction model to perform mechanical marking. Meanwhile, in order to improve the accuracy of labeling, the result of mechanical labeling needs to be manually corrected, and errors in model labeling are corrected. Therefore, besides the first manual marking, the same type note marking data with the same label can be selected and combined, the same type note marking data and the same type note marking data are trained together, a new extraction model is generated and compared with the previous extraction model, and the extraction model with the optimal index is selected to replace the note extraction model. The bill extraction model is for data annotation, unlike the bill identification model.
Therefore, a smaller label number set can be obtained by initially labeling a class of bills, an initial model can be obtained by training the set, and when the model is used for mechanically labeling subsequent bills, the model has fewer possible identified element values due to the small sample amount, but the workload of manual labeling can be greatly reduced. In continuous iterative training, the effect of the model can be rapidly improved, the accuracy of model identification can be determined in the manual correction process, the weak points of model training, namely sparse elements, can be found in a targeted mode, and targeted labeling training is carried out. Moreover, as the consistency of the labeling data of the similar bills with the same label is very high, only the data set needs to be replaced, and the automatic training can be carried out; even if the same type of bills with different labels are available, the combined training can be carried out by deleting the labels or replacing the label information.
The marking process can well meet the requirement of reducing the workload of data marking. Through continuous iterative training, the mechanical labeling of the intermediate model is utilized, the labeling work of simple labels can be reduced, and the specific labeling is only carried out on complex labels and sparse elements, so that the method is more targeted. Meanwhile, due to the consistency of the labeled data of each round, in the aspect of training, the model can be trained only by replacing the data set, and the training cost of the model is simplified.
In summary, in the embodiment of the application, the character information and the position information in the bill image are extracted through OCR recognition, that is, the content information of the characters is read, and the position information of the characters on the bill is retained. For an element entity without an obvious keyword, which can only be judged by position or adjacent element information, the position information or the adjacent element information is key information, and usually the sequence marking does not mark the position of characters on an image when marking, so that the extraction of the elements is difficult, and the key position information and the related information of the adjacent elements can be retained after the OCR recognition is combined.
As shown in fig. 9, based on the same inventive concept as the above-mentioned data set constructing method, the embodiment of the present application further provides a data set constructing apparatus, which includes an image acquiring module 91, an image recognizing module 92, a category determining module 93, an annotating module 94 and a data set determining module 95.
The image acquisition module 91 is used for acquiring a bill image to be processed;
the image recognition module 92 is used for performing OCR recognition on the bill image aiming at each bill image and determining the element information of each element entity and the position information of each element entity in the bill image; wherein, the element information comprises at least one of character information, table information and signature information;
a category determining module 93, configured to determine, for each element entity, a category of the element entity according to the element information and the position information;
a labeling module 94, configured to label, based on each element entity, the element entity by applying a label corresponding to the category of the element entity;
and a data set determining module 95, configured to determine that the element set formed by each labeled element entity in each to-be-processed bill image is a data set.
In some exemplary embodiments, the category determining module 93 is specifically configured to:
if the element information does not include preset keyword information, determining an adjacent element entity of the element entity according to the position information of the element information, and determining the category of the element entity according to the adjacent element entity; and the distance between the adjacent element entity and the element entity on the bill image is smaller than a preset distance threshold value.
In some exemplary embodiments, the tagging module 94 is specifically configured to:
if the category of the element entity is a simple category, marking the element entity as a label of the element entity; the number of the element entities in the area where the element entities of the simple category are located is one;
if the category of the element entity is a composite category, marking the element entity as a composite label; the composite label comprises labels of all element entities in the composite elements and corresponding element values; the number of the element entities in the region where the element entities of the composite category are located is at least two.
In some exemplary embodiments, the method further includes a presentation module, before labeling, based on the category of each element entity, the tag element entity corresponding to the category by applying:
and displaying each element entity according to a preset display form.
In some exemplary embodiments, if the preset presentation form is an html form, the presentation module is specifically configured to:
determining an element text box, and displaying the element text box in an html form; wherein, the element text box comprises attribute information of elements to be displayed;
if the preset display form is a json form, the display module is specifically configured to:
displaying the element key value pair according to the determined corresponding relation of the element key value pair; the key in the key value pair is the serial number of the element to be displayed, and the value in the key value pair is the nested json string of the element to be displayed.
In some exemplary embodiments, the presentation module also has a stock, before determining the element text box and presenting the element text box in html form:
if the elements needing to be displayed are elements in the table, determining attribute labels of the element text boxes;
and displaying the attribute label as the attached display information of the element text box.
The device for constructing the data set and the method for constructing the data set provided by the embodiment of the application adopt the same inventive concept, can obtain the same beneficial effects, and are not repeated herein.
As shown in fig. 10, based on the same inventive concept as the above-mentioned training method of the bill recognition model, the embodiment of the present application further provides a construction apparatus for training the bill recognition model, which includes a data set acquisition module 1001 and a training module 1002.
A data set obtaining module 1001, configured to obtain a training data set, where the training data includes a data set obtained by the method of the second aspect;
the training module 1002 is configured to train a pre-constructed neural network model by using a data set until the neural network model converges to obtain a bill recognition model.
The training device of the bill identification model and the training method of the bill identification model provided by the embodiment of the application adopt the same inventive concept, can obtain the same beneficial effects, and are not repeated herein.
Based on the same inventive concept as the construction method of the data set, the embodiment of the present application further provides an electronic device, which may be specifically a desktop computer, a portable computer, a smart phone, a tablet computer, a Personal Digital Assistant (PDA), a server, and the like. As shown in fig. 11, the electronic device may include a processor 111 and a memory 112.
The Processor 111 may be a general-purpose Processor, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component, and may implement or execute the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.
The memory 112, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charged Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 112 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.
Based on the same inventive concept as the construction method of the data set, the embodiment of the present application further provides an electronic device, which may be specifically a desktop computer, a portable computer, a smart phone, a tablet computer, a Personal Digital Assistant (PDA), a server, and the like. As shown in fig. 12, the electronic device may include a processor 121 and a memory 122.
The Processor 121 may be a general-purpose Processor, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component, and may implement or execute the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.
Memory 122, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charged Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 122 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; the computer storage media may be any available media or data storage device that can be accessed by a computer, including but not limited to: various media that can store program codes include a removable Memory device, a Random Access Memory (RAM), a magnetic Memory (e.g., a flexible disk, a hard disk, a magnetic tape, a magneto-optical disk (MO), etc.), an optical Memory (e.g., a CD, a DVD, a BD, an HVD, etc.), and a semiconductor Memory (e.g., a ROM, an EPROM, an EEPROM, a nonvolatile Memory (NAND FLASH), a Solid State Disk (SSD)).
Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof that contribute to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media that can store program codes include a removable Memory device, a Random Access Memory (RAM), a magnetic Memory (e.g., a flexible disk, a hard disk, a magnetic tape, a magneto-optical disk (MO), etc.), an optical Memory (e.g., a CD, a DVD, a BD, an HVD, etc.), and a semiconductor Memory (e.g., a ROM, an EPROM, an EEPROM, a nonvolatile Memory (NAND FLASH), a Solid State Disk (SSD)).
In some possible embodiments, various aspects of the methods provided by the present disclosure may also be implemented in the form of a program product including program code for causing a computer device to perform the steps of the methods according to various exemplary embodiments of the present disclosure described above in this specification when the program product is run on the computer device, for example, the computer device may perform the transaction information processing method described in the embodiments of the present disclosure. The program product may employ any combination of one or more readable media.
In the technical scheme, the data acquisition, transmission, use and the like all meet the requirements of relevant national laws and regulations.
The above embodiments are only used to describe the technical solutions of the present application in detail, but the above embodiments are only used to help understanding the method of the embodiments of the present application, and should not be construed as limiting the embodiments of the present application. Modifications and substitutions that may be readily apparent to those skilled in the art are intended to be included within the scope of the embodiments of the present application.

Claims (17)

1. A method of constructing a data set, comprising:
acquiring a bill image to be processed;
performing OCR recognition on the bill image aiming at each bill image, and determining element information of each element entity and position information of each element entity in the bill image; wherein the element information comprises at least one of character information, table information and signature information;
for each element entity, determining the category of the element entity according to the element information and the position information;
based on each element entity, applying a label corresponding to the category of the element entity to label the element entity;
and determining an element set formed by each marked element entity in each to-be-processed bill image as a data set.
2. The method of claim 1, wherein determining the category of the element entity according to the element information and the location information comprises:
if the element information does not include preset keyword information, determining an adjacent element entity of the element entity according to the position information of the element information, and determining the category of the element entity according to the adjacent element entity; wherein the distance between the adjacent element entity and the element entity on the bill image is smaller than a preset distance threshold value.
3. The method according to claim 1, wherein the labeling the element entities by applying the labels corresponding to the categories based on the categories of each element entity comprises:
if the category of the element entity is a simple category, marking the element entity as a label of the element entity; wherein the number of the element entities in the area where the element entities of the simple category are located is one;
if the category of the element entity is a composite category, marking the element entity as a composite label; the composite label comprises labels of all element entities in the composite elements and corresponding element values; the number of the element entities in the area where the element entities of the composite category are located is at least two.
4. The method according to claim 1, wherein before applying the label corresponding to the category to label the element entity based on the category of each element entity, the method further comprises:
and displaying each element entity according to a preset display form.
5. The method according to claim 4, wherein if the predetermined presentation form is html form, the presenting each element entity according to the predetermined presentation form comprises:
determining an element text box, and displaying the element text box in an html form; wherein, the element text box comprises attribute information of elements to be displayed;
if the preset display form is a json form, displaying each element entity according to the preset display form comprises:
displaying the element key value pairs according to the determined corresponding relation of the element key value pairs; the key value in the key value pair is a nested json string of the elements to be displayed.
6. The method of claim 5, wherein prior to determining an element text box and presenting the element text box in html form, the method further comprises:
if the elements needing to be displayed are elements in the table, determining attribute labels of the element text boxes;
and displaying the attribute tag as the attached display information of the element text box.
7. A training method of a bill recognition model is characterized by comprising the following steps:
acquiring a training data set, wherein the training data comprises a data set obtained by applying the method of any one of claims 1-6;
and training a pre-constructed neural network model by using the data set until the neural network model converges to obtain a bill identification model.
8. An apparatus for constructing a data set, comprising:
the image acquisition module is used for acquiring a bill image to be processed;
the image recognition module is used for performing OCR recognition on the bill images aiming at each bill image and determining the element information of each element entity and the position information of each element entity in the bill images; wherein the element information comprises at least one of character information, table information and signature information;
a category determination module, configured to determine, for each element entity, a category of the element entity according to the element information and the location information;
the labeling module is used for labeling each element entity by applying a label corresponding to the category of the element entity based on the element entity;
and the data set determining module is used for determining that the element set formed by each marked element entity in each to-be-processed bill image is a data set.
9. The apparatus of claim 8, wherein the category determination module is specifically configured to:
if the element information does not include preset keyword information, determining an adjacent element entity of the element entity according to the position information of the element information, and determining the category of the element entity according to the adjacent element entity; wherein the distance between the adjacent element entity and the element entity on the bill image is smaller than a preset distance threshold value.
10. The apparatus of claim 8, wherein the tagging module is specifically configured to:
if the category of the element entity is a simple category, marking the element entity as a label of the element entity; wherein the number of the element entities in the area where the element entities of the simple category are located is one;
if the category of the element entity is a composite category, marking the element entity as a composite label; the composite label comprises labels of all element entities in the composite elements and corresponding element values; the number of the element entities in the area where the element entities of the composite category are located is at least two.
11. The apparatus of claim 8, further comprising a presentation module, configured to, before the labeling of the element entities by applying the labels corresponding to the categories based on the categories of each element entity:
and displaying each element entity according to a preset display form.
12. The apparatus of claim 11, wherein if the preset presentation form is an html form, the presentation module is specifically configured to:
determining an element text box, and displaying the element text box in an html form; wherein, the element text box comprises attribute information of elements to be displayed;
if the preset display form is a json form, the display module is specifically configured to:
displaying the element key value pairs according to the determined corresponding relation of the element key value pairs; the key value in the key value pair is a nested json string of the elements to be displayed.
13. The apparatus of claim 12, wherein the presentation module further has a stock, and prior to determining an element text box and presenting the element text box in html form:
if the elements needing to be displayed are elements in the table, determining attribute labels of the element text boxes;
and displaying the attribute tag as the attached display information of the element text box.
14. A training device for bill recognition models is characterized by comprising:
a data set acquisition module, configured to acquire a training data set, where the training data includes a data set obtained by applying the method according to any one of claims 1 to 6;
and the training module is used for training a pre-constructed neural network model by using the data set until the neural network model is converged to obtain a bill identification model.
15. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented by the processor when executing the computer program.
16. A computer-readable storage medium having computer program instructions stored thereon, which, when executed by a processor, implement the steps of the method of any one of claims 1 to 7.
17. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 7 when executed by a processor.
CN202111421423.7A 2021-11-26 2021-11-26 Data set construction method, model training method and corresponding device Pending CN114067343A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111421423.7A CN114067343A (en) 2021-11-26 2021-11-26 Data set construction method, model training method and corresponding device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111421423.7A CN114067343A (en) 2021-11-26 2021-11-26 Data set construction method, model training method and corresponding device

Publications (1)

Publication Number Publication Date
CN114067343A true CN114067343A (en) 2022-02-18

Family

ID=80276635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111421423.7A Pending CN114067343A (en) 2021-11-26 2021-11-26 Data set construction method, model training method and corresponding device

Country Status (1)

Country Link
CN (1) CN114067343A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114637845A (en) * 2022-03-11 2022-06-17 上海弘玑信息技术有限公司 Model testing method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114637845A (en) * 2022-03-11 2022-06-17 上海弘玑信息技术有限公司 Model testing method, device, equipment and storage medium
CN114637845B (en) * 2022-03-11 2023-04-14 上海弘玑信息技术有限公司 Model testing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106649818B (en) Application search intention identification method and device, application search method and server
US10055391B2 (en) Method and apparatus for forming a structured document from unstructured information
EP3926531B1 (en) Method and system for visio-linguistic understanding using contextual language model reasoners
CN109902285B (en) Corpus classification method, corpus classification device, computer equipment and storage medium
CN110427487B (en) Data labeling method and device and storage medium
CN112287069B (en) Information retrieval method and device based on voice semantics and computer equipment
CN112800848A (en) Structured extraction method, device and equipment of information after bill identification
CN114648392B (en) Product recommendation method and device based on user portrait, electronic equipment and medium
CN113360699A (en) Model training method and device, image question answering method and device
CN107844531B (en) Answer output method and device and computer equipment
CN113641794A (en) Resume text evaluation method and device and server
Aralikatte et al. Fault in your stars: an analysis of android app reviews
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN111143556A (en) Software function point automatic counting method, device, medium and electronic equipment
CN114067343A (en) Data set construction method, model training method and corresponding device
CN117520503A (en) Financial customer service dialogue generation method, device, equipment and medium based on LLM model
CN112613293A (en) Abstract generation method and device, electronic equipment and storage medium
CN111460808A (en) Synonymous text recognition and content recommendation method and device and electronic equipment
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN113283432A (en) Image recognition and character sorting method and equipment
CN116756281A (en) Knowledge question-answering method, device, equipment and medium
CN115880702A (en) Data processing method, device, equipment, program product and storage medium
CN114647682A (en) Exercise arrangement method and device, electronic equipment and storage medium
GB2608112A (en) System and method for providing media content
CN113569741A (en) Answer generation method and device for image test questions, electronic equipment and readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination