CN112434555B - Key value pair region identification method and device, storage medium and electronic equipment - Google Patents
Key value pair region identification method and device, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN112434555B CN112434555B CN202011114774.9A CN202011114774A CN112434555B CN 112434555 B CN112434555 B CN 112434555B CN 202011114774 A CN202011114774 A CN 202011114774A CN 112434555 B CN112434555 B CN 112434555B
- Authority
- CN
- China
- Prior art keywords
- key
- text
- value
- region
- area
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 238000011176 pooling Methods 0.000 claims description 45
- 238000010586 diagram Methods 0.000 claims description 34
- 238000013527 convolutional neural network Methods 0.000 claims description 17
- 230000004927 fusion Effects 0.000 claims description 15
- 238000000605 extraction Methods 0.000 claims description 11
- 230000036541 health Effects 0.000 claims description 11
- 238000005070 sampling Methods 0.000 claims description 10
- 230000011218 segmentation Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 description 8
- 238000004590 computer program Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 7
- 238000001514 detection method Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 238000012015 optical character recognition Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000001629 suppression Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000009960 carding Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/153—Segmentation of character regions using recognition of characters or words
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biomedical Technology (AREA)
- Evolutionary Biology (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The embodiment of the invention discloses a key value pair region identification method. Comprising the following steps: the method comprises the steps of obtaining a target picture, inputting a key value pair area identification network into the target picture, identifying a key value pair area in the target picture, outputting a text area which is divided according to key value pair combination and a key area and a value area which are divided according to text attributes, marking a picture sample in advance by adopting the text area which is divided according to the key value pair combination and the key area and the value area which are divided according to the text attributes in the text area, inputting the picture sample and the marked text area, the key area and the value area into a preset network structure, training to obtain the identification network, automatically detecting the combined text area by the key value, classifying the text area at the same time, and automatically obtaining the key area and the value area.
Description
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a key value pair region identification method, a key value pair region identification device, a storage medium, and an electronic apparatus.
Background
The reimbursement and data carding of the existing notes, receipts and the like are manually input, so that the efficiency is low and the cost is high.
The algorithm of OCR (Optical Character Recognition ) technology mainly locates the text position on the invoice according to the convolutional network, and then recognizes the text through the cyclic neural network and the like. After the steps, the character positions in the isolated graph and the corresponding character recognition results can be obtained, but the relation logic is missing, and the recognized contents are required to be distinguished by using manual rules. For bills with simpler formats, such as fixed formats of rated invoices, value-added tax invoices and the like, the whole recognition rate can reach more than 90% under the condition that image characters are clearly visible by the current mainstream technology, but the processing format is more complex, or scenes needing special rules, such as bank receipts, insurance documents and the like, have about 60% recognition accuracy under the condition of the same image quality as the invoices.
In summary, it is difficult to identify scenes with complex formats by OCR technology and manual rules, and there are still problems of low efficiency and high cost.
Disclosure of Invention
In view of the above problems, a key value pair region identification method, a key value pair region identification device, a storage medium and an electronic device are provided to solve the problems that OCR technology and manual rules are difficult to identify scenes with complex formats, and still have low efficiency and high cost.
According to one aspect of the present invention, there is provided a key value pair region identification method including:
obtaining a target picture;
inputting the target picture into a key value pair area identification network; the key value pair region identification network adopts a text region which is divided according to key value pair combination in advance, key regions and value regions which are divided according to text attributes in the text region, marks a picture sample, inputs the picture sample and the marked text region, the key regions and the value regions into a preset network structure, and trains the picture sample and the marked text region, the key regions and the value regions to obtain the text region;
and identifying the key value pair area in the target picture by the key value pair area identification network, and outputting text areas segmented according to the key value pair combination and key areas and value areas which are divided according to text attributes in the text areas.
Optionally, the identifying, by the key value pair area identifying network, a key value pair area in the target picture, outputting a text area divided according to a key value pair combination, and the key area and the value area divided according to text attributes in the text area includes:
extracting features of different scales from the target picture by using a convolutional neural network, and carrying out feature fusion to obtain a fused feature map;
generating text areas segmented according to key value pair combinations according to the feature map;
and dividing the text region to generate the key region and the value region.
Optionally, generating the text region segmented according to the key value pair combination according to the feature map includes:
generating a plurality of candidate areas for each pixel point on the feature map;
identifying a target candidate region of the plurality of candidate regions that matches the key-value pair combination;
and merging the target candidate areas to obtain the text area.
Optionally, extracting features of different scales from the target picture by using a convolutional neural network, and performing feature fusion, where obtaining a fused feature map includes:
performing up-sampling operation on the first feature map output by the pooling layer to obtain a second feature map with the same size as the last pooling layer;
And superposing the second characteristic diagram and the third characteristic diagram output by the last pooling layer to obtain a fourth characteristic diagram.
Optionally, the method further comprises:
text recognition is carried out on the key area and the value area, so that key information of the key attribute in the key area and value information of the value attribute in the value area are obtained;
the key information and the value information are provided.
Optionally, if the key area includes a plurality of key areas, before the text recognition is performed on the key area and the value area to obtain key information of a key attribute in the key area and value information of a value attribute in the value area, the method further includes;
detecting line information in the target picture;
determining position information of the key area and the value area according to the line information;
the providing the key information and the value information includes:
and generating structural information composed of the key information and the value information according to the position information.
Optionally, the target picture comprises at least one of user health data, a bank receipt and a financial invoice.
According to another aspect of the present invention, there is provided a key-value pair region identifying apparatus including:
the acquisition module is used for acquiring the target picture;
The input module is used for inputting the target picture into a key value pair area identification network; the key value pair region identification network adopts a text region which is divided according to key value pair combination in advance, key regions and value regions which are divided according to text attributes in the text region, marks a picture sample, inputs the picture sample and the marked text region, the key regions and the value regions into a preset network structure, and trains the picture sample and the marked text region, the key regions and the value regions to obtain the text region;
and the identification module is used for identifying the key value pair area in the target picture by the key value pair area identification network, outputting a text area divided according to the key value pair combination and dividing the text area into key areas and value areas according to the text attribute.
Optionally, the identification module includes:
the feature extraction submodule is used for extracting features of different scales from the target picture by using a convolutional neural network, and carrying out feature fusion to obtain a fused feature map;
the region generation sub-module is used for generating text regions segmented according to key value pair combinations according to the feature map;
and the segmentation sub-module is used for segmenting the text region and generating the key region and the value region.
Optionally, the region generating submodule includes:
a region generation unit, configured to generate a plurality of candidate regions for each pixel point on the feature map;
a region identifying unit configured to identify a target candidate region, of the plurality of candidate regions, that matches the key-value pair combination;
and the merging unit is used for merging the target candidate areas to obtain the text areas.
Optionally, the feature extraction submodule includes:
the sampling unit is used for carrying out up-sampling operation on the first characteristic diagram output by the pooling layer to obtain a second characteristic diagram with the same size as the last pooling layer;
and the superposition unit is used for superposing the second characteristic diagram and the third characteristic diagram output by the last pooling layer to obtain a fourth characteristic diagram.
Optionally, the apparatus further comprises:
the text recognition module is used for carrying out text recognition on the key area and the value area to obtain key information of the key attribute in the key area and value information of the value attribute in the value area;
and the information providing module is used for providing the key information and the value information.
Optionally, if the key area includes a plurality of key areas, the apparatus further includes;
the detection module is used for detecting line information in the target picture before the text recognition is carried out on the key area and the value area to obtain key information of the key attribute in the key area and value information of the value attribute in the value area;
The information determining module is used for determining the position information of the key area and the value area according to the line information;
the information providing module includes:
and the information generation module is used for generating structural information consisting of the key information and the value information according to the position information.
Optionally, the target picture comprises at least one of user health data, a bank receipt and a financial invoice.
According to another aspect of the present invention, there is provided a storage medium comprising a stored program, wherein the program, when run, controls a device on which the storage medium resides to perform one or more methods as described above.
According to another aspect of the present invention, there is provided an electronic apparatus including: a memory, a processor, and executable instructions stored in the memory and executable in the processor, wherein the processor implements one or more of the methods described above when executing the executable instructions.
According to the embodiment of the invention, the target picture is input into the key value pair area identification network, the key value pair area identification network identifies the key value pair area in the target picture, the text area divided according to the key value pair combination is output, the text area divided according to the text attribute in the text area and the value area are divided according to the key value pair combination, the text area divided according to the key value pair combination and the text area divided according to the text attribute in the text area are adopted to mark the picture sample, the picture sample and the marked text area, the key area and the value area are input into the preset network structure, and the key value pair area identification network is trained to obtain the key value pair area identification network, so that the key value pair area identification network can automatically detect the text area combined by the key value pair, and meanwhile, the key area and the value area are automatically obtained.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 is a flow chart of a key-value pair region identification method according to a first embodiment of the present invention;
FIG. 2 is a flow chart of a key-value pair region identification method according to a second embodiment of the present invention;
FIG. 3 is a schematic diagram of a key-value versus region identification process;
fig. 4 is a block diagram showing a structure of a key value pair area identifying apparatus according to a third embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Example 1
Referring to fig. 1, a flowchart of a key value pair region identification method in the first embodiment of the present invention may specifically include:
and step 101, acquiring a target picture.
The target picture may include information of key value pair combination, such as various notes, receipts, etc., or any other suitable picture, which is not limited in this embodiment of the present invention.
In one embodiment of the present invention, after the target picture is acquired, the target picture needs to be preprocessed, and the input RGB (RGB color mode) image needs to be preprocessed, including but not limited to sharpening, denoising, etc. of the image.
For example, the network inputs are RGB three-channel images, which require scaling of the picture size to 512 x 512 due to computational power and model reasoning speed requirements.
And 102, inputting the target picture into a key value pair area identification network.
In the embodiment of the invention, the target picture is input into the trained key value pair area identification network, and the key value pair area identification network can automatically process the target picture and output the identification result.
In the embodiment of the invention, in the training process of the key value pair region identification network, the text region segmented according to the key value pair combination and the key region and the value region divided according to the text attribute in the text region are adopted to mark the picture sample.
In the embodiment of the invention, the picture sample may contain information of key-value pair combination, for example, a pair of key-value (key-value pair combination), a name of key and a name of value of Zhang san in the identity card. The picture sample includes various notes, receipts, etc., or any other suitable picture, which is not limited in this embodiment of the present invention.
In the embodiment of the invention, in order to detect each text region containing a key value pair combination as a target, during network training, all text region key value pairs of a training sample are segmented, then each key value pair combination text region is marked, for example, each key value pair combination text region is marked as 1, and a non-key value pair combination text region is marked as 0. In order to further classify the text region, it is divided into a key region and a value region divided according to the text attribute, wherein the text attribute comprises keys in key-value pair combinations and values in the key-value pair combinations, and when the network is trained, the text region of each key-value pair combination is divided into two text boxes, one is marked as a key region and one is marked as a value region, for example, two text boxes in a key-value region are respectively marked as two attributes of a key and a value.
And inputting the picture sample, the marked text region, the key region and the value region into a preset network structure, and training to obtain a key value pair region identification network.
In the embodiment of the invention, a picture sample, a marked text region, a marked key region and a marked value region are input into a preset network structure, wherein the preset network structure is a machine learning model, can be used for identifying images after training, can divide the picture after providing one image, and outputs labels of all regions. Training is carried out by adopting a picture sample, a marked text region, a key region and a value region, and the obtained network is recorded as a key value pair region identification network. The key value pair region recognition network may output text regions in the picture segmented according to the key value pair combinations, and key regions and value regions in the text regions segmented according to text attributes.
In an embodiment of the present invention, the preset network structure includes a network for object detection and a network for image classification. The network layer firstly carries out target detection on the picture, outputs a text region combined by the key value pairs in the picture, classifies two parts in the text region combined by the key value pairs, and outputs a key region and a value region in the text region. And minimizing the values of the target function of target detection and the target function of classification during training, thus obtaining the key value pair area identification network reaching the performance target.
And 103, identifying key value pair areas in the target picture by the key value pair area identification network, and outputting text areas divided according to key value pair combinations and key areas and value areas divided according to text attributes in the text areas.
In the embodiment of the invention, the key value pair area identification network can identify the key value pair area in the target picture, wherein the key value pair area identification network comprises text areas which are segmented according to the key value pair combination, and classifies the text areas, namely the key areas and the value areas which are divided according to the text attributes in the text areas.
For example, a convolutional neural network is utilized to detect a text region of a picture and classify key and value items in the text region, specifically, the method firstly carries out multi-scale feature extraction on an input picture through the convolutional neural network, then carries out fusion on features of different scales, and then carries out two-step operation on the fused feature map.
According to the embodiment of the invention, the target picture is input into the key value pair area identification network, the key value pair area identification network identifies the key value pair area in the target picture, the text area divided according to the key value pair combination is output, the text area divided according to the text attribute in the text area and the value area are divided according to the key value pair combination, the text area divided according to the key value pair combination and the text area divided according to the text attribute in the text area are adopted to mark the picture sample, the picture sample and the marked text area, the key area and the value area are input into the preset network structure, and the key value pair area identification network is trained to obtain the key value pair area identification network, so that the key value pair area identification network can automatically detect the text area combined by the key value pair, and meanwhile, the key area and the value area are automatically obtained.
Example two
Referring to fig. 2, a flowchart of a key value pair region identification method in the second embodiment of the present invention may specifically include:
Step 201, a target picture is acquired.
Step 202, inputting the target picture into a key value pair area identification network.
And 203, extracting features of different scales from the target picture by using a convolutional neural network, and carrying out feature fusion to obtain a fused feature map.
And 204, generating text areas segmented according to the key value pair combination according to the feature map.
And step 205, segmenting the text region to generate the key region and the value region.
In the embodiment of the invention, the text detection is carried out on the picture by utilizing the convolutional neural network to construct a convolutional neural network, the network mainly comprises three modules, the first module is used for carrying out convolution and fusion operation on the picture to obtain characteristics with different scales, the second module is used for regressing a text area containing key value pair combination from the fused characteristics, the third module is used for continuously classifying the text area of the second module, dividing a text box in the text area into areas with two attributes of keys and values, and marking the areas as a key area and a value area.
For example, different scale features are extracted by convolutional neural networks, e.g., using VGG (Visual Geometry Group Network ), res net (Residual Networks, residual network), etc., and feature fusion output is performed. A schematic diagram of the key value versus region identification process is shown in fig. 3.
The convolution pooling 1 in feature extraction includes 1 convolution layer and 1 pooling layer, and 64 3×3 convolution kernels and 1 max pooling (maximum sampling) pooling layers are employed.
The convolution pooling 2 in feature extraction includes 2 convolution layers and 1 pooling layer, with 128 convolution kernels of 3×3 and 1 max pooling layer.
The convolution pooling 3 in the feature extraction comprises 3 convolution layers and 1 pooling layer, wherein 2 layers of 256 convolution kernels of 3×3 are adopted first, and then 1 layer of 256 convolution layers of 1×1 and 1 pooling layer of max pooling are used.
The convolution pooling 4 in feature extraction comprises 3 convolution layers and 1 pooling layer, wherein 2 layers of 512 convolution kernels of 3×3 are adopted first, and then 1 layer of 512 convolution layers of 1×1 and 1 pooling layer of max pooling are used.
The convolution pooling 5 in feature extraction comprises 3 convolution layers and 1 pooling layer, wherein 2 layers of 512 convolution kernels of 3×3 are adopted first, and then 1 layer of 512 convolution layers of 1×1 and 1 pooling layer of max pooling are used.
In an optional embodiment of the present invention, extracting features of different scales from the target image by using a convolutional neural network, and performing feature fusion, so as to obtain an implementation manner of the fused feature image, may include: performing up-sampling operation on the first feature map output by the pooling layer to obtain a second feature map with the same size as the last pooling layer; and superposing the second characteristic diagram and the third characteristic diagram output by the last pooling layer to obtain a fourth characteristic diagram.
The main purpose of upsampling (otherwise known as upscaling or image interpolation) is to magnify the original image so that it can be displayed on a higher resolution display device. The scaling operation on the image does not bring more information about the image and therefore the quality of the image will inevitably be affected. However, there are indeed some scaling methods that can increase the information of the image so that the scaled image quality exceeds the original image quality.
For example, as shown in fig. 3, the last pooling layer of the above operation is first subjected to an up-sampling operation, the size of the last pooling layer is restored to the result of the previous step of the pooling operation with the last pooling layer, then the last pooling layer is directly overlapped with the pooling layer in the pooling 4 to obtain a new feature map, and the new feature map is fused with the feature maps in the pooling 3 and 2 to obtain a fused feature map in the same manner.
In an alternative embodiment of the present invention, an implementation of generating a text region segmented according to a key-value pair combination according to the feature map may include: generating a plurality of candidate areas for each pixel point on the feature map; identifying a target candidate region of the plurality of candidate regions that matches the key-value pair combination; and merging the target candidate areas to obtain the text area.
As shown in fig. 3, the generation of the combined candidate areas by the key values generates 8 candidate frames, i.e., candidate areas, for each pixel point on the feature map output in the previous step. The candidate regions have different sizes. And then, by means of regression, candidate areas with the value of 1 are regressed out through threshold filtering, namely, candidate areas matched with the key value pair combination in the plurality of candidate areas are identified and marked as target candidate areas. These target candidate regions are then combined into text regions of key-value pair combinations using NMS (Non-Maximum Suppression, non-maximal suppression) algorithm.
The next step is to sort the keys in the text region, typically containing 2 text boxes in the text region. And then, in each text area, regression is performed to obtain a text box of the key and the value attribute by using a regression method.
According to the embodiment of the invention, the target picture is input into the key value pair area identification network, the different scale features are extracted from the target picture by utilizing the convolutional neural network, the feature fusion is carried out, the fused feature diagram is obtained, the text area which is divided according to the key value pair combination is generated according to the feature diagram, the text area is divided, and the key area and the value area are generated, so that the key value pair area identification network can automatically detect the text area which is combined by the key value pair, and meanwhile, the text area is classified, the key area and the value area are automatically obtained, and compared with the matching of key and value rules under manual intervention, the scene with complex format can also be accurately identified, the method has more universality, the time for manual input and check is reduced, and a large amount of labor cost is saved.
In an alternative embodiment of the invention, the method further comprises: text recognition is carried out on the key area and the value area, so that key information of the key attribute in the key area and value information of the value attribute in the value area are obtained; the key information and the value information are provided.
After the text area, the key area and the value area are obtained, key information of the key attribute in the key area and value information of the value attribute in the value area are obtained through text recognition, and then the key information and the value information are provided. For example, the key information and the value information are structured to be output.
In an optional embodiment of the present invention, if the key area includes a plurality of key areas, before the text recognition is performed on the key area and the value area to obtain key information of a key attribute in the key area and value information of a value attribute in the value area, the method further includes; detecting line information in the target picture; determining position information of the key area and the value area according to the line information; one implementation of providing the key information and the value information includes: and generating structural information composed of the key information and the value information according to the position information.
If there is only one text region, the text region combined by the key value pair is output by category, if there are multiple key regions, most of the text lines are similar to the header text lines of the table region, and the following operation is needed for such regions to detect the straight line below the header text lines of the table. And restoring the format by using an image processing mode to obtain the position information of the table, matching each value area in the table by using a text box of the table head, outputting the text box according to the line, setting the category of the table head as a table-key, and setting the content in the table as the table-value. And identifying the key-value text box to obtain a structured output result.
In an alternative embodiment of the present invention, the target picture includes user health data, a bank receipt, a financial invoice, etc., or any other suitable picture, to which embodiments of the present invention are not limited.
The user health data comprises health image data such as physical examination reports, diagnosis records, various health indexes of people, medical records and the like. The key value pair area identification network can identify the corresponding key area and value area. The key value pair area identification network can be trained by adopting sample data of at least one of user health data, a bank receipt, a financial invoice and the like, the key value pair area identification network can be trained to identify the user health data by adopting sample data of the bank receipt, the key value pair area identification network can be trained to identify the bank receipt by adopting sample data of the bank receipt, and the financial invoice can be trained to identify the financial invoice by adopting sample data of the financial invoice.
Example III
Referring to fig. 4, a block diagram of a key value pair region identification apparatus in the third embodiment of the present invention is shown, which may specifically include:
an acquisition module 301, configured to acquire a target picture;
an input module 302, configured to input the target picture into a key value pair region identification network; the key value pair region identification network adopts a text region which is divided according to key value pair combination in advance, key regions and value regions which are divided according to text attributes in the text region, marks a picture sample, inputs the picture sample and the marked text region, the key regions and the value regions into a preset network structure, and trains the picture sample and the marked text region, the key regions and the value regions to obtain the text region;
and the identifying module 303 is configured to identify, by the key value pair area identifying network, a key value pair area in the target picture, output a text area divided according to a key value pair combination, and a key area and a value area divided according to text attributes in the text area.
Optionally, the identification module includes:
the feature extraction submodule is used for extracting features of different scales from the target picture by using a convolutional neural network, and carrying out feature fusion to obtain a fused feature map;
the region generation sub-module is used for generating text regions segmented according to key value pair combinations according to the feature map;
And the segmentation sub-module is used for segmenting the text region and generating the key region and the value region.
Optionally, the region generating submodule includes:
a region generation unit, configured to generate a plurality of candidate regions for each pixel point on the feature map;
a region identifying unit configured to identify a target candidate region, of the plurality of candidate regions, that matches the key-value pair combination;
and the merging unit is used for merging the target candidate areas to obtain the text areas.
Optionally, the feature extraction submodule includes:
the sampling unit is used for carrying out up-sampling operation on the first characteristic diagram output by the pooling layer to obtain a second characteristic diagram with the same size as the last pooling layer;
and the superposition unit is used for superposing the second characteristic diagram and the third characteristic diagram output by the last pooling layer to obtain a fourth characteristic diagram.
Optionally, the apparatus further comprises:
the text recognition module is used for carrying out text recognition on the key area and the value area to obtain key information of the key attribute in the key area and value information of the value attribute in the value area;
and the information providing module is used for providing the key information and the value information.
Optionally, if the key area includes a plurality of key areas, the apparatus further includes;
the detection module is used for detecting line information in the target picture before the text recognition is carried out on the key area and the value area to obtain key information of the key attribute in the key area and value information of the value attribute in the value area;
the information determining module is used for determining the position information of the key area and the value area according to the line information;
the information providing module includes:
and the information generation module is used for generating structural information consisting of the key information and the value information according to the position information.
Optionally, the target picture comprises at least one of user health data, a bank receipt and a financial invoice.
According to the embodiment of the invention, the target picture is input into the key value pair area identification network, the key value pair area identification network identifies the key value pair area in the target picture, the text area divided according to the key value pair combination is output, the text area divided according to the text attribute in the text area and the value area are divided according to the key value pair combination, the text area divided according to the key value pair combination and the text area divided according to the text attribute in the text area are adopted to mark the picture sample, the picture sample and the marked text area, the key area and the value area are input into the preset network structure, and the key value pair area identification network is trained to obtain the key value pair area identification network, so that the key value pair area identification network can automatically detect the text area combined by the key value pair, and meanwhile, the key area and the value area are automatically obtained.
For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.
In an embodiment of the disclosure, the key value pair area identification network generating device includes a processor and a memory, where the above modules and sub-modules are stored as program units, and the processor executes the above program units stored in the memory to implement corresponding functions.
The processor includes a kernel, and the kernel fetches the corresponding program unit from the memory. The kernel can set one or more than one of the key value pair area identification network through obtaining a target picture, inputting the target picture into the key value pair area identification network, identifying the key value pair area in the target picture by the key value pair area identification network, outputting a text area divided according to the key value pair combination and a text area and a value area divided according to the text attribute in the text area, marking a picture sample by adopting the text area divided according to the key value pair combination and the text area and the value area divided according to the text attribute in the text area, inputting the picture sample and the marked text area, the key area and the value area into a preset network structure, training to obtain the key value pair area identification network, so that the key value pair area identification network can automatically detect the text area combined by the key value pair, and simultaneously classify the text area to automatically obtain the key area and the value area.
The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip.
The embodiment of the invention provides a storage medium, on which a program is stored, which when executed by a processor, implements the key value pair region identification method.
The embodiment of the invention provides a processor which is used for running a program, wherein the key value pair region identification method is executed when the program runs.
The embodiment of the invention provides an electronic device, which comprises a processor, a memory and a program stored on the memory and capable of running on the processor, wherein the processor realizes the following steps when executing the program:
obtaining a target picture;
inputting the target picture into a key value pair area identification network; the key value pair region identification network adopts a text region which is divided according to key value pair combination in advance, key regions and value regions which are divided according to text attributes in the text region, marks a picture sample, inputs the picture sample and the marked text region, the key regions and the value regions into a preset network structure, and trains the picture sample and the marked text region, the key regions and the value regions to obtain the text region;
And identifying the key value pair area in the target picture by the key value pair area identification network, and outputting text areas segmented according to the key value pair combination and key areas and value areas which are divided according to text attributes in the text areas.
Optionally, the identifying, by the key value pair area identifying network, a key value pair area in the target picture, outputting a text area divided according to a key value pair combination, and the key area and the value area divided according to text attributes in the text area includes:
extracting features of different scales from the target picture by using a convolutional neural network, and carrying out feature fusion to obtain a fused feature map;
generating text areas segmented according to key value pair combinations according to the feature map;
and dividing the text region to generate the key region and the value region.
Optionally, generating the text region segmented according to the key value pair combination according to the feature map includes:
generating a plurality of candidate areas for each pixel point on the feature map;
identifying a target candidate region of the plurality of candidate regions that matches the key-value pair combination;
and merging the target candidate areas to obtain the text area.
Optionally, extracting features of different scales from the target picture by using a convolutional neural network, and performing feature fusion, where obtaining a fused feature map includes:
performing up-sampling operation on the first feature map output by the pooling layer to obtain a second feature map with the same size as the last pooling layer;
and superposing the second characteristic diagram and the third characteristic diagram output by the last pooling layer to obtain a fourth characteristic diagram.
Optionally, the method further comprises:
text recognition is carried out on the key area and the value area, so that key information of the key attribute in the key area and value information of the value attribute in the value area are obtained;
the key information and the value information are provided.
Optionally, if the key area includes a plurality of key areas, before the text recognition is performed on the key area and the value area to obtain key information of a key attribute in the key area and value information of a value attribute in the value area, the method further includes;
detecting line information in the target picture;
determining position information of the key area and the value area according to the line information;
the providing the key information and the value information includes:
and generating structural information composed of the key information and the value information according to the position information.
Optionally, the target picture comprises at least one of user health data, a bank receipt and a financial invoice.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.
Claims (9)
1. A key-value pair region identification method, comprising:
obtaining a target picture;
inputting the target picture into a key value pair area identification network; the key value pair region identification network adopts a text region which is divided according to key value pair combination in advance, key regions and value regions which are divided according to text attributes in the text region, marks a picture sample, inputs the picture sample and the marked text region, the key regions and the value regions into a preset network structure, and trains the picture sample and the marked text region, the key regions and the value regions to obtain the text region; dividing a text region of each key-value pair combination into two text boxes during network training, and marking the text boxes as a key region and a value region respectively;
the key value pair area identification network identifies the key value pair area in the target picture, and outputs a text area divided according to the key value pair combination and key areas and value areas divided according to text attributes in the text area;
Wherein the key value pair region recognition network recognizes a key value pair region in the target picture, outputting a text region divided according to a key value pair combination, and the key region and the value region divided according to text attributes in the text region include:
extracting features of different scales from the target picture by using a convolutional neural network, and carrying out feature fusion to obtain a fused feature map;
generating text areas segmented according to key value pair combinations according to the feature map; wherein each text region containing a combination of key-value pairs is detected as a target;
and dividing the text region to generate the key region and the value region.
2. The method of claim 1, wherein generating text regions segmented by key-value-pair combinations from the feature map comprises:
generating a plurality of candidate areas for each pixel point on the feature map;
identifying a target candidate region of the plurality of candidate regions that matches the key-value pair combination;
and merging the target candidate areas to obtain the text area.
3. The method of claim 1, wherein extracting features of different scales from the target picture by using a convolutional neural network, and performing feature fusion to obtain a fused feature map comprises:
Performing up-sampling operation on the first feature map output by the pooling layer to obtain a second feature map with the same size as the last pooling layer;
and superposing the second characteristic diagram and the third characteristic diagram output by the last pooling layer to obtain a fourth characteristic diagram.
4. The method according to claim 1, wherein the method further comprises:
text recognition is carried out on the key area and the value area, so that key information of the key attribute in the key area and value information of the value attribute in the value area are obtained;
the key information and the value information are provided.
5. The method of claim 4, wherein if the key region includes a plurality of key regions, the method further comprises, prior to the text identifying the key region and the value region to obtain key information for key attributes in the key region and value information for value attributes in the value region;
detecting line information in the target picture;
determining position information of the key area and the value area according to the line information;
the providing the key information and the value information includes:
and generating structural information composed of the key information and the value information according to the position information.
6. The method of claim 1, wherein the target picture comprises at least one of user health data, a bank receipt, and a financial invoice.
7. A key-value pair region identifying apparatus, comprising:
the acquisition module is used for acquiring the target picture;
the input module is used for inputting the target picture into a key value pair area identification network; the key value pair region identification network adopts a text region which is divided according to key value pair combination in advance, key regions and value regions which are divided according to text attributes in the text region, marks a picture sample, inputs the picture sample and the marked text region, the key regions and the value regions into a preset network structure, and trains the picture sample and the marked text region, the key regions and the value regions to obtain the text region; dividing a text region of each key-value pair combination into two text boxes during network training, and marking the text boxes as a key region and a value region respectively;
the identification module is used for identifying the key value pair area in the target picture by the key value pair area identification network, and outputting a text area divided according to the key value pair combination and a key area and a value area divided according to the text attribute in the text area;
Wherein, the identification module includes:
the feature extraction submodule is used for extracting features of different scales from the target picture by using a convolutional neural network, and carrying out feature fusion to obtain a fused feature map;
the region generation sub-module is used for generating text regions segmented according to key value pair combinations according to the feature map; wherein each text region containing a combination of key-value pairs is detected as a target;
and the segmentation sub-module is used for segmenting the text region and generating the key region and the value region.
8. A storage medium comprising a stored program, wherein the program, when run, controls a device in which the storage medium is located to perform the method of any one of claims 1 to 6.
9. An electronic device, comprising: memory, a processor and executable instructions stored in the memory and executable in the processor, wherein the processor implements the method of any of claims 1-6 when executing the executable instructions.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011114774.9A CN112434555B (en) | 2020-10-16 | 2020-10-16 | Key value pair region identification method and device, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011114774.9A CN112434555B (en) | 2020-10-16 | 2020-10-16 | Key value pair region identification method and device, storage medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112434555A CN112434555A (en) | 2021-03-02 |
CN112434555B true CN112434555B (en) | 2024-04-09 |
Family
ID=74695658
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011114774.9A Active CN112434555B (en) | 2020-10-16 | 2020-10-16 | Key value pair region identification method and device, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112434555B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114092948B (en) * | 2021-11-24 | 2023-09-22 | 北京百度网讯科技有限公司 | Bill identification method, device, equipment and storage medium |
CN114724152A (en) * | 2022-02-22 | 2022-07-08 | 深圳职业技术学院 | Image-form-oriented shipping bill analysis method, device and equipment |
CN115116060B (en) * | 2022-08-25 | 2023-01-24 | 深圳前海环融联易信息科技服务有限公司 | Key value file processing method, device, equipment and medium |
CN115546488B (en) * | 2022-11-07 | 2023-05-19 | 北京百度网讯科技有限公司 | Information segmentation method, information extraction method and training method of information segmentation model |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110569361A (en) * | 2019-09-06 | 2019-12-13 | 腾讯科技(深圳)有限公司 | Text recognition method and equipment |
CN111177302A (en) * | 2019-12-16 | 2020-05-19 | 金蝶软件(中国)有限公司 | Business document processing method and device, computer equipment and storage medium |
CN111368527A (en) * | 2020-02-28 | 2020-07-03 | 上海汇航捷讯网络科技有限公司 | Key value matching method |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10013643B2 (en) * | 2016-07-26 | 2018-07-03 | Intuit Inc. | Performing optical character recognition using spatial information of regions within a structured document |
US10628668B2 (en) * | 2017-08-09 | 2020-04-21 | Open Text Sa Ulc | Systems and methods for generating and using semantic images in deep learning for classification and data extraction |
EP3908971A1 (en) * | 2019-02-27 | 2021-11-17 | Google LLC | Identifying key-value pairs in documents |
-
2020
- 2020-10-16 CN CN202011114774.9A patent/CN112434555B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110569361A (en) * | 2019-09-06 | 2019-12-13 | 腾讯科技(深圳)有限公司 | Text recognition method and equipment |
CN111177302A (en) * | 2019-12-16 | 2020-05-19 | 金蝶软件(中国)有限公司 | Business document processing method and device, computer equipment and storage medium |
CN111368527A (en) * | 2020-02-28 | 2020-07-03 | 上海汇航捷讯网络科技有限公司 | Key value matching method |
Non-Patent Citations (1)
Title |
---|
Chargrid: Towards Understanding 2D Documents;Anoop R Katti ∗ et al.;《arXiv》;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112434555A (en) | 2021-03-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112434555B (en) | Key value pair region identification method and device, storage medium and electronic equipment | |
CN112381775B (en) | Image tampering detection method, terminal device and storage medium | |
CN111681273B (en) | Image segmentation method and device, electronic equipment and readable storage medium | |
CN107690657B (en) | Trade company is found according to image | |
US9679354B2 (en) | Duplicate check image resolution | |
CN111353491B (en) | Text direction determining method, device, equipment and storage medium | |
CN113963147B (en) | Key information extraction method and system based on semantic segmentation | |
CN110796145B (en) | Multi-certificate segmentation association method and related equipment based on intelligent decision | |
CN112883926B (en) | Identification method and device for form medical images | |
Zhao et al. | Automatic blur region segmentation approach using image matting | |
CN110738238A (en) | certificate information classification positioning method and device | |
CN112232336A (en) | Certificate identification method, device, equipment and storage medium | |
KR20230147130A (en) | Methods and apparatus for ranking images in a collection using image segmentation and image analysis | |
CN112884755B (en) | Method and device for detecting contraband | |
CN111178398A (en) | Method, system, storage medium and device for detecting tampering of image information of identity card | |
CN112200789B (en) | Image recognition method and device, electronic equipment and storage medium | |
CN113920434A (en) | Image reproduction detection method, device and medium based on target | |
CN112396060A (en) | Identity card identification method based on identity card segmentation model and related equipment thereof | |
CN112215266A (en) | X-ray image contraband detection method based on small sample learning | |
CN115035533B (en) | Data authentication processing method and device, computer equipment and storage medium | |
CN115393868B (en) | Text detection method, device, electronic equipment and storage medium | |
CN111242112A (en) | Image processing method, identity information processing method and device | |
CN115858695A (en) | Information processing method and device and storage medium | |
Sreelakshmy et al. | An improved method for copy-move forgery detection in digital forensic | |
Rani et al. | Object Detection in Natural Scene Images Using Thresholding Techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |