CN112434555B

CN112434555B - Key value pair region identification method and device, storage medium and electronic equipment

Info

Publication number: CN112434555B
Application number: CN202011114774.9A
Authority: CN
Inventors: 张秋晖; 刘岩
Original assignee: Taikang Insurance Group Co Ltd
Current assignee: Taikang Insurance Group Co Ltd
Priority date: 2020-10-16
Filing date: 2020-10-16
Publication date: 2024-04-09
Anticipated expiration: 2040-10-16
Also published as: CN112434555A

Abstract

The embodiment of the invention discloses a key value pair region identification method. Comprising the following steps: the method comprises the steps of obtaining a target picture, inputting a key value pair area identification network into the target picture, identifying a key value pair area in the target picture, outputting a text area which is divided according to key value pair combination and a key area and a value area which are divided according to text attributes, marking a picture sample in advance by adopting the text area which is divided according to the key value pair combination and the key area and the value area which are divided according to the text attributes in the text area, inputting the picture sample and the marked text area, the key area and the value area into a preset network structure, training to obtain the identification network, automatically detecting the combined text area by the key value, classifying the text area at the same time, and automatically obtaining the key area and the value area.

Description

Key value pair region identification method and device, storage medium and electronic equipment

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a key value pair region identification method, a key value pair region identification device, a storage medium, and an electronic apparatus.

Background

The reimbursement and data carding of the existing notes, receipts and the like are manually input, so that the efficiency is low and the cost is high.

The algorithm of OCR (Optical Character Recognition ) technology mainly locates the text position on the invoice according to the convolutional network, and then recognizes the text through the cyclic neural network and the like. After the steps, the character positions in the isolated graph and the corresponding character recognition results can be obtained, but the relation logic is missing, and the recognized contents are required to be distinguished by using manual rules. For bills with simpler formats, such as fixed formats of rated invoices, value-added tax invoices and the like, the whole recognition rate can reach more than 90% under the condition that image characters are clearly visible by the current mainstream technology, but the processing format is more complex, or scenes needing special rules, such as bank receipts, insurance documents and the like, have about 60% recognition accuracy under the condition of the same image quality as the invoices.

In summary, it is difficult to identify scenes with complex formats by OCR technology and manual rules, and there are still problems of low efficiency and high cost.

Disclosure of Invention

In view of the above problems, a key value pair region identification method, a key value pair region identification device, a storage medium and an electronic device are provided to solve the problems that OCR technology and manual rules are difficult to identify scenes with complex formats, and still have low efficiency and high cost.

According to one aspect of the present invention, there is provided a key value pair region identification method including:

obtaining a target picture;

inputting the target picture into a key value pair area identification network; the key value pair region identification network adopts a text region which is divided according to key value pair combination in advance, key regions and value regions which are divided according to text attributes in the text region, marks a picture sample, inputs the picture sample and the marked text region, the key regions and the value regions into a preset network structure, and trains the picture sample and the marked text region, the key regions and the value regions to obtain the text region;

and identifying the key value pair area in the target picture by the key value pair area identification network, and outputting text areas segmented according to the key value pair combination and key areas and value areas which are divided according to text attributes in the text areas.

Optionally, the identifying, by the key value pair area identifying network, a key value pair area in the target picture, outputting a text area divided according to a key value pair combination, and the key area and the value area divided according to text attributes in the text area includes:

extracting features of different scales from the target picture by using a convolutional neural network, and carrying out feature fusion to obtain a fused feature map;

generating text areas segmented according to key value pair combinations according to the feature map;

and dividing the text region to generate the key region and the value region.

Optionally, generating the text region segmented according to the key value pair combination according to the feature map includes:

generating a plurality of candidate areas for each pixel point on the feature map;

identifying a target candidate region of the plurality of candidate regions that matches the key-value pair combination;

and merging the target candidate areas to obtain the text area.

Optionally, extracting features of different scales from the target picture by using a convolutional neural network, and performing feature fusion, where obtaining a fused feature map includes:

performing up-sampling operation on the first feature map output by the pooling layer to obtain a second feature map with the same size as the last pooling layer;

And superposing the second characteristic diagram and the third characteristic diagram output by the last pooling layer to obtain a fourth characteristic diagram.

Optionally, the method further comprises:

text recognition is carried out on the key area and the value area, so that key information of the key attribute in the key area and value information of the value attribute in the value area are obtained;

the key information and the value information are provided.

Optionally, if the key area includes a plurality of key areas, before the text recognition is performed on the key area and the value area to obtain key information of a key attribute in the key area and value information of a value attribute in the value area, the method further includes;

detecting line information in the target picture;

determining position information of the key area and the value area according to the line information;

the providing the key information and the value information includes:

and generating structural information composed of the key information and the value information according to the position information.

Optionally, the target picture comprises at least one of user health data, a bank receipt and a financial invoice.

According to another aspect of the present invention, there is provided a key-value pair region identifying apparatus including:

the acquisition module is used for acquiring the target picture;

The input module is used for inputting the target picture into a key value pair area identification network; the key value pair region identification network adopts a text region which is divided according to key value pair combination in advance, key regions and value regions which are divided according to text attributes in the text region, marks a picture sample, inputs the picture sample and the marked text region, the key regions and the value regions into a preset network structure, and trains the picture sample and the marked text region, the key regions and the value regions to obtain the text region;

and the identification module is used for identifying the key value pair area in the target picture by the key value pair area identification network, outputting a text area divided according to the key value pair combination and dividing the text area into key areas and value areas according to the text attribute.

Optionally, the identification module includes:

the feature extraction submodule is used for extracting features of different scales from the target picture by using a convolutional neural network, and carrying out feature fusion to obtain a fused feature map;

the region generation sub-module is used for generating text regions segmented according to key value pair combinations according to the feature map;

and the segmentation sub-module is used for segmenting the text region and generating the key region and the value region.

Optionally, the region generating submodule includes:

a region generation unit, configured to generate a plurality of candidate regions for each pixel point on the feature map;

a region identifying unit configured to identify a target candidate region, of the plurality of candidate regions, that matches the key-value pair combination;

and the merging unit is used for merging the target candidate areas to obtain the text areas.

Optionally, the feature extraction submodule includes:

the sampling unit is used for carrying out up-sampling operation on the first characteristic diagram output by the pooling layer to obtain a second characteristic diagram with the same size as the last pooling layer;

and the superposition unit is used for superposing the second characteristic diagram and the third characteristic diagram output by the last pooling layer to obtain a fourth characteristic diagram.

Optionally, the apparatus further comprises:

the text recognition module is used for carrying out text recognition on the key area and the value area to obtain key information of the key attribute in the key area and value information of the value attribute in the value area;

and the information providing module is used for providing the key information and the value information.

Optionally, if the key area includes a plurality of key areas, the apparatus further includes;

the detection module is used for detecting line information in the target picture before the text recognition is carried out on the key area and the value area to obtain key information of the key attribute in the key area and value information of the value attribute in the value area;

The information determining module is used for determining the position information of the key area and the value area according to the line information;

the information providing module includes:

and the information generation module is used for generating structural information consisting of the key information and the value information according to the position information.

According to another aspect of the present invention, there is provided a storage medium comprising a stored program, wherein the program, when run, controls a device on which the storage medium resides to perform one or more methods as described above.

According to another aspect of the present invention, there is provided an electronic apparatus including: a memory, a processor, and executable instructions stored in the memory and executable in the processor, wherein the processor implements one or more of the methods described above when executing the executable instructions.

According to the embodiment of the invention, the target picture is input into the key value pair area identification network, the key value pair area identification network identifies the key value pair area in the target picture, the text area divided according to the key value pair combination is output, the text area divided according to the text attribute in the text area and the value area are divided according to the key value pair combination, the text area divided according to the key value pair combination and the text area divided according to the text attribute in the text area are adopted to mark the picture sample, the picture sample and the marked text area, the key area and the value area are input into the preset network structure, and the key value pair area identification network is trained to obtain the key value pair area identification network, so that the key value pair area identification network can automatically detect the text area combined by the key value pair, and meanwhile, the key area and the value area are automatically obtained.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a flow chart of a key-value pair region identification method according to a first embodiment of the present invention;

FIG. 2 is a flow chart of a key-value pair region identification method according to a second embodiment of the present invention;

FIG. 3 is a schematic diagram of a key-value versus region identification process;

fig. 4 is a block diagram showing a structure of a key value pair area identifying apparatus according to a third embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Example 1

Referring to fig. 1, a flowchart of a key value pair region identification method in the first embodiment of the present invention may specifically include:

and step 101, acquiring a target picture.

The target picture may include information of key value pair combination, such as various notes, receipts, etc., or any other suitable picture, which is not limited in this embodiment of the present invention.

In one embodiment of the present invention, after the target picture is acquired, the target picture needs to be preprocessed, and the input RGB (RGB color mode) image needs to be preprocessed, including but not limited to sharpening, denoising, etc. of the image.

For example, the network inputs are RGB three-channel images, which require scaling of the picture size to 512 x 512 due to computational power and model reasoning speed requirements.

And 102, inputting the target picture into a key value pair area identification network.

In the embodiment of the invention, the target picture is input into the trained key value pair area identification network, and the key value pair area identification network can automatically process the target picture and output the identification result.

In the embodiment of the invention, in the training process of the key value pair region identification network, the text region segmented according to the key value pair combination and the key region and the value region divided according to the text attribute in the text region are adopted to mark the picture sample.

In the embodiment of the invention, the picture sample may contain information of key-value pair combination, for example, a pair of key-value (key-value pair combination), a name of key and a name of value of Zhang san in the identity card. The picture sample includes various notes, receipts, etc., or any other suitable picture, which is not limited in this embodiment of the present invention.

In the embodiment of the invention, in order to detect each text region containing a key value pair combination as a target, during network training, all text region key value pairs of a training sample are segmented, then each key value pair combination text region is marked, for example, each key value pair combination text region is marked as 1, and a non-key value pair combination text region is marked as 0. In order to further classify the text region, it is divided into a key region and a value region divided according to the text attribute, wherein the text attribute comprises keys in key-value pair combinations and values in the key-value pair combinations, and when the network is trained, the text region of each key-value pair combination is divided into two text boxes, one is marked as a key region and one is marked as a value region, for example, two text boxes in a key-value region are respectively marked as two attributes of a key and a value.

And inputting the picture sample, the marked text region, the key region and the value region into a preset network structure, and training to obtain a key value pair region identification network.

In the embodiment of the invention, a picture sample, a marked text region, a marked key region and a marked value region are input into a preset network structure, wherein the preset network structure is a machine learning model, can be used for identifying images after training, can divide the picture after providing one image, and outputs labels of all regions. Training is carried out by adopting a picture sample, a marked text region, a key region and a value region, and the obtained network is recorded as a key value pair region identification network. The key value pair region recognition network may output text regions in the picture segmented according to the key value pair combinations, and key regions and value regions in the text regions segmented according to text attributes.

In an embodiment of the present invention, the preset network structure includes a network for object detection and a network for image classification. The network layer firstly carries out target detection on the picture, outputs a text region combined by the key value pairs in the picture, classifies two parts in the text region combined by the key value pairs, and outputs a key region and a value region in the text region. And minimizing the values of the target function of target detection and the target function of classification during training, thus obtaining the key value pair area identification network reaching the performance target.

And 103, identifying key value pair areas in the target picture by the key value pair area identification network, and outputting text areas divided according to key value pair combinations and key areas and value areas divided according to text attributes in the text areas.

In the embodiment of the invention, the key value pair area identification network can identify the key value pair area in the target picture, wherein the key value pair area identification network comprises text areas which are segmented according to the key value pair combination, and classifies the text areas, namely the key areas and the value areas which are divided according to the text attributes in the text areas.

For example, a convolutional neural network is utilized to detect a text region of a picture and classify key and value items in the text region, specifically, the method firstly carries out multi-scale feature extraction on an input picture through the convolutional neural network, then carries out fusion on features of different scales, and then carries out two-step operation on the fused feature map.

Example two

Referring to fig. 2, a flowchart of a key value pair region identification method in the second embodiment of the present invention may specifically include:

Step 201, a target picture is acquired.

Step 202, inputting the target picture into a key value pair area identification network.

And 203, extracting features of different scales from the target picture by using a convolutional neural network, and carrying out feature fusion to obtain a fused feature map.

And 204, generating text areas segmented according to the key value pair combination according to the feature map.

And step 205, segmenting the text region to generate the key region and the value region.

In the embodiment of the invention, the text detection is carried out on the picture by utilizing the convolutional neural network to construct a convolutional neural network, the network mainly comprises three modules, the first module is used for carrying out convolution and fusion operation on the picture to obtain characteristics with different scales, the second module is used for regressing a text area containing key value pair combination from the fused characteristics, the third module is used for continuously classifying the text area of the second module, dividing a text box in the text area into areas with two attributes of keys and values, and marking the areas as a key area and a value area.

For example, different scale features are extracted by convolutional neural networks, e.g., using VGG (Visual Geometry Group Network ), res net (Residual Networks, residual network), etc., and feature fusion output is performed. A schematic diagram of the key value versus region identification process is shown in fig. 3.

The convolution pooling 1 in feature extraction includes 1 convolution layer and 1 pooling layer, and 64 3×3 convolution kernels and 1 max pooling (maximum sampling) pooling layers are employed.

The convolution pooling 2 in feature extraction includes 2 convolution layers and 1 pooling layer, with 128 convolution kernels of 3×3 and 1 max pooling layer.

The convolution pooling 3 in the feature extraction comprises 3 convolution layers and 1 pooling layer, wherein 2 layers of 256 convolution kernels of 3×3 are adopted first, and then 1 layer of 256 convolution layers of 1×1 and 1 pooling layer of max pooling are used.

The convolution pooling 4 in feature extraction comprises 3 convolution layers and 1 pooling layer, wherein 2 layers of 512 convolution kernels of 3×3 are adopted first, and then 1 layer of 512 convolution layers of 1×1 and 1 pooling layer of max pooling are used.

The convolution pooling 5 in feature extraction comprises 3 convolution layers and 1 pooling layer, wherein 2 layers of 512 convolution kernels of 3×3 are adopted first, and then 1 layer of 512 convolution layers of 1×1 and 1 pooling layer of max pooling are used.

In an optional embodiment of the present invention, extracting features of different scales from the target image by using a convolutional neural network, and performing feature fusion, so as to obtain an implementation manner of the fused feature image, may include: performing up-sampling operation on the first feature map output by the pooling layer to obtain a second feature map with the same size as the last pooling layer; and superposing the second characteristic diagram and the third characteristic diagram output by the last pooling layer to obtain a fourth characteristic diagram.

The main purpose of upsampling (otherwise known as upscaling or image interpolation) is to magnify the original image so that it can be displayed on a higher resolution display device. The scaling operation on the image does not bring more information about the image and therefore the quality of the image will inevitably be affected. However, there are indeed some scaling methods that can increase the information of the image so that the scaled image quality exceeds the original image quality.

For example, as shown in fig. 3, the last pooling layer of the above operation is first subjected to an up-sampling operation, the size of the last pooling layer is restored to the result of the previous step of the pooling operation with the last pooling layer, then the last pooling layer is directly overlapped with the pooling layer in the pooling 4 to obtain a new feature map, and the new feature map is fused with the feature maps in the pooling 3 and 2 to obtain a fused feature map in the same manner.

In an alternative embodiment of the present invention, an implementation of generating a text region segmented according to a key-value pair combination according to the feature map may include: generating a plurality of candidate areas for each pixel point on the feature map; identifying a target candidate region of the plurality of candidate regions that matches the key-value pair combination; and merging the target candidate areas to obtain the text area.

As shown in fig. 3, the generation of the combined candidate areas by the key values generates 8 candidate frames, i.e., candidate areas, for each pixel point on the feature map output in the previous step. The candidate regions have different sizes. And then, by means of regression, candidate areas with the value of 1 are regressed out through threshold filtering, namely, candidate areas matched with the key value pair combination in the plurality of candidate areas are identified and marked as target candidate areas. These target candidate regions are then combined into text regions of key-value pair combinations using NMS (Non-Maximum Suppression, non-maximal suppression) algorithm.

The next step is to sort the keys in the text region, typically containing 2 text boxes in the text region. And then, in each text area, regression is performed to obtain a text box of the key and the value attribute by using a regression method.

According to the embodiment of the invention, the target picture is input into the key value pair area identification network, the different scale features are extracted from the target picture by utilizing the convolutional neural network, the feature fusion is carried out, the fused feature diagram is obtained, the text area which is divided according to the key value pair combination is generated according to the feature diagram, the text area is divided, and the key area and the value area are generated, so that the key value pair area identification network can automatically detect the text area which is combined by the key value pair, and meanwhile, the text area is classified, the key area and the value area are automatically obtained, and compared with the matching of key and value rules under manual intervention, the scene with complex format can also be accurately identified, the method has more universality, the time for manual input and check is reduced, and a large amount of labor cost is saved.

In an alternative embodiment of the invention, the method further comprises: text recognition is carried out on the key area and the value area, so that key information of the key attribute in the key area and value information of the value attribute in the value area are obtained; the key information and the value information are provided.

After the text area, the key area and the value area are obtained, key information of the key attribute in the key area and value information of the value attribute in the value area are obtained through text recognition, and then the key information and the value information are provided. For example, the key information and the value information are structured to be output.

In an optional embodiment of the present invention, if the key area includes a plurality of key areas, before the text recognition is performed on the key area and the value area to obtain key information of a key attribute in the key area and value information of a value attribute in the value area, the method further includes; detecting line information in the target picture; determining position information of the key area and the value area according to the line information; one implementation of providing the key information and the value information includes: and generating structural information composed of the key information and the value information according to the position information.

If there is only one text region, the text region combined by the key value pair is output by category, if there are multiple key regions, most of the text lines are similar to the header text lines of the table region, and the following operation is needed for such regions to detect the straight line below the header text lines of the table. And restoring the format by using an image processing mode to obtain the position information of the table, matching each value area in the table by using a text box of the table head, outputting the text box according to the line, setting the category of the table head as a table-key, and setting the content in the table as the table-value. And identifying the key-value text box to obtain a structured output result.

In an alternative embodiment of the present invention, the target picture includes user health data, a bank receipt, a financial invoice, etc., or any other suitable picture, to which embodiments of the present invention are not limited.

The user health data comprises health image data such as physical examination reports, diagnosis records, various health indexes of people, medical records and the like. The key value pair area identification network can identify the corresponding key area and value area. The key value pair area identification network can be trained by adopting sample data of at least one of user health data, a bank receipt, a financial invoice and the like, the key value pair area identification network can be trained to identify the user health data by adopting sample data of the bank receipt, the key value pair area identification network can be trained to identify the bank receipt by adopting sample data of the bank receipt, and the financial invoice can be trained to identify the financial invoice by adopting sample data of the financial invoice.

Example III

Referring to fig. 4, a block diagram of a key value pair region identification apparatus in the third embodiment of the present invention is shown, which may specifically include:

an acquisition module 301, configured to acquire a target picture;

an input module 302, configured to input the target picture into a key value pair region identification network; the key value pair region identification network adopts a text region which is divided according to key value pair combination in advance, key regions and value regions which are divided according to text attributes in the text region, marks a picture sample, inputs the picture sample and the marked text region, the key regions and the value regions into a preset network structure, and trains the picture sample and the marked text region, the key regions and the value regions to obtain the text region;

and the identifying module 303 is configured to identify, by the key value pair area identifying network, a key value pair area in the target picture, output a text area divided according to a key value pair combination, and a key area and a value area divided according to text attributes in the text area.

Optionally, the identification module includes:

Optionally, the region generating submodule includes:

Optionally, the feature extraction submodule includes:

Optionally, the apparatus further comprises:

the information providing module includes:

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

In an embodiment of the disclosure, the key value pair area identification network generating device includes a processor and a memory, where the above modules and sub-modules are stored as program units, and the processor executes the above program units stored in the memory to implement corresponding functions.

The processor includes a kernel, and the kernel fetches the corresponding program unit from the memory. The kernel can set one or more than one of the key value pair area identification network through obtaining a target picture, inputting the target picture into the key value pair area identification network, identifying the key value pair area in the target picture by the key value pair area identification network, outputting a text area divided according to the key value pair combination and a text area and a value area divided according to the text attribute in the text area, marking a picture sample by adopting the text area divided according to the key value pair combination and the text area and the value area divided according to the text attribute in the text area, inputting the picture sample and the marked text area, the key area and the value area into a preset network structure, training to obtain the key value pair area identification network, so that the key value pair area identification network can automatically detect the text area combined by the key value pair, and simultaneously classify the text area to automatically obtain the key area and the value area.

The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip.

The embodiment of the invention provides a storage medium, on which a program is stored, which when executed by a processor, implements the key value pair region identification method.

The embodiment of the invention provides a processor which is used for running a program, wherein the key value pair region identification method is executed when the program runs.

The embodiment of the invention provides an electronic device, which comprises a processor, a memory and a program stored on the memory and capable of running on the processor, wherein the processor realizes the following steps when executing the program:

obtaining a target picture;

and dividing the text region to generate the key region and the value region.

and merging the target candidate areas to obtain the text area.

Optionally, the method further comprises:

the key information and the value information are provided.

detecting line information in the target picture;

the providing the key information and the value information includes:

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A key-value pair region identification method, comprising:

obtaining a target picture;

inputting the target picture into a key value pair area identification network; the key value pair region identification network adopts a text region which is divided according to key value pair combination in advance, key regions and value regions which are divided according to text attributes in the text region, marks a picture sample, inputs the picture sample and the marked text region, the key regions and the value regions into a preset network structure, and trains the picture sample and the marked text region, the key regions and the value regions to obtain the text region; dividing a text region of each key-value pair combination into two text boxes during network training, and marking the text boxes as a key region and a value region respectively;

the key value pair area identification network identifies the key value pair area in the target picture, and outputs a text area divided according to the key value pair combination and key areas and value areas divided according to text attributes in the text area;

Wherein the key value pair region recognition network recognizes a key value pair region in the target picture, outputting a text region divided according to a key value pair combination, and the key region and the value region divided according to text attributes in the text region include:

generating text areas segmented according to key value pair combinations according to the feature map; wherein each text region containing a combination of key-value pairs is detected as a target;

and dividing the text region to generate the key region and the value region.

2. The method of claim 1, wherein generating text regions segmented by key-value-pair combinations from the feature map comprises:

and merging the target candidate areas to obtain the text area.

3. The method of claim 1, wherein extracting features of different scales from the target picture by using a convolutional neural network, and performing feature fusion to obtain a fused feature map comprises:

4. The method according to claim 1, wherein the method further comprises:

the key information and the value information are provided.

5. The method of claim 4, wherein if the key region includes a plurality of key regions, the method further comprises, prior to the text identifying the key region and the value region to obtain key information for key attributes in the key region and value information for value attributes in the value region;

detecting line information in the target picture;

the providing the key information and the value information includes:

6. The method of claim 1, wherein the target picture comprises at least one of user health data, a bank receipt, and a financial invoice.

7. A key-value pair region identifying apparatus, comprising:

the acquisition module is used for acquiring the target picture;

the input module is used for inputting the target picture into a key value pair area identification network; the key value pair region identification network adopts a text region which is divided according to key value pair combination in advance, key regions and value regions which are divided according to text attributes in the text region, marks a picture sample, inputs the picture sample and the marked text region, the key regions and the value regions into a preset network structure, and trains the picture sample and the marked text region, the key regions and the value regions to obtain the text region; dividing a text region of each key-value pair combination into two text boxes during network training, and marking the text boxes as a key region and a value region respectively;

the identification module is used for identifying the key value pair area in the target picture by the key value pair area identification network, and outputting a text area divided according to the key value pair combination and a key area and a value area divided according to the text attribute in the text area;

Wherein, the identification module includes:

the region generation sub-module is used for generating text regions segmented according to key value pair combinations according to the feature map; wherein each text region containing a combination of key-value pairs is detected as a target;

8. A storage medium comprising a stored program, wherein the program, when run, controls a device in which the storage medium is located to perform the method of any one of claims 1 to 6.

9. An electronic device, comprising: memory, a processor and executable instructions stored in the memory and executable in the processor, wherein the processor implements the method of any of claims 1-6 when executing the executable instructions.