CN114724156B

CN114724156B - Form identification method and device and electronic equipment

Info

Publication number: CN114724156B
Application number: CN202210419150.0A
Authority: CN
Inventors: 李煜林; 钦夏孟; 章成全; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-04-20
Filing date: 2022-04-20
Publication date: 2023-07-25
Anticipated expiration: 2042-04-20
Also published as: CN114724156A

Abstract

The disclosure provides a form recognition method, a form recognition device and electronic equipment, relates to the technical field of artificial intelligence, and particularly relates to the technical field of deep learning, image processing and computer vision. The specific implementation scheme is as follows: acquiring an image to be identified, wherein the image to be identified comprises image contents of a target form, and the target form comprises M text entity units; acquiring a first characteristic of the target form based on the image to be identified; based on the first characteristics, carrying out entity classification on the M text entity units to obtain entity categories of each text entity unit; based on the first characteristics, predicting association relations among different text entity units in the M text entity units to obtain relation information of the M text entity units, wherein the association relations are used for representing whether association exists among the different text entity units; and outputting form information of the target form based on the entity category and the relation information.

Description

Form identification method and device and electronic equipment

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of deep learning, image processing and computer vision, and specifically relates to a form recognition method, a form recognition device and electronic equipment.

Background

Forms are a semi-structured form of documents that are widely used in a variety of business, office, etc. contexts. In an automated office system, the identification of form information from a form image is one of the important functions of the system.

Currently, a form recognition mode generally designs a set of plate surface analysis tools, and divides a form image into different types of modules from top to bottom for processing.

Disclosure of Invention

The disclosure provides a form identification method, a form identification device and electronic equipment.

According to a first aspect of the present disclosure, there is provided a form recognition method, including:

acquiring an image to be identified, wherein the image to be identified comprises image contents of a target form, the target form comprises M text entity units, and M is an integer greater than 1;

acquiring a first characteristic of the target form based on the image to be identified;

based on the first characteristics, carrying out entity classification on the M text entity units to obtain entity categories of each text entity unit;

based on the first characteristics, predicting association relations among different text entity units in the M text entity units to obtain relation information of the M text entity units, wherein the association relations are used for representing whether association exists among the different text entity units;

And outputting form information of the target form based on the entity category and the relation information.

According to a second aspect of the present disclosure, there is provided a form recognition apparatus including:

the first acquisition module is used for acquiring an image to be identified, wherein the image to be identified comprises image content of a target form, the target form comprises M text entity units, and M is an integer greater than 1;

the second acquisition module is used for acquiring the first characteristics of the target form based on the image to be identified;

the entity classification module is used for carrying out entity classification on the M text entity units based on the first characteristics to obtain entity categories of each text entity unit;

the relation prediction module is used for predicting association relations among different text entity units in the M text entity units based on the first characteristics to obtain relation information of the M text entity units, wherein the association relations are used for representing whether association exists among the different text entity units;

and the output module is used for outputting the form information of the target form based on the entity category and the relation information.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

At least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the methods of the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform any of the methods of the first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements any of the methods of the first aspect.

The technology solves the problem that the universality is not high due to the fact that the form identification depends on the form style, and improves the robustness of the form identification.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow diagram of a form identification method according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a form image in an example;

FIG. 3 is a schematic diagram of a form image in another example;

FIG. 4 is a schematic diagram of a form image in yet another example;

FIG. 5 is a schematic diagram of a form recognition device according to a second embodiment of the present disclosure;

fig. 6 is a schematic block diagram of an example electronic device used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

First embodiment

As shown in fig. 1, the present disclosure provides a form recognition method, including the steps of:

step S101: and acquiring an image to be identified, wherein the image to be identified comprises image contents of a target form, and the target form comprises M text entity units.

Wherein M is an integer greater than 1.

In this embodiment, the form recognition method relates to the technical field of artificial intelligence, in particular to the technical fields of deep learning, image processing and computer vision, and can be widely applied to scenes such as optical character recognition (Optical Character Recognition, OCR). The form recognition method of the embodiment of the present disclosure may be performed by the form recognition apparatus of the embodiment of the present disclosure. The form recognition apparatus of the embodiment of the present disclosure may be configured in any electronic device to perform the form recognition method of the embodiment of the present disclosure. The electronic device may be a server or a terminal device, and is not particularly limited herein.

In this step, the image to be identified may be any image including the content of a form image, which may be referred to as a form image, wherein the form may be a document in a semi-structured form, in which key values may be included.

As shown in FIG. 2, the image 200 includes image content of a form that includes key values, such as key 201 and value 202. As shown in FIG. 3, the image 300 includes the image content of a document, which is a special form composed of key values, colon symbol ": the text content on the left may be key 301, colon sign ": the text content to the right may be the value 302.

In an alternative embodiment, a table may be included in the form, and as shown in fig. 2, a table 203 may be included in the table, and a header 2031 and a cell 2032 may be included in the table, where a key relationship may be between the header and the cell.

The target form may be any form, and the target form may include at least two text entity units, where the text entity units may be a text line with uninterrupted characters, where the characters may be represented by spaces, colon symbols ": ", cells, etc.

In the case of the character break, even though the human sense is in the same-line relationship, the model is regarded as different text lines during the model detection, namely, different text entity units, each text entity unit corresponds to a detection box, and as shown in fig. 4, in the form image 400, the text entity unit 401 and the text entity unit 402 are in the same-line relationship, but are different text entity units.

The method for obtaining the image to be identified may include various methods, for example, the target form may be photographed or scanned in real time to obtain the image to be identified, or the image to be identified sent by other electronic devices may be received, or the image may be downloaded from the internet as the image to be identified, or a pre-stored image to be identified may be obtained.

Step S102: and acquiring a first characteristic of the target form based on the image to be identified.

In this step, the first feature may be an image related feature of the image to be identified for the target form, or may be a multi-modal feature, that is, the first feature may be a fusion feature of a visual feature of the image to be identified and a text feature of the target form, which is not specifically limited herein.

In an alternative embodiment, the image to be identified may be input to a target model for feature processing, so as to obtain a first feature, where the first feature may be an image related feature of the image to be identified for the target form, and the target model may be a network model of a convolutional neural network (Convolutional Neural Networks, CNN) such as a mixed structure of a res net-50 and a transducer network.

In another alternative embodiment, the image to be identified may be input to a first model for feature extraction, and an image feature map of the image to be identified is obtained, where the first model may be a convolutional neural network (Convolutional Neural Networks, CNN) such as ResNet-50. And the text recognition can be carried out on the image to be recognized to obtain text content and position information of the target form in the image to be recognized, a text feature map of the target form is constructed based on the text content and the position information, and the image feature map and the text feature map are fused to obtain first features which can be multi-mode features.

In yet another alternative embodiment, the image to be identified may be input to the first model to perform feature extraction to obtain an image feature map of the image to be identified, a region of interest (Region Of Interest, ROI) mapping operation is adopted to clip region features for the text entity unit on the image feature map, the region features of the text entity unit are feature-coded to map to obtain feature vectors, an image feature sequence of the image to be identified for the text entity unit is generated, and visual features including the image feature sequence and the text features are fused to obtain the first features.

Step S103: and carrying out entity classification on the M text entity units based on the first characteristics to obtain entity categories of each text entity unit.

In this step, the entity category may refer to a classification category of an entity constituting a form, wherein the form is different from a general document in that a text entity unit, that is, a text line as a whole, can be regarded as one entity by a model. The purpose of this step is to classify each text entity unit based on the first feature. Entity categories may include title, key, value, header, unit, etc.

In an alternative embodiment, a full connection layer may be used to perform feature mapping on the region features of the text entity unit in the first feature to obtain a feature vector, and a logistic regression model softmax function is used to perform entity classification based on the feature vector to obtain an entity class of the text entity unit.

In yet another alternative embodiment, the classification of each text entity unit may be performed by the fully connected network based on the first characteristic, resulting in an entity class for each text entity unit.

Step S104: based on the first feature, predicting association relations among different text entity units in the M text entity units to obtain relation information of the M text entity units, wherein the association relations are used for representing whether association exists among the different text entity units.

In this step, for every two different text entity units, the association relationship between the two text entity units may be predicted, so as to predict that there is an association between every two text entity units. That is, the association relationship may include two cases, in which there is an association between two text entity units, and in which there is no association between two text entity units.

The relationship information may include only the association relationship in which the association exists, or may include both the association relationship in which the association exists and the association relationship in which the association does not exist, and is not particularly limited herein.

In implementation, features of two different text entity units, such as a text entity unit i and a text entity unit j, may be obtained based on the first feature, for example, feature mapping is performed by intercepting region features of the two different text entity units, such as the text entity unit i and the text entity unit j, in the first feature, and two features obtained by serial mapping are classified by using a full connection layer, so as to predict whether there is a correlation between the text entity unit i and the text entity unit j.

Step S105: and outputting form information of the target form based on the entity category and the relation information.

In this step, the form information of the target form may refer to information output in the form, that is, an input image to be recognized, and a form in a preset format, such as an XML format, may be output by recognizing and constructing the form information.

The output form information may include a key-value relationship, i.e., a pairing relationship of keys and values. Wherein, the entity category is a key-value relationship, and the relationship between the table header and the unit in the table can be also called a key-value relationship. The target forms of fig. 2 through 4 each include a key-value relationship therein.

In some practical scenarios, the form information of the target form may also include structured information, which may include that different cells in the table are in a co-line relationship, that the corresponding cells of the header include multiple lines of text, that the keys or values are multiple lines of text, and so on. As shown in fig. 2, the cells corresponding to header 2031 include multiple lines of text, as shown in fig. 3, and value 302 is multiple lines of text, as shown in fig. 4, with text entity element 401 and text entity element 402, albeit in a co-line relationship.

The text entity units with the association in the M text entity units can be selected based on the relation information, and the association category of the association between the two text entity units is determined based on the entity category.

And then outputting form information of the target form based on the association type, wherein the form information is characterized in that the two text entity units are in a key-value relationship when the entity types of the two text entity units are keys and values respectively, and is characterized in that the two text entity units are in a same-line relationship when the entity types of the two text entity units are units, and is characterized in that the two text entity units are positioned in the same cell when the entity types of the two text entity units are headers, and the cell comprises a plurality of lines of texts. In this way, recognition of the text content and structure of the target form can be achieved.

In this embodiment, the form structure is constructed and output in a bottom-up manner by performing entity classification on text entity units in a target form in an image to be identified and predicting association relations between every two text entity units and combining entity types of the text entity units and association relations between every two text entity units, so that robustness of form identification can be improved and efficiency of form identification can be improved.

Optionally, the step S102 specifically includes:

performing text recognition on the image to be recognized to obtain position information and text contents of the M text entity units;

Extracting features of the image to be identified to obtain image features of the image to be identified;

based on the position information, carrying out feature coding on the text content to obtain text features of the target form;

and fusing the text features and the image features to obtain first features of the target form.

In this embodiment, the first feature may be a multi-modal feature.

The location information of all text entity units in the target form in the image to be identified can be located by an OCR or PDF parsing tool and the text content of each text entity unit is identified.

Feature extraction can be performed on an image to be identified through CNN (carbon fiber network) such as ResNet-50 to obtain a feature map of the image to be identified, namely the image features of the image to be identified, by usingAnd (c) representing w, h and d are the width, height and depth of the feature map respectively.

For each text entity unit, feature coding can be performed on each word in the text entity unit based on the text content of the text entity unit to obtain a feature vector, and the feature vector of each word in the text entity unit is connected in series by using a two-way long-short-term memory artificial neural network BiLSTM model based on the row relation of the words in the text entity unit to obtain the text feature of the text entity unit, as shown in the following formula (1).

t _i ＝BiLSTM({c _ij }),j∈[1,k _i ] (1)

Wherein, in the above formula (1),text feature, k, for text entity element i _i D is the dimension of the feature vector of the text, which is the number of the text entity unit.

And fusing the text characteristics of the M text entity units according to the position information of the text entity units to obtain the text characteristics of the target form.

And then fusing the text features and the image features of the target form to obtain the first features of the target form. The text features and the image features can be fused in a splicing mode, semantic enhancement can be performed on the spliced features through a second model on the basis of splicing, so that effective fusion expression of the multi-mode features is realized, and the recognition accuracy of the target form is further improved. The second model may be a transducer network structure.

In the embodiment, the multi-modal characteristics of the target form are obtained by fusing the text characteristics and the image characteristics, so that the expression of the form meaning is better carried out, and the accuracy of form identification is improved.

Optionally, the fusing the text feature and the image feature to obtain a first feature of the target form includes:

splicing the text features and the image features to obtain second features;

And carrying out semantic enhancement on the second features to obtain first features of the target form.

In the embodiment, text features and image features can be spliced, and on the basis of splicing, semantic enhancement is performed on the spliced features, namely second features, through the second model, so that effective fusion expression of multi-mode features is realized, and the recognition accuracy of a target form is further improved.

In an alternative embodiment, the second model may be a transducer network structure. When the second feature is semantically enhanced using a Transformer network, the Transformer network may be composed of N, e.g., 12, identical stacks of network layers, each layer being composed of a multi-headed attention layer and a feed-forward network layer, respectively, with residual connection and layer normalization operations between the two sublayers. The feedforward network layer is a full-connection layer, and the calculation form of the multi-head attention layer is shown in the following formulas (1), (2) and (3):

MultiHead(Q,K,V)＝Concat(head ₁ ,…head _h )W _m (1)

head _i ＝Attention(Q,K,V) (2)

wherein W is _m 、W _q 、W _k And W is _n As a parameter matrix, h is the attention head number (which can take a value of 8), and the multi-head attention can extract the characteristics of different subareas; sigma is a logistic regression model softmax function, Q, K, V are vector matrices of the second feature, i.e. the input sequence, and d is the vector dimension (which can take 768). Through calculation of the attention mechanism, the attention mechanism of Q on V is obtained, namely, the salient semantic feature on the V feature based on Q can be the first feature.

That is, the second feature is further encoded using the Transformer network, and the semantic feature of the target form, that is, the first feature, can be obtained by using the second feature as an input to the Transformer network.

Optionally, the text recognition is performed on the image to be recognized to obtain location information and text content of the M text entity units, including:

performing position prediction on each text entity unit in the image to be identified to obtain a detection frame position of each text entity unit, wherein the position information comprises the detection frame position;

and aiming at each text entity unit, intercepting an image of the detection frame position of the text entity unit in the image to be identified, and carrying out text identification to obtain the text content of the text entity unit.

In this embodiment, text detection may be performed on an image to be recognized using a text detection technique. In the implementation, an OCR text detection model, such as an eat model, may be used, and based on a deep learning algorithm, the predicted position of each text entity unit in the image to be recognized is returned, so as to obtain all detection frame sets of M text entity units, and p= { P is used _i ；i∈M ^* Represented by p _i For the detection frame position of the text entity unit i, M ^* May represent {1,2, …, M }.

Each detection frame p _i Cutting out corresponding rectangular image slices from the image to be identified, using I _i Representing and identifying I using a deep learning model such as CRNN _i For image text sequences of (c) _i Representation, the detection frame position p of each text entity unit can be obtained _i And text content c _i Thereby, text recognition of the image to be recognized can be realized.

Optionally, the feature encoding is performed on the text content based on the location information to obtain the text feature of the target form, including:

performing feature coding on text content of each text entity unit to obtain a third feature of the text entity unit;

constructing a target tensor, wherein the size of the target tensor is the same as the size of the image characteristic;

and embedding the third characteristic of each text entity unit into the target tensor according to the position information to obtain the text characteristic of the target form.

In this embodiment, for each text entity unit, feature encoding may be performed on each text in the text entity unit based on the text content of the text entity unit to obtain a feature vector, and based on the line relationship of the text in the text entity unit, the feature vector of each text in the text entity unit is connected in series by using the bi-directional long-short term memory artificial neural network BiLSTM model to obtain a text feature of the text entity unit, that is, a third feature.

Constructing an all-zero tensor with the same size as the image feature map L for text featuresI.e. target tensor, the text feature t of each text entity unit _i According to the corresponding detection frame position p _i Sequentially embedding the text features in the feature map T to obtain the text features of the target form. Thus, the size of the text feature and the image feature of the generated target form are matched, so that feature combination can be realized to obtain multi-mode features, namely +.>The expression of the feature combination is shown in the following expression (2).

I＝concat(L,T) (2)

Optionally, the step S103 specifically includes:

for each text entity unit, intercepting the regional characteristics of the text entity unit in the first characteristics to perform characteristic mapping to obtain first mapping characteristics of the text entity unit;

and carrying out entity classification on the text entity units based on the first mapping characteristics to obtain entity categories of the text entity units.

In this embodiment, the entity class of the text entity unit, such as title, key, value, header, unit, may be defined. For each text entity unit in the target form, a detection frame position p based on the text entity unit _i The region feature corresponding to the detection frame position in the first feature can be intercepted by using the ROI Pooling operation, the region feature is mapped into a feature vector with a predefined class size by using a full connection layer fc, and f is used _i And (3) representing to obtain a first mapping characteristic.

Based on the first mapping feature, mapping into a probability distribution using a softmax function is represented as shown in the following equation (3):

scores＝softmax(fc(f _i )) (3)

where score is the mapped probability distribution value and fc is the full connectivity layer.

And taking the predefined entity category with the highest probability value in the probability as the entity category of the text entity unit, thereby realizing the entity identification and classification of each text entity unit based on the first characteristic, and being represented by the following formula (4).

cls＝argmax(scores) (4)

Wherein cls predicts the entity class of the obtained text entity unit.

Optionally, the step S104 specifically includes:

aiming at every two text entity units in the M text entity units, intercepting the regional characteristics of each text entity unit in the two text entity units in the first characteristics to perform characteristic mapping to obtain second mapping characteristics and third mapping characteristics of the two text entity units;

splicing the second mapping feature and the third mapping feature to obtain a target mapping feature;

predicting the association relationship between the two text entity units based on the target mapping characteristics to obtain the association relationship between the two text entity units;

The relation information of the M text entity units comprises the association relation between the two text entity units.

In this embodiment, for each two text entity units in the M text entity units, the association relationship between the two text entity units may be predicted in the same manner.

For any two text entity units, the regional features of the detection frame positions corresponding to the text entity units in the first features can be intercepted by using the ROI Pooling operation to perform feature mapping, so as to obtain second mapping features and third mapping features. And splicing the second mapping feature and the third mapping feature to obtain a target mapping feature, and based on the target mapping feature, performing two classification by using a full connection layer fc to predict whether the two text entity units are associated or not so as to obtain an association relationship between the two text entity units. Therefore, based on the first characteristic, the prediction of the association relation between different text entity units in the M text entity units can be realized, and the robustness of form identification is improved.

Optionally, the step S105 specifically includes:

for each text entity unit, acquiring an association unit set of the text entity unit based on the relation information, and determining an association category between the text entity unit and each text entity unit in the association unit set based on the entity category of the text entity unit; the association unit set comprises other text entity units which are associated with the text entity units in the M text entity units;

And outputting form information of the target form based on the association category.

In this embodiment, form information of the target form may be identified and constructed in combination with entity category and relationship information.

In the implementation, M text entity units in the image to be identified can be traversed, and for each text entity unit, a text entity unit set associated with the text entity unit can be obtained from the relation information, so as to obtain an associated unit set of the text entity unit.

As for text entity units p _i Searching for text entity units p from relationship information _i The associated association unit set can be used { (p) _i ,p _j ) Represented by p _j For and text entity unit p _i There are associated text entity units, and the associated unit set may include at least one text entity unit p _i Different text entity units.

The association category between a text entity unit and each text entity unit in the set of association units may be determined based on the entity category of the text entity unit, i.e. the association category of two text entity units for which an association exists. The association category may include the following:

a first association category, which may refer to an association category between two text entity units whose entity categories are a key (denoted by K) and a value (denoted by B), respectively, is denoted by r1= { (K) _i ,b _j )；i,j∈M ^* Represented by R1, where R represents a first association category, k _i Text entity unit representing entity category as key, b _j Text entity units representing entity categories as values.

A second association category, which may refer to an association category between two text entity units whose entity categories are keys, is defined as r2= { (k) _i ,k _j )；i,j∈M ^* A first association category, k, wherein R2 represents _i 、k _j Text entity units respectively representing entity categories as keys.

A third association category, which may refer to an association category between two text entity units whose entity categories are a header (denoted by H) and a unit (denoted by C), respectively, is denoted by r3= { (H) _i ,c _j )；i,j∈M ^* And (h) represents, wherein R3 represents a third association class, h _i Text entity unit with entity category as header, c _j Text entity units representing entity categories as units.

A fourth association category may refer to an association category between two text entity units whose entity categories are both headers, where r4= { (h) _i ,h _j )；i,j∈M ^* And } represents, wherein R4 represents a fourth association category, h _i 、h _j Each representing a text entity element with an entity class of a header.

A fifth association category may refer to an association category between two text entity units whose entity categories are units, with r5= { (c) _i ,c _j )；i,j∈M ^* And (c) represents, wherein R5 represents a fifth association category, c _i 、c _j Each representing a text entity unit for which the entity class is a unit.

Based on the entity category and the position information of each text recognition unit and combined with the associated category of each text entity unit, a form with a preset format such as an XML format can be constructed, and form information is output.

In the construction process, the association category of each text entity unit can be combined, if the entity categories of the two text entity units are keys and values respectively, the two text entity units are characterized as a key-value relationship, if the entity categories of the two text entity units are units, the two text entity units are characterized as being in a same-line relationship, if the entity categories of the two text entity units are headers, the two text entity units are characterized as being positioned in the same cell, and the cell comprises a plurality of lines of texts.

Thus, the text content and the structure of the target form can be identified in a bottom-up mode, and the form information is output.

Optionally, the entity categories of the M text entity units include a first entity category and a second entity category, where a key-value relationship exists between the first entity category and the second entity category, and the outputting, based on the association category, form information of the target form includes at least one of the following:

Outputting a first key value relation of the target form under the condition that the association category is a first association category, wherein the first association category represents that entity categories of two text entity units with association are respectively the first entity category and the second entity category, and the form information comprises the first key value relation;

and under the condition that the association category is a second association category, outputting first structural information of the target form, wherein the second association category represents that entity categories of two text entity units with association are the same entity category, the first structural information represents that the two text entity units with association of the second association category correspond to the same unit, and the form information comprises the first structural information.

In this embodiment, the first entity class may be a key, the second entity class may be a value, and the first association class refers to an association class between two text entity units whose entity classes are the key and the value, respectively.

In the case that the association category of the M text entity units includes the first association category, the output form information may include a first key value relationship and further include a key value. The first key-value relationship may characterize the relationship of the key and the value, e.g., the relationship of the key and the value may be symbolized by a colon ": "or may be represented by a positional relationship (keys on left, values on right).

When a plurality of values associated with the same key are searched based on the relationship information, the plurality of values associated with the key may be sorted from top to bottom to be combined when form information is output, and the representative value is a plurality of lines of text, as shown in fig. 2.

The second association category may refer to an association category between two text entity units of which entity categories are keys, and in the case that the association category of the M text entity units includes the second association category, the output form information may include the first structure information. The first structural information characterizing key corresponds to a plurality of lines of text. When the form information is output, the text content of all text entity units with the second association category with one text entity unit can be reserved.

In this embodiment, the entity categories of the M text entity units include a first entity category and a second entity category, and the identification and construction of the form in the image to be identified are implemented by combining the first association category and the second association category, so as to implement the output of the form information.

Optionally, when the target form includes a table, the entity categories of the M text entity units further include a third entity category and a fourth entity category, where the third entity category and the fourth entity category have a key value relationship, and the fourth entity category is a value in the key value relationship, and the outputting, based on the association category, form information of the target form includes at least one of the following:

Outputting a second key value relation of the table under the condition that the association category is a third association category, wherein the third association category represents that entity categories of two text entity units with association are the third entity category and the fourth entity category respectively, and the form information comprises the second key value relation;

outputting second structural information of the table when the association category is a fourth association category, wherein the fourth association category represents that entity categories of two text entity units with association are both the third entity category, the second structural information represents that the two text entity units with association of the fourth association category correspond to the same cell, and the form information comprises the second structural information;

and outputting third structural information of the table under the condition that the association category is a fifth association category, wherein the fifth association category represents that entity categories of two text entity units with association are the fourth entity category, the third structural information represents whether the two text entity units with association of the fifth association category are in the same text row, and the form information comprises the third structural information.

In this embodiment, the third entity class may be a header, the second entity class may be a unit, the header and the unit may also be regarded as a special key-value relationship, and the third association class refers to an association class between two text entity units of the header and the unit, respectively.

In the case that the association class of the M text entity units includes the third association class, the output form information may include the second key value relationship, and further include a header and a unit. The second key relationship may represent a relationship of the header and the cell, which may be represented in a table by a positional relationship (e.g., header up, cell down; or header left, cell right).

The fourth association category may refer to an association category between two text entity units whose entity categories are both headers, and in the case that the association category of the M text entity units includes the fourth association category, the output form information may include second structure information. The cells corresponding to the second structural information characterization header include a plurality of lines of text. When the form information is output, text contents of all text entity units having a fourth association category with a text entity unit may be retained, as shown in fig. 2.

The fifth association category may refer to an association category between two text entity units, where entity categories are units, and in a case where the association category of M text entity units includes the fifth association category, the output form information may include third structure information. The third structure information characterizes that the different units are in a peer relationship, i.e. in the same text row. In the outputting of the form information, all text entity units having the fifth association category with a text entity unit may be arranged in the same text row, as shown in fig. 4.

In this embodiment, the entity categories of the M text entity units include a third entity category and a fourth entity category, and the identification and construction of the form in the image to be identified are implemented by combining the third association category, the fourth association category and the fifth association information, so as to implement the output of the form information.

Second embodiment

As shown in fig. 5, the present disclosure provides a form recognition apparatus 500, including:

a first obtaining module 501, configured to obtain an image to be identified, where the image to be identified includes image content of a target form, and the target form includes M text entity units, where M is an integer greater than 1;

A second obtaining module 502, configured to obtain a first feature of the target form based on the image to be identified;

an entity classification module 503, configured to perform entity classification on the M text entity units based on the first feature, to obtain an entity class of each text entity unit;

a relationship prediction module 504, configured to predict, based on the first feature, an association relationship between different text entity units in the M text entity units, to obtain relationship information of the M text entity units, where the association relationship is used to characterize whether there is an association between the different text entity units;

and an output module 505, configured to output form information of the target form based on the entity class and the relationship information.

Optionally, the second obtaining module 502 includes:

the text recognition unit is used for carrying out text recognition on the image to be recognized to obtain the position information and text content of the M text entity units;

the feature extraction unit is used for extracting features of the image to be identified to obtain image features of the image to be identified;

the feature coding unit is used for carrying out feature coding on the text content based on the position information to obtain the text feature of the target form;

And the fusion unit is used for fusing the text features and the image features to obtain the first features of the target form.

Optionally, the fusion unit is specifically configured to:

splicing the text features and the image features to obtain second features;

Optionally, the text recognition unit is specifically configured to:

Optionally, the feature encoding unit is specifically configured to:

Optionally, the entity classification module 503 is specifically configured to:

Optionally, the relationship prediction module 504 is specifically configured to:

Optionally, the output module 505 includes:

the acquiring unit is used for acquiring an association unit set of the text entity units based on the relation information aiming at each text entity unit; the association unit set comprises other text entity units which are associated with the text entity units in the M text entity units;

the associated category prediction unit is used for determining an associated category between the text entity unit and each text entity unit in the associated unit set based on the entity category of the text entity unit for each text entity unit;

and the output unit is used for outputting the form information of the target form based on the association category.

Optionally, the entity categories of the M text entity units include a first entity category and a second entity category, where a key-value relationship exists between the first entity category and the second entity category, and the output unit is specifically configured to at least one of the following:

Optionally, when the target form includes a table, the entity categories of the M text entity units further include a third entity category and a fourth entity category, where the third entity category and the fourth entity category have a key value relationship, and the fourth entity category is a value in the key value relationship, and the output unit is specifically configured to at least one of:

The form recognition device 500 provided in the present disclosure can implement each process implemented by the form recognition method embodiment, and can achieve the same beneficial effects, so that repetition is avoided, and no further description is provided here.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

FIG. 6 illustrates a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the respective methods and processes described above, such as a form recognition method. For example, in some embodiments, the form identification method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the form recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the form recognition method in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A form identification method, comprising:

acquiring an image to be identified, wherein the image to be identified comprises image contents of a target form, the target form comprises M text entity units, and M is an integer greater than 1; the text entity unit is a text line with uninterrupted characters, and the characters are interrupted by means of spaces, colon symbols and cells;

Based on the first characteristics, carrying out entity classification on the M text entity units to obtain entity categories of each text entity unit; entity categories include title, key, value, header and unit;

outputting form information of the target form based on the entity category and the relation information; the form information includes key value relationships and/or structured information, the structured information including: the different units in the table are in the same-row relationship, the keys or the values are a plurality of rows of texts, and the corresponding unit cell of the table head comprises a plurality of rows of texts;

the obtaining the first feature of the target form based on the image to be identified comprises the following steps:

performing feature extraction on the image to be identified based on a convolutional neural network to obtain image features of the image to be identified;

2. The method of claim 1, wherein the fusing the text feature and the image feature to obtain the first feature of the target form comprises:

splicing the text features and the image features to obtain second features;

3. The method of claim 1, wherein the text recognition of the image to be recognized to obtain location information and text content of the M text entity units includes:

4. The method of claim 1, wherein the feature encoding the text content based on the location information to obtain text features of the target form comprises:

5. The method of claim 1, wherein the classifying the M text entity units based on the first feature to obtain an entity class of each text entity unit comprises:

6. The method of claim 1, wherein predicting, based on the first feature, association relationships between different text entity units in the M text entity units to obtain relationship information of the M text entity units, includes:

7. The method of claim 1, wherein the outputting form information for the target form based on the entity category and the relationship information comprises:

8. The method of claim 7, wherein the entity categories of the M text entity units include a first entity category and a second entity category, the first entity category and the second entity category having a key value relationship, the outputting form information of the target form based on the association category comprising at least one of:

9. The method according to claim 7 or 8, wherein, in a case where the target form includes a table, the entity categories of the M text entity units further include a third entity category and a fourth entity category, the third entity category and the fourth entity category having a key value relationship, the fourth entity category being a value in the key value relationship, the outputting form information of the target form based on the association category, including at least one of:

10. A form identification device, comprising:

the first acquisition module is used for acquiring an image to be identified, wherein the image to be identified comprises image content of a target form, the target form comprises M text entity units, and M is an integer greater than 1; the text entity unit is a text line with uninterrupted characters, and the characters are interrupted by means of spaces, colon symbols and cells;

the entity classification module is used for carrying out entity classification on the M text entity units based on the first characteristics to obtain entity categories of each text entity unit; entity categories include title, key, value, header and unit;

the output module is used for outputting form information of the target form based on the entity category and the relation information; the form information includes key value relationships and/or structured information, the structured information including: the different units in the table are in the same-row relationship, the keys or the values are a plurality of rows of texts, and the corresponding unit cell of the table head comprises a plurality of rows of texts;

the second acquisition module includes:

The feature extraction unit is used for extracting features of the image to be identified based on a convolutional neural network to obtain image features of the image to be identified;

11. The device according to claim 10, wherein the fusion unit is specifically configured to:

splicing the text features and the image features to obtain second features;

12. The apparatus according to claim 10, wherein the text recognition unit is specifically configured to:

13. The apparatus of claim 10, wherein the feature encoding unit is specifically configured to:

14. The apparatus of claim 10, wherein the entity classification module is specifically configured to:

15. The apparatus of claim 10, wherein the relationship prediction module is specifically configured to:

16. The apparatus of claim 10, wherein the output module comprises:

17. The apparatus of claim 16, wherein the entity categories of the M text entity units include a first entity category and a second entity category, the first entity category and the second entity category having a key-value relationship, the output unit being specifically configured to at least one of:

18. The apparatus according to claim 16 or 17, wherein, in case the target form comprises a table, the entity categories of the M text entity units further comprise a third entity category and a fourth entity category, the third entity category and the fourth entity category having a key value relation, the fourth entity category being a value in the key value relation, the output unit being specifically configured to at least one of:

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.