CN117765556A

CN117765556A - Document image recognition method and device, electronic equipment and storage medium

Info

Publication number: CN117765556A
Application number: CN202311799594.2A
Authority: CN
Inventors: 范清; 张�杰
Original assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Current assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Priority date: 2023-12-25
Filing date: 2023-12-25
Publication date: 2024-03-26

Abstract

The disclosure provides a document image recognition method and device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring a target document image to be identified; the target document image comprises texts and connecting lines between the texts; extracting features of the target document image to obtain first multi-modal features of each semantic entity in the target document image; the first multi-modal feature is generated according to the image features of the target document image and the text features of all semantic entities in the target document image; generating type information of each semantic entity according to the first multi-mode characteristics, and determining relation information among each semantic entity; and generating a structured document corresponding to the target document image according to the type information of the semantic entities and the relation information of each semantic entity. The embodiment of the disclosure can extract the relation between semantic entities and perform structural processing on the document image containing the text and the connecting line.

Description

Document image recognition method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of image processing, and in particular relates to a method and a device for identifying a document image, electronic equipment and a storage medium.

Background

In many cases documents are stored in scanned images, such as organizational charts, mind charts, flowcharts, knowledge maps, etc. Converting characters in an image into a structured document facilitates document retrieval, document analysis, text editing, and other intelligent services. For example, the pictures of the organization structure graph of the enterprise are converted into the structured document, so that the organization structure of the enterprise can be analyzed more conveniently.

In the prior art, the document information can be extracted by using a manual processing mode, but the mode generally needs to consume a large amount of manpower and material resources for information comparison and processing; the method can also use a semantic entity extraction mode based on rules to extract document information with fixed formats such as identity cards, invoices and the like, but the method can only generally extract entity elements with fixed patterns, and the method can not extract relation information among semantic entities. In addition, for document images containing connecting lines between texts, the conventional technical solutions cannot extract the relationship between semantic entities by using image information such as text positions, connecting lines, and the like.

Therefore, there is a need for a document image recognition method capable of extracting relationship information between semantic entities from a document image containing connection lines between texts.

Disclosure of Invention

The disclosure provides a document image recognition method and device, electronic equipment and storage medium.

In a first aspect, the present disclosure provides a method for identifying a document image, including:

acquiring a target document image to be identified; the target document image comprises texts and connecting lines between the texts;

extracting features of the target document image to obtain first multi-modal features of each semantic entity in the target document image; the first multi-modal feature is generated according to the image feature of the target document image and the text feature of each semantic entity in the target document image;

generating type information of each semantic entity according to the first multi-mode characteristics, and determining relation information among each semantic entity;

and generating a structured document corresponding to the target document image according to the type information of the semantic entity and the relation information of each semantic entity.

Optionally, the extracting features of the target document image to obtain a first multi-modal feature of each semantic entity in the target document image specifically includes:

Extracting image features of the target document image to obtain the image features of the target document image;

extracting text features of the target document image to obtain the text features of each semantic entity; the text features are used for representing the semantics of the semantic entity and the position of the semantic entity in the target document image;

generating a second multi-modal feature of the target document image according to the image features and the text features of each semantic entity;

and processing the second multi-modal feature based on a self-attention mechanism to obtain a first multi-modal feature of the semantic entity.

Optionally, the extracting the image feature of the target document image to obtain the image feature of the target document image specifically includes:

dividing the target document image to obtain a sub-image with a target size and first position information of the sub-image in the target document image;

performing visual coding processing on each sub-image based on a visual encoder to obtain a first sub-image characteristic of the sub-image;

generating a second sub-image feature of each sub-image according to the first position information and the first sub-image feature;

And performing stitching processing on the second sub-image features to obtain the image features of the target document image.

Optionally, extracting text features of the target document image to obtain the text features of each semantic entity specifically includes:

performing word recognition processing on the target document image to obtain text information in the target document image;

word segmentation processing is carried out on the text information to obtain each semantic entity in the target document image and second position information of each semantic entity in the target document image;

generating word vectors of each semantic entity according to the word vector model;

and generating the text characteristic of the semantic entity according to the word vector of the semantic entity and the second position information aiming at each semantic entity.

Optionally, the processing the second multi-modal feature based on the self-attention mechanism to obtain a first multi-modal feature of the semantic entity specifically includes:

and processing the second multi-modal feature by utilizing a multi-modal self-attention layer in the multi-modal model to obtain a first multi-modal feature of the semantic entity.

Optionally, the generating type information of each semantic entity according to the first multi-mode feature and determining relationship information between each semantic entity specifically includes:

based on a first classification module in the multi-modal model, classifying the first multi-modal features of the semantic entity to obtain type information of the semantic entity;

for each semantic entity, constructing a third multi-modal feature of the semantic entity according to the first multi-modal feature and the type information of the semantic entity;

and extracting the relation information among the semantic entities according to the third multi-modal characteristics of each semantic entity.

Optionally, the extracting the relationship information between the semantic entities according to the third multi-modal feature of each semantic entity specifically includes:

creating a candidate relation set comprising candidate relation information between any two semantic entities according to the third multi-modal characteristic of each semantic entity; the candidate relation information is generated according to the third multi-modal characteristics of any two semantic entities and candidate relation identifiers;

Based on a second classification module in the multi-mode model, classifying the candidate relation information in the candidate relation set to obtain a classification result aiming at the candidate relation information;

and if the classification result indicates that the candidate relation information is true, the candidate relation identification in the candidate relation information is used as the relation information between the two semantic entities corresponding to the candidate relation information.

Optionally, if the classification result indicates that the candidate relationship information is true, the identifying the candidate relationship in the candidate relationship information as the relationship information between two semantic entities corresponding to the candidate relationship information specifically includes:

performing line segment detection processing on the target document image to obtain line segment information in the target document image;

if the classification result indicates that the candidate relation information is true, judging whether a connecting line exists between two semantic entities corresponding to the candidate relation information according to the line segment information, and obtaining a first judgment result;

verifying the classification result according to the first judgment result;

And if the classification result passes the verification, the candidate relation identification in the candidate relation information is used as the relation information between the two corresponding semantic entities.

Optionally, the target document image is any one of an organizational chart, a mind map, a flow chart and a knowledge graph; the relation information is used for indicating whether the two semantic entities have a superior-subordinate relation or not and whether the two semantic entities belong to a parallel relation or not.

In a second aspect, the present disclosure provides an apparatus for recognizing a document image, including:

the image acquisition module is used for acquiring a target document image to be identified; the target document image comprises texts and connecting lines between the texts;

the feature extraction module is used for extracting features of the target document image to obtain first multi-modal features of each semantic entity in the target document image; the first multi-modal feature is generated according to the image feature of the target document image and the text feature of each semantic entity in the target document image;

the entity analysis module is used for generating type information of each semantic entity according to the first multi-mode characteristics and determining relation information among the semantic entities;

And the document creation module is used for generating a structured document corresponding to the target document image according to the type information of the semantic entities and the relation information of each semantic entity.

In a third aspect, the present disclosure provides an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores one or more computer programs executable by the at least one processor, one or more of the computer programs being executable by the at least one processor to enable the at least one processor to perform the above-described document image recognition method.

In a fourth aspect, the present disclosure provides a computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the above-described document image recognition method.

The present disclosure provides a method and apparatus for identifying a document image, an electronic device, and a storage medium, where the scheme includes: extracting features of a target document image to be identified to obtain first multi-modal features generated according to image features of the target document image and text features of semantic entities; generating type information of each semantic entity and relation information among each semantic entity according to the first multi-mode characteristics; and finally, generating a structured document corresponding to the target document image. Therefore, through the information of the image and the text of the target document image, the type information of each semantic entity is generated, and the relation information among each semantic entity is extracted, so that the document image containing the text and the connecting line can be subjected to structural processing.

It should be understood that the description in this section is not intended to identify key or critical features of the disclosed embodiments, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, without limitation to the disclosure. The above and other features and advantages will become more readily apparent to those skilled in the art by describing in detail exemplary embodiments with reference to the attached drawings, in which:

FIG. 1 is an application scenario diagram of a method and an apparatus for recognizing a document image according to an embodiment of the present disclosure;

FIG. 2 is a flowchart of a method for recognizing a document image according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a multi-modal self-care layer according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a second classification module provided by an embodiment of the disclosure;

FIG. 5 is a schematic diagram of another method for recognizing document images according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a multi-modal model provided by an embodiment of the present disclosure;

FIG. 7 is a block diagram of a document image recognition apparatus provided by an embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

For a better understanding of the technical solutions of the present disclosure, exemplary embodiments of the present disclosure will be described below with reference to the accompanying drawings, in which various details of the embodiments of the present disclosure are included to facilitate understanding, and they should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Embodiments of the disclosure and features of embodiments may be combined with each other without conflict.

As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

In addition to the information recorded in the text mode, the position relation and the connecting line between the texts record the relation information between the texts in the images of the structural documents such as the organizational chart, the mind map, the flow chart, the knowledge map and the like. However, the method for identifying document images in the prior art often cannot identify the relation information between texts, so that the image of the structured document cannot be identified effectively.

Fig. 1 is an application scenario diagram of a method and an apparatus for recognizing a document image according to an embodiment of the present disclosure.

As shown in fig. 1, after collecting a document image of a structured document based on a camera, a mobile phone, a scanner, and other devices, the recognition system may recognize the document image based on the document image recognition method provided by the embodiment of the present disclosure, so as to obtain a structured document that may be conveniently subjected to various processes such as subsequent editing, analysis, and retrieval.

Fig. 2 is a flowchart of a method for recognizing a document image according to an embodiment of the present disclosure. As shown in fig. 2, the process may include the steps of:

step 201, obtaining a target document image to be identified; the target document image contains text and connecting lines between the text.

In some embodiments, the target document image may be any one of an org-chart, a mind-map, a flow chart, and a knowledge graph. In the target document image, the positions of the texts and the connecting lines between the texts may represent the relationships between the texts.

Step 202, extracting features of the target document image to obtain first multi-modal features of each semantic entity in the target document image; the first multi-modal feature is generated according to the image features of the target document image and the text features of all semantic entities in the target document image.

The semantic entity may be an entity with a specific meaning in the text, and may be a word, a sentence, or a paragraph in the target document image. For example, if the target document image is an organizational chart, the semantic entity may be an organization name, a person name, a job position, a job level, a department, a product, a location, an event, an object, a concept, or the like.

In some embodiments, the image features may be extracted from the whole target document image, and not from the sub-images corresponding to the semantic entities.

In some embodiments, the text feature may be used to represent the semantics of the semantic entity and may also be used to represent the location of the semantic entity in the target document image.

In practical applications, multiple modes may generally refer to information of multiple modes, including: text, images, video, audio, etc.

In some embodiments, the multi-modal model is used for fusing information of two modalities of an image and a text to obtain a first multi-modal feature of the semantic entity.

And 203, generating type information of each semantic entity according to the first multi-mode characteristics, and determining relation information among each semantic entity.

In some embodiments, the first multimodal feature may implicate semantic information, image information, type information, and relationship information with other semantic entities of the semantic entity.

In some embodiments, the type information is used to represent a category to which the semantic entity belongs. For example, when the target document image is an organizational chart, the categories of semantic entities may include organization names, personnel names, positions, levels of positions, departments, products, locations, events, objects, concepts, and the like.

In some embodiments, the relationship information may represent a relationship between two of the semantic entities. For example, when the target document image is an organizational chart or a knowledge graph, the relationship information may be used to indicate whether there is a superior-inferior relationship between two semantic entities and whether the relationship belongs to a parallel relationship.

And 204, generating a structured document corresponding to the target document image according to the type information of the semantic entities and the relation information of each semantic entity.

In some embodiments, the structured document may refer to a document that is organized and stored according to certain rules and formats, and may be composed of a logical structure of titles, chapters, paragraphs, and the like. The structured document may be an XML document or JSON file.

In the embodiment of the disclosure, feature extraction is performed on a target document image to be identified, so as to obtain a first multi-modal feature generated according to image features of the target document image and text features of semantic entities; generating type information of each semantic entity and relation information among each semantic entity according to the first multi-mode characteristics; and finally, generating a structured document corresponding to the target document image. Therefore, through the information of the image and the text of the target document image, the type information of each semantic entity is generated, and the relation information among each semantic entity is extracted, so that the document image containing the text and the connecting line can be subjected to structural processing.

Based on the method in fig. 2, the examples of the present specification also provide some specific embodiments of the method, as described below.

In some embodiments, the image feature may be obtained by extracting an image feature of the whole target document image, and not by extracting an image feature of a sub-image corresponding to the semantic entity.

In some embodiments, the text feature may be derived from a word vector of the semantic entity and location information of the semantic entity in the target document image.

In some embodiments, the image features and the text features of the semantic entities are spliced to obtain the second multimodal features of the target document image.

In the embodiment of the disclosure, the second multi-modal feature of the target document image is generated according to the image feature and the text feature of each semantic entity, and the text feature is used for representing the position of the semantic entity in the target document image, so that the multi-modal model is facilitated to perform fusion processing on information of two modes of the image and the text, and further the first multi-modal feature obtained by the fusion processing can reflect the association relationship between the semantic entities in terms of semantics, positions, connecting lines and the like, and global information of the image is reserved in the first multi-modal feature of the semantic entity.

In some embodiments, the extracting the image feature of the target document image to obtain the image feature of the target document image specifically includes: extracting image features of the target document image based on a visual feature extraction module to obtain the image features of the target document image; the visual characteristic extraction module is any one of a convolutional neural network model, a visual encoder and a SIFT algorithm. The visual feature extraction module may employ various image feature extraction models in image recognition technology, and is not particularly limited herein. The convolutional neural network model may include VGG16, res net, etc. models. The visual encoder may be a ViT (Vision Transformer) model.

In some embodiments, before the dividing process is performed on the target document image, the method may further include: and adjusting the size of the image to be identified through a resize function to obtain the target document image with the preset size. The preset size may be set as desired, for example 224x224.

In some embodiments, the target document image is divided according to a preset target size, so as to obtain a sub-image with the target size. For example, the target document image with a size of 224×224 may be divided into 4 sub-images with a preset size of 112×112.

In some embodiments, the first location information may be used to represent a location of the sub-image in the target document image; the first position information may specifically include a 1D position code for representing a sequence number, and a 2D position code for representing an area where the sub-image is located in the target document image. Numbering the sub-images obtained by dividing according to a preset sequence (for example, from left to right and from top to bottom), and obtaining the 1D position codes of the sub-images. The sequence numbers of the 4 sub-images above may be 0, 1, 2, 3, respectively. The 2D position coding can be represented by adopting a plurality of modes such as two corner points at opposite angles, corner points, length width and the like; for example, the 2D position code may be expressed as (x, y, width, height), where x, y is the upper left corner coordinates of the sub-image in the target document image, width, height is the width and height of the block, in pixels.

In some embodiments, the visual encoder may be a ViT (Vision Transformer) model. And inputting the sub-image into the visual encoder for each sub-image to obtain the first sub-image characteristic with the preset dimension.

In some embodiments, for each of the sub-images, the first position information is added on the basis of the first sub-image feature of the sub-image, so as to obtain a second sub-image feature of each of the sub-images. And performing stitching processing on the second sub-image characteristic (for example, 1 x 768 dimensions) of each sub-image to obtain the image characteristic (4 x 768 dimensions) of the target document image.

In the embodiment of the disclosure, the visual encoder is used for extracting the image characteristics of the target document image, so that the image characteristics of the target document image can effectively express the global characteristics of the target document image, and more space information is reserved.

and generating the text characteristic of the semantic entity according to the word vector of the semantic entity and the second position information aiming at each semantic entity. .

In some embodiments, text recognition processing is performed on the target document image based on a text recognition model (Optical Character Recognition, OCR) to obtain text information in the target document image and second location information for each character.

In some embodiments, the word segmentation process may be implemented based on a segmentation tool such as Jieba, snowNLP, pkuSeg, THULAC, hanLP. In some embodiments, the word segmentation process may also be directly based on the hollow lattice of the text information.

In some embodiments, the third location information may include a 1D location code for representing a sequence number of the semantic entity in the target document image and a 2D location code for representing an area of the respective semantic entity in the target document image.

In some embodiments, the semantic entities obtained by the partitioning are numbered in a tandem order, resulting in a 1D position encoding of each semantic entity.

In some embodiments, the text recognition processing is performed on the target document image, and the position information of each character in the text information may also be obtained. The 2D position code may be obtained according to position information of each character included in the semantic entity, for example, a minimum bounding box (for example, AABB bounding box) including an area corresponding to each second position information. The 2D position coding can be represented by adopting a plurality of modes such as two corner points at the diagonal position of the AABB bounding box, the corner points, the length width and the like; for example, the 2D position code may be expressed as (x, y, width, height), where x, y is the upper left corner coordinates of the semantic entity in the target document image, width, height is the width and height of the block, in pixels.

In some embodiments, the word vector model may be a pre-trained bge model, openai-text-casting model. Determining a word vector of each semantic entity by utilizing the word vector model; the word vector may have a preset dimension.

In some embodiments, for each of the semantic entities, the third location information is appended on the basis of a word vector of the semantic entity, resulting in the text feature (e.g. 1×768 dimensions) of the semantic entity. The number of dimensions of both the text feature and the second sub-image feature is the same. And performing splicing processing on the image features and the text features of the semantic entities to obtain the second multi-modal features of the target document image.

In the embodiment of the disclosure, the text feature may represent both semantic information of the semantic entity and position information of the semantic entity in the target document image, so as to facilitate fusion processing of information of two modes of an image and a text according to association between the text of the semantic entity and the information of the two modes of the image by the multimodal model; and the first multi-modal feature obtained by fusion processing can reflect the association relation among semantic entities in terms of semantics, positions, connecting lines and the like.

and processing the second multi-modal feature by utilizing a multi-modal self-attention layer in the multi-modal model to obtain a first multi-modal feature of the semantic entity. .

Fig. 3 is a schematic structural diagram of a multi-modal self-attention layer according to an embodiment of the present disclosure, and as shown in fig. 3, the multi-modal self-attention layer may use a multi-head attention mechanism, for example, may be composed of 12 transducer encoder layers (i.e., 12 head self-attention).

In some embodiments, the multi-modal self-attention layer may be used to fuse the second multi-modal features.

In some embodiments, the multimodal model may also be trained based on a training dataset prior to implementing the above-described document image recognition method. The sample pictures in the training data set can be images such as an organizational chart, a thinking guide chart, a flow chart, a knowledge graph and the like; the sample image can carry semantic entities, type information of the semantic entities, relationship information among the semantic entities and other label information. And training the multi-modal model by taking the minimized output result of the multi-modal model and the label information as targets to obtain the trained multi-modal model.

In some embodiments, the multi-modal model may be initialized during its training process using a pre-trained Bert model, with the remaining parameters being randomly initialized. And optimizing parameters of the multi-mode model based on an Adam optimizer. The initial learning rate of Adam optimizer may be set to 0.0001.

In some embodiments, two pre-training tasks of mask visual language modeling, text image alignment may also be performed prior to training of the multimodal model. Wherein, the process of mask visual language modeling may be: randomly masking 15% of the semantic entities while masking the corresponding image regions, allowing the multimodal model to predict covered semantic entities based on the remaining text and visual cues. The process of text image alignment may be: 15% of the semantic entities are randomly selected and masked while the corresponding image areas are masked. A classification layer is added to the multimodal model during training, and can be used for predicting whether each semantic entity is capped or not. Such that the multimodal model learns the association between image features and text features.

In the embodiment of the disclosure, the second multi-mode features are fused according to the relation between the text and the image by the multi-mode self-attention layer, so that the first multi-mode features obtained by the fusion can reflect the association relation between semantic entities in terms of semantics, positions, connecting lines and the like, and the accuracy and the robustness of document image recognition can be improved.

Optionally, generating type information of each semantic entity according to the first multi-mode feature, and determining relationship information between each semantic entity, which specifically includes:

In some embodiments, the multimodal model can include a backbone network (e.g., a multi-headed attention module) for feature extraction, and can also include a first classification module (e.g., a classification head). The first classification module may be for mapping a first multimodal feature to a predefined set of categories. The first classification module may be a full connection layer, the type information of the semantic entity may be a probability vector output by the first classification module, each dimension of the probability vector corresponds to a category, and the value of each element in the probability vector is the probability that the semantic entity belongs to the category corresponding to the dimension; and taking the category with the highest probability as the type of the semantic entity. The sum of all elements in the probability vector may be equal to 1.

In some embodiments, the first multi-modal feature of the semantic entity and the type information are spliced to obtain a third multi-modal feature of the semantic entity.

In the embodiment of the disclosure, the third multi-modal feature carries semantic information, position information, global information and type information of the semantic entities, which is helpful for extracting the relationship information among the semantic entities, and improves the accuracy of the relationship information.

In some embodiments, the candidate relationship information may be a triplet of third multimodal features of two semantic entities and candidate relationship identification therebetween; the candidate relationship set may then be a triplet set consisting of triples. The triples may be represented as (first semantic entity, second semantic entity, candidate relationship identification).

In some embodiments, if the target document image is an organizational chart, the relationship between two semantic entities may be an upper-lower relationship and a parallel relationship, and the triples corresponding to the two semantic entities may be four, which respectively represent that the first semantic entity is an upper level of the second semantic entity, the first semantic entity is a lower level of the second semantic entity, the first semantic entity and the second semantic entity are parallel relationships, and the first semantic entity and the second semantic entity have no relationship.

In some embodiments, the second classification module may be configured to perform a two-classification process on candidate relationship information, and determine whether the candidate relationship information is true.

Fig. 4 is a schematic diagram of a second classification module according to an embodiment of the disclosure. As shown in fig. 4, the second classification module may include two Linear layers (Linear layers) and one classification layer (Sigmoid).

After inputting the candidate relation information (a first semantic entity, a second semantic entity and a candidate relation identifier) into the second classification module, a first linear layer is used for carrying out linear transformation on third multi-mode features of two semantic entities of a triplet to obtain fourth multi-mode features of the semantic entity; the second linear layer is used for carrying out linear transformation on the fourth multi-modal feature and the candidate relation identifier which are respectively corresponding to the two semantic entities to obtain a relation feature corresponding to the candidate relation of the two semantic entities, and inputting the relation feature into the classification layer to obtain the confidence coefficient of the candidate relation information, namely the probability that the candidate relation information is true.

In the embodiment of the disclosure, candidate relationship information is generated according to the third multi-modal characteristics of any two semantic entities and the candidate relationship identifier, and then the relationship information between the two semantic entities is determined according to the candidate relationship information. Because the third multi-modal feature contains information of two modes of text and image and type information of the semantic entity, relationship classification is carried out on the candidate relationship information constructed based on the third multi-modal feature, and accuracy of the relationship information is improved.

verifying the classification result according to the first judgment result;

In some embodiments, the target document image includes a connection line between texts. Generally, if no connecting line exists between two texts, it can be determined that two semantic entities corresponding to the texts have no relationship; if a connecting line exists between two texts, two semantic entities corresponding to the texts can be identified to have relevance. Therefore, the line segment information in the target document image can be used for correcting the classification result output by the second classification module.

In some embodiments, adding candidate relationship identifications in all candidate relationship information between two semantic entities to a candidate relationship subset; classifying results of the candidate relation information in the candidate relation subsets according to the second classification module; judging whether a connecting line exists between two semantic entities corresponding to the candidate relation information or not to obtain a first judging result; and determining the relation information from the candidate relation subset according to the first judging result and the first judging result. The first determination may be used to screen out erroneous candidate relationship identities from the subset of candidate relationships.

Specifically, the classification result indicates that there is no relation between the two semantic entities, and the first judgment result indicates that there is a connecting line between the two semantic entities, thereby obtaining the verification result indicating that the classification result is false. In some cases, the candidate relationship in the candidate relationship information with the highest confidence may be identified as the relationship information.

The classification result indicates that there is a relationship (e.g., a superior-inferior relationship or a parallel relationship) between the two semantic entities, and the first judgment result indicates that there is no connection line between the two semantic entities, thereby judging that the classification result is erroneous. In some cases, a candidate relationship that represents no relationship between two of the semantic entities may be identified as the relationship information.

In some embodiments, before the candidate relationship set is created, whether a connection line exists between two semantic entities corresponding to the candidate relationship information may be determined according to the line segment information, so as to obtain a first determination result. Screening candidate relation identifiers in the candidate relation subsets according to the first judging result; and creating candidate relation information according to the screened candidate relation subsets and the third multi-mode features of each semantic entity, thereby reducing the operand in the relation extraction process by reducing the number of candidate relation information in the candidate relation set.

In the embodiment of the present disclosure, the classification result of the second classification module is corrected by the connection line between the texts in the target document image, which is helpful for improving the accuracy of the relationship information.

In some embodiments, the upper and lower level relationship may refer to one semantic entity subordinate to another semantic entity. For example, one position depends from another position in the organizational chart, and one concept depends from another in the knowledge graph.

In some embodiments, the parallel relationship may refer to two semantic entities subordinate to the same semantic entity. For example, two positions belonging to the same position in the organizational chart have a parallel relationship, and lower-level titles under the same title in the mind map have a parallel relationship.

In the embodiment of the disclosure, the relation of the semantic entity in the target document image is abstracted into the upper-lower relation and the parallel relation, so that the logical structure of the target document image such as an organization chart, a thinking guide chart, a flow chart, a knowledge graph and the like can be determined according to the relation information, and the structured document of the target document image can be generated.

It will be appreciated that the above-mentioned method embodiments of the present disclosure may be combined with each other to form a combined embodiment without departing from the principle logic, and are limited to the description of the present disclosure. It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the particular order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same technical conception, the present disclosure also provides another document image recognition method.

FIG. 5 is a schematic diagram of another method for recognizing document images according to an embodiment of the present disclosure; FIG. 6 is a schematic diagram of a multi-modal model provided by an embodiment of the present disclosure; as shown in fig. 5 and 6, the method includes:

word segmentation processing is carried out on the text information, so that each semantic entity (a board of directors, a total manager, a development part and an engineering part) in the target document image and second position information of each semantic entity in the target document image are obtained;

and carrying out fusion processing on the second multi-mode features by utilizing a multi-mode self-attention layer in the multi-mode model to obtain first multi-mode features of the semantic entity.

judging whether the classification result is true according to the first judgment result to obtain a verification result;

and if the verification result shows that the classification result is true, the candidate relation identification in the candidate relation information is used as the relation information between the two corresponding semantic entities. For example, the board length and the overall manager are in a top-bottom relationship, and the development unit and the engineering unit are in a parallel relationship.

Based on the same technical conception, the present disclosure also provides a document image recognition device, an electronic device, and a computer readable storage medium, which can be used to implement any of the document image recognition methods provided by the present disclosure.

Fig. 7 is a block diagram of a document image recognition apparatus provided in an embodiment of the present disclosure. Referring to fig. 7, the document image recognition apparatus includes:

an image acquisition module 701, which may be configured to acquire a target document image to be identified; the target document image comprises texts and connecting lines between the texts;

the feature extraction module 702 may be configured to perform feature extraction on the target document image to obtain first multi-modal features of each semantic entity in the target document image; the first multi-modal feature is generated according to the image feature of the target document image and the text feature of each semantic entity in the target document image;

the entity analysis module 703 may be configured to generate type information of each semantic entity according to the first multimodal feature, and determine relationship information between each semantic entity;

the document creation module 704 may be configured to generate a structured document corresponding to the target document image according to the type information of the semantic entity and the relationship information of each semantic entity.

Based on the device in fig. 7, the present description example also provides some specific embodiments of the device, as described below.

Optionally, the feature extraction module 702 may specifically include:

the image feature extraction unit can be used for extracting image features of the target document image to obtain the image features of the target document image;

the text feature extraction unit can be used for extracting text features of the target document image to obtain the text features of each semantic entity; the text features are used for representing the semantics of the semantic entity and the position of the semantic entity in the target document image;

a first feature generating unit, configured to generate a second multi-modal feature of the target document image according to the image feature and the text feature of each semantic entity;

and the feature fusion unit can be used for processing the second multi-mode features based on a self-attention mechanism to obtain first multi-mode features of the semantic entity.

Optionally, the image feature extraction unit may specifically be configured to:

Optionally, the text feature extraction unit may be specifically configured to:

Optionally, the feature fusion unit may be specifically configured to:

Optionally, the entity analysis module 703 may specifically include:

the type classification unit can be used for classifying the first multi-mode features of the semantic entity based on a first classification module in the multi-mode model to obtain type information of the semantic entity;

the second feature generating unit may be configured to construct, for each semantic entity, a third multi-modal feature of the semantic entity according to the first multi-modal feature of the semantic entity and the type information;

and the relation extraction unit can be used for extracting the relation information among the semantic entities according to the third multi-mode characteristics of each semantic entity.

Optionally, the relationship extraction unit specifically includes:

the candidate set creating subunit may be configured to create a candidate relationship set including candidate relationship information between any two semantic entities according to a third multimodal feature of each semantic entity; the candidate relation information is generated according to the third multi-modal characteristics of any two semantic entities and candidate relation identifiers;

the relationship judging subunit can be used for classifying the candidate relationship information in the candidate relationship set based on a second classification module in the multi-mode model to obtain a classification result aiming at the candidate relationship information;

And the relation generation subunit may be configured to, if the classification result indicates that the candidate relation information is true, use the candidate relation identifier in the candidate relation information as the relation information between two semantic entities corresponding to the candidate relation information.

Optionally, the relationship generating subunit may specifically be configured to:

verifying the classification result according to the first judgment result;

The respective modules in the above-described document image recognition apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

Referring to fig. 8, an embodiment of the present disclosure provides an electronic device including: at least one processor 801; at least one memory 802, and one or more I/O interfaces 803, coupled between the processor 801 and the memory 802; the memory 802 stores one or more computer programs executable by the at least one processor 801, and the one or more computer programs are executed by the at least one processor 801 to enable the at least one processor 801 to perform the above-described document image recognition method.

The various modules in the electronic device described above may be implemented in whole or in part in software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

The disclosed embodiments also provide a computer readable storage medium having a computer program stored thereon, wherein the computer program when executed by a processor/processing core implements the above-described document image recognition method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.

Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when executed in a processor of an electronic device, performs the above-described document image recognition method.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer-readable storage media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).

The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable program instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, random Access Memory (RAM), read Only Memory (ROM), erasable Programmable Read Only Memory (EPROM), static Random Access Memory (SRAM), flash memory or other memory technology, portable compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical disc storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable program instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and may include any information delivery media.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

The computer program product described herein may be embodied in hardware, software, or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purpose of limitation. In some instances, it will be apparent to one skilled in the art that features, characteristics, and/or elements described in connection with a particular embodiment may be used alone or in combination with other embodiments unless explicitly stated otherwise. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as set forth in the appended claims.

Claims

1. A method for recognizing a document image, comprising:

2. The method according to claim 1, wherein the extracting features of the target document image to obtain the first multi-modal feature of each semantic entity in the target document image specifically includes:

3. The method according to claim 2, wherein the extracting the image feature of the target document image to obtain the image feature of the target document image specifically includes:

4. The method according to claim 2, wherein the extracting text features of the target document image to obtain the text features of each semantic entity specifically includes:

5. The method according to claim 2, wherein the processing the second multi-modal feature based on the self-attention mechanism to obtain the first multi-modal feature of the semantic entity specifically comprises:

6. The method according to claim 1, wherein the generating type information of each semantic entity according to the first multi-modal feature and determining relationship information between each semantic entity specifically comprises:

7. The method according to claim 6, wherein the extracting the relationship information between the semantic entities according to the third multi-modal feature of each semantic entity specifically comprises:

8. The method according to claim 7, wherein if the classification result indicates that the candidate relationship information is true, identifying the candidate relationship in the candidate relationship information as the relationship information between the two semantic entities corresponding to the candidate relationship information specifically includes:

verifying the classification result according to the first judgment result;

9. The method according to claim 7, wherein the target document image is any one of an organizational chart, a mind map, a flow chart, and a knowledge graph; the relation information is used for indicating whether the two semantic entities have a superior-subordinate relation or not and whether the two semantic entities belong to a parallel relation or not.

10. A document image recognition apparatus, comprising:

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores one or more computer programs executable by the at least one processor to enable the at least one processor to perform the method of recognizing a document image according to any one of claims 1 to 9.

12. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of recognizing a document image according to any one of claims 1 to 9.