CN114639109A

CN114639109A - Image processing method and device, electronic equipment and storage medium

Info

Publication number: CN114639109A
Application number: CN202210167897.1A
Authority: CN
Inventors: 曹浩宇; 郑岩; 郭安泰; 姜德强; 刘银松
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-02-23
Filing date: 2022-02-23
Publication date: 2022-06-17

Abstract

The application discloses an image processing method, an image processing device, electronic equipment and a storage medium, which relate to the technical field of computers, and the method comprises the following steps: extracting the characteristics of N characters contained in the image to be structured to obtain multi-modal characteristics corresponding to the N characters, and obtaining initial reference information based on the obtained N multi-modal characteristics; inputting the initial reference information and N multi-modal characteristics into a document multi-modal model, and executing the following operations in a loop iteration mode until structured data before the terminal character is output is obtained: reading a multi-modal feature and obtaining historical structured data; finally, structured data of characters corresponding to the multi-modal features and updated reference information can be obtained based on the current reference information, the multi-modal features and historical structured data. In this way, the document multi-modal model supports the structuralization of various documents, so that the accuracy of the character structuralization result is improved.

Description

Image processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

With the rapid development of computer technology, an object can be subjected to text detection and Recognition under various scenes based on an Optical Character Recognition (OCR) technology. For example, the object may structure an unstructured image into semi-structured data, such as text, based on optical character recognition techniques.

In the related art, when an image is structured based on an optical character recognition technology, the following schemes are generally adopted:

scheme 1: the electronic equipment receives the image to be structured, determines a template image corresponding to the image to be structured, and maps the image to be structured to the template image, so that the structured result of the corresponding field in the image to be structured is extracted according to the anchor point feature in the template image, and the recognition result is obtained. Wherein the anchor point features are, for example, fixed words, field distribution, etc.

However, after the image to be structured is rotated or deformed, that is, after the layout of the image to be structured is changed, the matching template image may not be found by using the scheme 1, so that an accurate recognition result cannot be obtained.

Scheme 2: the electronic equipment receives an image to be structured, calls a special text field detector, detects the position of a required structured field, and then calls a text recognizer to recognize the field at the corresponding position, so as to obtain a character recognition result.

However, when the format of the image to be structured is complex and the types of the fields are many, the position of the required structured field may not be accurately located by using the scheme 2, so that a character structured result with high accuracy cannot be obtained.

In summary, in the related art, when the electronic device performs the structuring process on the image, the electronic device cannot process the image with various formats and various field types, and thus the accuracy of the character structuring result is poor.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, electronic equipment and a storage medium, which are used for providing a character structuring scheme without limitation on a format type and a field type, so that the structured data of characters in an image can be accurately obtained.

In a first aspect, an image processing method is provided, including:

carrying out feature extraction on N characters contained in an image to be structured to obtain multi-modal features corresponding to the N characters; each mode represents a character attribute, and N is a positive integer;

obtaining initial reference information based on the obtained N multi-modal characteristics, wherein the reference information comprises attention weight of each layer of a document multi-modal model and comprehensive multi-modal characteristics corresponding to the image to be structured;

inputting the initial reference information and the N multi-modal characteristics into the document multi-modal model, and executing the following operations in a loop iteration mode until structured data before a termination character is output is obtained:

reading a multi-modal feature and obtaining historical structured data, wherein the historical structured data is: the last reading of the structured data of the character corresponding to the multi-modal feature, and if the multi-modal feature is the first reading, the historical structured data is empty;

and obtaining the structural data of the character corresponding to the multi-modal feature and the updated reference information based on the current reference information, the multi-modal feature and the historical structural data.

In a second aspect, there is provided an image processing apparatus comprising:

the extraction unit is used for extracting the characteristics of N characters contained in the image to be structured to obtain multi-modal characteristics corresponding to the N characters; each mode represents a character attribute, and N is a positive integer;

the determination unit is used for obtaining initial reference information based on the obtained N multi-modal features, wherein the reference information comprises attention weight of each layer of the document multi-modal model and comprehensive multi-modal features corresponding to the image to be structured;

the processing unit is used for inputting the initial reference information and the N multi-modal characteristics into the document multi-modal model, and executing the following operations in a loop iteration mode until structured data before a termination character is output is obtained:

Optionally, the apparatus further comprises a training unit, configured to:

determining an overall loss function; the overall loss function comprises a sequence-to-sequence task loss function;

after the preset document multi-modal model is trained, carrying out convergence inspection on the trained preset document multi-modal model through the overall loss function;

and when the trained preset document multi-modal model is determined to be converged, obtaining the document multi-modal model.

In a third aspect, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the image processing method according to any one of the first aspect.

In a fourth aspect, a computer-readable storage medium is proposed, on which a computer program is stored, which computer program, when being executed by a processor, realizes the image processing method of any of the first aspect.

In a fifth aspect, a computer program product is proposed, comprising a computer program which, when executed by a processor, implements the image processing method of any of the first aspects described above.

The beneficial effect of this application is as follows:

in the embodiment of the application, an image processing method, an image processing device, an electronic device and a storage medium are provided, wherein feature extraction is performed on N characters contained in an image to be structured, multi-modal features corresponding to the N characters are obtained, and initial reference information is obtained based on the obtained N multi-modal features; inputting the initial reference information and N multi-modal characteristics into a document multi-modal model, and executing the following operations in a loop iteration mode until structured data before the terminal character is output is obtained: reading a multi-modal feature and obtaining historical structured data; finally, structured data of characters corresponding to the multi-modal features and updated reference information can be obtained based on the current reference information, the multi-modal features and historical structured data. Therefore, the document multi-modal model based structuring method and the document multi-modal model based structuring method can support the structuring of various documents, and therefore the accuracy of character structuring results can be improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or related technologies, the drawings needed to be used in the description of the embodiments or related technologies are briefly introduced below, it is obvious that the drawings in the following description are only the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is an alternative schematic diagram of an application scenario in an embodiment of the present application;

FIG. 2 is a diagram illustrating an overall structure of a document multi-modal model in an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a stacked layered cross-modal encoder according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a modal awareness mask module MAMM according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating a process of training a document multimodal model in an embodiment of the present application;

FIG. 6 is a schematic diagram of OCR error correction in an embodiment of the present application;

FIG. 7 is a flowchart illustrating an image processing method according to an embodiment of the present application;

fig. 8 is a schematic diagram illustrating an implementation of an image processing method in an embodiment of the present application;

FIG. 9 is a diagram illustrating an image processing result according to an embodiment of the present application;

FIG. 10 is a schematic diagram illustrating an exemplary configuration of an image processing apparatus according to an embodiment of the present disclosure;

fig. 11 is a schematic diagram of a hardware component structure of an electronic device in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the technical solutions in the embodiments of the present application will be described below clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

For the convenience of understanding the technical solutions provided by the embodiments of the present application, some key terms used in the embodiments of the present application are explained first:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

NER-LM: (NER model with language model pre training) with a language model pre-trained NER model. Named Entity Recognition (NER), also referred to as entity recognition, entity segmentation and entity extraction, is a subtask of information extraction that aims to locate and classify named entities in text into predefined categories such as people, organizations, locations, temporal expressions, quantities, monetary values, percentages, etc.

MD-Bert: (Modular DeuplingBert), a stacked hierarchical cross-Modal Encoder, wherein BERT (bidirectional Encoder restances from transformers), an Encode of bidirectional Transformer, describes character-level, word-level, sentence-level, and even sentence-level relational features.

Unidirectional LM: a left-to-right LM or a right-to-left LM, wherein a left-to-right LM may be understood as using the left word of a mask word to predict a masked word; the right-to-left LM may be understood to use the word to the right of the mask word to predict the masked word.

And (3) bidirectional LM: namely, the left and right words of the current mask can be seen.

sequence-to-sequence LM: the inputs are a source sentence and a target sentence, the mask word is on the target, and then the context of the current mask is that all words of the source sentence and the vocabulary on the left side of the mask word in the target sentence can be seen.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formula learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to the technologies of document structuralization and the like in the computer vision of artificial intelligence, and the document multi-modal model is built and trained based on the machine learning technology. The embodiment of the invention can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like.

The following briefly introduces the design concept of the embodiments of the present application:

as described above, the OCR structuring method provided in the related art has the problems that the template registration method based on image features has high requirements for image quality and character recognition results, and is difficult to cope with rotation, perspective, distortion, and the like; and the customized detection method based on the text features can reduce the dependence on the image quality to a certain extent, but has poor detection effect on complex formats.

In addition, the above methods cannot deal with OCR structured scenes with complex formats or without formats, and cannot deal with OCR recognition error problems, so that the application range is limited.

In view of this, the present application provides an image processing method, an image processing apparatus, an electronic device, and a storage medium, based on which a generating-based OCR structuring method can be provided, that is, a to-be-structured image is processed based on a document multi-modal model determined by a large-scale corpus pre-trained multi-modal sequence-to-sequence generating manner, so as to obtain structured data corresponding to the to-be-structured image, thereby solving the problems that the to-be-structured image cannot be processed by a OCR structuring method in the related art, such as a variable format, a variety of field types, and OCR error processing, and improving the accuracy of a character structuring result.

The preferred embodiments of the present application will be described in conjunction with the drawings of the specification, it should be understood that the preferred embodiments described herein are for purposes of illustration and explanation only and are not intended to limit the present application, and features of the embodiments and examples of the present application may be combined with each other without conflict.

Fig. 1 is a schematic diagram of a possible application scenario in the embodiment of the present application. In the application scenario diagram, the terminal device 110 (the terminal device 110n of the object n including the terminal device 1101 of the object 1, the terminal device 1102 … of the object 2) and the electronic device 120 are included.

In the embodiment of the present application, the terminal device 110 includes, but is not limited to, an electronic device such as a desktop computer, a mobile phone, a mobile computer, a tablet computer, a media player, a smart wearable device, a smart television, an in-vehicle device, and a Personal Digital Assistant (PDA).

The electronic device 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. And the electronic equipment can also be desktop computers, mobile phones, mobile computers, tablet computers and the like.

In the embodiment of the present application, a communication connection is established between the terminal device 110 and the electronic device 120 through a communication network by using a connection manner of wired connection or wireless connection.

In a possible technical solution of the present application, the object may determine the image to be structured on the terminal device 1101, and then send the image to be structured to the electronic device 120 through the terminal device 1101; alternatively, the electronic device 120 may send an acquisition request for acquiring the image to be structured to the terminal device 110, so as to receive the image to be structured sent by the terminal device 110. That is to say, in the embodiment of the present application, the image to be structured may be determined by the terminal device, or the image to be structured may be determined by the electronic device, which is not limited in this application.

The technical scheme provided by the application can be applied to various OCR structured scenes with non-fixed formats or fixed formats, such as business such as optimal image intelligent structured OCR and receipt recognition OCR.

Specifically, the image to be structured is determined based on a determination rule corresponding to the object or the electronic device, where one or more images to be structured may be specifically determined according to actual processing requirements, and the following description takes the example of one image to be structured as an example.

Scene one, the image selected or input on the terminal device 110 with the object, determines the image to be structured.

Specifically, the object may select or input an image on the terminal device 110, and trigger a request for structuring processing of the image, so that the terminal device 110 may transmit the image to the electronic device 120, and the electronic device 120 may determine the image to be structured.

For example, after the object selects an image including receipt information on the terminal device 1102, the terminal device 1102 may transmit the image to the electronic device 120 as an image to be structured, and then the electronic device 120 may determine the image to be structured.

And in the second scene, the electronic device 120 determines the image to be structured based on the corresponding determination rule.

In particular, the electronic device 120 may determine the images to be structured in sequence based on corresponding determination rules, for example, in the order in which the images are input to the electronic device 120. For example, the electronic device 120 stores 30 images, and then determines an image with an order of 1 as an image to be structured in the storage order of the 30 images. Of course, other determination rules may be used, which are not limited in the embodiments of the present application.

In the embodiment of the application, after the electronic device 120 determines the image to be structured, the electronic device 120 performs feature extraction on N characters included in the image to be structured to obtain multi-modal features corresponding to the N characters, and obtains initial reference information based on the obtained N multi-modal features; inputting the initial reference information and N multi-modal characteristics into a document multi-modal model, and executing the following operations in a loop iteration mode until structured data before the terminal character is output is obtained: reading a multi-modal feature and obtaining historical structured data; finally, structured data of characters corresponding to the multi-modal features and updated reference information can be obtained based on the current reference information, the multi-modal features and historical structured data. Therefore, the document multi-modal model based structuring method and device can support structuring of various documents, and accuracy of character recognition results can be improved.

In order to better explain the image processing method provided in the present application, the training process of the document multi-modal model in the present application is first explained by taking the example that only the electronic device 120 completes the training of the image model.

The structure of the document multi-modal model is first explained with reference to the accompanying drawings:

referring to fig. 2, a schematic diagram of an overall structure of a possible document multi-modal model in the embodiment of the present application is shown. The constructed document multi-mode model comprises two parts of a stacked layered cross-mode encoder and an iteration generation part.

In the embodiment of the application, the electronic device can obtain initial reference information from the obtained multi-modal features, wherein the reference information comprises historical memory states of each layer of the document multi-modal model and comprehensive multi-modal features corresponding to the image to be structured. Specifically, the initial reference information may be understood as M0 in fig. 2. The electronic device may then input the initial reference information and the multi-modal characteristics of the one character into the document multi-modal model, which then processes it to obtain the corresponding structured data and updated reference information, i.e., M1 in fig. 2.

In the embodiment of the present application, please refer to fig. 3, in which fig. 3 is a schematic structural diagram of a possible stacked layered cross-mode encoder in the embodiment of the present application. In particular, the stacked layered cross-modality encoder shown in FIG. 3 can be understood as part of the document multimodal model of FIG. 2 that processes the initial reference information and the multimodal features of one character.

In the embodiment of the present application, as shown in fig. 3, the stacked hierarchical cross-modal encoder in fig. 3 includes a plurality of concat, which can be understood as concat () method, for linking two or more arrays, and this method does not change the existing arrays, but only returns one copy of the linked arrays.

In the embodiment of the present application, as shown in fig. 3, the stacked layered cross-Modal encoder includes a plurality of Modal decoupling blocks. Specifically, each modal decoupling block comprises a modal perception mask module MAMM, Add and Norm connected with the MAMM, Feed Forward connected with the Add and Norm, and another Add and Norm connected with the Feed Forward. Where Feed Forward can be understood as a Feed Forward network Layer or a Feed Forward neural network Layer, and Add and Norm can be understood as an additive normalized network Layer that can smoothly integrate the input and the output of other network layers and a Layer Norm network Layer with residual connections, which is the computation of the last dimension. Moreover, the normalization processing of the network layer can prevent the numerical value in the network layer from changing too much, thereby being beneficial to accelerating the training speed and improving the generalization performance.

Specifically, in fig. 3, Multi modal embeddings can be understood as a Multi-modal embedded feature, i.e., a Multi-modal feature of one character, Token Prediction can be understood as a Token Prediction, and Decoder output can be understood as a Decoder output, so that the aforementioned structured data of one character can be obtained. Wherein t in fig. 3 is used to characterize time step, in this application, time step can be understood as a character-to-character time interval; and, S in FIG. 3 is used to characterize the text features of the character, X and Y are used to characterize the layout features of the character, V is used to characterize the visual information of the character, F is used to characterize the historical structured data, and M is used to characterize the reference information.

It is clear that the stacked layered cross-modal encoder in the document multimodal model provided in the present application takes as input the memory M, i.e. the historical memory state and the historical structured data F, generates the output Dt and updates the memory M, wherein the memory comprises the historical state of each network layer and the previously embedded multimodal features of the document multimodal model.

In a specific implementation, at the first time step, i.e. the time interval between the processing of the zeroth character and the processing of the first character, M, i.e. the history memory state is empty, i.e. there is no history memory state, and the history structuring data is empty, and the multi-modal features of all characters are input, so that the initial reference information and the comprehensive multi-modal features corresponding to the image to be structured can be obtained.

In a specific implementation process, at any time step after the first time step, the electronic device may input the output or the termination character of the previous time step to the document multi-modal model, so that the document multi-modal model may obtain the structured data of the character corresponding to any time step based on the output or the termination character of the previous time step and the updated reference information.

For example, referring to FIG. 2, at the third time step, the document multimodal models are entered with M1, Date0, and one character multimodal features, and the document multimodal models are output with Date1 and M2.

Obviously, in the present application, in order to learn more common features and make full use of the pre-training data, a unified encoder-decoder module named MD-Bert, namely the aforementioned stacked layered cross-modal encoder, is proposed, which is composed of stacked layered multi-modal Transformer encoders.

Specifically, the stacked layered cross-modal encoder, in the encoding stage, OCR texts may mutually assign attention to each other, corresponding to a self-encoding model. In the decoding stage, the decoded text field can not only see the whole OCR input, but also focus on the decoded text part, and is equivalent to a unidirectional decoder.

In the present embodiment, in consideration of the multi-modal model in the related art, the respective modalities are generally fused by adding or connecting different modality features together, and thus, unnecessary correlation with each other is inevitably introduced. However, these correlations interfere with the ability to enhance the model because the different data modes, i.e., modalities, are orthogonal in nature, and therefore, a customized flow is designed for the processing of each modality feature in the present application.

In particular implementations, to address the different concerns of the attention module, masks are used in this application to control the portion of the context that the token should pay attention to when computing its contextual representation. As shown in fig. 4, it can be seen that, in the present application, text features corresponding to characters, i.e., semantics of the characters, layout features, i.e., layouts of the characters, and visual features, i.e., input features of computer vision of the characters, are mapped to a hidden state.

Specifically, the present application proposes a MAMM (modal aware masking module) encoder, which is a multi-modal Transformer model with a layered structure, that is, different modalities are jointly modeled in a decoupled manner. Thus, when modal content corresponding to text features, layout features and visual features is input to the MAMM, the MAMM first calculates the attention scores of each modality respectively, then adds the attention scores together to obtain a fusion score, and finally uses the fusion score to apply masking and subsequent operations to the semantic content.

In the present embodiment, MAMM follows the design of the basic module in BERT, but replaces multi-headed attention with modal-aware multi-headed attention. And it MAMM also contains the Feed Forward (FF) layer, residual concatenation and Layer Normalization (LN), while not sharing modal parameters.

Referring to fig. 5, which is a schematic diagram of a training process of a document multi-modal model in an embodiment of the present application, the following describes a training process of the document multi-modal model in the embodiment of the present application with reference to fig. 5:

step 501: pre-training an initial document multi-modal model based on a preset task to obtain a preset document multi-modal model; the preset tasks comprise a one-way generation task, a two-way generation task and a sequence-to-sequence generation task.

In the embodiment of the application, the initial document multi-modal model can be pre-trained based on three complete gap-filling tasks, namely, a one-way LM, a two-way LM and a Sequence-to-Sequence LM.

Specifically, in the aforementioned completion gap filling task, some texts may be randomly selected in the input and replaced with special marks [ MASK ]. The corresponding output vector network computed by the Transformer can then be input into a softmax classifier to predict the masked tokens. That is, parameter learning of the document multimodal model is performed with minimal cross entropy loss, and adjustment parameters of the document multimodal model are calculated using the predictive tokens and the original tokens.

Step 502: and establishing a sample set according to the plurality of sample images and the marked structured data corresponding to the characters in the sample images.

In the embodiment of the present application, a large number of sample images may be selected based on the actual implementation process. Alternatively, billions of OCR text may be selected. Specifically, a large number of sample images may be selected first, and then optical character recognition, OCR, processing may be performed on the sample images, so that a large number of OCR texts may be obtained.

In a specific implementation process, images of different formats can be selected from databases corresponding to various different service scenes, images of different fields can be selected from databases corresponding to various different service scenes, and then OCR processing is performed on the selected images, so that OCR texts can be obtained. Further, a sample set can be established based on the OCR text and the marked structured data corresponding to the characters in the OCR text.

Step 503: inputting sample images in a sample set into a preset document multi-modal model for training to obtain a plurality of structured data, and comparing the plurality of structured data with corresponding marked structured data respectively to obtain a plurality of comparison results; each structured data is correspondingly determined based on the multi-modal characteristics of one character, the current reference information and the structured data of the character before the character.

In the embodiment of the application, the electronic device can input the sample images in the sample set into a preset document multi-modal model for training, so that the electronic device can obtain a plurality of corresponding structured data. Further, the electronic device may compare the plurality of structured data with the corresponding labeled structured data, respectively, to obtain a plurality of comparison results. That is, the electronic device can obtain an alignment of the predictive token, i.e., the obtained structured data, and the original token, i.e., the tagged structured data.

Step 504: and adjusting the preset document multi-modal model according to a plurality of comparison results to obtain the document multi-modal model meeting the preset convergence condition.

In this embodiment, the electronic device may adjust the preset document multi-modal model based on the determined multiple comparison results. Specifically, the electronic device may determine an overall loss function; wherein the overall loss function comprises a sequence-to-sequence task loss function. Furthermore, after the preset document multi-modal model is trained, the electronic equipment can perform convergence check on the trained preset document multi-modal model through an overall loss function; and when the electronic equipment determines that the trained preset document multi-modal model is converged, obtaining the document multi-modal model.

In particular, the extension of the NER-LM task sequence to the sequence LM serves to better constrain the integrity of the entity. For example, continuing with FIG. 4, the text in the first (i.e., source OCR text) segment can focus on each other from two directions within the segment, while the text of the second (i.e., target structured data) can only focus on the target and the left context in itself, as well as the markers in the source segment. Specifically, given a source segment s _ {1}, s _ {2}, and corresponding entity types n _ {1}, n _ {2}, and some background sentences B _ {1}, B _ {2}, which contain entity values, then the input format is formed as A '[ BEG ] s {1} B _ {1} s _ {2} B _ {2} [ SEP ] "and B' [ BEG ] n _ {1} s _ {1} n _ {2} s _ {2} [ SEP ]". Thus, each token in a has access to all other tokens of a, while each token in B has access to all tokens in a and the previous tokens in B. That is, the target entities in B are masked for prediction during training.

In the embodiment of the application, since billions of different versions and/or different fields of OCR texts are used in model pre-training, the document multi-modal model obtained by training has the context perception capability. In this way, automatic correction can be achieved when input containing typical OCR erroneous text is encountered, thereby improving the accuracy of character recognition.

For example, referring to FIG. 6, when a space character or a fixed field is absent in the OCR text being input, the document multimodal model in the present application can automatically correct, thereby inputting structured data of characters with higher accuracy.

In the embodiment of the present application, after the training process of the document multi-modal model is introduced, the image processing method provided in the present application is described.

Fig. 7 is a flowchart illustrating a possible image processing method according to an embodiment of the present application.

Step 701: performing feature extraction on N characters contained in the image to be structured to obtain multi-modal features corresponding to the N characters; wherein each mode represents a character attribute, and N is a positive integer.

Optionally, before performing step 701, the electronic device may further perform an optical character recognition OCR process on the image to be structured to obtain a character recognition result.

In the embodiment of the application, after the electronic device obtains the character recognition result or determines the image to be structured, feature extraction can be performed on the image to be structured or the N characters included in the character recognition result, so as to obtain multi-modal features corresponding to the N characters.

In this embodiment of the present application, the electronic device may perform feature extraction on the N characters, and obtain the multi-modal features corresponding to each of the N characters may adopt, but is not limited to, the following steps:

step a: and carrying out text information coding processing on the N characters included in the image to be structured or the character recognition result to obtain text characteristics corresponding to the N characters.

In a specific implementation process, the semantic content of each text segment can be acquired from the character recognition result in an actual implementation process, considering that the semantic content is the most credible information of the structured extraction information.

In the embodiment of the present application, text information encoding processing is performed on characters, and the processing of obtaining text features corresponding to each of N characters may be understood as converting a text corresponding to a character recognition result into an input of a network.

In a specific implementation process, the electronic device may divide the text corresponding to the character recognition result according to a predetermined batch length L to determine at least one token sequence. That is to say, in the embodiment of the present application, the text corresponding to the character recognition result may be processed by using the token sequence as a processing unit. Therefore, in the specific implementation process, the structured data corresponding to one token sequence can be determined firstly, then the structured data corresponding to the next token sequence can be determined, and the structured data corresponding to the character recognition result can be determined orderly and efficiently.

In particular, a start indication tag [ BEG ] may be added in front of each token sequence, and an end indication tag [ SEP ] is also appended to the end of the sequence before each token. Furthermore, when the length of the token sequence of this time is different from the predetermined batch length L, an additional padding tag [ PAD ] may be added, which, as can be seen, serves to align the length of the sequence with the predefined batch length L.

Based on the foregoing, the electronic device can obtain a token sequence input to the document multimodal model, which is expressed as:

S＝[[BEG],w₁,w₂,…,w_n,[SEP],[PAD],…],|S|＝L。

it should be noted that, in the specific implementation process, although a fixed length is used during the training, the document multi-modal model can effectively process a variable length longer than the training time in the specific implementation process because the novel position embedding method cord2vec is adopted.

In particular, the text features corresponding to the characters included in the image to be structured in this application may be understood as w in the token sequence.

Step b: carrying out layout information coding processing on N characters included in the image to be structured to obtain layout characteristics corresponding to the N characters; the layout features are used to characterize the spatial and sequence order information corresponding to the characters.

In the embodiment of the application, considering that the structured task is a typical two-dimensional scene, the relative position of the characters is crucial to the meaning of the characters, and the conventional sequence labeling method needs to perform reading sequence ordering processing, which is tedious and complex. Therefore, the present application provides a unified two-dimensional & one-dimensional coding scheme, based on which spatial and sequence order information can be coded simultaneously, thereby avoiding serialization operations.

In this embodiment of the application, the electronic device may perform normalization and discretization on coordinates corresponding to the N characters, respectively obtain coordinates after respective processing of the N characters, and then determine layout features corresponding to the N characters, respectively, based on the coordinates after respective processing of the N characters and the preset bounding box, where the layout features include at least one of an upper-left corner coordinate, a lower-right corner coordinate, and a width and a height of the preset bounding box.

In particular implementations, the electronic device may normalize and discretize the coordinates into integers within a range, for example, expressed as: [0, α ], where α is the maximum proportion of the input, i.e., the maximum proportion of the coordinates of the input character. And, the electronic device may pass the bounding box corresponding to the character's vertex coordinates and aspect ratio.

In an embodiment of the application, the electronic device may represent layout information of the character based on the tuples. Optionally, the layout information may be expressed as: (x0, x1, w), (y0, y1, h), wherein (x0, y0) and (x1, y1) are used to characterize the top left and bottom right coordinates of each character, and w, h are the width and height, respectively, used to characterize the character. Thus, the position information and word order information of the character can be correspondingly determined based on the tuple of the character.

Optionally, the electronic device may correspondingly determine the normalized bounding box based on the Cord2Vec method, so that tuple information corresponding to each character may be correspondingly determined based on the normalized bounding box, and then the layout feature corresponding to each character is determined.

In a specific implementation, the electronic device may assume that each character is distributed within a grid of [ W, H ], and that each character occupies a pixel having a width and a height of 1 on a line-first basis. Then, the x-axis feature and the y-axis feature can be embedded using two embedding layers, respectively, so that the normalized bounding box that can correspond to the determination of the ith character marker can be expressed as:

x_i＝PosEmb2D_x(x₀,x₁,w),y_i＝PosEmb2D_y(y₀,y₁,h)。

among them, PosEmb2Dx and PosEmb2Dy are position embedding functions, which embed each input element separately with coordinates as input, and then add each embedding with an element-by-element function.

In a specific implementation process, in consideration of the possible presence of placeholders, the placeholders can be regarded as uniformly divided grids, so that bounding box coordinates corresponding to the placeholders can be determined, and then layout features corresponding to the placeholders can be determined. For example, an empty bounding box X _ { PAD } (0, 0, 0), Y _ { PAD } (0, 0, 0) is appended to [ PAD ], and X _ { SEP } (0, w, w), Y _ { SEP } (0, h, h) is appended to other special labels [ deg ], [ SEP ].

Step c: and carrying out image information coding processing on the N characters included in the image to be structured to obtain visual characteristics corresponding to the N characters.

In a particular implementation, ResNet-18 may be used as the base model for the visual encoder. Specifically, the electronic device may adjust the image to be structured into a W × H image, and then input the W × H image into a ResNet-18 based visual encoder, so that the visual encoder feature map scales the image to a fixed size by average pooling, for example, the image has a width of W/n and a height of H/n. Finally, the RoI align is applied to extract the fixed-size visual embedding of each mark, so that the visual features corresponding to the N characters can be obtained. It should be noted that, in consideration that the visual backbone of the CNN network cannot capture position information, in an actual implementation process, the electronic device further adds layout position embedding to the image mark embedding, so that the visual features of each character can be accurately determined.

In an actual implementation process, the steps a, b, and c may be executed simultaneously or separately, and the execution order is not limited to front and back.

Step d: and obtaining multi-modal characteristics corresponding to the N characters based on the text characteristics, the layout characteristics and the visual characteristics corresponding to the N characters.

In the embodiment of the application, after the electronic device obtains the text feature, the layout feature and the visual feature corresponding to each of the N characters, the multi-modal feature corresponding to each of the N characters can be determined based on the text feature, the layout feature and the visual feature corresponding to each of the N characters.

That is, the multimodal features to which each of the N characters corresponds can be understood as multimodal features including text features, layout features, and visual features. In addition, it should be noted that, in practical implementation, the multi-modal feature of the character may also include other features, which is not limited in this application.

Step 702: and obtaining initial reference information based on the obtained N multi-modal characteristics, wherein the reference information comprises attention weight of each layer of the document multi-modal model and comprehensive multi-modal characteristics corresponding to the image to be structured.

In this embodiment of the application, the electronic device may input the obtained N multimodal features into the document multimodal model, then fuse the N multimodal features through a normalized network layer of a first encoder in the document multimodal model, and process the fused features through a residual connection network layer of the first encoder to obtain initial reference information.

Step 703: inputting initial reference information and N multi-modal characteristics into a document multi-modal model, and executing the following operations in a loop iteration mode until structured data before outputting a termination character is obtained:

reading a multi-modal feature and obtaining historical structured data, wherein the historical structured data is as follows: the last reading of the structured data of the character corresponding to the multi-modal features, and if one multi-modal feature is the first reading, one historical structured data is empty;

and obtaining the structural data of the character corresponding to the multi-modal feature and updated reference information based on the current reference information, the multi-modal feature and historical structural data.

In the embodiment of the application, the electronic device can input initial reference information and N multi-modal characteristics into a document multi-modal model, and execute corresponding operations in a loop iteration mode until structured data before outputting a termination character is obtained.

Referring to fig. 2, the electronic device may read a multi-modal feature and obtain a historical structured data, so as to input the multi-modal feature, current reference information, and a historical structured data into the document multi-modal model, and then the document multi-modal model processes the multi-modal feature, so as to obtain a structured data of characters corresponding to the multi-modal feature and updated reference information.

For example, suppose that the multimodal character read by the electronic device is the multimodal character 3 of the third character, i.e., the text character w3, the layout character, i.e., the two-tuple information, and the visual character V3 of the third character, and reads the structured data Date1 corresponding to the second character, and then inputs the present character w3, the layout character, i.e., the two-tuple information, the visual characters V3 and F2, and the current reference information M2 into the document multimodal model, so that the document multimodal model can output the structured data Date2 and the updated reference information M3.

Therefore, the image processing method provided by the application can support the structuralization of the image with the unknown format and the unknown field, and the pre-typesetting of the word sequence is not needed, so that the processing speed of the structuralization of the image is improved on the basis of improving the accuracy of character recognition in the image, and the structuralization effect of the image with the unfixed format in the optimal image OCR is further improved.

Fig. 8 is a schematic diagram illustrating an exemplary implementation of an image processing method according to an embodiment of the present disclosure. Wherein, the characters included in the image to be structured shown in fig. 8 are: "Date: 11/12/96Attention: John J.Mulderigc/o Mike Baker Company: Philip Morris Management OCRp". The electronic device can then perform OCR processing on the image to be structured and then obtain character recognition results. Further, the electronic device may perform feature extraction on a plurality of characters included in the character recognition result to obtain multi-modal features corresponding to the plurality of characters.

Specifically, the electronic device may input multi-modal features, i.e., T0, corresponding to each of the plurality of characters into the document multi-modal model, such that the initial reference information, i.e., M0, may be obtained. Then, the initial reference information and the multi-modal features of the first character are input into the document multi-modal model, and structured data, namely Date0, and updated reference information, namely M1, are obtained. Further, the multi-modal feature and the historical structured data of the second character, namely Date0, and the current reference information, namely M1, can be input into the document multi-modal model, so that the structured data, namely Date1 and the updated reference information, namely M2, can be obtained, and the execution is continuously performed in a loop iteration mode until the structured data before the output termination character is obtained, so that the structured data corresponding to the image to be structured, namely "Date: 11/12/96[ DESP ] attribute: John J. Muldriergic/o MiBaker [ DESP ] Company: Philip Morris Management [ SEP ]", can be obtained.

Compared with the scheme based on the fixed format in the related technology, the image processing method provided by the application does not need to limit the version style of the image and the style of the field, can perform structural processing on OCR documents of various formats, enhances the universality of the image processing scheme, improves the accuracy of character recognition, and improves the use experience of the object.

Fig. 9 is a schematic diagram illustrating an image processing result according to an embodiment of the present application. Obviously, the image processing method provided in the application can clearly determine the structured data corresponding to each character, and the structured data corresponding to the image can be output or presented in a key value pair mode of a field and an identification result, so that the use experience of the object is improved.

Based on the same inventive concept, referring to fig. 10, which is a schematic diagram of a logical structure of an image processing apparatus in an embodiment of the present application, an image processing apparatus 1000 includes an extracting unit 1001, a determining unit 1002, and a processing unit 1003, wherein:

the extraction unit 1001 is configured to perform feature extraction on N characters included in an image to be structured, and obtain multi-modal features corresponding to the N characters; each mode represents a character attribute, and N is a positive integer;

a determining unit 1002, configured to obtain initial reference information based on the obtained N multi-modal features, where the reference information includes an attention weight of each layer in a document multi-modal model and a comprehensive multi-modal feature corresponding to the image to be structured;

the processing unit 1003 is configured to input the initial reference information and the N multi-modal features into the document multi-modal model, and perform the following operations in a loop iteration manner until structured data before a termination character is output is obtained:

Optionally, before performing feature extraction on each character included in the image to be structured, the apparatus further includes an obtaining unit, configured to:

and performing Optical Character Recognition (OCR) processing on the image to be structured to obtain a character recognition result.

Optionally, the extracting unit 1001 is configured to:

carrying out text information coding processing on N characters included in the image to be structured to obtain text characteristics corresponding to the N characters; and the number of the first and second groups,

carrying out layout information coding processing on N characters included in the image to be structured to obtain layout characteristics corresponding to the N characters; the layout features are used for representing space and sequence order information corresponding to the characters; and the number of the first and second groups,

carrying out image information coding processing on N characters included in the image to be structured to obtain visual features corresponding to the N characters;

obtaining multi-modal features corresponding to the N characters based on the text features, layout features and visual features corresponding to the N characters.

Optionally, the extracting unit 1001 is specifically configured to:

carrying out standardization and discretization on the coordinates corresponding to the N characters respectively to obtain the coordinates of the N characters after respective processing;

determining layout features corresponding to the N characters respectively based on the coordinates after the N characters are processed respectively and a preset boundary box, wherein the layout features comprise at least one of upper left corner coordinates, lower right corner coordinates and width and height of the preset boundary box.

Optionally, the determining unit 1002 is specifically configured to:

inputting the obtained N multi-modal characteristics into the document multi-modal model respectively;

and fusing the N multi-modal characteristics through a normalized network layer of a first encoder in the document multi-modal model, and processing the fused characteristics through a residual connection network layer of the first encoder to obtain initial reference information.

Optionally, the apparatus further comprises a training unit, configured to:

pre-training an initial document multi-modal model based on a preset task to obtain a preset document multi-modal model;

establishing a sample set according to a plurality of sample images and mark structured data corresponding to characters included in the sample images;

inputting the sample images in the sample set into a preset document multi-modal model for training to obtain a plurality of structured data, and comparing the plurality of structured data with corresponding marked structured data respectively to obtain a plurality of comparison results; each piece of structured data is correspondingly determined based on the multi-modal characteristics of one character, the current reference information and the structured data of the character before the character;

and adjusting the preset document multi-modal model according to the comparison results to obtain the document multi-modal model meeting the preset convergence condition.

Optionally, the apparatus further comprises a training unit, configured to:

Having described the image processing method and apparatus of the exemplary embodiments of the present application, next, an electronic device according to another exemplary embodiment of the present application is described.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

Based on the same inventive concept and the same technical concept as the above method embodiment, the embodiment of the present application provides a computer device, which may be the terminal device and/or the server shown in fig. 1, as shown in fig. 11, including at least one processor 1101 and a memory 1102 connected to the at least one processor, where a specific connection medium between the processor 1101 and the memory 1102 is not limited in the embodiment of the present application, and the processor 1101 and the memory 1102 are connected through a bus in fig. 11 as an example. The bus may be divided into an address bus, a data bus, a control bus, etc.

In the embodiment of the present application, the memory 1102 stores instructions executable by the at least one processor 1101, and the at least one processor 1101 may execute the steps of the image processing method by executing the instructions stored in the memory 1102.

The processor 1101 is a control center of the computer device, and may connect various parts of the computer device by using various interfaces and lines, and identify a video tag of a target video by executing or executing instructions stored in the memory 1102 and calling up data stored in the memory 1102. Optionally, the processor 1101 may include one or more processing units, and the processor 1101 may integrate an application processor and a modem processor, wherein the application processor mainly handles operating systems, user interfaces, application programs, and the like, and the modem processor mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1101. In some embodiments, the processor 1101 and the memory 1102 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.

The processor 1101 may be a general purpose processor such as a Central Processing Unit (CPU), a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, configured to implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.

Memory 1102, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 1102 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charged Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory 1102 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1102 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function to store program instructions and/or data.

Based on the same inventive concept, embodiments of the present application provide a computer-readable storage medium storing a computer program executable by a computer device, which, when the program is run on the computer device, causes the computer device to perform the steps of the above-described image processing method.

Based on the same inventive concept, the present application provides a computer program product comprising a computer program stored on a computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform the steps of the above-mentioned image processing method.

It should be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data electronics to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data electronics, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data electronics to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data electronics to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. An image processing method, characterized in that the method comprises:

reading a multi-modal feature and obtaining historical structured data, wherein the historical structured data is: the method comprises the steps that structural data of characters corresponding to multi-modal features are read last time, and if the multi-modal features are read for the first time, the historical structural data are null;

2. The method of claim 1, wherein prior to feature extraction for each character contained in the image to be structured, the method further comprises:

3. The method of claim 1 or 2, wherein performing feature extraction on N characters contained in an image to be structured to obtain multi-modal features corresponding to the N characters respectively comprises:

4. The method of claim 3, wherein performing layout information encoding processing on N characters included in the image to be structured to obtain layout features corresponding to the N characters respectively comprises:

5. The method of claim 1 or 2, wherein deriving initial reference information based on the obtained N multi-modal features comprises:

and fusing the N multi-modal characteristics through a normalized network layer of a first encoder in the document multi-modal model, and processing the fused characteristics through a residual connecting network layer of the first encoder to obtain initial reference information.

6. The method of claim 1 or 2, wherein the document multimodal model is obtained based on training in the following way:

inputting the sample images in the sample set into the preset document multi-modal model for training to obtain a plurality of structured data, and comparing the plurality of structured data with corresponding marked structured data respectively to obtain a plurality of comparison results; each piece of structured data is correspondingly determined based on the multi-modal characteristics of one character, the current reference information and the structured data of the character before the character;

7. The method of claim 6, wherein adjusting the preset document multimodal model according to the comparison results to obtain a document multimodal model satisfying a preset convergence condition comprises:

8. An image processing apparatus, characterized in that the apparatus comprises:

9. The apparatus according to claim 8, wherein before feature extraction of each character included in the image to be structured, the apparatus further comprises an obtaining unit configured to:

and carrying out Optical Character Recognition (OCR) processing on the image to be structured to obtain a character recognition result.

10. The apparatus of claim 8 or 9, wherein the extraction unit is configured to:

11. The apparatus of claim 10, wherein the extraction unit is specifically configured to:

12. The apparatus according to claim 8 or 9, wherein the determining unit is specifically configured to:

13. The apparatus according to claim 8 or 9, wherein the apparatus further comprises a training unit for:

14. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the image processing method according to any of claims 1-7 when executing the program.

15. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program, when executed by a processor, implements the image processing method of any of claims 1-7.