CN116030469A

CN116030469A - Processing method, processing device, processing equipment and computer readable storage medium

Info

Publication number: CN116030469A
Application number: CN202211691412.5A
Authority: CN
Inventors: 田秋雨; 王敏; 陈永洒; 罗林锋
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2022-12-27
Filing date: 2022-12-27
Publication date: 2023-04-28

Abstract

The embodiment of the application discloses a processing method, a processing device, processing equipment and a computer readable storage medium. The method comprises the following steps: obtaining an object to be processed, and performing character recognition on the object to be processed to obtain first information in the object to be processed; serializing the first information to obtain at least one text block and structural information of each text block, wherein the text block is a text set with correct and complete semantic information; classifying and matching at least one text block based on the first information, the at least one text block and the structure information of each text block to obtain a classification result and a matching result; and determining the classification result and the matching result as the processing result of the object to be processed.

Description

Processing method, processing device, processing equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a processing method, apparatus, device, and computer readable storage medium.

Background

In the category of visual document questions and answers (Document Visual Question Answering, doc VQA), keys (keys) and Key values (values) are determined from a huge amount of scanned documents, or questions (questions) and answers (answers) are determined, and keys and corresponding values (or answers corresponding to questions) are determined to be important tasks for document processing.

In the related art, a machine learning model is generally adopted to process a document, however, in the training process of the machine learning model, the document formats are not uniform, and the document data volume of each format is less, so that the generalization capability of the machine learning model is poor, and the accuracy of a document processing result is difficult to ensure.

Disclosure of Invention

In view of this, embodiments of the present application provide a processing method, apparatus, device, and computer-readable storage medium, capable of improving accuracy of a processing result of an object to be processed.

The technical scheme is realized as follows:

the embodiment of the application provides a processing method, which comprises the following steps:

obtaining an object to be processed, and performing character recognition on the object to be processed to obtain first information in the object to be processed;

serializing the first information to obtain at least one text block and structural information of each text block, wherein the text block is a text set with correct and complete semantic information;

based on the first information, the at least one text block and the structure information of each text block, carrying out classification processing and matching processing on the at least one text block to obtain a classification result and a matching result;

And determining the classification result and the matching result as the processing result of the object to be processed.

The embodiment of the application provides a processing device, which comprises:

the first acquisition module is used for acquiring an object to be processed, carrying out character recognition on the object to be processed and acquiring first information in the object to be processed;

the second acquisition module is used for carrying out serialization processing on the first information to obtain at least one text block and structural information of each text block, wherein the text block is a text set with correct and complete semantic information;

the processing module is used for carrying out classification processing and matching processing on the at least one text block based on the first information, the at least one text block and the structure information of each text block to obtain a classification result and a matching result;

and the first determining module is used for determining the classification result and the matching result as the processing result of the object to be processed.

An embodiment of the present application provides a processing apparatus, including:

a memory for storing executable processing instructions;

and the processor is used for realizing the processing method provided by the embodiment of the application when executing the executable processing instructions stored in the memory.

Embodiments of the present application provide a computer-readable storage medium having stored therein computer-executable instructions configured to perform the steps of the above-described processing method.

The embodiment of the application provides a processing method, a processing device, a processing equipment and a computer readable storage medium, and by adopting the technical scheme, firstly, an object to be processed is obtained, and character recognition is carried out on the object to be processed to obtain first information in the object to be processed; then, carrying out serialization processing on the first information to obtain at least one text block and structure information of each text block; and finally, classifying and matching each text block based on the first information, each text block and the structural information of each text block to obtain a classification result and a matching result, and determining the classification result and the matching result as the processing result of the object to be processed. Therefore, by carrying out serialization processing on the first information in the object to be processed, the obtained text blocks are prevented from having incorrect semantic information, the structure information among the text blocks can be determined more accurately, and further, the text blocks are classified and matched based on the first information, the text blocks and the structure information of the text blocks, various characteristics of the object to be processed are utilized, the problem that fitting is carried out when the text blocks are classified and matched is avoided, and therefore accuracy of processing results of the object to be processed is improved.

Drawings

Fig. 1 is a schematic flow chart of a processing method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a document image processing method based on text serialization and structural information according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a document image processing flow based on text serialization and structural information according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a processing device according to an embodiment of the present application;

fig. 5 is a schematic diagram of a composition structure of a processing apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be further described with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments\other embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments\other embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

When the document is processed, due to the fact that the types of the document are numerous (such as invoices, reimbursements, invoices and the like), and the same type of document also has a plurality of document formats, when the document processing problem is solved by adopting a traditional machine learning method, different machine learning models need to be trained for different formats. Because many document formats are specific to enterprises (often have certain randomness), and the manual labeling process required by training data is complex, the labeling quality is highly required, and it is often difficult to collect enough training data to achieve an ideal model processing effect.

In the related art, the result of optical character recognition (optical character recognition, OCR) and the information of image Feature Map (Feature Map) are directly fused and taken as input to enter a Doc VQA model, and finally the result of semantic entity recognition (Semantic Entity Recognition) is obtained, which can recognize a Question (or Key), a Answer (or Value) and a Header (Header), and the relation extraction (Relation Extraction) is performed to determine the pairing relation of the Key and the Value. However, since the OCR recognition result is non-serialized, the OCR recognition result is usually output in order from left to right and top to bottom, but in many documents, there are many content folds. If the cross-modal Doc VQA model is input according to the direct OCR recognition result, the reading order is not conformed and the semantics and typesetting information are disturbed. In addition, because the formats of the documents are not uniform, the document data volume of each format is small, the OCR recognition result is directly used as the fusion of the text features and the image features, the generalization capability of the obtained model is poor, and a good document processing result cannot be achieved.

Based on the problems of the related art, the embodiment of the application provides a processing method, which can improve the accuracy of the processing result of the object to be processed. As shown in fig. 1, a flow chart of a processing method according to an embodiment of the present application is provided, and the method includes the following steps:

s101, obtaining an object to be processed, and performing character recognition on the object to be processed to obtain first information in the object to be processed.

It should be noted that the object to be processed may be a graphic object, for example, the object to be processed may be a scanned document (in the form of a scanned PDF or a picture) stored in a history in an enterprise, such as a scanned document of an invoice, an reimbursement bill, an invoice, or the like. The character recognition may be to recognize characters, symbols, graphics, etc. in the object to be processed, and the first information may include character information, position information, typesetting information of the entire object to be processed, etc.

In some embodiments, the object to be processed may be pre-stored in a server, in which case it may be obtained from the server; the object to be processed may also be obtained in real time, for example, by scanning a paper document such as an invoice, an reimbursement bill, an invoice, or the like by a scanning device. When the character recognition is carried out on the object to be processed, the character recognition can be carried out in an OCR recognition mode, and after the object to be processed is recognized, the first information in the object to be processed, such as character information, position information, typesetting information and the like, included in the object to be processed can be obtained.

S102, carrying out serialization processing on the first information to obtain at least one text block and structure information of each text block.

In some embodiments, the serializing processing of the first information may be dividing text blocks in the object to be processed according to semantic information, location information, and the like in the object to be processed, so as to obtain one or more text blocks in the object to be processed and structure information of each text block, where the text blocks are a text set with correct and complete semantic information, and the text may include chinese, english, graphics, symbols, and the like; the structure information of each text block may be position information, arrangement order, or the like corresponding to each text block.

S103, based on the first information, the at least one text block and the structure information of each text block, carrying out classification processing and matching processing on the at least one text block to obtain a classification result and a matching result.

In some embodiments, after the first information, each text block, and the structure information of each text block corresponding to the object to be processed are obtained, classification processing and matching processing may be performed on each text block according to the first information, each text block, and the structure information of each text block, so as to obtain a classification result and a matching result, where the classification result may be a type corresponding to each text block, for example, each text block belongs to Question, answer or a Header, and the matching result may be a correspondence between a text block with a type of Question and a text block with a type of Answer.

S104, determining the classification result and the matching result as the processing result of the object to be processed.

In some embodiments, the processing result of the object to be processed may include a classification result and a matching result, and after determining the classification result and the matching result corresponding to each text block, the processing result of the object to be processed is obtained.

In the embodiment of the application, firstly, an object to be processed is obtained, and character recognition is carried out on the object to be processed to obtain first information in the object to be processed; then, carrying out serialization processing on the first information to obtain at least one text block and structure information of each text block; and finally, classifying and matching each text block based on the first information, each text block and the structural information of each text block to obtain a classification result and a matching result, and determining the classification result and the matching result as the processing result of the object to be processed. Therefore, by carrying out serialization processing on the first information in the object to be processed, the obtained text blocks are prevented from having incorrect semantic information, the structure information among the text blocks can be determined more accurately, and further, the text blocks are classified and matched based on the first information, the text blocks and the structure information of the text blocks, various characteristics of the object to be processed are utilized, the problem that fitting is carried out when the text blocks are classified and matched is avoided, and therefore accuracy of processing results of the object to be processed is improved.

In some embodiments of the present application, the first information includes semantic information, location information and typesetting information of an object to be processed corresponding to each text, based on which serialization processing is performed on the first information to obtain at least one text block and structural information of each text block, that is, step S102 may be implemented through the following steps S1021 to S1023, which are described below.

S1021, semantic information, position information and typesetting information of the object to be processed corresponding to each word in the object to be processed are obtained.

In some embodiments, the semantic information may be the semantic represented by the text, the location information may be the location of the text in the object to be processed, and the typesetting information may include information such as a font size, a line spacing, and a word spacing of the text. In practice, after the serialization processing is performed on the object to be processed, for example, by means of OCR recognition, the recognized semantics corresponding to each word, the position where each word is located, and the typesetting information corresponding to the object to be processed can be obtained.

S1022, dividing the characters in the object to be processed into at least one text block according to the semantic information, the position information and the typesetting information.

In some embodiments, by combining the semantics represented by each text, the position where each text is located, and the typesetting information corresponding to the object to be processed, one or more text blocks corresponding to the object to be processed can be obtained. In practice, each text belonging to the same text block has complete and correct semantic information, the positions of each text in the same text block are continuous, and the word sizes, line spacing, character spacing and the like of each text in the same text block are the same, so that each text in the object to be processed can be divided into one or more text blocks according to the semantic information, the position information and the typesetting information corresponding to each text.

S1023, determining the structure information of each text block based on the semantic information and the position information corresponding to each text block.

In some embodiments, after the semantic information and the position information corresponding to each text are obtained, the semantic information corresponding to the text block can be determined according to the semantic information of each text in the same text block; and determining the position information corresponding to the text block according to the position information of each text in the same text block. After semantic information and position information of all text blocks corresponding to the object to be processed are obtained, structural information of each text block can be determined according to the semantics corresponding to each text block and the position of each text block, and the structural information can comprise the arrangement sequence among the text blocks.

It can be understood that in the embodiment of the application, the text in the object to be processed is divided into one or more text blocks through the semantic information, the position information and the typesetting information of each text in the object to be processed, and the structure information corresponding to each text block is determined according to the semantic information and the position information of each text block, so that the correct division of each text block and the serialization processing of each text block are realized, and the guarantee is provided for the correctness of the classification result and the matching result of each subsequent text block.

In some embodiments of the present application, the text in the object to be processed is divided into at least one text block according to the semantic information, the location information and the layout information, that is, the step S103 may be implemented through the following steps S1031 to S1032, and each step is described below.

S1031, obtaining first semantic information, first position information and first typesetting information corresponding to the ith character, and second semantic information, second position information and second typesetting information corresponding to the (i+1) th character.

In some embodiments, i=1, …, N is a positive integer greater than zero, representing the number of words in the object to be processed. The first semantic information may be the semantic corresponding to the ith text, the first location information may be the location of the ith text, and the first typesetting information may be the font size, font, color, etc. corresponding to the ith text; the second semantic information may be semantics corresponding to the i+1th text, the second location information may be a location where the i+1th text is located, and the second typesetting information may be a font size, a font, a color, etc. corresponding to the i+1th text.

In some embodiments, before dividing the text in the object to be processed into one or more text blocks, semantic information, position information and typesetting information of the ith text and the semantic information, position information and typesetting information of the ith text may be sequentially obtained, and then the semantic information, typesetting information and position information corresponding to each adjacent text are analyzed to determine whether each adjacent text belongs to the same text block.

S1032, determining that the first semantic information and the second semantic information meet a first matching condition, the first typesetting information and the second typesetting information are the same, the first position information and the second position information meet a second matching condition, and determining that the ith character and the (i+1) th character belong to the same text block.

In some embodiments, the first matching condition may be a corresponding semantic coherence of the first semantic information and the second semantic information, e.g., the semantics of the i-th word and the semantics of the i+1th word are properly joined, matched to each other; the second matching condition may be that corresponding positions of the first position information and the second position information are consecutive, for example, the position of the i-th character and the position of the i+1th character are adjacent (before and after) or contiguous (last position of the previous row is the first position of the next row); the first typesetting information and the second typesetting information are the same, and the fonts, the font sizes, the colors and the like corresponding to the i-th text and the i+1th text can be respectively corresponding to the same.

In some embodiments, if it is determined that the first semantic information and the second semantic information meet the first matching condition, the first typesetting information and the second typesetting information are the same, and the first location information and the second location information meet the second matching condition, which indicates that the semantics of the i-th text and the i+1th text are consistent, the locations are connected, and the typesetting is the same, the i-th text and the i+1th text can be divided into the same text block.

In some embodiments of the present application, the classification processing and the matching processing are performed on at least one text block based on the first information, the at least one text block, and the structure information of each text block, so as to obtain a classification result and a matching result, that is, step S104 may be implemented through steps S1041 to S1043 described below, where each step is described below.

S1041, fusing the first information, at least one text block and the structure information of each text block to obtain a first fusion processing result.

In some embodiments, after the first information of the object to be processed, the respective text blocks, and the structural information of the respective text blocks are obtained, fusion processing may be performed on the first information, the respective text blocks, and the structural information of the respective text blocks, for example, fusion processing is performed using a Multi-mode transform encoder (Multi-Modal Transformer Encoder) model, to obtain a result after the fusion processing, that is, a first fusion processing result.

S1042, predicting the first fusion processing result by using the trained first classification model to obtain object types corresponding to each text block, and taking the object types corresponding to each text as classification results.

In some embodiments, the first classification model may be a semantic entity identification (Semantic Entity Recognition) model, and by predicting the first fusion processing result, an object type corresponding to each text block may be determined, where the object type may include at least a question type or an answer type, and in other embodiments, the object type may further include a title type, and so on.

S1043, performing matching processing on the first text block of the question type and the second text block of the answer type to obtain a matching processing result.

In some embodiments, the matching result includes a correspondence between a first text block and a second text block, where the first text block may correspond to one or more second text blocks, and the correspondence between the first text block and the second text block may indicate that the content expressed by the second text block is an answer to a question corresponding to the first text block, that is, the first text block and the second text block have a question-answer relationship.

In some embodiments, after determining the object type to which each text block belongs, a text block whose object type is a question type may be determined as a first text block, and a text block whose object type is an answer type may be determined as a second text block. And then, carrying out matching processing on each first text block and each second text block to obtain a second text block corresponding to (or matched with) each first text block. In practice, each second text block may be respectively matched with the mth first text block, so as to determine a second text block corresponding to the mth first text block, where m=1, 2, …, K is an integer greater than 0, and K represents the number of first text blocks.

In some embodiments of the present application, after the first information is serialized to obtain at least one text block and structure information of each text block, i.e. step S102, the following steps S201 to S205 may be further performed, and each step is described below.

S201, obtaining integral image features and local image features corresponding to the object to be processed.

It should be noted that, the overall image features of the object to be processed may be global features of the image, such as color features, texture features, and shape features; the local image features may be local features of the image, which may be features extracted from local areas of the image, such as edges, corner points, lines, curves, areas of special properties, etc.

In some embodiments, the multi-scale image feature extraction model may be used to perform feature extraction on the object to be processed, so as to obtain one or more feature maps corresponding to the document image to be processed, where the feature maps include integral image features and local image features corresponding to the object to be processed.

S202, fusion processing is carried out on the integral image characteristics, the local image characteristics, the first information, at least one text block and the structural information of each text block, and a second fusion processing result is obtained.

In some embodiments, after the integral image feature and the local image feature corresponding to the object to be processed are obtained, fusion processing may be performed on the integral image feature, the local image feature, the first information, each text block, and the structure information of each text block by using the multi-mode transform encoder model, so as to obtain a second fusion processing result.

S203, predicting a second fusion processing result by using the trained second classification model to obtain object types corresponding to the text blocks.

In some embodiments, the second classification model may be the same as the first classification model, and the second classification model may be different from the first classification model, and by using the trained second classification model to predict the second fusion processing result, each text block may be classified, and an object type to which each text block belongs may be determined, where the object type includes at least a question type or an answer type, and in other embodiments, the object type may further include a title type.

S204, matching the third text block of the question type and the fourth text block of the answer type to obtain a matching processing result.

In some embodiments, the matching processing result includes a correspondence of the third text block and the fourth text block. The third text block may correspond to one or more fourth text blocks, and the correspondence between the third text block and the fourth text block may indicate that the content expressed by the fourth text block is an answer of the question corresponding to the third text block, that is, the third text block and the fourth text block have a question-answer relationship.

In some embodiments, after determining the object type to which each text block belongs, a text block whose object type is a question type may be determined as a third text block, and a text block whose object type is an answer type may be determined as a fourth text block. And then, carrying out matching processing on each third text block and each fourth text block to obtain a fourth text block corresponding to (or matched with) each third text block. In practice, each fourth text block and each third text block may be matched in turn, so as to determine a fourth text block corresponding to each third text block.

S205, determining the object types and the corresponding relations corresponding to the text blocks as the processing results of the to-be-processed document.

In some embodiments, after obtaining the object type to which each text block belongs and the correspondence between the text block whose object type is the question type and the text block whose object type is the answer type, the object type and the correspondence corresponding to each text block may be used as a processing result of the document to be processed.

It can be understood that the overall image features and the local image features obtained in the embodiment of the application take account of the image information of each scale, and the multi-modal feature fusion is realized by fusing the overall image features, the local image features, the first information, at least one text block and the structural information of each text block and then performing prediction processing, so that various features of an object to be processed are more effectively utilized, the over-fitting problem of a classification model is reduced, and the accuracy of a classification result is ensured.

In some embodiments of the present application, the third text block and the fourth text block each include at least one, based on which the third text block of the question type and the fourth text block of the answer type are subjected to a matching process, so as to obtain a matching process result, that is, step S204 may be implemented through steps S2041 to S2042, and each step is described below.

S2041, respectively carrying out matching processing on the j-th third text block and each fourth text block to obtain each matching value.

In some embodiments, j=1, …, M is an integer greater than 0, representing the number of third text blocks. The magnitude of the match value may represent a degree of match between the third text block and the fourth text block, the greater the match value, the higher the degree of match between the third text block and the fourth text block. In practice, the j-th third text block may be respectively matched with each fourth text block, so as to obtain a matching value between each fourth text block and the j-th text block.

S2042, determining a matching value with the largest score in the matching values, determining a fourth text block corresponding to the matching value with the largest score as a target text block matched with the j-th third text block, and taking the corresponding relation between the target text block and the j-th third text block as a matching processing result.

In some embodiments, since the fourth text block includes one or more, one or more matching values may be obtained after matching the j-th third text block with each fourth text block. By determining the matching value with the largest score among the matching values and determining the matching value with the largest score as the target text block matched with the j-th third text block, the fourth text block (target text block) with the highest matching degree with the j-th third text block is taken as the text block matched with the j-th third text block. And then taking the corresponding relation between the fourth text block with the highest matching degree with the j-th third text block and the j-th third text block as a matching processing result.

In some embodiments of the present application, the j-th third text block is respectively matched with each fourth text block to obtain each matching value, that is, step S2041 may be implemented by the following steps S301 to S303, and each step is described below.

S301, third semantic information corresponding to a j-th third text block and fourth semantic information corresponding to each fourth text block are obtained.

In some embodiments, the third semantic information may be a semantic represented by a j-th third text block, and the fourth semantic information may be a semantic represented by each fourth text block. In practice, since the third text block and the fourth text block each include one or more words, the semantic information of the j-th third text block and the semantic information of each fourth text block can be obtained according to the semantic information corresponding to each word.

S302, the matching degree of the third semantic information and the fourth semantic information is respectively determined.

In some embodiments, the third semantic information and the fourth semantic information may be subjected to semantic analysis processing, so as to determine a matching degree between the third semantic information and the fourth semantic information. In practice, the degree of matching may be divided into a plurality of levels, for example, if the degree of matching is divided into five levels: first, second, third, fourth, fifth, each level may represent a low, medium, high match, respectively.

S303, determining the matching value corresponding to each matching degree based on the preset relation between the matching degree and the matching value.

In some embodiments, the preset relationship between the matching degree and the matching value may be obtained in advance, and the matching degree and the matching value may be in a one-to-one correspondence relationship. After the matching degree between the fourth semantic information corresponding to each fourth text block and the third semantic information of the j-th third text block is obtained, the matching value corresponding to each matching degree can be determined according to the preset relationship.

In the embodiment of the application, an object to be processed is obtained, and character recognition is carried out on the object to be processed to obtain first information in the object to be processed; serializing the first information to obtain at least one text block and structure information of each text block; and classifying and matching each text block based on the first information, each text block and the structural information of each text block to obtain a classification result and a matching result, and determining the classification result and the matching result as the processing result of the object to be processed. Therefore, by carrying out serialization processing on the first information in the object to be processed, the obtained text blocks are prevented from having incorrect semantic information, the structure information among the text blocks can be determined more accurately, and further, the text blocks are classified and matched based on the first information, the text blocks and the structure information of the text blocks, various characteristics of the object to be processed are utilized, the problem that fitting is carried out when the text blocks are classified and matched is avoided, and therefore accuracy of processing results of the object to be processed is improved.

Next, an implementation process of the application embodiment in an actual application scenario is described.

In some embodiments, as shown in fig. 2, a flow chart of a document image processing method based on text serialization and structure information provided in the embodiments of the present application is shown, and the document processing method based on text serialization and structure information provided in the embodiments of the present application may be implemented through the following steps S401 to S406, where each step is described below.

S401, acquiring a document image to be processed (equivalent to the object to be processed in other embodiments), and performing character recognition on the document image to be processed to acquire text information, position information and typesetting information of the document image to be processed (equivalent to 'acquiring first information in the object to be processed' in other embodiments).

In some embodiments, the document image to be processed may be a scanned electronic document, such as a form image of a ticket, receipt, or the like. When character recognition is performed on the document image to be processed, character recognition can be performed on the document to be processed by using an OCR recognition model, so that text (token) information, position information and layout (layout) information (arrangement and position relation among tokens) of the document as a whole are obtained. The text information may include text extracted from the document image to be processed, etc., the position information may include a position where each text in the document image to be processed is located, and the typesetting information may include a font, a font size, a color, a word spacing, a line spacing, etc. of the text.

S402, carrying out serialization processing on the text information to obtain at least one text block and structure information of each text block.

In some embodiments, the text information may be serialized using a trained text serialization model, which may be trained using the disclosed labeled dataset, to obtain a trained text serialization model. The serialization processing of the text information can be performed by analyzing the text semantic, position, typesetting and other information in the document image to be processed to determine different text blocks and the structure information of each text block.

It can be understood that by carrying out serialization processing on the text information, wrong reading sequence and structural information caused by content folding can be prevented, and obtaining of text blocks which do not accord with reading habit and have incorrect semantic information and structural information is avoided.

S403, acquiring image features of the document image to be processed, and carrying out fusion processing on the text information, the position information, the typesetting information, each text block, the structural information of each text block and the image features by utilizing a multi-mode transformation encoder model to acquire a fusion processing result.

In some embodiments, the image features of the document image to be processed may include global features and local features of the document image to be processed (equivalent to global image features and local image features in other embodiments). After the image characteristics of the document image to be processed are obtained, the text information, the position information, the typesetting information, each text block, the structure information of each text block and the image characteristics of the document image to be processed can be fused by using a mode transformation encoder (transformer encoder) model, so that the processing result after various characteristics are fused is obtained.

S404, classifying the fusion processing result by using a semantic entity recognition model to obtain the category (corresponding to the object type corresponding to each text block in other embodiments) to which each text block belongs.

In some embodiments, the semantic entity identification (Semantic Entity Recognition) model may enable determination of the category to which the text block belongs based on information such as semantics, and the category to which each text block belongs may include Header, question, answer, and the like. In practice, the semantic entity identification model can analyze each text block by combining text information, position information, typesetting information, image characteristics and the like, so that the respective corresponding category of each text block is determined.

S405, matching processing is carried out on the text blocks with the types of questions and the text blocks with the types of answers by using the relation extraction model, and a matching processing result is obtained.

In some embodiments, the text blocks with the category of the question type may include one or more text blocks with the category of the answer type and one or more text blocks with the category of the answer type, after determining the type to which each text block belongs, a matching process may be performed on the text blocks with the category of the question type and the text blocks with the category of the answer type by using a relationship extraction (Relation Extraction) model (which corresponds to "matching a first text block with the question type and a second text block with the answer type" or "matching a third text block with the answer type and a fourth text block with the question type" in other embodiments), so as to obtain a matching process result, where a matching relationship between the text blocks with the question type and the text blocks with the category of the answer type is included in the matching process result.

S406, determining the category and the matching processing result of each text block as the processing result of the document image to be processed.

In some embodiments, after determining the category to which each text block belongs and the pairing relationship between the text block of the question type and the text block of the answer type, a processing result of the document image to be processed is obtained.

In some embodiments, the document image processing method based on text serialization and structure information provided in the embodiments of the present application may be implemented by using the flow shown in fig. 3, and the document image processing flow provided in the embodiments of the present application will be described below by taking fig. 3 as an example.

S1, an image feature extraction module extracts image features of a document image to be processed, and character recognition is carried out on the text image to be processed by utilizing an OCR recognition model.

In some embodiments, image features (including local features and global features) of the text image to be processed may be obtained by an image feature extraction module; through the OCR recognition module, the text token information (text feature), the position information (text position feature) and the layout information (layout feature) of the whole document of the document image to be processed can be obtained.

And S2, carrying out serialization processing on the text blocks by using a text serialization model to obtain serialized text features.

In some embodiments, a generic text serialization model (which may utilize a publicly-labeled dataset) may be trained separately for textual token information to prevent incorrect reading order and structural information from content folding.

And S3, performing fusion processing on the image features, the serialized text features, the position features of the text and the layout features of the document by using a Multi-mode transform encoder (Multi-Modal Transformer Encoder with Spitial-Aware Self Attention) with a space-aware self-attention mechanism, and obtaining a fusion processing result.

And S4, classifying the fusion processing result by utilizing a semantic entity identification (Semantic Entity Recognition) model to obtain the category corresponding to each text block.

In some embodiments, the Semantic Entity Recognition model performs classification tasks, and three categories are available after the classification process: header, question, answer, each text block can be divided into the three categories Header, question, answer described above.

Step S5, a relation extraction and assembly module and a dual affine attention classifier (Biaffine Attention Classifier) module pair the text blocks with the category of Question and the text blocks with the category of Answer.

In some embodiments, relation Extraction is a relational extraction assembly module, pairs text blocks corresponding to the classified questions and text blocks corresponding to the Answer, and classifies the text blocks by Biaffine Attention Classifier module, and takes the pairing with high confidence as the final Question-Answer (or key-value) pairing.

It can be understood that the obtained multiple scales extract image features, the image information of each scale is considered, various features are more effectively utilized through multi-mode feature fusion, and the problem of over-fitting is reduced; the text blocks are serialized, so that the fact that the output result of the OCR module does not accord with the reading habit and the output result has incorrect semantic information and structure information is avoided; the Layout information of the document is added, the relation among text blocks is more accurately positioned, good auxiliary characteristics are provided for a downstream relation extraction task (Relation Extraction), the problem of overfitting is effectively solved, and therefore the accuracy of the processing result of the document image to be processed is improved.

The present application further provides a processing apparatus, and fig. 4 is a schematic structural diagram of the processing apparatus provided in the embodiment of the present application, as shown in fig. 4, where the processing apparatus 500 includes:

a first obtaining module 501, configured to obtain an object to be processed, perform character recognition on the object to be processed, and obtain first information in the object to be processed;

the second obtaining module 502 is configured to perform serialization processing on the first information to obtain at least one text block and structural information of each text block, where the text block is a text set with correct and complete semantic information;

A processing module 503, configured to perform classification processing and matching processing on the at least one text block based on the first information, the at least one text block, and structure information of each text block, to obtain a classification result and a matching result;

a first determining module 504, configured to determine the classification result and the matching result as a processing result of the object to be processed.

In some embodiments, the first information includes semantic information, position information and typesetting information of the object to be processed corresponding to each text; the second acquisition module 502 may include:

the first acquisition sub-module is used for acquiring semantic information, position information and typesetting information of the object to be processed corresponding to each word in the object to be processed;

dividing the characters in the object to be processed into at least one text block according to the semantic information, the position information and the typesetting information;

and the first determining submodule is used for determining the structure information of each text block based on the semantic information and the position information corresponding to each text block.

In some embodiments, the partitioning submodule includes:

the first obtaining unit is used for obtaining first semantic information, first position information and first typesetting information corresponding to the ith character, and second semantic information, second position information and second typesetting information corresponding to the (i+1) th character, wherein i=1, …, N and N are the numbers of characters in the object to be processed;

The first determining unit is configured to determine that the first semantic information and the second semantic information meet a first matching condition, the first typesetting information and the second typesetting information are the same, the first position information and the second position information meet a second matching condition, and determine that the i-th text and the i+1-th text belong to the same text block.

In some embodiments, the processing module 503 includes:

the second acquisition sub-module is used for carrying out fusion processing on the first information, the at least one text block and the structure information of each text block to obtain a first fusion processing result;

the prediction sub-module is used for predicting the first fusion processing result by using a trained first classification model to obtain object types corresponding to each text block, wherein the object types corresponding to each text block are used as classification results, and the object types at least comprise question types or answer types;

the first matching processing sub-module is used for carrying out matching processing on the first text block of the question type and the second text block of the answer type to obtain a matching processing result, and the matching processing result comprises the corresponding relation between the first text block and the second text block.

In some embodiments, the processing device 500 further comprises:

the third acquisition module is used for acquiring the integral image characteristics and the local image characteristics corresponding to the object to be processed;

the fusion processing module is used for carrying out fusion processing on the integral image characteristics, the local image characteristics, the first information, the at least one text block and the structure information of each text block to obtain a second fusion processing result;

the prediction module is used for 5 predicting the second fusion processing result by using the trained second classification model to obtain object types corresponding to each text block, wherein the object types at least comprise question types or answer types;

the matching processing module is used for carrying out matching processing on the third text block of the question type and the fourth text block of the answer type to obtain a matching processing result, wherein the matching processing result comprises the corresponding relation between the third text block and the fourth text block;

a second determining module, configured to determine the object types and the corresponding relationships corresponding to the text blocks,

and determining the processing result of the document to be processed.

In some embodiments, the third text block and the fourth text block each include at least one; the matching processing module comprises:

The second matching processing submodule is used for respectively carrying out 5 matching processing on the j-th third text block and each fourth text block to obtain each matching value, wherein j=1, …, M and M are the number of the third text blocks;

and the second determining submodule is used for determining the matching value with the largest score in the matching values, determining the fourth text block corresponding to the matching value with the largest score as a target text block matched with the j-th third text block, and taking the corresponding relation between the target text block and the j-th third text block as a matching processing result.

0 in some embodiments, the second matching process submodule includes:

the second acquisition unit is used for acquiring third semantic information corresponding to the j-th third text block and fourth semantic information corresponding to each fourth text block;

a second determining unit for determining the matches of the third semantic information and the fourth semantic information

The degree of matching;

and 5, a third determining unit, configured to determine a matching value corresponding to each matching degree based on a preset relationship between the matching degree and the matching value.

It should be noted that, the description of the processing device in the embodiment of the present application is similar to the description of the embodiment of the method described above, and has similar beneficial effects as the embodiment of the method, so that a detailed description is omitted. For technical details not disclosed in the embodiments of the present apparatus, please refer to the description of the embodiments of the method of the present application for understanding.

In the embodiment of the present application, if the control method is implemented in the form of a software functional module and sold or used as a separate product, the control method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributing to the related art, and the computer software product may be stored in a storage medium, and include several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Accordingly, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the processing method provided in the above embodiments.

The embodiment of the application also provides processing equipment. Fig. 5 is a schematic structural diagram of a processing apparatus according to an embodiment of the present application, as shown in fig. 5, where the processing apparatus 600 includes: a memory 601, a processor 602, a communication interface 603 and a communication bus 604. Wherein the memory 601 is configured to store executable processing instructions; the processor 602 is configured to execute the executable processing instructions stored in the memory to implement the processing method provided in the above embodiment.

The description of the processing device and the storage medium embodiments above is similar to that of the method embodiments described above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the processing apparatus and the storage medium of the present application, please refer to the description of the method embodiments of the present application for understanding.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising at least one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units; can be located in one place or distributed to a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

One of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage device, a ROM, a magnetic disk, or an optical disk.

Alternatively, the integrated units described above may be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in essence or in a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a product to perform all or part of the methods described in the various embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage device, a ROM, a magnetic disk, or an optical disk.

The foregoing is merely an embodiment of the present application, but the protection scope of the present application is not limited thereto, and any person skilled in the art will easily think about changes or substitutions within the technical scope of the present application, and should be covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of processing, comprising:

2. The method of claim 1, wherein the first information includes semantic information, position information and typesetting information of the object to be processed corresponding to each text;

Serializing the first information to obtain at least one text block and structure information of each text block, wherein the method comprises the following steps:

obtaining semantic information, position information and typesetting information of the object to be processed corresponding to each word in the object to be processed;

and determining the structure information of each text block based on the semantic information and the position information corresponding to each text block.

3. The method of claim 2, wherein the dividing the text in the object to be processed into at least one text block according to the semantic information, the location information, and the typesetting information comprises:

obtaining first semantic information, first position information and first typesetting information corresponding to an ith character, and second semantic information, second position information and second typesetting information corresponding to an (i+1) th character, wherein i=1, …, N and N are the numbers of characters in the object to be processed;

determining that the first semantic information and the second semantic information meet a first matching condition, the first typesetting information and the second typesetting information are the same, the first position information and the second position information meet a second matching condition, and determining that the ith text and the (i+1) th text belong to the same text block.

4. The method of claim 1, wherein the classifying and matching the at least one text block based on the first information, the at least one text block, and the structure information of each text block, to obtain a classification result and a matching result, comprises:

carrying out fusion processing on the first information, the at least one text block and the structure information of each text block to obtain a first fusion processing result;

predicting the first fusion processing result by using a trained first classification model to obtain object types corresponding to each text block, wherein the object types corresponding to each text block are used as classification results, and the object types at least comprise question types or answer types;

and carrying out matching processing on the first text block of the question type and the second text block of the answer type to obtain a matching processing result, wherein the matching processing result comprises the corresponding relation between the first text block and the second text block.

5. The method of claim 1, further comprising:

obtaining integral image features and local image features corresponding to the object to be processed;

carrying out fusion processing on the integral image characteristics, the local image characteristics, the first information, the at least one text block and the structural information of each text block to obtain a second fusion processing result;

Predicting the second fusion processing result by using a trained second classification model to obtain object types corresponding to each text block, wherein the object types at least comprise question types or answer types;

matching the third text block of the question type and the fourth text block of the answer type to obtain a matching result, wherein the matching result comprises the corresponding relation between the third text block and the fourth text block;

and determining the object types corresponding to the text blocks and the corresponding relation as a processing result of the document to be processed.

6. The method of claim 5, the third text block and the fourth text block each comprising at least one;

and performing matching processing on the third text block of the question type and the fourth text block of the answer type to obtain a matching processing result, wherein the matching processing result comprises the following steps:

respectively carrying out matching processing on the j-th third text block and each fourth text block to obtain each matching value, wherein j=1, …, M and M are the number of the third text blocks;

and determining a matching value with the largest score in the matching values, determining a fourth text block corresponding to the matching value with the largest score as a target text block matched with the j-th third text block, and taking the corresponding relation between the target text block and the j-th third text block as a matching processing result.

7. The method according to claim 6, wherein the matching the j-th third text block with each fourth text block to obtain each matching value includes:

third semantic information corresponding to the j-th third text block and fourth semantic information corresponding to each fourth text block are obtained;

respectively determining the matching degree of the third semantic information and the fourth semantic information;

and determining the matching value corresponding to each matching degree based on the preset relation between the matching degree and the matching value.

8. A processing apparatus, comprising:

9. A processing apparatus, comprising:

a memory for storing executable processing instructions;

a processor for implementing the processing method of any one of claims 1 to 7 when executing executable processing instructions stored in said memory.

10. A computer readable storage medium storing processing instructions for causing a processor to perform the method of any one of claims 1 to 7.