CN112612911A

CN112612911A - Image processing method, system, device and medium, and program product

Info

Publication number: CN112612911A
Application number: CN202011623342.0A
Authority: CN
Inventors: 黄永帅; 潘乐萌; 张资殷; 都林
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-04-06

Abstract

The application provides an image processing method, which is applied to the field of artificial intelligence and comprises the following steps: the method comprises the steps of obtaining an image comprising one or more document objects, extracting semantic features of the document objects from texts in the document objects, and obtaining classification results of the document objects according to the visual features and the semantic features of the document objects. The method classifies the document targets by combining the semantic features on the basis of the visual features, provides more information for the classifier, improves the classification accuracy, avoids the recognition error of a downstream character recognition model caused by the mismatching of input types, and meets the service requirements. Moreover, the method supports automatic extraction of semantic features for classification, so that end-to-end classification is realized without human intervention, the classification efficiency is improved, and the classification cost is reduced.

Description

Image processing method, system, device and medium, and program product

Technical Field

The present application relates to the field of Artificial Intelligence (AI), and in particular, to an image processing method, system, device, computer-readable storage medium, and computer program product.

Background

With the rapid development of Optical Character Recognition (OCR) technology, applications of recognizing and processing text information in an image using OCR technology instead of human power are becoming more and more widespread. In order to realize batch and automatic processing, the input images can be classified into different types such as cards, bills and the like, and then a professional OCR engine corresponding to the types, such as a bill recognition engine, is used for character recognition.

To this end, it is proposed in the industry to automatically detect document objects such as cards, tickets, and mailshots in images, and segment and classify the document objects by using an object detection and classification technique.

Currently, a relatively common target detection and classification technique in the industry is mask-based regional convolutional neural network (mask R-CNN). However, the accuracy of the classification of the document target by the technologies such as mask-RCNN is low, and the business requirements are difficult to meet.

Disclosure of Invention

The application provides an image processing method, and a semantic feature is combined to classify a document target on the basis of a visual feature, so that on one hand, the accuracy of classification is improved, the recognition rate of texts in the document target can be improved, on the other hand, automatic classification can be realized, manual intervention is not needed, and the classification efficiency is improved. The application also provides a system, a device, a computer readable storage medium and a computer program product corresponding to the method.

In a first aspect, the present application provides an image processing method. The method may be performed by an image processing system. The image processing system may be deployed at a server, such as a cloud server or a physical server. The cloud server comprises a server in a public cloud or a server in a private cloud or a mixed cloud.

Specifically, the image processing system acquires an image, the image includes one or more document objects, and the image processing system may acquire semantic features of the document objects according to texts in the document objects and then acquire classification results of the document objects according to the visual features and the semantic features of the document objects. Wherein the classification result comprises a category label of the document object.

The method classifies the document targets by combining the semantic features on the basis of the visual features, provides more information for the classifier, can improve the classification accuracy, avoids the recognition error of a downstream character recognition model caused by input type mismatching, and meets the business requirements.

Moreover, the method can support automatic extraction of semantic features for classification, so that end-to-end classification is realized without human intervention, such as manual frame selection of classification keywords or manual classification, the classification efficiency is improved, and the classification cost is reduced.

In some possible implementations, the image processing system may obtain a corresponding confidence level when classifying the document objects based on the visual features. The confidence level is an empirically determined probability value that characterizes the degree of plausibility. Similarly, the image processing system may obtain a corresponding confidence level when classifying document objects based on semantic features. To improve classification accuracy, the image processing system may obtain a classification result for the document object based on the confidence determined by the visual features and the confidence determined by the semantic features.

In some possible implementations, the image processing system may classify the document object according to a visual feature of the document object, such as any one or more of a color feature, a texture feature, a shape feature, or a spatial relationship feature, to obtain a first class tag of the document object. When the confidence corresponding to the first class label is smaller than the preset threshold, the confidence degree of the first class label is low, namely the probability that the first class label is correct is low, and the image processing system can classify according to the visual feature and the semantic feature to obtain a final classification result, so that the classification result is corrected by utilizing the semantic feature, and the classification accuracy is improved.

Certainly, when the confidence corresponding to the first class label is greater than or equal to the preset threshold, it indicates that the confidence level of the first class label is higher, that is, the probability that the first class label is correct is higher, and the image processing system may directly use the first class label as the final classification result.

Therefore, on the premise of ensuring the accuracy of document target classification, additional classification operation is avoided, the classification efficiency is improved, and computing resources (such as computing resources required by text classification) are saved.

In some possible implementations, the image processing system may obtain a first confidence that the document object belongs to different categories according to the visual features of the document object, and obtain a second confidence that the document object belongs to different categories according to the semantic features of the document object, and then the image processing system may obtain the classification result of the document object according to the weighted operation result of the first confidence and the second confidence.

For convenience of description, the result of the weighted operation of the first confidence and the second confidence may be referred to as a third confidence. The result of the weighting operation may be a weighted average or the result of other weighting operations. The third confidence is obtained based on the visual feature and semantic feature classification, so that the third confidence can represent the comprehensive confidence of the document target belonging to different categories. In this way, the image processing system can obtain the classification result of the document object according to the comprehensive confidence, for example, the classification with the highest comprehensive confidence can be used as the final classification result.

In the method, the weights of the first confidence coefficient and the second confidence coefficient can be determined according to the contribution degree of the visual features and the semantic features to the classification, based on the determination, the weighting operation result of the first confidence coefficient and the second confidence coefficient can accurately reflect the comprehensive confidence coefficient, and the final classification result determined based on the comprehensive confidence coefficient has higher reliability.

In some possible implementations, the classification result further includes a confidence level corresponding to the category label, and the image processing system may further output the category label and the confidence level corresponding to the category label. Therefore, more information can be provided for the user, and a reference is provided for the user whether to use the category label.

In some possible implementations, when the first confidence is smaller than a preset threshold, the image processing system may output a first category label determined according to the visual feature and a first confidence corresponding to the first category label, a second category label determined according to the semantic feature and a second confidence corresponding to the second category label, a third category label determined according to the visual feature and the semantic feature and a third confidence corresponding to the third category label. Further, when the first confidence is greater than or equal to the preset threshold, the image processing system may output a first category label determined according to the visual features and a first confidence corresponding to the first category label.

I.e. whether the image processing system classifies in conjunction with semantic features, may also be determined by the number of output class labels or the number of confidences. When the number of the class labels is 3 or the number of the confidence degrees is 3, the image processing system is indicated to be combined with the semantic features to classify the current document object, and when the number of the class labels is 1 or the number of the confidence degrees is 1, the image processing system is indicated to be not combined with the semantic features to classify the current document object.

In some possible implementations, when the image includes multiple document objects, the image processing system may further detect the document object in the image according to a visual characteristic of the document object, thereby obtaining a bounding box of the multiple document objects. The bounding box is specifically a bar box that encloses the document object. The bounding box can be used for segmenting or cutting a document object to obtain a local image comprising the document object, and the recognition rate can be improved by performing recognition based on the local image.

In some possible implementations, the bounding box is obtained from initial corner points based on regression of the visual features. Compared with the traditional rectangular frame, the bounding box obtained according to the initial corner points is closer to the outline of the document target, so that the document target can be accurately positioned. Character recognition based on the bounding box can avoid character interference outside the bounding box, and is helpful for improving recognition rate.

In some possible implementations, the initial corner points of the bounding box of a document object are typically four. The image processing system may perform edge regression based on the visual features to determine edge information that surrounds the box. The edge information may particularly be represented by edge pixels (which are typically very narrow bands of pixels). Considering that the edge pixels are difficult to be directly expressed mathematically, the image processing system may process the edge pixels in the whole image by using a hough transform or the like, specifically, extract a straight line from the edge pixels, thereby obtaining an edge line. The image processing system may correct the initial corner points using the intersection points of the edge lines to obtain corrected corner points. The corrected corner points are greater than or equal to four. The image processing system may obtain the bounding box according to the corrected corner points, for example, the corrected corner points may be connected in sequence, so as to obtain the bounding box.

Since the image processing system also takes into account the edge information when regressing the bounding box of the document object, the bounding box determined based on this method is more accurate. For example, when a corner of a document object, such as an invoice, is missing from an image, the image processing system may regress a bounding box that lacks a corner, specifically a bounding box that regresses a pentagon.

In some possible implementations, the image processing system may determine a correction manner for the initial corner point according to the number of intersections of the edge lines, and perform correction according to the corresponding correction manner. Specifically, when the number of intersections of the edge lines is greater than the number of initial corner points, for example, greater than four, the image processing system may directly replace the initial corner points with the intersections of the edge lines as the corrected corner points. The image processing system obtains the bounding box of the document object according to the polygon, such as a pentagon, formed by the corrected corner points. When the number of the intersection points of the edge lines is less than or equal to the number of the initial corner points, the image processing system may search for the intersection points of the edge lines within a preset radius with the initial corner points as a circle center, and if the intersection points exist, obtain the corrected corner points according to the intersection points of the initial corner points and the edge lines. For example, the image processing system may determine the midpoint of the intersection of the initial corner point and the edge line as the modified corner point.

In some possible implementations, the image processing system may further process the edge lines, for example, in a candidate region of the document object, connect, filter, and extend all straight lines within a certain range of the candidate region to obtain intersections of the edge lines. The image processing system may connect straight lines having similar curvatures (e.g., a straight line included angle is smaller than 10 °) within the above range, filter straight lines having a length smaller than a preset length, and extend the remaining straight lines, thereby obtaining a plurality of intersection points.

The range of the candidate region may be an expanded range, for example, a range of expanding the length and width of the candidate region by 0.1 times. The image processing system expands the range of line selection to prevent the edge line from completely coinciding with the edge of the region or the resulting intersection point from being outside the current region.

In some possible implementations, the text in the document object may be determined from the bounding box. In particular, the image processing system may determine the text in the document object based on the proportion of the overlapping area of the bounding box and the text box in the image. The overlapping area ratio may be an area ratio of the surrounding box and the overlapping portion of the text box to the text box. When the overlap area ratio reaches the preset ratio, the image processing system may determine that the text in the text box is the text in the document target, and when the overlap area ratio does not reach the preset ratio, the image processing system may determine that the text in the text box is not the text in the document target. Therefore, the method and the device can accurately acquire the text in the document target, extract the semantic features from the text for classification, and improve the classification accuracy.

In some possible implementations, the image processing system may output bounding boxes of the plurality of document objects according to the position information of the corner points. The position information of the corner points comprises position information of initial corner points or position information of corrected corner points. When the initial corner point is corrected, the bounding box of the document target can be output according to the position information of the corrected corner point, and when the initial corner point is not corrected, the bounding box of the document target can be output according to the position information of the initial corner point.

The image processing system may output the bounding box in the form of a line box in the image, or may output the bounding box in the form of text (e.g., coordinates of corner points). The bounding box is output in a line frame mode, so that the method is more intuitive, and a user can conveniently and quickly judge whether the output is correct.

In some possible implementations, the document object in the image may have a rotation angle in consideration of a photographing angle and the like, and accordingly, the document object in the image may have a case where the sizes of characters are not equal. For example, when a presentation is taken from the front left of a classroom and an image of the presentation is obtained, the lower left characters in the image are relatively large and the upper right characters are relatively small. The image processing system performs perspective transformation on a perspective transformation matrix determined from perspective information, for example, from a bounding box, and obtains a text after the perspective transformation. The size of the characters in the text after perspective transformation is the same.

In some possible implementations, the document object in the image may have a rotation angle in consideration of a shooting angle and the like, and the bounding box obtained by processing the image by the image processing system may have a different width from top to bottom. Therefore, the image processing system can also perform perspective transformation on the bounding box according to the perspective information reserved in the bounding box, and the bounding box after the perspective transformation is rectangular. The surrounding frames are uniform to be rectangular, so that character recognition of a downstream character recognition model is facilitated, and the recognition rate is improved.

In some possible implementations, the image processing system may also obtain structured information of the document object in the image based on the image and an optical character recognition model corresponding to the classification result. Therefore, the automatic extraction of the structured information can be realized, and further, the automatic input of the information, the auxiliary reading for the disabled or the filtering of forbidden words and the like can be realized.

In some possible implementations, the image includes a plurality of document objects, and for this reason, the image processing system may further obtain a plurality of partial images according to the image, for example, by cropping the image according to bounding boxes of the plurality of document objects in the image, so as to obtain a plurality of partial images, where the plurality of partial images correspond to the plurality of document objects one to one. The image processing system inputs the plurality of local images into the optical character recognition models corresponding to the classification results of the local images, and obtains the structural information of the plurality of document targets. Therefore, the images comprising different document objects can be prevented from being input into the same optical character recognition model, and the recognition rate is improved.

In some possible implementations, the document objects include the following categories: card, ticket, label, mail, or document. The card and certificate include cards or certificates, such as work cards, business cards, nameplates, identification cards, drivers licenses, passes, passports, business licenses and the like, the bills include invoices, checks, drafts, receipts and the like, and the labels include labels on commodity packages or commodity price labels on shelves and the like. The mail is specifically an email. The document refers to a poster, a presentation, a text document, a table document, and the like.

The image processing method supports classification of different types of document targets, is wide in application range and high in usability, and can meet the classification requirements of different types of document targets.

In a second aspect, the present application provides an image processing method. The method may be performed by an image processing system. The image processing system may be deployed at a terminal. Terminals include, but are not limited to, desktop computers, notebook computers, smart phones, and the like. Specifically, an image processing system receives an image input by a user, the image comprises one or more document objects, then the image processing system outputs a classification result of the document objects, the classification result comprises class labels of the document objects, the classification result is obtained according to visual features of the document objects and semantic features of the document objects, and the semantic features of the document objects are obtained according to texts in the document objects.

The method supports automatic extraction of semantic features for classification, so that end-to-end classification is realized without human intervention, such as manual frame selection of classification keywords or manual classification, the classification efficiency is improved, and the classification cost is reduced. Moreover, the semantic features are used for correcting the classification result based on the visual features, so that the classification accuracy can be improved.

In some possible implementations, the classification result further includes a confidence level corresponding to the category label.

In some possible implementations, the image processing system may output the classification result by:

presenting the classification result of the document target to the user; alternatively, the first and second electrodes may be,

and outputting the classification result of the document target to a result file.

In some possible implementations, the method further includes:

and outputting the bounding box of the document target, wherein the bounding box is obtained according to the visual characteristics of the document target.

In some possible implementations, the bounding box is obtained from initial corner points based on regression of the visual features.

In some possible implementation manners, the initial corner points include four, the bounding box is obtained according to corrected corner points, the corrected corner points are greater than or equal to four, the corrected corner points are obtained by correcting the initial corner points according to intersection points of edge lines, and the edge lines are straight lines extracted based on edge information of the visual feature regression.

In some possible implementations, the method further includes:

and acquiring the structural information of the document target in the image according to the image and an optical character recognition model corresponding to the classification result.

In a third aspect, the present application provides an image processing system. The system comprises:

a communication unit for acquiring an image, the image including one or more document objects;

the feature extraction unit is used for acquiring semantic features of the document target according to the text in the document target;

and the classification unit is used for obtaining a classification result of the document target according to the visual characteristic and the semantic characteristic of the document target, wherein the classification result comprises a class label of the document target.

In some possible implementations, the classification unit is specifically configured to:

obtaining a classification result of the document object according to the confidence determined by the visual feature and the confidence determined by the semantic feature, wherein the confidence is a probability value which is determined according to experience and is used for representing the credibility.

In some possible implementations, the classification result further includes a confidence level corresponding to the class label, and the communication unit is further configured to:

and outputting the class label and the confidence corresponding to the class label.

In some possible implementations, the communication unit is specifically configured to:

outputting a first class label determined according to the visual features and a first confidence degree corresponding to the first class label, a second class label determined according to the semantic features and a second confidence degree corresponding to the second class label, a third class label determined according to the visual features and the semantic features and a third confidence degree corresponding to the third class label, wherein the first confidence degree is smaller than a preset threshold value.

In some possible implementations, the system further includes:

and the object detection unit is used for obtaining the bounding boxes of the plurality of document objects according to the visual characteristics of the document objects.

In some possible implementations, the text in the document object is determined from the bounding box.

In some possible implementations, the communication unit is further configured to:

and outputting the bounding boxes of the plurality of document objects according to the position information of the corner points.

In some possible implementations, the system further includes:

and the perspective transformation unit is used for obtaining the text after perspective transformation according to the perspective information, and the size of the characters in the text after perspective transformation is the same.

In some possible implementations, the system further includes:

and the perspective transformation unit is used for obtaining the bounding box after perspective transformation according to the perspective information, and the bounding box after the perspective transformation is a rectangle.

In some possible implementations, the system further includes:

and the identification unit is used for acquiring the structural information of the document target in the image according to the image and the optical character identification model corresponding to the classification result.

In some possible implementations, the image includes a plurality of document objects, and the identifying unit is specifically configured to:

obtaining a plurality of local images according to the image, wherein the local images correspond to the document targets one by one;

and inputting the plurality of local images into the optical character recognition model corresponding to the classification result to obtain the structural information of the plurality of document targets.

In some possible implementations, the document objects include the following categories: card, ticket, label, mail, or document.

In a fourth aspect, the present application provides an image processing system. The system comprises:

a communication unit for receiving an image input by a user, the image including one or more document objects;

the communication unit is further configured to output a classification result of the document object, where the classification result includes a category label of the document object, the classification result is obtained according to a visual feature of the document object and a semantic feature of the document object, and the semantic feature of the document object is obtained according to a text in the document object.

In some possible implementations, the system further includes:

In a fifth aspect, the present application provides an apparatus comprising a processor and a memory. The processor and the memory are in communication with each other. The processor is configured to execute the instructions stored in the memory to cause the apparatus to perform the image processing method as in the first aspect or any implementation manner of the first aspect.

In a sixth aspect, the present application provides a computer-readable storage medium having instructions stored therein, where the instructions instruct an apparatus to perform the image processing method according to the first aspect or any implementation manner of the first aspect.

In a seventh aspect, the present application provides a computer program product comprising instructions that, when run on a device, cause the device to perform the image processing method of the first aspect or any of the implementations of the first aspect.

The present application can further combine to provide more implementations on the basis of the implementations provided by the above aspects.

Drawings

In order to more clearly illustrate the technical method of the embodiments of the present application, the drawings used in the embodiments will be briefly described below.

Fig. 1 is a system architecture diagram of an image processing system according to an embodiment of the present application;

fig. 2 is a schematic hardware structure diagram of a server according to an embodiment of the present disclosure;

fig. 3 is a flowchart of an image processing method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a text selection provided in an embodiment of the present application;

fig. 5 is a flowchart of an image processing method according to an embodiment of the present application;

FIG. 6 is a diagram illustrating an enclosure of a document object according to an embodiment of the present application;

fig. 7 is a schematic input/output diagram of an image processing system according to an embodiment of the present application;

FIG. 8 is a schematic diagram of perspective transformation and angle transformation provided by an embodiment of the present application;

fig. 9 is a flowchart of an image processing method according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an image processing system according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an image processing system according to an embodiment of the present application.

Detailed Description

The terms "first" and "second" in the embodiments of the present application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.

Some technical terms referred to in the embodiments of the present application will be first described.

Images are a kind of similarity, vivid description or portrayal of objective objects, and are the most commonly used information carriers in human social activities. Wherein the image may be obtained in a variety of ways. In some embodiments, the image may be obtained by a camera shooting an objective object. In other embodiments, the image may be obtained by screenshot of the objective object with a screenshot tool or by scanning the objective object with a scanning device, such as a scanner.

An objective object depicted in an image may be referred to as a target. The objects in the image may comprise different types, and may for example be animal objects including people, cats, dogs, or plant objects including trees, flowers, or documents objects including cards (e.g. work cards, bank cards, transportation cards, identification cards, drivers' licenses), tickets (e.g. invoices, receipts), labels (e.g. labels on packages of goods, price labels on shelves), mail, documents (e.g. presentations, posters). Wherein the document object includes text therein.

OCR specifically refers to recognizing characters included in a document object in an image, and converting the recognized characters into text. The image is usually an image obtained by optically converting a paper document or an electronic document into a dot matrix, such as a black-and-white dot matrix.

OCR can make the computer possess character detection recognition function similar to human eye, so it can be widely used in different image processing scenes. For example, OCR may be used in scenarios such as automatic entry of credential information, ticket information, assisted reading by disabled people, filtering of illicit words, and so on.

In order to realize batch and automatic processing, the input images can be classified into different types such as cards, bills and the like, and then a professional OCR engine corresponding to the types, such as a bill recognition engine, is used for character recognition. For this reason, it is proposed in the industry to automatically detect document objects such as cards, tickets, and mails in an image and to segment and classify the document objects by using an object detection and classification technique.

Currently, the more common target detection and classification technique in the industry is mask R-CNN. However, mask-RCNN and other techniques typically use visual features of the image to classify the document object. The visual features are particularly features at a visual level, and may include any one or more of color features, texture features, shape features, or spatial relationship features, for example. When the visual features of the document object are not obvious, for example, the visual features of the document object such as an email in an email screenshot or the document object such as a presentation in a presentation image taken by a user, a ticket for sale, a shopping receipt and the like in a reimbursement order image are usually not obvious, and the accuracy of classification through the visual features is low, so that the business requirements are difficult to meet.

In view of this, an embodiment of the present application provides an image processing method. The method may be performed by an image processing system. Specifically, the image processing system acquires an image, the image includes one or more document objects, and the image processing system may acquire semantic features of the document objects according to texts in the document objects and then acquire classification results of the document objects according to the visual features and the semantic features of the document objects.

The method classifies the document targets by combining the semantic features on the basis of the visual features, provides more information for the classifier, can improve the classification accuracy, avoids the recognition error of a downstream character recognition model caused by input type mismatching, and meets the business requirements. Moreover, the method can support automatic extraction of semantic features for classification, so that end-to-end classification is realized without human intervention, such as manual frame selection of classification keywords or manual classification, the classification efficiency is improved, and the classification cost is reduced.

When the image comprises a plurality of document targets, the method also supports the segmentation of the document targets included in the image, and the classification can be performed by combining the semantic features of the segmented image, so that the classification accuracy can be further improved. The image processing method provided by the embodiment of the application can realize end-to-end segmentation and classification, specifically, automatically acquire the bounding box and the category of each document target in the image, does not need human intervention, is beneficial to realizing the automation of recognition processes such as OCR (optical character recognition), and greatly improves the efficiency of image processing.

The image processing method provided by the embodiment of the application can be applied to different scenes. For example, the method can be used in the automatic entry scene of certificate information and bill information, or in the auxiliary reading scene of the disabled, or in the filtering scene of forbidden words.

In some possible implementation manners, the image processing method provided by the embodiment of the present application may be provided to a user in the form of a cloud service, such as software as a service (SaaS), or a function as a service (FaaS). For example, an image processing system for implementing the image processing method may be deployed to a public cloud, so as to provide an externally published cloud service, and the cloud service is used for classifying images and then inputting document targets in the images into character recognition models of corresponding categories for character recognition. When the image processing method is externally released as a service, in consideration of security, uploaded data such as an image may also be protected, and for example, the image may be subjected to encryption processing. In some embodiments, the image processing system for implementing the image processing method may also be deployed to a private cloud, thereby providing a cloud service for intra-pair use. Of course, the image processing system for implementing the image processing method may also be deployed to a hybrid cloud. Wherein, a hybrid cloud refers to an architecture that includes at least one public cloud and at least one private cloud.

When the image processing method is provided for use by a user in the form of a cloud service, the cloud service may provide an Application Programming Interface (API) and/or a user interface (also referred to as a user interface). The user interface may be a graphical user interface (CUI) or a Command User Interface (CUI). In this way, the service caller may directly call the API provided by the cloud service to perform image processing, for example, classify the image, and of course, the cloud service may also receive the image submitted by the user through the GUI or the CUI, classify the image, and return the classification result.

In other possible implementations, the image processing method provided by the embodiment of the present application may be provided to a user in a packaged software package. Specifically, after purchasing the software package, the user can install and use the software package in the running environment of the user. Of course, the software package described above may also be pre-installed on a computing device for image processing.

For convenience of understanding, the following describes an example of the technical solution of the present application in a scenario where certificate information and ticket information are automatically entered and image processing is performed in a cloud service manner.

Referring to the system architecture diagram of the image processing system shown in fig. 1, a server 10 and a terminal 20 establish a communication connection as shown in fig. 1. The server 10 may be a cloud server in a public cloud. In some embodiments, the server 10 may also be a cloud server in a private cloud or a hybrid cloud. The terminal 20 includes, but is not limited to, a smartphone, a tablet, or a scanner, etc.

The server 10 is disposed with an image processing system 100 for providing a cloud service for image processing. In particular, the image processing system 100 includes a classification subsystem 102. The terminal 20 may capture images by photographing, scanning, etc., for example, by placing an invoice, an identification card, a bank card, etc. on a plain paper, and then the terminal 20 performs a photographing operation to obtain an image of a document object including the invoice, the identification card, the bank card, etc. The terminal 20 may send the image to the cloud, for example, to the server 10, and the image processing system 100 (for example, the classification subsystem 102 in the image processing system 100) running in the server 10 may obtain semantic features of the document object according to the text in the document object included in the image, and obtain a classification result of the document object according to the visual features and the semantic features of the document object.

The classification result comprises a class label of the document object, and the class label is used for representing the class of the document object. In particular, the document object may include categories such as a card, ticket, label, mail, or file. In some possible implementations, the category of the document object may be further divided into sub-categories, for example, the card may be divided into sub-categories such as a work card, a bank card, a pass, a driver's license, and the like, and the ticket may include sub-categories such as a shopping receipt, a ticket, and the like. In some embodiments, the classification result may also include a confidence that the document object belongs to the corresponding category. The confidence level is an empirically determined probability value that characterizes the degree of plausibility. Confidence may be a value in the range of [0,1], with values closer to 1 indicating a higher confidence level and values closer to 0 indicating a lower confidence level.

Further, the image processing system 100 may also include an identification subsystem 104. The recognition subsystem 104 may include a character recognition model corresponding to at least one category of document objects. After the classification subsystem 102 in the image processing system 100 classifies the document objects in the image, the optical character recognition models corresponding to the partial images corresponding to the document objects one to one may be input, for example, the optical character recognition models corresponding to the classification results of the document objects in the partial images, and the optical character recognition models may be specifically dedicated OCR engines, such as a bill recognition engine, a business card recognition engine, a work card recognition engine, a label recognition engine, and the like, so that character recognition may be performed by the dedicated OCR engines to obtain structured information.

The structured information specifically refers to information logically expressed by a two-dimensional table structure. For example, for a document object of the class of work cards, the following structured information can be obtained by performing character recognition through a work card recognition engine:

name: zhang III;

job number: 2348798, respectively;

and (4) department: administration department.

The structured information obtained by the character recognition model recognition based on the work card recognition engine can realize automatic information input, for example, the automatic input of the work card information is realized, manual input of a user is not needed, and the input efficiency and the input accuracy are improved. In addition, the method simplifies the user operation and improves the user experience.

The system architecture of the image processing system 100 is described above. Next, the server 10 that deploys the image processing system 100 will be described from the viewpoint of hardware instantiation.

Fig. 2 shows a schematic structural diagram of the server 10. It should be understood that fig. 2 only shows a part of the hardware structure and a part of the software modules in the server 10, and when the server 10 is implemented specifically, the server 10 may further include more hardware structures, such as indicator lights, buzzers, and the like, and more software modules, such as various application programs, and the like.

As shown in fig. 2, the server 10 includes a bus 201, a processor 202, a communication interface 203, and a memory 204. The processor 202, memory 204, and communication interface 203 communicate via a bus 201.

The bus 201 may be a Peripheral Component Interconnect (PCI) bus, a peripheral component interconnect express (PCIe) or Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 2, but it is not intended that there be only one bus or one type of bus.

The processor 202 may be any one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Micro Processor (MP), a Digital Signal Processor (DSP), and the like.

The communication interface 203 is used for communication with the outside. For example, the communication interface 203 is used for acquiring images, for example, receiving images sent by the terminal 20, and returning classification results of document objects in the images, for example, returning classification results of document objects in the images to the terminal 20.

Memory 204 may include volatile memory (volatile memory), such as Random Access Memory (RAM). The memory 204 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory, a Hard Disk Drive (HDD) or a Solid State Drive (SSD). The RAM and the ROM are called memories, and the HDD and the SSD are called external memories.

The memory 204 stores programs or instructions, such as programs or instructions required for implementing the image processing method provided by the embodiment of the present application. The processor 202 executes the program or instructions to perform the image processing method described above.

In order to make the technical solution of the present application clearer and easier to understand, the following describes in detail an image processing method provided in an embodiment of the present application with reference to the drawings.

Referring to the flowchart of the image processing method shown in fig. 3, the method includes:

s302: the image processing system 100 acquires an image.

The image includes one or more document objects. The document object refers to an object of a document type included in the image. Unlike animal objects such as humans, cats, dogs, etc., or plant objects such as trees, flowers, etc., document objects may include cards, tickets, tags, mail, or documents. The card and certificate include cards or certificates, such as work cards, business cards, nameplates, identification cards, drivers licenses, passes, passports, business licenses and the like, the bills include invoices, checks, drafts, receipts and the like, and the labels include labels on commodity packages or commodity price labels on shelves and the like. The mail is specifically an email. The document refers to a poster, a presentation, a text document, a table document, and the like.

Specifically, the image processing system 100 may acquire an image, such as by receiving the image from the terminal 20, and then input the image into a character detection model, which may be a generic OCR model that performs character detection on the input image. When the character is detected, it is indicated that the image includes a document object, whereby an image including one or more document objects can be obtained. When no character is detected, indicating that the image does not include a document object, the image processing system 100 may terminate the process flow for the image.

The character detection model is specifically used for detecting the position of a character in an image and returning a bounding box (BBox) of an area where the character is located (for example, a line where the character is located). The bounding box of the region where the character is located may be a quadrangle, which may be characterized by coordinates of four corner points. The return values of the character detection model may be expressed in the form of { (x1, y1), (x2, y2), (x3, y3), (x4, y4) }. When the quadrilateral is a rectangle, the bounding box may also be characterized by a center point and an offset of one of the corner points to the center point. Correspondingly, the form of the return value of the character detection model can be specifically expressed as { (x, y), (dx, dy) }.

In some possible implementations, the image processing system 100 may acquire the image when information entry is needed, or disabled people need to be assisted in reading, or forbidden words need to be filtered, so as to classify the image, and then input the image into a character recognition model of a corresponding category to perform character recognition, so as to implement automatic information entry, or transcribe a recognition result into voice-assisted reading, or filter the forbidden words in the image.

S304: the image processing system 100 obtains semantic features of the document object based on the text in the document object.

Specifically, when the image processing system 100 detects that the image includes the character through the character detection model, the image may be further clipped according to the bounding box of the area where the character is located, and then the clipped image is input into the character recognition model, and the character recognition model may recognize the content of the character to obtain the text in the document object included in the image.

Considering that the text belongs to unstructured data, the image processing system 100 may encode the text, for example, the text may be encoded by a word vector (word to vector) model, so as to obtain an encoded text. The image processing system 100 may then extract semantic features of the text through a text classification network (e.g., a feature extraction layer of the text classification network). Wherein, semantic features refer to features that characterize the meaning of text.

The text classification network may be constructed by a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN). In some embodiments, the text classification network may be constructed using fast text classification network FastText, text convolutional neural network TextCNN, text convolutional neural network TextRNN, or inflation gated convolutional neural network (DGCNN).

S306: the image processing system 100 obtains the classification result of the document object according to the visual feature and the semantic feature of the document object.

Specifically, the image processing system 100 may extract, from the image, a visual feature of the document object, which may include any one or more of a color feature, a texture feature, a shape feature, or a spatial relationship feature of the document object in the image, and then the image processing system 100 may classify the document object according to the visual feature and the semantic feature, thereby obtaining a classification result of the document object.

The classification result includes a category label of the document object. The category tag is specifically used to identify the category to which the target of the document belongs. When the image includes a document object, the image processing system 100 may output a category tag of the document object. When the image includes a plurality of document objects, the image processing system 100 may output category labels for each of the plurality of document objects.

The image processing system 100 may extract visual features of document objects in an image through an image classification network (e.g., a feature extraction layer of the image classification network). The image classification network may adopt a convolutional neural network architecture. In some embodiments, the image classification network may be constructed according to a Visual Geometry Group (VGG) network or a residual network (ResNet).

For ease of description, the present example is illustrated with a 5-layer convolutional neural network to extract visual features. Specifically, the image processing system 100 inputs the image into 5 layers of convolutional neural networks, extracts convolutional layer outputs at different stages respectively, and then fuses the convolutional layer outputs at different stages to obtain a multilayer feature space. In consideration of memory consumption, the image processing system 100 may extract from the second layer and blend the outputs of the second layer convolutional layer to the fifth layer convolutional layer to obtain a multi-layer feature space.

Considering that the sizes and dimensions of the feature maps output at different stages are generally different, the image processing system 100 may upsample the feature map at the high stage, and unify the sizes of the feature maps output at different stages into W × H, where W represents the width and H represents the height. The image processing system 100 may further use a 1 × 1 convolution kernel to unify dimensions of all feature maps, for example, to a dimension C, and then combine all feature maps in sequence to obtain visual features of the image, thereby obtaining visual features of the document target in the image. In the example of a 5-layer convolutional neural network, visual features may be represented as [ F2, F3, F4, F5 ].

In some possible implementations, the image processing system 100 classifies the document object according to a visual feature of the document object, and obtains a first classification result of the document object. Specifically, the image classification network may generate anchor frames of different sizes on each feature point of the visual feature using a sliding window, for example, when the size of the visual feature is W × H, and each feature point generates k anchor frames, the image classification network may generate W × H × k anchor frames. When the visual features include M layers, the image classification network may generate S ═ M × W × H × k anchor frames. The image classification network may perform preliminary screening on S anchor frames, filter the overlapped anchor frames, and determine the anchor frames that do not include the document target, for example, a non-maximum suppression (NMS) algorithm may be used to filter a large number of overlapped anchor frames to obtain final S anchor frames, which are candidate regions.

Generated for image classification networksThe s candidate regions may adopt pooling operations such as max pooling (max pooling) due to their different sizes, and unify the sizes of the candidate regions to obtain final s eigenvectors of w × c (w is their width and height, and c is the number of channels). The image classification network classifies the s characteristic vectors to obtain the probability distribution of each vector on each class

Wherein the content of the first and second substances,

the confidence in the ith class is characterized. The image classification network may output the confidence maximum

As a class label for the candidate region, each candidate region corresponding to a document object, and thus the image processing system 100 may obtain a classification result of the document object according to the class label of the candidate region output by the image classification network. The image processing system 100 may output the category label and the confidence corresponding to the category label, so that the user can determine whether to adopt the classification result according to the confidence.

In order to distinguish class labels obtained by classifying semantic features based on a text, in the embodiment of the application, the class labels output by the image classification network based on the visual features are called as first class labels, and the class labels obtained by classifying the text classification network based on the semantic features are called as second class labels.

In some possible implementations, the image processing system 100 may also compare the confidence corresponding to the first category label to a preset threshold. The preset threshold may be set according to an empirical value, and may be set to 0.6, for example. When the confidence corresponding to the first class label is greater than or equal to the preset threshold, it indicates that the confidence of the first class label is higher, and the image processing system 100 may directly use the first class label as a final classification result. When the confidence corresponding to the first class label is smaller than the preset threshold, the image processing system 100 may further execute the above S304, start text classification, and obtain a final classification result according to the visual feature and the semantic feature of the document target.

Specifically, the image processing system 100 may obtain first confidence degrees that the document target belongs to different categories according to the visual features of the document target, obtain second confidence degrees that the document target belongs to different categories according to the semantic features of the document target, and then obtain the classification result of the document target according to the weighted operation result of the first confidence degrees and the second confidence degrees. See in particular the following formula:

wherein the content of the first and second substances,

characterizing the confidence that a document target belongs to the ith class based on visual features,

determining confidence level of document target belonging to ith class based on semantic features, lambda characterization weight, PⁱThe characterization determines a confidence level that the document target belongs to the ith class based on the visual features and the semantic features.

Based on this, the image processing system 100 may be according to PⁱDetermining the classification result of the document object, e.g. the image processing system 100 may compare PⁱThe largest class serves as the third class label for the document object, which the image processing system 100 may use as the final classification result.

Accordingly, the image processing system 100 may output a first category label determined from the visual features and a first confidence corresponding to the first category label, a second category label determined from the semantic features and a second confidence corresponding to the second category label, a third category label determined from the visual features and the semantic features and a third confidence corresponding to the third category label.

Based on the above description, the embodiments of the present application provide an image processing method. In the method, the image processing system also combines the semantic features to classify the document targets in the image on the basis of the visual features, and provides more information for the classifier, so that the classification accuracy can be improved, the recognition error of a downstream character recognition model caused by the mismatching of input types is avoided, and the service requirement is met.

In some possible implementations, when multiple document objects are included in the image, the image processing system 100 may also obtain bounding boxes for the multiple document objects based on visual characteristics of the document objects in the image. The image processing system 100 may output bounding boxes of multiple document objects to crop an image from the bounding boxes to obtain multiple partial images. The plurality of partial images correspond to the plurality of document objects one to one.

It should be noted that the image processing system 100 may determine texts in the multiple document objects according to bounding boxes of the multiple document objects and text boxes (boxes corresponding to regions where texts are located) detected based on a character detection model, such as a general OCR model, then obtain semantic features of the multiple document objects according to the texts in the multiple document objects, and perform classification based on the visual features and the semantic features to obtain classification results. Because the text selected by the method is accurate and does not include the interference text, the classification accuracy can be further improved.

Referring to the text selection diagram shown in fig. 4, as shown in fig. 4, the image processing system 100 may determine an overlapping area ratio of the bounding box 402 and the text box 404, which may be specifically an area ratio of an overlapping portion of the bounding box 402 and the text box 404 to the text box 404. When the overlap area ratio reaches the preset ratio, the image processing system 100 may select the text in the text box 404 for classification based on the semantic features, and when the overlap area ratio does not reach the preset ratio, the image processing system 100 may not select the text in the text box 404 and the text in the unselected text box 404 is not used for classification based on the semantic features. The preset ratio may be set according to an empirical value, and may be set to 0.5, for example.

For ease of distinction, FIG. 4 uses different shading to identify selected text boxes 404 and unselected text boxes 404. In other possible implementations of embodiments of the present application, the image processing system 100 may distinguish the selected text box 404 from the unselected text boxes 404 in other ways.

The process of the image processing system 100 for returning the bounding box of the document object and the classification result will be described in detail below with reference to specific embodiments.

Referring to the flowchart of the image processing method shown in fig. 5, the method includes:

s502: the image processing system 100 acquires an image.

S504: the image processing system 100 inputs the image into the character detection model, and executes S506 when the character detection model detects a character, and executes S512 when the character detection model does not detect a model.

S506: the image processing system 100 inputs the image into an image classification network, and obtains bounding boxes of a plurality of document objects in the image and first class labels of the plurality of document objects.

The image processing system 100 extracts visual features from an image by using an image classification network, and the specific implementation of obtaining the first class label of the document target by performing classification based on the visual features may refer to the description of the relevant content in the embodiment shown in fig. 3, which is not described herein again.

The image processing system 100 may further extract visual features from the image by using an image classification network, and perform corner regression based on the visual features to obtain initial corners. The image processing system 100 may obtain bounding boxes for multiple document objects based on the initial corner points.

Specifically, the initial corner points of the bounding box of one document object are typically four. The image processing system 100 may perform edge regression based on the visual features to determine edge information that surrounds the box. The edge information may particularly be represented by edge pixels (which are typically very narrow bands of pixels). Considering that it is difficult to directly express the edge pixels mathematically, the image processing system 100 may process the edge pixels in the whole image by using a Hough Transform (Hough Transform) or the like, specifically, extract straight lines from the edge pixels, thereby obtaining edge lines.

The image processing system 100 may process the edge lines, for example, in a candidate region of the document object, connect, filter and extend all the lines in a certain range of the candidate region, so as to obtain the intersection points of the edge lines. The image processing system 100 may connect straight lines with similar curvatures (e.g., included straight lines of less than 10 °) within the above range, filter straight lines with lengths less than a preset length, and extend the remaining straight lines to obtain a plurality of intersection points.

The range of the candidate region may be an expanded range, for example, a range of expanding the length and width of the candidate region by 0.1 times. The image processing system 100 expands the range of line selection to prevent the edge line from completely coinciding with the edge of the region or the resulting intersection point from being outside the current region.

Further, the image processing system 100 may modify the initial corner point by using the intersection point to obtain a modified corner point. The corrected corner points may be greater than or equal to four. The image processing system 100 may obtain a bounding box of the document object according to the corrected corner points, where the bounding box may specifically be a box formed by sequentially connecting the corrected corner points.

In some possible implementations, the image processing system 100 may determine a correction manner for the initial corner point according to the number of intersections of the edge lines, and perform correction according to the corresponding correction manner. Specifically, when the number of intersections of the edge lines is greater than the number of initial corner points, for example, greater than four, the image processing system 100 may directly replace the initial corner points with the intersections of the edge lines as the corrected corner points. The image processing system 100 obtains a bounding box of the document object from polygons, such as pentagons, formed by the corrected corner points. When the number of the intersection points of the edge lines is less than or equal to the number of the initial corner points, the image processing system 100 may search for the intersection points of the edge lines within the preset radius r with the initial corner points as the center of a circle, and if the intersection points exist, obtain the corrected corner points according to the intersection points of the initial corner points and the edge lines. For example, the image processing system 100 may determine a midpoint of an intersection of the initial corner point and the edge line as the modified corner point.

S508: when the confidence of the first class label is smaller than the preset threshold, the image processing system 100 obtains semantic features of the text in the document target according to the bounding boxes of the plurality of document targets and the text box output by the character detection model.

S510: the image processing system 100 obtains a final classification result according to the visual features and semantic features of the document object.

Specific implementation of S508 and S510 can be described with reference to relevant content in the embodiment shown in fig. 3, and is not described herein again.

In some possible implementations, the image processing system 100 may output bounding boxes of multiple document objects and classification results of multiple document objects. The bounding box can be characterized by the coordinates of the corner points, and further, the bounding box can be characterized by the category of the corner points. The classification result may be characterized by the category with the highest confidence, and further, the classification result may further include the confidence of the category.

In some embodiments, the image processing system 100 may output the bounding box and classification results in the following format:

coordinates of corner points of the bounding box;

the category of corner points (which may be 1 or 2, for example);

class labels (which may be, for example, integers from 0 to N, including 1 or 3 values);

confidence of class (e.g., floating point number of 0 ~ 1 including 1 or 3 values)

When the angular point category value is 1, the representation bounding box comprises 4 angular points, and when the angular point category value is 2, the representation bounding box comprises more than four angular points, specifically, intersection points generated by the edge lines. When the number of the class labels is 1, the characterization classification result is obtained only through the visual features, at this time, the confidence value corresponding to the class label is inevitably greater than or equal to a preset threshold (for example, greater than 0.6), and the number of the confidence of the class label is 1. When the number of the category labels is 3, the 3 category labels respectively represent a first category label determined based on the visual features, a second category label determined based on the semantic features, and a third category label determined based on the visual features and the semantic features (classification result corrected by the semantic features), and the number of the confidence degrees of the category labels is also 3, and the confidence degree corresponding to the first category label is smaller than a preset threshold value.

FIG. 6 also shows a schematic diagram of a bounding box of document objects in an image, as shown in FIG. 6, an image 600 includes 2 document objects 602, and the 2 document objects 602 are ticketed. The lower left corner of one ticket is outside the image, and therefore, the image processing system 100 outputs the bounding box of the document object 602 as the bounding box formed by the intersection of the edge lines, specifically, a pentagon. The pentagon may be specifically characterized by five corner points 604, where the five corner points 604 are specifically corner points obtained by modifying an initial corner point by using intersection points of edge lines. Another ticketing corner point is in the image, the image processing system 100 outputs the bounding box of the document object 602 as a quadrangle, which can be specifically characterized by four corner points 604, where the four corner points 604 are specifically corner points obtained by correcting the initial corner points by using the intersection points of edge lines.

S512: the image processing system 100 terminates the image processing flow and returns the prompt information.

Specifically, when the character detection model detects that the image does not include characters, indicating that the image is not an image to be processed, the image processing system 100 may terminate the image processing flow and return a prompt message to prompt that the current image does not need to be processed.

For ease of understanding, the embodiments of the present application provide examples of processing different images input to the image processing system 100, and the following detailed description is made in conjunction with the examples.

As shown in FIG. 7, image 702, image 704, and image 706 are input into image processing system 100, respectively, characters are detected in image 702 and image 704, image processing system 100 may perform a further classification procedure with respect to image 702 and image 704, no characters are detected in image 706, image processing system 100 may terminate the processing procedure for image 706, outputting prompting information, e.g., image processing system 100 may present a prompt box 712 to the user, which prompt box 712 is used to prompt that no document object is included in image 706.

The image 702 includes a plurality of document objects, and the image processing system 100 classifies the plurality of document objects in the image 702 based on the visual characteristics of the document objects in the image 702, and also segments the plurality of document objects in the image based on the visual characteristics. Specifically, the image processing system 100 may perform corner regression and edge regression on a plurality of document targets in the image respectively according to the visual features to obtain initial corners and edge information of the plurality of document targets, and then the image processing system 100 may extract edge lines by using the edge information, correct the initial corners by using intersections of the edge lines, and output bounding boxes of the plurality of document targets according to the corrected corners. For each document object, when the confidence of the first class label determined based on the visual features is smaller than the preset threshold, the image processing system 100 further starts text classification, outputs the first class label determined based on the visual features and the confidence thereof, the second class label determined based on the semantic features and the confidence thereof, and the third class label determined based on the visual features and the semantic features and the confidence thereof, and when the confidence of the first class label determined based on the visual features is greater than or equal to the preset threshold, the image processing system 100 outputs the first class label and the confidence thereof, as shown in a result display box 708.

The image 704 includes a document object, and the image processing system 100 classifies the document object in the image 704 based on visual characteristics of the document object in the image 704 to obtain a first class tag of the document object in the image 704. The confidence of the first category label is smaller than the preset threshold, the image processing system 100 further starts text classification, and outputs the first category label determined based on the visual features and the confidence thereof, the second category label determined based on the semantic features and the confidence thereof, and the third category label determined based on the visual features and the semantic features and the confidence thereof, which are specifically shown in a result display box 710.

In some possible implementations, after the image processing system 100 obtains the classification result of the image, the structured information of the document object in the image can be obtained according to the image and the optical character recognition model corresponding to the classification result, such as a special OCR engine corresponding to the classification result.

When the image includes a plurality of document objects, the image processing system 100 may obtain a plurality of partial images from the image, for example, by cropping the image according to the bounding box to obtain a plurality of partial images, where the plurality of partial images correspond to the plurality of document objects one to one. The image processing system 100 may then input the plurality of partial images into the optical character recognition model corresponding to the classification result of the document object in the partial image, to obtain the structured information of the plurality of document objects.

In some possible implementations, the image processing system 100 may obtain perspective information of the document object from the bounding box of the document object. For example, when the bounding box is a quadrangle with unequal widths, the image processing system 100 may determine perspective information of the document object, for example, information such as a perspective transformation matrix of the document object, by using the position information of the corner points according to the principle of the size of the corner points. Image processing system 100 may also perform perspective transformation on the text in the document object based on the perspective information.

As shown in fig. 8, the image 802 is an image obtained by shooting the presentation 804, the presentation 804 includes a text, and due to a shooting angle, the entire text in the text is inclined to the lower right corner, the upper left character is relatively large, and the lower right character is relatively small, and the image processing system 100 processes the image 802 to obtain a bounding box 806, where the bounding box 806 is a quadrangle with unequal widths. The image processing system 100 may obtain perspective information, such as a perspective transformation matrix, by using the bounding box, and perform perspective transformation on the text according to the perspective transformation matrix to obtain the text after perspective transformation. The bounding box 804 is also subjected to perspective transformation along with the characters, and a bounding box 808 subjected to perspective transformation is obtained. As shown in fig. 8, the text after perspective conversion has the same size of characters, and the bounding box 808 after perspective conversion is converted from a quadrangle with unequal vertical width to a rectangle with equal vertical width. The corrected text is input into a special character recognition model, so that the accuracy of character recognition can be improved.

Further, when the presentation 804 of the image 802 has a rotation angle, the image processing system 100 may further perform an angle transformation on the presentation 804 according to the rotation angle. Specifically, the image processing system 100 may perform the angle transformation before the perspective transformation, or perform the angle transformation after the perspective transformation. Fig. 8 illustrates an example in which perspective transformation is performed followed by angle transformation, and image processing system 100 performs angle transformation on the perspective-transformed text and perspective-transformed bounding box 808 to obtain angle-transformed text and angle-transformed bounding box 810. The inclination of characters in the text after angle transformation can be zero, so that a downstream character recognition model can recognize conveniently, and the accuracy of character recognition is improved.

The image processing method has been described above mainly from the viewpoint of deployment of the image processing system 100 in the server 10. In some possible implementations, the image processing system 100 may also be deployed at a terminal. Next, an image processing method provided in an embodiment of the present application will be described from the perspective of a terminal.

Referring to a flowchart of an image processing method shown in fig. 9, the method includes:

s902: the image processing system 100 receives an image input by a user.

A document object may be included in the image. The document objects specifically include the following categories: card, ticket, label, mail, or document. Multiple document objects may also be included in the image. Wherein the plurality of document objects may be document objects of the same category, such as document objects of the same category owned by a plurality of entities (e.g., a plurality of persons). The multiple document objects may also be different categories of document objects, as different categories of document objects owned by the same entity (e.g., a person).

S904: the image processing system 100 outputs a classification result of the document object, wherein the classification result of the document object is obtained according to the visual feature of the document object and the semantic feature of the document object, and the semantic feature of the document object is obtained according to the text in the document object.

The specific implementation of the image processing system 100 to classify document objects or to segment document objects in an image can be seen in the above description.

There are many implementations of the image processing system 100 outputting the classification result of the document object. In some possible implementations, the image processing system 100 may present the classification results of the document object to the user. In other possible implementations, the image processing system 100 may output the classification result of the document object to a result file. Further, the image processing system 100 may also output a bounding box of the document object. The determination manner of the bounding box can be described by reference to the related content above.

In some possible implementations, the image processing system 100 may further obtain the structured information of the document object in the image according to the image and an optical character recognition model corresponding to the classification result. Specifically, when the image includes a plurality of document objects, the image processing system 100 may further crop the image according to the bounding box to obtain a plurality of partial images, and then input the plurality of partial images into an optical character recognition model corresponding to the classification result of the document object in the partial image, such as a special OCR engine corresponding to the classification result, to perform character recognition, and obtain the structured information of the document object according to the recognition result.

The image processing method provided by the embodiment of the present application is described in detail above with reference to fig. 1 to 9, and the image processing system 100 provided by the embodiment of the present application is described below with reference to the drawings.

The image processing system realizes the functions of processing images, such as classifying document targets in the images and identifying contents in the document targets through subsystems with different functions and units with different functions. The present embodiment does not limit the division manner of the subsystems and units in the image processing system 100, and the following description is made with reference to the exemplary division manners shown in fig. 10 and 11.

Referring to the schematic structural diagram of the image processing system shown in fig. 10, the image processing system 100 includes a classification subsystem 102, wherein the classification subsystem 102 may include a communication unit 1022, a feature extraction unit 1024, and a classification unit 1026.

A communication unit 1022 for acquiring an image including one or more document objects;

the feature extraction unit 1024 is configured to obtain semantic features of the document target according to the text in the document target;

a classifying unit 1026, configured to obtain a classification result of the document target according to the visual feature and the semantic feature of the document target, where the classification result includes a category label of the document target.

In some possible implementations, the classifying unit 1026 is specifically configured to:

In some possible implementations, the classification result further includes a confidence corresponding to the class label, and the communication unit 1022 is further configured to:

In some possible implementations, the communication unit 1022 is specifically configured to:

In some possible implementations, the system 100 also includes a detection subsystem 104. The detection subsystem 104 includes a communication unit 1042 and an object detection unit 1044. The communication unit 1042 may be configured to obtain a visual feature of the document target, for example, the visual feature of the document target obtained from the classification subsystem 102. The object detection unit 1044 is configured to obtain bounding boxes of the plurality of document objects according to the visual features of the document objects.

In some possible implementations, the communication unit 1042 is further configured to:

In some possible implementations, the system 100 further includes an identification subsystem 106, and the identification subsystem 106 includes a communication unit 1062 and a perspective transformation unit 1064. The communication unit 1062 is configured to obtain the bounding box, and the perspective transformation unit 1064 is configured to obtain the text after perspective transformation according to the perspective information from the bounding box. Wherein, the size of the characters in the text after perspective transformation is the same.

In some possible implementations, the system 100 further includes an identification subsystem 106, and the identification subsystem 106 includes a communication unit 1062 and a perspective transformation unit 1064. The communication unit 1062 is configured to obtain the bounding box, and the perspective transformation unit 1064 is configured to obtain the bounding box after perspective transformation according to the perspective information from the bounding box. In this example, the bounding box after the perspective transformation is a rectangle.

In some possible implementations, the recognition subsystem 106 further includes:

the recognition unit 1066 is configured to obtain the structural information of the document object in the image according to the image and the optical character recognition model corresponding to the classification result.

In some possible implementations, the image includes a plurality of document objects, and the identifying unit 1066 is specifically configured to:

It should be noted that the image processing system 100 may not be divided into subsystems, such as a classification subsystem, a detection subsystem, and a recognition subsystem. Accordingly, the communication unit 1022, the communication unit 1042, and the communication unit 1062 may be the same communication unit.

The image processing system 100 according to the embodiment of the present application may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of each module/unit of the image processing system 100 are respectively for implementing corresponding flows of each method in the embodiments shown in fig. 3 and fig. 5, and are not described herein again for brevity.

Next, referring to a schematic configuration diagram of the image processing system shown in fig. 11, the image processing system 100 includes:

a communication unit 1102 for receiving an image input by a user, the image including one or more document objects;

the communication unit 1102 is further configured to output a classification result of the document object, where the classification result includes a category label of the document object, the classification result is obtained according to a visual feature of the document object and a semantic feature of the document object, and the semantic feature of the document object is obtained according to a text in the document object.

In some possible implementations, the communication unit 1102 is specifically configured to:

In some possible implementations, the communication unit 1102 is further configured to:

In some possible implementations, the system 100 further includes:

and the identifying unit 1104 is used for obtaining the structural information of the document target in the image according to the image and the optical character recognition model corresponding to the classification result.

The image processing system 100 according to the embodiment of the present application may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of each module/unit of the image processing system 100 are respectively for implementing the corresponding flow of each method in the embodiment shown in fig. 9, and are not described herein again for brevity.

The embodiment of the application also provides a computer readable storage medium. The computer-readable storage medium can be any available medium that a computing device can store or a data storage device, such as a data center, that contains one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), among others. The computer readable storage medium includes instructions that instruct a computing device to perform the image processing method described above as applied to the image processing system 100.

Embodiments of the present application also provide a computer program product comprising one or more computer instructions. When loaded and executed on a computing device, cause the processes or functions described in accordance with embodiments of the application to occur, in whole or in part.

The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, or data center to another website site, computer, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.).

When the computer program product is executed by a computer, the computer performs any one of the aforementioned image processing methods. The computer program product may be a software installation package which may be downloaded and executed on a computer in the event that any of the aforementioned image processing methods needs to be used.

The description of the flow or structure corresponding to each of the above drawings has emphasis, and a part not described in detail in a certain flow or structure may refer to the related description of other flows or structures.

Claims

1. An image processing method, characterized in that the method comprises:

acquiring an image, wherein the image comprises one or more document objects;

obtaining semantic features of the document target according to the text in the document target;

and obtaining a classification result of the document target according to the visual feature and the semantic feature of the document target, wherein the classification result comprises a class label of the document target.

2. The method according to claim 1, wherein obtaining the classification result of the document object according to the visual feature and the semantic feature of the document object comprises:

3. The method of claim 1 or 2, wherein the classification result further includes a confidence level corresponding to the class label, the method further comprising:

4. The method of claim 3, wherein outputting the class label and the confidence level corresponding to the class label comprises:

5. The method according to any one of claims 1 to 4, further comprising:

and obtaining the bounding boxes of the plurality of document targets according to the visual characteristics of the document targets.

6. The method of claim 5, wherein the bounding box is obtained from initial corner points based on regression of the visual features.

7. The method according to claim 6, wherein the initial corner points include four corner points, the bounding box is obtained from modified corner points, the modified corner points are greater than or equal to four corner points, the modified corner points are obtained by modifying the initial corner points according to intersection points of edge lines, and the edge lines are straight lines extracted based on edge information of the visual feature regression.

8. The method of any of claims 5 to 7, wherein the text in the document object is determined from the bounding box.

9. The method according to any one of claims 5 to 8, further comprising:

10. The method according to any one of claims 5 to 9, further comprising:

and obtaining a text after perspective transformation according to the perspective information, wherein the size of characters in the text after perspective transformation is the same.

11. The method according to any one of claims 5 to 10, further comprising:

and obtaining the bounding box after perspective transformation according to the perspective information, wherein the bounding box after the perspective transformation is a rectangle.

12. The method according to any one of claims 1 to 11, further comprising:

13. The method of claim 12, wherein the image comprises a plurality of document objects, and wherein obtaining the structured information of the document objects in the image according to the image and an optical character recognition model corresponding to the classification result comprises:

14. The method of any of claims 1 to 13, wherein the document objects include the following categories: card, ticket, label, mail, or document.

15. An image processing method, characterized in that the method comprises:

receiving an image input by a user, wherein the image comprises one or more document objects;

and outputting a classification result of the document target, wherein the classification result comprises a class label of the document target, the classification result is obtained according to the visual characteristic of the document target and the semantic characteristic of the document target, and the semantic characteristic of the document target is obtained according to the text in the document target.

16. The method of claim 15, wherein the classification result further comprises a confidence level corresponding to the class label.

17. The method according to claim 15 or 16, wherein the outputting the classification result of the document object comprises:

18. The method of any one of claims 15 to 17, further comprising:

19. The method of claim 18, wherein the bounding box is obtained from initial corner points based on regression of the visual features.

20. The method according to claim 19, wherein the initial corner points include four corner points, the bounding box is obtained from modified corner points, the modified corner points are greater than or equal to four corner points, the modified corner points are obtained by modifying the initial corner points according to intersection points of edge lines, and the edge lines are straight lines extracted based on edge information of the visual feature regression.

21. The method of any one of claims 15 to 20, further comprising:

22. An image processing system, characterized in that the system comprises:

23. The system according to claim 22, wherein the classification unit is specifically configured to:

24. The system of claim 22 or 23, wherein the classification result further comprises a confidence level corresponding to the class label, and wherein the communication unit is further configured to:

25. The system of claim 24, wherein the communication unit is specifically configured to:

26. The system of any one of claims 22 to 25, further comprising:

27. The system of claim 26, wherein the bounding box is obtained from initial corner points based on regression of the visual features.

28. The system according to claim 27, wherein the initial corner points comprise four, the bounding box is obtained from modified corner points, the modified corner points are greater than or equal to four, the modified corner points are obtained by modifying the initial corner points according to intersection points of edge lines, and the edge lines are straight lines extracted based on edge information of the visual feature regression.

29. The system of any of claims 26 to 28, wherein text in the document object is determined from the bounding box.

30. The system of any one of claims 26 to 29, wherein the communication unit is further configured to:

31. The system of any one of claims 26 to 30, further comprising:

32. The system of any one of claims 26 to 31, further comprising:

33. The system of any one of claims 22 to 32, further comprising:

34. The system according to claim 33, wherein the image comprises a plurality of document objects, the recognition unit being specifically configured to:

35. The system according to any one of claims 22 to 34, wherein the document objects include the following categories: card, ticket, label, mail, or document.

36. An image processing system, characterized in that the system comprises:

37. The system of claim 36, wherein the classification result further comprises a confidence level corresponding to the class label.

38. The system according to claim 36 or 37, wherein the communication unit is specifically configured to:

39. The system of any one of claims 36 to 38, wherein the communication unit is further configured to:

40. The system of claim 39, wherein the bounding box is obtained from initial corner points based on regression of the visual features.

41. The system according to claim 40, wherein the initial corner points comprise four corner points, the bounding box is obtained from modified corner points, the modified corner points are greater than or equal to four corner points, the modified corner points are obtained by modifying the initial corner points according to intersection points of edge lines, and the edge lines are straight lines extracted based on edge information of the visual feature regression.

42. The system of any one of claims 36 to 41, further comprising:

43. An apparatus, comprising a processor and a memory;

the processor is to execute instructions stored in the memory to cause the device to perform the method of any of claims 1 to 21.

44. A computer-readable storage medium comprising instructions that direct a device to perform the method of any of claims 1-21.

45. A computer program product, characterized in that it causes a computer to carry out the method according to any one of claims 1 to 21, when said computer program product is run on a computer.