CN112287763A

CN112287763A - Image processing method, apparatus, device and medium

Info

Publication number: CN112287763A
Application number: CN202011035247.9A
Authority: CN
Inventors: 雷迅; 刘玉升; 谭竣方; 邹颖; 国宏志; 王栋
Original assignee: Beijing Kuangshi Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Priority date: 2020-09-27
Filing date: 2020-09-27
Publication date: 2021-01-29

Abstract

The embodiment of the invention provides an image processing method, an image processing device, image processing equipment and an image processing medium, wherein the method comprises the following steps: performing feature extraction on an image to be processed to obtain a feature map of the image to be processed; determining an attribute predicted value of each feature point on the feature map, wherein the attribute predicted value of one feature point represents the probability that a pixel point corresponding to the feature point in the image to be processed belongs to a text region, the predicted distance between the pixel point and a plurality of position points of the text region to which the pixel point belongs, and the predicted category of the text region to which the pixel point belongs; and marking the position and the category of each text area on the image to be processed according to the attribute predicted value of each feature point on the feature map. By adopting the technical scheme of the invention, the accuracy of positioning and returning the text area in the image can be improved.

Description

Image processing method, apparatus, device and medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image processing method, an image processing apparatus, an image processing device, and an image processing medium.

Background

In the field of Optical Character Recognition (OCR), a scenario frequently encountered is that text content Recognition needs to be performed on a fixed-template document (such as an identity card, a driving license, a car purchase invoice, and other documents or articles with a fixed system). When recognizing the text content of an image of a fixed template, it is usually necessary to locate a target text region to be recognized (such as name, gender, nationality, etc. on an identification card), i.e. to mark the position of each target text region in the image, so as to output the recognition result of each text region to the user.

When the text region is located, in the related art, generally, the text region is classified and located by performing keyword matching or the like on the content in the image. However, when the method is adopted, the precision requirement on keyword matching is very high, and classification and positioning errors are easily caused once the matching has slight deviation, so that the final classification accuracy is low.

Disclosure of Invention

In view of the above problems, an image processing method, apparatus, device, and medium of the embodiments of the present invention are proposed to overcome or at least partially solve the above problems.

In order to solve the above problem, a first aspect of the present invention discloses an image processing method, including:

performing feature extraction on an image to be processed to obtain a feature map of the image to be processed;

determining an attribute predicted value of each feature point on the feature map, wherein the attribute predicted value of one feature point represents the probability that a pixel point corresponding to the feature point in the image to be processed belongs to a text region, the predicted distance between the pixel point and a plurality of position points of the text region to which the pixel point belongs, and the predicted category of the text region to which the pixel point belongs;

and marking the position and the category of each text area on the image to be processed according to the attribute predicted value of each feature point on the feature map.

In a second aspect of the embodiments of the present invention, an image processing apparatus is further disclosed, including:

the characteristic extraction module is used for extracting the characteristics of the image to be processed to obtain a characteristic diagram of the image to be processed;

the predicted value determining module is used for determining an attribute predicted value of each feature point on the feature map, wherein the attribute predicted value of one feature point represents the probability that a pixel point corresponding to the feature point in the image to be processed belongs to a text region, the predicted distance between the pixel point and a plurality of position points of the text region to which the pixel point belongs, and the predicted category of the text region to which the pixel point belongs;

and the marking module is used for marking the position and the category of each text area on the image to be processed according to the attribute predicted value of each feature point on the feature map.

In a third aspect of the embodiments of the present invention, an electronic device is further disclosed, including:

one or more processors; and

one or more machine readable media having instructions stored thereon which, when executed by the one or more processors, cause the apparatus to perform an image processing method as described in embodiments of the first aspect of the invention.

In a fourth aspect of the embodiments of the present invention, a computer-readable storage medium is further disclosed, which stores a computer program for causing a processor to execute the image processing method according to the embodiment of the first aspect of the present invention.

The embodiment of the invention has the following advantages:

in the embodiment of the present invention, feature extraction may be performed on an image to be processed, so as to obtain a feature map of the image to be processed, and further determine an attribute prediction value of each feature point included in the feature map, where the attribute prediction value may represent a probability that a pixel point corresponding to the feature point in the image to be processed belongs to a text region, a prediction distance between the pixel point and a plurality of position points of the text region to which the pixel point belongs, and a prediction category of the text region to which the pixel point belongs, so as to mark a position and a category of each text region on the image to be processed respectively according to the attribute prediction value.

On one hand, in the embodiment of the invention, the attribute predicted value of one feature point represents the probability that a pixel point corresponding to the feature point in the image to be processed belongs to the text region, the predicted distance between the pixel point and a plurality of position points of the text region to which the pixel point belongs, and the predicted category of the text region to which the pixel point belongs.

On the other hand, on the basis of pixel-level classification, information of the same feature point in different dimensions, namely information of three dimensions, namely the probability of the same feature point belonging to a text region, the prediction distance between the same feature point and a plurality of position points of the belonging text region and the prediction category of the belonging text region, is determined, so that reference information according to classification is enriched, and the information of the three dimensions is closely associated from the viewpoint of text region classification, so that the position and the category of a pixel point corresponding to the feature point in an image to be processed can be more accurately positioned through the associated information. Therefore, the position and the category of each text region can be more accurately positioned by integrating the information, so that the classification result of the text regions is more reliable.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a flow chart of the steps of an image processing method of an embodiment of the present invention;

FIG. 2 is a flow chart of steps of a further image processing method according to an embodiment of the invention;

FIG. 3 is a schematic diagram of image processing of an identification card image according to an embodiment of the invention;

FIG. 4 is a flow chart of steps of a further image processing method according to an embodiment of the invention;

FIG. 5 is a flowchart illustrating the steps for training a default model according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of feature extraction performed on a template image sample according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a default model according to an embodiment of the present invention;

fig. 8 is a block diagram of the image processing apparatus in the embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below to clearly and completely describe the technical solutions in the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the related art, when performing text domain positioning and classification on images in images of fixed templates, the following two methods are generally adopted:

one way is to identify the content of the detected text lines and finally classify the text lines in text fields by means of text content keyword matching and the like. For example, for the id card image, text lines such as "name", "gender", "id card number" and the like in the id card image are detected first, so as to identify the text content in the text lines, and when the text content of "third name" is identified, the text area of the "third name" can be obtained to be classified into the "name" category according to the matching of the keywords of the "name".

However, this method has a very high requirement on the accuracy of character recognition, and if the character content recognition is slightly deviated, the classification is easily mistaken.

Another way is to detect the text position and classify the text lines by the detected text position in the text field, but this way needs high position detection precision, and once the coordinate detection has errors, the classification is easy to be wrong.

Neither of the above two approaches can result in higher accuracy classification of text regions. In view of this, the applicant proposes one of the following core technical concepts: the image is classified at a pixel level, namely the probability that each pixel point on the image is a text region, the distance between each pixel point and four vertexes of the text region and which text region belongs to are determined, so that the three kinds of information of each pixel point are comprehensively considered to determine the category and the position of one text region, and the accuracy of classifying the text regions is improved.

Referring to fig. 1, fig. 1 is a flow chart illustrating steps of an image processing method according to an embodiment of the present invention.

As shown in fig. 1, an image processing method of this embodiment may specifically include the following steps:

step S101: and performing feature extraction on the image to be processed to obtain a feature map of the image to be processed.

In this embodiment, the image to be processed may be an image taken of an article having a fixed system, the image to be processed may include a plurality of image areas, and the positions of the image areas may be specified in advance on the image to be processed, for example, an article having a standard specification such as an identification card and a bank card includes an area formed by a text such as a name and an identification card, and the area has a specified position range on the identification card, so that the image of the identification card correspondingly has an image area formed by a text such as a name and an identification card.

In this embodiment, the feature extraction performed on the image to be processed may refer to feature extraction performed on the image to be processed in multiple scales, and a fusion feature map obtained by fusing feature maps in multiple scales. By adopting multi-scale feature extraction, the problem of inaccurate feature extraction caused by different sizes of characters on the image to be processed can be avoided, so that the image features of the image to be processed can be accurately described on the whole.

Step S102: and determining an attribute predicted value of each feature point on the feature map, wherein the attribute predicted value of one feature point represents the probability that a pixel point corresponding to the feature point in the image to be processed belongs to the text region, the predicted distance between the pixel point and a plurality of position points of the text region to which the pixel point belongs, and the predicted category of the text region to which the pixel point belongs.

In this embodiment, the obtained feature map may include a plurality of pixel points, where the pixel points on the feature map may be referred to as feature points. Since the feature map may reflect the overall features of the image to be processed, the feature point on one feature map may correspond to a small region on the image to be processed, that is, one feature point may describe the overall features in a small region on the image to be processed. In practice, a small area on the image to be processed is composed of one or more pixel points, and therefore, each feature point on the feature map may correspond to one or more pixel points in the image to be processed.

In this embodiment, the attribute prediction value of the feature point may include information of three dimensions, which are respectively: the probability that the pixel point corresponding to the characteristic point belongs to the text region, the predicted distance between the pixel point corresponding to the characteristic point and the plurality of position points of the text region, and the predicted category of the text region. In practice, the predicted value of the attribute may be a three-dimensional vector value by which the above information is recorded.

For example, taking an identity card as an example, for a feature point a in a feature map, the feature point a corresponds to a small region in an identity card image, and the small region may include one or more pixel points. The attribute prediction value of the feature point A is {0.8, (3,2,1,3), 1}, wherein 0.8 indicates that the probability that the pixel point corresponding to the feature point A belongs to the text region is 0.8, and (3,2,1,3) indicates that the prediction distances from the pixel point corresponding to the feature point A to the 4 position points of the text region are 3,2,1,3 respectively; the prediction category of the text region to which the pixel point corresponding to the feature point a belongs is 1. For example, 1 indicates a "name" category.

In this embodiment, each feature point on the feature map may correspond to a plurality of pixel points in the image to be processed, so that the attribute prediction value of each feature point is obtained, and the position information and the category information of each pixel point in the image to be processed, which corresponds to each feature point, in the image to be processed are predicted.

Step S103: and marking the position and the category of each text area on the image to be processed according to the attribute predicted value of each feature point on the feature map.

In this embodiment, each feature point on the feature map may correspond to one or more pixel points in the image to be processed, so that, through the attribute prediction value of the feature point, the probability that each small block region in the image to be processed belongs to the text region, the prediction distance between each small block region and each of the plurality of position points of the text region to which the small block region belongs, and the prediction category of the text region to which the small block region belongs may be obtained. Furthermore, according to the probability that each small block region belongs to the text region, the prediction distance between each small block region and a plurality of position points of the text region to which the small block region belongs, and the prediction type of the text region to which the small block region belongs, the small block regions belonging to the same prediction type and belonging to the text region can be classified and combined according to the prediction distances between the small block regions and the text region to which the small block regions belong, and therefore the marking of the position and the type of each text region on the processed image is achieved.

By way of example, still taking the id card image as an example, for the feature point a and the feature point B in the feature map, if the feature point a corresponds to a small region C in the id card image and the feature point B corresponds to a small region D in the id card image, the attribute prediction value of the feature point a is {0.8, (3,2,1,3), 1}, and the attribute prediction value of the feature point B is {0.78, (4,2.5,0.8,2.8), 1 }; the probabilities that the characteristic point A and the characteristic point B belong to the text regions are close, the prediction distances between the characteristic point A and the characteristic point B and the position points of the text regions are also close, and the prediction categories of the text regions are consistent, so that the small block region C and the small block region D in the ID card image can be marked as the same text region according to the attribute prediction values of the characteristic point A and the characteristic point B, the category of the text region can be determined to be 1, and in addition, the position of the text region in the ID card image can be determined according to the prediction distances between the small block region C and the small block region D and the position points of the text region.

In practice, in the case where the probabilities of the two small adjacent regions belonging to the same text region in the image to be processed are very close to each other, the prediction distances from the plurality of position points of the text region are also very close to each other, and the prediction categories belonging to the text region are also of the same category, the prediction categories belonging to the text region should also belong to the same category when the probabilities of the two small adjacent regions belonging to the same text region are very close to each other and the prediction distances from the plurality of position points of the text region are also very close to each other. Therefore, close relevance exists among the three information included in the attribute predicted value of one feature point, so that when the position and the category of the text region are marked according to the three information, the marked position and the category of the text region are more reliable, and the accuracy of positioning and classifying the text region is improved.

In the embodiment of the invention, the feature map can reflect the overall image features of the image to be processed, the feature points on the feature map can correspond to one or more pixel points on the image to be processed, and the attribute predicted values of the feature points represent the probability that the pixel points corresponding to the feature points in the image to be processed belong to the text region, the predicted distances between the pixel points and the multiple position points of the text region to which the pixel points belong, and the predicted category of the text region to which the pixel points belong. Therefore, the text region classification method and the device rely on the pixel points for classification when classifying the text region, so that the classification precision is finer, and the classification precision of the text region is improved.

Compared with the method of matching keywords, positions or coordinate points in the related technology, the method makes full use of the information of three dimensions represented by each feature point on the feature map, and therefore high-precision text domain classification is achieved.

Referring to fig. 2 and 3, fig. 2 is a flowchart illustrating steps of another image processing method according to an embodiment of the present invention, and fig. 3 is a schematic diagram illustrating image processing of an identification card image.

In this embodiment, the image to be processed may be an image obtained by rectifying an original image, specifically, the original image includes an image of a target object, where the target object is an object having a fixed template, that is, an image of an article having a fixed format, for example, an identity card, a bank card, and the like.

In practice, when an original image is obtained by shooting a target object, a problem that a shooting interface of a camera cannot be aligned with the edge of the target object occurs, and particularly when a certificate image is shot, the target object is often inclined in the shooting interface, so that deviation occurs when the position of a text area is marked when the certificate image is identified subsequently, and further classification is inaccurate.

In view of this, in the present embodiment, the obtained original image may be subjected to tilt correction. The process of performing the tilt correction on the original image may be as follows:

first, the position of the image of the target object on the original image is predicted.

Secondly, according to the position of the image of the target object on the original image, the image to be processed is extracted from the original image, and the image to be processed comprises the image of the target object.

In some embodiments, positions of a plurality of preset position points of an image of a target object on an original image may be determined, and an image formed by enclosing the plurality of preset position points is extracted from the original image according to the positions of the plurality of preset position points on the original image, so as to obtain an image to be processed including the target object. The preset position points may be some position points on the edge of the target object, and for example, the identity card image may be position points on four edges of the identity card, and the position points may form a standard rectangular frame.

In some embodiments, coordinates of a plurality of preset vertices in the target object in the original image may be predicted, and the image of the target object may be subjected to perspective transformation according to the coordinates of the plurality of preset vertices in the original image and the coordinates of the plurality of preset vertices in the image of the target object on the original image, so that the coordinates of the plurality of preset vertices in the image of the target object are the predicted coordinates. Therefore, the image of the target object is corrected in the original image, and then the corrected image of the target object is extracted from the original image, wherein the extracted corrected image of the target object is the image to be processed.

For example, as shown in fig. 3, the leftmost image is an original image, the identification card image included in the original image is tilted in the image, and to perform tilt correction on the identification card image, coordinates of four vertices of the identification card in the original image may be predicted first, and the predicted coordinates are shown as four vertices of a dashed box in the figure. And then, according to the predicted coordinates of the four vertexes, transforming the real coordinates of the four vertexes of the identity card in the original image to transform the real coordinates of the four vertexes to the predicted coordinates, thereby realizing the inclination correction of the identity card, as shown in fig. 4, namely, aligning the edge of the identity card in the image with a dotted line frame, and thus obtaining a regular identity card image.

In some embodiments, the original image may also be corrected by the image transformation model, in which case, the original image may be input into the image transformation model to obtain the corrected image to be processed output by the image transformation model.

The image transformation model is used for transforming the target object according to the predicted coordinates of a plurality of vertexes of the target object, so that a transformed original image, namely the image to be processed, is output. The image transformation model is a model obtained by training a neural network model by taking a template image as a sample.

After the original image is subjected to tilt correction, the image to be processed is obtained, and then each text region in the image to be processed can be positioned and classified. As shown in fig. 3, the method may specifically include the following steps:

step S301: and performing feature extraction on the image to be processed to obtain a feature map of the image to be processed.

In this embodiment, the process of step S301 is similar to the process of step S101, and reference may be made to the description of step S101 for relevant points, which is not described herein again.

Step S302: and determining an attribute predicted value of each feature point on the feature map, wherein the attribute predicted value of one feature point represents the probability that a pixel point corresponding to the feature point in the image to be processed belongs to the text region, the predicted distance between the pixel point and a plurality of position points of the text region to which the pixel point belongs, and the predicted category of the text region to which the pixel point belongs.

In this embodiment, the process of step S302 is similar to the process of step S102, and reference may be made to the description of step S102 for relevant points, which is not described herein again.

In this embodiment, when obtaining the attribute prediction value of each feature point on the feature map, the position and the category of each text region on the image to be processed may be marked according to the attribute prediction value of each feature point on the feature map, and the steps S303 to S305 may be specifically described.

Step S303: and merging the adjacent characteristic points belonging to the text regions according to the position relation among the characteristic points belonging to the text regions on the characteristic diagram to obtain the text regions on the image to be processed.

In this embodiment, the position relationship between the feature points may refer to whether the feature points are adjacent to each other, and then the adjacent feature points belonging to the text region are merged to obtain the text region.

In a specific implementation, feature points on the predicted feature map, the probability of which is smaller than a preset probability, may be filtered to obtain a plurality of remaining activated feature points in the predicted feature map, and then, the position relationship between the activated feature points belonging to the text region is combined with adjacent activated feature points belonging to the text region to obtain each text region on the image to be processed.

Wherein, since each feature point has a probability of belonging to the text region, the probability may reflect a likelihood that the feature point belongs to the text region, and a higher probability indicates a higher likelihood that the feature point is a feature point in the text region. Therefore, in this embodiment, feature points whose probability of belonging to the text region is smaller than the preset probability may be filtered, that is, feature points obviously not belonging to the text region are filtered, and the remaining feature points that are not filtered are activation feature points, and these activation feature points may all belong to the text region.

Because the feature points with the probability less than the preset probability are filtered, the noise interference of the feature points in the feature map with little relation with the text region is reduced, and the accuracy and the efficiency of subsequently classifying and positioning the text region can be improved. After the activation feature points are obtained, adjacent activation feature points may be merged. Because one feature point corresponds to one or more pixel points in the image to be processed, merging adjacent activated feature points is just to merge a plurality of adjacent pixel points in the image to be processed, and then a plurality of text regions in the image to be processed are obtained.

Step S304: and marking the position of the text region in the image to be processed according to the predicted distance between each characteristic point belonging to the text region and a plurality of position points of the text region.

In this embodiment, after obtaining a plurality of text regions, the position of each text region needs to be determined, and each text region is obtained by merging a plurality of adjacent activation feature points, so each text region includes a plurality of activation feature points, and the attribute prediction value of each activation feature point includes distances from a plurality of position points of the text region, and therefore, the position of the text region in the image to be processed can be determined according to the prediction distances between each activation feature point in the text region and the plurality of position points of the text region.

The text region may be a rectangular region, the predicted distances between the feature point and the plurality of position points of the text region may be the predicted distances between the feature point and the four vertices of the text region, and the position of the text region in the image to be processed may be determined by the predicted distances between the plurality of feature points and the four vertices of the text region.

In one embodiment, when determining the position of a text region in an image to be processed, weighting the predicted distance between each text region and a plurality of position points of the text region according to the weight of each activation feature point belonging to the text region to obtain position information of the plurality of position points of the text region; and marking the position of each text region in the image to be processed according to the obtained position information of the plurality of position points of the text region. Wherein, the weight of one activation characteristic point is the prediction probability or confidence corresponding to the activation characteristic point.

In this embodiment, the weight of each activation feature point may refer to a prediction probability or confidence corresponding to the activation feature point. The confidence may be a difference between a probability that the activation feature point belongs to the text region and the true situation, and the higher the confidence is, the smaller the difference between the probability that the activation feature point belongs to the text region and the true situation is. For example, if the probability that an activated feature point belongs to a text region is 0.8 and the confidence is 0.9, the activated feature point is characterized as a feature point belonging to the text region.

In a specific implementation, each of the activated feature points has a predicted distance from a plurality of location points of the text region, so that, for each activated feature point in one text region, the activated feature points may be weighted with the predicted distance from the plurality of location points of the text region, so as to obtain location information of the plurality of location points of the text region, where the location information of the plurality of location points may refer to coordinates of the plurality of location points of the text region in the image to be processed, and then the location of the text region in the image to be processed is marked according to the coordinates of the plurality of location points.

After the position of each text region is determined, the category of each text region can be determined according to the predicted category of the text region to which each feature point in the text region belongs. Specifically, as described in step S305:

step S305: and marking the category of the text region of the image to be processed according to the prediction category of each characteristic point belonging to the text region for each obtained text region.

In one implementation, when determining the category of each text region, since the text region is obtained by merging adjacent activation feature points, the text region includes the activation feature points, and for each obtained text region, the respective prediction categories of the activation feature points belonging to the text region may be weighted according to the respective weights of the activation feature points belonging to the text region, so as to obtain category information of the text region; and marking the category of the text region of the image to be processed according to the obtained category information of each text region. Wherein, the weight of one activation characteristic point is the prediction probability or confidence corresponding to the activation characteristic point.

In this embodiment, the prediction probability or the confidence of each activation feature point may refer to a difference between a prediction category and a true category of the text region to which the activation feature point belongs, and the higher the confidence or the prediction probability is, the smaller the difference between the prediction category and the true condition of the text region to which the activation feature point belongs is. For example, if the prediction category of the text region to which the activation feature point belongs is 1 and the confidence is 0.9, the category characterizing the text region to which the activation feature point belongs is 1.

In this embodiment, the image to be processed may include text regions of multiple categories, for example, a "name" category, a "gender" category, and the like, and different categories may be characterized by different mathematical values. In a specific implementation, each activated feature point in one text region has a prediction category and a corresponding confidence of the text region to which the activated feature point belongs, so that the prediction categories of the activated feature points belonging to the text region can be weighted and averaged to obtain category information of the text region, that is, a category value of the text region can be obtained, the category to which the text region belongs can be determined through the category information, and then the category of the text region is marked on the image to be processed.

For example, if the text region includes an activation feature point a and an activation feature point B, the weights of the activation feature points a and B are 0.8 and 0.5, the prediction categories of the activation feature points B and B are 1 and 1, respectively, the weighted average value is 0.65, and the integer is 1, the category of the text region is marked as the category corresponding to 1.

Through the steps, the category and the position of each text region in the image to be processed can be obtained, so that the text region classification based on the pixel level is realized, and the accuracy of the text region classification is greatly improved. And because the feature points with the prediction probability smaller than the preset probability are filtered, the feature points obviously not belonging to the text region are filtered, the interference caused by the feature points is avoided, and the efficiency of text region classification is improved.

In practice, after text regions are classified by pixel level, there may be a case that some text regions cannot be labeled with positions and categories due to high requirement of classification accuracy at pixel level. For example, if the text region includes the activated feature point a and the activated feature point B, the weights of the activated feature points a and B are 0.4 and 0.5, and the prediction categories of the activated feature points B and B are 1 and 1, respectively, the weighted sum value is 0.45, and the category of the text region cannot be determined.

In this embodiment, in order to ensure that all text regions included in the image to be processed can be determined as to the category and the position, for the case that some text regions cannot be marked as to the position and the category, the text regions that are not marked as to the position and the category on the image to be processed may be secondarily positioned and classified. Specifically, the following steps S306 to S308 are described.

Step S306: and obtaining parameter values of the target text regions which are not marked with the categories in the text regions on the image to be processed.

Wherein the parameter values include: text within the target text region and/or a position of the target text region in the image to be processed.

Step S307: and comparing the parameter value of the target text area with the parameter value of each template text area in the template image.

Step S308: and marking the category of the template text area with the matched parameter value as the category of the target text area.

In this embodiment, the template image may refer to an image that uses the same type of fixed template as the image to be processed. For example, if the image to be processed is an identification card image, the template image is a template image of the identification card.

In this embodiment, the text included in the target text region may be compared with the text in each template text region in the template image by using a keyword, and if the comparison result is consistent, the category of the target text region is the category to which the template text region belongs. Of course, the position of the target text region in the image to be processed may also be compared with the positions of the template text regions, and the category of the target text region may be determined according to the position of the target text region in the template image.

Of course, in order to improve the classification accuracy, the position of the target text region in the template image may be determined first, so as to obtain the sequence of the target text region in the template image, then the text included in the target text region is matched with the keywords of the template text region, and the category of the target text region is determined comprehensively according to the obtained position sequence and the keyword matching result.

Illustratively, as shown in the right image shown in fig. 3, the right image is a marked identification card image in which each text region is marked according to the attribute prediction value of the feature point. If the part enclosed by the solid line box 401 in the drawing is a text region of which the category is not determined, the text recognition result according to the text region 401 is "male", and the position column of the text region 401 in the id card template is the 2 nd text line (one text line is one line, and there may be a plurality of text regions of different categories in one text line), and the coordinate center point of the text region 401 is located in the left half region of the corrected id card drawing, it is considered that the text region 401 approximately belongs to the text region of the category of "gender".

In this embodiment, after the category and the position of each text region in the image to be processed are marked, the text content in the image to be processed can be extracted and identified, specifically, the following steps S309 to S310:

step S309: and cutting the image to be processed according to the position of each text area marked on the image to be processed to obtain a plurality of text area images.

In this embodiment, the image to be processed includes the category information and the location information of each text region, so that the image to be processed may be cut according to the category information and the location information of each text region to cut out each text region.

Illustratively, as shown in the marked identification card image on the right side of fig. 3, 11 text regions are marked in the identification card image, the region framed by the dashed line frame and the solid line frame in fig. 3 is the determined text region, and each text region has respective position information and category information, so that the image of 11 text regions can be segmented from the image to be processed, thereby obtaining the image of 11 text regions.

Step S310: and respectively carrying out text recognition on the text area images to obtain the text contents in the text area images.

In this embodiment, texts in the multiple text region images may be identified to obtain text contents in the text region images, where the text contents may be contents in a text format, so that the text images may be converted into text contents in the text format. As shown in fig. 4, the text region image of "name" is subjected to text recognition to obtain the content "name" in the text format, and similarly, the text region image of "zhang san" is subjected to text recognition to obtain the content "zhang san" in the text format, so that the purpose of recognizing the text image as the text characters is achieved.

When the technical scheme of the embodiment of the invention is adopted, the following advantages are achieved:

first, since the image to be processed is an image obtained by performing tilt correction on the target object in the original image, the problem that the position prediction of the text region is biased due to the tilt of the image to be processed can be avoided, thereby improving the accuracy of classifying the text region.

Secondly, because the feature points in the feature map, the prediction probability of which is less than the preset probability, can be filtered, the interference caused by the features of the non-text region can be avoided, and the efficiency of classifying the text regions is improved.

Thirdly, as the target text region which is not marked with the type in the image to be processed is secondarily classified by adopting the matching rule of text matching or position matching, the type and the position of all the text regions in the image to be processed can be ensured to be marked, and therefore, the accuracy of classifying the text regions is improved.

In some embodiments, in order to improve the efficiency of classifying the text regions and make the classification of the text regions more intelligent, a full-convolution neural network may be used to perform pixel-level classification on the image to be processed, and the network outputs prediction information as to whether each pixel point on the feature map is a text region, a distance between each pixel point and four vertices of the text region, and which text region belongs to, so as to determine the position and the category of the text region according to the information output by the network.

Referring to fig. 4, a flowchart illustrating steps of an image processing method according to the present embodiment is shown, and as shown in fig. 4, the method may specifically include the following steps:

step S501: a plurality of template image samples carrying annotations are obtained.

A template image sample carrying annotations comprising: each pixel point of the template image sample belongs to the real probability of the text region, the real distance between the pixel point and the position points of the text region and the real category of the text region.

In this embodiment, the carried labels may be used to describe real categories and real positions of each text region to be identified in the template image sample. Specifically, the actual category and the actual position of each text region in the template image sample can be labeled manually. Specifically, the positions of four vertices of each text region, the category of each text region, and the text content of each text region may be noted.

The template image sample may be an image captured for each of a plurality of target objects under the same template, for example, if the template is an identity card, images of a plurality of different identity cards may be used as the template image sample.

Step S502: and training a preset model by taking the plurality of template image samples carrying the labels as training samples to obtain a prediction model.

Step S503: and performing feature extraction on the image to be processed to obtain a feature map of the image to be processed.

Step S504: and inputting the feature map into the prediction model to obtain the attribute prediction value of each feature point on the feature map.

In this embodiment, the preset model may be used to merge extracted feature maps of multiple scales, and may be used to output an attribute prediction value of each feature point in the feature map.

After obtaining the prediction model, in one mode, feature extraction may be performed on the image to be processed to obtain a feature map of the image to be processed, and the feature map is input into the prediction model to output an attribute prediction value of each feature point on the feature map. Of course, in another mode, the image to be processed may also be directly input into the prediction model, feature extraction is performed on the image to be processed through the prediction model, and the attribute prediction value of each feature point on the feature map output by the prediction model is obtained. In this manner, the predictive model may include a feature extraction portion structure.

S505: and marking the position and the category of each text area on the image to be processed according to the attribute predicted value of each feature point on the feature map.

The process of marking the position and the category of each text region on the image to be processed according to the attribute prediction value of each feature point on the feature map is similar to the above steps S303 to S305, and the relevant points may refer to the description of steps S303 to S305 in the above embodiment, and are not described herein again.

Referring to fig. 5, a flowchart illustrating steps of training a preset model to obtain a prediction model is shown, and as shown in fig. 5, the preset model may be trained through the following steps:

step S601: and performing feature extraction on each template image sample in the training samples to obtain a feature map of the template image sample.

In this embodiment, in an implementation manner, when feature extraction is performed on each template image sample in the training sample to obtain a feature map of the template image sample, feature extraction of multiple scales may be performed on each template image sample in the training sample to obtain a feature map of the template image sample in multiple scales; and then, fusing the characteristic images of the template image sample in multiple scales to obtain the characteristic image of the template image sample.

Specifically, when feature extraction is performed, feature extraction of different scales may be performed, that is, a plurality of different convolution scales are processed, the size of the feature map output in each convolution step may be reduced to half of the size of the input feature map, so as to obtain a plurality of feature maps of different scales, then, from the feature map obtained by the lowest layer of convolution, inverse pooling/upsampling operations are performed in sequence, the upsampled feature map and the feature map obtained by the last layer of convolution are merged (concat), and after several times of merging and convolution operations, a multi-scale fused feature map is finally obtained.

In one embodiment, the template image sample may be subjected to Feature extraction by using a neural network model, as shown in fig. 6, a schematic diagram of Feature extraction of the template image sample by using the neural network model is shown, as shown in fig. 6, Input represents the template image sample, the template image sample is Input to the convolution layer "conv layer 16,/2", Feature extraction is sequentially performed by "conv stage 116,/2", "conv stage 164,/2", "conv stage 2128,/2", "conv stage 3256,/2", "conv stage 3384,/2", and the like, four Feature maps with different scales are respectively obtained from Feature map1 to Feature map4, then, an inverse pooling/upsampling operation is sequentially performed from Feature map4 obtained from the lowest layer, and then each layer of Feature maps obtained by processing is merged with a Feature map obtained from the previous layer, then, the features of the multiple scales are fused by the Feature merging part, and finally, a Feature map5 of the template image sample is obtained.

In some embodiments, the neural network model for feature extraction may be included in a preset model, and the obtained preset model is as shown in fig. 7, so that the preset model may include a feature extraction part, a feature merging part and an output part. The result output unit may perform convolution processing on the output Feature map5 of the Feature merging unit in a scale of 1 × 1, and obtain an output result.

Step S602: and inputting the characteristic diagrams of the template image samples in the training samples into the preset model to obtain the output result of the preset model.

In this embodiment, the feature map of each template image sample may be input into the preset model to obtain an output result of the preset model, where the output result may be a feature map including a plurality of feature points, and each feature point on the feature map carries the following information: prediction probability of belonging to the text region, prediction distance to each position point of the belonging text region, and prediction category of the belonging text region.

As shown in fig. 7, the output result at the output part includes a plurality of channels, wherein 1 channel represents the prediction probability that the pixel point on the feature map is a Text region, as shown in "Text score" in fig. 7; the other 8 channels represent the predicted positions of the pixel point and the four vertices of the text box, as shown in the Quad coordinates output by the module "1 × 1, 8" in fig. 7; the last channel represents the prediction category of the text region to which the pixel belongs, as shown in fig. 7 by the module "1 × 1, N" and the Quad type "output by the module" softmax ", where N in" 1 × 1, N "represents N categories, for example, the categories such as" name "," gender "," address "and the like may be included in the id card image.

After obtaining the output result, the loss value of the preset model may be determined according to the output result of each template image sample and the label of the template image sample, where the process of determining the loss value of the preset model may be as described in steps S603 to S606.

Step S603: and determining a first loss value of the preset model according to the real probability carried by each template image sample in the training sample and the output prediction probability of the preset model.

Step S604: and determining a second loss value of the preset model according to the real distance carried by each template image sample in the training samples and the output predicted distance of the preset model.

Step S605: and determining a third loss value of the preset model according to the real class carried by each template image sample in the training sample and the output prediction class of the preset model.

Step S606: and obtaining the loss value of the preset model according to the first loss value, the second loss value, the third loss value and the respective weights of the first loss value, the second loss value and the third loss value.

In this embodiment, the output result of each template image sample includes the node information of three dimensions, such as the prediction probability, the prediction distance, the prediction category, and the like, and the label of each template image sample also includes the labels of three dimensions, such as the true probability, the true distance, the true category, and the like, so that when the loss value of the preset model is obtained, the loss value of the preset model in each dimension can be obtained by using different loss functions, and then different weights are given to the loss values of the dimensions, so as to obtain the overall loss of the preset model.

Thus, independent loss calculation is performed for different output results, and finally, the overall loss is obtained according to respective weights, so that the accuracy of loss value calculation can be improved.

In specific implementation, the loss function used may be composed of three parts, which are respectively binary cross entropy loss, mean square error loss, and multivariate cross entropy loss. The first loss value of the preset model can be calculated by adopting a binary cross entropy loss function, the first loss value is determined according to the real probability carried by each template image sample and the output prediction probability of the preset model, and the first loss value is used for reflecting the prediction probability that each feature point on the feature map of the template image sample predicted by the preset model belongs to the text region and the difference between the feature point and the real probability that the feature point belongs to the text region.

The second loss value is determined by the real distance carried by each template image sample and the output predicted distance of the preset model, and is used for reflecting the predicted distance between each feature point on the feature map of the template image sample predicted by the preset model and each position point of the text region to which the feature point belongs, and comparing the difference between the feature point and the real distance of each position point of the text region to which the feature point belongs.

The third loss value is determined by the real type carried by each template image sample and the output prediction type of the preset model, and is used for reflecting the prediction type of the text region to which each feature point on the feature map of the template image sample predicted by the preset model, and comparing the difference between the prediction type of the text region to which the feature point belongs and the real type of the text region to which the feature point belongs.

In practice, after the first loss value, the second loss value, and the third loss value are determined, the first loss value, the second loss value, and the third loss value may be weighted and averaged according to respective weights of the three loss values to obtain a loss value of the preset model. The weight of each of the three loss values can be preset according to actual requirements, for example, the weight ratio is set to be 1:1: 1.

Step S607: and updating the preset model according to the loss value of the preset model to obtain the prediction model.

In this embodiment, the parameters of the preset model may be iteratively updated for multiple times according to the loss value of the preset model until the loss value of the preset model is less than or equal to the preset loss value. When the loss value is less than or equal to the preset loss value, the difference between the probability that each feature point predicted by the preset model belongs to the text region, the distance between each feature point and each position point in the text region, and the type of the text region is very small, the three information of each feature point can be accurately predicted by the preset model, and then the preset model under the condition is saved as the prediction model.

In one embodiment, since the template image samples in the training samples carry labels, before determining the loss values, one label matrix of the template image samples can be obtained through the following steps, and in the case of having a label matrix, when determining the loss values of the preset model, the three loss values can be determined based on the label matrix respectively. Specifically, the process of generating the labeling matrix corresponding to the template image sample may be as follows;

firstly, generating and storing a corresponding labeling matrix aiming at each template image sample in the training sample carrying a label;

and then, reading the stored labeling matrix to obtain the true probability, the true distance and the true category carried by each template image sample in the training sample.

In this embodiment, the size of the labeling matrix is the same as the size of the output result of the preset model, and the labeling matrix may be understood as a feature map in which each feature point carries the true probability that the feature point belongs to the text region, the true distances between the feature point and the multiple position points of the text region, and the true category of the text region. In practice, different template image samples may correspond to different labeling matrices.

In a specific implementation, since the template image sample carries the label, that is, the template image sample carries information such as the position and the attribute of each text region, the label matrix of the template image sample can be directly generated according to the information.

In practice, the annotation matrix may be persistently stored, so that when the first loss value, the second loss value, and the third loss value of the preset model are obtained, the annotation matrix corresponding to the input template image sample may be read, and the three loss values may be determined according to the output result output by the annotation matrix and the preset model.

In a specific implementation, since the label matrix can be understood as a feature map in which each feature point carries a true probability that the feature point belongs to the text region, true distances to a plurality of position points of the text region, and a true category of the text region, an output result of the preset model is that each feature point carries a prediction probability that the feature point belongs to the text region, prediction distances to the plurality of position points of the text region, and a feature map of a prediction category of the text region, three loss values can be calculated respectively according to the two feature maps.

Because the label matrix can be stored persistently, when a plurality of template image samples in the same batch are trained for a plurality of times, the label matrix of the plurality of template image samples in the batch can be directly read out during each training, so that the problem that the label matrix needs to be calculated additionally during each training is solved, and the training efficiency is improved.

When the technical scheme of the embodiment is adopted, the following advantages are also provided:

first, since a prediction model for determining an attribute prediction value of each feature point on the feature map can be trained in advance, the efficiency of determining the attribute prediction value of each feature point on the feature map can be improved, thereby improving the efficiency of labeling the position and category of each text region on the image to be processed.

Secondly, because the neural network model has certain generalization performance, the generalization of processing the image to be processed can be improved.

Thirdly, when the prediction model is trained, loss values can be respectively determined according to the probability that pixel points corresponding to the feature points belong to the text regions, the prediction distances between the pixel points and the multiple position points of the text regions to which the pixel points belong and the prediction types of the text regions to which the pixel points belong, and weights are given to the loss values of all dimensions, so that the overall loss is determined.

Referring to fig. 8, a block diagram of an image processing apparatus according to an embodiment of the present invention is shown, and as shown in fig. 8, the apparatus may specifically include the following modules:

the feature extraction module 901 may be configured to perform feature extraction on an image to be processed to obtain a feature map of the image to be processed;

a predicted value determining module 902, configured to determine an attribute predicted value of each feature point on the feature map, where the attribute predicted value of one feature point represents a probability that a pixel point corresponding to the feature point in the to-be-processed image belongs to a text region, a predicted distance between the pixel point and a plurality of position points of the text region to which the pixel point belongs, and a predicted category of the text region to which the pixel point belongs;

the marking module 903 may be configured to mark a position and a category of each text region on the image to be processed according to the attribute prediction value of each feature point on the feature map.

Optionally, the marking module 903 may include the following units:

the region determining unit may be configured to merge adjacent feature points belonging to the text regions according to a position relationship between feature points belonging to the text regions on the feature map, so as to obtain each text region on the image to be processed;

the position determining unit can be used for marking the position of the text region in the image to be processed according to the predicted distances between each characteristic point belonging to the text region and a plurality of position points of the text region;

and the category determining unit can be used for marking the category of the text region of the image to be processed according to the prediction category of each characteristic point belonging to the text region for each obtained text region.

Optionally, the area determining unit may include:

the feature point filtering subunit is configured to filter feature points, where a probability corresponding to the predicted feature map is smaller than a preset probability, to obtain a plurality of remaining activated feature points in the predicted feature map;

the merging subunit may be configured to merge the adjacent activated feature points belonging to the text region according to the position relationship between the activated feature points belonging to the text region, so as to obtain each text region on the image to be processed.

Optionally, the position determining unit may include:

a weighting processing subunit, configured to perform, for each obtained text region, weighting processing on predicted distances to a plurality of location points of the text region according to respective weights of the activated feature points belonging to the text region, so as to obtain location information of the plurality of location points of the text region; wherein, the weight of one activation characteristic point is the prediction probability or confidence corresponding to the activation characteristic point;

and the position marking subunit may be configured to mark the position of each text region in the image to be processed according to the obtained position information of the plurality of position points of the text region.

Optionally, the category determining unit may include:

the weighting processing subunit is configured to, for each obtained text region, perform weighting processing on the prediction categories of each activation characteristic point belonging to the text region according to the respective weight of each activation characteristic point belonging to the text region, so as to obtain category information of the text region; wherein, the weight of one activation characteristic point is the prediction probability or confidence corresponding to the activation characteristic point;

and the category marking subunit may be configured to mark a category of each text region of the image to be processed according to the obtained category information of the text region.

Optionally, the apparatus may further include:

a target text region determining module, configured to obtain parameter values of a target text region that is not labeled with a category in each text region on the image to be processed, where the parameter values may include: the text in the target text region and/or the position of the target text region in the image to be processed;

the matching module can be used for comparing the parameter value of the target text region with the parameter value of each template text region in the template image;

and the marking module can be used for marking the category of the template text region matched with the parameter value as the category of the target text region.

Optionally, the apparatus may further include:

the image cutting module can be used for cutting the image to be processed according to the position of each text region marked on the image to be processed to obtain a plurality of text region images;

the identification module may be configured to perform text identification on the text region images, respectively, to obtain text contents in the text region images.

Optionally, the apparatus may further include:

the system comprises a sample obtaining module, a label obtaining module and a label analyzing module, wherein the sample obtaining module can be used for obtaining a plurality of template image samples carrying labels;

the model training module can be used for training a preset model by taking the template image samples carrying the labels as training samples to obtain a prediction model, and the labels carried by one template image sample can include: the real probability that each pixel point of the template image sample belongs to a text region, the real distances between the pixel points and a plurality of position points of the text region to which the pixel points belong, and the real category of the text region to which the pixel points belong;

the prediction value determining module may be specifically configured to input the feature map into the prediction model, so as to obtain an attribute prediction value of each feature point on the feature map.

Optionally, the model training module may include:

the feature extraction unit may be configured to perform feature extraction on each template image sample in the training samples to obtain a feature map of the template image sample;

the input unit may be configured to input a feature map of each template image sample in the training sample into the preset model, so as to obtain an output result of the preset model;

the first loss determining unit may be configured to determine a first loss value of the preset model according to a true probability carried by each template image sample in the training samples and a prediction probability of an output of the preset model;

the second loss determining unit may be configured to determine a second loss value of the preset model according to a real distance carried by each template image sample in the training samples and a predicted distance output by the preset model;

a third loss determining unit, configured to determine a third loss value of the preset model according to a real category carried by each template image sample in the training sample and a prediction category output by the preset model;

the loss determining unit may be configured to obtain a loss value of a preset model according to the first loss value, the second loss value, the third loss value, and respective weights of the first loss value, the second loss value, the third loss value, and the third loss value;

and the updating unit can be used for updating the preset model according to the loss value of the preset model to obtain the prediction model.

Optionally, the feature extraction unit may include:

the multi-scale feature extraction unit can be used for extracting features of multiple scales of each template image sample in the training samples to obtain feature maps of the template image samples in the multiple scales;

and the fusion unit can be used for fusing the characteristic maps of the template image sample in multiple scales to obtain the characteristic map of the template image sample.

Optionally, the apparatus may further include:

the label matrix generation module can be used for generating and storing a corresponding label matrix aiming at each template image sample in the training sample carrying a label;

and the reading module can be used for reading the stored labeling matrix so as to obtain the true probability, the true distance and the true category carried by each template image sample in the training sample.

Optionally, the apparatus may further include:

an obtaining module operable to obtain an original image, which may include an image of a target object;

a prediction module operable to predict a position of an image of the target object on the original image;

the rectification module may be configured to extract the image to be processed from the original image according to a position of the image of the target object on the original image, where the image to be processed includes the image of the target object.

Embodiments of the present invention further provide an electronic device, which may be configured to execute an image processing method and may include a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor is configured to execute the image processing method.

Embodiments of the present invention further provide a computer-readable storage medium storing a computer program for causing a processor to execute the image processing method according to the embodiments of the present invention.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The foregoing detailed description of an image processing method, an image processing apparatus, an image processing device, and a storage medium according to the present invention has been presented, and the principles and embodiments of the present invention are described herein by using specific examples, and the descriptions of the above examples are only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An image processing method, characterized in that the method comprises:

2. The method according to claim 1, wherein marking the position and the category of each text area on the image to be processed according to the attribute predicted value of each feature point on the feature map comprises:

merging adjacent characteristic points belonging to the text regions according to the position relation among the characteristic points belonging to the text regions on the characteristic diagram to obtain the text regions on the image to be processed;

marking the position of each obtained text region in the image to be processed according to the predicted distance between each characteristic point belonging to the text region and a plurality of position points of the text region;

and marking the category of the text region of the image to be processed according to the prediction category of each characteristic point belonging to the text region for each obtained text region.

3. The method according to claim 2, wherein merging adjacent feature points belonging to a text region according to a positional relationship between the feature points belonging to the text region on the feature map to obtain each text region on the image to be processed, comprises:

filtering feature points with the probability smaller than the preset probability on the prediction feature map to obtain a plurality of remaining activation feature points in the prediction feature map;

and combining the position relation among the activated feature points belonging to the text region to the adjacent activated feature points belonging to the text region to obtain each text region on the image to be processed.

4. The method according to claim 2, wherein for each obtained text region, marking the position of the text region in the image to be processed according to the predicted distance between each feature point belonging to the text region and a plurality of position points of the text region respectively comprises:

weighting the predicted distances between each text region and a plurality of position points of the text region according to the weight of each activation characteristic point belonging to the text region to obtain the position information of the plurality of position points of the text region; wherein, the weight of one activation characteristic point is the prediction probability or confidence corresponding to the activation characteristic point;

and marking the position of each text region in the image to be processed according to the obtained position information of the plurality of position points of the text region.

5. The method according to claim 2, wherein for each obtained text region, marking the category of the text region of the image to be processed according to the predicted category of each feature point belonging to the text region comprises:

for each obtained text region, according to the respective weight of each activated feature point belonging to the text region, carrying out weighting processing on the respective prediction categories of each activated feature point belonging to the text region to obtain category information of the text region; wherein, the weight of one activation characteristic point is the prediction probability or confidence corresponding to the activation characteristic point;

and marking the category of the text region of the image to be processed according to the obtained category information of each text region.

6. The method according to any one of claims 1-5, further comprising:

for a target text region which is not marked with a category in each text region on the image to be processed, obtaining parameter values of the target text region, wherein the parameter values comprise: the text in the target text region and/or the position of the target text region in the image to be processed;

comparing the parameter value of the target text region with the parameter value of each template text region in the template image;

and marking the category of the template text area with the matched parameter value as the category of the target text area.

7. The method according to any one of claims 1-5, further comprising:

cutting the image to be processed according to the position of each text area marked on the image to be processed to obtain a plurality of text area images;

and respectively carrying out text recognition on the text area images to obtain the text contents in the text area images.

8. The method according to any one of claims 1-5, further comprising:

obtaining a plurality of template image samples carrying labels;

and training a preset model by taking the template image samples carrying the labels as training samples to obtain a prediction model, wherein the labels carried by one template image sample comprise: the real probability that each pixel point of the template image sample belongs to a text region, the real distances between the pixel points and a plurality of position points of the text region to which the pixel points belong, and the real category of the text region to which the pixel points belong;

determining an attribute predicted value of each feature point on the feature map, including:

and inputting the feature map into the prediction model to obtain the attribute prediction value of each feature point on the feature map.

9. The method of claim 8, wherein training a preset model by using the template image samples carrying the labels as training samples to obtain a prediction model comprises:

performing feature extraction on each template image sample in the training samples to obtain a feature map of the template image sample;

inputting the characteristic diagrams of the template image samples in the training samples into the preset model to obtain the output result of the preset model;

determining a first loss value of the preset model according to the real probability carried by each template image sample in the training sample and the output prediction probability of the preset model;

determining a second loss value of the preset model according to the real distance carried by each template image sample in the training samples and the output prediction distance of the preset model;

determining a third loss value of the preset model according to the real category carried by each template image sample in the training sample and the output prediction category of the preset model;

obtaining a loss value of a preset model according to the first loss value, the second loss value, the third loss value and respective weights of the first loss value, the second loss value and the third loss value;

and updating the preset model according to the loss value of the preset model to obtain the prediction model.

10. The method of claim 9, wherein performing feature extraction on each template image sample in the training samples to obtain a feature map of the template image sample comprises:

for each template image sample in the training sample, carrying out multi-scale feature extraction on the template image sample to obtain multi-scale feature maps of the template image sample;

and fusing the characteristic graphs of the template image sample in multiple scales to obtain the characteristic graph of the template image sample.

11. The method of claim 9, wherein prior to determining the first loss value, the second loss value, and the third loss value, the method further comprises:

generating and storing a corresponding label matrix for labels carried by each template image sample in the training samples;

and reading the stored labeling matrix to obtain the true probability, the true distance and the true category carried by each template image sample in the training sample.

12. The method of any of claims 1-5 or 9-11, wherein prior to feature extraction of the image to be processed, the method further comprises:

obtaining an original image, the original image comprising an image of a target object;

predicting a position of an image of the target object on the original image;

and extracting the image to be processed from the original image according to the position of the image of the target object on the original image, wherein the image to be processed comprises the image of the target object.

13. An image processing apparatus characterized by comprising:

14. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing implementing the image processing method according to any of claims 1-12.

15. A computer-readable storage medium storing a computer program for causing a processor to execute the image processing method according to any one of claims 1 to 12.