CN115631493B

CN115631493B - Text region determining method, system and related device

Info

Publication number: CN115631493B
Application number: CN202211423558.1A
Authority: CN
Inventors: 许康; 宁可
Original assignee: Kingdee Software China Co Ltd
Current assignee: Kingdee Software China Co Ltd
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2023-05-09
Anticipated expiration: 2042-11-04
Also published as: WO2024092957A1; CN115631493A

Abstract

The application discloses a text region determining method, a text region determining system and a related device, and relates to an artificial intelligence technology, wherein the method comprises the following steps: obtaining a text image to be detected, and carrying out prediction processing on the text image to be detected through a target prediction model to obtain a first probability that each pixel point in the text image to be detected belongs to a text region and a second probability that each pixel point and adjacent pixel points belong to the same type; carrying out probability analysis processing according to the first probability and the second probability of each pixel point in the text image to be detected to obtain a probability threshold value for classifying the text region and the non-text region; and carrying out text region extraction processing on the text image to be detected according to the probability threshold value to obtain a text region in the text image to be detected. The method and the device have strong universality in actual task scenes, can accurately position the text region of the image on the basis of model prediction processing, and effectively filter the non-text region which is easy to influence the text detection effect, so as to improve the follow-up recognition accuracy of the image text.

Description

Text region determining method, system and related device

Technical Field

The embodiment of the application relates to the technical field of information processing, in particular to a text region determining method, a text region determining system and a related device.

Background

Nowadays, texts in images contain rich information, and extraction of the information (namely text recognition) has important significance for understanding scenes in which the images are located and the like.

Correspondingly, the text detection and recognition technology is widely applied to the fields of education, finance, medical treatment and the like, and is convenient for the production efficiency of the industry and the daily study life of people. The text detection and recognition technology includes early text detection (i.e. locating text position) and later text recognition (i.e. recognizing text content), which are indispensable, and the text detection is particularly critical as a premise of text recognition.

However, in the existing text detection technology, only the prediction algorithm is often used to directly predict the location area of the text, so that the detected text completely depends on the detection performance of the network algorithm to locate the text area, which causes that the text content which is arranged in multiple directions is easily adhered to a whole text block area to be output, or non-text content (such as logo, redundant symbols or binding trace points) is erroneously detected to be output as the text content, and the influence of poor text detection effect occurs.

Disclosure of Invention

The embodiment of the application provides a text region determining method, a text region determining system and a related device, which are used for improving the position detection effect of an image text region.

An embodiment of the present application provides a text region determining method, including:

acquiring a text image to be detected;

the text image to be detected is predicted through a target prediction model, and a first probability that each pixel point in the text image to be detected belongs to a text region and a second probability that each pixel point and adjacent pixel points belong to the same type are obtained, wherein the same type means that two pixel points to be compared belong to the text region or both belong to a non-text region;

performing probability analysis processing according to the first probability and the second probability of each pixel point in the text image to be detected to obtain a probability threshold value for classifying a text region and a non-text region;

and carrying out text region extraction processing on the text image to be detected according to the probability threshold value to obtain a text region in the text image to be detected.

The text region determining method described in the first aspect of the present application may be implemented in the context of the second aspect of the present application.

A second aspect of the embodiments of the present application provides a text region determining system, including:

the acquisition unit is used for acquiring the text image to be detected;

the processing unit is used for carrying out prediction processing on the text image to be detected through a target prediction model to obtain a first probability that each pixel point in the text image to be detected belongs to a text region and a second probability that each pixel point and adjacent pixel points belong to the same type, wherein the same type means that two pixel points which are compared belong to the text region or both belong to a non-text region;

The processing unit is further used for carrying out probability analysis processing according to the first probability and the second probability of each pixel point in the text image to be detected, so as to obtain a probability threshold value for classifying the text region and the non-text region;

and the processing unit is also used for extracting the text region of the text image to be detected according to the probability threshold value to obtain the text region in the text image to be detected.

A third aspect of the embodiments of the present application provides an electronic device, including:

a central processing unit, a memory and an input/output interface;

the memory is a short-term memory or a persistent memory;

the central processor is configured to communicate with the memory and to execute instruction operations in the memory to perform the method described in the first aspect of the embodiments of the present application or any particular implementation of the first aspect.

A fourth aspect of the embodiments provides a computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform a method as described in the first aspect of the embodiments or any specific implementation of the first aspect of the embodiments.

A fifth aspect of the embodiments of the present application provides a computer program product comprising instructions or a computer program which, when run on a computer, causes the computer to perform the method as described in the first aspect of the embodiments of the present application or any specific implementation of the first aspect.

From the above technical solutions, the embodiments of the present application have at least the following advantages:

the probability threshold value is obtained through the analysis of the first probability and the second probability of each pixel point in the text image to be detected, so that the possibility that each pixel point really belongs to a text region is comprehensively judged, the condition occurrence rate of the text region of the error positioning image is effectively reduced, and the accuracy of the text recognition is promoted.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.

FIG. 1 is a schematic view of an application environment according to an embodiment of the present application;

FIG. 2a is a schematic flow chart of a text region determining method according to an embodiment of the present application;

FIG. 2b is another flow chart of a text region determination method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of obtaining a mask map according to an embodiment of the present application;

FIG. 4 is another schematic diagram of obtaining a mask map according to an embodiment of the present application;

FIG. 5a is a schematic diagram of a convolution kernel according to an embodiment of the present disclosure;

FIG. 5b is a schematic diagram of a generating process of a mask result diagram according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a process of a mask diagram according to an embodiment of the present application;

FIG. 7 is a schematic diagram of calculating a center point position distance according to an embodiment of the present application;

FIG. 8 is a schematic diagram of text feature acquisition according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a text field determining system according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the following description, reference is made to "one embodiment" or "one embodiment" and the like, which describe a subset of all possible embodiments, but it is to be understood that "one embodiment" or "one embodiment" may be the same subset or a different subset of all possible embodiments and may be combined with each other without conflict. In the following description, the term plurality refers to at least two. The term "threshold value" as used herein, if any, may include the former being greater than the latter.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

For ease of understanding and description, terms and expressions which are referred to in the embodiments of the present application will be described before further detailed description of the present application, and are applicable to the following explanations.

1. Optical character recognition (OCR, optical Character Recognition): refers to a process in which an electronic device (e.g., a scanner or a digital camera) checks characters printed on paper, determines the shape thereof by detecting dark and bright patterns, and then translates the shape into computer text by a character recognition method; that is, for the print character, the characters in the paper document are converted into the image file of black-white lattice by adopting an optical mode, and the characters in the image are converted into the text format which can be identified by the computer product through the identification software, so that the technology of further editing and processing by the word processing software and obtaining the characters and layout information is adopted.

2. Deep convolutional neural network (CNN, convolutional Neural Network): the network structure in deep learning is widely used in the field of computer vision and has good effect.

3. transformer: is a self-attention mechanism network structure and is good at extracting relevant characteristics of an input word vector or an image vector.

4. Image mask (mask): the area or process of image processing is controlled by masking the processed image (either fully or partially) with selected images, graphics or objects. The particular image or object used for overlay is referred to as a mask or template. In optical image processing, the mask may be a film, a filter, or the like; in digital image processing, the mask is a two-dimensional matrix array, and a multi-value image is sometimes used. The mask image mentioned in the embodiments of the present application may be regarded as a conversion image of a text image, so that the processing performed on the mask image is understood in a certain sense as processing the text image.

For ease of understanding and explanation, referring to fig. 1, fig. 1 shows a schematic view of an application environment suitable for use in embodiments of the present application. Wherein the server 102, the terminal device 101 and the printing device 103 are connected in communication with each other; the server 102 is a server that can provide functions of scanning, image text region determination, text detection recognition, and the like, and is not particularly limited herein; the terminal device 101 may be a variety of electronic devices having a display screen and supporting data input, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, wearable electronic devices, and the like. Specifically, the terminal device 101 may have a client application (such as APP, a WeChat applet, etc.) installed thereon, so that the user may communicate with the server 102, the printing device 103 based on the client application; for example, the user may send the text image to be detected, which is pre-subjected to text detection and recognition, to the printing device 103 by the client application program, and the image of the print product may perform related text detection processing, such as text region determination, through the server 102, so that the server 102 supporting the OCR technology may perform text recognition on the processed text region subsequently, and thus, finally, return text information contained in the target text region to the terminal device 101.

It should be noted that, at least part of the operations provided in the embodiments of the present application may be implemented by the terminal device and the server together as described above, or may be implemented all at the server side, or may also be implemented all at the terminal device side, which may be specifically determined according to an actual application scenario, and this is not a limitation.

The present application will be described in further detail below.

Referring to fig. 2a, the first aspect of the present application provides an embodiment of a text region determining method, which includes steps 21 to 24:

21. and acquiring a text image to be detected.

The text image to be detected is often referred to as an input image, at least part of which contains the text to be recognized, i.e. the text area required in the embodiments of the present application.

22. And predicting the text image to be detected through the target prediction model to obtain a first probability and a second probability of each pixel point in the text image to be detected.

Specifically, a first probability that each pixel point belongs to a text region and a second probability that each pixel point and adjacent pixel points belong to the same type in a text image to be detected are obtained, wherein the same type means that two pixel points which are compared belong to the text region or both belong to a non-text region. Therefore, the second probability can be used for representing the type correlation between the pixel points and the adjacent pixel points, and on the basis of obtaining the first probability of the pixel points, the second probability is also used for helping to comprehensively judge the possibility that each pixel point really belongs to a text region, so that the situation occurrence rate of misjudging a non-text region as the text region is effectively reduced.

23. A probability threshold is processed based on the first probability and the second probability.

After the first probability and the second probability of each pixel point in the text image to be detected are obtained, probability analysis processing can be carried out on the probabilities so as to obtain a probability threshold value for classifying the text region and the non-text region.

24. And extracting a text region in the text image to be detected according to the probability threshold value.

In practical application, the probability threshold is determined by the correlation expression (first probability and second probability) between the pixel points and the text region, so that text region extraction processing can be performed on the text image to be detected according to the probability threshold to determine the text region in the text image to be detected, and therefore non-text regions which are easy to influence the text recognition effect can be effectively distinguished and eliminated.

Therefore, according to the embodiment of the application, the probability threshold value is obtained through the analysis of the first probability and the second probability of each pixel point in the text image to be detected, so that the possibility that each pixel point really belongs to the text region is comprehensively judged, the situation incidence rate of the text region of the error positioning image is effectively reduced, and the accuracy of the text recognition is promoted.

Referring to fig. 2b to 8, the present application provides another embodiment of a text region determining method, which includes steps 30 to 34:

In some specific examples, prior to step 32, the method may further include step 30 (obtaining a target prediction model), and specific operations of step 30 may include:

acquiring a sample text image, a first sample result of whether each pixel point in the sample text image belongs to a text region, and a second sample result of whether each pixel point in the sample text image and adjacent pixel points belong to the same type; carrying out prediction processing on the sample text image through an initial prediction model to obtain a first probability that each pixel point in the sample text image belongs to a text region and a second probability that each pixel point and adjacent pixel points belong to the same type; and training an initial prediction model according to the difference between the first probability of each pixel point in the sample text image and the first sample result and the difference between the second probability of each pixel point in the sample text image and the second sample result so as to obtain a target prediction model for performing prediction processing on the text image to be detected.

As a possible implementation manner, in order to match with the input setting of the prediction model, so as to train a required target prediction model more quickly and efficiently, the universality and the result response rate of the image in the network model are improved, and the sample text image pre-input into the prediction model can be subjected to size preprocessing before the model starts training. For example, to obtain a text image sample (or referred to as a sample text image) with an image size of 32, if the original size of the text image sample is (h_original, w_original, 3), the input image may be subjected to a size preprocessing before model training and prediction, and the size preprocessing method may be as follows:

The processed image height h=round (h_original/32.0) 32+32,

processed image width w=round (w_original/32.0) ×32+32;

where round is a rounding function that removes the fractional part of the input value, e.g., round (1.1) =1; it can be seen that both values of h and w after the size preprocessing are preset positive multiples of 32, in other words, the size of the text sample image as the training sample can be made to be within the preset size interval.

Based on the above description of the size preprocessing, accordingly, in some specific examples, the size of the text image to be detected actually input into the target prediction model mentioned in the application may be specifically the same as the size of the text image sample after the size preprocessing, that is, the same as a preset positive multiple of 32, so as to adapt to the input setting of the target prediction model.

For ease of illustration and understanding, the initial predictive model referred to herein is exemplified by convolutional neural network CNN (Convolutional Neural Networks). As shown in fig. 3 and fig. 4, specifically, after each text image sample is input as a batch (batch) of sample data into the CNN network model, a mask map corresponding to each of the text image samples is output, for example, a pixel_mask_prediction and a center_mask_prediction, and a loss value between a true vector pixel mask (which can be understood as a first sample result) and a predicted vector pixel_mask_prediction (which can be understood as a first probability) and a loss value between a true vector center mask (which can be understood as a second sample result) and a predicted vector pixel_mask_prediction (which can be understood as a second probability) are calculated; and updating model parameter values in the CNN network model by using a back propagation algorithm, so that the loss values between the pixel mask and the pixel_mask_prediction and the loss values between the center mask and the center_mask_prediction are smaller and smaller. Repeating training until two loss values meet convergence conditions, and stopping training to obtain a required target prediction model; specifically, the convergence condition is that two loss values are stabilized in a preset loss interval continuously for a plurality of times.

Note that, the pixel_mask and the center_mask may be referred to as a pixel mask map and a center mask map of the text image sample, respectively, and the mask map is actually a vector, so the label corresponding to the text image sample may be also referred to as two vectors such as a pixel mask (two-dimensional vector) and a center mask (three-dimensional vector) for representing probability results. The initial predictive model referred to herein may employ other deep learning models that can process visual images in addition to the CNN model, such as Long-term memory model LSTM (Long-Short Term Memory), to process each input image into mask maps, such as pixel mask maps and center mask maps, for containing (representing) first and second probability information for the image.

Of course, the above prediction model may also output other mask diagrams besides the pixel mask diagram and the center mask diagram, so long as the mask diagrams obtained by prediction processing can include the first probability and the second probability information of the input image, so as to execute the operation of step 23 subsequently, thereby implementing the position cutting and positioning of the target text region of the input image, including to assist in filtering out some candidate text block regions that are falsely detected like text real non-text content; the text block which is detected by mistake is provided with a wrong module of similar text contents such as stripes, shading, incomplete chapters, handy graffiti, binding trace points, logo marks, redundant punctuation marks and the like.

31. And acquiring a text image to be detected.

32. And predicting the text image to be detected through the target prediction model to obtain a first probability and a second probability of each pixel point in the text image to be detected.

As shown in fig. 4, the text image to be tested is a model input image including a pixel mask image, which is used for judging the content of sticky text, and has a width w and a height h; the target prediction model outputs mask images, such as a center mask image and a pixel mask image, of the text image to be detected, wherein the width and the height of the mask images are half of the original width and the height of the text image to be detected respectively. In the pixel mask diagram, a value (such as a pixel value) corresponding to each pixel of the text image to be tested may be represented as a probability value, where the probability value is between 0 and 1, and the probability value represents a probability that a location where a pixel of the text image to be tested is located belongs to a text region (i.e., a first probability). In addition, two upper and lower white bars (or white text bars) in the pixel mask diagram of fig. 4 are used to frame the approximate location of the text content. It can be seen that the Pixel mask serves to primarily separate text regions from non-text regions in the image, primarily to substantially delineate text regions in the image.

In the Center mask diagram, each value (e.g., pixel value) corresponding to each pixel of the text image to be detected is actually a 4-element vector (left_ratio, right_ratio, top_ratio, down_ratio), and the first element value left_ratio of the vector represents the probability that a first pixel adjacent to the left side of a certain pixel belongs to the same type, where the same type refers to that two pixels are text region pixels or are non-text region pixels; similarly, the second element value right_ratio of the vector refers to the probability that a pixel point is of the same type as the first pixel point adjacent to the right side of the pixel point; the third element value top_ratio of the vector refers to a probability value that a pixel point belongs to the same type as a pixel point adjacent to the pixel point above the pixel point; the fourth element value down_ratio of the vector refers to a probability value that a pixel point belongs to the same type as a pixel adjacent therebelow. Therefore, it can be seen that the center mask includes the second probability that each pixel of the text image to be detected and the adjacent pixel belong to the same type (e.g., each element value in the 4-element vector), and the center mask and the probability information thereof are used for indicating the possibility of error adhesion in the text region (e.g., the adjacent two pixels do not belong to the same type), and are mainly used for assisting in text block filtering in a subsequent process more accurately and ensuring that the target text region can be accurately cut for effective application to the text recognition module.

Of course, in some specific examples, other similarity prediction methods, such as those of 8 pixels around, may be used in addition to the above-described prediction method of similarity of four pixels (4 pixels) in the up, down, left, and right directions.

33. A probability threshold is processed based on the first probability and the second probability.

In some specific examples, the specific implementation of step 33 may include:

combining the first probability and the second probability of each pixel point in the text image to be detected to obtain a first kernel corresponding to each pixel point; clustering is carried out on the plurality of first kernel to obtain a cluster center kernel; a probability threshold for classifying text regions from non-text regions is determined based on the cluster center kernel.

Based on the above step explanation, specifically, practical situations such as model training and historical experience can be combined, and the first probability of each pixel point of the text image to be detected in the corresponding position of the pixel mask image, such as coordinates (x, y), namely the confidence coefficient, is recorded as p_ratio; the second probability of the corresponding pixel point can be recorded as each component element value of the 4-element vector in the corresponding position of the center pixel point, and the vector can be specifically recorded as a feature vector (a 1, a2, a3, a 4) formed by four element values, so that it can be understood that the feature vector can be used for representing the probability that a certain pixel point and an adjacent pixel point are the same as the pixel point of the text region; according to the first probability p_ratio and the five elements of the four element values in the feature vector, a first kernel of 3*3 size shown in fig. 5a can be formed, so that each pixel point in the text image to be tested can generate a first kernel. The kernel mentioned in this application is understood to be a convolution kernel.

In some specific examples, the cluster center kernel (or referred to as a cluster center kernel) may include a first cluster center kernel kernel_1 corresponding to a text region type and a second cluster center kernel kernel_2 corresponding to a non-text region type, so that the probability threshold (or referred to as a pixel threshold threshold_pixel) mentioned in step 33 may be specifically calculated according to the maximum value relationship between kernel_1, kernel_2 and the first probability of the plurality of pixel points:

specifically, the threshold_pixel can be selected by adopting an adaptive method, namely, the obtained kernel is clustered by adopting a K-means algorithm (K-means) automatic clustering method, two kernels can be taken as clustering center cores at first, then clustering is carried out by adopting an L2 norm (L2-norm) method in a one-dimensional dimension, so that the required kernel_1 and kernel_2 are finally updated, and the probability threshold threshold_pixel can be calculated according to the following formula:

threshold_pixel=max(min((kernel_1+kernel_2)/2,p_ratio_max),p_ratio_min)，

where p_ratio_max represents an up-bound of the confidence p_ratio value, p_ratio_min represents a down-bound of the confidence p_ratio value, and the purpose of setting up the up-down bound is to ensure that the threshold_pixel is not too large and too small even though it may be stable within a range. For example, threshold_pixel=max (min ((kernel_1+kernel_2) ×0.5, 0.62), 0.42).

Of course, other clustering methods may be used to cluster the resulting kernels, such as gaussian mixture model (GMM, gaussian Mixture Model).

34. And extracting a text region in the text image to be detected according to the probability threshold value.

In some specific examples, the specific operational procedure of step 34 may include steps 341 through 344:

341. and combining the first probability of each pixel point in the text image to be detected with the first probability of surrounding pixel points to obtain a second kernel corresponding to each pixel point.

Here, "surrounding" and "adjacent" may have the same meaning or may be different, and for example, "surrounding" may refer to eight directions of upper left, upper right, lower left, and "adjacent" may refer to four directions of upper right, lower right, and right left. As shown in fig. 5b, the first probability value of the current pixel with coordinates (x, y) is 0.96, and the first probability values of the surrounding pixels are 0.22, 0.61, … …, 0.58, 0.78, etc., so that the matrix formed in this way can be regarded as the second kernel corresponding to the current pixel.

342. And carrying out convolution processing on the first kernel and the second kernel of each pixel point in the text image to be detected to obtain a convolution value.

Specifically, after the second kernel of each pixel point in the text image to be detected is obtained, the convolution value can be obtained by correspondingly combining the first kernel convolution processing. The convolution value may represent an element value corresponding to each pixel point in a corresponding position of the mask result map (mask_pix map). In some specific examples, the convolution values may be considered as decimal pixel values between 0 and 1, such as 0.01, 0.2, etc., and further, the element values may form an entirely new mask map (mask_pix map) corresponding to the text image to be tested relative to the pixel mask, etc. For a text image to be tested of size H, W, the resulting mask_pix map is actually a two-dimensional tensor of size (h/2, w/2) that is the same size as the pixel mask map. Therefore, according to the above process of processing the mask_pix map from the two mask maps, the value of each pixel in the text image to be tested in the mask_pix map depends not only on the first probability p_ratio of the pixel in the pixel mask, but also on the second probability value corresponding to the adjacent pixels at the peripheral position (or called the periphery) of the pixel.

343. And (3) performing binarization processing on a mask result graph (i.e. mask_pix graph) according to the probability threshold.

The mask result graph comprises convolution values of each image point in the text image to be detected. As can be seen from the above description, the values of the pixel points pointed to by the same position of the mask_pix diagram shown in fig. 6 and the pixel mask diagram shown in fig. 4 are not necessarily the same, for example, the white text bar area outlined by the box in the pixel mask diagram may show black background blocks in the mask_pix diagram, because the non-text content or false detection component may be printed at these black blocks.

So the operations related to step 343 may be: the basic pixel block diagram (may be referred to as a tile) may be selected by a small bounding box in the mask_pix diagram shown in fig. 6, i.e. a small area in the image is first defined to perform binarization processing on the entire mask_pix diagram: specifically, for the value of each pixel point of the text image to be tested in the basic pixel block diagram, when the value is greater than threshold_pixel, the value becomes 1 to indicate that the pixel point displays text content, namely, the position where the pixel point is represented as a text region, or becomes 0 to indicate that the pixel point displays black background, namely, the position where the pixel point is represented as a non-text region. The mask_pix map and the pixel mask map have the same size, so that the class distinction of the corresponding text or the non-text is performed on the pixel contents of each position in the mask_pix map by means of the threshold_pixel, and the class distinction is also equivalent to the purpose of distinguishing the region class corresponding to the pixel contents of each position of the text image to be tested in the pixel mask map, namely the probability of predicting whether the pixel of the corresponding position belongs to a text region or a non-text region.

344. And carrying out contour extraction processing on the mask result graph after binarization processing, and taking the extracted area as a text area in the text image to be detected.

In the above-mentioned image blocks with small bounding boxes as examples, each pixel value in each image block is not necessarily represented as a value of 1, so that the image block containing text content, that is, containing 1, needs to be subjected to position extraction below to ensure that text content is maximally framed and selected, and avoid missing characters. Thus, in some specific examples, the operations relating to step 344 may be: the contour extraction function (findContours function) in the software library opencv can be used for extracting the position of the image block containing the value 1; specifically, the outermost contour extraction can be performed on the image block containing the value 1 by adopting a cv_RETER_EXTERNal mode to obtain a text block, namely, distinguishing a text region; as one possible example, the occurrence of a text block may be considered as classifying the repeated text region selected by the repeated box as a large and identical large text block.

As another possible implementation manner, the specific operation of performing the position contour extraction processing on the image block containing the value 1 (step 344) may include: gradient preprocessing is carried out on the (binarized) image so as to extract each pixel value mutation point; and drawing contour lines through the relevant position information of the abrupt points to filter out partial confusing factors, and then drawing outline lines according to the information of the abrupt points with the maintained pixel values, so that the relevant positions of text block areas (or called text areas) can be finally fitted according to the information of the outline lines.

Because the extracted text block areas may have component contents of wrongly detected characters, such as logo, random graffiti, etc., the candidate text block areas need to be filtered, so as to further effectively avoid the occurrence of the wrong detection condition, and finally accurately locate the truly valuable target text area where the actually needed characters are to be identified in the text image to be detected. Thus, after step 34, the method may further comprise the following operation step 35 (filtering out part of the text region):

determining various data information corresponding to the text region; performing feature fusion processing on various data information to obtain a feature result corresponding to the text region; text regions for which the feature result does not reach the feature selection threshold are filtered.

As a possible implementation manner, specific operations for determining various data information corresponding to the text region may be:

respectively intercepting corresponding position mapping images of a text region in a text image to be detected and a pixel mask map, wherein the pixel mask map comprises a first probability of each pixel point in the text image to be detected; performing text feature extraction processing on the two position mapping images corresponding to the text region through a text feature extraction model to obtain text features (text_features) of the text region; and taking the position information of the text region in the text image to be detected, the size of the text region, the size of the text image to be detected and the text characteristics of the text region as data information of the text region.

For example, the filtering mechanism for the candidate text block may be started by the following data information such as the center point position distance (belonging to the position information), the size and the text feature text_feature. Specifically, as shown in fig. 8, the process of acquiring text features includes:

according to the position of the candidate text block, a block diagram of the corresponding position, namely the position mapping image, is cut out from the text image to be detected and the pixel mask map; and then, inputting each position mapping image into a pre-trained text feature extraction model to output text features of the candidate text blocks. In order to speed up the processing progress and the response effect, the sizes of the mapping images of the positions input to the text feature extraction model can be selected and adjusted (resize) in advance to be consistent, and then the text feature extraction model is input in a consistent size. In addition, it is also possible to:

A. as shown in fig. 7, a center point position distance ratio_distance between the center point of the candidate text block and the center point of the pixel mask map (or can be understood as a mask_pix map) is calculated;

B. the size of the candidate text block (width text_w and height text_h) and the size of the text image to be tested (width w and height h) are obtained.

In some specific examples, the process of training the aforementioned text feature extraction model (step 36) may include:

performing text feature extraction processing on the sample original image through the encoder model to obtain text features of the sample original image; performing text feature restoration processing on the text features through the decoder model to obtain restored images; training an encoder model and a decoder model according to the difference between the original image and the restored image of the sample; and taking the trained encoder model as a text feature extraction model.

Specifically, a network structure including an encoder and a decoder may be first designed, where both the encoder and the decoder are composed of transformers. The encoder network encodes the original picture that was input (which may be resized to a certain size in advance) into a vector (image feature) that has a fixed length. Then, the decoder network can restore the image similar to the original image as much as possible according to the image_feature vector. In the training process, the difference between the original picture and the restored picture is calculated, the smaller the difference is, the better the difference is, after the training process of the specified number of rounds, the difference value tends to be zero or tends to be stable, and at the moment, the network of the encoder part can be extracted to be used as the required text feature extraction model. In addition, the network model can keep the capability of encoding and extracting information in the pictures in the training process so as to adapt to actual feature extraction task scenes.

It should be noted that, the order of execution of the above-mentioned acquisition of various data information including, but not limited to, the center point position distance, the size and the text feature is not limited.

As one possible implementation, the specific operations for calculating the feature result may be: according to the above ratio_ distance, text _w, text_h, w, h, text_feature and other data information comprehensive consideration, the following functions are set to calculate the feature result y corresponding to the candidate text block (i.e. the candidate text region):

y = f(ratio_distance, text_w, text_h, w, h, text_feature)；

the y value may in particular be between 0 and 1, for which purpose 0.5 may be chosen as the corresponding feature selection threshold (or filtering threshold).

And then, filtering out the candidate text blocks with the characteristic results not reaching the characteristic selection threshold, and determining the region covered by the candidate text blocks with the characteristic results exceeding the characteristic selection threshold as a target text region, namely, further precisely screening and remaining the final text region with true high value.

Specifically, on the basis of the calculated feature result y, when y is greater than 0.5, the candidate text block can be reserved, and the area covered by the candidate text block can be further determined to be a target text area; otherwise, when y < 0.5, the component content which is erroneously detected as characters exists in the candidate text block and needs to be filtered.

The operation contents of steps 31 to 34 are similar to those of steps 21 to 24, and detailed description thereof will be omitted. The sequence of

steps

30 and 31 is not limited, and the sequence of steps 36 and any of steps 30 to 34 is not limited, but may be alternatively performed simultaneously, specifically according to the actual scene requirement, and is not limited herein.

In summary, the text region determining method with strong robustness and high operation efficiency is provided, so that the finally determined target text region has high reliability, is considered to avoid low-value non-text content or false detection content, accurately positions the positions of effective characters in the text image to be detected, and therefore the recognition effect of the post OCR recognition technology on the characters in the target text region is effectively promoted, and the user experience is enhanced.

Referring to fig. 9, a second aspect of the present application provides an embodiment of a text region determining system, the embodiment including:

an acquiring unit 901, configured to acquire a text image to be detected;

the processing unit 902 is configured to perform prediction processing on a text image to be detected through a target prediction model, so as to obtain a first probability that each pixel point in the text image to be detected belongs to a text region, and a second probability that each pixel point and an adjacent pixel point belong to the same type, where the same type means that two pixel points to be compared both belong to the text region or both belong to a non-text region;

The processing unit 902 is further configured to perform probability analysis processing according to the first probability and the second probability of each pixel point in the text image to be detected, so as to obtain a probability threshold value for classifying the text region and the non-text region;

the processing unit 902 is further configured to perform text region extraction processing on the text image to be tested according to the probability threshold value, so as to obtain a text region in the text image to be tested.

Optionally, the processing unit 902 is specifically configured to:

combining the first probability and the second probability of each pixel point in the text image to be detected to obtain a first kernel corresponding to each pixel point;

clustering is carried out on the plurality of first kernel to obtain a cluster center kernel;

a probability threshold for classifying text regions from non-text regions is determined based on the cluster center kernel.

Optionally, the cluster centers kernel include a first cluster center kernel_1 corresponding to the text region type and a second cluster center kernel_2 corresponding to the non-text region type; the processing unit 902 is specifically configured to:

a probability threshold for classifying the text region and the non-text region is calculated based on a maximum relationship between the kernel_1, the kernel_2, and a first probability of the plurality of pixels.

Optionally, obtaining a mask map through prediction processing, wherein the mask map comprises first probability and second probability information of each pixel point in the text image to be detected; the processing unit 902 is specifically configured to:

combining the first probability of each pixel point in the text image to be detected with the first probability of surrounding pixel points to obtain a second kernel corresponding to each pixel point;

carrying out convolution processing on the first kernel and the second kernel of each pixel point in the text image to be detected to obtain a convolution value;

performing binarization processing on the mask result graph according to the probability threshold; the mask result graph comprises a convolution value of each pixel point in the text image to be detected;

and carrying out contour extraction processing on the mask result graph after binarization processing, and taking the extracted area as a text area in the text image to be detected.

Optionally, the number of text regions includes a plurality; the processing unit 902 is further configured to:

determining various data information corresponding to the text region;

performing feature fusion processing on various data information to obtain a feature result corresponding to the text region;

text regions for which the feature result does not reach the feature selection threshold are filtered.

Respectively intercepting corresponding position mapping images of a text region in a text image to be detected and a pixel mask map, wherein the pixel mask map comprises a first probability of each pixel point in the text image to be detected;

performing text feature extraction processing on the two position mapping images corresponding to the text region through a text feature extraction model to obtain text features of the text region;

and taking the position information of the text region in the text image to be detected, the size of the text region, the size of the text image to be detected and the text characteristics of the text region as data information of the text region.

Optionally, the processing unit 902 is further configured to:

performing text feature extraction processing on the sample original image through the encoder model to obtain text features of the sample original image;

performing text feature restoration processing on the text features through the decoder model to obtain restored images;

training an encoder model and a decoder model according to the difference between the original image and the restored image of the sample;

and taking the trained encoder model as a text feature extraction model.

Optionally, the processing unit 902 is further configured to:

acquiring a sample text image, a first sample result of whether each pixel point in the sample text image belongs to a text region, and a second sample result of whether each pixel point in the sample text image and adjacent pixel points belong to the same type;

Carrying out prediction processing on the sample text image through an initial prediction model to obtain a first probability that each pixel point in the sample text image belongs to a text region and a second probability that each pixel point and adjacent pixel points belong to the same type;

and training an initial prediction model according to the difference between the first probability of each pixel point in the sample text image and the first sample result and the difference between the second probability of each pixel point in the sample text image and the second sample result so as to obtain a target prediction model for performing prediction processing on the text image to be detected.

In this embodiment, the operations performed by each unit of the text region determining system are similar to those described in the foregoing first aspect or any specific method embodiment of the first aspect, and are not described herein in detail.

Referring to fig. 10, an electronic device 1000 of an embodiment of the present application may include one or more central processing units (CPUs, central processing units) 1001 and a memory 1005, where the memory 1005 stores one or more application programs or data.

Wherein the memory 1005 may be volatile storage or persistent storage. The program stored in the memory 1005 may include one or more modules, each of which may include a series of instruction operations in the electronic device. Still further, the central processor 1001 may be configured to communicate with the memory 1005, and execute a series of instruction operations in the memory 1005 on the electronic device 1000.

The electronic device 1000 can also include one or more power supplies 1002, one or more wired or wireless network interfaces 1003, one or more input/output interfaces 1004, and/or one or more operating systems, such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.

The cpu 1001 may perform the operations performed by the foregoing first aspect or any specific method embodiment of the first aspect, which are not described herein.

A computer readable storage medium is provided comprising instructions which, when run on a computer, cause the computer to perform a method as described in the first aspect or any specific implementation of the first aspect.

A computer program product comprising instructions or a computer program is provided which, when run on a computer, causes the computer to perform the method as described above in the first aspect or any one of the specific implementations of the first aspect.

It should be understood that, in various embodiments of the present application, the sequence number of each step does not mean that the execution sequence of each step should be determined by the function and the internal logic, and should not limit the implementation process of the embodiments of the present application.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system and apparatus described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system or apparatus, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all or part of the technical solution contributing to the prior art, or in the form of a software product, which is stored in a storage medium (computer program product) and includes several instructions for causing a computer device (which may be a personal computer, a service server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random access memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims

1. A text region determining method, comprising:

acquiring a text image to be detected;

the method comprises the steps that a target prediction model is used for carrying out prediction processing on a text image to be detected, so that first probability that each pixel point in the text image to be detected belongs to a text region and second probability that each pixel point and adjacent pixel points belong to the same type are obtained, wherein the two pixel points which belong to the same type are compared and belong to text regions or all belong to non-text regions, and each adjacent pixel point of each pixel point comprises pixel points in four directions which are right above, right below, and right left below the pixel point;

2. The text region determining method according to claim 1, wherein the performing probability analysis processing according to the first probability and the second probability of each pixel point in the text image to be detected to obtain a probability threshold for classifying a text region and a non-text region includes:

and determining a probability threshold value for classifying the text region and the non-text region according to the cluster center kernel.

3. The text region determining method according to claim 2, wherein the cluster center kernel includes a first cluster center kernel_1 corresponding to a type representing a text region and a second cluster center kernel_2 corresponding to a type representing a non-text region;

the determining a probability threshold value for classifying the text region and the non-text region according to the cluster center kernel comprises the following steps:

and calculating a probability threshold value for classifying the text region and the non-text region according to the highest relation among the kernel_1, the kernel_2 and the first probabilities of the pixel points.

4. The text region determining method according to claim 2, wherein the text region extracting process is performed on the text image to be detected according to the probability threshold value to obtain a text region in the text image to be detected, and the method comprises:

Combining the first probability of each pixel point in the text image to be detected with the first probability of surrounding pixel points to obtain a second kernel corresponding to each pixel point, wherein the surrounding pixel points of each pixel point only comprise the adjacent pixel points of the pixel point, or only comprise the pixel points in four directions of the upper left, the upper right, the lower right and the lower left of the pixel point and the adjacent pixel points of the pixel point;

performing convolution processing on the first kernel and the second kernel of each pixel point in the text image to be detected to obtain a convolution value;

5. The text region determination method according to any one of claims 1 to 4, wherein the number of text regions includes a plurality of; the text region extraction processing is carried out on the text image to be detected according to the probability threshold value, and after the text region in the text image to be detected is obtained, the method further comprises the following steps:

Determining various data information corresponding to the text region;

performing feature fusion processing on the multiple data information to obtain a feature result corresponding to the text region;

and filtering text areas of which the feature results do not reach a feature selection threshold.

6. The text region determining method according to claim 5, wherein the determining the plurality of data information corresponding to the text region includes:

respectively intercepting corresponding position mapping images of the text region in the text image to be detected and the pixel mask map; the pixel mask map comprises a first probability of each pixel point in the text image to be detected;

7. The text region determining method according to claim 6, wherein before performing text feature extraction processing on the two position mapping images corresponding to the text region by using the text feature extraction model to obtain text features of the text region, the method further comprises:

Performing text feature extraction processing on a sample original image through an encoder model to obtain text features of the sample original image;

performing text feature restoration processing on the text features through a decoder model to obtain restored images;

training the encoder model and the decoder model according to the difference between the sample original image and the restored image;

and taking the trained encoder model as a text feature extraction model.

8. The text region determination method according to any one of claims 1 to 4, wherein before the prediction processing of the text image to be measured by the target prediction model, the method further comprises:

acquiring a sample text image, a first sample result of whether each pixel point in the sample text image belongs to a text region, and a second sample result of whether each pixel point in the sample text image and an adjacent pixel point belong to the same type;

And training the initial prediction model according to the difference between the first probability of each pixel point in the sample text image and the first sample result and the difference between the second probability of each pixel point in the sample text image and the second sample result so as to obtain the target prediction model for performing prediction processing on the text image to be detected.

9. A text field determination system, comprising:

the acquisition unit is used for acquiring the text image to be detected;

the processing unit is used for carrying out prediction processing on the text image to be detected through a target prediction model to obtain a first probability that each pixel point in the text image to be detected belongs to a text region and a second probability that each pixel point and an adjacent pixel point belong to the same type, wherein the same type means that two pixel points to be compared belong to a text region or a non-text region, and the adjacent pixel point of each pixel point comprises pixel points in four directions of right above, right below and left below the pixel point;

10. An electronic device, comprising:

a central processing unit, a memory and an input/output interface;

the memory is a short-term memory or a persistent memory;

the central processor is configured to communicate with the memory and to execute instruction operations in the memory to perform the method of any of claims 1 to 8.

11. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 8.