CN117115824A

CN117115824A - Visual text detection method based on stroke region segmentation strategy

Info

Publication number: CN117115824A
Application number: CN202310617471.6A
Authority: CN
Inventors: 袁春; 李磊
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-11-24

Abstract

A visual text detection method based on a stroke region segmentation strategy comprises the following steps: s1, performing feature extraction and multi-level region prediction on an input text image through a front-end processing module based on a convolutional neural network; s2, extracting region candidate frames of a text level and a stroke level according to a multi-level prediction result of the text region, and further constructing a hierarchical local graph structure; s3, through a back-end processing module based on the graph neural network, node characteristic aggregation and relation reasoning based on the multi-level graph nodes are executed on each local graph, the relation among the nodes of the different levels of graph is deduced, link prediction is carried out, and the nodes are grouped to form an integral text instance detection result. Experiments are carried out on standard evaluation data sets widely adopted in the field of visual text detection research, and the effectiveness, high precision and good generalization capability of the visual text detection method are verified.

Description

Visual text detection method based on stroke region segmentation strategy

Technical Field

The invention relates to a visual text detection technology, in particular to a visual text detection method based on a stroke region segmentation strategy.

Background

Visual text detection techniques in realistic complex scenes, which aim to mark arbitrarily shaped closed areas from an input image for each text instance, have been widely applied to related tasks in the field of multimedia signal processing, including image text editing, optical Character Recognition (OCR), and image text translation. With the vigorous development of Convolutional Neural Network (CNNs) models, currently mainstream text detectors are mainly extended from a target detection or target segmentation framework in the field of computer vision, mainly including regression-based text detection methods and segmentation-based text detection methods. Regression-based text detection methods are typically based on a generic object detector that locates text boxes by predicting the offset of anchor boxes or pixels. While the above strategies are somewhat effective, such approaches tend to be accompanied by complex anchor box configuration strategies and elaborate post-processing flows, which limit their ability to represent arbitrarily shaped text and prevent its large-scale application in real-world scenarios. Segmentation-based text detection methods typically combine pixel-level prediction and post-processing steps to extract text instances from text regions derived from the segmentation prediction. Segmentation-based text detection methods tend to be able to locate arbitrarily shaped text instances more accurately than regression-based text detection methods. However, such methods [4,12] typically require time consuming post-processing steps and are difficult to efficiently discern and separate multiple text instances that are in close proximity to each other.

Recently, the research field has proposed some mixed text detection methods to combine the core ideas of the two methods. The hybrid text detection method generally performs pixel-level segmentation prediction to search for potential text regions, and adopts a bounding box regression strategy to guide the final text detection result on the basis. In this study branch, deep reg predicts an offset from pixels in the text region to guide multi-directional text box regression prediction. Later, some efforts have attempted to exploit the powerful functions of the Graph Neural Network (GNN) to improve text detection performance by modeling and reasoning about text regions. The GraphText introduces a deep relation inference graph network as a back-end network module in a text detection framework. In addition, strokut first predicts the multi-level representation of each text region and then performs structural reasoning based on the hierarchical relational graph network model.

However, conventional methods often have difficulty in accurately locating text instances of arbitrary shape, and also in efficiently distinguishing and separating multiple text instances that are close to each other.

It should be noted that the information disclosed in the above background section is only for understanding the background of the application and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.

Disclosure of Invention

The invention aims to overcome the defects of the background technology and provides a high-precision visual text detection method based on a stroke region segmentation strategy.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a visual text detection method based on a stroke region segmentation strategy comprises the following steps:

s1, performing feature extraction and multi-level region prediction on an input text image through a front-end processing module based on a convolutional neural network; the front-end processing module comprises a trunk image feature extraction network, a text region prediction network and a stroke region prediction network, and performs multistage prediction related to a text region through a series of convolution layers stacked on the feature pyramid trunk image feature extraction network;

s2, extracting text-level and stroke-level region candidate frames according to a multi-level prediction result of the text region, wherein image regions represented by the candidate frames are used as graph nodes to form a plurality of local graph structures, and further a hierarchical local graph structure is constructed;

s3, node characteristic aggregation and relation reasoning based on multi-level graph nodes are executed on each local graph through a back-end processing module based on the graph neural network, the relation among the nodes of different levels of graph is deduced, link prediction is conducted, and the nodes are grouped according to the link relation among the nodes of each text level, so that an integral text instance detection result is formed.

Further, in step S1, the multi-level prediction includes: obtaining the classification confidence coefficient of the text-level rectangular area corresponding to each text instance; regression prediction of Guan Wenben-level attributes such as text rotation angle, center line position and the like in each text instance; and character segmentation predictions for corresponding stroke levels within the prediction bounding box for each text region.

Further, in step S1, the extracting the region candidate boxes at the text level and the stroke level according to the multi-level prediction result of the text region includes: and extracting corresponding multi-level candidate rectangular frames according to the multi-level prediction result of the text region, wherein the partial graph only containing text level or stroke level nodes is an isomorphic graph, and the partial graph containing text level and stroke level nodes is an abnormal graph.

Further, in step S1, the text region prediction network predicts attributes related to the text instance region, including: classification probability prediction for predicting text region TR and text center region TCR is then performed h ₁ 、h ₂ Regression value prediction of cos θ and sin θ, h ₁ And h ₂ Representing the distance of the current pixel to the upper and lower edges of TR and TR, respectively, the text instance height dimension h is h ₁ And h ₂ And, θ indicates direction information of the text instance; estimating a text center line corresponding to the TR on the basis of predicting the TR potential area; using the feature outputs of the 2 channels to guide classification probability prediction of TR and TCR; during training, the first characteristic channel is used for predicting a background, and the second characteristic channel is used for predicting a foreground, namely a text example; during testing, taking a foreground prediction result of the second characteristic channel for subsequent processing; wherein the outputs of the 1 characteristic channels are used to predict the regression attribute values, respectively.

Further, in step S1, the stroke region prediction network separates character content in each text region from complex background, wherein fine stroke segmentation representations in the text region are generated in combination with image low-level semantic and high-level semantic information to guide a subsequent text detection process;

preferably, the stroke area prediction network comprises a two-stage prediction process;

1) Extracting text-related features from a high-level feature representation of an input image acquired from a backbone network; specifically, an external rectangular OTR of TR is cut out from an input image, and a global pooling layer is utilized to combine with a continuous convolution layer to extract OTR region characteristics acquired from a backbone network; calculating a channel attention profile of the input image using the plurality of pooling layers and the multi-layer perceptron network and associated nonlinear activation functions to discern and measure relative contributions of different network layers in the backbone network to the text region representation; during this period, the extracted input feature map is up-sampled to the same resolution as the input image, and then multiplied by the obtained channel attention feature map so as to realize the semantic information distillation operation on the input image; thereby obtaining a text image semantic representation;

2) Finely modeling a stroke representation of a text region, enhancing the finely grained stroke character segmentation representation by introducing orthogonal convolution networks from orthogonal directions; specifically, 3-channel RGB original input features of a text region circumscribed rectangle OTR are used as complementary low-level image semantic information, and are fused with the obtained text image semantic representation; preferably, orthogonal convolution layers of convolution kernel sizes 1×7 and 7×1 are introduced to calculate the attention coefficients in the spatial direction, and the resulting attention values are multiplied by the fused text feature map.

Further, the front-end processing module pre-trains the stroke region prediction network using a dataset with stroke level segmentation map annotations as labels, and a mean square error loss function.

Further, in step S3, the node feature and the connection structure thereof are initialized, which specifically includes:

initializing node characteristics: two complementary feature representations are adopted, including geometric embedding and content embedding, and are used for initializing the features of the text-level and stroke-level nodes; for geometric embedding, encoding the geometric attributes of each predicted region candidate frame into a high-dimensional space; for content embedding, obtaining content features of each graph node by sending predicted feature graphs for each region candidate box geometry correlation attribute to the RRoI-Align layer; embedding and connecting the two obtained features to form a final graph node feature representation; preferably, the initial feature representation of all nodes is normalized by subtracting the features of the center node when generating the local graph network;

Generating an adjacency matrix: the topology formed by each partial graph network is encoded in an adjacency matrix A E R ^N×N Wherein a (c, n) =1, if there is a connection between the central node c in the partial graph and each of its neighbor nodes n; preferably, the method for generating the adjacency matrix specifically includes:

for isomorphic diagrams, including the construction of isomorphic stroke-level diagram networks and text-level diagram networks; for isomorphic stroke graphs only containing stroke level graph nodes, a KNN nearest neighbor algorithm based on Euclidean distance is adopted, and 8 nearest neighbor nodes of each center node are selected as 1-hop neighbor nodes of the center node to form an adjacency matrix A _s The method comprises the steps of carrying out a first treatment on the surface of the For isomorphic text graphs only containing text-level graph nodes, the adopted adjacent matrix construction mode is different from the isomorphic stroke graph in that each central node in the isomorphic text graph only keeps 4 direct neighbor nodes nearest to the central node to form a corresponding matrix A _t ；

Aiming at heterogeneous text graph networks, graph nodes of two levels of text and strokes are simultaneously contained; constructing a graph network of the type according to Euclidean distances among the central positions of the extracted candidate frames of each region; specifically, each text-level region candidate box is regarded as a central node of the heterogeneous text graph, and the connection relation in the 1-hop and 2-hop neighborhood range of the central node is adopted to generate an adjacent matrix A of the heterogeneous graph _h The method comprises the steps of carrying out a first treatment on the surface of the The 1-hop neighborhood of the center node contains 4 text-level graph neighbor nodes nearest to it, while its 2-hop neighborhood contains an additional 4 stroke-level graph neighbor nodes nearest to it.

Further, in step S3, the back-end processing module performs, through a hierarchical neural network inference model, relationship inference and link prediction of nodes in the generated multiple local graph networks; the reasoning process based on the graph neural network comprises the following three stages:

firstly, aggregating and updating stroke level node characteristics guided by an attention mechanism in a weighted average mode; the weight information in the weighting process comes from two parts, including a normalized adjacency matrix A _s And attention coefficient alpha between any two graph nodes v and u derived in graph annotation force network GAT _v，u The method comprises the steps of carrying out a first treatment on the surface of the The weighted aggregation process of the first stage is described as:

wherein σ is an activation function, W is a trainable weight parameter, s _k Features representing a stroke level graph node k;

is from As, fuse (·) represents a characteristic linear combination function;

if the center of the stroke level graph node falls within the region of the text level graph node, merging the updated representation of the stroke level node into the corresponding text level node representation;

The second stage fuses the features of the two-level graph nodes by stacking two transducer encoder modules; in particular, the introduced transducer encoder models and infers hierarchical relationships between nodes of the heterogeneous map by capturing attention coefficients between stroke-stroke, stroke-text, and text-text nodes, expressed as:

wherein,all text representing the t layerThe present (t) node features and the stroke(s) node features; attention (& gt) is the Attention computation operation in the transducer, Q, K, V represents the query matrix, key matrix and value matrix, W, respectively ^Q(，K，V) Is a trainable weight parameter;

using graph inference network containing expanded neighborhood range, for each text level graph node, aggregating at the first layer of the designed graph network the characteristic representation of 1-hop neighbor nodes containing only text neighbor nodes, and then aggregating at the subsequent layer the information of 2-hop neighbors containing both text neighbors and stroke neighbor nodes; during the period, the network structure for self-adapting and adjusting the abnormal composition by adopting dynamic graph convolution is described as follows:

P＝σ(M _t，s ,A _t,s (G(H _t，s ))W)

where W is a trainable weight matrix, G (·) represents a conventional information aggregation process on the graph network, M _t，s And A _t，s Respectively representing a cross-layer shielding matrix and a cross-jump attention matrix in the introduced dynamic graph network;

Preferably, the cross-layer masking matrix M _t，s Further divided into M' _s 、M′ _t And M' _t，s Representing a self-masking matrix between stroke-level graph nodes, a self-masking matrix between text-level graph nodes, and a mutual masking matrix between stroke-level and text-level graph nodes, respectively; the masking result of the stroke level graph nodes is finally based on M' _s And M' _t，s Is compared with a fixed threshold, and the masking result of the text level graph nodes is ultimately based on M' _t And M' _t，s Is a comparison of the linear combination of (a) and a fixed threshold;

by completing the three stages, the output of the last graph network layer is used for the prediction of the link relation between the text graph nodes and the positioning of the text instance bounding box regression value.

Preferably, cross entropy loss between the graph model prediction result and the corresponding real class label is adopted in the training process to guide the learning process of the whole detection framework.

Preferably, the text level nodes are grouped by a breadth-first search method and ranked by a minimum path algorithm based on classification of graph nodes and link prediction results.

Preferably, the boundaries of the text examples with arbitrary shapes are obtained by sequentially connecting midpoints of the top and bottom in the candidate boxes corresponding to the text nodes after sorting.

A computer readable storage medium, which when executed by a processor, implements the method for visual text detection based on a stroke region segmentation strategy.

The invention has the following beneficial effects:

the invention provides a visual text detection method based on a stroke region segmentation strategy, which can effectively realize high-precision visual text detection. Firstly, by introducing a lightweight stroke segmentation prediction network, only effective supplement of text region prediction can be realized for the current mainstream text detector, so that multi-level (text level and stroke level) representation of a detection model on a text region is realized. During this time, a visual image dataset (SceneText) may be introduced, each text instance in the image sample being labeled with a stroke-level segmentation label, i.e., a binarized stroke character segmentation map. The data set improves the prediction accuracy of the detection framework for the text region multi-level representation through a front-end processing module based on a convolutional neural network in the pre-training detection framework. Meanwhile, by introducing the graph neural network model as a back-end processing module in the constructed text detection frame, feature aggregation and relationship reasoning can be effectively performed on each part of the text region predicted by the front-end processing module, so that the improved graph model can be better suitable for a text detection task scene. The detection method of the invention performs experiments on a standard evaluation data set widely adopted in the field of visual text detection research, and verifies the effectiveness, high precision and good generalization capability of the method of the invention.

Drawings

Fig. 1 is a text detection framework and a processing flow chart of the embodiment of the invention.

FIG. 2 is (a) an original image in an embodiment of the present invention; (b) Text Region (TR) related attribute prediction; (c) an outer rectangle (OTR) corresponding to TR.

Fig. 3 is a schematic diagram of a network structure of a stroke area prediction network according to an embodiment of the present invention.

FIG. 4 is a diagram of an embodiment of the present invention employing an introduced Scenetext text image dataset and a mean square error loss function to pretrain a proposed stroke region prediction network, after which the trained stroke prediction network is used to predict stroke representations of text images in real scenes online.

FIG. 5 is a diagram of a text detection result visualization according to an embodiment of the present invention; in the figure, the first column (a), the second column (b) and the third column (c) represent the input image, the stroke segmentation predicted by the proposed method and the final text detection result, respectively.

FIG. 6 is an example of an embodiment of the present invention guiding OCR translation (Chinese to English and English to French) tasks, including: an input image (a), predicted stroke segmentation (b), text detection results (c), and a translated image (d).

FIG. 7 is an overall process flow for OCR translation tasks using a front-end processing module in an embodiment of the present invention.

Detailed Description

The following describes embodiments of the present invention in detail. It should be emphasized that the following description is merely exemplary in nature and is in no way intended to limit the scope of the invention or its applications.

It will be understood that when an element is referred to as being "mounted" or "disposed" on another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. In addition, the connection may be for both a fixing action and a coupling or communication action.

It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are merely for convenience in describing embodiments of the invention and to simplify the description by referring to the figures, rather than to indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus are not to be construed as limiting the invention.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the embodiments of the present invention, the meaning of "plurality" is two or more, unless explicitly defined otherwise.

In the prior detection method, the prediction precision of a front-end processing module based on the convolutional neural network to a text region is limited, which greatly limits the capability of a back-end processing module based on the graph neural network to infer and generate a final text detection result. Furthermore, the migration of classical graph neural networks or variant models thereof directly to the text detection domain without introducing effective optimization strategies makes it difficult for graph network modules constructed in these methods to seamlessly accommodate text detection-related tasks. The above drawbacks limit to some extent the potential of the application of graphic models in the field of visual text detection.

Referring to fig. 1, an embodiment of the present invention provides a visual text detection method based on a stroke region segmentation strategy, including the following steps:

The invention provides a visual text detection method based on a stroke region segmentation strategy, which can effectively realize high-precision visual text detection. Firstly, by introducing a lightweight stroke segmentation prediction network, only effective supplement of text region prediction can be realized for the current mainstream text detector, so that multi-level (text level and stroke level) representation of a detection model on a text region is realized. Meanwhile, by introducing the graph neural network model as a back-end processing module in the constructed text detection frame, feature aggregation and relationship reasoning can be effectively performed on each part of the text region predicted by the front-end processing module, so that the improved graph model can be better suitable for a text detection task scene. The detection method of the invention performs experiments on a standard evaluation data set widely adopted in the field of visual text detection research, and verifies the effectiveness, high precision and good generalization capability of the method of the invention.

During this time, a visual image dataset (SceneText) may be introduced, each text instance in the image sample being labeled with a stroke-level segmentation label, i.e., a binarized stroke character segmentation map. The data set improves the prediction accuracy of the detection framework for the text region multi-level representation through a front-end processing module based on a convolutional neural network in the pre-training detection framework.

Specific embodiments of the present invention are described further below.

(1) Inspection framework overview

The processing flow of the technical framework provided by the invention is shown in figure 1, and the processing flow consists of two main network modules, including a front-end processing module based on a convolutional neural network and a back-end processing module based on a graph neural network. Fig. 1 shows technical details of the text detection framework and a processing flow thereof according to the present invention.

First, the present invention employs a convolutional neural network-based text region detector, which performs feature extraction and multi-level region prediction for each input text image as a front-end processing module. Specifically, the front-end processing module employed applies a series of convolutional layers stacked on top of a feature pyramid backbone image feature extraction network to perform multi-level prediction related to text regions, including: 1. classification confidence of text-level rectangular region (segmentation-box-level) corresponding to each text instance, 2. Regression prediction of Guan Wenben-level attributes such as text rotation angle, centerline position, etc. within each text instance, and 3. Character segmentation prediction of corresponding stroke level within the prediction bounding box of each text region. The present invention uses the imported SceneText data set to pre-train the front-end processing module.

Then, corresponding multi-level candidate rectangular frames are extracted according to the multi-level prediction result of the text region, wherein the image region represented by each candidate rectangular frame is regarded as one graph node, and all the generated graph nodes form a plurality of local graph structures. In particular, a local graph network containing only text-level (stroke-level) nodes is referred to as a homogenous text (stroke) graph, while a local graph network containing both text-level and stroke-level nodes is referred to as a heterogeneous text graph. Based on the above, the invention provides a back-end processing module based on a graph neural network to execute feature aggregation and relation reasoning based on multi-level graph nodes on each local graph. The proposed module will infer the possibility of edge links between text level graph nodes, and finally group the nodes according to the link relations between the text level nodes to form an overall text instance detection result. In the training process, the cross entropy loss between the graph model prediction result and the corresponding real class label is adopted to guide the learning process of the whole detection framework.

(2) Front-end processing module

Text region prediction network: because the related features of the input image extracted from the main feature extraction network keep the spatial resolution and contain rich image semantic information, the method further applies a series of stacking in the main A convolution layer over the dry network predicts attributes associated with the text instance region. Specifically, this stage requires a classification probability prediction of the predicted text region (denoted by TR) and the text center region (denoted by TCR), followed by h ₁ 、h ₂ Regression value predictions for cos θ and sin θ. TR represents the area in which the text instance is located. According to the method, the text center line corresponding to the TR is estimated on the basis of predicting the TR potential area, the two ends of the center line are reduced by 0.5 terminal pixels of the width scale of the text example, and then the center line area is further enlarged by 0.3 height scale range of the text example, so that the final text center line (TCR) is obtained. The inventive network uses the feature output of 2 channels to guide the classification probability prediction of TR and TCR, and they are all subjected to subsequent processing such as Softmax normalization layer and threshold decision operation to obtain the final prediction result. Specifically, during the training process, the first feature channel is used to predict the background and the second channel is used to predict the foreground (i.e., text instance). During testing, the invention directly takes the prospect prediction result of the second characteristic channel to carry out subsequent processing. In addition, h ₁ And h ₂ Representing the distance of the current pixel to the upper and lower edges of TR and TR, respectively, while the text example height dimension h mentioned above is h ₁ And h ₂ And θ is used to indicate directional information of the text instance. The invention predicts each regression attribute value using the output of 1 characteristic channel, respectively. A representation of the relevant properties is shown in fig. 2. In fig. 2, (a) an original image; (b) Text Region (TR) related attribute prediction; (c) an outer rectangle (OTR) corresponding to TR.

Stroke area prediction network: on the basis of realizing text region prediction, the invention further explores text stroke representations with finer granularity, thereby realizing the separation of character content in each text region from complex background. The prediction process can be divided into two phases as shown in fig. 3.

Fig. 3 shows a network structure of the stroke area prediction network. Specifically, the present invention combines image low-level semantic and high-level semantic information to generate a fine stroke segmentation representation in a text region and thereby guide a subsequent text detection process.

First, the present invention extracts text-related features from a high-level feature representation of an input image acquired from a backbone network. Specifically, the method cuts out the external rectangle of TR (represented by OTR) from the input image, and utilizes a global pooling layer in combination with a continuous convolution layer to extract OTR region features acquired from the backbone network. The present invention utilizes several pooling and multi-layered perceptron networks and related nonlinear activation functions to compute a channel attention profile of an input image to discern and measure the relative contributions of different network layers in the backbone network to the text region representation. During this time, the present invention upsamples the extracted input feature map to the same resolution size as the input image and then multiplies it by the channel attention feature map obtained above to realize a semantic information distillation operation on the input image. The process enables the method of the invention to obtain rich text image semantic representations, including information such as text color, texture, edges and the like.

Second, the present invention introduces a stroke enhancement operator to finely model the stroke representation of the text region. Considering that the stroke content in each text instance can be understood as a closed area surrounded by a series of edges, the method of the invention effectively enhances a fine-grained stroke character segmentation representation by introducing orthogonal convolution networks from orthogonal directions, inspired by existing edge detection correlation methods. Specifically, the method takes the 3-channel RGB original input features of the text region circumscribed rectangle (OTR) as complementary low-level image semantic information, and fuses the low-level image semantic information with the text representation with rich semantics obtained previously. During this time, the method of the present invention further introduces orthogonal convolution layers with convolution kernel sizes of 1×7 and 7×1 to calculate the attention coefficient along the spatial direction, and then multiplies the obtained attention value by the fused text feature map to suppress the interference of irrelevant information such as noise and other backgrounds. This process helps enhance the representation of the text region stroke level, enabling the model to effectively generate stroke content that is finer grained and contains complex texture details.

(3) Imported SceneText data set

The embodiment of the invention also introduces a novel text data set with stroke level segmentation map annotation as a label, namely Scenetext, which is used for pre-training a front-end processing module (shown in figure 4) based on the convolutional neural network. The SceneText dataset contains about 200K of text images, which is also an image dataset containing multiple text languages, english, chinese, japanese, etc.

Fig. 4 shows the method of the invention using an introduced SceneText text image dataset and a mean square error loss function to pre-train the proposed stroke area prediction network, after which the trained stroke prediction network is used for online prediction of the stroke representation of the text image in the real scene.

(4) Back-end processing module

The method constructs a plurality of local graph structure networks by extracting multi-level (text level, stroke level) region candidate boxes in each predicted text region and taking each candidate box as a graph node. For the text-level nodes and stroke-level nodes obtained based on the above method, node features and their connection structures are then initialized as follows.

Initializing node characteristics: the method adopts two complementary characteristic representations, including geometric embedding and content embedding, and is used for initializing the characteristics of the text-level and stroke-level nodes. For geometric embedding, the inventive method applies trigonometric functions to encode the predicted geometric properties (including center point coordinates, width, height, and rotation angle of each region candidate box) into a high-dimensional space. For content embedding, the method acquires the content characteristics of each graph node by sending the predicted characteristic graphs aiming at the geometric correlation attributes of each region candidate frame to the RRoI-Align layer. By the mode, the network embeds and connects the two obtained characteristics to form the final characteristic representation of the graph node. Notably, the inventive method normalizes the initial feature representation of all nodes by subtracting the features of the center node when generating the local graph network.

Generating an adjacency matrix: the topology formed by each partial graph network is encoded in an adjacency matrix A E R ^N×N In which a (c, n) =1 if in the center node in the partial graphc and each of its neighboring nodes n. Subsequently, the present invention explores two methods of generating adjacency matrices and applies different information aggregation functions in the constructed adjacency structure to verify their impact on text detection performance.

The first category is isomorphic diagrams, including the construction of isomorphic stroke-level diagram networks and text-level diagram networks. For isomorphic stroke graphs only containing stroke level graph nodes, the invention adopts a KNN nearest neighbor algorithm based on Euclidean distance, and selects 8 nearest neighbor nodes of each center node as direct (1-hop) neighbor nodes of the center node to form an adjacency matrix A _s . For isomorphic text graphs only containing text level graph nodes, the invention adopts the same adjacent matrix construction mode as the above (isomorphic stroke graph), and is characterized in that each central node in the isomorphic text graph only keeps 4 direct neighbor nodes nearest to the central node to form a corresponding matrix A _t 。

The second type is directed to heterogeneous text graph networks that contain graph nodes at both text and stroke levels. The invention constructs the graph network according to the Euclidean distance between the central positions of the extracted candidate frames of each region. Specifically, each text-level region candidate box is considered as a center node of the heterogeneous text graph, where the connection relationship within 1-hop and 2-hop neighborhood of the center node is employed to generate a heterograph adjacency matrix A _h . In the arrangement of the present invention, the 1-hop neighborhood of the center node contains 4 text-level graph neighbor nodes nearest thereto, while its 2-hop neighborhood contains an additional 4 stroke-level graph neighbor nodes nearest thereto. The advantages of this arrangement are: the limited number of neighbors facilitates efficient relational reasoning and efficient feature learning, while the introduced higher-order neighbor nodes (2-hop neighbors) can provide auxiliary information outside the local structure in the graph.

Graph reasoning: based on the processing, the method simultaneously considers the text-level and stroke-level graph nodes, and further introduces a hierarchical graph neural network reasoning model, thereby realizing the relationship reasoning and the link prediction of the execution nodes in the generated multiple local graph networks. The graph neural network based reasoning process can be subdivided into three phases.

First, the inventive method uses a weighted average approach to aggregate and update stroke level node features guided by the attention mechanism. The weight information in the weighting process comes from two parts, including a normalized adjacency matrix A _s And attention coefficient alpha between any two graph nodes v and u derived in a graph annotation force network (GAT) _v，u . In this way, the weighted aggregation process of the first stage can be described as:

Wherein σ is an activation function, W is a trainable weight parameter, s _k Representing the characteristics of the stroke level graph node k.

In addition, in the case of the optical fiber,

here, theFrom As, fuse (. Cndot.) represents the characteristic linear combination function.

Thereafter, if the center of the former (stroke level graph node) falls within the area of the latter (text level graph node), the updated representation of the stroke level node is merged into the corresponding text level node representation. Furthermore, the above described feature update strategy of stroke level nodes may introduce more discriminative and expressive stroke representations for the subsequent two-level node feature fusion process, considering that different stroke level graph nodes contain information from different parts of the text region and have different contributions to the corresponding text level graph nodes. Thus, the second stage fuses the features of the two-level graph nodes by stacking two transducer encoder modules. Specifically, the introduced transducer encoder achieves efficient modeling and reasoning of hierarchical structure relationships between the nodes of the heterogeneous map by capturing attention coefficients between stroke-stroke, stroke-text, and text-text nodes. This process can be expressed as:

wherein,representing all text (t) node features and stroke(s) node features of the t-layer. Attention (& gt) is the Attention computation operation in the transducer, Q, K, V represents the query matrix, key matrix and value matrix, W, respectively ^Q(，K，V) Is a trainable weight parameter.

Based on the two-stage process, the invention further provides a graph inference network containing a larger neighborhood range. Specifically, for each text level graph node, the present invention aggregates the feature representation of its 1-hop neighbor nodes (including only text neighbor nodes) at the first layer of the designed graph network, and then aggregates the information of its 2-hop neighbors (including both text neighbors and stroke neighbor nodes) at the subsequent layer. During the process, the method adopts a network structure of dynamic graph convolution self-adaptive adjustment of the abnormal composition, and the process can be described as:

P＝σ(M _t，s ，A _t，s (G(H _t，s ))W)

where W is a trainable weight matrix and G (·) represents the conventional information aggregation process on the graph network, i.e., the characterization update process of each central node is a linear combination of its neighboring node features. In addition, M _t，s And A _t，s Representing a cross-layer mask matrix and a cross-hop attention matrix, respectively, in an introduced dynamic graph network. The former limits the aggregation process to the dynamic sub-part of the whole graph, effectively eliminates irrelevant noise nodes in the information aggregation process and stabilizes the training process; the latter performs feature aggregation updates more efficiently by recalculating the importance of each node in the receptive field captured at each layer.

Specifically, the cross-layer shielding matrix reduces noise node interference and simultaneously causes information on the graph to be aggregated by shielding a part of nodes from participating in the information aggregation process of the current layerThe relationship reasoning process is more discernable; the cross-jump attention matrix enables the characteristic updating process of different nodes in each local neighborhood on the graph to be more expressive by adaptively adjusting the connection relation between each reserved node (not shielded). Notably, the cross-layer masking matrix M in the method of the present invention _t，s Can be further divided into M' _s 、M′t _And M′ _t，s they represent the self-masking matrix between the stroke level graph nodes, the self-masking matrix between the text level graph nodes, and the mutual masking matrix between the stroke level and text level graph nodes, respectively. Thus, the masking result of the stroke level graph nodes is ultimately based on M' _s And M' _t，s Is compared with a fixed threshold, and the masking result of the text level graph nodes is ultimately based on M' _t And M' _t，s Is a comparison of the linear combination of (c) and a fixed threshold.

On the basis of completing the three stages, the invention uses the output of the last graph network layer for the prediction of the link relation between text graph nodes and the positioning of the regression value of the text instance bounding box.

(5) Overview of detection framework reasoning process

In the reasoning process of the constructed text detection framework, firstly, a multi-level prediction result of a text region is obtained according to a front-end processing module, and region candidate frames of two levels are extracted by setting corresponding thresholds, so that a hierarchical (comprising text levels and stroke levels) local graph structure network is constructed. On the basis, the back-end processing module deduces the relation among nodes of different levels of the graph and carries out link prediction by executing hierarchical node characteristic aggregation and relation reasoning. According to the classification of the graph nodes and the link prediction result, the text level nodes are grouped by a breadth-first search method and are ordered by a minimum path algorithm. And finally, sequentially connecting midpoints of the top and the bottom in the candidate boxes corresponding to the text nodes after the sequencing to obtain the boundary of the text instance with any shape.

Experimental test and effect thereof

(1) Evaluating data sets and implementation details

The invention aims at experimental verification of the proposed method on an evaluation data set which is mainstream in the Text detection field, and comprises CTW-1500, total-Text and ICDAR 2015. Experiments were performed on a server configured with 4 NVIDIA GeForce GTX 1080Ti GPUs and implemented based on a Pytorch deep learning framework. The backbone feature extraction network in the proposed method is pre-trained on the ImageNet dataset. In the experiment, a data enhancement strategy is adopted, the input image is adjusted to 640 x 640, and each image is randomly flipped with a probability of 0.5. The batch size of the input data during training is set to 64, i.e., 16 image samples are processed per GPU. All evaluation experiments were performed at a single image resolution.

(2) Experimental results

As shown in Table 1, the evaluation results of the method provided by the invention on all text detection data sets are obviously superior to those of the prior other methods. In addition, the text detection speed (FPS) of each method is also listed in the table to demonstrate that the proposed method is well balanced between accuracy and speed. First, to evaluate the performance of the proposed method in detecting Text instances that are close to each other or of arbitrary shape, the proposed method was experimentally evaluated and compared with the existing mainstream Text detection model on two Text datasets (including Total-Text and CTW-1500) containing numerous curved Text instances. As shown in Table 1, the performance of the proposed method is much better than models designed for detection of curved text instances, such as TextSnack and CRATT. In particular, the proposed Text detection method achieves an Hmean score of 87.5% in CTW-1500 and 89.1% in Total-Text, both of which significantly exceeded other comparative methods. The method has the advantages of obtaining consistent and superior performance in detecting text examples which are close to each other and in any shape by virtue of the introduced hierarchical graph reasoning network module, and particularly has stronger detection robustness for the text examples with different curvatures.

Second, the present invention evaluates the proposed method on the ICDAR 2015 dataset to verify its ability to detect tiny and low resolution text instances. As shown in Table 1, the method provided by the invention achieves 89.7%, 91.7% and 90.7% recall, precision and Hmea scores, which are significantly better than the relevant comparative models (including ReLaText, DRRG and StrokeNet). From the results, it can be seen that the multi-level region prediction network introduced by the method of the present invention is good at capturing tiny and low resolution stroke representations, which plays a very important role in facilitating the efficient modeling of finer granularity text region representations by the detector.

Table 1 shows the results of the evaluation of the proposed method on CTW-1500, total-Text and ICDAR 2015 data sets, and compares the performance with the existing mainstream methods in the field. Wherein the top two performance values are highlighted in bold. Furthermore, (ST) represents the use of the introduced SceneText stroke segmentation dataset to pre-train the convolutional neural network based front-end processing module in the detection framework. R: recall (%), P: precision (%), H: hmean (%).

TABLE 1

(3) Visualization of

The visualization of the text detection results in fig. 5 may also prove the effectiveness, high accuracy and good generalization ability of the proposed method in visual image text detection.

Fig. 5 shows a visual text detection result, in which a first column (a), a second column (b) and a third column (c) represent the input image, stroke segmentation predicted by the proposed method and the final text detection result, respectively.

Application example:

(1) Instance of a Stroke segmentation dataset

The invention firstly pre-trains a front-end processing module based on a convolutional neural network in the existing mainstream text detector by using the Scenetext data set to verify the effectiveness of the introduced stroke segmentation data set and the contribution of the introduced stroke segmentation data set to the improvement of the performance of other detectors in the text detection field. The results in the table 2 show that the external data set introduced by the invention can effectively improve the prediction precision of the front-end processing module in the existing mainstream text detector for text regions, so that the potential of the back-end processing module based on the graph model in the methods can be better mined and excited, and finally, the detection performance of the related mainstream method in the text detection field is improved.

Table 2 is a quantized evaluation of the stroke segmentation dataset introduced by the present invention for performance improvement of a related detector in the text detection field. Where (ST) represents pre-training a convolutional neural network based front-end processing module in the detection framework using the introduced SceneText stroke segmentation dataset. R: recall (%), P: precision (%), H: hmean (%).

TABLE 2

(2) Embodiments of the inventive method in OCR translation applications

Based on the summary of the invention, the present invention developed an OCR translation tool based on the proposed text detection method as shown in fig. 6. The figure provides a detailed process flow of the proposed text detection method in the chinese to english and english to french translation examples. The subgraphs contained therein represent the input image (a), the predicted stroke segmentation (b), the text detection result (c) and the translated image (d), respectively. Fig. 6 shows an example of a task of guiding OCR translation (chinese to english and english to french).

Based on the above, related applications based on OCR translation can be used as the downstream tasks of the text detection model provided by the invention. FIG. 7 illustrates the overall process flow of the proposed method on an OCR translation application, where an example of Chinese to English OCR translation is given. FIG. 7 shows the overall process flow for OCR translation tasks using the front-end processing module in the proposed method.

In the application scene, a front-end processing module of the method is called for processing, and then detection results of stroke levels and text levels are output, and related results are input into a text recognition module and a text erasing module respectively. For the image erasure module, a built-in OpenCV erasure algorithm is simply used here. The present invention uses a dilation operation in the built-in OpenCV function to expand the predicted stroke segmentation area before applying the erase operation. Furthermore, text color can be easily estimated by averaging pixel values in predicted stroke areas. Such a strategy may in practice improve the visual effect of image erasure (repair). The machine translation model runs by calling google translation APIs, which contain language identification related APIs.

The outstanding features of the embodiment of the invention include: a lightweight stroke segmentation prediction network is provided to introduce a novel graph neural network reasoning model and obtain remarkable effects. The stroke segmentation prediction network implements a multi-level (text level, stroke level) representation of the detection model for the text region. The graph neural network model is used as a back-end processing module in the constructed text detection framework, and can effectively perform feature aggregation and relationship reasoning on each part of the text region predicted by the front-end processing module, so that the improved graph model can be better suitable for a text detection task scene.

The method effectively solves the problems that the traditional method is difficult to accurately position the image text examples with any shape, and is difficult to effectively distinguish and separate a plurality of text examples in the same text image which are close to each other, and the algorithm is too long in time. The visual text detection method has the advantages of fully verified effectiveness, high precision and good generalization capability.

In some embodiments, a visual text image dataset (SceneText) is introduced to pre-train the stroke region prediction network, facilitating improved prediction accuracy of the detection framework for multi-level representations of text regions. In particular, each instance of text in its image sample is labeled with a stroke-level segmentation label, i.e., a binarized stroke character segmentation map. The data set improves the prediction accuracy of the detection framework for the text region multi-level representation through a front-end processing module based on a convolutional neural network in the pre-training detection framework.

The text detection method provided by the invention can be used for front-end processing (text segmentation and detection) of a text image restoration system and front-end processing (text segmentation and detection) of an OCR image translation system.

The invention can execute stroke-level segmentation prediction on the text region, and the segmentation prediction result can meet the requirements of more business functions, such as image restoration, OCR translation and the like by combining with the text detection result of the traditional method. The method can be used for the tasks of character detection and recognition in visual text images such as OCR (text content recognition), image restoration, image translation, electronic receipts, invoices and the like, intelligent full automation of text image processing is realized, and information understanding and processing efficiency in the related fields of text image processing are improved.

The method has wide application prospect, for example, the efficiency and the reliability of OCR technology can be effectively improved by improving the text content recognition and text instance detection precision, so that the labor and resource cost of related business in enterprises is greatly reduced.

The embodiment of the invention also provides a storage medium for storing a computer program which, when executed, performs at least the visual text detection method based on the stroke region segmentation strategy as described above.

The embodiment of the invention also provides a control device, which comprises a processor and a storage medium for storing a computer program; wherein the processor is configured to execute at least the visual text detection method based on the stroke area segmentation strategy as described above when executing the computer program.

The embodiments of the present invention also provide a processor executing a computer program, at least performing the method as described above.

The storage medium may be implemented by any type of volatile or non-volatile storage device, or combination thereof. The nonvolatile Memory may be a Read Only Memory (ROM), a programmable Read Only Memory (PROM, programmable Read-Only Memory), an erasable programmable Read Only Memory (EPROM, erasableProgrammable Read-Only Memory), an electrically erasable programmable Read Only Memory (EEPROM, electricallyErasable Programmable Read-Only Memory), a magnetic random Access Memory (FRAM, ferromagneticRandom Access Memory), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a compact disk Read Only (CD-ROM, compact Disc Read-Only Memory); the magnetic surface memory may be a disk memory or a tape memory. The volatile memory may be random access memory (RAM, random Access Memory), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (SRAM, static Random Access Memory), synchronous static random access memory (SSRAM, synchronousStatic Random Access Memory), dynamic random access memory (DRAM, dynamic RandomAccessMemory), synchronous dynamic random access memory (SDRAM, synchronous Dynamic Random AccessMemory), double data rate synchronous dynamic random access memory (ddr SDRAM, double Data RateSynchronous Dynamic Random Access Memory), enhanced synchronous dynamic random access memory (ESDRAM, enhanced Synchronous Dynamic Random Access Memory), synchronous link dynamic random access memory (SLDRAM, syncLink Dynamic Random Access Memory), direct memory bus random access memory (DRRAM, direct Rambus Random Access Memory). The storage media described in embodiments of the present invention are intended to comprise, without being limited to, these and any other suitable types of memory.

In the several embodiments provided by the present invention, it should be understood that the disclosed systems and methods may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.

Alternatively, the above-described integrated units of the present invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

The methods disclosed in the method embodiments provided by the invention can be arbitrarily combined under the condition of no conflict to obtain a new method embodiment.

The features disclosed in the several product embodiments provided by the invention can be combined arbitrarily under the condition of no conflict to obtain new product embodiments.

The features disclosed in the embodiments of the method or the apparatus provided by the invention can be arbitrarily combined without conflict to obtain new embodiments of the method or the apparatus.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several equivalent substitutions and obvious modifications can be made without departing from the spirit of the invention, and the same should be considered to be within the scope of the invention.

Claims

1. A visual text detection method based on a stroke region segmentation strategy is characterized by comprising the following steps:

2. A method of visual text detection based on a stroke region segmentation strategy as recited in claim 1 wherein in step S1, the multi-level prediction comprises: obtaining the classification confidence coefficient of the text-level rectangular area corresponding to each text instance; regression prediction of Guan Wenben-level attributes such as text rotation angle, center line position and the like in each text instance; and character segmentation predictions for corresponding stroke levels within the prediction bounding box for each text region.

3. The method for detecting a visual text based on a stroke region segmentation strategy as claimed in claim 1 or 2 wherein in step S1, the extracting the region candidate boxes at the text level and the stroke level according to the multi-level prediction result of the text region comprises: and extracting corresponding multi-level candidate rectangular frames according to the multi-level prediction result of the text region, wherein the partial graph only containing text level or stroke level nodes is an isomorphic graph, and the partial graph containing text level and stroke level nodes is an abnormal graph.

4. A method of visual text detection based on a stroke region segmentation strategy as recited in any one of claims 1 to 3 wherein in step S1 the text region prediction network predicts attributes associated with text instance regions comprising: classification probability prediction for predicting text region TR and text center region TCR is then performed h ₁ 、h ₂ Regression value prediction of cos θ and sin θ, h ₁ And h ₂ Representing the distance of the current pixel to the upper and lower edges of TR and TR, respectively, the text instance height dimension h is h ₁ And h ₂ And, θ indicates direction information of the text instance; estimating a text center line corresponding to the TR on the basis of predicting the TR potential area; using the feature outputs of the 2 channels to guide classification probability prediction of TR and TCR; during training, the first characteristic channel is used for predicting a background, and the second characteristic channel is used for predicting a foreground, namely a text example; during testing, taking a foreground prediction result of the second characteristic channel for subsequent processing; wherein the outputs of the 1 characteristic channels are used to predict the regression attribute values, respectively.

5. The method of visual text detection based on a stroke region segmentation strategy as recited in any one of claims 1 to 4 wherein in step S1 the stroke region prediction network separates character content in each text region from complex background, wherein fine stroke segmentation representations in text regions are generated in combination with image low-level semantic and high-level semantic information to guide the subsequent text detection process;

6. A method of visual text detection based on a stroke region segmentation strategy as recited in any of claims 1 to 5, wherein the front-end processing module pre-trains the stroke region prediction network using a dataset with stroke level segmentation map annotations as labels and a mean square error loss function.

7. The method for detecting visual text based on a stroke region segmentation strategy as recited in any one of claims 1 to 6 wherein in step S3, node features and connection structures thereof are initialized first, specifically comprising:

Aiming at heterogeneous text graph networks, graph nodes of two levels of text and strokes are simultaneously contained; according to the extracted candidate frames of each regionTo build a graph network of this type; specifically, each text-level region candidate box is regarded as a central node of the heterogeneous text graph, and the connection relation in the 1-hop and 2-hop neighborhood range of the central node is adopted to generate an adjacent matrix A of the heterogeneous graph _h The method comprises the steps of carrying out a first treatment on the surface of the The 1-hop neighborhood of the center node contains 4 text-level graph neighbor nodes nearest to it, while its 2-hop neighborhood contains an additional 4 stroke-level graph neighbor nodes nearest to it.

8. The method for detecting visual text based on a stroke region segmentation strategy according to any one of claims 1 to 7, wherein in step S3, the back-end processing module performs relationship reasoning and link prediction of nodes in the generated multiple local graph networks through a hierarchical graph neural network reasoning model; the reasoning process based on the graph neural network comprises the following three stages:

firstly, aggregating and updating stroke level node characteristics guided by an attention mechanism in a weighted average mode; the weight information in the weighting process comes from two parts, including a normalized adjacency matrix A _s And attention coefficient alpha between any two graph nodes v and u derived in graph annotation force network GAT _v,u The method comprises the steps of carrying out a first treatment on the surface of the The weighted aggregation process of the first stage is described as:

is from As, fuse (·) represents a characteristic linear combination function;

wherein,representing all text (t) node features and stroke(s) node features of the t layer; attention (& gt) is the Attention computation operation in the transducer, Q, K, V represents the query matrix, key matrix and value matrix, W, respectively ^Q(,K,V) Is a trainable weight parameter;

P＝σ(M _t,s ,A _t,s (G(H _t,s ))W)

where W is a trainable weight matrix, G (·) represents a conventional information aggregation process on the graph network, M _t,s And A _t,s Respectively representing a cross-layer shielding matrix and a cross-jump attention matrix in the introduced dynamic graph network;

Preferably, the cross-layer masking matrix M _t,s Further divided into M' _s 、M′ _t And M' _t,s Representing self-masking matrices and text-level nodes between stroke-level nodes, respectivelyA self-masking matrix between points, and a mutual masking matrix between stroke level and text level graph nodes; the masking result of the stroke level graph nodes is finally based on M' _s And M' _t,s Is compared with a fixed threshold, and the masking result of the text level graph nodes is ultimately based on M' _t And M' _t,s Is a comparison of the linear combination of (a) and a fixed threshold;

9. The method for detecting visual text based on a stroke region segmentation strategy according to any one of claims 1 to 8, wherein a learning process of the whole detection framework is guided by adopting cross entropy loss between a graph model prediction result and a corresponding real class label in a training process; preferably, according to classification and link prediction results of the graph nodes, the text level nodes are grouped by a breadth-first search method and are ordered by a minimum path algorithm; preferably, the boundaries of the text examples with arbitrary shapes are obtained by sequentially connecting midpoints of the top and bottom in the candidate boxes corresponding to the text nodes after sorting.

10. A computer readable storage medium storing a computer program which, when executed by a processor, implements a method for visual text detection based on a stroke region segmentation strategy as claimed in any one of claims 1 to 9.