CN111126389A

CN111126389A - Text detection method and device, electronic equipment and storage medium

Info

Publication number: CN111126389A
Application number: CN201911330293.9A
Authority: CN
Inventors: 刘皓
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2020-05-08

Abstract

The embodiment of the invention discloses a text detection method, a text detection device, electronic equipment and a storage medium, wherein the text detection method comprises the following steps: the method comprises the steps of obtaining an image to be detected, constructing a detection frame corresponding to each text element in the image to be detected, respectively extracting texture features and geometric features of a region corresponding to each detection frame, obtaining incidence relations among the detection frames, classifying the detection frames according to the incidence relations, the texture features and the geometric features to obtain classified detection frames, and performing text detection on the image to be detected based on the classified detection frames.

Description

Text detection method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a text detection method and device, electronic equipment and a storage medium.

Background

The natural scene text detection has extremely important wide application in real life, such as text retrieval, guideboard identification, intelligent test paper correction and the like. However, the detection of the natural scene text is still a difficult task due to various uncontrollable interference factors in the natural scene, such as shadow masking, shooting angle, foreign matter masking, and the influence of some inherent attributes of the text, such as artistic words, deformed words or incomplete words. However, with the development of Artificial Intelligence (AI) technology in recent years, natural scene text detection technology based on deep learning algorithm has been advanced significantly in performance.

At present, a relatively common text detection technology is mainly a regression-based method, but in the research and practice processes of the prior art, the inventor of the present invention finds that the current regression-based method can only process the situation of a rectangular text line, and when the shape of a text is a curve, a predicted detection box of the text cannot accurately cover all text regions; in addition, for a long text line, once the aspect ratio of the text line is greater than a preset prediction threshold, a frame is lost or the prediction is incomplete, so that the detection effect of the existing text detection scheme is not good.

Disclosure of Invention

The embodiment of the invention provides a text detection method, a text detection device, electronic equipment and a storage medium, which can improve the accuracy of text detection.

The embodiment of the invention provides a text detection method, which comprises the following steps:

acquiring an image to be detected, wherein the image to be detected comprises a text to be detected, and the text to be detected comprises a plurality of text elements;

constructing a detection frame corresponding to each text element in the image to be detected;

respectively extracting texture features and geometric features of the corresponding region of each detection frame, and acquiring association relations among the detection frames;

classifying the detection frames according to the incidence relation, the texture features and the geometric features to obtain classified detection frames;

and performing text detection on the image to be detected based on the classified detection box.

Correspondingly, an embodiment of the present invention further provides a text detection apparatus, including:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring an image to be detected, the image to be detected comprises a text to be detected, and the text to be detected comprises a plurality of text elements;

the construction module is used for constructing a detection frame corresponding to each text element in the image to be detected;

the extraction module is used for respectively extracting the texture features and the geometric features of the corresponding region of each detection frame;

the second acquisition module is used for acquiring the association relation among the detection frames;

the classification module is used for classifying the detection frames according to the incidence relation, the textural features and the geometric features to obtain the classified detection frames;

and the detection module is used for carrying out text detection on the image to be detected based on the classified detection box.

Optionally, in some embodiments of the present invention, the classification module includes:

the calculating unit is used for calculating the similarity function corresponding to each detection frame according to the association relation;

and the classification unit is used for classifying the detection frames based on the texture features, the geometric features and the similarity function to obtain the classified detection frames.

Optionally, in some embodiments of the present invention, the classifying unit includes:

the construction subunit is used for respectively constructing a texture feature map corresponding to the image to be detected and a geometric feature map corresponding to the image to be detected according to the texture feature, the geometric feature and the similarity function;

and the classification subunit is used for classifying the detection frames based on the texture feature map and the geometric feature map to obtain the classified detection frames.

Optionally, in some embodiments of the present invention, the building subunit is specifically configured to:

calculating texture feature points corresponding to the image to be detected through texture features and a similarity function;

constructing a texture feature map corresponding to the image to be detected based on the texture feature points;

calculating a geometric feature point corresponding to the image to be detected through a geometric feature and a similarity function;

and constructing a geometric feature map corresponding to the image to be detected based on the geometric feature points.

Optionally, in some embodiments of the present invention, the classification subunit is specifically configured to:

fusing the texture feature map and the geometric feature map to obtain a fused feature map;

predicting the category of the detection frame through the fused feature map;

and classifying the detection frames based on the prediction result to obtain the classified detection frames.

Optionally, in some embodiments of the present invention, the detection module includes:

the determining unit is used for determining the classified detection frames belonging to the same category as a homologous group;

the construction unit is used for constructing a text box for text detection according to the classified detection boxes in the homologous group;

and the detection unit is used for carrying out text detection on the image to be detected based on the text box.

Optionally, in some embodiments of the present invention, the building unit is specifically configured to:

determining a central point corresponding to each classified detection frame in the homologous group;

obtaining the corresponding size of each classified detection frame in the homologous group;

a text box for text detection is constructed based on the center point and the size.

Optionally, in some embodiments of the present invention, the text box further includes an adjusting module, where the adjusting module is configured to adjust an edge of the text box to obtain an adjusted text box;

the detection module is specifically configured to: and performing text detection on the image to be detected based on the adjusted text box.

Optionally, in some embodiments of the present invention, the building module is specifically configured to:

performing semantic segmentation on the image to be detected to obtain target pixel points corresponding to each text element and pixel association information corresponding to each target pixel point;

and constructing a detection frame corresponding to each text element based on the pixel correlation information and the plurality of target pixel points.

After an image to be detected is obtained, the image to be detected comprises a text to be detected, the text to be detected comprises a plurality of text elements, a detection frame corresponding to each text element is constructed in the image to be detected, then, the texture feature and the geometric feature of a region corresponding to each detection frame are respectively extracted, the incidence relation among the detection frames is obtained, then, the detection frames are classified according to the incidence relation, the texture feature and the geometric feature, the classified detection frames are obtained, and finally, text detection is carried out on the image to be detected based on the classified detection frames. Therefore, the scheme can effectively improve the accuracy of text detection.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a scene schematic diagram of a text detection method according to an embodiment of the present invention;

FIG. 1b is a schematic flow chart of a text detection method according to an embodiment of the present invention;

fig. 1c is a schematic diagram of an 8-neighborhood of a target pixel point in the text detection method according to the embodiment of the present invention;

fig. 1d is a schematic diagram of a reference line constructed in the text detection method provided in the embodiment of the present invention;

fig. 1e is a schematic diagram of a text box constructed in the text detection method provided in the embodiment of the present invention;

fig. 1f is a schematic diagram illustrating a text box being adjusted in the text detection method according to the embodiment of the present invention;

FIG. 2a is another schematic flow chart of a text detection method according to an embodiment of the present invention;

fig. 2b is a schematic view of another scene of the text detection method according to the embodiment of the present invention;

fig. 2c to fig. 2e are another schematic diagrams of constructing a text box in the text detection method according to the embodiment of the present invention;

fig. 3a is a schematic structural diagram of a text detection apparatus according to an embodiment of the present invention;

FIG. 3b is a schematic structural diagram of a text detection apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a text detection method and device, electronic equipment and a storage medium.

The text detection apparatus may be specifically integrated in a terminal, and the terminal may include a mobile phone, a tablet Computer, or a Personal Computer (PC).

For example, referring to fig. 1a, the text detection apparatus is integrated on a mobile phone, the mobile phone may include a camera and a display screen, for example, when a user takes a picture of a sign through the camera, the mobile phone may acquire an image to be detected corresponding to the sign through the camera, where the image to be detected includes text content of the sign (i.e., a text to be detected), the text to be detected includes a plurality of text elements, then the mobile phone may perform semantic segmentation on the image to be detected to obtain target pixel points corresponding to each text element and pixel association information corresponding to each target pixel point, and construct a detection box corresponding to each text element based on the pixel association information and the plurality of target pixel points, then the mobile phone may extract texture features and geometric features of a region corresponding to each detection box, and acquire an association relationship between the detection boxes, finally, the mobile phone performs text detection on the image to be detected based on the classified detection frame, and can recognize the text to be detected in the image to be detected according to the detection result, for example, the mobile phone can recognize a road name and the like on a sign.

Compared with the existing text detection scheme, the text detection scheme can extract the texture features and the geometric features of the corresponding regions of each detection frame and acquire the association relationship among the detection frames, when the text is in a curve shape, the detection frames can be classified according to the association relationship, the texture features and the geometric features, then the text detection is carried out on the image to be detected based on the classified detection frames, the situation that the detection frames cannot accurately cover all text regions is avoided, in addition, for a long text line, because the scheme constructs the detection frames corresponding to each text element in the image to be detected, the problem that the frames are lost or the prediction is incomplete in the detection process can be avoided, and therefore, the scheme can improve the accuracy of text detection.

The following are detailed below. It should be noted that the description sequence of the following embodiments is not intended to limit the priority sequence of the embodiments.

A method of detecting text, comprising: the method comprises the steps of obtaining an image to be detected, constructing a detection frame corresponding to each text element in the image to be detected, respectively extracting texture features and geometric features of a region corresponding to each detection frame, obtaining incidence relations among the detection frames, classifying the detection frames according to the incidence relations, the texture features and the geometric features to obtain classified detection frames, and performing text detection on the image to be detected based on the classified detection frames.

Referring to fig. 1b, fig. 1b is a schematic flow chart of a text detection method according to an embodiment of the invention. The specific flow of the text recognition method can be as follows:

101. and acquiring an image to be detected.

The image to be detected can be pre-stored locally, can also be pulled through accessing a network interface, and can also be obtained through real-time shooting through a camera, which is determined according to actual conditions.

102. And constructing a detection frame corresponding to each text element in the image to be detected.

For example, specifically, semantic segmentation may be performed on an image to be detected, and then a detection frame corresponding to each text element is constructed in the image to be detected based on a semantic segmentation result, where the semantic segmentation of the image is to perform pixel-level identification and segmentation to obtain category information and accurate position information of an object in the image, it can be understood that, in the embodiment of the present invention, the semantic segmentation is performed on the image to be detected, so that a pixel point corresponding to each text element to be detected, that is, a target pixel point, and pixel association information corresponding to the target pixel point can be obtained, that is, optionally, in some embodiments, the step "constructing a detection frame corresponding to each text element in the image to be detected" may specifically include:

(11) performing semantic segmentation on an image to be detected to obtain target pixel points corresponding to each text element and pixel association information corresponding to each target pixel point;

(12) and constructing a detection frame corresponding to each text element based on the pixel correlation information and the plurality of target pixel points.

The pixel correlation information can be understood as pixel neighborhood information, in image processing, a neighborhood refers to a set of pixels adjacent to a certain pixel and reflects a spatial relationship between the pixels, the pixel correlation information can be pixel 4 neighborhood information, pixel diagonal neighborhood information or pixel 8 neighborhood information, the pixel 8 neighborhood information can be considered as information obtained by fusing the pixel 4 neighborhood information and the pixel diagonal neighborhood information, and when a pixel point is located at an image boundary, the point of a certain neighborhood corresponding to the pixel point can be considered to fall outside the image.

For example, the pixel points of the image to be detected may be classified by a first classification sub-model in a preset text detection model, so as to determine the pixel points belonging to the text elements, that is, obtain a plurality of target pixel points, then classify the plurality of target pixel points by a second classification sub-model in the preset text detection model, obtain a classification confidence corresponding to each target pixel point, and finally, according to the classification confidence, determine whether the target pixel point has a connection relationship with 8 neighborhoods, where 8 fields refer to an upper neighborhood, a lower neighborhood, a left neighborhood, a right neighborhood, and a diagonal neighborhood, as shown in fig. 1c, and then construct pixel association information corresponding to the target pixel points according to the connection relationships.

Wherein, the pixel predicted value refers to the probability that each pixel point in the text feature image belongs to the region of the text to be detected, the classification confidence coefficient refers to the probability that the target pixel point belongs to the text to be detected, firstly, a convolutional neural network such as an FPN (feature Pyramid) feature Pyramid network can be used, as shown in fig. 1d, the feature extraction is performed on the image to be detected, the image to be detected firstly outputs 512 layers of feature maps with the size of 32 in the convolutional neural network formed by nine convolutional layers, then, the feature maps are input into the FPN, after 3 stages of up-sampling, 32 layers of text feature maps with the size of 256 are finally output, then, the pixel points in the text feature image are classified through a first classification sub-model in a preset text detection model, the pixel predicted value corresponding to each pixel point is obtained, and then, based on the pixel predicted value, determining a plurality of target pixel points corresponding to the text to be detected, and then classifying the plurality of target pixel points through a second classification sub-model in a preset text detection model to obtain the classification confidence corresponding to each target pixel point.

103. And respectively extracting the texture features and the geometric features of the corresponding region of each detection frame, and acquiring the association relation among the detection frames.

For example, the texture feature of each detection box and the geometric feature of each detection box may be respectively extracted through a convolutional neural network, such as an fpn (feature Pyramid) feature Pyramid network, and an association relationship between the detection boxes may be obtained, where the association relationship may be a relative position relationship between the detection boxes.

104. And classifying the detection frames according to the incidence relation, the texture characteristics and the geometric characteristics to obtain the classified detection frames.

After obtaining the texture features of the detection frames and the geometric features of the detection frames, the categories to which the detection frames belong may be predicted by using the association relationships between the detection frames, where a similarity function corresponding to each detection frame may be calculated according to the association relationships, and then the detection frames are classified according to the texture features, the geometric features, and the similarity functions, so as to obtain the classified detection frames, that is, optionally, in some embodiments, the step "classifying the detection frames according to the association relationships, the texture features, and the geometric features, so as to obtain the classified detection frames" may specifically include:

(21) calculating a similarity function corresponding to each detection frame according to the association relation;

(22) classifying the detection frames based on the textural features, the geometric features and the similarity function to obtain the classified detection frames.

For example, a similarity function corresponding to each detection frame may be calculated according to the association relationship, and then the detection frames are classified by using a preset graph convolutional neural network based on the texture features, the geometric features, and the similarity functions to obtain the classified detection frames, it should be noted that, in the embodiments of the present invention, the similarity function may include a cosine similarity function, a gaussian similarity function, and/or a character string similarity function, preferably, in some embodiments, the similarity function includes a cosine similarity function, a gaussian similarity function, and a character string similarity function, that is, the detection frames are classified by using a preset graph convolutional neural network based on the texture features, the geometric features, and the similarity functions, and the obtaining of the classified detection frames may specifically include: the detection frames are classified by adopting a preset graph convolutional neural network based on the texture features, the geometric features, the cosine similarity function, the Gaussian similarity function and the character string similarity function to obtain the classified detection frames, namely, in the scheme for classifying the detection frames, not only the texture features corresponding to the detection frames and the geometric features corresponding to the detection frames need to be considered, but also the similarity of each detection frame in each dimension, such as the cosine similarity, the similarity in Gaussian distribution and the similarity between character strings need to be considered, so that the accuracy of classification of the detection frames can be improved, and the subsequent text detection of the image to be detected based on the detection frames is facilitated.

Further, a texture feature map corresponding to the image to be detected and a geometric feature map corresponding to the image to be detected can be constructed through a preset map convolution neural network according to the texture feature, the geometric feature and the similarity function, then, the detection frame is classified based on the texture feature map and the geometric feature map, and the detection frame after classification is obtained, that is, in some embodiments, the step "classifying the detection frame based on the texture feature, the geometric feature and the similarity function, and obtaining the detection frame after classification" specifically may include:

(41) respectively constructing a texture feature map corresponding to the image to be detected and a geometric feature map corresponding to the image to be detected according to the texture feature, the geometric feature and the similarity function;

(42) and classifying the detection frames based on the texture feature map and the geometric feature map to obtain the classified detection frames.

For example, after calculating the similarity function corresponding to each detection frame, texture feature points corresponding to the image to be detected may be calculated by using the texture features and the similarity function, and then, a texture feature map corresponding to the image to be detected is constructed based on the texture feature points, which may specifically be calculated by the following formula:

G₁＝K(g_i,g_j),i,j∈{1,2,...,N}

its characteristic texture feature map G₁The sum of the product of the texture feature of the ith detection region and the similarity function K and the product of the texture feature of the jth detection region and the similarity function K is equal to β, and it should be noted that the similarity function K is₁K₁+β₂K₂+β₃K₃Wherein, K is₁、K₂And K₃Respectively representing a cosine similarity function, a gaussian similarity function and a string similarity function, β₁、β₂And β₃All represent weight coefficients which satisfy β₁+β₂+β ₃1, the method can be set according to the actual situation,similarly, after the similarity function corresponding to each detection frame is calculated, the geometric feature points corresponding to the image to be detected can be calculated through the geometric features and the similarity function, then, the texture feature map corresponding to the image to be detected is constructed based on the geometric feature points, and the calculation can be specifically carried out through the following formula

G₂＝K(h_i,h_j),i,j∈{1,2,...,N}

Its characteristic geometric feature map G₂That is, in some embodiments, the step "classify the detection frame based on the texture feature map and the geometric feature map to obtain the post-classification detection frame" may specifically include:

(51) calculating texture feature points corresponding to the image to be detected through texture features and a similarity function;

(52) constructing a texture feature map corresponding to the image to be detected based on the texture feature points;

(53) calculating a geometric feature point corresponding to the image to be detected through a geometric feature and a similarity function;

(54) and constructing a geometric feature map corresponding to the image to be detected based on the geometric feature points.

After the texture feature map and the geometric feature map are obtained, the texture feature map and the geometric feature map may be respectively processed by using a preset map convolutional neural network, where one layer of the preset map convolutional neural network may be defined as Z ═ ReLu (LayerNorm (gxw)) + X, G is the texture feature map or the geometric feature map mentioned above, X is a texture feature set of a region corresponding to each detection frame or a geometric feature set of a region corresponding to each detection frame, W is a weight matrix, ReLu is a nonlinear layer, LayerNorm represents layer normalization, optionally, in some embodiments, the texture feature map and the geometric feature map may be respectively processed by using a map convolutional neural network in which convolutional layers are 3 layers, and finally, outputs of the texture feature map and the geometric feature map are concatenated in channel dimensions, and the concatenated feature map is processed by using one layer of convolutional layer, and finally, obtaining the probability that the detection area I and other detection areas belong to the same category, and classifying the detection areas based on the detection result to obtain a classified detection frame.

Optionally, in some embodiments, when the probability is greater than a preset threshold, it may be determined that the two detection regions belong to the same category, for example, the preset threshold is 80%, the probability that the detection region I and the detection region a belong to the same category is 50%, the probability that the detection region I and the detection region B belong to the same category is 90%, and the probability that the detection region I and the detection region C belong to the same category is 35%, it may be determined that the detection region I and the detection region B belong to the same category, it should be noted that this category may be a text line, a phrase, or a sentence, specifically set according to an actual situation, that is, the step "classify the detection frames based on the texture feature map and the geometric feature map, and obtain the detection frames after classification", specifically may include:

(61) fusing the texture feature map and the geometric feature map to obtain a fused feature map;

(62) predicting the category of the detection frame through the fused feature map;

(63) and classifying the detection frames based on the prediction result to obtain the classified detection frames.

It should be noted that the graph convolution neural network may be pre-established, that is, in some embodiments, the graph convolution neural network may specifically include:

(71) acquiring an image sample comprising a plurality of sample detection boxes, wherein each sample detection box comprises a plurality of text element samples marked with true values of the category to which the sample detection box belongs;

(72) inputting the image sample into a basic graph convolution network to obtain a category prediction value of each category of the sample detection frame;

(73) and converging the basic graph convolution network based on the category true value and the category predicted value to obtain the graph convolution neural network.

And (3) rolling layers: the method is mainly used for feature extraction of an input image (such as a training sample or an image to be identified), wherein the size of a convolution kernel can be determined according to practical application, for example, the sizes of convolution kernels from a first layer of convolution layer to a fourth layer of convolution layer can be (7, 7), (5, 5), (3, 3), (3, 3); optionally, in order to reduce the complexity of the calculation and improve the calculation efficiency, in this embodiment, the sizes of convolution kernels of the four convolution layers may all be set to (3, 3), the activation functions all use "reduce (Linear rectification function, Rectified Linear Unit)", the padding (padding, which refers to a space between an attribute definition element border and an element content) modes are all set to "same", and the "same" padding mode may be simply understood as that an edge is padded with 0, and the number of 0 padding on the left side (upper side) is the same as or less than the number of 0 padding on the right side (lower side). Optionally, the convolutional layers may be directly connected to each other, so as to accelerate the network convergence speed, and in order to further reduce the amount of computation, downsampling (downsampling) may be performed on all layers or any 1 to 2 layers of the second to fourth convolutional layers, where the downsampling operation is substantially the same as the operation of convolution, and the downsampling convolution kernel is only a maximum value (maxpolong) or an average value (average value) of corresponding positions.

It should be noted that, for convenience of description, in the embodiment of the present invention, both the layer where the activation function is located and the down-sampling layer (also referred to as a pooling layer) are included in the convolution layer, and it should be understood that the structure may also be considered to include the convolution layer, the layer where the activation function is located, the down-sampling layer (i.e., a pooling layer), and a full-connection layer, and of course, the structure may also include an input layer for inputting data and an output layer for outputting data, which are not described herein again.

Full connection layer: the learned features may be mapped to a sample label space, which mainly functions as a "classifier" in the whole convolutional neural network, and each node of the fully-connected layer is connected to all nodes output by the previous layer (e.g., the down-sampling layer in the convolutional layer), where one node of the fully-connected layer is referred to as one neuron in the fully-connected layer, and the number of neurons in the fully-connected layer may be determined according to the requirements of the practical application, for example, in the text detection model, the number of neurons in the fully-connected layer may be set to 512 each, or may be set to 128 each, and so on. Similar to the convolutional layer, optionally, in the fully-connected layer, a non-linear factor may be added by adding an activation function, for example, an activation function sigmoid (sigmoid function) may be added.

For example, specifically, an image sample may be acquired through multiple approaches, where the image sample includes multiple sample detection boxes, each sample detection box includes multiple text element samples labeled with true values of the categories to which the sample belongs, then, the image sample is input into a base graph convolution network, the categories to which the sample detection boxes belong are predicted according to texture features and geometric features of corresponding regions of each sample detection box and an association relationship between the sample detection boxes, so as to obtain category prediction values of the sample detection boxes belonging to the categories, and finally, the base graph convolution network is converged based on the category true values and the category prediction values, so as to obtain a graph convolution neural network.

105. And performing text detection on the image to be detected based on the classified detection box.

For example, specifically, a text box for text detection may be constructed based on the classified detection box, and then text detection may be performed on the image to be detected according to the constructed text box, for example, it may be detected that the image to be detected includes two text lines, where one text line includes five characters and the other text line includes three characters, that is, in some embodiments, the step "performing text detection on the image to be detected based on the classified detection box" may specifically include:

(81) determining the classified detection frames belonging to the same category as a homologous group;

(82) constructing a text box for text detection according to the classified detection boxes in the homologous group;

(83) and performing text detection on the image to be detected based on the text box.

Specifically, after the determining the homology group, a central point corresponding to each classified detection box in the homology group may be determined, then, a size corresponding to each classified detection box may be obtained, such as a length of the classified detection box and a width of the classified detection box, and finally, a text box for text detection may be constructed based on the central point and the size, that is, in some embodiments, the step "constructing the text box for text detection according to the classified detection box in the homology group" may specifically include:

(91) determining a central point corresponding to each classified detection frame in the homologous group;

(92) obtaining the corresponding size of each classified detection frame in the homologous group;

(93) a text box for text detection is constructed based on the center point and the size.

For example, specifically, first, a central point corresponding to each classified detection frame in the homology group is determined, then, according to the arrangement of the classified detection frames in the image to be detected, the central points corresponding to each classified detection frame are sequentially connected to obtain a reference line, as shown in fig. 1d, at the same time, height values corresponding to the classified detection frames within a preset range of the central point, for example, height values corresponding to five classified detection frames around the central point are obtained, then, the obtained height values are summed and averaged to obtain a reference height, then, a text box for text detection is constructed based on the central point and the size, for example, the constructed reference line has a length of 5 cm and a reference height of 3 cm, the reference line is translated upwards by 3 cm and is translated downwards by 3 cm along a direction perpendicular to the reference line, obtaining a text box for text detection, wherein a reference line is constructed based on a central point corresponding to each classified detection box, so that the classified detection boxes corresponding to the head and tail end points of the reference line cannot be covered by the text box, and therefore, the head and tail end points of the reference line need to be respectively translated towards two sides, the distance of translation is equal to the distance of the reference height, the distance of the translated two end points is taken as the length corresponding to the text box, and at this time, the length of the text box is 16 cm, and the height of the text box is 6 cm, as shown in fig. 1 e.

It should be noted that, since the reference line is not necessarily a straight line segment, the constructed text box is not necessarily a rectangular box, which is inconvenient for subsequent text detection and text recognition, and therefore, optionally, in some embodiments, after the step "constructing the text box for text detection based on a central point and a size", specifically, the method may further include: and adjusting the edge of the text box to obtain the adjusted text box.

The step of performing text detection on the image to be detected based on the text box specifically includes: and performing text detection on the image to be detected based on the adjusted text box.

Specifically, the edges of the obtained text box may be smoothed by a Thin Plate Spline interpolation (Thin Plate Spline), and the shape of the text box is adjusted to be rectangular, so as to facilitate subsequent text detection and text recognition, as shown in fig. 1 f.

After the image to be detected is obtained, the image to be detected comprises a text to be detected, the text to be detected comprises a plurality of text elements, a detection frame corresponding to each text element is constructed in the image to be detected, then, the texture feature and the geometric feature of the corresponding area of each detection frame are respectively extracted, the incidence relation among the detection frames is obtained, then, the detection frames are classified according to the incidence relation, the texture feature and the geometric feature to obtain the classified detection frames, and finally, the text detection is carried out on the image to be detected based on the classified detection frames. Compared with the existing text detection scheme, the text detection scheme can extract the texture features and the geometric features of the corresponding regions of each detection frame and acquire the association relationship among the detection frames, when the text is in a curve shape, the detection frames can be classified according to the association relationship, the texture features and the geometric features, then the text detection is carried out on the image to be detected based on the classified detection frames, the situation that the detection frames cannot accurately cover all text regions is avoided, in addition, for a long text line, because the scheme constructs the detection frames corresponding to each text element in the image to be detected, the problem that the frames are lost or the prediction is incomplete in the detection process can be avoided, and therefore, the scheme can improve the accuracy of text detection.

The method according to the examples is further described in detail below by way of example.

In this embodiment, the text detection apparatus will be described by taking an example in which it is specifically integrated in a terminal.

Referring to fig. 2a, a text detection method may specifically include the following steps:

201. and the terminal acquires an image to be detected.

202. And the terminal constructs a detection frame corresponding to each text element in the image to be detected.

For example, specifically, after the terminal performs semantic segmentation on the image to be detected, the terminal constructs a detection box corresponding to each text element in the image to be detected based on a semantic segmentation result.

203. And the terminal respectively extracts the texture features and the geometric features of the corresponding region of each detection frame and acquires the association relation among the detection frames.

For example, the terminal may extract texture features of each detection box and geometric features of each detection box through a convolutional neural network, such as an fpn (feature Pyramid) feature Pyramid network, and may obtain an association relationship between the detection boxes, where the association relationship may be a relative position relationship between the detection boxes.

204. And the terminal classifies the detection frames according to the incidence relation, the textural features and the geometric features to obtain the classified detection frames.

For example, specifically, the terminal may calculate a similarity function corresponding to each detection frame according to the association relationship, and then classify the detection frames by using a preset graph convolutional neural network based on the texture features, the geometric features, and the similarity function to obtain the classified detection frames.

Preferably, in some embodiments, the similarity function includes a cosine similarity function, a gaussian similarity function, and a string similarity function, that is, the terminal classifies the detection frame by using a preset graph convolutional neural network based on the texture feature, the geometric feature, and the similarity function, and the detection frame after being classified may specifically include: classifying the detection frames by adopting a preset graph convolutional neural network based on the texture features, the geometric features, the cosine similarity function, the Gaussian similarity function and the character string similarity function to obtain the classified detection frames, namely, in the scheme for classifying the detection frames, not only the texture features corresponding to the detection frames and the geometric features corresponding to the detection frames need to be considered, but also the similarity of each detection frame in each dimension, such as the cosine similarity, the similarity in Gaussian distribution and the similarity between character strings need to be considered, so that the accuracy of classification of the detection frames can be improved, and the subsequent text detection of the image to be detected based on the detection frames is facilitated

205. And the terminal performs text detection on the image to be detected based on the classified detection box.

For example, specifically, the terminal may construct a text box for text detection based on the classified detection boxes, and then perform text detection on an image to be detected according to the constructed text box, optionally, in some embodiments, the terminal may determine the classified detection boxes belonging to the same category as a homologous group, then, the terminal determines a central point corresponding to each classified detection box in the homologous group, then, may obtain a size corresponding to each classified detection box, such as a length of the classified detection box and a width of the classified detection box, and finally, may construct the text box for text detection based on the central point and the size.

To facilitate understanding of the text detection method provided by the embodiment of the present invention, taking a scene of a road sign as an example for further description, please refer to fig. 2b, where a text detection device is integrated in a terminal, the terminal captures the road sign through a camera to obtain an image of the road sign (i.e., a scene image), then performs feature extraction on the scene image to obtain a feature image corresponding to the scene image (i.e., an image to be detected), then the terminal constructs a detection box corresponding to each text element in the image to be detected, as shown in fig. 2c, then the terminal extracts texture features and geometric features of a region corresponding to each detection box, classifies the detection boxes based on a preset convolution neural network and similarity functions corresponding to the detection boxes to obtain classified detection boxes, as shown in fig. 2d, and then the terminal adjusts edges of the classified detection boxes, and finally, the terminal performs text detection on the image to be detected based on the adjusted detection box, as shown in fig. 2 e. After the terminal acquires the image to be detected, the image to be detected comprises a text to be detected, the text to be detected comprises a plurality of text elements, the terminal constructs a detection frame corresponding to each text element in the image to be detected, then the terminal extracts the texture feature and the geometric feature of the corresponding region of each detection frame respectively and acquires the incidence relation among the detection frames, then the terminal classifies the detection frames according to the incidence relation, the texture feature and the geometric feature to obtain the classified detection frames, and finally the terminal detects the text of the image to be detected based on the classified detection frames. Compared with the existing text detection scheme, the terminal can extract the texture features and the geometric features of the corresponding area of each detection frame and acquire the association relationship among the detection frames, when the text is in a curve shape, the detection frames can be classified according to the association relationship, the texture features and the geometric features, then the text detection is carried out on the image to be detected based on the classified detection frames, the situation that the detection frames cannot accurately cover all text areas is avoided, in addition, for a long text line, because the scheme constructs the detection frames corresponding to each text element in the image to be detected, the problem that the frames are lost or the prediction is incomplete in the detection process can be avoided, and therefore, the scheme can improve the accuracy of the text detection.

In order to better implement the text detection method according to the embodiment of the present invention, an embodiment of the present invention further provides a text detection apparatus (detection apparatus for short) based on the foregoing text detection method. The meanings of the nouns are the same as those in the text detection method, and specific implementation details can refer to the description in the method embodiment.

Referring to fig. 3a, fig. 3a is a schematic structural diagram of a text detection apparatus according to an embodiment of the present invention, where the detection apparatus may include an obtaining module 301, a constructing module 302, an extracting module 303, a second obtaining module 304, a classifying module 305, and a detecting module 306, and specifically the following modules may be included:

the first obtaining module 301 is configured to obtain an image to be detected.

The image to be detected includes a text to be detected, the text to be detected includes a plurality of text elements, where the text elements refer to various elements in the text to be detected, such as characters, symbols, and the like, and the image to be detected may be obtained by the first obtaining module 301 through pulling via the access network interface.

A constructing module 302, configured to construct a detection box corresponding to each text element in the image to be detected.

For example, specifically, the construction module 302 may perform semantic segmentation on the image to be detected, and then construct a detection box corresponding to each text element in the image to be detected based on a result of the semantic segmentation.

Optionally, in some embodiments, the building module 302 is specifically configured to: and performing semantic segmentation on the image to be detected to obtain target pixel points corresponding to each text element and pixel association information corresponding to each target pixel point, and constructing a detection frame corresponding to each text element based on the pixel association information and the plurality of target pixel points.

And the extracting module 303 is configured to extract the texture feature and the geometric feature of the region corresponding to each detection frame.

For example, the extracting module 303 may extract the texture feature of each detection box and the geometric feature of each detection box respectively through a convolutional neural network, such as an fpn (feature Pyramid) feature Pyramid network.

A second obtaining module 304, configured to obtain an association relationship between the detection frames.

The classification module 305 is configured to classify the detection frames according to the association relationship, the texture features, and the geometric features, so as to obtain the classified detection frames.

For example, specifically, the classification module 305 may predict the category to which each detection frame belongs by using the association relationship between the detection frames, where a similarity function corresponding to each detection frame may be calculated according to the association relationship, and then the detection frames are classified according to the texture features, the geometric features, and the similarity function, so as to obtain the classified detection frames.

Optionally, in some embodiments, the classification module 305 may specifically include:

Optionally, in some embodiments, the classification unit may specifically include:

Optionally, in some embodiments, the building subunit may specifically be configured to: calculating texture feature points corresponding to the image to be detected through texture features and a similarity function, constructing a texture feature map corresponding to the image to be detected based on the texture feature points, calculating geometric feature points corresponding to the image to be detected through geometric features and the similarity function, and constructing a geometric feature map corresponding to the image to be detected based on the geometric feature points.

Optionally, in some embodiments, the classification subunit is specifically configured to: and fusing the texture feature map and the geometric feature map to obtain a fused feature map, predicting the category of the detection frame according to the fused feature map, and classifying the detection frame based on the prediction result to obtain a classified detection frame.

A detection module 306, configured to perform text detection on the image to be detected based on the classified detection box

For example, specifically, the detection module 306 may construct a text box for text detection based on the classified detection box, and then perform text detection on the image to be detected according to the constructed text box.

Optionally, in some embodiments, the detection module may specifically include:

and the detection unit is used for performing text detection on the image to be detected based on the text box.

Optionally, in some embodiments, the construction unit is specifically configured to: determining a central point corresponding to each classified detection box in the homologous group, acquiring the size corresponding to each classified detection box in the homologous group, and constructing a text box for text detection based on the central point and the size.

Optionally, in some embodiments, please refer to fig. 3b, which further includes an adjusting module 307, where the adjusting module 307 is configured to adjust an edge of the text box to obtain an adjusted text box. The detection module 306 is specifically configured to: and performing text detection on the image to be detected based on the adjusted text box.

In the embodiment of the present invention, after a first obtaining module 301 obtains an image to be detected, the image to be detected includes a text to be detected, the text to be detected includes a plurality of text elements, a constructing module 302 constructs a detection frame corresponding to each text element in the image to be detected, then an extracting module 303 extracts texture features and geometric features of a region corresponding to each detection frame, respectively, and a second obtaining module 304 can obtain an association relationship between the detection frames, then a classifying module 305 classifies the detection frames according to the association relationship, the texture features and the geometric features to obtain classified detection frames, and finally, a detecting module 306 performs text detection on the image to be detected based on the classified detection frames. Compared with the existing text detection scheme, in the text detection apparatus of the present invention, the extraction module 303 may extract the texture feature and the geometric feature of the corresponding region of each detection box, and the second obtaining module 304 can obtain the association relationship between the detection boxes, when the shape of the text is a curve, the classification module 305 may classify the detection boxes according to the association relationship, the texture feature and the geometric feature, then, the detection module 306 performs text detection on the image to be detected based on the classified detection box, so as to avoid the situation that the detection box cannot accurately cover all text regions, and in addition, for long text lines, since the scheme constructs a detection box corresponding to each text element in the image to be detected, the problem of frame missing or incomplete prediction in the detection process can be avoided, and therefore the accuracy of text detection can be improved by the scheme.

In addition, an embodiment of the present invention further provides an electronic device, as shown in fig. 4, which shows a schematic structural diagram of the electronic device according to the embodiment of the present invention, specifically:

the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 4 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the whole electronic device by various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may further include an input unit 404, and the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

the method comprises the steps of obtaining an image to be detected, constructing a detection frame corresponding to each text element in the image to be detected, respectively extracting texture features and geometric features of a region corresponding to each detection frame, obtaining incidence relations among the detection frames, classifying the detection frames according to the incidence relations, the texture features and the geometric features to obtain classified detection frames, and performing text detection on the image to be detected based on the classified detection frames.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present invention provide a storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute the steps in any one of the text detection methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any text detection method provided in the embodiments of the present invention, the beneficial effects that can be achieved by any text detection method provided in the embodiments of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The text detection method, the text detection device, the electronic device, and the storage medium according to the embodiments of the present invention are described in detail above, and a specific example is applied in the text to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understanding the method and the core concept of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A text detection method, comprising:

2. The method of claim 1, wherein the classifying the detection frame according to the association relationship, the texture feature and the geometric feature to obtain a classified detection frame comprises:

calculating a similarity function corresponding to each detection frame according to the association relation;

classifying the detection frames based on the textural features, the geometric features and the similarity function to obtain the classified detection frames.

3. The method of claim 2, wherein the classifying the detection frame based on the texture feature, the geometric feature and the similarity function to obtain a classified detection frame comprises:

respectively constructing a texture feature map corresponding to the image to be detected and a geometric feature map corresponding to the image to be detected according to the texture feature, the geometric feature and the similarity function;

and classifying the detection frames based on the texture feature map and the geometric feature map to obtain the classified detection frames.

4. The method according to claim 3, wherein the constructing a texture feature map corresponding to the image to be detected and a geometric feature map corresponding to the image to be detected according to the texture feature, the geometric feature and the similarity function respectively comprises:

5. The method of claim 3, wherein classifying the detection frame based on the texture feature map and the geometric feature map to obtain a classified detection frame comprises:

predicting the category of the detection frame through the fused feature map;

6. The method according to any one of claims 1 to 5, wherein the text detection of the image to be detected based on the classified detection box comprises:

determining the classified detection frames belonging to the same category as a homologous group;

constructing a text box for text detection according to the classified detection boxes in the homologous group;

and performing text detection on the image to be detected based on the text box.

7. The method of claim 6, wherein constructing the text box for text detection according to the classified detection boxes in the homology group comprises:

8. The method of claim 6, wherein after constructing the text box for text detection based on the center point and the size, further comprising:

adjusting the edge of the text box to obtain an adjusted text box;

the text detection of the image to be detected based on the text box comprises the following steps: and performing text detection on the image to be detected based on the adjusted text box.

9. The method according to any one of claims 1 to 5, wherein constructing a detection box corresponding to each text element in the image to be detected comprises:

10. A text detection apparatus, comprising:

11. The apparatus of claim 10, wherein the classification module comprises:

12. The apparatus of claim 11, wherein the classification unit comprises:

13. The apparatus according to claim 12, wherein the building subunit is specifically configured to:

14. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the text detection method according to any of claims 1-9 are implemented when the program is executed by the processor.

15. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, carries out the steps of the text detection method according to any one of claims 1 to 9.