CN114972947B

CN114972947B - Depth scene text detection method and device based on fuzzy semantic modeling

Info

Publication number: CN114972947B
Application number: CN202210882622.6A
Authority: CN
Inventors: 王芳芳; 徐晓刚; 李萧缘; 王军; 曹卫强
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2022-07-26
Filing date: 2022-07-26
Publication date: 2022-12-06
Anticipated expiration: 2042-07-26
Also published as: CN114972947A

Abstract

The invention discloses a depth scene text detection method and a device based on fuzzy semantic modeling, wherein the method comprises the following steps: acquiring a plurality of groups of image data sets with truth value labels for training scene text detection; secondly, performing feature learning and global feature fusion on the images in the data set to obtain a fused global feature map; step three, performing pixel-level semantic classification on the fused global feature map, predicting the pixel-level semantic reliability through numerical regression, performing multi-branch joint optimization under full supervision, and completing construction of an end-to-end joint learning framework; using an end-to-end joint learning framework to predict fuzzy semantic information in the image, and obtaining a text attribute image by utilizing reliability analysis and fusion; and step five, carrying out binarization and communication domain extraction on the text attribute graph to obtain a final text detection result. The method is simple, flexible, robust and wide in application range.

Description

Depth scene text detection method and device based on fuzzy semantic modeling

Technical Field

The invention belongs to the field of computer vision, and relates to a depth scene text detection method and device based on fuzzy semantic modeling.

Background

Scene text detection is defined as the following problem: the text region positions of multi-direction, multi-language, curved or irregular shapes are found in the natural scene image. Due to the effectiveness of statistical modeling, current learning-based methods are increasingly applied to scene text detection tasks. The existing learning-based method mainly adopts a deep learning framework, inputs an image and outputs a detected text region.

In recent years, it has been widely used in the field of computer vision tasks such as scene understanding, image retrieval, and the like. The computer vision task has two key points: the first point is how to mine semantic information at the text pixel level from bottom to top so as to be able to adapt to a wide variety of text shapes; the second point is how to model semantic ambiguity of text region edges to solve the problem that adjacent instances are difficult to distinguish due to the clustering distribution between homogeneous textures inside the text and the text.

Disclosure of Invention

Aiming at the first point, the invention considers that a semantic segmentation frame is utilized, and the pixel-level semantic information can be effectively mined by performing global feature fusion and end-to-end feature learning through a feature extraction network and a feature pyramid network; aiming at the second point, the text instance boundary region is considered to have unique semantic characteristics, two semantic information of the text and the instance boundary are mined and semantic reliability analysis is carried out, so that the boundaries of different text targets can be found and distinguished more accurately, and the specific technical scheme is as follows:

a depth scene text detection method based on fuzzy semantic modeling comprises the following steps:

acquiring a plurality of groups of image data sets with truth value labels for training scene text detection;

step two, performing feature learning and global feature fusion on the images in the data set by using a full convolution feature extraction network and a feature pyramid network to obtain a fused global feature map;

step three, performing pixel-level semantic classification on the fused global feature map, predicting the pixel-level semantic reliability through numerical regression, performing multi-branch joint optimization under full supervision, and completing construction of an end-to-end joint learning framework;

step four, using an end-to-end joint learning frame to predict fuzzy semantic information in the image, and obtaining a text attribute map by utilizing reliability analysis and fusion;

and step five, carrying out binarization and communication domain extraction on the obtained text attribute graph to obtain a final text detection result.

Further, the second step specifically includes the following substeps:

(2.1) extracting depth features of each image on different scales by using a full convolution network and a feature pyramid network;

and (2.2) extracting and fusing the depth feature maps on different scales by using convolution operation and splicing operation to obtain a fused global feature map.

Further, the third step specifically includes the following sub-steps:

(3.1) fusion-based Global feature mapEstablishing 4 prediction branches with consistent structures, wherein each prediction branch comprises three layers of convolution operation, and predicting text semantic category score of each pixel position on the global feature map

Instance boundary semantic category score

Text reliability value

And example demarcation reliability value

；

(3.2) learning and optimizing the text segmentation graph, the example boundary segmentation graph, the text reliability graph and the example boundary reliability graph generated by the prediction branch so as to establish an end-to-end joint learning framework, wherein the overall framework loss function is as follows:

+

+

,

wherein

And

for the smooth L1 loss function,

and with

Is a normalized focal loss function.

Further, the fourth step is specifically:

inputting a to-be-predicted image based on the end-to-end joint learning framework established from the first step to the third step, and obtaining a text segmentation graph T and a text reliability graph through the minimum loss function learning

Example boundary segmentation graph S and example boundary reliability graph

And performing reliability analysis by using the four output graphs and fusing the four output graphs into a final text attribute graph M:

wherein

To balance the weighting coefficients of the branch intervals.

Further, the fifth step is specifically:

carrying out contour discovery, namely binarization and communication domain extraction on the text attribute graph output in the step four to obtain coordinate representation of the text region:

wherein,

in order to be a function of the binarization,

is communicated withAnd an extraction function, wherein each text instance uses a set of points

It is shown that,

and

respectively representing the abscissa and ordinate of the mth coordinate of a text region, and N represents the number of coordinate points.

Further, generating a text true value graph, an example boundary true value graph, a text reliability true value graph and an example boundary reliability true value graph by expanding, intersecting and truncating a distance function of the text region by utilizing the coordinates of the text region;

the text true value graph specifically includes: performing binarization filling by using truth value data of text region coordinates, wherein the inside filling of the text region is 1, the background is 0, and the truth value data is used as a truth value map of the text region

；

The example boundary truth diagram specifically includes: the text outline is adaptively expanded by taking 1/5 of line height as an expansion parameter according to the scale of the text outline, similar text instances after expansion are overlapped, and an overlapping area is defined as a truth map of an instance boundary area

；

The text reliability truth value graph and the example boundary reliability truth value graph are specifically as follows: calculating semantic reliability of pixel positions by a truncation function aiming at the edges of the text region and the example boundary region to obtain a reliability true value graph of the boundary between the text and the example

And

the cutoff function is:

wherein

Measuring pixel position

Semantic boundary closest to it

The Euclidean distance between the two electrodes,

is the pixel position

The label of the binary value at (a),

is a truncation threshold, and is also a normalization coefficient,

absolute value of (2) represents position

The reliability of the pixel is measured, and the sign is used to distinguish the semantic tendency.

A depth scene text detection device based on fuzzy semantic modeling comprises one or more processors and is used for realizing the depth scene text detection method based on fuzzy semantic modeling.

A computer-readable storage medium, on which a program is stored, which, when executed by a processor, implements the method for deep scene text detection based on fuzzy semantic modeling.

Compared with the existing scene text detection method, the method has the following beneficial effects:

firstly, the scene text detection method solves the problem that adjacent examples are difficult to distinguish due to the aggregation distribution between the internal homogenization textures of the texts and the texts in the scene text detection method in any form from bottom to top from the aspect of redundancy removal, and finds and explores two fuzzy semantics of the boundaries of the texts and the examples in natural images;

secondly, the reliability modeling is carried out on the fuzzy semantic boundary, the competition problem on the semantic boundary is solved through reliability analysis, and the final semantic attribute is judged, so that a clear and complete example boundary is obtained, and the scene text detection effect is improved;

finally, the scene text detection method is used as a simple and direct lightweight frame based on semantic segmentation, a final detection result is obtained at one time in a communication domain extraction mode, any iterative or other complex post-processing steps are not needed, and the effect of the method exceeds that of a plurality of multi-stage methods based on segmentation;

the method has good application value in scenes such as scene understanding, automatic driving and the like, for example, in an automatic driving task, texts in the scenes contain a large amount of information for helping to understand the scenes and assisting in driving, and the accurate detection of the positions of the texts is based on the scene text information.

Drawings

FIG. 1 is a schematic flow diagram of a depth scene text detection method based on fuzzy semantic modeling according to the present invention;

FIGS. 2 a-2 c are schematic diagrams of original images according to an embodiment of the present invention;

FIG. 3 is a framework diagram of a learning network of the present invention;

FIG. 4 is a diagram illustrating the detection effect of the semantic segmentation framework on the random form text in the natural scene image according to the embodiment of the present invention;

fig. 5 is a schematic structural diagram of a deep scene text detection device based on fuzzy semantic modeling according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and technical effects of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.

In scene text detection, the geometric attributes of a text are variable and semantic boundaries are fuzzy, so that a pixel level text region from bottom to top is found based on a convolutional neural network, semantic information and reliability of a text and an example boundary are modeled, finally, network optimization is performed through an effective training strategy, and specifically, a deep scene text detection method and a deep scene text detection device based on fuzzy semantic modeling are provided, two mutually competing fuzzy boundary semantic categories, namely the text and the example boundary, are mined in a natural scene image, and scene text target detection in any shape is performed from the perspective of redundancy removal through pixel level multi-label classification and fuzzy semantic reliability analysis; the method uses a one-stage deep learning segmentation framework, utilizes the focus loss function at the cross-image pixel level to carry out network optimization, and has the advantages of simple and convenient implementation method, flexibility, robustness and wide application range.

In more detail, as shown in fig. 1, the method for detecting the text of the depth scene based on the fuzzy semantic modeling includes the following steps:

step one, acquiring a plurality of groups of image data sets with true value labels for training scene text detection. Specifically, the present invention is implemented on three data sets with truth labels, which are:

SCUT-CTW1500 dataset: the data set contains 1000 training images, 500 test images;

TotalText dataset: the data set contained 1255 training images, 300 test images;

ICDAR-ArT dataset: the data set contained 5603 training images, 4563 test images.

And step two, performing feature learning and global feature fusion on the images in the data set by using a full convolution feature extraction network and a feature pyramid network to obtain a fused global feature map.

The specific implementation method of the step comprises the following steps:

And step three, performing pixel-level semantic classification on the fused global feature map, predicting the pixel-level semantic reliability through numerical regression, and performing multi-branch joint optimization under full supervision to complete the construction of an end-to-end joint learning framework.

The specific implementation method of the step comprises the following steps:

(3.1) establishing 4 prediction branches with consistent structures based on the fused global feature map, wherein each prediction branch comprises three layers of convolution operation, and predicting text semantic category score of each pixel position on the global feature map

Instance boundary semantic category score

Text reliability value

And an example boundary reliability value

；

(3.2) learning and optimizing the text segmentation graph, the example boundary segmentation graph, the text reliability graph and the example boundary reliability graph generated by the prediction branches so as to establish an end-to-end joint learning framework, wherein the overall framework loss function is as follows:

+

+

,

wherein

And

for the smooth L1 loss function,

and

to normalize the focus loss function, take the segmentation loss function at the positive sample point as an example:

wherein

The total number of all sample pixel points in the current image,

the dynamic mean of the training weights of the positive sample points in all the images that are currently processed,

as the current position

The value of the reliability of the (d) is,

a probability value predicted for the current location,

for the number of all positive sample points of the current image, the weight

。

And step four, using an end-to-end joint learning framework to predict fuzzy semantic information in the image, and obtaining a text attribute map by utilizing reliability analysis and fusion.

The specific implementation method of the step comprises the following steps:

Example boundary segmentation graph S and example boundary reliability graph

wherein

To balance the weighting coefficients of the branch intervals.

The specific implementation method of the step comprises the following steps:

wherein,

in order to be a function of the binarization,

for the Unicom and extract function, each text instance uses a set of points

It is shown that,

and

respectively represent the abscissa and ordinate of the mth coordinate of one text region, and N represents the number of coordinate points.

Finally, generating four true value graphs of text, example boundary, text reliability and example boundary reliability by using the coordinate information of the text area and through modes of text area expansion intersection, distance truncation function and the like, wherein the four true value graphs are as follows:

performing binarization filling by using truth value data of text region coordinates, wherein the inside filling of the text region is 1, the background is 0, and the truth value data is used as a truth value map of the text region

；

The text outline is adaptively expanded by taking 1/5 of line height as an expansion parameter according to the scale of the text outline, similar text instances after expansion are overlapped, and an overlapping area is defined as a truth map of an instance boundary area

；

Calculating semantic reliability of pixel positions by a truncation function aiming at the edges of the text region and the example boundary region to obtain a reliability true value graph of the boundary between the text and the example

And

the cutoff function is:

wherein

Measuring pixel position

Semantic boundary closest to it

The euclidean distance between them,

is the pixel position

The label of the binary value at (a),

is a cut-off threshold value and is also a normalization coefficient, in the method experiment of the invention

= 10。

Represents a position by an absolute value of

The reliability of the pixel is determined, and the sign is used to distinguish the semantic tendency.

The above-described method is applied to specific embodiments below so that those skilled in the art can better understand the effects of the present invention.

The implementation method of this embodiment is as described above, and specific steps are not elaborated, and the effect is shown only for case data. The invention is implemented on three data sets with truth labels, which are respectively as follows:

SCUT-CTW1500 dataset: the data set contained 1000 training images, 500 test images.

TotalText dataset: the data set contained 1255 training images and 300 test images.

This example performs the experiment on each data set separately, and the images in the data sets are shown in fig. 2a to 2c, for example.

The main flow of text detection is as follows:

1) Extracting a multi-scale feature map of the image through a full convolution network and a feature pyramid structure network;

2) Extracting and fusing each scale feature map by using convolution operation and splicing operation to obtain a global feature map;

3) Four prediction branches are constructed on the global feature map, and the text semantic classification score, the instance boundary semantic classification score, the text semantic reliability and the instance boundary semantic reliability of each sample point on the feature map are respectively predicted;

4) Jointly optimizing semantic classification and reliability regression branches;

5) Predicting semantic and reliability information by using the learning framework to obtain a text attribute graph;

6) And (3) carrying out binarization and communication domain extraction on the text attribute graph to obtain a final text detection result, wherein the overall learning network framework is shown as fig. 3, and the detection effect graph of the random form text in the natural scene image is shown as fig. 4.

In order to comprehensively compare the effectiveness of the method, other advanced methods are compared, and effectiveness analysis is performed on three operations of example boundary segmentation, reliability analysis and normalized focal loss function provided by the method. The accuracy (precision), recall (recall) and overall performance (F-measure) of the detection results of this embodiment are shown in tables 1 to 3, where F-measure represents the balanced overall performance between accuracy and recall:

. The data in the table show the performance of the method on three indexes of precision, call and F-measure, and compared with other methods based on a semantic segmentation framework and other methods based on a regression framework, the method has further improvement on the whole.

Table 1 shows the evaluation indexes of the SCUT-CTW1500 data set in the present example:

table 2 shows the evaluation indexes of TotalText data set in this example:

table 3 shows the evaluation indexes of the ICDAR-ArT data set of the present example:

table 4 shows the example boundary segmentation, reliability analysis, and effectiveness analysis of the normalized focal loss function proposed in this embodiment:

through the technical scheme, the deep scene text detection method based on the fuzzy semantic modeling is provided based on the deep learning technology. The invention can mine the semantic information and semantic reliability of the boundary of the text and the instance on various real image data, thereby obtaining an accurate detection result.

Corresponding to the embodiment of the depth scene text detection method based on the fuzzy semantic modeling, the invention also provides an embodiment of a depth scene text detection device based on the fuzzy semantic modeling.

Referring to fig. 5, the depth scene text detection apparatus based on fuzzy semantic modeling provided in the embodiment of the present invention includes one or more processors, which are configured to implement the depth scene text detection method based on fuzzy semantic modeling in the above embodiment.

The depth scene text detection device based on fuzzy semantic modeling of the embodiment of the invention can be applied to any equipment with data processing capability, such as computers and other equipment or devices. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a device in a logical sense, a processor of any device with data processing capability reads corresponding computer program instructions in the nonvolatile memory into the memory for operation. From a hardware aspect, as shown in fig. 5, a hardware structure diagram of any device with data processing capability where the deep scene text detection apparatus based on fuzzy semantic modeling is located according to the present invention is shown, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 5, in the embodiment, any device with data processing capability where the apparatus is located may also include other hardware according to the actual function of the any device with data processing capability, which is not described again.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

The embodiment of the present invention further provides a computer-readable storage medium, on which a program is stored, and when the program is executed by a processor, the method for detecting a text in a deep scene based on fuzzy semantic modeling in the foregoing embodiments is implemented.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be an external storage device such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing capable device, and may also be used for temporarily storing data that has been output or is to be output.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the practice of the invention as described in the foregoing examples, or that certain features may be substituted in the practice of the invention. All changes, equivalents and the like which come within the spirit and principles of the invention are desired to be protected.

Claims

1. A depth scene text detection method based on fuzzy semantic modeling is characterized by comprising the following steps:

secondly, performing feature learning and global feature fusion on the images in the data set by using a full convolution feature extraction network and a feature pyramid network to obtain a fused global feature map;

step three, performing pixel-level semantic classification on the fused global feature map, predicting the pixel-level semantic reliability through numerical regression, performing multi-branch joint optimization under full supervision, and completing construction of an end-to-end joint learning framework, wherein the method specifically comprises the following substeps:

Instance boundary semantic category score

Text reliability value

And an example boundary reliability value

；

+

+

,

wherein

And

for the smooth L1 loss function,

and with

As a function of normalized focal loss;

using an end-to-end joint learning framework to predict fuzzy semantic information in the image, and obtaining a text attribute image by utilizing reliability analysis and fusion, wherein the method specifically comprises the following steps: inputting a to-be-predicted image based on the end-to-end joint learning framework established from the first step to the third step, and obtaining a text segmentation graph T and a text reliability graph through the minimum loss function learning

Example boundary segmentation graph S and example boundary reliability graph

And then, carrying out reliability analysis and fusing the reliability analysis and the reliability analysis into a final text attribute graph M:

wherein

Is the weighting coefficient of the balanced branch interval;

step five, carrying out binarization and communication domain extraction on the obtained text attribute graph to obtain a final text detection result, which specifically comprises the following steps: and (3) carrying out contour discovery, namely binarization and communication domain extraction on the text attribute graph output in the step four to obtain coordinate representation of the text region:

wherein,

in order to be a function of the binarization,

for the Union and extraction function, each text instance uses a set of points

It is shown that the process of the present invention,

and

2. The method for detecting the text of the depth scene based on the fuzzy semantic modeling as claimed in claim 1, wherein the second step specifically comprises the following sub-steps:

3. The method as claimed in claim 1, wherein the text detection method based on fuzzy semantic modeling is characterized in that a text true value map, an example boundary true value map, a text reliability true value map and an example boundary reliability true value map are generated by expanding intersection and truncating distance functions of the text region by using coordinates of the text region;

the text true value graph specifically comprises the following steps: performing binarization filling by using truth value data of text region coordinates, wherein the inside filling of the text region is 1, the background is 0, and the truth value data is used as a truth value map of the text region

；

The example boundary truth diagram specifically includes: text outlines are adaptively expanded according to the scale of the text outlines, 1/5 of line height is used as an expansion parameter, similar text examples after expansion are overlapped, and an overlapping area is defined as a truth map of an example boundary area

；

The text reliability truth map and the example boundary reliability truth map are specifically as follows: calculating semantic reliability of pixel positions by a truncation function aiming at the edges of the text region and the example boundary region to obtain a reliability true value graph of the boundary between the text and the example

And

the cutoff function is:

wherein

Measuring pixel position

Semantic boundary closest to it

The Euclidean distance between the two electrodes,

is the pixel position

The label of the binary value at (a),

is a truncation threshold, and is also a normalization coefficient,

represents a position by an absolute value of

4. A deep scene text detection device based on fuzzy semantic modeling, which is characterized by comprising one or more processors and is used for realizing the deep scene text detection method based on fuzzy semantic modeling according to any one of claims 1 to 3.

5. A computer-readable storage medium, having stored thereon a program which, when executed by a processor, implements the method for detecting text in a deep scene based on fuzzy semantic modeling according to any one of claims 1 to 3.