CN114972947B - Depth scene text detection method and device based on fuzzy semantic modeling - Google Patents

Depth scene text detection method and device based on fuzzy semantic modeling Download PDF

Info

Publication number
CN114972947B
CN114972947B CN202210882622.6A CN202210882622A CN114972947B CN 114972947 B CN114972947 B CN 114972947B CN 202210882622 A CN202210882622 A CN 202210882622A CN 114972947 B CN114972947 B CN 114972947B
Authority
CN
China
Prior art keywords
text
reliability
semantic
graph
boundary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210882622.6A
Other languages
Chinese (zh)
Other versions
CN114972947A (en
Inventor
王芳芳
徐晓刚
李萧缘
王军
曹卫强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202210882622.6A priority Critical patent/CN114972947B/en
Publication of CN114972947A publication Critical patent/CN114972947A/en
Application granted granted Critical
Publication of CN114972947B publication Critical patent/CN114972947B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/28Quantising the image, e.g. histogram thresholding for discrimination between background and foreground patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/1918Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a depth scene text detection method and a device based on fuzzy semantic modeling, wherein the method comprises the following steps: acquiring a plurality of groups of image data sets with truth value labels for training scene text detection; secondly, performing feature learning and global feature fusion on the images in the data set to obtain a fused global feature map; step three, performing pixel-level semantic classification on the fused global feature map, predicting the pixel-level semantic reliability through numerical regression, performing multi-branch joint optimization under full supervision, and completing construction of an end-to-end joint learning framework; using an end-to-end joint learning framework to predict fuzzy semantic information in the image, and obtaining a text attribute image by utilizing reliability analysis and fusion; and step five, carrying out binarization and communication domain extraction on the text attribute graph to obtain a final text detection result. The method is simple, flexible, robust and wide in application range.

Description

Depth scene text detection method and device based on fuzzy semantic modeling
Technical Field
The invention belongs to the field of computer vision, and relates to a depth scene text detection method and device based on fuzzy semantic modeling.
Background
Scene text detection is defined as the following problem: the text region positions of multi-direction, multi-language, curved or irregular shapes are found in the natural scene image. Due to the effectiveness of statistical modeling, current learning-based methods are increasingly applied to scene text detection tasks. The existing learning-based method mainly adopts a deep learning framework, inputs an image and outputs a detected text region.
In recent years, it has been widely used in the field of computer vision tasks such as scene understanding, image retrieval, and the like. The computer vision task has two key points: the first point is how to mine semantic information at the text pixel level from bottom to top so as to be able to adapt to a wide variety of text shapes; the second point is how to model semantic ambiguity of text region edges to solve the problem that adjacent instances are difficult to distinguish due to the clustering distribution between homogeneous textures inside the text and the text.
Disclosure of Invention
Aiming at the first point, the invention considers that a semantic segmentation frame is utilized, and the pixel-level semantic information can be effectively mined by performing global feature fusion and end-to-end feature learning through a feature extraction network and a feature pyramid network; aiming at the second point, the text instance boundary region is considered to have unique semantic characteristics, two semantic information of the text and the instance boundary are mined and semantic reliability analysis is carried out, so that the boundaries of different text targets can be found and distinguished more accurately, and the specific technical scheme is as follows:
a depth scene text detection method based on fuzzy semantic modeling comprises the following steps:
acquiring a plurality of groups of image data sets with truth value labels for training scene text detection;
step two, performing feature learning and global feature fusion on the images in the data set by using a full convolution feature extraction network and a feature pyramid network to obtain a fused global feature map;
step three, performing pixel-level semantic classification on the fused global feature map, predicting the pixel-level semantic reliability through numerical regression, performing multi-branch joint optimization under full supervision, and completing construction of an end-to-end joint learning framework;
step four, using an end-to-end joint learning frame to predict fuzzy semantic information in the image, and obtaining a text attribute map by utilizing reliability analysis and fusion;
and step five, carrying out binarization and communication domain extraction on the obtained text attribute graph to obtain a final text detection result.
Further, the second step specifically includes the following substeps:
(2.1) extracting depth features of each image on different scales by using a full convolution network and a feature pyramid network;
and (2.2) extracting and fusing the depth feature maps on different scales by using convolution operation and splicing operation to obtain a fused global feature map.
Further, the third step specifically includes the following sub-steps:
(3.1) fusion-based Global feature mapEstablishing 4 prediction branches with consistent structures, wherein each prediction branch comprises three layers of convolution operation, and predicting text semantic category score of each pixel position on the global feature map
Figure 884670DEST_PATH_IMAGE001
Instance boundary semantic category score
Figure 270652DEST_PATH_IMAGE002
Text reliability value
Figure 519231DEST_PATH_IMAGE003
And example demarcation reliability value
Figure 383282DEST_PATH_IMAGE004
(3.2) learning and optimizing the text segmentation graph, the example boundary segmentation graph, the text reliability graph and the example boundary reliability graph generated by the prediction branch so as to establish an end-to-end joint learning framework, wherein the overall framework loss function is as follows:
Figure 182609DEST_PATH_IMAGE005
+
Figure 208334DEST_PATH_IMAGE006
+
Figure 944209DEST_PATH_IMAGE007
,
wherein
Figure 611950DEST_PATH_IMAGE008
And
Figure 749671DEST_PATH_IMAGE009
for the smooth L1 loss function,
Figure 8614DEST_PATH_IMAGE010
and with
Figure 231785DEST_PATH_IMAGE011
Is a normalized focal loss function.
Further, the fourth step is specifically:
inputting a to-be-predicted image based on the end-to-end joint learning framework established from the first step to the third step, and obtaining a text segmentation graph T and a text reliability graph through the minimum loss function learning
Figure 437638DEST_PATH_IMAGE012
Example boundary segmentation graph S and example boundary reliability graph
Figure 429865DEST_PATH_IMAGE013
And performing reliability analysis by using the four output graphs and fusing the four output graphs into a final text attribute graph M:
Figure 100002_DEST_PATH_IMAGE014
wherein
Figure 30348DEST_PATH_IMAGE015
To balance the weighting coefficients of the branch intervals.
Further, the fifth step is specifically:
carrying out contour discovery, namely binarization and communication domain extraction on the text attribute graph output in the step four to obtain coordinate representation of the text region:
Figure 740815DEST_PATH_IMAGE016
wherein,
Figure DEST_PATH_IMAGE017
in order to be a function of the binarization,
Figure 953622DEST_PATH_IMAGE018
is communicated withAnd an extraction function, wherein each text instance uses a set of points
Figure 65934DEST_PATH_IMAGE019
It is shown that,
Figure 666680DEST_PATH_IMAGE020
and
Figure 864443DEST_PATH_IMAGE021
respectively representing the abscissa and ordinate of the mth coordinate of a text region, and N represents the number of coordinate points.
Further, generating a text true value graph, an example boundary true value graph, a text reliability true value graph and an example boundary reliability true value graph by expanding, intersecting and truncating a distance function of the text region by utilizing the coordinates of the text region;
the text true value graph specifically includes: performing binarization filling by using truth value data of text region coordinates, wherein the inside filling of the text region is 1, the background is 0, and the truth value data is used as a truth value map of the text region
Figure 412099DEST_PATH_IMAGE022
The example boundary truth diagram specifically includes: the text outline is adaptively expanded by taking 1/5 of line height as an expansion parameter according to the scale of the text outline, similar text instances after expansion are overlapped, and an overlapping area is defined as a truth map of an instance boundary area
Figure 877453DEST_PATH_IMAGE023
The text reliability truth value graph and the example boundary reliability truth value graph are specifically as follows: calculating semantic reliability of pixel positions by a truncation function aiming at the edges of the text region and the example boundary region to obtain a reliability true value graph of the boundary between the text and the example
Figure 117942DEST_PATH_IMAGE024
And
Figure 271843DEST_PATH_IMAGE025
the cutoff function is:
Figure 623189DEST_PATH_IMAGE026
wherein
Figure 444515DEST_PATH_IMAGE027
Measuring pixel position
Figure 652642DEST_PATH_IMAGE028
Semantic boundary closest to it
Figure 559418DEST_PATH_IMAGE029
The Euclidean distance between the two electrodes,
Figure 448877DEST_PATH_IMAGE030
is the pixel position
Figure 390288DEST_PATH_IMAGE028
The label of the binary value at (a),
Figure 471115DEST_PATH_IMAGE031
is a truncation threshold, and is also a normalization coefficient,
Figure 865187DEST_PATH_IMAGE032
absolute value of (2) represents position
Figure 292757DEST_PATH_IMAGE028
The reliability of the pixel is measured, and the sign is used to distinguish the semantic tendency.
A depth scene text detection device based on fuzzy semantic modeling comprises one or more processors and is used for realizing the depth scene text detection method based on fuzzy semantic modeling.
A computer-readable storage medium, on which a program is stored, which, when executed by a processor, implements the method for deep scene text detection based on fuzzy semantic modeling.
Compared with the existing scene text detection method, the method has the following beneficial effects:
firstly, the scene text detection method solves the problem that adjacent examples are difficult to distinguish due to the aggregation distribution between the internal homogenization textures of the texts and the texts in the scene text detection method in any form from bottom to top from the aspect of redundancy removal, and finds and explores two fuzzy semantics of the boundaries of the texts and the examples in natural images;
secondly, the reliability modeling is carried out on the fuzzy semantic boundary, the competition problem on the semantic boundary is solved through reliability analysis, and the final semantic attribute is judged, so that a clear and complete example boundary is obtained, and the scene text detection effect is improved;
finally, the scene text detection method is used as a simple and direct lightweight frame based on semantic segmentation, a final detection result is obtained at one time in a communication domain extraction mode, any iterative or other complex post-processing steps are not needed, and the effect of the method exceeds that of a plurality of multi-stage methods based on segmentation;
the method has good application value in scenes such as scene understanding, automatic driving and the like, for example, in an automatic driving task, texts in the scenes contain a large amount of information for helping to understand the scenes and assisting in driving, and the accurate detection of the positions of the texts is based on the scene text information.
Drawings
FIG. 1 is a schematic flow diagram of a depth scene text detection method based on fuzzy semantic modeling according to the present invention;
FIGS. 2 a-2 c are schematic diagrams of original images according to an embodiment of the present invention;
FIG. 3 is a framework diagram of a learning network of the present invention;
FIG. 4 is a diagram illustrating the detection effect of the semantic segmentation framework on the random form text in the natural scene image according to the embodiment of the present invention;
fig. 5 is a schematic structural diagram of a deep scene text detection device based on fuzzy semantic modeling according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and technical effects of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.
In scene text detection, the geometric attributes of a text are variable and semantic boundaries are fuzzy, so that a pixel level text region from bottom to top is found based on a convolutional neural network, semantic information and reliability of a text and an example boundary are modeled, finally, network optimization is performed through an effective training strategy, and specifically, a deep scene text detection method and a deep scene text detection device based on fuzzy semantic modeling are provided, two mutually competing fuzzy boundary semantic categories, namely the text and the example boundary, are mined in a natural scene image, and scene text target detection in any shape is performed from the perspective of redundancy removal through pixel level multi-label classification and fuzzy semantic reliability analysis; the method uses a one-stage deep learning segmentation framework, utilizes the focus loss function at the cross-image pixel level to carry out network optimization, and has the advantages of simple and convenient implementation method, flexibility, robustness and wide application range.
In more detail, as shown in fig. 1, the method for detecting the text of the depth scene based on the fuzzy semantic modeling includes the following steps:
step one, acquiring a plurality of groups of image data sets with true value labels for training scene text detection. Specifically, the present invention is implemented on three data sets with truth labels, which are:
SCUT-CTW1500 dataset: the data set contains 1000 training images, 500 test images;
TotalText dataset: the data set contained 1255 training images, 300 test images;
ICDAR-ArT dataset: the data set contained 5603 training images, 4563 test images.
And step two, performing feature learning and global feature fusion on the images in the data set by using a full convolution feature extraction network and a feature pyramid network to obtain a fused global feature map.
The specific implementation method of the step comprises the following steps:
(2.1) extracting depth features of each image on different scales by using a full convolution network and a feature pyramid network;
and (2.2) extracting and fusing the depth feature maps on different scales by using convolution operation and splicing operation to obtain a fused global feature map.
And step three, performing pixel-level semantic classification on the fused global feature map, predicting the pixel-level semantic reliability through numerical regression, and performing multi-branch joint optimization under full supervision to complete the construction of an end-to-end joint learning framework.
The specific implementation method of the step comprises the following steps:
(3.1) establishing 4 prediction branches with consistent structures based on the fused global feature map, wherein each prediction branch comprises three layers of convolution operation, and predicting text semantic category score of each pixel position on the global feature map
Figure 823096DEST_PATH_IMAGE001
Instance boundary semantic category score
Figure 576288DEST_PATH_IMAGE002
Text reliability value
Figure 457656DEST_PATH_IMAGE003
And an example boundary reliability value
Figure 954497DEST_PATH_IMAGE004
(3.2) learning and optimizing the text segmentation graph, the example boundary segmentation graph, the text reliability graph and the example boundary reliability graph generated by the prediction branches so as to establish an end-to-end joint learning framework, wherein the overall framework loss function is as follows:
Figure 401658DEST_PATH_IMAGE005
+
Figure 60173DEST_PATH_IMAGE006
+
Figure 933232DEST_PATH_IMAGE007
,
wherein
Figure 968184DEST_PATH_IMAGE008
And
Figure 473115DEST_PATH_IMAGE009
for the smooth L1 loss function,
Figure 302530DEST_PATH_IMAGE010
and
Figure 158491DEST_PATH_IMAGE011
to normalize the focus loss function, take the segmentation loss function at the positive sample point as an example:
Figure 997134DEST_PATH_IMAGE033
wherein
Figure 356571DEST_PATH_IMAGE034
The total number of all sample pixel points in the current image,
Figure DEST_PATH_IMAGE035
the dynamic mean of the training weights of the positive sample points in all the images that are currently processed,
Figure 888047DEST_PATH_IMAGE032
as the current position
Figure 464259DEST_PATH_IMAGE028
The value of the reliability of the (d) is,
Figure 106593DEST_PATH_IMAGE036
a probability value predicted for the current location,
Figure 586116DEST_PATH_IMAGE037
for the number of all positive sample points of the current image, the weight
Figure 22914DEST_PATH_IMAGE038
And step four, using an end-to-end joint learning framework to predict fuzzy semantic information in the image, and obtaining a text attribute map by utilizing reliability analysis and fusion.
The specific implementation method of the step comprises the following steps:
inputting a to-be-predicted image based on the end-to-end joint learning framework established from the first step to the third step, and obtaining a text segmentation graph T and a text reliability graph through the minimum loss function learning
Figure 587887DEST_PATH_IMAGE012
Example boundary segmentation graph S and example boundary reliability graph
Figure 502753DEST_PATH_IMAGE013
And performing reliability analysis by using the four output graphs and fusing the four output graphs into a final text attribute graph M:
Figure 836783DEST_PATH_IMAGE014
wherein
Figure 444482DEST_PATH_IMAGE015
To balance the weighting coefficients of the branch intervals.
And step five, carrying out binarization and communication domain extraction on the obtained text attribute graph to obtain a final text detection result.
The specific implementation method of the step comprises the following steps:
carrying out contour discovery, namely binarization and communication domain extraction on the text attribute graph output in the step four to obtain coordinate representation of the text region:
Figure 995287DEST_PATH_IMAGE016
wherein,
Figure 713844DEST_PATH_IMAGE017
in order to be a function of the binarization,
Figure 167959DEST_PATH_IMAGE018
for the Unicom and extract function, each text instance uses a set of points
Figure 743297DEST_PATH_IMAGE019
It is shown that,
Figure 282862DEST_PATH_IMAGE020
and
Figure 805111DEST_PATH_IMAGE021
respectively represent the abscissa and ordinate of the mth coordinate of one text region, and N represents the number of coordinate points.
Finally, generating four true value graphs of text, example boundary, text reliability and example boundary reliability by using the coordinate information of the text area and through modes of text area expansion intersection, distance truncation function and the like, wherein the four true value graphs are as follows:
performing binarization filling by using truth value data of text region coordinates, wherein the inside filling of the text region is 1, the background is 0, and the truth value data is used as a truth value map of the text region
Figure 848153DEST_PATH_IMAGE022
The text outline is adaptively expanded by taking 1/5 of line height as an expansion parameter according to the scale of the text outline, similar text instances after expansion are overlapped, and an overlapping area is defined as a truth map of an instance boundary area
Figure 797655DEST_PATH_IMAGE023
Calculating semantic reliability of pixel positions by a truncation function aiming at the edges of the text region and the example boundary region to obtain a reliability true value graph of the boundary between the text and the example
Figure 558937DEST_PATH_IMAGE024
And
Figure 383411DEST_PATH_IMAGE025
the cutoff function is:
Figure 343277DEST_PATH_IMAGE026
wherein
Figure 463680DEST_PATH_IMAGE027
Measuring pixel position
Figure 977838DEST_PATH_IMAGE028
Semantic boundary closest to it
Figure 841889DEST_PATH_IMAGE029
The euclidean distance between them,
Figure 859523DEST_PATH_IMAGE030
is the pixel position
Figure 150827DEST_PATH_IMAGE028
The label of the binary value at (a),
Figure 886702DEST_PATH_IMAGE031
is a cut-off threshold value and is also a normalization coefficient, in the method experiment of the invention
Figure 554444DEST_PATH_IMAGE031
= 10。
Figure 223323DEST_PATH_IMAGE032
Represents a position by an absolute value of
Figure 443783DEST_PATH_IMAGE028
The reliability of the pixel is determined, and the sign is used to distinguish the semantic tendency.
The above-described method is applied to specific embodiments below so that those skilled in the art can better understand the effects of the present invention.
The implementation method of this embodiment is as described above, and specific steps are not elaborated, and the effect is shown only for case data. The invention is implemented on three data sets with truth labels, which are respectively as follows:
SCUT-CTW1500 dataset: the data set contained 1000 training images, 500 test images.
TotalText dataset: the data set contained 1255 training images and 300 test images.
ICDAR-ArT dataset: the data set contained 5603 training images, 4563 test images.
This example performs the experiment on each data set separately, and the images in the data sets are shown in fig. 2a to 2c, for example.
The main flow of text detection is as follows:
1) Extracting a multi-scale feature map of the image through a full convolution network and a feature pyramid structure network;
2) Extracting and fusing each scale feature map by using convolution operation and splicing operation to obtain a global feature map;
3) Four prediction branches are constructed on the global feature map, and the text semantic classification score, the instance boundary semantic classification score, the text semantic reliability and the instance boundary semantic reliability of each sample point on the feature map are respectively predicted;
4) Jointly optimizing semantic classification and reliability regression branches;
5) Predicting semantic and reliability information by using the learning framework to obtain a text attribute graph;
6) And (3) carrying out binarization and communication domain extraction on the text attribute graph to obtain a final text detection result, wherein the overall learning network framework is shown as fig. 3, and the detection effect graph of the random form text in the natural scene image is shown as fig. 4.
In order to comprehensively compare the effectiveness of the method, other advanced methods are compared, and effectiveness analysis is performed on three operations of example boundary segmentation, reliability analysis and normalized focal loss function provided by the method. The accuracy (precision), recall (recall) and overall performance (F-measure) of the detection results of this embodiment are shown in tables 1 to 3, where F-measure represents the balanced overall performance between accuracy and recall:
Figure 666954DEST_PATH_IMAGE039
. The data in the table show the performance of the method on three indexes of precision, call and F-measure, and compared with other methods based on a semantic segmentation framework and other methods based on a regression framework, the method has further improvement on the whole.
Table 1 shows the evaluation indexes of the SCUT-CTW1500 data set in the present example:
Figure 872807DEST_PATH_IMAGE040
table 2 shows the evaluation indexes of TotalText data set in this example:
Figure 599455DEST_PATH_IMAGE041
table 3 shows the evaluation indexes of the ICDAR-ArT data set of the present example:
Figure 498141DEST_PATH_IMAGE042
table 4 shows the example boundary segmentation, reliability analysis, and effectiveness analysis of the normalized focal loss function proposed in this embodiment:
Figure 943029DEST_PATH_IMAGE043
through the technical scheme, the deep scene text detection method based on the fuzzy semantic modeling is provided based on the deep learning technology. The invention can mine the semantic information and semantic reliability of the boundary of the text and the instance on various real image data, thereby obtaining an accurate detection result.
Corresponding to the embodiment of the depth scene text detection method based on the fuzzy semantic modeling, the invention also provides an embodiment of a depth scene text detection device based on the fuzzy semantic modeling.
Referring to fig. 5, the depth scene text detection apparatus based on fuzzy semantic modeling provided in the embodiment of the present invention includes one or more processors, which are configured to implement the depth scene text detection method based on fuzzy semantic modeling in the above embodiment.
The depth scene text detection device based on fuzzy semantic modeling of the embodiment of the invention can be applied to any equipment with data processing capability, such as computers and other equipment or devices. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a device in a logical sense, a processor of any device with data processing capability reads corresponding computer program instructions in the nonvolatile memory into the memory for operation. From a hardware aspect, as shown in fig. 5, a hardware structure diagram of any device with data processing capability where the deep scene text detection apparatus based on fuzzy semantic modeling is located according to the present invention is shown, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 5, in the embodiment, any device with data processing capability where the apparatus is located may also include other hardware according to the actual function of the any device with data processing capability, which is not described again.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
The embodiment of the present invention further provides a computer-readable storage medium, on which a program is stored, and when the program is executed by a processor, the method for detecting a text in a deep scene based on fuzzy semantic modeling in the foregoing embodiments is implemented.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be an external storage device such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing capable device, and may also be used for temporarily storing data that has been output or is to be output.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the practice of the invention as described in the foregoing examples, or that certain features may be substituted in the practice of the invention. All changes, equivalents and the like which come within the spirit and principles of the invention are desired to be protected.

Claims (5)

1. A depth scene text detection method based on fuzzy semantic modeling is characterized by comprising the following steps:
acquiring a plurality of groups of image data sets with truth value labels for training scene text detection;
secondly, performing feature learning and global feature fusion on the images in the data set by using a full convolution feature extraction network and a feature pyramid network to obtain a fused global feature map;
step three, performing pixel-level semantic classification on the fused global feature map, predicting the pixel-level semantic reliability through numerical regression, performing multi-branch joint optimization under full supervision, and completing construction of an end-to-end joint learning framework, wherein the method specifically comprises the following substeps:
(3.1) establishing 4 prediction branches with consistent structures based on the fused global feature map, wherein each prediction branch comprises three layers of convolution operation, and predicting text semantic category score of each pixel position on the global feature map
Figure DEST_PATH_IMAGE002
Instance boundary semantic category score
Figure DEST_PATH_IMAGE004
Text reliability value
Figure DEST_PATH_IMAGE006
And an example boundary reliability value
Figure DEST_PATH_IMAGE008
(3.2) learning and optimizing the text segmentation graph, the example boundary segmentation graph, the text reliability graph and the example boundary reliability graph generated by the prediction branch so as to establish an end-to-end joint learning framework, wherein the overall framework loss function is as follows:
Figure DEST_PATH_IMAGE010
+
Figure DEST_PATH_IMAGE012
+
Figure DEST_PATH_IMAGE014
,
wherein
Figure DEST_PATH_IMAGE016
And
Figure DEST_PATH_IMAGE018
for the smooth L1 loss function,
Figure DEST_PATH_IMAGE020
and with
Figure DEST_PATH_IMAGE022
As a function of normalized focal loss;
using an end-to-end joint learning framework to predict fuzzy semantic information in the image, and obtaining a text attribute image by utilizing reliability analysis and fusion, wherein the method specifically comprises the following steps: inputting a to-be-predicted image based on the end-to-end joint learning framework established from the first step to the third step, and obtaining a text segmentation graph T and a text reliability graph through the minimum loss function learning
Figure DEST_PATH_IMAGE024
Example boundary segmentation graph S and example boundary reliability graph
Figure DEST_PATH_IMAGE026
And then, carrying out reliability analysis and fusing the reliability analysis and the reliability analysis into a final text attribute graph M:
Figure DEST_PATH_IMAGE028
wherein
Figure DEST_PATH_IMAGE030
Is the weighting coefficient of the balanced branch interval;
step five, carrying out binarization and communication domain extraction on the obtained text attribute graph to obtain a final text detection result, which specifically comprises the following steps: and (3) carrying out contour discovery, namely binarization and communication domain extraction on the text attribute graph output in the step four to obtain coordinate representation of the text region:
Figure DEST_PATH_IMAGE032
wherein,
Figure DEST_PATH_IMAGE034
in order to be a function of the binarization,
Figure DEST_PATH_IMAGE036
for the Union and extraction function, each text instance uses a set of points
Figure DEST_PATH_IMAGE038
It is shown that the process of the present invention,
Figure DEST_PATH_IMAGE040
and
Figure DEST_PATH_IMAGE042
respectively represent the abscissa and ordinate of the mth coordinate of one text region, and N represents the number of coordinate points.
2. The method for detecting the text of the depth scene based on the fuzzy semantic modeling as claimed in claim 1, wherein the second step specifically comprises the following sub-steps:
(2.1) extracting depth features of each image on different scales by using a full convolution network and a feature pyramid network;
and (2.2) extracting and fusing the depth feature maps on different scales by using convolution operation and splicing operation to obtain a fused global feature map.
3. The method as claimed in claim 1, wherein the text detection method based on fuzzy semantic modeling is characterized in that a text true value map, an example boundary true value map, a text reliability true value map and an example boundary reliability true value map are generated by expanding intersection and truncating distance functions of the text region by using coordinates of the text region;
the text true value graph specifically comprises the following steps: performing binarization filling by using truth value data of text region coordinates, wherein the inside filling of the text region is 1, the background is 0, and the truth value data is used as a truth value map of the text region
Figure DEST_PATH_IMAGE044
The example boundary truth diagram specifically includes: text outlines are adaptively expanded according to the scale of the text outlines, 1/5 of line height is used as an expansion parameter, similar text examples after expansion are overlapped, and an overlapping area is defined as a truth map of an example boundary area
Figure DEST_PATH_IMAGE046
The text reliability truth map and the example boundary reliability truth map are specifically as follows: calculating semantic reliability of pixel positions by a truncation function aiming at the edges of the text region and the example boundary region to obtain a reliability true value graph of the boundary between the text and the example
Figure DEST_PATH_IMAGE048
And
Figure DEST_PATH_IMAGE050
the cutoff function is:
Figure DEST_PATH_IMAGE052
wherein
Figure DEST_PATH_IMAGE054
Measuring pixel position
Figure DEST_PATH_IMAGE056
Semantic boundary closest to it
Figure DEST_PATH_IMAGE058
The Euclidean distance between the two electrodes,
Figure DEST_PATH_IMAGE060
is the pixel position
Figure 722912DEST_PATH_IMAGE056
The label of the binary value at (a),
Figure DEST_PATH_IMAGE062
is a truncation threshold, and is also a normalization coefficient,
Figure DEST_PATH_IMAGE064
represents a position by an absolute value of
Figure 967949DEST_PATH_IMAGE056
The reliability of the pixel is measured, and the sign is used to distinguish the semantic tendency.
4. A deep scene text detection device based on fuzzy semantic modeling, which is characterized by comprising one or more processors and is used for realizing the deep scene text detection method based on fuzzy semantic modeling according to any one of claims 1 to 3.
5. A computer-readable storage medium, having stored thereon a program which, when executed by a processor, implements the method for detecting text in a deep scene based on fuzzy semantic modeling according to any one of claims 1 to 3.
CN202210882622.6A 2022-07-26 2022-07-26 Depth scene text detection method and device based on fuzzy semantic modeling Active CN114972947B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210882622.6A CN114972947B (en) 2022-07-26 2022-07-26 Depth scene text detection method and device based on fuzzy semantic modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210882622.6A CN114972947B (en) 2022-07-26 2022-07-26 Depth scene text detection method and device based on fuzzy semantic modeling

Publications (2)

Publication Number Publication Date
CN114972947A CN114972947A (en) 2022-08-30
CN114972947B true CN114972947B (en) 2022-12-06

Family

ID=82968948

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210882622.6A Active CN114972947B (en) 2022-07-26 2022-07-26 Depth scene text detection method and device based on fuzzy semantic modeling

Country Status (1)

Country Link
CN (1) CN114972947B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116129456B (en) * 2023-02-09 2023-07-25 广西壮族自治区自然资源遥感院 Method and system for identifying and inputting property rights and interests information
CN117851883B (en) * 2024-01-03 2024-08-30 之江实验室 Cross-modal large language model-based scene text detection and recognition method

Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108288088A (en) * 2018-01-17 2018-07-17 浙江大学 A kind of scene text detection method based on end-to-end full convolutional neural networks
WO2019192397A1 (en) * 2018-04-04 2019-10-10 华中科技大学 End-to-end recognition method for scene text in any shape
CN110322495A (en) * 2019-06-27 2019-10-11 电子科技大学 A kind of scene text dividing method based on Weakly supervised deep learning
CN110738609A (en) * 2019-09-11 2020-01-31 北京大学 method and device for removing image moire
CN111210518A (en) * 2020-01-15 2020-05-29 西安交通大学 Topological map generation method based on visual fusion landmark
CN111931763A (en) * 2020-06-09 2020-11-13 浙江大学 Depth scene text detection method based on random shape edge geometric modeling
CN112101165A (en) * 2020-09-07 2020-12-18 腾讯科技(深圳)有限公司 Interest point identification method and device, computer equipment and storage medium
CN112950645A (en) * 2021-03-24 2021-06-11 中国人民解放军国防科技大学 Image semantic segmentation method based on multitask deep learning
CN112966691A (en) * 2021-04-14 2021-06-15 重庆邮电大学 Multi-scale text detection method and device based on semantic segmentation and electronic equipment
CN112966697A (en) * 2021-03-17 2021-06-15 西安电子科技大学广州研究院 Target detection method, device and equipment based on scene semantics and storage medium
CN113343707A (en) * 2021-06-04 2021-09-03 北京邮电大学 Scene text recognition method based on robustness characterization learning
CN113591719A (en) * 2021-08-02 2021-11-02 南京大学 Method and device for detecting text with any shape in natural scene and training method
CN114202671A (en) * 2021-11-17 2022-03-18 桂林理工大学 Image prediction optimization processing method and device
CN114255464A (en) * 2021-12-14 2022-03-29 南京信息工程大学 Natural scene character detection and identification method based on CRAFT and SCRN-SEED framework
CN114399497A (en) * 2022-01-19 2022-04-26 中国平安人寿保险股份有限公司 Text image quality detection method and device, computer equipment and storage medium
WO2022098203A1 (en) * 2020-11-09 2022-05-12 Samsung Electronics Co., Ltd. Method and apparatus for image segmentation
CN114494698A (en) * 2022-01-27 2022-05-13 北京邮电大学 Traditional culture image semantic segmentation method based on edge prediction
CN114495103A (en) * 2022-01-28 2022-05-13 北京百度网讯科技有限公司 Text recognition method, text recognition device, electronic equipment and medium
CN114565913A (en) * 2022-03-03 2022-05-31 广州华多网络科技有限公司 Text recognition method and device, equipment, medium and product thereof

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11366968B2 (en) * 2019-07-29 2022-06-21 Intuit Inc. Region proposal networks for automated bounding box detection and text segmentation
CN112926372B (en) * 2020-08-22 2023-03-10 清华大学 Scene character detection method and system based on sequence deformation
CN112287931B (en) * 2020-12-30 2021-03-19 浙江万里学院 Scene text detection method and system

Patent Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108288088A (en) * 2018-01-17 2018-07-17 浙江大学 A kind of scene text detection method based on end-to-end full convolutional neural networks
WO2019192397A1 (en) * 2018-04-04 2019-10-10 华中科技大学 End-to-end recognition method for scene text in any shape
CN110322495A (en) * 2019-06-27 2019-10-11 电子科技大学 A kind of scene text dividing method based on Weakly supervised deep learning
CN110738609A (en) * 2019-09-11 2020-01-31 北京大学 method and device for removing image moire
CN111210518A (en) * 2020-01-15 2020-05-29 西安交通大学 Topological map generation method based on visual fusion landmark
CN111931763A (en) * 2020-06-09 2020-11-13 浙江大学 Depth scene text detection method based on random shape edge geometric modeling
CN112101165A (en) * 2020-09-07 2020-12-18 腾讯科技(深圳)有限公司 Interest point identification method and device, computer equipment and storage medium
WO2022098203A1 (en) * 2020-11-09 2022-05-12 Samsung Electronics Co., Ltd. Method and apparatus for image segmentation
CN112966697A (en) * 2021-03-17 2021-06-15 西安电子科技大学广州研究院 Target detection method, device and equipment based on scene semantics and storage medium
CN112950645A (en) * 2021-03-24 2021-06-11 中国人民解放军国防科技大学 Image semantic segmentation method based on multitask deep learning
CN112966691A (en) * 2021-04-14 2021-06-15 重庆邮电大学 Multi-scale text detection method and device based on semantic segmentation and electronic equipment
CN113343707A (en) * 2021-06-04 2021-09-03 北京邮电大学 Scene text recognition method based on robustness characterization learning
CN113591719A (en) * 2021-08-02 2021-11-02 南京大学 Method and device for detecting text with any shape in natural scene and training method
CN114202671A (en) * 2021-11-17 2022-03-18 桂林理工大学 Image prediction optimization processing method and device
CN114255464A (en) * 2021-12-14 2022-03-29 南京信息工程大学 Natural scene character detection and identification method based on CRAFT and SCRN-SEED framework
CN114399497A (en) * 2022-01-19 2022-04-26 中国平安人寿保险股份有限公司 Text image quality detection method and device, computer equipment and storage medium
CN114494698A (en) * 2022-01-27 2022-05-13 北京邮电大学 Traditional culture image semantic segmentation method based on edge prediction
CN114495103A (en) * 2022-01-28 2022-05-13 北京百度网讯科技有限公司 Text recognition method, text recognition device, electronic equipment and medium
CN114565913A (en) * 2022-03-03 2022-05-31 广州华多网络科技有限公司 Text recognition method and device, equipment, medium and product thereof

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Fuzzy Semantics for Arbitrary-shaped Scene Text Detection;Fangfang Wang 等;《IEEE Transactions on Image Processing》;20220830;全文 *
Proposing a Semantic Analysis based Sanskrit Compiler by mapping Sanskrit"s linguistic features with Compiler phases;Akshay Chavan 等;《2021 Second International Conference on Electronics and Sustainable Communication Systems (ICESC)》;20211231;全文 *
Semantic Genes and the Formalized Representation of Lexical Meaning;Dan Hu;《2010 International Conference on Asian Language Processing》;20101231;全文 *
深度卷积神经网络图像语义分割研究进展;青晨等;《中国图象图形学报》;20200616(第06期);全文 *

Also Published As

Publication number Publication date
CN114972947A (en) 2022-08-30

Similar Documents

Publication Publication Date Title
Zhang et al. A cascaded R-CNN with multiscale attention and imbalanced samples for traffic sign detection
Lee et al. Simultaneous traffic sign detection and boundary estimation using convolutional neural network
CN114972947B (en) Depth scene text detection method and device based on fuzzy semantic modeling
US20200160124A1 (en) Fine-grained image recognition
CN107833213B (en) Weak supervision object detection method based on false-true value self-adaptive method
CN108734210B (en) Object detection method based on cross-modal multi-scale feature fusion
CN109960742B (en) Local information searching method and device
CN111091105A (en) Remote sensing image target detection method based on new frame regression loss function
CN111680678B (en) Target area identification method, device, equipment and readable storage medium
CN107798725B (en) Android-based two-dimensional house type identification and three-dimensional presentation method
CN113239818B (en) Table cross-modal information extraction method based on segmentation and graph convolution neural network
CN113487610B (en) Herpes image recognition method and device, computer equipment and storage medium
CN111415373A (en) Target tracking and segmenting method, system and medium based on twin convolutional network
JP2023527615A (en) Target object detection model training method, target object detection method, device, electronic device, storage medium and computer program
CN114463603B (en) Training method and device for image detection model, electronic equipment and storage medium
JP2019185787A (en) Remote determination of containers in geographical region
Cao et al. Multi angle rotation object detection for remote sensing image based on modified feature pyramid networks
CN115115825A (en) Method and device for detecting object in image, computer equipment and storage medium
CN114168768A (en) Image retrieval method and related equipment
Park et al. Estimating the camera direction of a geotagged image using reference images
CN113704276A (en) Map updating method and device, electronic equipment and computer readable storage medium
Zhou et al. Self-supervised saliency estimation for pixel embedding in road detection
CN113610856B (en) Method and device for training image segmentation model and image segmentation
Jia et al. Sample generation of semi‐automatic pavement crack labelling and robustness in detection of pavement diseases
CN114241470A (en) Natural scene character detection method based on attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant