CN114972947B - Depth scene text detection method and device based on fuzzy semantic modeling - Google Patents
Depth scene text detection method and device based on fuzzy semantic modeling Download PDFInfo
- Publication number
- CN114972947B CN114972947B CN202210882622.6A CN202210882622A CN114972947B CN 114972947 B CN114972947 B CN 114972947B CN 202210882622 A CN202210882622 A CN 202210882622A CN 114972947 B CN114972947 B CN 114972947B
- Authority
- CN
- China
- Prior art keywords
- text
- reliability
- semantic
- graph
- boundary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 48
- 238000000034 method Methods 0.000 claims abstract description 38
- 238000000605 extraction Methods 0.000 claims abstract description 15
- 238000012549 training Methods 0.000 claims abstract description 12
- 230000004927 fusion Effects 0.000 claims abstract description 10
- 238000004891 communication Methods 0.000 claims abstract description 9
- 238000005457 optimization Methods 0.000 claims abstract description 6
- 238000010276 construction Methods 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 33
- 230000011218 segmentation Effects 0.000 claims description 21
- 238000003860 storage Methods 0.000 claims description 11
- 238000010586 diagram Methods 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 7
- 238000012360 testing method Methods 0.000 description 6
- 238000013135 deep learning Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 238000004590 computer program Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000265 homogenisation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/28—Quantising the image, e.g. histogram thresholding for discrimination between background and foreground patterns
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/153—Segmentation of character regions using recognition of characters or words
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/1918—Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a depth scene text detection method and a device based on fuzzy semantic modeling, wherein the method comprises the following steps: acquiring a plurality of groups of image data sets with truth value labels for training scene text detection; secondly, performing feature learning and global feature fusion on the images in the data set to obtain a fused global feature map; step three, performing pixel-level semantic classification on the fused global feature map, predicting the pixel-level semantic reliability through numerical regression, performing multi-branch joint optimization under full supervision, and completing construction of an end-to-end joint learning framework; using an end-to-end joint learning framework to predict fuzzy semantic information in the image, and obtaining a text attribute image by utilizing reliability analysis and fusion; and step five, carrying out binarization and communication domain extraction on the text attribute graph to obtain a final text detection result. The method is simple, flexible, robust and wide in application range.
Description
Technical Field
The invention belongs to the field of computer vision, and relates to a depth scene text detection method and device based on fuzzy semantic modeling.
Background
Scene text detection is defined as the following problem: the text region positions of multi-direction, multi-language, curved or irregular shapes are found in the natural scene image. Due to the effectiveness of statistical modeling, current learning-based methods are increasingly applied to scene text detection tasks. The existing learning-based method mainly adopts a deep learning framework, inputs an image and outputs a detected text region.
In recent years, it has been widely used in the field of computer vision tasks such as scene understanding, image retrieval, and the like. The computer vision task has two key points: the first point is how to mine semantic information at the text pixel level from bottom to top so as to be able to adapt to a wide variety of text shapes; the second point is how to model semantic ambiguity of text region edges to solve the problem that adjacent instances are difficult to distinguish due to the clustering distribution between homogeneous textures inside the text and the text.
Disclosure of Invention
Aiming at the first point, the invention considers that a semantic segmentation frame is utilized, and the pixel-level semantic information can be effectively mined by performing global feature fusion and end-to-end feature learning through a feature extraction network and a feature pyramid network; aiming at the second point, the text instance boundary region is considered to have unique semantic characteristics, two semantic information of the text and the instance boundary are mined and semantic reliability analysis is carried out, so that the boundaries of different text targets can be found and distinguished more accurately, and the specific technical scheme is as follows:
a depth scene text detection method based on fuzzy semantic modeling comprises the following steps:
acquiring a plurality of groups of image data sets with truth value labels for training scene text detection;
step two, performing feature learning and global feature fusion on the images in the data set by using a full convolution feature extraction network and a feature pyramid network to obtain a fused global feature map;
step three, performing pixel-level semantic classification on the fused global feature map, predicting the pixel-level semantic reliability through numerical regression, performing multi-branch joint optimization under full supervision, and completing construction of an end-to-end joint learning framework;
step four, using an end-to-end joint learning frame to predict fuzzy semantic information in the image, and obtaining a text attribute map by utilizing reliability analysis and fusion;
and step five, carrying out binarization and communication domain extraction on the obtained text attribute graph to obtain a final text detection result.
Further, the second step specifically includes the following substeps:
(2.1) extracting depth features of each image on different scales by using a full convolution network and a feature pyramid network;
and (2.2) extracting and fusing the depth feature maps on different scales by using convolution operation and splicing operation to obtain a fused global feature map.
Further, the third step specifically includes the following sub-steps:
(3.1) fusion-based Global feature mapEstablishing 4 prediction branches with consistent structures, wherein each prediction branch comprises three layers of convolution operation, and predicting text semantic category score of each pixel position on the global feature mapInstance boundary semantic category scoreText reliability valueAnd example demarcation reliability value;
(3.2) learning and optimizing the text segmentation graph, the example boundary segmentation graph, the text reliability graph and the example boundary reliability graph generated by the prediction branch so as to establish an end-to-end joint learning framework, wherein the overall framework loss function is as follows:
Further, the fourth step is specifically:
inputting a to-be-predicted image based on the end-to-end joint learning framework established from the first step to the third step, and obtaining a text segmentation graph T and a text reliability graph through the minimum loss function learningExample boundary segmentation graph S and example boundary reliability graphAnd performing reliability analysis by using the four output graphs and fusing the four output graphs into a final text attribute graph M:
Further, the fifth step is specifically:
carrying out contour discovery, namely binarization and communication domain extraction on the text attribute graph output in the step four to obtain coordinate representation of the text region:
wherein,in order to be a function of the binarization,is communicated withAnd an extraction function, wherein each text instance uses a set of pointsIt is shown that,andrespectively representing the abscissa and ordinate of the mth coordinate of a text region, and N represents the number of coordinate points.
Further, generating a text true value graph, an example boundary true value graph, a text reliability true value graph and an example boundary reliability true value graph by expanding, intersecting and truncating a distance function of the text region by utilizing the coordinates of the text region;
the text true value graph specifically includes: performing binarization filling by using truth value data of text region coordinates, wherein the inside filling of the text region is 1, the background is 0, and the truth value data is used as a truth value map of the text region;
The example boundary truth diagram specifically includes: the text outline is adaptively expanded by taking 1/5 of line height as an expansion parameter according to the scale of the text outline, similar text instances after expansion are overlapped, and an overlapping area is defined as a truth map of an instance boundary area;
The text reliability truth value graph and the example boundary reliability truth value graph are specifically as follows: calculating semantic reliability of pixel positions by a truncation function aiming at the edges of the text region and the example boundary region to obtain a reliability true value graph of the boundary between the text and the exampleAndthe cutoff function is:
whereinMeasuring pixel positionSemantic boundary closest to itThe Euclidean distance between the two electrodes,is the pixel positionThe label of the binary value at (a),is a truncation threshold, and is also a normalization coefficient,absolute value of (2) represents positionThe reliability of the pixel is measured, and the sign is used to distinguish the semantic tendency.
A depth scene text detection device based on fuzzy semantic modeling comprises one or more processors and is used for realizing the depth scene text detection method based on fuzzy semantic modeling.
A computer-readable storage medium, on which a program is stored, which, when executed by a processor, implements the method for deep scene text detection based on fuzzy semantic modeling.
Compared with the existing scene text detection method, the method has the following beneficial effects:
firstly, the scene text detection method solves the problem that adjacent examples are difficult to distinguish due to the aggregation distribution between the internal homogenization textures of the texts and the texts in the scene text detection method in any form from bottom to top from the aspect of redundancy removal, and finds and explores two fuzzy semantics of the boundaries of the texts and the examples in natural images;
secondly, the reliability modeling is carried out on the fuzzy semantic boundary, the competition problem on the semantic boundary is solved through reliability analysis, and the final semantic attribute is judged, so that a clear and complete example boundary is obtained, and the scene text detection effect is improved;
finally, the scene text detection method is used as a simple and direct lightweight frame based on semantic segmentation, a final detection result is obtained at one time in a communication domain extraction mode, any iterative or other complex post-processing steps are not needed, and the effect of the method exceeds that of a plurality of multi-stage methods based on segmentation;
the method has good application value in scenes such as scene understanding, automatic driving and the like, for example, in an automatic driving task, texts in the scenes contain a large amount of information for helping to understand the scenes and assisting in driving, and the accurate detection of the positions of the texts is based on the scene text information.
Drawings
FIG. 1 is a schematic flow diagram of a depth scene text detection method based on fuzzy semantic modeling according to the present invention;
FIGS. 2 a-2 c are schematic diagrams of original images according to an embodiment of the present invention;
FIG. 3 is a framework diagram of a learning network of the present invention;
FIG. 4 is a diagram illustrating the detection effect of the semantic segmentation framework on the random form text in the natural scene image according to the embodiment of the present invention;
fig. 5 is a schematic structural diagram of a deep scene text detection device based on fuzzy semantic modeling according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and technical effects of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.
In scene text detection, the geometric attributes of a text are variable and semantic boundaries are fuzzy, so that a pixel level text region from bottom to top is found based on a convolutional neural network, semantic information and reliability of a text and an example boundary are modeled, finally, network optimization is performed through an effective training strategy, and specifically, a deep scene text detection method and a deep scene text detection device based on fuzzy semantic modeling are provided, two mutually competing fuzzy boundary semantic categories, namely the text and the example boundary, are mined in a natural scene image, and scene text target detection in any shape is performed from the perspective of redundancy removal through pixel level multi-label classification and fuzzy semantic reliability analysis; the method uses a one-stage deep learning segmentation framework, utilizes the focus loss function at the cross-image pixel level to carry out network optimization, and has the advantages of simple and convenient implementation method, flexibility, robustness and wide application range.
In more detail, as shown in fig. 1, the method for detecting the text of the depth scene based on the fuzzy semantic modeling includes the following steps:
step one, acquiring a plurality of groups of image data sets with true value labels for training scene text detection. Specifically, the present invention is implemented on three data sets with truth labels, which are:
SCUT-CTW1500 dataset: the data set contains 1000 training images, 500 test images;
TotalText dataset: the data set contained 1255 training images, 300 test images;
ICDAR-ArT dataset: the data set contained 5603 training images, 4563 test images.
And step two, performing feature learning and global feature fusion on the images in the data set by using a full convolution feature extraction network and a feature pyramid network to obtain a fused global feature map.
The specific implementation method of the step comprises the following steps:
(2.1) extracting depth features of each image on different scales by using a full convolution network and a feature pyramid network;
and (2.2) extracting and fusing the depth feature maps on different scales by using convolution operation and splicing operation to obtain a fused global feature map.
And step three, performing pixel-level semantic classification on the fused global feature map, predicting the pixel-level semantic reliability through numerical regression, and performing multi-branch joint optimization under full supervision to complete the construction of an end-to-end joint learning framework.
The specific implementation method of the step comprises the following steps:
(3.1) establishing 4 prediction branches with consistent structures based on the fused global feature map, wherein each prediction branch comprises three layers of convolution operation, and predicting text semantic category score of each pixel position on the global feature mapInstance boundary semantic category scoreText reliability valueAnd an example boundary reliability value;
(3.2) learning and optimizing the text segmentation graph, the example boundary segmentation graph, the text reliability graph and the example boundary reliability graph generated by the prediction branches so as to establish an end-to-end joint learning framework, wherein the overall framework loss function is as follows:
whereinAndfor the smooth L1 loss function,andto normalize the focus loss function, take the segmentation loss function at the positive sample point as an example:
whereinThe total number of all sample pixel points in the current image,the dynamic mean of the training weights of the positive sample points in all the images that are currently processed,as the current positionThe value of the reliability of the (d) is,a probability value predicted for the current location,for the number of all positive sample points of the current image, the weight。
And step four, using an end-to-end joint learning framework to predict fuzzy semantic information in the image, and obtaining a text attribute map by utilizing reliability analysis and fusion.
The specific implementation method of the step comprises the following steps:
inputting a to-be-predicted image based on the end-to-end joint learning framework established from the first step to the third step, and obtaining a text segmentation graph T and a text reliability graph through the minimum loss function learningExample boundary segmentation graph S and example boundary reliability graphAnd performing reliability analysis by using the four output graphs and fusing the four output graphs into a final text attribute graph M:
And step five, carrying out binarization and communication domain extraction on the obtained text attribute graph to obtain a final text detection result.
The specific implementation method of the step comprises the following steps:
carrying out contour discovery, namely binarization and communication domain extraction on the text attribute graph output in the step four to obtain coordinate representation of the text region:
wherein,in order to be a function of the binarization,for the Unicom and extract function, each text instance uses a set of pointsIt is shown that,andrespectively represent the abscissa and ordinate of the mth coordinate of one text region, and N represents the number of coordinate points.
Finally, generating four true value graphs of text, example boundary, text reliability and example boundary reliability by using the coordinate information of the text area and through modes of text area expansion intersection, distance truncation function and the like, wherein the four true value graphs are as follows:
performing binarization filling by using truth value data of text region coordinates, wherein the inside filling of the text region is 1, the background is 0, and the truth value data is used as a truth value map of the text region;
The text outline is adaptively expanded by taking 1/5 of line height as an expansion parameter according to the scale of the text outline, similar text instances after expansion are overlapped, and an overlapping area is defined as a truth map of an instance boundary area;
Calculating semantic reliability of pixel positions by a truncation function aiming at the edges of the text region and the example boundary region to obtain a reliability true value graph of the boundary between the text and the exampleAndthe cutoff function is:
whereinMeasuring pixel positionSemantic boundary closest to itThe euclidean distance between them,is the pixel positionThe label of the binary value at (a),is a cut-off threshold value and is also a normalization coefficient, in the method experiment of the invention= 10。Represents a position by an absolute value ofThe reliability of the pixel is determined, and the sign is used to distinguish the semantic tendency.
The above-described method is applied to specific embodiments below so that those skilled in the art can better understand the effects of the present invention.
The implementation method of this embodiment is as described above, and specific steps are not elaborated, and the effect is shown only for case data. The invention is implemented on three data sets with truth labels, which are respectively as follows:
SCUT-CTW1500 dataset: the data set contained 1000 training images, 500 test images.
TotalText dataset: the data set contained 1255 training images and 300 test images.
ICDAR-ArT dataset: the data set contained 5603 training images, 4563 test images.
This example performs the experiment on each data set separately, and the images in the data sets are shown in fig. 2a to 2c, for example.
The main flow of text detection is as follows:
1) Extracting a multi-scale feature map of the image through a full convolution network and a feature pyramid structure network;
2) Extracting and fusing each scale feature map by using convolution operation and splicing operation to obtain a global feature map;
3) Four prediction branches are constructed on the global feature map, and the text semantic classification score, the instance boundary semantic classification score, the text semantic reliability and the instance boundary semantic reliability of each sample point on the feature map are respectively predicted;
4) Jointly optimizing semantic classification and reliability regression branches;
5) Predicting semantic and reliability information by using the learning framework to obtain a text attribute graph;
6) And (3) carrying out binarization and communication domain extraction on the text attribute graph to obtain a final text detection result, wherein the overall learning network framework is shown as fig. 3, and the detection effect graph of the random form text in the natural scene image is shown as fig. 4.
In order to comprehensively compare the effectiveness of the method, other advanced methods are compared, and effectiveness analysis is performed on three operations of example boundary segmentation, reliability analysis and normalized focal loss function provided by the method. The accuracy (precision), recall (recall) and overall performance (F-measure) of the detection results of this embodiment are shown in tables 1 to 3, where F-measure represents the balanced overall performance between accuracy and recall:. The data in the table show the performance of the method on three indexes of precision, call and F-measure, and compared with other methods based on a semantic segmentation framework and other methods based on a regression framework, the method has further improvement on the whole.
Table 1 shows the evaluation indexes of the SCUT-CTW1500 data set in the present example:
table 2 shows the evaluation indexes of TotalText data set in this example:
table 3 shows the evaluation indexes of the ICDAR-ArT data set of the present example:
table 4 shows the example boundary segmentation, reliability analysis, and effectiveness analysis of the normalized focal loss function proposed in this embodiment:
through the technical scheme, the deep scene text detection method based on the fuzzy semantic modeling is provided based on the deep learning technology. The invention can mine the semantic information and semantic reliability of the boundary of the text and the instance on various real image data, thereby obtaining an accurate detection result.
Corresponding to the embodiment of the depth scene text detection method based on the fuzzy semantic modeling, the invention also provides an embodiment of a depth scene text detection device based on the fuzzy semantic modeling.
Referring to fig. 5, the depth scene text detection apparatus based on fuzzy semantic modeling provided in the embodiment of the present invention includes one or more processors, which are configured to implement the depth scene text detection method based on fuzzy semantic modeling in the above embodiment.
The depth scene text detection device based on fuzzy semantic modeling of the embodiment of the invention can be applied to any equipment with data processing capability, such as computers and other equipment or devices. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a device in a logical sense, a processor of any device with data processing capability reads corresponding computer program instructions in the nonvolatile memory into the memory for operation. From a hardware aspect, as shown in fig. 5, a hardware structure diagram of any device with data processing capability where the deep scene text detection apparatus based on fuzzy semantic modeling is located according to the present invention is shown, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 5, in the embodiment, any device with data processing capability where the apparatus is located may also include other hardware according to the actual function of the any device with data processing capability, which is not described again.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
The embodiment of the present invention further provides a computer-readable storage medium, on which a program is stored, and when the program is executed by a processor, the method for detecting a text in a deep scene based on fuzzy semantic modeling in the foregoing embodiments is implemented.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be an external storage device such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing capable device, and may also be used for temporarily storing data that has been output or is to be output.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the practice of the invention as described in the foregoing examples, or that certain features may be substituted in the practice of the invention. All changes, equivalents and the like which come within the spirit and principles of the invention are desired to be protected.
Claims (5)
1. A depth scene text detection method based on fuzzy semantic modeling is characterized by comprising the following steps:
acquiring a plurality of groups of image data sets with truth value labels for training scene text detection;
secondly, performing feature learning and global feature fusion on the images in the data set by using a full convolution feature extraction network and a feature pyramid network to obtain a fused global feature map;
step three, performing pixel-level semantic classification on the fused global feature map, predicting the pixel-level semantic reliability through numerical regression, performing multi-branch joint optimization under full supervision, and completing construction of an end-to-end joint learning framework, wherein the method specifically comprises the following substeps:
(3.1) establishing 4 prediction branches with consistent structures based on the fused global feature map, wherein each prediction branch comprises three layers of convolution operation, and predicting text semantic category score of each pixel position on the global feature mapInstance boundary semantic category scoreText reliability valueAnd an example boundary reliability value;
(3.2) learning and optimizing the text segmentation graph, the example boundary segmentation graph, the text reliability graph and the example boundary reliability graph generated by the prediction branch so as to establish an end-to-end joint learning framework, wherein the overall framework loss function is as follows:
using an end-to-end joint learning framework to predict fuzzy semantic information in the image, and obtaining a text attribute image by utilizing reliability analysis and fusion, wherein the method specifically comprises the following steps: inputting a to-be-predicted image based on the end-to-end joint learning framework established from the first step to the third step, and obtaining a text segmentation graph T and a text reliability graph through the minimum loss function learningExample boundary segmentation graph S and example boundary reliability graphAnd then, carrying out reliability analysis and fusing the reliability analysis and the reliability analysis into a final text attribute graph M:
step five, carrying out binarization and communication domain extraction on the obtained text attribute graph to obtain a final text detection result, which specifically comprises the following steps: and (3) carrying out contour discovery, namely binarization and communication domain extraction on the text attribute graph output in the step four to obtain coordinate representation of the text region:
wherein,in order to be a function of the binarization,for the Union and extraction function, each text instance uses a set of pointsIt is shown that the process of the present invention,andrespectively represent the abscissa and ordinate of the mth coordinate of one text region, and N represents the number of coordinate points.
2. The method for detecting the text of the depth scene based on the fuzzy semantic modeling as claimed in claim 1, wherein the second step specifically comprises the following sub-steps:
(2.1) extracting depth features of each image on different scales by using a full convolution network and a feature pyramid network;
and (2.2) extracting and fusing the depth feature maps on different scales by using convolution operation and splicing operation to obtain a fused global feature map.
3. The method as claimed in claim 1, wherein the text detection method based on fuzzy semantic modeling is characterized in that a text true value map, an example boundary true value map, a text reliability true value map and an example boundary reliability true value map are generated by expanding intersection and truncating distance functions of the text region by using coordinates of the text region;
the text true value graph specifically comprises the following steps: performing binarization filling by using truth value data of text region coordinates, wherein the inside filling of the text region is 1, the background is 0, and the truth value data is used as a truth value map of the text region;
The example boundary truth diagram specifically includes: text outlines are adaptively expanded according to the scale of the text outlines, 1/5 of line height is used as an expansion parameter, similar text examples after expansion are overlapped, and an overlapping area is defined as a truth map of an example boundary area;
The text reliability truth map and the example boundary reliability truth map are specifically as follows: calculating semantic reliability of pixel positions by a truncation function aiming at the edges of the text region and the example boundary region to obtain a reliability true value graph of the boundary between the text and the exampleAndthe cutoff function is:
whereinMeasuring pixel positionSemantic boundary closest to itThe Euclidean distance between the two electrodes,is the pixel positionThe label of the binary value at (a),is a truncation threshold, and is also a normalization coefficient,represents a position by an absolute value ofThe reliability of the pixel is measured, and the sign is used to distinguish the semantic tendency.
4. A deep scene text detection device based on fuzzy semantic modeling, which is characterized by comprising one or more processors and is used for realizing the deep scene text detection method based on fuzzy semantic modeling according to any one of claims 1 to 3.
5. A computer-readable storage medium, having stored thereon a program which, when executed by a processor, implements the method for detecting text in a deep scene based on fuzzy semantic modeling according to any one of claims 1 to 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210882622.6A CN114972947B (en) | 2022-07-26 | 2022-07-26 | Depth scene text detection method and device based on fuzzy semantic modeling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210882622.6A CN114972947B (en) | 2022-07-26 | 2022-07-26 | Depth scene text detection method and device based on fuzzy semantic modeling |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114972947A CN114972947A (en) | 2022-08-30 |
CN114972947B true CN114972947B (en) | 2022-12-06 |
Family
ID=82968948
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210882622.6A Active CN114972947B (en) | 2022-07-26 | 2022-07-26 | Depth scene text detection method and device based on fuzzy semantic modeling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114972947B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116129456B (en) * | 2023-02-09 | 2023-07-25 | 广西壮族自治区自然资源遥感院 | Method and system for identifying and inputting property rights and interests information |
CN117851883B (en) * | 2024-01-03 | 2024-08-30 | 之江实验室 | Cross-modal large language model-based scene text detection and recognition method |
Citations (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108288088A (en) * | 2018-01-17 | 2018-07-17 | 浙江大学 | A kind of scene text detection method based on end-to-end full convolutional neural networks |
WO2019192397A1 (en) * | 2018-04-04 | 2019-10-10 | 华中科技大学 | End-to-end recognition method for scene text in any shape |
CN110322495A (en) * | 2019-06-27 | 2019-10-11 | 电子科技大学 | A kind of scene text dividing method based on Weakly supervised deep learning |
CN110738609A (en) * | 2019-09-11 | 2020-01-31 | 北京大学 | method and device for removing image moire |
CN111210518A (en) * | 2020-01-15 | 2020-05-29 | 西安交通大学 | Topological map generation method based on visual fusion landmark |
CN111931763A (en) * | 2020-06-09 | 2020-11-13 | 浙江大学 | Depth scene text detection method based on random shape edge geometric modeling |
CN112101165A (en) * | 2020-09-07 | 2020-12-18 | 腾讯科技(深圳)有限公司 | Interest point identification method and device, computer equipment and storage medium |
CN112950645A (en) * | 2021-03-24 | 2021-06-11 | 中国人民解放军国防科技大学 | Image semantic segmentation method based on multitask deep learning |
CN112966691A (en) * | 2021-04-14 | 2021-06-15 | 重庆邮电大学 | Multi-scale text detection method and device based on semantic segmentation and electronic equipment |
CN112966697A (en) * | 2021-03-17 | 2021-06-15 | 西安电子科技大学广州研究院 | Target detection method, device and equipment based on scene semantics and storage medium |
CN113343707A (en) * | 2021-06-04 | 2021-09-03 | 北京邮电大学 | Scene text recognition method based on robustness characterization learning |
CN113591719A (en) * | 2021-08-02 | 2021-11-02 | 南京大学 | Method and device for detecting text with any shape in natural scene and training method |
CN114202671A (en) * | 2021-11-17 | 2022-03-18 | 桂林理工大学 | Image prediction optimization processing method and device |
CN114255464A (en) * | 2021-12-14 | 2022-03-29 | 南京信息工程大学 | Natural scene character detection and identification method based on CRAFT and SCRN-SEED framework |
CN114399497A (en) * | 2022-01-19 | 2022-04-26 | 中国平安人寿保险股份有限公司 | Text image quality detection method and device, computer equipment and storage medium |
WO2022098203A1 (en) * | 2020-11-09 | 2022-05-12 | Samsung Electronics Co., Ltd. | Method and apparatus for image segmentation |
CN114494698A (en) * | 2022-01-27 | 2022-05-13 | 北京邮电大学 | Traditional culture image semantic segmentation method based on edge prediction |
CN114495103A (en) * | 2022-01-28 | 2022-05-13 | 北京百度网讯科技有限公司 | Text recognition method, text recognition device, electronic equipment and medium |
CN114565913A (en) * | 2022-03-03 | 2022-05-31 | 广州华多网络科技有限公司 | Text recognition method and device, equipment, medium and product thereof |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11366968B2 (en) * | 2019-07-29 | 2022-06-21 | Intuit Inc. | Region proposal networks for automated bounding box detection and text segmentation |
CN112926372B (en) * | 2020-08-22 | 2023-03-10 | 清华大学 | Scene character detection method and system based on sequence deformation |
CN112287931B (en) * | 2020-12-30 | 2021-03-19 | 浙江万里学院 | Scene text detection method and system |
-
2022
- 2022-07-26 CN CN202210882622.6A patent/CN114972947B/en active Active
Patent Citations (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108288088A (en) * | 2018-01-17 | 2018-07-17 | 浙江大学 | A kind of scene text detection method based on end-to-end full convolutional neural networks |
WO2019192397A1 (en) * | 2018-04-04 | 2019-10-10 | 华中科技大学 | End-to-end recognition method for scene text in any shape |
CN110322495A (en) * | 2019-06-27 | 2019-10-11 | 电子科技大学 | A kind of scene text dividing method based on Weakly supervised deep learning |
CN110738609A (en) * | 2019-09-11 | 2020-01-31 | 北京大学 | method and device for removing image moire |
CN111210518A (en) * | 2020-01-15 | 2020-05-29 | 西安交通大学 | Topological map generation method based on visual fusion landmark |
CN111931763A (en) * | 2020-06-09 | 2020-11-13 | 浙江大学 | Depth scene text detection method based on random shape edge geometric modeling |
CN112101165A (en) * | 2020-09-07 | 2020-12-18 | 腾讯科技(深圳)有限公司 | Interest point identification method and device, computer equipment and storage medium |
WO2022098203A1 (en) * | 2020-11-09 | 2022-05-12 | Samsung Electronics Co., Ltd. | Method and apparatus for image segmentation |
CN112966697A (en) * | 2021-03-17 | 2021-06-15 | 西安电子科技大学广州研究院 | Target detection method, device and equipment based on scene semantics and storage medium |
CN112950645A (en) * | 2021-03-24 | 2021-06-11 | 中国人民解放军国防科技大学 | Image semantic segmentation method based on multitask deep learning |
CN112966691A (en) * | 2021-04-14 | 2021-06-15 | 重庆邮电大学 | Multi-scale text detection method and device based on semantic segmentation and electronic equipment |
CN113343707A (en) * | 2021-06-04 | 2021-09-03 | 北京邮电大学 | Scene text recognition method based on robustness characterization learning |
CN113591719A (en) * | 2021-08-02 | 2021-11-02 | 南京大学 | Method and device for detecting text with any shape in natural scene and training method |
CN114202671A (en) * | 2021-11-17 | 2022-03-18 | 桂林理工大学 | Image prediction optimization processing method and device |
CN114255464A (en) * | 2021-12-14 | 2022-03-29 | 南京信息工程大学 | Natural scene character detection and identification method based on CRAFT and SCRN-SEED framework |
CN114399497A (en) * | 2022-01-19 | 2022-04-26 | 中国平安人寿保险股份有限公司 | Text image quality detection method and device, computer equipment and storage medium |
CN114494698A (en) * | 2022-01-27 | 2022-05-13 | 北京邮电大学 | Traditional culture image semantic segmentation method based on edge prediction |
CN114495103A (en) * | 2022-01-28 | 2022-05-13 | 北京百度网讯科技有限公司 | Text recognition method, text recognition device, electronic equipment and medium |
CN114565913A (en) * | 2022-03-03 | 2022-05-31 | 广州华多网络科技有限公司 | Text recognition method and device, equipment, medium and product thereof |
Non-Patent Citations (4)
Title |
---|
Fuzzy Semantics for Arbitrary-shaped Scene Text Detection;Fangfang Wang 等;《IEEE Transactions on Image Processing》;20220830;全文 * |
Proposing a Semantic Analysis based Sanskrit Compiler by mapping Sanskrit"s linguistic features with Compiler phases;Akshay Chavan 等;《2021 Second International Conference on Electronics and Sustainable Communication Systems (ICESC)》;20211231;全文 * |
Semantic Genes and the Formalized Representation of Lexical Meaning;Dan Hu;《2010 International Conference on Asian Language Processing》;20101231;全文 * |
深度卷积神经网络图像语义分割研究进展;青晨等;《中国图象图形学报》;20200616(第06期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN114972947A (en) | 2022-08-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | A cascaded R-CNN with multiscale attention and imbalanced samples for traffic sign detection | |
Lee et al. | Simultaneous traffic sign detection and boundary estimation using convolutional neural network | |
CN114972947B (en) | Depth scene text detection method and device based on fuzzy semantic modeling | |
US20200160124A1 (en) | Fine-grained image recognition | |
CN107833213B (en) | Weak supervision object detection method based on false-true value self-adaptive method | |
CN108734210B (en) | Object detection method based on cross-modal multi-scale feature fusion | |
CN109960742B (en) | Local information searching method and device | |
CN111091105A (en) | Remote sensing image target detection method based on new frame regression loss function | |
CN111680678B (en) | Target area identification method, device, equipment and readable storage medium | |
CN107798725B (en) | Android-based two-dimensional house type identification and three-dimensional presentation method | |
CN113239818B (en) | Table cross-modal information extraction method based on segmentation and graph convolution neural network | |
CN113487610B (en) | Herpes image recognition method and device, computer equipment and storage medium | |
CN111415373A (en) | Target tracking and segmenting method, system and medium based on twin convolutional network | |
JP2023527615A (en) | Target object detection model training method, target object detection method, device, electronic device, storage medium and computer program | |
CN114463603B (en) | Training method and device for image detection model, electronic equipment and storage medium | |
JP2019185787A (en) | Remote determination of containers in geographical region | |
Cao et al. | Multi angle rotation object detection for remote sensing image based on modified feature pyramid networks | |
CN115115825A (en) | Method and device for detecting object in image, computer equipment and storage medium | |
CN114168768A (en) | Image retrieval method and related equipment | |
Park et al. | Estimating the camera direction of a geotagged image using reference images | |
CN113704276A (en) | Map updating method and device, electronic equipment and computer readable storage medium | |
Zhou et al. | Self-supervised saliency estimation for pixel embedding in road detection | |
CN113610856B (en) | Method and device for training image segmentation model and image segmentation | |
Jia et al. | Sample generation of semi‐automatic pavement crack labelling and robustness in detection of pavement diseases | |
CN114241470A (en) | Natural scene character detection method based on attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |