CN113255669B - Method and system for detecting text of natural scene with any shape - Google Patents

Method and system for detecting text of natural scene with any shape Download PDF

Info

Publication number
CN113255669B
CN113255669B CN202110715820.9A CN202110715820A CN113255669B CN 113255669 B CN113255669 B CN 113255669B CN 202110715820 A CN202110715820 A CN 202110715820A CN 113255669 B CN113255669 B CN 113255669B
Authority
CN
China
Prior art keywords
mask
candidate frame
frame
candidate
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110715820.9A
Other languages
Chinese (zh)
Other versions
CN113255669A (en
Inventor
许信顺
刘新宇
罗昕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202110715820.9A priority Critical patent/CN113255669B/en
Publication of CN113255669A publication Critical patent/CN113255669A/en
Application granted granted Critical
Publication of CN113255669B publication Critical patent/CN113255669B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method and a system for detecting a natural scene text with any shape, which comprises the following steps: acquiring a to-be-detected text image; inputting the image to be detected into the trained detection model to obtain a final detection frame; carrying out post-processing on the obtained final detection frame to form a text area; and the detection model screens the candidate detection frames through the classification score and the mask score to obtain the final detection frame. The invention designs a mask attention module for connecting a mask generation process and a mask quality scoring process, wherein the mask attention module has a positive effect on the prediction of the mask score.

Description

Method and system for detecting text of natural scene with any shape
Technical Field
The invention relates to the technical field of natural scene text detection, in particular to a method and a system for detecting a natural scene text with any shape.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
Text appears in every corner of our daily lives as the most direct way of information dissemination. Due to potential application values in the aspects of automatic driving, blind navigation, information retrieval and the like, the text understanding task in a natural scene gets more and more attention. In general, the natural scene text understanding task involves two steps: text detection and text recognition. The first step is to locate text regions; the second step is for identifying content in the text region. Text detection is a crucial position as a precursor to text understanding tasks.
The traditional text detection method is mainly based on a connected region analysis method and a sliding window method, the two methods are based on the characteristics of manual design and can play a role in some simple scenes, and some methods cannot be used in some complex scenes. In recent years, due to the improvement of computer performance, the deep learning meets the unprecedented development opportunity, and the text detection technology based on the deep learning is also rapidly improved. Text detection based on deep learning can be mainly classified into two categories: regression-based methods and segmentation-based methods. The regression-based method can be used for detecting horizontal or multi-directional texts, and the segmentation-based method can be used for detecting texts in any shapes, so that the segmentation-based method occupies the dominant position of natural scene text detection at present.
One of the mainstream segmentation-based methods is an instance-based segmentation method. Such methods typically first use a horizontal candidate box (pro-visual) to locate a region; a classification score is then generated to determine whether the region enclosed by the candidate box belongs to text, and a segmentation mask is generated for delineating the text region. Such methods typically use classification scores as the only criteria for evaluating the quality of predicted candidate boxes, which can lead to serious false positive problems. False positive problems can be particularly classified into three categories:
(1) classify the resulting false positives. As shown in fig. 2(a), some areas in the natural scene have features similar to the text, such as graffiti on a wall, lines on a book, cracks on a road surface, etc., and these areas may be mistakenly classified as texts, resulting in a false positive sample.
(2) False positives due to regression. As shown in fig. 2(b), for some long texts or texts with larger character spacing (such as chinese), a candidate box may only contain a partial text segment, and an incomplete text segment may cause ambiguity in the subsequent recognition module.
(3) False positives resulting from segmentation. As shown in fig. 2(c), for some irregular texts, the horizontal candidate box may contain a large amount of background noise, so that the final segmented mask may not perfectly represent the text region.
For some systems with high precision requirements, the existence of false positive samples can cause immeasurable loss, many systems prefer not to recognize and fail to recognize, and the false positive samples in the detection result can cause fatal influence on the recognition result. For example, in the automatic driving, the four words of 'no parking' detect only the latter half, which may cause the vehicle to illegally park, and in the information retrieval process, the four words of 'football' detect only the former half, which may cause the detected result to be greatly different from the ideal result.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method and a system for detecting a natural scene text with any shape;
in a first aspect, the invention provides a method for detecting a text in a natural scene in any shape;
the method for detecting the text of the natural scene with any shape comprises the following steps:
acquiring a to-be-detected text image;
inputting the image to be detected into the trained detection model to obtain a final detection frame; carrying out post-processing on the obtained final detection frame to form a text area;
and the detection model screens the candidate detection frames through the classification score and the mask score to obtain the final detection frame.
In a second aspect, the invention provides a system for detecting text in a natural scene in any shape;
an arbitrarily shaped natural scene text detection system, comprising:
an acquisition module configured to: acquiring a to-be-detected text image;
a detection module configured to: inputting the image to be detected into the trained detection model to obtain a final detection frame; carrying out post-processing on the obtained final detection frame to form a text area;
and the detection model screens the candidate detection frames through the classification score and the mask score to obtain the final detection frame.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention analyzes and summarizes the false positive problem existing in the traditional text detection method based on example segmentation, and provides a mechanism for scoring the mask quality to inhibit the false positive examples.
2. The invention designs a new method for detecting the natural scene text with any shape according to the proposed mask quality scoring mechanism, and the proposed method can inhibit all types of false positive samples in a simple and uniform manner.
3. The invention designs a mask attention module for connecting a mask generation process and a mask quality scoring process, wherein the mask attention module has a positive effect on the prediction of the mask score.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of a method of the first embodiment;
FIG. 2(a) is a graph of false positives resulting from classification according to the first embodiment;
FIG. 2(b) is a false positive resulting from the regression of the first embodiment;
FIG. 2(c) is a graph of false positives resulting from the segmentation of the first embodiment;
FIG. 2(d) is a sample of the true positive of the first embodiment;
FIG. 3 is a detailed structure of the MAM of the first embodiment;
fig. 4 is a detailed structure of the Mask head of the first embodiment.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
All data are obtained according to the embodiment and are legally applied on the data on the basis of compliance with laws and regulations and user consent.
In order to uniformly solve the above-mentioned false positive problem, the invention designs an arbitrary shape text detection method based on a mask quality scoring mechanism, which is different from the previous method that only uses the classification score of a candidate box to evaluate the quality of a candidate box, and in the model designed by the invention, whether a candidate box can be kept is determined by the classification score and the mask score of the candidate box. Based on the mechanism, the model can evaluate the quality of the candidate box more reasonably, and false positive samples are more likely to be found and filtered out. The overall framework of the model is shown in fig. 1. The model designed by the invention consists of four parts: a skeleton Network (Backbone), a candidate area Network (RPN), a bounding Box module (Box head), and a Mask module (Mask head). The frame module (Box head) comprises two full connection layers which are connected in sequence.
Example one
The embodiment provides a method for detecting a natural scene text in any shape;
as shown in fig. 1, the method for detecting a text in a natural scene with an arbitrary shape includes:
s1: acquiring a to-be-detected text image;
s2: inputting the image to be detected into the trained detection model to obtain a final detection frame; carrying out post-processing on the obtained final detection frame to form a text area;
and the detection model screens the candidate detection frames through the classification score and the mask score to obtain the final detection frame.
Further, the S2: inputting the image to be detected into the trained detection model to obtain a final detection frame; the method specifically comprises the following steps:
s21: carrying out feature extraction on the image to be detected;
s22: constructing an initial candidate frame based on the extracted image features;
s23: generating an initial candidate frame feature based on the initial candidate frame; predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, performing frame regression on the initial candidate frame, and adjusting the size and the position of the initial candidate frame to obtain an adjusted candidate frame;
s24: generating characteristics of the adjusted candidate frame for the adjusted candidate frame;
expanding the adjusted candidate frame to obtain an expanded candidate frame;
for the expansion candidate frame, generating an expansion candidate frame characteristic;
s25: generating a mask for the adjusted candidate frame based on the features of the adjusted candidate frame and the expanded candidate frame features; evaluating the mask quality to obtain a mask score;
s26: and screening the adjusted candidate frames through the classification score and the mask score to form a final detection frame.
Further, the S21: carrying out feature extraction on the image to be detected; the method specifically comprises the following steps:
and a Deep residual Network (ResNet) is adopted as a skeleton Network Backbone, the features of the image to be detected are extracted, and Feature Pyramid networks are used for enhancing and representing the features.
Further, the S22: constructing an initial candidate frame based on the extracted image features; the method specifically comprises the following steps:
inputting the extracted features into a candidate Region generation Network (RPN) to obtain a constructed initial candidate frame.
Illustratively, the RPN network will output several horizontal candidate boxes, and one candidate box may be specifically expressed as follows:
Figure 264377DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 96329DEST_PATH_IMAGE002
is a candidate frame
Figure 354135DEST_PATH_IMAGE003
The coordinates of the upper left corner of the table,
Figure 252821DEST_PATH_IMAGE004
and
Figure 353501DEST_PATH_IMAGE005
are respectively as
Figure 97466DEST_PATH_IMAGE003
Width and height of (d);
further, the S23: generating an initial candidate frame feature based on the initial candidate frame; the method specifically comprises the following steps:
and generating initial candidate frame features by adopting a candidate region alignment operation RoIAlign based on the initial candidate frame and the extracted image features.
Further, the S23: predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, performing frame regression on the initial candidate frame, and adjusting the size and the position of the initial candidate frame to obtain an adjusted candidate frame; the method specifically comprises the following steps:
reducing the dimension of the initial candidate frame features through two full-connection layers, and simultaneously and respectively sending the dimension-reduced features to a classification branch and a regression branch;
the classification branch is a full-connection layer with two-dimensional vector output, and a classification score is obtained by calculation according to the output of the classification branch;
the regression branch is a full connection layer with four-dimensional vector output, and the initial candidate frame is subjected to frame regression according to the output of the regression branch.
Illustratively, the S23: generating an initial candidate frame feature based on the initial candidate frame; predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, performing frame regression on the initial candidate frame, and adjusting the size and the position of the initial candidate frame to obtain an adjusted candidate frame; the method specifically comprises the following steps:
s231: candidate frame
Figure 334413DEST_PATH_IMAGE003
ROIAlign formation by candidate region alignment operation
Figure 404000DEST_PATH_IMAGE006
Is characterized by
Figure 601763DEST_PATH_IMAGE007
S232: candidate frame features
Figure 274053DEST_PATH_IMAGE007
Reducing the dimension through two full-connection layers;
s233: outputting a two-dimensional vector by the reduced features through a full connection layer
Figure 506451DEST_PATH_IMAGE008
Figure 370108DEST_PATH_IMAGE009
Is directed to the output of a category of text,
Figure 789589DEST_PATH_IMAGE010
is directed to the output of the background analogy.
The classification score of a candidate box is used for determining whether the area enclosed by the candidate box belongs to the text or not, and the classification score
Figure 406515DEST_PATH_IMAGE011
Figure 352474DEST_PATH_IMAGE012
S234: the feature after dimensionality reduction simultaneously outputs a four-dimensional vector through another full-connection layer
Figure 763864DEST_PATH_IMAGE013
For bounding box regression, initial candidate boxes
Figure 60853DEST_PATH_IMAGE003
Forming adjusted candidate frame by adjusting position and size through frame regression branch
Figure 950311DEST_PATH_IMAGE014
Figure 16356DEST_PATH_IMAGE015
Further, the S24: generating characteristics of the adjusted candidate frame for the adjusted candidate frame; the method specifically comprises the following steps:
and generating the feature of the adjusted candidate frame by adopting a candidate region alignment operation RoIAlign based on the adjusted candidate frame and the image feature extracted in the S22.
Further, the S24: expanding the adjusted candidate frame to obtain an expanded candidate frame; the method specifically comprises the following steps:
and adopting the extension operation extension to form an extended candidate frame for the adjusted candidate frame.
The extension operation keeps the center position of the candidate frame unchanged, and the width and the height of the candidate frame are respectively expanded
Figure 864227DEST_PATH_IMAGE016
In the practical operation process
Figure 258299DEST_PATH_IMAGE016
Typically 2 is taken.
Meanwhile, the extension operation also ensures that the candidate frame after expansion does not exceed the image boundary.
Further, the S24: generating an expansion candidate frame characteristic for the expansion candidate frame; the method specifically comprises the following steps:
and for the expansion candidate frame, generating an expansion candidate frame characteristic by adopting a candidate region alignment operation RoIAlign.
Further, the S25: generating a mask for the adjusted candidate frame based on the features of the adjusted candidate frame and the expanded candidate frame features; evaluating the mask quality to obtain a mask score; the method specifically comprises the following steps:
the adjusted features of the candidate box and the expanded features of the candidate box are input into a Mask module Mask head, and the Mask module Mask head comprises two workflows: mask generating stream and mask score stream; two workflows are connected through a Mask Attention Module (MAM).
The upper workflow in the Mask head of fig. 4 is a Mask generation flow, and outputs a Mask with the adjusted candidate box feature as an input. A mask generation stream comprising: the segmentation mask comprises a convolutional layer C1, a convolutional layer C2 and an deconvolution layer, wherein the characteristics of the deconvolution layer are subjected to dimensionality reduction through a convolutional layer C3 with a 1x1 convolutional kernel, and a segmentation mask is output.
The next workflow in the Mask head of fig. 4 is a Mask score stream, and outputs a Mask score with the adjusted candidate box features and the expanded candidate box features as inputs. The Mask score stream firstly adopts two layers of convolution layers and two Mask Attention Modules (MAM) to fuse the input characteristics, and the Mask score stream comprises: convolutional layer D1, convolutional layer D2, convolutional layer D3, full-link layer FC1, full-link layer FC2, and full-link layer FC 3.
The Mask of the feature and Mask generation stream output after the Mask Attention Module (MAM) in fig. 4 is stacked into the convolutional layer D3, the full connection layer FC1, the full connection layer FC2, and the full connection layer FC3, and outputs a predicted Mask score.
Further, the S26: screening the adjusted candidate frame through the classification score and the mask score to obtain a final detection frame; the method specifically comprises the following steps:
s261: all adjusted candidate frames are first deduplicated by non-maximal suppression (NMS);
s262: the quality of a candidate frame is scored by the candidate frame
Figure 577547DEST_PATH_IMAGE017
The specific calculation formula is as follows:
Figure 373465DEST_PATH_IMAGE018
wherein the content of the first and second substances,
Figure 251291DEST_PATH_IMAGE019
for the predicted mask score, will
Figure 398238DEST_PATH_IMAGE017
Filtering out candidate frames smaller than 0.5, and forming a final detection frame by the reserved candidate frames;
s263: and selecting the maximum connected region in the mask of the reserved detection frame as a final detection result.
Further, the detection model, the model structure, includes:
the system comprises a skeleton network Backbone, a text detection module and a text analysis module, wherein the skeleton network Backbone is used for inputting a text detection image;
the output end of the Backbone Network Backbone is connected with the input end of a Region candidate Network (RPN);
an output end of the candidate Region generation Network (RPN) is connected with an input end of the RoIAlign layer; the output end of the RoIAlign layer is connected with the input end of the frame module Box head; the frame module Box head comprises two fully-connected layers which are connected in sequence;
the output end of the RoIAlign layer is also connected with the input end of the Mask head module.
Further, the Mask head module comprises: two parallel working branches: a first branch and a second branch;
wherein, first branch road includes: a convolutional layer C1 and a convolutional layer C2 connected in sequence; an input terminal of convolutional layer C1 for inputting the characteristics of the adjusted candidate frame;
wherein, the second branch road includes: a convolutional layer D1 and a convolutional layer D2 connected in this order; the input end of the convolution layer D1 is used for inputting the splicing value of the adjusted candidate frame characteristic and the expanded candidate frame characteristic;
the output terminal of convolutional layer C2 is connected to the first input terminal of the first Mask Attention Module (MAM);
the output end of the convolutional layer D2 is connected with the second input end of the first mask attention module;
the first output end of the first mask attention module is connected with the first input end of the second mask attention module;
the second output end of the first mask attention module is connected with the second input end of the second mask attention module;
a first output end of the second mask attention module is connected with an input end of the deconvolution layer, an output end of the deconvolution layer is connected with an input end of the convolution layer C3, and an output end of the convolution layer C3 generates a predicted mask;
the second output end of the second mask attention module is connected with the input end of the convolutional layer D3, the output end of the convolutional layer C3 is connected with the input end of the convolutional layer D3, the characteristics of the output end of the convolutional layer D3 are connected with three full-connection layers after being subjected to size adjustment, and the last full-connection layer outputs a mask score.
Illustratively, the Mask head module is divided into two workflows: the mask generates a stream and a masked score stream. The mask generation flow generates a corresponding mask for the candidate box feature by using the adjusted candidate box feature, and the mask scoring flow evaluates the mask quality by using the adjusted candidate box feature and the expanded candidate box feature, wherein the expanded candidate box comprises more peripheral information which is helpful for predicting the mask quality.
Illustratively, the Mask head module specifically works as follows:
step (1): adjusted candidate box features
Figure 895079DEST_PATH_IMAGE007
Passes through twoConvolutional layer formation masking to generate stream features
Figure 935716DEST_PATH_IMAGE020
Step (2): adjusted candidate box features
Figure 594231DEST_PATH_IMAGE007
And extended candidate box features
Figure 87529DEST_PATH_IMAGE021
Cascading into two convolutional layers to form a masked scored stream feature
Figure 388060DEST_PATH_IMAGE022
And (3):
Figure 158570DEST_PATH_IMAGE020
and
Figure 611155DEST_PATH_IMAGE022
feeding into a first Mask Attention Module (MAM); the first mask attention module MAM causes the masking score stream to focus on the areas contained in the mask in order to predict the mask quality more accurately. The detailed structure of the MAM can refer to fig. 3;
and (4): features of the two workflows pass through a second mask attention module;
and (5): mask generation stream characterization through an deconvolution layer and a convolution layer to generate predicted masks
Figure 732694DEST_PATH_IMAGE023
And (6): masked score stream feature sum
Figure 695971DEST_PATH_IMAGE023
Outputting predicted mask scores for a convolution layer and three fully-connected layers in a stack
Figure 320988DEST_PATH_IMAGE019
The specific process of the step (3) is as follows:
step (3.1): masked production flow features
Figure 586884DEST_PATH_IMAGE020
Generating a phased mask through a convolutional layer as an attention map:
Figure 54774DEST_PATH_IMAGE024
wherein the content of the first and second substances,
Figure 697108DEST_PATH_IMAGE025
indicates an possession of
Figure 301265DEST_PATH_IMAGE026
The convolution layer of the convolution kernel is formed,
Figure 3642DEST_PATH_IMAGE027
is one
Figure 568615DEST_PATH_IMAGE028
The attention map of (1);
Figure 375160DEST_PATH_IMAGE027
the region of which the medium response value is higher than the set threshold value represents a region of interest in the segmentation process;
Figure 974768DEST_PATH_IMAGE027
the region with the medium response value lower than the set threshold value represents a region which is not concerned in the segmentation process;
step (3.2): enhancing on features of masked score streams
Figure 707101DEST_PATH_IMAGE027
The representation of the region of interest is specifically operated as follows:
Figure 24950DEST_PATH_IMAGE029
wherein the content of the first and second substances,
Figure 743507DEST_PATH_IMAGE030
is a function for expanding the number of channels of the feature map, and the feature map is copied in actual operation
Figure 587835DEST_PATH_IMAGE027
From
Figure 366435DEST_PATH_IMAGE031
Extend to
Figure 30635DEST_PATH_IMAGE032
Figure 818462DEST_PATH_IMAGE033
Expanded attention map and masked score stream features
Figure 127084DEST_PATH_IMAGE022
Enhanced dot product feature
Figure 699754DEST_PATH_IMAGE034
Step (3.3):
Figure 461037DEST_PATH_IMAGE027
the response value of the area of no interest is usually very low, so
Figure 911610DEST_PATH_IMAGE034
The response of these regions of no interest is greatly suppressed;
in order to prevent the loss of the whole area information, in
Figure 340317DEST_PATH_IMAGE034
Adding original characteristics on the basis of the following steps:
Figure 585353DEST_PATH_IMAGE035
step (3.4):
Figure 365091DEST_PATH_IMAGE020
and
Figure 229141DEST_PATH_IMAGE036
respectively obtaining output values through convolution layer fusion characteristics.
Wherein the internal structure of the first mask attention module and the second MAM module are the same.
Further, the first masked attention module includes:
convolutional layer E1; an input of the convolutional layer E1 is for connection with a first mask attention module first input; the output end of the convolutional layer E1 is used for being connected with a first output end of a first mask attention module;
a convolutional layer F1; an input of the convolutional layer F1 is for connection with a first mask attention module first input; the output end of the convolutional layer F1 is used for being connected with the input end of the multiplier;
the input end of the multiplier is also connected with the second input end of the first mask attention module; the output end of the multiplier is connected with the input end of the adder, and the input end of the adder is also connected with the second input end of the first mask attention module; the output of the adder is further adapted to be coupled to a second output of the first mask attention module via convolutional layer G1.
Further, the training of the trained detection model comprises:
sa 1: constructing a training set, wherein the training set is an image of a known candidate frame label;
sa 2: inputting the training set into the detection model, training the detection model,
sa 3: carrying out feature extraction on the image of the known candidate frame tag;
sa 4: constructing an initial candidate frame based on the extracted features;
sa 5: generating an initial candidate frame feature based on the initial candidate frame; predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, generating a four-dimensional regression bias vector for the initial candidate frame based on the characteristics of the initial candidate frame;
sa 6: generating characteristics of the initial candidate frame for the initial candidate frame; expanding the initial candidate frame to obtain an expanded candidate frame; generating an expansion candidate frame characteristic for the expansion candidate frame;
sa 7: generating a mask based on the initial candidate box feature and the expanded candidate box feature; evaluating the mask quality to obtain a mask score;
sa 8: and calculating a loss function according to the generated classification score, the regression bias vector, the mask score and the attention map generated in the step Sa7, and optimizing network parameters through back propagation to obtain a trained detection model.
Exemplarily, the Sa 6: for the initial candidate frame, generating characteristics of the initial candidate frame; expanding the initial candidate frame to obtain an expanded candidate frame; generating an expansion candidate frame characteristic for the expansion candidate frame; the method specifically comprises the following steps:
sa 61: candidate frame
Figure 371410DEST_PATH_IMAGE003
Formed by Roiarign
Figure 193872DEST_PATH_IMAGE006
Is characterized by
Figure 555846DEST_PATH_IMAGE007
Sa62:
Figure 489167DEST_PATH_IMAGE003
Forming extended candidate boxes via extension operations
Figure 361308DEST_PATH_IMAGE037
The extension operation is specifically as follows:
Figure 479305DEST_PATH_IMAGE038
wherein the content of the first and second substances,
Figure 702476DEST_PATH_IMAGE039
is that
Figure 32964DEST_PATH_IMAGE037
The coordinates of the upper left corner of the table,
Figure 290770DEST_PATH_IMAGE040
and
Figure 189455DEST_PATH_IMAGE041
are respectively as
Figure 24556DEST_PATH_IMAGE037
The width and the height of the base material,
Figure 34101DEST_PATH_IMAGE016
represents the expansion factor, i.e. the multiple by which the candidate box expands;
Sa63:
Figure 769582DEST_PATH_IMAGE037
also formed by Roiarign
Figure 573590DEST_PATH_IMAGE006
Extended candidate box feature of
Figure 36932DEST_PATH_IMAGE021
Further, in Sa8, the loss function of the computational model optimizes the parameters of the entire model by back propagation, and the specific form of the loss function is as follows:
Figure 709222DEST_PATH_IMAGE042
Figure 941620DEST_PATH_IMAGE043
wherein L is a loss function of the model and consists of six parts,
Figure 41163DEST_PATH_IMAGE044
Figure 726223DEST_PATH_IMAGE045
Figure 343149DEST_PATH_IMAGE046
Figure 289108DEST_PATH_IMAGE047
and
Figure 966077DEST_PATH_IMAGE048
for balancing the importance between the loss functions;
Figure 498952DEST_PATH_IMAGE049
Figure 388410DEST_PATH_IMAGE050
Figure 595401DEST_PATH_IMAGE051
the loss of the RPN network and the Box head is in the same form as the prior method based on example division,
Figure 36746DEST_PATH_IMAGE052
involving two parts, two classes of Log loss and bounding box regression
Figure 430819DEST_PATH_IMAGE053
The loss of the carbon dioxide gas is reduced,
Figure 514181DEST_PATH_IMAGE050
in the form of cross-entropy is used,
Figure 310099DEST_PATH_IMAGE051
by using
Figure 328870DEST_PATH_IMAGE054
The form will not be described herein.
Figure 334873DEST_PATH_IMAGE055
Mask generation flow loss for Mask head, adopting cross entropy form:
Figure 566134DEST_PATH_IMAGE056
wherein
Figure 863164DEST_PATH_IMAGE057
Representation mask
Figure 787258DEST_PATH_IMAGE023
To middle
Figure 546135DEST_PATH_IMAGE058
The value of the individual pixels is then calculated,
Figure 581088DEST_PATH_IMAGE059
is shown as
Figure 351597DEST_PATH_IMAGE058
Mask labels for individual pixels (obtained by real labels of the training data),
Figure 571226DEST_PATH_IMAGE060
to represent
Figure 692766DEST_PATH_IMAGE023
Total number of middle pixels.
Figure 390464DEST_PATH_IMAGE061
The loss function is a loss function aiming at a mask attention diagram in the MAM, and the specific form is as follows:
Figure 281059DEST_PATH_IMAGE062
Figure 546955DEST_PATH_IMAGE063
wherein
Figure 250732DEST_PATH_IMAGE064
And
Figure 158645DEST_PATH_IMAGE065
respectively representing the loss of two MAMs, and the loss forms of the two MAMs are consistent.
Figure 762801DEST_PATH_IMAGE066
Representing masked attention maps
Figure 199599DEST_PATH_IMAGE027
To middle
Figure 889206DEST_PATH_IMAGE058
The value of the individual pixels is then calculated,
Figure 69652DEST_PATH_IMAGE067
a mask label representing the pixel is identified,
Figure 934840DEST_PATH_IMAGE068
can be labeled by a mask
Figure 667172DEST_PATH_IMAGE069
And (3) obtaining secondary interpolation:
Figure 719442DEST_PATH_IMAGE070
Figure 703578DEST_PATH_IMAGE071
aiming at Mask head Mask gain loss, the method adopts
Figure 780863DEST_PATH_IMAGE053
Form (a):
Figure 559463DEST_PATH_IMAGE072
wherein
Figure 489241DEST_PATH_IMAGE019
Is the mask score of the model prediction and,
Figure 11490DEST_PATH_IMAGE073
is the true mask score, defined as the intersection ratio between the generation mask and the true mask.
Figure 585690DEST_PATH_IMAGE073
Can be obtained through the following process:
if a candidate box has an intersection ratio greater than 0.2 to a true horizontal text box, then the true mask score for the candidate box is calculated by the following formula:
Figure 659826DEST_PATH_IMAGE074
wherein
Figure 686688DEST_PATH_IMAGE023
Firstly, binarization processing is carried out:
Figure 137260DEST_PATH_IMAGE075
if a candidate box does not intersect any real text boxes or the intersection ratio of all real text boxes is less than 0.2, then the candidate box's intersection ratio is less than
Figure 300389DEST_PATH_IMAGE073
Directly set to 0.
Fig. 2(a) -2 (d) show three types of false positive samples and one true positive sample. Wherein the rectangular box represents the final detection box of the model prediction; the shaded portions in the boxes represent the segmentation masks for these detection boxes. Wherein cls-score is the classification score predicted by the model; ms-score is the masking score predicted by the model. The traditional method only screens candidate frames through classification scores, and the three types of false positive samples generally have higher classification scores so that the false positive samples are reserved; the method provided by the invention screens the candidate boxes through the classification score and the mask score, the mask score of three types of false positive samples is very low and can be filtered, and the two scores of one true positive sample are both very high and can be reserved in the final detection result.
Training process:
step (1): acquiring original pictures of a training set and original labels of text regions in each picture (generally, horizontal or multidirectional texts are labeled by quadrangles, and irregular texts are labeled by polygons), and generating mask labels and candidate frame labels of the text regions by using the original labels. Marking all pixels inside the quadrangle or the polygon as a text category (namely, marking the pixel value as 1), and marking all pixels outside the quadrangle or the polygon as a background category (namely, marking the pixel value as 0), and forming a text region mask; taking the minimum horizontal frame capable of surrounding the quadrangle or the polygon as a candidate frame label;
step (2): sending the pictures into a backhaul extraction feature and constructing an initial candidate frame through an RPN;
and (3): the initial candidate Box characteristic and the expanded candidate Box characteristic are sent to the Box head and the Mask head to generate the classification score
Figure 686370DEST_PATH_IMAGE011
With offset frame
Figure 826627DEST_PATH_IMAGE076
Dividing the mask
Figure 690678DEST_PATH_IMAGE023
And a mask score
Figure 98525DEST_PATH_IMAGE019
. The MAM module in the Mask head outputs an attention diagram
Figure 655409DEST_PATH_IMAGE027
And (4): calculating a loss function of the model, and optimizing the whole model through back propagation;
and (5): after the whole training set trains K epochs, the fixed model stores network parameters, and K is a positive integer in the range of 30-40.
The testing process comprises the following steps:
step (1): acquiring a picture to be tested;
step (2): sending the pictures into a backhaul extraction feature and constructing an initial candidate frame through an RPN;
and (3): the initial candidate Box features are fed into the Box head to generate classification scores
Figure 515917DEST_PATH_IMAGE011
And the frame offset
Figure 183659DEST_PATH_IMAGE076
By using
Figure 321379DEST_PATH_IMAGE076
Adjusting the original candidate frame;
and (4): sending the adjusted candidate frame characteristics and the expanded adjusted candidate frame characteristics into a Mask head to generate a segmentation Mask
Figure 439377DEST_PATH_IMAGE023
And a mask score
Figure 396969DEST_PATH_IMAGE019
And (5): non-maxima suppression is used to filter out duplicate candidate blocks. Reusing classification scores
Figure 133981DEST_PATH_IMAGE011
Sum mask score
Figure 749376DEST_PATH_IMAGE019
Calculating candidate box scores
Figure 382483DEST_PATH_IMAGE017
. Will be provided with
Figure 483163DEST_PATH_IMAGE017
Candidate boxes smaller than 0.5 are filtered out;
and (6): and selecting the maximum connected region in the mask of the reserved candidate frame as a final detection result.
Example two
The embodiment provides a natural scene text detection system with any shape;
an arbitrarily shaped natural scene text detection system, comprising:
an acquisition module configured to: acquiring a to-be-detected text image;
a detection module configured to: inputting the image to be detected into the trained detection model to obtain a final detection frame; carrying out post-processing on the obtained final detection frame to form a text area;
and the detection model screens the candidate detection frames through the classification score and the mask score to obtain the final detection frame.
It should be noted that the above-mentioned acquiring module and detecting module correspond to steps S1 to S2 in the first embodiment, and the above-mentioned modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. The method for detecting the text of the natural scene in any shape is characterized by comprising the following steps:
acquiring a to-be-detected text image;
inputting the image to be detected into the trained detection model to obtain a final detection frame, wherein the method specifically comprises the following steps:
carrying out feature extraction on the image to be detected;
constructing an initial candidate frame based on the extracted image features;
generating an initial candidate frame feature based on the initial candidate frame; predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, performing frame regression on the initial candidate frame, and adjusting the size and the position of the initial candidate frame to obtain an adjusted candidate frame;
generating characteristics of the adjusted candidate frame for the adjusted candidate frame; expanding the adjusted candidate frame to obtain an expanded candidate frame; for the expansion candidate frame, generating an expansion candidate frame characteristic;
generating a mask for the adjusted candidate frame based on the features of the adjusted candidate frame and the expanded candidate frame features; evaluating the mask quality to obtain a mask score;
screening the adjusted candidate frame by the product of the classification score and the mask score to form a final detection frame;
carrying out post-processing on the obtained final detection frame to form a text area;
and the detection model screens the candidate detection frames through the product of the classification score and the mask score to obtain the final detection frame.
2. The method for detecting the text in the natural scene with the arbitrary shape as claimed in claim 1, wherein the classification score of the candidate box is predicted based on the initial candidate box feature; meanwhile, performing frame regression on the initial candidate frame, and adjusting the size and the position of the initial candidate frame to obtain an adjusted candidate frame; the method specifically comprises the following steps:
reducing the dimension of the initial candidate frame features through two full-connection layers, and simultaneously and respectively sending the dimension-reduced features to a classification branch and a regression branch;
the classification branch is a full-connection layer with two-dimensional vector output, and a classification score is obtained by calculation according to the output of the classification branch;
the regression branch is a full connection layer with four-dimensional vector output, and the initial candidate frame is subjected to frame regression according to the output of the regression branch.
3. The method for detecting text in a natural scene with an arbitrary shape as set forth in claim 1,
generating a mask for the adjusted candidate frame based on the features of the adjusted candidate frame and the expanded candidate frame features; evaluating the mask quality to obtain a mask score; the method specifically comprises the following steps:
the adjusted features of the candidate box and the expanded features of the candidate box are input into a Mask module Mask head, and the Mask module Mask head comprises two workflows: mask generating stream and mask score stream;
a mask generation stream, which takes the adjusted candidate box characteristics as input and outputs a mask;
and the mask score stream takes the adjusted candidate box characteristics and the expanded candidate box characteristics as input and outputs a mask score.
4. The method for detecting the text in the natural scene with the arbitrary shape as claimed in claim 1, wherein the model structure of the detection model comprises:
the system comprises a skeleton network Backbone, a text detection module and a text analysis module, wherein the skeleton network Backbone is used for inputting a text detection image;
the output end of the Backbone network Backbone is connected with the input end of the candidate region generation network RPN;
the output end of the candidate region generation network RPN is connected with the input end of a RoIAlign layer; the output end of the RoIAlign layer is connected with the input end of the frame module Box head; the frame module Box head comprises two fully-connected layers which are connected in sequence;
the output end of the RoIAlign layer is also connected with the input end of the Mask head module.
5. The method for detecting the text of the natural scene with the arbitrary shape as claimed in claim 4, wherein the Mask head module comprises: two parallel working branches: a first branch and a second branch;
wherein, first branch road includes: a convolutional layer C1 and a convolutional layer C2 connected in sequence; an input terminal of convolutional layer C1 for inputting the characteristics of the adjusted candidate frame;
wherein, the second branch road includes: a convolutional layer D1 and a convolutional layer D2 connected in this order; the input end of the convolution layer D1 is used for inputting the splicing value of the adjusted candidate frame characteristic and the expanded candidate frame characteristic;
the output end of the convolutional layer C2 is connected to the first input end of the first mask attention module MAM;
the output end of the convolutional layer D2 is connected with the second input end of the first mask attention module;
the first output end of the first mask attention module is connected with the first input end of the second mask attention module;
the second output end of the first mask attention module is connected with the second input end of the second mask attention module;
a first output end of the second mask attention module is connected with an input end of the deconvolution layer, an output end of the deconvolution layer is connected with an input end of the convolution layer C3, and an output end of the convolution layer C3 generates a predicted mask;
the second output end of the second mask attention module is connected with the input end of the convolutional layer D3, the output end of the convolutional layer C3 is connected with the input end of the convolutional layer D3, the characteristics of the output end of the convolutional layer D3 are connected with three full-connection layers after being subjected to size adjustment, and the last full-connection layer outputs a mask score.
6. The method for detecting the text of the natural scene with the arbitrary shape as claimed in claim 4, wherein the Mask head module specifically works as follows:
adjusted candidate box features
Figure 619961DEST_PATH_IMAGE001
Forming mask generation stream features over two convolutional layers
Figure 5943DEST_PATH_IMAGE002
Adjusted candidate box features
Figure 520101DEST_PATH_IMAGE001
And extended candidate box features
Figure 508785DEST_PATH_IMAGE003
Cascading into two convolutional layers to form a masked scored stream feature
Figure 791999DEST_PATH_IMAGE004
Figure 473516DEST_PATH_IMAGE005
And
Figure 209391DEST_PATH_IMAGE004
feeding into a first mask attention module; the first mask attention module causes the mask score stream to focus on regions contained in the mask;
features of the two workflows pass through a second mask attention module;
mask generation stream characterization through an deconvolution layer and a convolution layer to generate predicted masks
Figure 142712DEST_PATH_IMAGE006
Masked score stream feature sum
Figure 405066DEST_PATH_IMAGE006
Outputting predicted mask scores for a convolution layer and three fully-connected layers in a stack
Figure 867272DEST_PATH_IMAGE007
7. The method for detecting text in an arbitrarily-shaped natural scene as recited in claim 5, wherein the first masking attention module comprises:
convolutional layer E1; an input of the convolutional layer E1 is for connection with a first mask attention module first input; the output end of the convolutional layer E1 is used for being connected with a first output end of a first mask attention module;
a convolutional layer F1; an input of the convolutional layer F1 is for connection with a first mask attention module first input; the output end of the convolutional layer F1 is used for being connected with the input end of the multiplier;
the input end of the multiplier is also connected with the second input end of the first mask attention module; the output end of the multiplier is connected with the input end of the adder, and the input end of the adder is also connected with the second input end of the first mask attention module; the output of the adder is further adapted to be coupled to a second output of the first mask attention module via convolutional layer G1.
8. The method for detecting the text in the natural scene with the arbitrary shape as claimed in claim 1, wherein the training step of the trained detection model comprises:
constructing a training set, wherein the training set is an image of a known candidate frame label;
inputting the training set into the detection model, training the detection model,
carrying out feature extraction on the image of the known candidate frame tag;
constructing an initial candidate frame based on the extracted features;
generating an initial candidate frame feature based on the initial candidate frame; predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, generating a four-dimensional regression bias vector for the initial candidate frame based on the characteristics of the initial candidate frame;
generating characteristics of the initial candidate frame for the initial candidate frame; expanding the initial candidate frame to obtain an expanded candidate frame; generating an expansion candidate frame characteristic for the expansion candidate frame;
generating a mask based on the initial candidate box feature and the expanded candidate box feature; evaluating the mask quality to obtain a mask score;
and calculating a loss function according to the generated classification score, the regression bias vector, the mask score and the generated attention map, and obtaining a trained candidate frame screening model by reversely propagating and optimizing network parameters.
9. The system for detecting the texts in the natural scenes in any shapes is characterized by comprising the following steps:
an acquisition module configured to: acquiring a to-be-detected text image;
a detection module configured to: inputting the image to be detected into the trained detection model to obtain a final detection frame, wherein the method specifically comprises the following steps:
carrying out feature extraction on the image to be detected;
constructing an initial candidate frame based on the extracted image features;
generating an initial candidate frame feature based on the initial candidate frame; predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, performing frame regression on the initial candidate frame, and adjusting the size and the position of the initial candidate frame to obtain an adjusted candidate frame;
generating characteristics of the adjusted candidate frame for the adjusted candidate frame; expanding the adjusted candidate frame to obtain an expanded candidate frame; for the expansion candidate frame, generating an expansion candidate frame characteristic;
generating a mask for the adjusted candidate frame based on the features of the adjusted candidate frame and the expanded candidate frame features; evaluating the mask quality to obtain a mask score;
screening the adjusted candidate frame by the product of the classification score and the mask score to form a final detection frame;
carrying out post-processing on the obtained final detection frame to form a text area;
and the detection model screens the candidate detection frames through the product of the classification score and the mask score to obtain the final detection frame.
CN202110715820.9A 2021-06-28 2021-06-28 Method and system for detecting text of natural scene with any shape Active CN113255669B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110715820.9A CN113255669B (en) 2021-06-28 2021-06-28 Method and system for detecting text of natural scene with any shape

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110715820.9A CN113255669B (en) 2021-06-28 2021-06-28 Method and system for detecting text of natural scene with any shape

Publications (2)

Publication Number Publication Date
CN113255669A CN113255669A (en) 2021-08-13
CN113255669B true CN113255669B (en) 2021-10-01

Family

ID=77189947

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110715820.9A Active CN113255669B (en) 2021-06-28 2021-06-28 Method and system for detecting text of natural scene with any shape

Country Status (1)

Country Link
CN (1) CN113255669B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116958981B (en) * 2023-05-31 2024-04-30 广东南方网络信息科技有限公司 Character recognition method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549893A (en) * 2018-04-04 2018-09-18 华中科技大学 A kind of end-to-end recognition methods of the scene text of arbitrary shape
CN109299274A (en) * 2018-11-07 2019-02-01 南京大学 A kind of natural scene Method for text detection based on full convolutional neural networks
CN111754531A (en) * 2020-07-08 2020-10-09 深延科技(北京)有限公司 Image instance segmentation method and device
CN111950545A (en) * 2020-07-23 2020-11-17 南京大学 Scene text detection method based on MSNDET and space division
CN112183545A (en) * 2020-09-29 2021-01-05 佛山市南海区广工大数控装备协同创新研究院 Method for recognizing natural scene text in any shape
CN112446356A (en) * 2020-12-15 2021-03-05 西北工业大学 Method for detecting text with any shape in natural scene based on multiple polar coordinates

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9275308B2 (en) * 2013-05-31 2016-03-01 Google Inc. Object detection using deep neural networks
JP2020181255A (en) * 2019-04-23 2020-11-05 国立大学法人 東京大学 Image analysis device, image analysis method, and image analysis program
CN110287960B (en) * 2019-07-02 2021-12-10 中国科学院信息工程研究所 Method for detecting and identifying curve characters in natural scene image
CN110895695B (en) * 2019-07-31 2023-02-24 上海海事大学 Deep learning network for character segmentation of text picture and segmentation method
CN110807422B (en) * 2019-10-31 2023-05-23 华南理工大学 Natural scene text detection method based on deep learning
CN112749704A (en) * 2019-10-31 2021-05-04 北京金山云网络技术有限公司 Text region detection method and device and server
CN112183322B (en) * 2020-09-27 2022-07-19 成都数之联科技股份有限公司 Text detection and correction method for any shape
CN112163634B (en) * 2020-10-14 2023-09-05 平安科技(深圳)有限公司 Sample screening method and device for instance segmentation model, computer equipment and medium
AU2020103585A4 (en) * 2020-11-20 2021-02-04 Sonia Ahsan CDN- Object Detection System: Object Detection System with Image Classification and Deep Neural Networks
CN112861855A (en) * 2021-02-02 2021-05-28 华南农业大学 Group-raising pig instance segmentation method based on confrontation network model
CN112989927B (en) * 2021-02-03 2024-03-05 杭州电子科技大学 Scene graph generation method based on self-supervision pre-training

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549893A (en) * 2018-04-04 2018-09-18 华中科技大学 A kind of end-to-end recognition methods of the scene text of arbitrary shape
CN109299274A (en) * 2018-11-07 2019-02-01 南京大学 A kind of natural scene Method for text detection based on full convolutional neural networks
CN111754531A (en) * 2020-07-08 2020-10-09 深延科技(北京)有限公司 Image instance segmentation method and device
CN111950545A (en) * 2020-07-23 2020-11-17 南京大学 Scene text detection method based on MSNDET and space division
CN112183545A (en) * 2020-09-29 2021-01-05 佛山市南海区广工大数控装备协同创新研究院 Method for recognizing natural scene text in any shape
CN112446356A (en) * 2020-12-15 2021-03-05 西北工业大学 Method for detecting text with any shape in natural scene based on multiple polar coordinates

Also Published As

Publication number Publication date
CN113255669A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
CN110348445B (en) Instance segmentation method fusing void convolution and edge information
Varghese et al. ChangeNet: A deep learning architecture for visual change detection
CN109614979B (en) Data augmentation method and image classification method based on selection and generation
CN101971190B (en) Real-time body segmentation system
CN111461212B (en) Compression method for point cloud target detection model
CN109492596B (en) Pedestrian detection method and system based on K-means clustering and regional recommendation network
Aggarwal et al. A robust method to authenticate car license plates using segmentation and ROI based approach
CN111767927A (en) Lightweight license plate recognition method and system based on full convolution network
CN114648665A (en) Weak supervision target detection method and system
CN112990282B (en) Classification method and device for fine-granularity small sample images
CN103198479A (en) SAR image segmentation method based on semantic information classification
CN110705412A (en) Video target detection method based on motion history image
CN112287941A (en) License plate recognition method based on automatic character region perception
CN112507876A (en) Wired table picture analysis method and device based on semantic segmentation
CN113255669B (en) Method and system for detecting text of natural scene with any shape
CN115131797A (en) Scene text detection method based on feature enhancement pyramid network
Tao et al. Contour-based smoky vehicle detection from surveillance video for alarm systems
CN113763364B (en) Image defect detection method based on convolutional neural network
CN114330234A (en) Layout structure analysis method and device, electronic equipment and storage medium
CN111832390B (en) Handwritten ancient character detection method
CN113496480A (en) Method for detecting weld image defects
CN110363198B (en) Neural network weight matrix splitting and combining method
CN111178275A (en) Fire detection method based on convolutional neural network
Li et al. An improved PCB defect detector based on feature pyramid networks
CN117115824A (en) Visual text detection method based on stroke region segmentation strategy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant