CN113033559A - Text detection method and device based on target detection and storage medium - Google Patents

Text detection method and device based on target detection and storage medium Download PDF

Info

Publication number
CN113033559A
CN113033559A CN202110417152.1A CN202110417152A CN113033559A CN 113033559 A CN113033559 A CN 113033559A CN 202110417152 A CN202110417152 A CN 202110417152A CN 113033559 A CN113033559 A CN 113033559A
Authority
CN
China
Prior art keywords
candidate
text
character
image
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110417152.1A
Other languages
Chinese (zh)
Inventor
杨洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huahan Weiye Technology Co ltd
Original Assignee
Shenzhen Huahan Weiye Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huahan Weiye Technology Co ltd filed Critical Shenzhen Huahan Weiye Technology Co ltd
Priority to CN202110417152.1A priority Critical patent/CN113033559A/en
Publication of CN113033559A publication Critical patent/CN113033559A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to a text detection method and device based on target detection and a storage medium, wherein the text detection method comprises the following steps: acquiring an image to be detected, constructing candidate frames corresponding to all characters respectively according to the image to be detected, extracting characteristic information of the candidate frames and generating candidate character components corresponding to all the characters respectively; predicting a connected region of candidate character components respectively corresponding to each character to obtain a connection relation between the candidate character components; and combining the plurality of candidate character components which are connected with each other, fusing to obtain candidate text lines, and generating text detection results of the candidate text lines. According to the technical scheme, the image features are extracted by constructing the candidate frame and the candidate character components corresponding to the characters are generated, so that the characters in the complex background image can be independently positioned by means of the candidate character components, not only can candidate text lines be formed in a fused mode, but also the accuracy of text line positioning can be enhanced.

Description

Text detection method and device based on target detection and storage medium
Technical Field
The invention relates to the technical field of image processing, in particular to a text detection method and device based on target detection and a storage medium.
Background
The natural scene text detection has extremely important wide application in real life, is an important precondition for text recognition, such as text retrieval, guideboard recognition, intelligent test paper correction and the like, and can also be applied to the fields of image retrieval, machine translation, automatic driving and the like. Due to the influences of various uncontrollable interference factors in a natural scene, such as shadow shielding, shooting angles, foreign matter shielding, and certain inherent attributes of texts, such as artistic characters, deformed characters or incomplete characters, the detection of texts in the natural scene is still a difficult task. In order to improve the efficiency in the aspects of automatic office work, automatic identification and automatic production and manufacture, the automatic generation of documents according to images is an important application, and for example, the automatic generation of bill information according to bill images can greatly reduce the labor intensity of bank workers.
At present, the natural scene text detection is difficult due to several main factors: (1) diversity and variability factors, compared with text contents in a standard document, the text of a natural scene may be multi-scale and multi-language, and the aspects of shape, direction, proportion, color and the like may be different, and the changes bring difficulty to text detection; (2) complex background factors, scene texts may appear in any background, including positions of signal marks, bricks or bushes, fences and the like, and the backgrounds may have characteristics very similar to the texts, so that the backgrounds become noises to influence the judgment of the texts, and of course, the texts are lost due to the shielding of foreign matters, which also causes potential detection errors; (3) due to the influence of uneven imaging quality, the text image may be distorted or virtual focused due to different shooting angles or shooting distances, or noise and shadow are formed due to different illumination during shooting, so that the imaging quality of the text image cannot be ensured.
The natural scene text detection method under the complex background can be roughly divided into three categories: a sliding window based method, a connected component based method, a deep learning method. The method based on the sliding window does not fully utilize inherent characteristics of the text, so that a plurality of non-text candidate regions are extracted compared with the real text region, the subsequent non-text region filtering process becomes extremely complex, and the method is sensitive to some external factors, such as illumination change, image blurring and the like. The method based on the connected components uses a multi-stage method to position language text information, firstly MSER regions of three channels of an image R, G, B are extracted, then a classifier is trained to remove repeated MSER regions and non-text MSER regions to obtain candidate MSER regions, and then the candidate text regions are connected into text bars and the obtained text bars are subjected to de-duplication processing; although the method can detect and locate the language text region, the process is complicated and is divided into a plurality of stages, the detection effect depends on the quality of the candidate region generated by the MSERs, and meanwhile, the method is influenced by a feature extraction mode of manual design. The deep learning method is mainly based on a regression method to process rectangular text lines, and when the shape of a text is a curve, a predicted detection box cannot accurately cover all text areas; in addition, for long text lines, the problem of missing frames or incomplete prediction occurs once the aspect ratio of the text line is larger than the preset prediction threshold.
Disclosure of Invention
The invention mainly solves the technical problem of how to accurately detect text lines from a complex background image. In order to solve the technical problem, the present application provides a text detection method and apparatus based on target detection, and a storage medium.
In a first aspect, an embodiment provides a text detection method based on target detection, including: acquiring an image to be detected, wherein the image to be detected comprises one or more characters; constructing candidate frames corresponding to the characters according to the image to be detected, extracting characteristic information of the candidate frames and generating candidate character components corresponding to the characters; predicting a connected region of candidate character components respectively corresponding to each character to obtain a connection relation between the candidate character components; combining a plurality of candidate character components which are connected with each other, and fusing to obtain candidate text lines; and generating a text detection result of the candidate text line.
The constructing of candidate frames corresponding to the characters respectively according to the image to be detected, the extracting of feature information of the candidate frames and the generating of candidate character components corresponding to the characters respectively comprises the following steps: continuously and repeatedly performing convolution and downsampling operation on the image to be detected, and obtaining a first feature map with a corresponding scale after each operation; performing convolution and up-sampling operation on the first feature map with the minimum scale for multiple times continuously, and obtaining a second feature map with a corresponding scale after each operation; carrying out feature summation operation on the first feature map and the second feature map with the same scale to obtain a third feature map under the corresponding scale; constructing a candidate frame corresponding to each character on the third feature map of each scale, and obtaining feature information of the candidate frame through regression processing; mapping the characteristic information of the candidate frame corresponding to each character to the image to be detected, and carrying out non-maximum suppression on the mapping result to obtain candidate character components corresponding to each character respectively; the candidate character component includes location information and a confidence score for the character.
The predicting of the connected region of the candidate character assemblies respectively corresponding to each character to obtain the connection relation between the candidate character assemblies comprises the following steps: acquiring position information and confidence scores of candidate character components corresponding to the characters formed by the third feature maps of each scale; the position information of the character candidate component is represented as (x, y, w, h, theta), and the confidence score of the character candidate component is represented as Schar(ii) a Wherein X and Y are coordinates or offset of an X axis and a Y axis respectively, w and h are width pixels and height pixels respectively, and theta is a rotation angle relative to the X axis; predicting a first connection score of the candidate character component corresponding to each character and other candidate character components in the surrounding neighborhood according to the position information of the candidate character component corresponding to each character formed by the third feature map of each scale, wherein the first connection score is expressed as Sneighbor(ii) a Predicting a second connection score between the candidate character component corresponding to each character and the candidate character components formed by the third feature maps of other scales according to the position information of the candidate character components corresponding to each character formed by the third feature maps of each scale, wherein the second connection score is expressed as Slevel(ii) a Determining each character corresponding to the first connection score and the second connection scoreThe connection relation between the candidate character components.
The merging of the interconnected candidate character components to obtain candidate text lines by fusion comprises the following steps: forming a candidate set C using the interconnected candidate character components, and C ═ CiIn which c isiIs the ith connected candidate character component and the corresponding rotation angle is thetaiThe value range of i is 1-n; performing linear regression processing on a process parameter b according to a least square method to obtain a regression line; wherein the regression line is represented as
Figure BDA0003026384080000031
The objective function of the least squares method is expressed as
Figure BDA0003026384080000032
Wherein
Figure BDA0003026384080000033
aiIs a preset weight coefficient; projecting to the regression line according to the position information of the candidate character component corresponding to each character respectively, acquiring projection points at two sides and marking as (x) respectivelyl,yl) And (x)r,yr) (ii) a Determining candidate text lines by using the acquired projection points on the two sides; the position information of the candidate text line is represented as
Figure BDA0003026384080000034
wi、hiRespectively, the width pixel and the height pixel of the ith character candidate component.
The generating of the text detection result of the candidate text line includes: obtaining a text feature map corresponding to the candidate text line through feature transformation; establishing a second neural network and configuring a loss function of the second neural network
L=Lq+Lp
Wherein L iscRegression loss as confidence score of candidate textFunction and satisfy
Figure BDA0003026384080000035
λ2Is a preset weight coefficient, y' is a predicted candidate text confidence score,
Figure BDA0003026384080000036
representing annotated candidate text confidence scores; l ispA regression loss function for the candidate text positions and satisfying Lp=Ld+Lβ
Figure BDA0003026384080000037
d represents candidate text position and is represented by coordinates or coordinate offset of four corner points (x)1,y1,x2,y2,x3,y3,x4,y4),smoothedL1() Is about L1Activation function of norm, djFor the predicted jth candidate text position,
Figure BDA0003026384080000038
for the marked jth candidate text position, βjFor the predicted jth candidate text rotation angle,
Figure BDA0003026384080000039
the marked jth candidate text rotation angle is obtained; inputting the text feature map into the second neural network, and calculating a candidate text confidence score y' and a candidate text position d when a loss function of the second neural network converges; and obtaining a text detection result of the candidate text line by using the candidate text confidence score y' and the candidate text position d.
According to a second aspect, an embodiment provides a text detection apparatus, comprising: the system comprises a camera, a processing unit and a display unit, wherein the camera is used for capturing images of a natural scene and forming an image to be detected, and the image to be detected comprises one or more characters; a processor connected to the camera, configured to process the image to be detected according to the text detection method in the first aspect, so as to obtain a text detection result; and the display is connected with the processor and used for displaying the image to be detected and/or the text detection result.
According to a third aspect, an embodiment provides a computer-readable storage medium, wherein the medium stores a program, and the program is executable by a processor to implement the text detection method in the first aspect.
The beneficial effect of this application is:
according to the above embodiment, a text detection method and apparatus based on target detection, and a storage medium are provided, wherein the text detection method includes: acquiring an image to be detected, wherein the image to be detected comprises one or more characters; constructing candidate frames corresponding to the characters according to the image to be detected, extracting characteristic information of the candidate frames and generating candidate character components corresponding to the characters; predicting a connected region of candidate character components respectively corresponding to each character to obtain a connection relation between the candidate character components; combining a plurality of candidate character components which are connected with each other, and fusing to obtain candidate text lines; and generating a text detection result of the candidate text line. Because the features are extracted by constructing the candidate frame and the candidate character components corresponding to the characters are generated, the individual positioning of each character in the complex background image is facilitated, technical conditions are provided for connecting the candidate character components into candidate text lines, and the accuracy of text line positioning can be enhanced.
In addition, when the connection relation between the candidate character components is predicted, the position information is changed from the prior (x, y, w, h) to (x, y, w, h, theta), the comprehensive influence effect of coordinates, wide and high pixels and a rotation angle on the positions of the characters is fully considered, and the connection scores of the candidate character components in the same-layer feature map and the non-same-layer feature map are also considered, so that the connection relation between the candidate character components is more accurately represented by means of multi-dimensional output parameters. When a plurality of candidate character components are connected with each other and fused to obtain candidate text lines, the process parameters are subjected to linear regression processing by using a least square method, so that the boundary positions of two sides of the text lines formed by character connection can be accurately obtained, and the positioning precision of the candidate text lines is further improved. When a text detection result of the candidate text line is generated, the neural network is used for carrying out regression operation processing on the confidence score of the candidate text and the position of the candidate text, so that the position fine adjustment of the candidate text line is realized, the positioning requirements of different text scales can be adapted by using four corner point positioning modes, and the self-adaption and accurate positioning capabilities of the text detection method to the text scales are further improved.
Drawings
FIG. 1 is a flow chart of a text detection method based on target detection in the present application;
FIG. 2 is a schematic diagram of a text detection method;
FIG. 3 is a flow diagram of a generate candidate characters component;
FIG. 4 is a flow diagram of obtaining a connection relationship of candidate character components;
FIG. 5 is a flow chart for obtaining candidate lines of text;
FIG. 6 is a flow chart for generating text test results;
FIG. 7 is a schematic diagram of an extraction network for a feature map at multiple scales;
FIG. 8 is a schematic diagram of the concatenation of candidate character components into text lines;
FIG. 9 is a schematic structural diagram of a text detection device according to the present application;
FIG. 10a is a diagram of candidate character components in a merchandise tag;
FIG. 10b is a schematic diagram of candidate lines of text in a merchandise tag;
FIG. 10c is a diagram illustrating text detection results of candidate text lines in a merchandise tag;
FIG. 11 is a schematic diagram of a processor;
fig. 12 is a schematic structural diagram of a text detection device in another embodiment of the present application.
Detailed Description
The present application will be described in further detail below with reference to the accompanying drawings by way of specific embodiments. Wherein like elements in different embodiments are numbered with like associated elements. In the following description, numerous details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted or replaced with other elements, materials, methods in different instances. In some instances, certain operations related to the present application have not been shown or described in detail in order to avoid obscuring the core of the present application from excessive description, and it is not necessary for those skilled in the art to describe these operations in detail, so that they may be fully understood from the description in the specification and the general knowledge in the art.
Furthermore, the features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, the various steps or actions in the method descriptions may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification and drawings are for the purpose of describing certain embodiments only and are not intended to imply a required sequence unless otherwise indicated where such sequence must be followed.
The numbering of the components as such, e.g., "first", "second", etc., is used herein only to distinguish the objects as described, and does not have any sequential or technical meaning. The term "connected" and "coupled" when used in this application, unless otherwise indicated, includes both direct and indirect connections (couplings).
The technical solution of the present application will be specifically described with reference to the following examples.
The first embodiment,
Referring to fig. 1, the present embodiment discloses a text detection method based on object detection, which mainly includes steps 110 and 150, which are described below.
Step 110, acquiring an image to be detected, wherein the image to be detected can be an optical image of the surface of an object to be detected in a natural scene, and the object to be detected can include one or more characters (such as Chinese characters, letters, numbers, symbols and the like) because the object to be detected is a type of object (such as a billboard, an indication mark, a commodity label, a page, lettering, a handbook and the like) with text information on the surface, and one or more characters are strung together to form a text and represent the specific meaning of the text.
It should be noted that a camera or a video camera may be used to capture an image of an object to be detected in a natural scene, so as to obtain an image to be detected containing one or more characters, and then the processor acquires the image to be detected from the camera or the video camera to perform further image processing. Besides texts formed by characters, the image to be detected can also contain some simple or complex background information (such as colors, lines, stains, flaws, patterns and the like), and at this time, the interference of the background information needs to be eliminated, the position of the text needs to be detected from the background information, and even the text content can be identified.
And 120, constructing candidate frames corresponding to the characters according to the image to be detected, extracting the characteristic information of the candidate frames and generating candidate character components corresponding to the characters.
Referring to fig. 2, in the technical solution of the present application, a neural network is mainly used to extract trunk features of an image to be detected to obtain a corresponding feature map, which mainly considers different receptive fields of each character in the image to be detected, and feature maps of the receptive fields of different scales are obtained by extracting the trunk features. Then, the feature maps with different scales can be used as input, and the character location technology (such as a sliding window technology) is adopted to determine the position of each character by using the candidate box on the feature map, so as to extract the candidate character component.
For the detection task of each character in the image to be detected, not only which characters are included in the image need to be returned, but also the position of each character in the image needs to be marked by a candidate frame. How to extract such a candidate frame from the image to be detected quickly and well requires selecting an appropriate character location technique, wherein a sliding window technique (sliding window) can directly extract the candidate region. For example, in the process of extracting the candidate character component, feature maps of different receptive fields extracted by a skeleton are taken as input, candidate character components are extracted by using different numbers of candidate frames on feature maps of different scales, where the number of candidate frames may adopt a gaussian distribution design mode, 6 candidate frames are adopted on a feature map of a middle scale (for example, 3 candidate frames of different sizes and each candidate frame is combined by using 2 different aspect ratios), and 4 candidate frames are adopted on a feature map of a small scale and a large scale (for example, 2 candidate frames of different sizes and each candidate frame is combined by using 2 different aspect ratios). Further, the central coordinates, the width and height pixels and the confidence scores of the candidate boxes can be obtained by a regression operation mode through a neural network.
In one embodiment, the size of the candidate box can be adaptively matched according to the size of the character, for example, if the width and the height of the character are both 100 pixels, the width and the height of the candidate box are both 150 pixels in a 1.5-time manner.
In this embodiment, the candidate character component corresponding to each character in the image to be detected is mainly used for representing the position information and the confidence score of each character, wherein the position information can be represented by using the coordinates, the scale on the width and the height, and the rotation angle, and the confidence score can be represented by using whether the character is the confidence level of the character.
And step 130, predicting the connected regions of the candidate character assemblies respectively corresponding to the characters to obtain the connection relation among the candidate character assemblies. And predicting the connection region of the candidate character components, so as to perform candidate character component fusion according to the candidate character components and the connection relation of the candidate character components, so as to obtain a candidate text line. Then, how to obtain the connection relationship between the character candidate components will be described below.
And step 140, combining the plurality of candidate character components which are connected with each other, and fusing to obtain a candidate text line. It can be understood that after all the candidate character components are detected, the candidate frame information and the connection relation information of the candidate character components output by each feature map may be fused, so as to obtain the information of the candidate text lines.
For example, as shown in FIG. 2, since the character candidate components can characterize the position information of the character, the connection relationship between the character candidate components is knownAccording to the position information included by the plurality of candidate character components which are connected with each other, the corresponding characters can be connected into text lines in a straight line projection mode, namely the candidate text lines are obtained through fusion. Representing five parameters of position information coordinates, dimensions and angles of the candidate character assembly, namely (X, Y, w, h, theta), wherein X and Y are coordinates or offset of an X axis and a Y axis respectively, w and h are width pixels and height pixels respectively, and theta is a rotation angle relative to the X axis; the obtained candidate text line can also be represented by these parameters, for example, the position information of the candidate text line is represented as (x)t,yt,wt,htt). In addition, the confidence scores of a plurality of candidate character components which are connected with each other can be averaged, so that the confidence scores of the candidate text lines are obtained.
Step 150, generating a text detection result of the candidate text line. The position information and the confidence score of the candidate text line obtained in step S140 are only preliminary results, and in order to adapt to text regions of different shapes and obtain a more accurate detection result, regression fine tuning may be performed on the position information and the confidence score of the candidate text line. For example, after the candidate text line is obtained, feature transformation is performed through RoIPooling or RoIAlign to obtain a text feature map with a specific scale corresponding to the candidate text line, regression processing of the position information and the confidence score of the candidate text line is performed by using the text feature map, and the confidence score and the position of the candidate text are obtained after the regression processing to form a text detection result.
It should be noted that, the poiling operation is performed on RoI, where RoI (region of interest) refers to a region of interest where a target exists in an image, the poiling refers to Pooling processing performed on feature vectors of the image, and in a convolutional neural network, a Pooling layer is often behind a convolutional layer, and feature vectors output by the convolutional layer are reduced through Pooling. The specific operation of ROIPooling can be described as: the ROI is mapped to a feature map (feature map) corresponding position according to an input image, the mapped region is divided into portions having the same size, and a maximum pooling (max pooling) or average pooling (average pooling) operation is performed for each portion.
It should be noted that roilign is an improvement on roiploling, cancels quantization operation, and obtains an image numerical value on a pixel point with a floating point coordinate by using a bilinear interpolation method, so that the whole feature aggregation process is converted into a continuous operation, and the problem of region mismatching caused by two-time quantization in the roiploling operation can be solved. The specific operation of ROI Align can be described as: traversing each candidate region in the image, keeping the boundary of a floating point number not to be quantized, dividing the candidate region into k x k units, enabling the boundary of each unit not to be quantized, calculating and fixing four coordinate positions in each unit, calculating the values of the four positions by using a bilinear interpolation method, and then performing maximum pooling operation.
In this embodiment, for candidate text lines of a large target in an image, either rolilining or rolign may be selected; however, for candidate text lines of smaller targets in the image, RoIAlign may be preferentially selected, and the result of such processing may be more accurate.
Those skilled in the art can easily understand that since the candidate frame is constructed to extract the features and generate the candidate character components corresponding to the characters in the embodiment, each character in the complex background image is favorably and individually positioned, technical conditions are provided for connecting the candidate character components into candidate text lines, and the accuracy of text line positioning can be enhanced.
In this embodiment, referring to fig. 3, the above step 120 mainly involves the processes of constructing a candidate box of a character, extracting feature information of the candidate box, and generating a candidate character component, and specifically may include steps 121 and 125, which are respectively described as follows.
And 121, continuously performing convolution and downsampling operation on the image to be detected for multiple times, and obtaining a first feature map with a corresponding scale after each operation.
Referring to fig. 7, the continuous convolution and downsampling operation performed on the image to be detected for multiple times can be realized by using a VGG16 network, and each layer of the network can adopt a structural design form of convolution and Pooling, so that not only is the convolution and downsampling (Pooling) operation realized, but also the feature extraction operation of different receptive fields can be obtained, and the forward process of the network is realized from bottom to top; in the forward process, the size of the feature map changes after passing through some layers, but does not change when passing through other layers, the layers without changing the size of the feature map are classified into a processing Block (Block), and then each extracted feature is the output of each Block, so that a feature pyramid is formed. And respectively obtaining four first feature maps with different respective rates by using the feature pyramid, wherein the first feature maps are sorted into C2, C3, C4 and C5 according to the scale from large to small.
And step 122, performing continuous convolution and upsampling operations on the first feature map with the minimum scale for multiple times, and obtaining a second feature map with a corresponding scale after each operation.
Referring to fig. 7, in the case of obtaining the first feature map C5 with the smallest scale, a top-down convolution and upsampling operation may be implemented by using the VGG16 network, and specifically, the operation of upsampling (Unpooling) may be implemented by using transposed convolution. The second feature map with the smallest scale is P5, and after a plurality of continuous convolution and up-sampling operations, the second feature maps P4, P3 and P2 with gradually increasing scales can be obtained.
It should be noted that, in order to ensure consistency of the upsampling and downsampling operations, the scale sizes (i.e., the width and height pixel sizes) of the first feature maps C2, C3, C4 and C5 and the second feature maps P2, P3, P4 and P5 are respectively the same.
And 123, performing feature summation operation on the first feature map and the second feature map with the same scale to obtain a third feature map under the corresponding scale. For example, in fig. 7, after the summation operation is performed on the first feature map C2 and the second feature map P2, a third feature map R2 with the same scale as the first feature map C2 is obtained, after the summation operation is performed on the first feature map C3 and the second feature map P3, a third feature map R3 with the same scale as the first feature map C3 is obtained, and after the summation operation is performed on the first feature map C4 and the second feature map P4, a third feature map R4 with the same scale as the first feature map C4 is obtained; since the first feature map C5 and the second feature map P5 are both minimum-scale feature maps and are not down-sampled or up-sampled, the second feature map P5 may be used as the third feature map R5, but the two feature maps may be summed and operated otherwiseThe calculated feature map is referred to as a third feature map R5. It should be noted that the summation operation referred to herein is shown in fig. 7
Figure BDA0003026384080000092
In another embodiment, after obtaining the third feature map of each scale and before constructing the candidate box on the third feature map, that is, after step 123 and before step 124, the method further includes: and performing convolution processing (such as a 3 × 3 convolution processing procedure) on the third feature map at each scale, wherein the purpose of the convolution processing is to eliminate the feature aliasing effect generated by upsampling in the third feature map.
And step 124, constructing candidate frames corresponding to the characters on the third feature map of each scale, and obtaining feature information of the candidate frames through regression processing. In order to construct a candidate frame corresponding to each character, the candidate frame may be slid on the third feature map, and different numbers of candidate frames are selected on the third feature maps with different scales to extract character features, so as to obtain a candidate frame corresponding to each character. For example, the number of candidate boxes is designed in a gaussian distribution manner, 6 candidate boxes (3 different candidate box sizes and 2 different aspect ratio combinations) are adopted in the third feature maps (e.g., R3 and R4) in the middle scale, 4 candidate boxes (2 different candidate box sizes and 2 different aspect ratio combinations) are adopted in the third feature maps (e.g., R2 and R5) in the two-side scales, and then the center offset and the width-height offset of the candidate boxes and the confidence score (i.e., the confidence level of whether the character is included) are obtained in a regression manner.
In one implementation, to determine feature information (including location information and confidence scores) for the candidate boxes, a regression process of the third feature map may be implemented using a neural network.
First, a first neural network is established, for example, a structure of the first neural network is formed by using a full connection layer with 2 layers, and a loss function of the first neural network is configured
Lrpn=Lcls+Lreg
At LrpnIn the formula (1), LregIs a regression loss function of the position information of the candidate frame, and
Figure BDA0003026384080000091
λ1representing a predetermined weight coefficient, w*Representing the weight of the neural network needing regression, and superscript T representing transposition operation, wherein i is the serial number of the third feature diagram and the value range is 2-5, phi represents an activation function for activating the feature vector of the third feature diagram, and u represents the activation function for activating the feature vector of the third feature diagram*Representing a transformation relation between the annotation position and the predicted position and satisfying u*=(ux,uy,uw,uh,uθ),ux、uy、uw、uh、uθRespectively represent a transformation relation of the frame candidates with respect to the X-axis coordinate, a transformation relation with respect to the Y-axis coordinate, a transformation relation with respect to the width pixel, a transformation relation with respect to the height pixel, a transformation relation with respect to the X-axis rotation angle, and ux=(Gx-P′x)/P′w、uy=(Gy-P′y)/P′h、uw=log(Gw/P′w)、uh=log(Gh/P′h)、uθ=P′θ-Gθ,G=(Gx,Gy,Gw,Gh,Gθ) Is labeled location information, P '═ P'x,P′y,P′w,P′h,P′θ) Is predicted location information. L isclsIs a regression loss function of the confidence scores of the characters in the candidate box, an
Figure BDA0003026384080000101
Alpha is a preset weight coefficient, p is a predicted confidence score (generally expressed by a numerical value between 0 and 1, the larger the numerical value is, the higher the character confidence is),
Figure BDA0003026384080000102
is the annotated confidence score (generally represented by a numerical value of 0 or 1, with 0 representing not a character and 1 representing a character).
After the first neural network is trained by using the labeled image (the position information and the confidence score are labeled on each character in the image), the first neural network which is completely established can be obtained. Since the network training process is mature, this prior art will not be described in detail here.
And then inputting the third feature map of each scale into the first neural network, and determining candidate boxes corresponding to the characters respectively when the loss function of the first neural network converges, so that feature information of the candidate boxes, including the position information P' and the confidence score P, can be calculated.
It is understood that the candidate frame refers to a rectangular frame which is a region for defining characters, the confidence score P of the candidate frame reflects the confidence of whether the image feature in the rectangular frame is a character, and the position information P' of the candidate frame reflects the frame center coordinate offset, the width and height dimension, and the rotation angle of the rectangular frame in the image.
And step 125, mapping the characteristic information of the candidate frame corresponding to each character to the image to be detected, and performing non-maximum suppression on the mapping result to obtain candidate character assemblies corresponding to each character, wherein each candidate character assembly comprises the position information and the confidence score of the corresponding character.
It can be understood that the candidate boxes reflect the positions and the confidence degrees of the characters in the third feature map, and in order to further understand the situation of each character in the image to be detected, each candidate box needs to be mapped into the image to be detected to determine the position of the corresponding character in the image to be detected. The resolution difference exists between the third characteristic diagram and the image to be detected, for example, the width and high resolution of the image to be detected is 2 times of the width and high resolution of the third characteristic diagram, then the coordinates of the candidate frame are multiplied by 2 to obtain the coordinates of corresponding pixel points in the image to be detected, and the width and high scale quantity of the candidate frame is multiplied by 2 to obtain the number of the width and high pixel of corresponding characters in the image to be detected; the mapping of the rotation angle of the candidate frame to the image to be detected can be kept unchanged, and the mapping of the confidence score of the candidate frame to the image to be detected can also be kept unchanged. The mapping result of each candidate frame is subjected to non-maximum suppression to obtain a candidate character component of the corresponding character, wherein the candidate character component is substantially a wordThe data set of the position information and confidence score of the character reflects the position and confidence attribute of the character. Then, the position information of each candidate character component can be represented as (X, Y, w, h, θ), where X and Y are coordinates or offsets of an X axis and a Y axis, respectively, w and h are width pixels and height pixels, respectively, and θ is a rotation angle with respect to the X axis; the confidence score for each candidate character component may be represented as Schar,ScharThe value range of (1) is 0-1.
It should be noted that Non-Maximum Suppression (NMS for short) is an existing image processing algorithm, and the idea is to search for a local Maximum and suppress a small value. For example, in the target detection process, a large number of candidate frames may be generated at the same target position in the target detection process, and these candidate frames may overlap with each other, and at this time, it is necessary to find the best target boundary frame by using non-maximum suppression, so as to eliminate redundant boundary frames.
For example, in fig. 7, candidate frames corresponding to each character in the third feature map R2 are mapped to an image to be detected, an obtained candidate character component is D2, a parameter 2 included in the candidate character component is the number of channels of the confidence score (two channels respectively indicate yes and no, and form a binary result), and a parameter 5 is the number of channels of the position information (x, y, w, h, and θ each occupy one channel). Similarly, the parameters of the character candidate components D3, D4 and D5 can be known. Because the candidate character assemblies D2, D3, D4 and D5 are the mapping results based on different third feature maps (P2, P3, P4 and P5), obvious same-layer and different-layer relations exist between the candidate character assemblies respectively corresponding to the characters formed by the third feature maps of various scales, the candidate character assemblies formed by the same third feature map belong to the same-layer relation, and the candidate character assemblies formed by different third feature maps belong to the different-layer relation.
In this embodiment, referring to fig. 4, the step 130 mainly involves the process of predicting the connected region and obtaining the connection relationship of the candidate characters, and may specifically include steps 131 and 134, which are respectively described as follows.
And step 131, acquiring position information and confidence scores of candidate character assemblies corresponding to the characters formed by the third feature maps of each scale. It can be understood that, for the third feature map of any scale, if N candidate frames are formed, the N candidate character components will be obtained after the candidate frames are mapped into the image to be detected; due to the difference between the features of the third feature maps of various scales, the number of candidate frames which can be constructed by each third feature map is different, and further the number of formed candidate character components is different, that is, the number of candidate character components belonging to different layer relations is not necessarily the same.
For the candidate character component corresponding to each character, the position information of the candidate character component may be represented as (x, y, w, h, θ), and the confidence score of the candidate character component may be represented as Schar(ii) a Wherein X and Y are coordinates or offsets of an X axis and a Y axis respectively, w and h are width pixels and height pixels respectively, and theta is a rotation angle relative to the X axis.
Step 132, predicting a first connection score of the candidate character component corresponding to each character and other candidate character components in the surrounding neighborhood according to the position information of the candidate character component corresponding to each character formed by the third feature map of each scale, wherein the first connection score is represented as Sneighbor
For each candidate character component belonging to the same layer relationship, a connection relationship belonging to the same text may exist between them. Because each candidate character component has information such as coordinates, dimensions and angles of the corresponding character, the connection relation between each candidate character component and the candidate character components in the surrounding 8 directions (such as up, down, left, right, left up, left down, right up and right down) can be judged according to the size of the information, and if the coordinate distance between two candidate character components in a certain direction is smaller than a preset threshold value, the dimension difference or the angle difference is smaller than the preset threshold value, the connection between the two candidate character components is predicted, otherwise, the connection between the two candidate character components is not predicted; combining the results of the connection/non-connection in all directions to obtain a first connection score Sneighbor
Referring to fig. 7, the parameter 16 of the candidate character component D3 represents a first connection score, and has a total of 2 × 8 — 16 quantities, where 8 represents eight directions and 2 represents the number of channels in each direction (two channels are connected and unconnected, respectively, forming a binary result).
Step 133, predicting a second connection score between the candidate character component corresponding to each character and the candidate character components formed by the third feature maps of other scales according to the position information of the candidate character components corresponding to each character formed by the third feature maps of each scale, where the second connection score is represented as Slevel
For each candidate character component belonging to a different layer relationship, a connection relationship belonging to the same character may exist between them. Since each candidate character component has information such as coordinates, dimensions, angles and the like of the corresponding character, the connection relationship between any one candidate character component in a certain layer and the candidate character component at the same position in each layer can be judged according to the size of the information. If the coordinate difference between the two candidate character assemblies positioned in different layers is smaller than a preset threshold, and the scale difference or the angle difference is smaller than the preset threshold, predicting that the two candidate character assemblies are connected, otherwise, not connecting the two candidate character assemblies; then a second connection score S can be obtained by combining the results of the connection/non-connection between each two layerslevel
Referring to fig. 7, the parameter 8 of the candidate character component D3 represents the second connection score, and has a total of 2 × 4 to 8, where 4 represents four layers, and 2 represents the number of channels with or without connection between layers (two channels are connected and not connected, respectively, forming a binary result).
And step 134, determining the connection relation between the candidate character assemblies corresponding to the characters by using the first connection scores and the second connection scores. It is to be understood that the first connection score reflects the connection relation between the candidate character components on the same layer, and the second connection score reflects the connection relation between the candidate character components on different layers, so that the connection relation between the candidate character components corresponding to all characters can be determined by using the first connection score and the second connection score.
Those skilled in the art will readily understand that, in the present embodiment, when predicting the connection relationship between the candidate character components, the position information is changed from the past (x, y, w, h) to (x, y, w, h, θ), the comprehensive influence of the coordinates, the width and height pixels, and the rotation angle on the positions where the characters are located is fully considered, and the connection scores of the candidate character components within the same-layer feature map and between different-layer feature maps are also considered, so as to more accurately characterize the connection relationship between the candidate character components by means of the multi-dimensional output parameters.
In the present embodiment, referring to fig. 5, the step 140 mainly relates to the process of merging candidate character components and fusing to obtain candidate text lines, and specifically includes steps 141 and 144, which are respectively described as follows.
Step 141, forming a candidate set C by using the interconnected candidate character components, and C ═ CiIn which c isiIs the ith connected candidate character component and the corresponding rotation angle is thetaiAnd the value range of i is 1-n.
It is understood that there may be more candidate sets C due to the existence of multiple sets of candidate character components connected to each other, and in this embodiment, for convenience of explanation, the process of forming candidate text lines by a single candidate set C is only taken as an example.
142, performing linear regression processing on a process parameter b according to a least square method to obtain a regression line; wherein the regression line is represented as
Figure BDA0003026384080000131
The objective function of the least squares method is expressed as
Figure BDA0003026384080000132
Wherein
Figure BDA0003026384080000133
aiAre preset weight coefficients and may employ Huber, Tukey, Gauss, or Drop weight coefficients.
In the step 143,projecting to the regression line according to the position information of the candidate character component corresponding to each character respectively, acquiring projection points at two sides (namely the point at the leftmost side and the point at the rightmost side of the projection), and marking as (x) respectivelyl,yl) And (x)r,yr)。
144, determining candidate text lines by using the obtained projection points on the two sides, wherein the position information of the candidate text lines can be represented as
Figure BDA0003026384080000134
wi、hiRespectively, the width pixel and the height pixel of the ith character candidate component. Furthermore, xt、ytCoordinates of intermediate pixels, w, for candidate lines of textt、htIs the width and height pixel dimension of the candidate text line.
It is to be understood that since the candidate set C ═ CiIncludes a plurality of character component candidates ciAnd each candidate character component has its own confidence score ScharThen for the candidate text line determined by these candidate character components, its confidence score may be represented by the mean of the confidence scores of these candidate character components.
Those skilled in the art will readily understand that, in the present embodiment, when a plurality of candidate character assemblies are connected to each other and fused to obtain a candidate text line, the least square method is used to perform linear regression on the process parameters, so that the boundary positions on two sides of the text line formed by character connection can be obtained more accurately, and the positioning accuracy of the candidate text line is further improved.
For example, as shown in fig. 8, for an image to be detected containing C, R, A, O, C characters, the outer frame of each character represents a corresponding candidate character component, and the position of the corresponding character can be determined by the coordinates and dimensions of each candidate character component, so that it is known that C, R on the upper side belongs to the same text, and C-R is determined to form a candidate text line through the connection relationship. Similarly, the position of the corresponding character is determined by the coordinates and the scale of other candidate character components, so that the lower side A, C, R, O, C is known to belong to the same text, and the A-C-R-O-C is determined to form another candidate text line through the connection relation.
In this embodiment, referring to fig. 6, the step 150 mainly relates to a process of generating a text detection result, and specifically includes steps 151 and 153, which are respectively described as follows.
Step 151, obtaining a text feature map corresponding to the candidate text lines through feature transformation, specifically, feature transformation may be performed on each candidate text line by using the above-described RoIPooling or RoIAlign, so as to obtain a corresponding text feature map.
In step 152, in order to further accurately determine the positions and confidence degrees of the candidate text lines, a neural network may be used to implement regression processing of the text feature map.
First, a second neural network is established, and the loss function of the second neural network is configured as
L=Lq+Lp
In the above formula L, LcIs a regression loss function of the confidence scores of the candidate texts and satisfies
Figure BDA0003026384080000141
λ2The predicted candidate text confidence score is a preset weight coefficient, y' is a predicted candidate text confidence score (the value range is 0-1),
Figure BDA0003026384080000142
and (3) representing the confidence score of the candidate texts of the labels (the value is 0 or 1, wherein 0 represents that the candidate texts are not texts, and 1 represents that the candidate texts are texts). L ispIs a regression loss function of candidate text positions and satisfies Lp=Ld+Lβ
Figure BDA0003026384080000143
Figure BDA0003026384080000144
d represents candidate text position and is represented by coordinates or coordinate offset of four corner points (x)1,y1,x2,y2,x3,y3,x4,y4),smoothedL1() Is about L1Activation function of norm, djFor the predicted jth candidate text position,
Figure BDA0003026384080000145
for the marked jth candidate text position, βjFor the predicted jth candidate text rotation angle,
Figure BDA0003026384080000146
is the jth candidate text rotation angle of the label.
In one embodiment, if x' is taken as the activation function
Figure BDA0003026384080000147
Input parameters of, then activate the function
Figure BDA0003026384080000148
Can be specifically expressed by formula
Figure BDA0003026384080000149
After the second neural network is trained by using the labeled images (the text lines in the images are labeled with the position information and the confidence scores), the completely established second neural network can be obtained. Since the network training process is mature, this prior art will not be described in detail here.
And inputting the text feature map into a second neural network, and calculating a candidate text confidence score y' and a candidate text position d when the loss function of the second neural network converges.
And step 153, obtaining a text detection result of the candidate text line by using the candidate text confidence score y' and the candidate text position d.
It can be understood that the candidate text confidence score y 'is a result of performing regression and fine tuning on the confidence scores of the candidate text lines by using the second neural network, the candidate text position d is a result of performing regression and fine tuning on the position information on the candidate text by using the second neural network, and compared with the initially determined confidence score and position information of the candidate text lines, the candidate text confidence score y' and the candidate text position d can more accurately reflect the text detection condition of the corresponding candidate text lines.
Those skilled in the art will readily understand that, in this embodiment, when a text detection result of a candidate text line is generated, a second neural network is used to perform regression operation processing on the candidate text confidence scores and the candidate text positions, so that not only is position fine tuning of the candidate text line achieved, but also positioning requirements of different text scales can be adapted by using four corner point positioning modes, and the text scale self-adaptation and accurate positioning capabilities of the text detection method are further improved.
Example II,
Referring to fig. 9, based on the text detection method disclosed in the first embodiment, the present embodiment discloses a text detection apparatus, and the text detection apparatus 2 includes a camera 21, a processor 22 and a display 23, which are respectively described below.
The camera 21 is configured to image a natural scene and form an image to be detected, and for the purpose of text detection, the image to be detected formed by image capture should include one or more characters, and the one or more characters are strung together to form a text and represent a specific meaning of the text.
It should be noted that the image to be detected may be an optical image of the surface of an object to be detected in a natural scene, and the object to be detected may be a type of object (such as a billboard, an indication mark, a commodity label, a page, a lettering, a script, and the like) whose surface has text information, so that the image to be detected may include one or more characters (such as chinese characters, letters, numbers, symbols, and the like). Of course, the image to be detected may include some simple or complex background information (such as color, lines, stains, scratches, patterns, etc.) in addition to the text formed by the individual characters, and it is necessary that the processor 22 can eliminate the interference of the background information and detect the position of the text.
The processor 22 is connected to the camera 21, and is configured to process the image to be detected according to the text detection method described in the first embodiment, so as to obtain a text detection result. It is understood that the processor 22 may be a CPU, a GPU, an FPGA, a microcontroller or a digital integrated circuit with data processing function, as long as the text detection method implemented in the above steps 110 and 150 can be implemented according to its own logic instructions.
The display 23 is connected to the processor 22 and is configured to display an image to be detected and/or a text detection result. It is understood that the display 23 may be a screen with an image display function, and may be capable of displaying the image to be detected and the text detection result separately or together, and the specific screen type and display layout are not limited.
In the present embodiment, referring to FIG. 11, the processor 22 includes an image acquisition module 22-1, a component extraction module 22-2, a relationship prediction module 22-3, a character fusion module 22-4, and a result generation module 22-5, which are respectively described below.
The image acquisition module 22-1 may communicate with the camera 21 to acquire an image to be detected from the camera 21.
The component extraction module 22-2 is connected to the image acquisition module 22-1, constructs candidate frames corresponding to the characters according to the image to be detected, extracts feature information of the candidate frames, and generates candidate character components corresponding to the characters. For example, the component extraction module 22-2 can perform continuous convolution and downsampling operations on an image to be detected for multiple times, and obtain a first feature map of a corresponding scale after each operation; performing continuous convolution and up-sampling operation on the first feature map with the minimum scale for multiple times, and obtaining a second feature map with a corresponding scale after each operation; then, carrying out feature summation operation on the first feature map and the second feature map with the same scale to obtain a third feature map under the corresponding scale; then constructing candidate frames corresponding to the characters on the third feature map of each scale, and obtaining feature information of the candidate frames through regression processing; and finally mapping the characteristic information of the candidate frame corresponding to each character into the image to be detected, and carrying out non-maximum suppression on the mapping result to obtain candidate character components corresponding to each character respectively, wherein the candidate character components comprise the position information and the confidence score of the character. For the functions of the component extraction module 22-2, reference may be made to the step 121 and the step 125 in the first embodiment, which are not described herein again.
The relation prediction module 22-3 is connected with the component extraction module 22-2, and mainly predicts the connected regions of the candidate character components corresponding to each character respectively to obtain the connection relation between the candidate character components. For example, the relationship prediction module 22-3 obtains the position information and the confidence score of the candidate character component corresponding to each character formed by the third feature map of each scale; according to the position information of the candidate character component corresponding to each character formed by the third feature map of each scale, predicting the first connection score S of the candidate character component corresponding to each character and other candidate character components in the surrounding neighborhoodneighbor(ii) a And predicting a second connection score S between the candidate character assembly corresponding to each character and the candidate character assemblies formed by the third feature maps of other scales according to the position information of the candidate character assemblies corresponding to the characters formed by the third feature maps of each scalelevel(ii) a And then, determining the connection relation between the candidate character components corresponding to the characters by using the first connection score and the second connection score. The functions of the relationship prediction module 22-3 can be referred to the step 131 and 134 in the first embodiment, which are not described herein again.
The character fusion module 22-4 is connected with the relation prediction module 22-3, and is mainly used for merging a plurality of candidate character components which are connected with each other, and fusing to obtain candidate text lines. For example, the character fusion module 22-4 constructs the candidate set C ═ { C } using the interconnected candidate character componentsiIn which c isiIs the ith connected candidate character component and the corresponding rotation angle is thetaiWherein the value range of i is 1 to n; performing linear regression processing on a process parameter b according to a least square method to obtain a regression line, wherein the regression line is expressed as
Figure BDA0003026384080000161
Then, according to the position information of the candidate character component corresponding to each character, projecting to the regression line, obtaining the projection points at both sides and marking as (x) respectivelyl,yl) And (x)r,yr) (ii) a The following advantagesAnd determining candidate text lines by using the acquired projection points on the two sides. The functions of the character fusion module 22-4 can be referred to the step 141 and the step 144 in the first embodiment, which are not described herein again.
The result generating module 22-5 is connected to the character fusing module 22-4, and is mainly used for generating text detection results of candidate text lines. For example, the result generating module 22-5 obtains a text feature map corresponding to the candidate text line through feature transformation; establishing and training a second neural network in advance, and configuring a loss function of the second neural network as L ═ Lq+Lp(ii) a Then, inputting the text feature map into a second neural network, and calculating to obtain a candidate text confidence score y' and a candidate text position d when a loss function of the second neural network converges; and finally, obtaining a text detection result of the candidate text line by using the candidate text confidence score y' and the candidate text position d. For the functions of the component extraction module 22-2, reference may be made to step 151 and step 153 in the first embodiment, which are not described herein again.
Such as the merchandise label shown in fig. 10a, 10b, 10c, there are a plurality of text lines formed by a series of characters on the merchandise label. The component extraction module 22-2 extracts each character (number, letter, symbol) in the merchandise label to obtain a candidate character component represented by each highlight point in fig. 10 a. The relation predicting module 22-3 predicts the connected regions of the candidate character components to obtain the connection relation between the characters, and the character fusing module 22-4 fuses the candidate character components having the connection relation to obtain a fusion result as shown in fig. 10b, wherein each bar region represents one candidate text line. The result generating module 22-5 performs regression fine adjustment on the candidate text lines, accurately calculates the coordinates of the four corners of each candidate text line, and marks each candidate text line in the form of a detection box to form the detection result in fig. 10 c. At this time, the position detection of each text line in the commodity label is completed.
In the implementation, the disclosed text detection device 2 can realize the functions of text image acquisition, text detection and result display, and the processor adopts the text detection method designed in the step 110 and the step 150, so that the text detection in the natural scene can be efficiently and accurately carried out, the problem of multi-direction and variable-scale detection of the text in the natural scene is solved, and the positioning precision of the text is greatly improved; in addition, the text detection method predicts various parameters of the text such as coordinates, dimensions, angles and the like, so that the processor can detect not only horizontal or rotary texts, but also annularly distributed texts, and the applicability of the device in various text detection occasions is effectively improved.
Example III,
Referring to fig. 11, the present embodiment discloses a text detection apparatus, and the text detection apparatus 3 mainly includes a memory 31 and a processor 32.
The main components of the text detection means 3 are a memory 31 and a processor 32. The memory 31 is used as a computer-readable storage medium and is mainly used for storing a program, where the program may be a program code corresponding to the text detection method in the first embodiment. The processor 32 is connected to the memory 31, and is configured to execute the program stored in the memory 31 to implement the text detection method. The functions implemented by the processor 32 can refer to the processor 22 in the second embodiment, and will not be described in detail here.
Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by computer programs. When all or part of the functions of the above embodiments are implemented by a computer program, the program may be stored in a computer-readable storage medium, and the storage medium may include: a read only memory, a random access memory, a magnetic disk, an optical disk, a hard disk, etc., and the program is executed by a computer to realize the above functions. For example, the program may be stored in a memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above may be implemented. In addition, when all or part of the functions in the above embodiments are implemented by a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and may be downloaded or copied to a memory of a local device, or may be version-updated in a system of the local device, and when the program in the memory is executed by a processor, all or part of the functions in the above embodiments may be implemented.
The present invention has been described in terms of specific examples, which are provided to aid understanding of the invention and are not intended to be limiting. For a person skilled in the art to which the invention pertains, several simple deductions, modifications or substitutions may be made according to the idea of the invention.

Claims (10)

1. A text detection method based on target detection is characterized by comprising the following steps:
acquiring an image to be detected, wherein the image to be detected comprises one or more characters;
constructing candidate frames corresponding to the characters according to the image to be detected, extracting characteristic information of the candidate frames and generating candidate character components corresponding to the characters;
predicting a connected region of candidate character components respectively corresponding to each character to obtain a connection relation between the candidate character components;
combining a plurality of candidate character components which are connected with each other, and fusing to obtain candidate text lines;
and generating a text detection result of the candidate text line.
2. The text detection method of claim 1, wherein the constructing candidate frames corresponding to the characters respectively according to the image to be detected, extracting feature information of the candidate frames and generating candidate character components corresponding to the characters respectively comprises:
continuously and repeatedly performing convolution and downsampling operation on the image to be detected, and obtaining a first feature map with a corresponding scale after each operation;
performing convolution and up-sampling operation on the first feature map with the minimum scale for multiple times continuously, and obtaining a second feature map with a corresponding scale after each operation;
carrying out feature summation operation on the first feature map and the second feature map with the same scale to obtain a third feature map under the corresponding scale;
constructing a candidate frame corresponding to each character on the third feature map of each scale, and obtaining feature information of the candidate frame through regression processing;
mapping the characteristic information of the candidate frame corresponding to each character to the image to be detected, and carrying out non-maximum suppression on the mapping result to obtain candidate character components corresponding to each character respectively; the candidate character component includes location information and a confidence score for the character.
3. The text detection method of claim 2, wherein after obtaining the third feature map at each scale and before constructing the candidate box on the third feature map, further comprising: and performing convolution processing on the third feature map of each scale to eliminate the feature aliasing effect generated by upsampling in the third feature map.
4. The text detection method of claim 2, wherein the constructing a candidate box corresponding to each character on the third feature map of each scale and obtaining feature information of the candidate box through regression processing comprise:
establishing a first neural network and configuring a loss function of the first neural network to
Lrpn=Lcls+Lreg
Wherein L isregIs a regression loss function of the position information of the candidate frame, and
Figure FDA0003026384070000021
λ1representing a predetermined weight coefficient, w*Representing the weight of the neural network needing regression, the superscript T representing transposition operation, phi representing the activation function for activating the feature vector of the third feature map, u*Representing a transformation relation between the annotation position and the predicted position and satisfying u*=(ux,uy,uw,uh,uθ),ux、uy、uw、uh、uθRespectively represent a transformation relation of the candidate frame with respect to X-axis coordinates, a transformation relation with respect to Y-axis coordinates, a transformation relation with respect to width pixels, a transformation relation with respect to height pixels, a transformation relation with respect to X-axis rotation angles, and ux=(Gx-P′x)/P′w、uy=(Gy-P′y)/P′h、uw=log(Gw/P′w)、uh=log(Gh/P′h)、uθ=P′θ-Gθ,G=(Gx,Gy,Gw,Gh,Gθ) Is labeled location information, P '═ P'x,P′y,P′w,P′h,P′θ) Is predicted location information; l isclsA regression loss function of the confidence scores of the characters in the candidate box, and
Figure FDA0003026384070000022
alpha is a preset weight coefficient, p is a confidence score of the prediction,
Figure FDA0003026384070000023
a confidence score for the annotation;
inputting the third feature map of each scale into the first neural network, and determining candidate frames corresponding to the characters respectively when the loss function of the first neural network converges; the feature information of the candidate box includes position information P' and a confidence score P.
5. The text detection method of claim 2, wherein the predicting the connected regions of the candidate character components corresponding to the respective characters to obtain the connection relationship between the candidate character components comprises:
acquiring position information and confidence scores of candidate character components corresponding to the characters formed by the third feature maps of each scale; the candidatesThe position information of the character component is represented as (x, y, w, h, theta), and the confidence score of the candidate character component is represented as Schar(ii) a Wherein X and Y are coordinates or offset of an X axis and a Y axis respectively, w and h are width pixels and height pixels respectively, and theta is a rotation angle relative to the X axis;
predicting a first connection score of the candidate character component corresponding to each character and other candidate character components in the surrounding neighborhood according to the position information of the candidate character component corresponding to each character formed by the third feature map of each scale, wherein the first connection score is expressed as Sneighbor
Predicting a second connection score between the candidate character component corresponding to each character and the candidate character components formed by the third feature maps of other scales according to the position information of the candidate character components corresponding to each character formed by the third feature maps of each scale, wherein the second connection score is expressed as Slevel
And determining the connection relation between candidate character components corresponding to each character by using the first connection score and the second connection score.
6. The text detection method of claim 5, wherein said merging the interconnected candidate character components to obtain a candidate text line comprises:
forming a candidate set C using the interconnected candidate character components, and C ═ CiIn which c isiIs the ith connected candidate character component and the corresponding rotation angle is thetaiThe value range of i is 1-n;
performing linear regression processing on a process parameter b according to a least square method to obtain a regression line; wherein the regression line is represented as
Figure FDA0003026384070000031
The objective function of the least squares method is expressed as
Figure FDA0003026384070000032
Wherein
Figure FDA0003026384070000033
aiIs a preset weight coefficient;
projecting to the regression line according to the position information of the candidate character component corresponding to each character respectively, acquiring projection points at two sides and marking as (x) respectivelyl,yl) And (x)r,yr);
Determining candidate text lines by using the acquired projection points on the two sides; the position information of the candidate text line is represented as
Figure FDA0003026384070000034
wi、hiRespectively, the width pixel and the height pixel of the ith character candidate component.
7. The text detection method of claim 6, wherein the generating text detection results for the candidate text lines comprises:
obtaining a text feature map corresponding to the candidate text line through feature transformation;
establishing a second neural network and configuring a loss function of the second neural network
L=Lq+Lp
Wherein L iscIs a regression loss function of the confidence scores of the candidate texts and satisfies
Figure FDA0003026384070000035
λ2Is a preset weight coefficient, y' is a predicted candidate text confidence score,
Figure FDA0003026384070000036
representing annotated candidate text confidence scores; l ispIs the waiting timeSelecting a regression loss function of text positions and satisfying Lp=Ld+Lβ
Figure FDA0003026384070000037
d represents candidate text position and is represented by coordinates or coordinate offset of four corner points (x)1,y1,x2,y2,x3,y3,x4,y4),
Figure FDA0003026384070000038
Is about L1Activation function of norm, djFor the predicted jth candidate text position,
Figure FDA0003026384070000039
for the marked jth candidate text position, βjFor the predicted jth candidate text rotation angle,
Figure FDA00030263840700000310
the marked jth candidate text rotation angle is obtained;
inputting the text feature map into the second neural network, and calculating a candidate text confidence score y' and a candidate text position d when a loss function of the second neural network converges;
and obtaining a text detection result of the candidate text line by using the candidate text confidence score y' and the candidate text position d.
8. A text detection apparatus, comprising:
the system comprises a camera, a processing unit and a display unit, wherein the camera is used for capturing images of a natural scene and forming an image to be detected, and the image to be detected comprises one or more characters;
the processor is connected with the camera and used for processing the image to be detected according to the text detection method of any one of claims 1 to 7 to obtain a text detection result;
and the display is connected with the processor and used for displaying the image to be detected and/or the text detection result.
9. The text detection apparatus of claim 8, wherein the processor comprises:
the image acquisition module is used for acquiring the image to be detected from the camera;
the component extraction module is used for constructing candidate frames corresponding to the characters according to the image to be detected, extracting the characteristic information of the candidate frames and generating candidate character components corresponding to the characters;
the relation prediction module is used for predicting the connected region of the candidate character components respectively corresponding to each character to obtain the connection relation between the candidate character components;
the character fusion module is used for merging a plurality of candidate character components which are connected with each other and fusing to obtain candidate text lines;
and the result generation module is used for generating a text detection result of the candidate text line.
10. A computer-readable storage medium, characterized in that the medium has stored thereon a program which is executable by a processor to implement the text detection method according to any one of claims 1-7.
CN202110417152.1A 2021-04-19 2021-04-19 Text detection method and device based on target detection and storage medium Pending CN113033559A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110417152.1A CN113033559A (en) 2021-04-19 2021-04-19 Text detection method and device based on target detection and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110417152.1A CN113033559A (en) 2021-04-19 2021-04-19 Text detection method and device based on target detection and storage medium

Publications (1)

Publication Number Publication Date
CN113033559A true CN113033559A (en) 2021-06-25

Family

ID=76456817

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110417152.1A Pending CN113033559A (en) 2021-04-19 2021-04-19 Text detection method and device based on target detection and storage medium

Country Status (1)

Country Link
CN (1) CN113033559A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673652A (en) * 2021-08-12 2021-11-19 维沃软件技术有限公司 Two-dimensional code display method and device and electronic equipment
CN113807351A (en) * 2021-09-18 2021-12-17 京东鲲鹏(江苏)科技有限公司 Scene character detection method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673652A (en) * 2021-08-12 2021-11-19 维沃软件技术有限公司 Two-dimensional code display method and device and electronic equipment
CN113807351A (en) * 2021-09-18 2021-12-17 京东鲲鹏(江苏)科技有限公司 Scene character detection method and device
CN113807351B (en) * 2021-09-18 2024-01-16 京东鲲鹏(江苏)科技有限公司 Scene text detection method and device

Similar Documents

Publication Publication Date Title
CN108549893B (en) End-to-end identification method for scene text with any shape
CN109829893B (en) Defect target detection method based on attention mechanism
CN109299274B (en) Natural scene text detection method based on full convolution neural network
CN111723585B (en) Style-controllable image text real-time translation and conversion method
CN109635883B (en) Chinese character library generation method based on structural information guidance of deep stack network
CN104751142B (en) A kind of natural scene Method for text detection based on stroke feature
CN105574524B (en) Based on dialogue and divide the mirror cartoon image template recognition method and system that joint identifies
CN114529925B (en) Method for identifying table structure of whole line table
CN113435240B (en) End-to-end form detection and structure identification method and system
CN110766020A (en) System and method for detecting and identifying multi-language natural scene text
US10713515B2 (en) Using multiple cameras to perform optical character recognition
CN115131797B (en) Scene text detection method based on feature enhancement pyramid network
CN113033558B (en) Text detection method and device for natural scene and storage medium
CN113033559A (en) Text detection method and device based on target detection and storage medium
Jiang et al. Linearized multi-sampling for differentiable image transformation
CN111027538A (en) Container detection method based on instance segmentation model
CN113158895A (en) Bill identification method and device, electronic equipment and storage medium
Chen et al. An adaptive deep learning framework for fast recognition of integrated circuit markings
CN112364863B (en) Character positioning method and system for license document
CN113034492A (en) Printing quality defect detection method and storage medium
Rest et al. Illumination-based augmentation for cuneiform deep neural sign classification
CN115909378A (en) Document text detection model training method and document text detection method
CN114708591A (en) Document image Chinese character detection method based on single character connection
JPH07168910A (en) Document layout analysis device and document format identification device
US11106931B2 (en) Optical character recognition of documents having non-coplanar regions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination