CN113033558B

CN113033558B - Text detection method and device for natural scene and storage medium

Info

Publication number: CN113033558B
Application number: CN202110417133.9A
Authority: CN
Inventors: 杨洋
Original assignee: Shenzhen Huahan Weiye Technology Co ltd
Current assignee: Shenzhen Huahan Weiye Technology Co ltd
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2024-03-19
Anticipated expiration: 2041-04-19
Also published as: CN113033558A

Abstract

The application relates to a text detection method and device for a natural scene and a storage medium, wherein the text detection method comprises the following steps: acquiring an image to be detected, constructing a feature map of a plurality of scales of the image to be detected, and predicting the text category and the connection relation of each pixel point; forming text connected components by using text categories and pixel points with connection relations meeting thresholding conditions, and obtaining position information and confidence scores of each text line in the image to be detected through example segmentation; and carrying out regression processing on the position information and the confidence score to obtain text detection results corresponding to each text line respectively. According to the technical scheme, two classifications and connectivity judgment are carried out on each pixel in the feature map, example segmentation is carried out on an image to be detected according to a connectivity result, regression fine adjustment is further carried out on text lines through regression processing, so that accurate text line positions and text line confidence scores can be obtained conveniently, and the adaptability of text detection in a natural scene is improved.

Description

Text detection method and device for natural scene and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method and an apparatus for detecting text in a natural scene, and a storage medium.

Background

The natural scene text detection has extremely important wide application in real life, is an important premise of text recognition, such as text retrieval, guideboard recognition, intelligent correction of test paper and the like, and can also be applied to the fields of image retrieval, machine translation, automatic driving and the like. Due to the influence of various uncontrollable interference factors in the natural scene, such as light shadow shielding, shooting angles, foreign matter shielding, and some inherent properties of the text itself, such as artistic words, deformed words or incomplete words, the detection of the text in the natural scene is still a very difficult task. In order to improve the efficiency of automation office, automation identification and automation production and manufacture, the automatic generation of documents according to images is an important application, for example, the automatic generation of bill information according to bill images can greatly reduce the labor intensity of bank staff.

At present, a natural scene text detection difficulty is caused by factors mainly in several aspects: (1) The diversity and the variability factors, compared with the text content in the standard document, the text of the natural scene can be multi-scale and multi-language, and the shape, the direction, the proportion, the color and other aspects can be different, and the changes all bring difficulty to text detection; (2) The complex background factors, namely the scene text can appear in any background, including the positions of signal marks, bricks, or grass, fences and the like, and the backgrounds can have very similar characteristics with the text, so that the background can be used for judging the noise influence text, and the potential detection errors can be caused by the absence of the text caused by the shielding of foreign matters; (3) The uneven imaging quality influences that the text image possibly causes distortion and virtual focus due to different shooting angles or shooting distances or forms noise and shadow due to different illumination during shooting, so that the imaging quality of the text image cannot be ensured.

The natural scene text detection method under the complex background can be roughly divided into three types: a sliding window-based method, a connected component-based method, and a deep learning method. The method based on sliding window does not fully utilize inherent characteristics of the text, which can lead to extraction of a lot of non-text candidate regions than real text regions, so that the subsequent non-text region filtering process becomes extremely complex, and the method is sensitive to some external factors, such as illumination change, image blurring and the like. The method based on the connected components uses a multi-stage method to position language text information, firstly extracts MSER regions of three channels of an image R, G, B, trains a classifier to remove repeated MSER regions and non-text MSER regions to obtain candidate MSER regions, connects the candidate text regions into text strips, and performs de-duplication processing on the obtained text strips; although the method can detect and locate the language text region, the process is complex and is divided into a plurality of stages, the detection effect depends on the quality of candidate regions generated by MSERs, and the detection effect is influenced by the characteristic extraction mode of manual design. The deep learning method is mainly based on a regression method to process rectangular text lines, and when the text is in a curve shape, a predicted detection frame cannot accurately cover all text areas; in addition, for long text lines, problems of frame loss or incomplete prediction can occur once the aspect ratio of the text line is greater than a preset prediction threshold.

Disclosure of Invention

The technical problem that this application mainly solves is how to accurately detect text information from complicated background image. In order to solve the technical problems, the application provides a text detection method and device for a natural scene and a storage medium.

According to a first aspect, in one embodiment, there is provided a text detection method for a natural scene, including: acquiring an image to be detected, wherein the image to be detected comprises one or more text lines; constructing feature graphs of multiple scales of the image to be detected, and predicting text categories and connection relations of all pixel points according to the feature graphs of the multiple scales; forming text connected components by using text categories and pixel points with connection relations meeting thresholding conditions so as to carry out instance segmentation on the image to be detected, and obtaining position information and confidence scores of each text line in the image to be detected; and carrying out regression processing on the position information and the confidence score of each text line in the image to be detected to obtain text detection results corresponding to each text line.

Constructing a feature map of multiple scales of the image to be detected, and generating a rootPredicting text category and connection relation of each pixel point according to the feature map of a plurality of scales, comprising: performing convolution and downsampling operations on the image to be detected for N times continuously, and obtaining a first characteristic diagram I with a corresponding scale after each operation _i I=1, 2, …, N, and the larger the value of I is, the corresponding first feature map I _i The smaller the scale of (2); first characteristic diagram I _N And a first characteristic diagram I _N-1 Respectively convolving and then carrying out summation operation, and carrying out up-sampling operation on the summation result to obtain a first characteristic diagram I _N-2 A second feature map of equal scale size; first characteristic diagram I _N-2 Performing summation operation and up-sampling operation on the convolved image and the second characteristic image with the same scale size to obtain a first characteristic image I _N-3 Second feature map with equal scale, and so on until the first feature map I is obtained ₁ A second feature map of equal scale size; first characteristic diagram I ₁ Carrying out convolution and summation operation on the convolution result and the second characteristic diagram with the same scale size to obtain a third characteristic diagram; and carrying out pixel characteristic analysis on the third characteristic diagram to obtain text categories and connection relations of all pixel points in the third characteristic diagram.

The step of performing pixel feature analysis on the third feature map to obtain text categories and connection relations of all pixel points in the third feature map includes: establishing a first neural network and configuring a loss function of the first neural network asWherein alpha is a preset weight coefficient, p is a confidence score of the text category and the connection relation of the predicted pixel point, and +. >Confidence scores of text categories and connection relations of the marked pixel points; inputting the third feature map to the first neural network, determining the confidence score p of the text category and the connection relation of each pixel point in the third feature map when the loss function of the first neural network is converged, and determining the text category and the connection relation of the corresponding pixel point according to the confidence score p.

The step of forming text connected components by the text category and the pixel points with the connection relation meeting the thresholding condition to perform instance segmentation on the image to be detected to obtain the position information and the confidence score of each text line in the image to be detected, comprising the following steps: thresholding the predicted text category and connection relation of each pixel point, forming equivalent pairs of pixel points with text category belonging to the text and positive connection relation, and storing the equivalent pairs into a preset set; connecting corresponding pixel points according to equivalent pairs in the set, constructing one or more connected components and forming corresponding connected areas; mapping each connected region into the image to be detected, and setting an external rectangle for the mapped region in the image to be detected to obtain text boxes corresponding to each text line respectively; and determining the position information and the confidence score of each text line according to the text boxes respectively corresponding to each text line in the image to be detected.

The method further comprises a noise filtering step after obtaining text boxes corresponding to the text lines respectively, wherein the noise filtering step comprises the following steps: determining coordinate offset, wide and high pixel quantity and rotation angle quantity of a corresponding text box according to coordinates of pixel points in each text box in the image to be detected; configuring noise filtering conditions according to the coordinate offset, the wide and high pixel quantity and/or the rotation angle quantity of each text box in the image to be detected; discarding the text boxes meeting the noise filtering conditions in the image to be detected to reserve the rest text boxes, and obtaining one or more candidate text lines according to the rest text boxes.

The determining the position information and the confidence score of each text line according to the text boxes corresponding to each text line in the image to be detected comprises the following steps: acquiring the coordinate offset, the wide and high pixel quantity and the rotation angle quantity of each text box in the image to be detected, and determining the position information of the corresponding text line by utilizing the coordinate offset, the wide and high pixel quantity and the rotation angle quantity of each text box; and obtaining the confidence scores of the pixel points in each text box in the image to be detected, and determining the confidence scores of the corresponding text lines by utilizing the confidence scores of the pixel points in each text box.

Performing regression processing on the position information and the confidence score of each text line in the image to be detected to obtain text detection results corresponding to each text line, wherein the regression processing comprises the following steps: carrying out feature transformation on each text line to obtain a text feature map corresponding to each text line; establishing a second neural network and configuring a loss function of the second neural network to be l=l _q +L _p The method comprises the steps of carrying out a first treatment on the surface of the Wherein L is _q Regression loss function for text line confidence score and satisfiesLambda is a preset weight coefficient, y' is a predicted text line confidence score,/->Representing the confidence score of the annotated text line; l (L) _p Regression loss function for text line position information and satisfies L _p ＝L _d +L _β ，/>d represents the text line position and d= (x) is represented by coordinates or coordinate offsets of four corner points ₁ ,y ₁ ,x ₂ ,y ₂ ,x ₃ ,y ₃ ,x ₄ ,y ₄ )，To be about L ₁ Activation function of norm d _j For the predicted jth text line position, < >>For the j-th text line position of the label, beta _j For the predicted j text line rotation angle, < >>The rotation angle of the j text line is marked; inputting the text feature map into the second neural network, and calculating text line confidence scores when the loss function of the second neural network convergesA number y' and a text line position d; and obtaining a text detection result of the corresponding text line by using the text line confidence score y' and the text line position d.

According to a second aspect, in one embodiment there is provided a text detection device comprising: the camera is used for capturing images of a target scene and forming an image to be detected, wherein the image to be detected comprises one or more text lines; the processor is connected with the camera and is used for processing the image to be detected according to the text detection method in the first aspect to obtain a text detection result; and the display is connected with the processor and used for displaying the image to be detected and/or the text detection result.

The processor includes: the image acquisition module is used for acquiring the image to be detected from the camera; the text prediction module is used for constructing feature graphs of a plurality of scales of the image to be detected, and predicting text types and connection relations of all pixel points according to the feature graphs of the plurality of scales; the text segmentation module is used for forming text connected components by using the text categories and the pixel points with the connection relations meeting the thresholding conditions so as to conduct instance segmentation on the image to be detected, and position information and confidence scores of each text line in the image to be detected are obtained; and the result generation module is used for carrying out regression processing on the position information and the confidence score of each text line in the image to be detected to obtain text detection results corresponding to each text line respectively.

According to a third aspect, an embodiment provides a computer readable storage medium having stored thereon a program executable by a processor to implement the text detection method described in the first aspect.

The beneficial effects of this application are:

according to the above embodiment, a text detection method and device for a natural scene, and a storage medium, wherein the text detection method comprises the following steps: acquiring an image to be detected, constructing a feature map of a plurality of scales of the image to be detected, and predicting the text category and the connection relation of each pixel point according to the feature map of the plurality of scales; forming text connected components by using the text categories and the pixel points with the connection relations meeting the thresholding conditions so as to perform instance segmentation on the image to be detected, and obtaining the position information and the confidence score of each text line in the image to be detected; and carrying out regression processing on the position information and the confidence score of each text line in the image to be detected to obtain text detection results corresponding to each text line. According to the technical scheme, the two-classification and connectivity judgment are carried out on each pixel in the feature map, the example segmentation is carried out on the image to be detected according to the connectivity result, and the regression fine adjustment is further carried out on the text line through regression processing, so that the accurate text line position and text line confidence degree can be obtained, and the adaptability of text detection in a natural scene is improved.

In addition, when predicting the text category and the connection relation of the pixel points in the feature map, text category judgment and connection relation judgment with the neighborhood pixels are carried out on each pixel point through the first neural network, so that an analysis object is more refined, and technical support is provided for example segmentation of images. When an example is divided for the image to be detected, the text category and the connection relation of each pixel point are fully considered, so that a text box in the image to be detected can be mapped and obtained by constructing a connected component, and each text line can still be detected in a complex background. When the text detection result of the text line is generated, regression operation processing is carried out on the text line confidence score and the text line position by utilizing the neural network, so that the position fine adjustment of the candidate text line is realized, the positioning requirements of different text scales can be adapted by utilizing the four corner positioning modes, and the self-adaptive and accurate positioning capability of the text detection method on the text scale is further improved.

Drawings

FIG. 1 is a flow chart of a text detection method for natural scenes in the present application;

FIG. 2 is a schematic diagram of a text detection method;

FIG. 3 is a flow chart for predicting text class and connection relationships of each pixel;

FIG. 4 is a flow chart of a process for obtaining location information and confidence scores for text lines;

FIG. 5 is a flow chart of a process for obtaining text detection results;

FIG. 6 is a schematic diagram of a network structure for extracting image features;

FIG. 7 is a schematic diagram of constructing a text box in an image to be detected;

FIG. 8 is a schematic diagram of a text detection device according to the present application;

FIG. 9a is a schematic diagram of a text line connected region constructed in a key sign;

FIG. 9b is a schematic diagram of the location of a line of marked text in a key sign;

FIG. 10 is a schematic diagram of a processor;

fig. 11 is a schematic structural diagram of a text detection device according to another embodiment of the present application.

Detailed Description

The present application is described in further detail below with reference to the accompanying drawings by way of specific embodiments. Wherein like elements in different embodiments are numbered alike in association. In the following embodiments, numerous specific details are set forth in order to provide a better understanding of the present application. However, one skilled in the art will readily recognize that some of the features may be omitted, or replaced by other elements, materials, or methods in different situations. In some instances, some operations associated with the present application have not been shown or described in the specification to avoid obscuring the core portions of the present application, and may not be necessary for a person skilled in the art to describe in detail the relevant operations based on the description herein and the general knowledge of one skilled in the art.

Furthermore, the described features, operations, or characteristics of the description may be combined in any suitable manner in various embodiments. Also, various steps or acts in the method descriptions may be interchanged or modified in a manner apparent to those of ordinary skill in the art. Thus, the various orders in the description and drawings are for clarity of description of only certain embodiments, and are not meant to be required orders unless otherwise indicated.

The numbering of the components itself, e.g. "first", "second", etc., is used herein merely to distinguish between the described objects and does not have any sequential or technical meaning. The terms "coupled" and "connected," as used herein, are intended to encompass both direct and indirect coupling (coupling), unless otherwise indicated.

The technical scheme of the application is specifically described below with reference to examples.

Embodiment 1,

Referring to fig. 1, the present embodiment discloses a text detection method for a natural scene, which mainly includes steps 110-150, and is described below.

In step 110, an image to be detected is obtained, which may be an optical image of the surface of the object to be detected in the natural scene, and since the object to be detected is a type of object (such as a billboard, an indication mark, a commodity label, a page, a lettering, a hand book, etc.) with text information on the surface, the image to be detected may include one or more characters (such as a kanji character, a letter, a number, a symbol, etc.), and several strings of characters are formed into a text line and represent a specific meaning of the text, and of course, these characters may also form a plurality of text lines and represent specific meanings respectively. That is, the image to be detected may include one or more lines of text.

It should be noted that a camera or video camera may be used to capture an image of an object to be detected in a natural scene, so as to obtain an image to be detected including one or more text lines, and then the processor obtains the image to be detected from the camera or video camera to perform further image processing. In addition to one or more text lines, the image to be detected may also include some simple or complex background information (such as color, lines, stains, scratches, patterns, etc.), where interference of the background information needs to be eliminated and the position of the text line is detected, and even the content of the text line can be identified.

And 120, constructing feature graphs of multiple scales of the image to be detected, and predicting the text category and the connection relation of each pixel point according to the feature graphs of the multiple scales.

Referring to fig. 2, in the technical solution of the present application, a neural network is mainly used to perform a trunk feature extraction on an image to be detected to obtain feature maps of multiple scales, and prediction of text types and prediction of pixel connection relationships are performed on pixel points in the feature maps obtained by rolling and upsampling the feature maps of multiple scales. Here, both the text category and the connection relation can be represented by confidence, and if a pixel belongs to a certain text, the pixel is considered to have higher confidence (can be represented by numerical quantization between 0 and 1) with respect to the text category; if a pixel point and any pixel point in the neighborhood belong to the same text instance, the connection relationship between the two pixel points is considered to have higher confidence (can be represented by numerical quantization between 0 and 1). It is understood that the text category herein refers to the text/non-text classification, and the connection relationship refers to the connected/disconnected classification.

And 130, constructing text connected components by using the text categories and the pixel points with the connection relations meeting the thresholding conditions to perform instance segmentation on the image to be detected, and obtaining the position information and the confidence scores of each text line in the image to be detected.

Here, thresholding is performed on the pixel classification (i.e. the confidence level of the text class) and the connection classification (i.e. the confidence level of the connection relation) predicted by the network in step 120 by using 2 different thresholds, so as to obtain a binarized classification result, for example, if the confidence level of the text class of a pixel is greater than a first threshold (e.g. 50%), the pixel is considered to belong to a certain text, and then the text class of the pixel is positive, otherwise negative; for example, if the confidence coefficient of the connection relationship between one pixel point and any pixel point in the neighborhood is greater than a second threshold (for example, 50%), then both the pixel points are considered to belong to a text instance, and then the connection relationship between the two pixel points is positive, otherwise, negative.

For example, in fig. 2, after the text category and the connection relation of each pixel point are obtained by using the feature map, the pixels can be connected into connected regions based on the result of connection classification, and the connected regions are mapped into the image to be detected, so that the circumscribed rectangle of the mapped region in the image to be detected can be easily obtained, thereby forming the text box corresponding to each text line respectively. Since the text box reflects the specific position of the text line inside, the pixels inside the text box can be considered as text pixels, and the pixels outside the text box are non-text pixels, so that the position information and the confidence score of the corresponding text line can be obtained under the condition that the position and the confidence of the text box are known.

In this embodiment, the position information of each text line in the image to be detected may be represented by coordinates, dimensions on the width and height, and rotation angles, and the confidence score may be represented by the confidence of whether the confidence is text.

In this embodiment, the example segmentation is one of the more important processing ways in machine vision, and can be considered as a task of identifying the target contour at the pixel level. The purpose of performing example segmentation on the image to be detected is to directly communicate and obtain a text region according to pixels, so that candidate frames of each character are not required to be acquired, and the efficiency of text line detection can be improved. In some cases, the process of instance segmentation may also be described as: inputting an image, namely inputting the whole picture into CNN to perform feature extraction; generating a suggestion window (proposals) with the FPN, a plurality of suggestion windows being generated per picture; mapping the suggestion window onto a feature map of a last layer of the CNN; generating a fixed-size feature map for each RoI by RoIAlign; and finally, carrying out regression on the mask through the frame by utilizing full connection classification. It can be understood that the essence of instance segmentation is to detect an image, find out the ROI in the image, correct pixels for each ROI by ROIAlign, and then predict the classification to which different instances belong by using a designed FCN frame for each ROI, so as to obtain the image instance segmentation result.

And 140, carrying out regression processing on the position information and the confidence scores of each text line in the image to be detected to obtain text detection results corresponding to each text line. The location information and confidence score of the text line obtained in step 140 are merely preliminary results, and regression fine-tuning of the location information and confidence score of the text line may be performed in order to adapt to text regions of different shapes and obtain more accurate detection results. For example, after obtaining text lines, performing feature transformation through RoIPooling or RoIAlign to obtain text feature diagrams with specific dimensions corresponding to the text lines, performing regression processing on position information and confidence scores of the text lines by using the text feature diagrams, and obtaining text line confidence scores and text line positions after the regression processing to form text detection results.

For example, in fig. 2, after the corresponding text lines are respectively determined through the text boxes in the image to be detected, the text feature map corresponding to each text line can be obtained through a feature transformation mode, and then the text line confidence score and the text line position of each text line are determined through a regression fine tuning mode.

Note that, roiling refers to Pooling of the RoI, where RoI (Region of interest) refers to a region of interest where an object exists in an image, pooling refers to Pooling of feature vectors of the image, and in a convolutional neural network, a Pooling layer is often behind a convolutional layer, and the feature vectors output by the convolutional layer are reduced by Pooling. The specific operation of roikooling can be described as: according to the input image, mapping the ROI to a position corresponding to a feature map (feature map), dividing the mapped region into parts with the same size, and performing a maximum pooling (max pooling) or average pooling (average pooling) operation on each part.

It should be noted that ROIAlign is an improvement on roiling, and is different in that quantization operation is cancelled, and an image value on a pixel point with a floating point number is obtained by using a bilinear interpolation method, so that the whole feature aggregation process is converted into a continuous operation, and the problem of region mismatch caused by twice quantization in the roiling operation can be solved. The specific operation of the ROI alignment can be described as: traversing each candidate region in the image, keeping floating point number boundaries without quantization, dividing the candidate region into k x k units, enabling the boundaries of each unit without quantization, calculating and fixing four coordinate positions in each unit, calculating values of the four positions by a bilinear interpolation method, and performing maximum pooling operation.

In this embodiment, for a text line of a large target in an image, either roispooling or RoIAlign may be selected; however, for text lines of smaller objects in the image, roIAlign may be preferentially selected, and the result of this process may be more accurate.

As can be easily understood by those skilled in the art, according to the technical scheme of the application, the two-classification and connectivity judgment are performed on each pixel in the feature map, the pixels of the text predicted to be positive are predicted to be connected in a positive connection relationship, the example segmentation of the image is performed according to the connectivity result, and the regression fine adjustment is further performed on the text line through regression processing, so that the accurate text line position and text line confidence coefficient can be obtained, and the adaptability of text detection in a natural scene is improved.

In this embodiment, referring to fig. 3, the above-mentioned step 120 mainly involves a process of constructing a feature map, predicting text types and connection relationships, and may specifically include steps 121-124, which are respectively described below.

Step 121, performing convolution and downsampling operations on the image to be detected for N times continuously, and obtaining a first feature map I with a corresponding scale after each operation _i First feature map I corresponding to i=1, 2, …, N, and the larger I is _i The smaller the scale of (c).

Referring to fig. 6, the convolution and downsampling operations performed on the image to be detected for n=5 times can be implemented by means of the VGG16 network, the first layers of the network can adopt a convolution+pooling structural design form, the second layers can adopt a pooling+full-connection structural design form, both the convolution and downsampling (Pooling) operations are implemented, and the feature extraction operations of different sensing fields can also be obtained, and the forward process of the network is implemented by one downsampling operation. In the forward process, the size of the feature map is changed after passing through some layers, but not the other layers, and the layers which do not change the size of the feature map are classified into a processing Block (Block), so that each extracted feature is the output of each Block, and thus a feature pyramid is formed. Five first feature graphs with different respective rates can be obtained by utilizing the feature pyramid, and the first feature graphs are ranked as I according to the scale from large to small in resolution ₁ 、I ₂ 、I ₃ 、I ₄ 、I ₅ 。

In FIG. 6, to be detectedThe image is processed by a first processing block (convolution a1+maximum pooling b1+convolution a 2) to output a first characteristic diagram I ₁ First characteristic diagram I ₁ After passing through the second processing block (maximum pooling b2+ convolution a 3), a first characteristic diagram I is output ₂ First characteristic diagram I ₂ After the third processing block (maximum pooling b3+ convolution a 4), a first characteristic diagram I is output ₃ First characteristic diagram I ₃ After passing through a fourth processing block (maximum pooling b4+ convolution a 5), a first characteristic diagram I is output ₄ First characteristic diagram I ₄ After passing through a fifth processing block (maximum pooling b5+ full connection q1+ full connection q 2), outputting a first characteristic diagram I ₅ . It will be appreciated that for the first feature map I extracted from the features in fig. 6 _i The subscript i has a value ranging from i=1, 2, …,5.

Step 122, the first feature map I _N And a first characteristic diagram I _N-1 Respectively convolving and then carrying out summation operation, and carrying out up-sampling operation on the summation result to obtain a first characteristic diagram I _N-2 A second feature map of equal scale size; first characteristic diagram I _N-2 Performing summation operation and up-sampling operation on the convolved image and the second characteristic image with the same scale size to obtain a first characteristic image I _N-3 Second feature map with equal scale, and so on until the first feature map I is obtained ₁ And a second feature map of equal scale.

Referring to FIG. 6, due to the first feature map I ₁ To I ₅ The scale (or resolution) of (a) gradually decreases so that they also need to be up-sampled (un-sampled) for subsequent pixel feature analysis of feature maps containing rich feature information. For example, the first characteristic diagram I ₅ Convolving the result of the processing of a10 with a first characteristic diagram I ₄ The result after the convolution a9 processing is subjected to summation operation, and the summation result is subjected to up-sampling s1 operation to obtain a first characteristic diagram I ₃ Second feature map P of equal scale ₁ The method comprises the steps of carrying out a first treatment on the surface of the Then, the first feature map I ₃ Convolving a8 with a second feature map P of equal scale ₁ Performing summation operation and up-sampling s2 operation to obtain a first characteristic diagram I ₂ Second feature map P of equal scale ₂ The method comprises the steps of carrying out a first treatment on the surface of the Next, a first feature map I ₂ Convolving a7 with a second feature map P of equal scale ₂ Performing summation operation and up-sampling s3 operation to obtain a first characteristic diagram I ₁ Second feature map P of equal scale ₃ . The first feature map I can be obtained by the convolution and upsampling iterative process illustrated in FIG. 6 ₁ And a second feature map of equal scale. The summation operations referred to herein refer to those of FIG. 6 The convolutions a6, a7, a8, a9 and a10 can all adopt a processing mode of 1×1 convolution.

It should be noted that, to ensure consistency of the upsampling and downsampling operations, the first feature map I ₁ 、I ₂ 、I ₃ And a second characteristic map P ₃ 、P ₂ 、P ₁ The scale sizes (i.e., the width-height pixel sizes) of (a) are respectively corresponding to the same. Likewise, a third characteristic diagram R ₁ And a first characteristic diagram I ₁ Is the same in scale.

Step 123, the first feature map I ₁ And carrying out summation operation on the convolved second feature map and the second feature map with the same scale size to obtain a third feature map. Referring to FIG. 6, a first feature map I ₁ Convolving a6 with a second feature map P of equal scale ₃ The third characteristic diagram R is obtained after summation operation ₁ 。

And 124, performing pixel characteristic analysis on the third characteristic diagram to obtain text categories and connection relations of all pixel points in the third characteristic diagram. In one implementation, to achieve pixel feature analysis, a neural network may be utilized to achieve regression processing of the third feature map.

First, a first neural network, such as a structure of the first neural network formed by using a full connection layer of 2 layers, is required to be established, and a loss function of the first neural network is configured as

Wherein alpha is a preset weight coefficient, p is the confidence score of the text category and the connection relation of the predicted pixel point, The confidence score is the text category and the connection relation of the marked pixel points.

After training the first neural network using the annotated image, a first neural network can be obtained that is complete. The labeling of each pixel point in the image can adopt a preset pixel generation rule and a preset inter-pixel connection rule, wherein the pixel generation rule can be described as: each pixel within the text box is labeled positive (or 1), if there is overlapping text, pixels within the non-overlapping text box are labeled positive (or 1), otherwise are labeled negative (or 0); the inter-pixel connection rule can be described as: giving a pixel, if the pixel and a pixel in the adjacent area belong to the same text example, marking the connection relation of the two pixels as positive (or 1), otherwise marking the connection relation as negative (or 0); it should be noted that, there are typically 8 adjacent pixels per pixel point, such as up, down, left (left), right (right), left up, left down, right up, right down, and there is no connection between adjacent pixels, because they may belong to different text examples. Since the network training process is already well established, this prior art will not be described in detail here.

Then, a third feature map (such as the third feature map R in fig. 6) ₁ ) And inputting the text category and the confidence score p of the connection relation of each pixel point in the third feature map are determined when the loss function of the first neural network is converged, and the text category and the connection relation of the corresponding pixel point are determined according to the confidence score p. Since the confidence score p includes confidence results for both text category, connection, and the connection in turn includes pixel adjacency in eight directions (up, down, left, right, upper left, lower left, upper right, lower right) of the neighborhood, the confidence score for each pixel can include 9 score values, 1 of themThe score value is the confidence of the text category (which can be represented by 0-1, the greater the value is, the higher the confidence), and the rest 8 score values are the confidence of the pixel adjacency relations on eight opposite directions of the neighborhood (which can be represented by 0-1, the greater the value is, the higher the confidence is). It will be appreciated that since the confidence score for each pixel can be predicted, the location (e.g., coordinates) of that pixel is also known.

As can be appreciated by those skilled in the art, when predicting the text category and the connection relation of the pixel points in the feature map, the text category judgment and the connection relation judgment with the neighborhood pixels are performed on each pixel point through the first neural network, so that the analysis object is more refined, and technical support is provided for the example segmentation of the image.

In this embodiment, referring to fig. 4, the above-mentioned step 130 mainly relates to thresholding and text instance segmentation, and may specifically include steps 131-134, which are respectively described below.

Step 131, thresholding is performed on the text category and the connection relation of each predicted pixel point, the pixel points with the text category belonging to the text and the connection relation being positive form equivalent pairs, and the equivalent pairs are stored in a preset set.

Because the text category and the connection relation of each pixel point in the third feature map are represented by using the confidence scores, the confidence of the text category and the confidence of the connection relation can be thresholded, so that a corresponding classification result is obtained. In a particular embodiment, referring to FIG. 6, a third feature map R ₁ After thresholding the confidence of each pixel point about the text category, 2 channels can be used to represent the text category of the pixel point, and parameter 2 in fig. 6 represents two channels, one channel represents text, the other channel represents non-text, and the distinction between text and non-text is the binarization result of the confidence of the text category, for example, if the confidence of the text category of a pixel point is greater than 0.5, the text category of the pixel point is considered positive, i.e. belongs to text. Similarly, 16 channels can be used to represent the pixel adjacency between each pixel and eight pixels in the neighborhood, and parameter 16 in FIG. 6 represents And 16 channels, wherein 8-direction pixel adjacent relations exist between each pixel point and eight adjacent pixel points, two pixel points in each direction correspond to two channels, one channel represents that the two pixel points have a connection relation, the other channel represents that the two pixel points have no connection relation, the connection and disconnection distinction is a binarization result of the confidence coefficient of the connection relation, for example, if the confidence coefficient of the connection relation between one pixel point and any one pixel point in the adjacent area is larger than 0.5, the connection relation between the two pixel points is considered to be positive, namely, the two pixels are adjacent.

It will be understood that after the binarized text pixel processing result (2 channels) and the binarized pixel adjacency relation processing result (16 channels) are obtained, for each pixel p, if the text category is positive (i.e. belongs to the text), any pixel q of the neighborhood is judged, if the connection relation between p and q is positive, p and q are formed into an equivalent pair (i.e. the coordinate combination of the pixel p and q), and then the equivalent pair is added to the union.

And step 132, connecting corresponding pixel points according to equivalent pairs in the set, constructing one or more connected components and forming corresponding connected regions.

For each equivalent pair stored in the union, the coordinates of two pixels belonging to a text category and having a connection relationship are reflected, so that all the text pixels predicted to be positive can be connected into connected components by using the connection relationship predicted to be positive, each connected component corresponds to a connected region, and the corresponding connected region is a detected text instance.

And step 133, mapping each connected region into the image to be detected, and setting an external rectangle for the mapped region in the image to be detected to obtain text boxes corresponding to each text line respectively.

To understand the situation of each text line in the image to be detected, each connected region constructed in the third feature map needs to be mapped into the image to be detected to determine the position of the corresponding text line in the image to be detected. The third feature image and the image to be detected have a resolution difference, for example, the wide high resolution of the image to be detected is 2 times that of the third feature image, then the coordinates of the corresponding pixel points in the image to be detected are obtained by multiplying any pixel coordinate of a certain connected area by 2, and the corresponding pixel points in the image to be detected are combined to form a mapping area in the image to be detected.

Referring to fig. 7, for an image to be detected including C, R, A, O, C characters, there are two mapping areas, one is the area of C-R, and the other is the area of a-C-R-O-C, and since the mapping area may have an irregular shape following a text line, it is necessary to perform circumscribed rectangle processing on the mapping area to obtain a text box corresponding to the text line, where one rectangular text box determines the text line of C-R, and the other rectangular text box determines the text line of a-C-R-O-C.

In a specific embodiment, the size of the text box can be adaptively matched according to the size of the text line, for example, if the width and the height of the text line are respectively 200 pixels and 100 pixels, the text box is provided with 300 pixels and 150 pixels in a 1.5-time mode.

It should be noted that, since the coordinates and the confidence scores of the pixels in each connected region in the third feature map are obtained, the coordinates and the confidence scores of the pixels in the mapped region in the image to be detected can be obtained through the mapping of the connected regions. For the text box circumscribed by the mapping area, the coordinates of each pixel point in the text box can be obtained by coordinate value expansion according to the coordinates of the pixel points in the mapping area, so that the center pixel coordinates and the wide and high pixel quantity of the text box can be calculated, and the rotation angle of the text box can be calculated according to the edge pixel coordinates of the text box. For the text box circumscribed by the mapping area, the confidence score of each pixel point in the text box can be reasonably assigned according to the confidence score of the pixel point in the mapping area, so that the confidence score of the text box can be obtained through the average calculation of the confidence scores of the pixel points. Thus, the position information of the text box and the confidence score are obtained, the position information can be represented by coordinates, a wide pixel quantity, a high pixel quantity and a rotation angle, and the confidence score can be represented by whether the confidence score is the averaged confidence score.

And step 134, determining the position information and the confidence score of each text line according to the text boxes corresponding to each text line in the image to be detected. In a specific embodiment, acquiring a coordinate offset, a wide-high pixel quantity and a rotation angle quantity of each text box in an image to be detected, and determining position information of a corresponding text line by utilizing the coordinate offset, the wide-high pixel quantity and the rotation angle quantity of each text box; and obtaining the confidence scores of the pixel points in each text box in the image to be detected, and determining the confidence scores of the corresponding text lines by utilizing the confidence scores of the pixel points in each text box.

In the present embodiment, since an attempt is made to connect text pixels predicted to be positive by a connection relation predicted to be positive (see step 132), some noise results inevitably occur in this process, and a filtering operation is required in a subsequent step for each connected component obtained. For example, in another embodiment, after obtaining text boxes corresponding to the text lines respectively (for example, after step 133 and before step 134 in fig. 4), the method further includes a noise filtering step, where the noise filtering step includes:

(1) And determining the coordinate offset, the wide and high pixel quantity and the rotation angle quantity of the corresponding text box according to the coordinates of the pixel points in each text box in the image to be detected.

(2) And configuring noise filtering conditions according to the coordinate offset, the wide and high pixel quantity and/or the rotation angle quantity of each text box in the image to be detected.

(3) Text boxes in the image to be detected which meet the noise filtering condition are discarded to reserve the rest text boxes, one or more candidate text lines are obtained according to the rest text boxes, and then in step 134, only the position information and the confidence scores of the one or more candidate text lines are determined.

A faster method is to filter the text box obtained by the geometric property of the text box and the corresponding statistical result of the training set so as to remove noise. Such as by geometric properties of the text box such as height, width, aspect Ratio (Aspect Ratio) or area, and discarding a text box if the short side of the text box is less than a pixels long (or the text box area is less than b). Wherein the two numbers a and b are not manually specified values, but are a threshold value counted on the training data set, for example, a is the 90% value in the short-side length value sorting set of all text boxes, and b is the 90% value in the area sorting set of all text boxes. Of course, the coordinate offset or the rotation angle of the text box may also be used as a noise filtering condition, for example, the text box is discarded when the coordinate offset of the center pixel of the text box is considered to be greater than or less than a certain offset threshold, or the text box is discarded when the rotation angle of the text box is considered to be greater than or less than a certain angle threshold. It will be appreciated that the noise filtering conditions may be freely set according to the actual needs, and that these are just some possible implementations, but do not constitute a strict limitation on the noise filtering step.

Those skilled in the art can understand that when an example is divided for an image to be detected, the text category and the connection relation of each pixel point are fully considered, so that a text box in the image to be detected can be mapped and obtained by constructing a connected component, and each text line can still be detected in a complex background.

In this embodiment, referring to fig. 5, the above-mentioned step 140 mainly relates to the process of the location information and confidence score regression processing, and may specifically include steps 141-143, which are respectively described below.

And 141, carrying out feature transformation on each text line to obtain a text feature map corresponding to each text line. In one implementation, each text line of the candidate may be feature transformed using the RoIPooling or RoIAlign described above to obtain a corresponding text feature map.

In step 142, to further refine the location and confidence of the text lines, a neural network may be utilized to implement regression processing of the text feature map.

First, a second neural network is established, the second neural network can adopt a structural design mode of two full-connection layers, and a loss function of the second neural network is configured as

L＝L _q +L _p 。

Wherein L is _q Regression loss function for text line confidence score and satisfies Lambda is a preset weight coefficient, y' is a predicted text line confidence score (the value range is 0-1), and +.>Representing the annotated text line confidence score (either a value of 0 or 1,0 representing no text and 1 representing text). L (L) _p Regression loss function for text line position information and satisfies L _p ＝L _d +L _β ，/>d represents the text line position and d= (x) is represented by coordinates or coordinate offsets of four corner points ₁ ,y ₁ ,x ₂ ,y ₂ ,x ₃ ,y ₃ ,x ₄ ,y ₄ )，/>To be about L ₁ Activation function of norm d _j For the predicted jth text line position, < >>For the j-th text line position of the label, beta _j For the predicted j text line rotation angle, < >>The j text lines are marked by the rotation angle.

In one embodiment, if x' is used as the activation functionIs then activated function +.>Can be formulated asIs specifically shown as

After training the second neural network using the annotated image (each text line in the image has been annotated with location information and confidence scores), a completed second neural network can be obtained. Since the network training process is already well established, this prior art will not be described in detail here.

The text feature map is then input into a second neural network, and a text line confidence score y' and a text line position d are calculated when the loss function of the second neural network converges.

And step 143, obtaining a text detection result of the corresponding text line by using the text line confidence score y' and the text line position d.

It can be understood that the text confidence score y 'is a result of performing regression fine tuning on the confidence score of the candidate text line by using the second neural network, and the text line position d is a result of performing regression fine tuning on the position information of the candidate text line by using the second neural network, so that the text line confidence score y' and the text line position d can more accurately reflect the text detection condition of the corresponding text line compared with the confidence score and the position information which are initially determined on the text line.

Those skilled in the art will readily understand that when a text detection result of a text line is generated, regression operation processing is performed on the text line confidence score and the text line position by using a neural network, so that not only is the fine adjustment of the text line position realized, but also the positioning requirements of different text scales can be adapted by using four corner positioning modes, and the self-adaptive and accurate positioning capability of the text detection method on the text scales is further improved.

Embodiment II,

Referring to fig. 8, a text detection device 2 is disclosed in this embodiment based on the text detection method disclosed in the first embodiment, and the text detection device 2 includes a camera 21, a processor 22 and a display 23, which are described below.

The camera 21 is used for capturing images of natural scenes and forming images to be detected, and for the purpose of text detection, the images to be detected formed by capturing images should include one or more text lines, and each text line is composed of one or more characters and represents a text specific meaning.

It should be noted that, the image to be detected may be an optical image of the surface of the object to be detected in the natural scene, and then the object to be detected may be an object (such as a billboard, an indication mark, a commodity label, a page, a lettering, a hand book, etc.) with text information on the surface, so that the image to be detected may include one or more characters (such as Chinese characters, letters, numbers, symbols, etc.), where the characters form one or more text lines. Of course, the image to be detected may include some simple or complex background information (such as color, lines, stains, scratches, patterns, etc.) in addition to one or more text lines, and the processor 22 is required to be able to exclude the interference of the background information and detect the text line position therefrom.

The processor 22 is connected to the camera 21 and is configured to process the image to be detected according to the text detection method described in the first embodiment, so as to obtain a text detection result. It will be appreciated that the processor 22 may be a CPU, GPU, FPGA, microcontroller or digital integrated circuit with data processing functionality, provided that the text detection method implemented in steps 110-140 above can be implemented in accordance with its own logic instructions.

The display 23 is connected to the processor 22 and is used for displaying the image to be detected and/or the text detection result. It will be appreciated that the display 23 may be a screen with an image display function, and may be capable of displaying the image to be detected and the text detection result individually or together, and the specific screen type and display layout are not limited.

In the present embodiment, referring to fig. 10, the processor 22 includes an image acquisition module 22-1, a text prediction module 22-2, a text segmentation module 22-3, and a result generation module 22-4, respectively, as described below.

The image acquisition module 22-1 may communicate with the camera 21 to acquire an image to be detected from the camera 21.

The text prediction module 22-2 is connected with the image acquisition module 22-1, and is mainly used for constructing a feature map of a plurality of scales of the image to be detected, and predicting the text category and the connection relation of each pixel point according to the feature map of the plurality of scales. For example, the text prediction module 22-2 can perform convolution and downsampling operations on the image to be detected N times in succession, and obtain the first feature map I with the corresponding scale after each operation _i I=1, 2, …, N, and the larger the value of I is, the corresponding first feature map I _i The smaller the scale of (2); then the first characteristic diagram I _N And a first characteristic diagram I _N-1 Respectively convolving and then carrying out summation operation, and carrying out up-sampling operation on the summation result to obtain a first characteristic diagram I _N-2 A second feature map of equal scale size; first characteristic diagram I _N-2 Performing summation operation and up-sampling operation on the convolved image and the second characteristic image with the same scale size to obtain a first characteristic image I _N-3 Second feature map with equal scale, and so on until the first feature map I is obtained ₁ And a second feature map of equal scale. Next, the first feature map I will be ₁ Carrying out convolution and summation operation on the convolution result and the second characteristic diagram with the same scale size to obtain a third characteristic diagram; and carrying out pixel characteristic analysis on the third characteristic diagram to obtain the text category and the connection relation of each pixel point in the third characteristic diagram. The text prediction module 22-2 may refer to steps 121-124 in the first embodiment, and will not be described herein.

The text segmentation module 22-3 is connected with the text prediction module 22-2, and is mainly used for forming text connected components by using the text category and the pixel points with the connection relationship meeting the thresholding condition so as to perform instance segmentation on the image to be detected, thereby obtaining the position information and the confidence score of each text line in the image to be detected. For example, the text segmentation module 22-3 performs thresholding on the predicted text category and connection relation of each pixel, forms an equivalent pair for the pixels with text category belonging to the text and positive connection relation, and stores the equivalent pair into a preset set; then, corresponding pixel points are connected according to equivalent pairs in the set, one or more connected components are constructed, and corresponding connected areas are formed; and mapping each connected region into an image to be detected, and setting an external rectangle in the mapping region in the image to be detected to obtain text boxes corresponding to each text line respectively. And finally, determining the position information and the confidence score of each text line according to the text boxes respectively corresponding to each text line in the image to be detected. The text segmentation module 22-3 may refer to steps 131-134 in the first embodiment, and will not be described herein.

The result generating module 22-4 is connected with the text dividing module 22-3, and is mainly used for carrying out regression processing on the position information and the confidence score of each text line in the image to be detected to obtain text detection results corresponding to each text line. For example, the result generation module 22-4 performs feature transformation on each text line to obtain a text feature map corresponding to each text line; establishing a second neural network and configuring a loss function of the second neural network to be l=l _q +L _p . The text feature map is then input into a second neural network, and a text line confidence score y' and a text line position d are calculated when the loss function of the second neural network converges. And finally, obtaining a text detection result of the corresponding text line by using the text line confidence score y' and the text line position d. The function of the result generation module 22-4 may refer to steps 141-143 in the first embodiment specifically, and will not be described herein.

Such as the plurality of key identification plates shown in fig. 9a and 9b, each key identification plate has text lines formed by series connection of characters. The text prediction module 22-2 predicts the text category and the connection relation of the pixels for a plurality of text lines in a plurality of key identification plates respectively, the text segmentation module 22-3 forms text connected components according to the predicted pixels and forms connected regions, and finally a plurality of connected regions as shown in fig. 9a are obtained, wherein each bar-shaped connected region represents one text line. The result generation module 22-4 performs regression fine adjustment on each text line, accurately calculates the four-corner coordinates of each text line, and marks each text line by the form of a detection frame to form the detection result in fig. 9 b. At this time, the position detection of the Chinese character row in the key signboard is completed.

In the implementation, the disclosed text detection device 2 can realize the functions of text image acquisition, text detection and result display, and as the processor adopts the text detection method designed in the steps 110-140, the text detection is efficiently and accurately carried out in a natural scene, the multi-directional and variable-scale detection problems of the text in the natural scene are solved, and the positioning precision of the text is greatly improved; in addition, the text detection method predicts various parameters such as coordinates, dimensions, angles and the like of the text, so that the processor can detect the text with horizontal or rotation and the text with annular distribution, and the applicability of the device in various text detection occasions is effectively improved.

Third embodiment,

Referring to fig. 11, the present embodiment discloses a text detection device, and the text detection device 3 mainly includes a memory 31 and a processor 32.

The main components of the text detection means 3 are a memory 31 and a processor 32. The memory 31 is a computer readable storage medium, and is mainly used for storing a program, and the program may be a program code corresponding to the text detection method in the first embodiment. The processor 32 is connected to the memory 31, and is configured to execute a program stored in the memory 31 to implement a text detection method. The function implemented by the processor 32 may refer to the processor 22 in the second embodiment, and will not be described in detail here.

Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by a computer program. When all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a computer readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic disk, optical disk, hard disk, etc., and the program is executed by a computer to realize the above-mentioned functions. For example, the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above can be realized. In addition, when all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and the program in the above embodiments may be implemented by downloading or copying the program into a memory of a local device or updating a version of a system of the local device, and when the program in the memory is executed by a processor.

The foregoing description of specific examples has been presented only to aid in the understanding of the present application and is not intended to limit the present application. Several simple deductions, modifications or substitutions may also be made by the person skilled in the art to which the present application pertains, according to the idea of the present application.

Claims

1. A text detection method for a natural scene, comprising:

acquiring an image to be detected, wherein the image to be detected comprises one or more text lines;

constructing a feature map of a plurality of scales of the image to be detected, and predicting text types and connection relations of each pixel point by adopting a first neural network according to the feature map of the plurality of scales, wherein the text types refer to classification situations of texts/non-texts, and the connection relations refer to classification situations of connection/disconnection between the pixel point and the pixel points in the adjacent areas;

forming text connected components by using text categories and pixel points with connection relations meeting thresholding conditions so as to carry out instance segmentation on the image to be detected, and obtaining position information and confidence scores of each text line in the image to be detected;

and carrying out feature transformation on each text line in the image to be detected to obtain a text feature map corresponding to each text line, carrying out regression processing on the position information and the confidence score of the text line by using a second neural network by using the text feature map, and obtaining the confidence score and the text line position of the text line after the regression processing to form a text detection result.

2. The text detection method of claim 1, wherein constructing the feature map of the plurality of scales of the image to be detected, and predicting the text category and the connection relationship of each pixel point by using the first neural network according to the feature map of the plurality of scales comprises:

performing convolution and downsampling operations on the image to be detected for N times continuously, and obtaining a first characteristic diagram I with a corresponding scale after each operation _i First feature map I corresponding to i=1, 2, …, N, and the larger I is _i The smaller the scale of (2);

first characteristic diagram I _N And a first characteristic diagram I _N-1 Respectively convolving and then carrying out summation operation, and carrying out up-sampling operation on the summation result to obtain a first characteristic diagram I _N-2 A second feature map of equal scale size; first characteristic diagram I _N-2 Performing summation operation and up-sampling operation on the convolved image and the second characteristic image with the same scale size to obtain a first characteristic image I _N-3 Second feature map with equal scale, and so on until the first feature map I is obtained ₁ A second feature map of equal scale size; first characteristic diagram I ₁ Carrying out convolution and summation operation on the convolution result and the second characteristic diagram with the same scale size to obtain a third characteristic diagram;

and carrying out pixel characteristic analysis on the third characteristic diagram by adopting a first neural network to obtain the text category and the connection relation of each pixel point in the third characteristic diagram.

3. The text detection method of claim 2, wherein the performing pixel feature analysis on the third feature map by using a first neural network to obtain text categories and connection relations of each pixel point in the third feature map includes:

establishing a first neural network and configuring a loss function of the first neural network as

Wherein alpha is a preset weight coefficient, p is the confidence score of the text category and the connection relation of the predicted pixel point,confidence scores of text categories and connection relations of the marked pixel points;

inputting the third feature map to the first neural network, determining the confidence score p of the text category and the connection relation of each pixel point in the third feature map when the loss function of the first neural network is converged, and determining the text category and the connection relation of the corresponding pixel point according to the confidence score p.

4. The text detection method as claimed in claim 3, wherein the forming text connected components by using the text category and the pixel points with the connection relationship meeting the thresholding condition to perform instance segmentation on the image to be detected to obtain the position information and the confidence score of each text line in the image to be detected includes:

Thresholding the predicted text category and connection relation of each pixel point, forming equivalent pairs of pixel points with text category belonging to the text and positive connection relation, and storing the equivalent pairs into a preset set;

connecting corresponding pixel points according to equivalent pairs in the set, constructing one or more connected components and forming corresponding connected areas;

mapping each connected region into the image to be detected, and setting an external rectangle for the mapped region in the image to be detected to obtain text boxes corresponding to each text line respectively;

and determining the position information and the confidence score of each text line according to the text boxes respectively corresponding to each text line in the image to be detected.

5. The text detection method of claim 4, further comprising a noise filtering step after obtaining text boxes corresponding to the text lines, respectively, the noise filtering step comprising:

determining coordinate offset, wide and high pixel quantity and rotation angle quantity of a corresponding text box according to coordinates of pixel points in each text box in the image to be detected;

configuring noise filtering conditions according to the coordinate offset, the wide and high pixel quantity and/or the rotation angle quantity of each text box in the image to be detected;

Discarding the text boxes meeting the noise filtering conditions in the image to be detected to reserve the rest text boxes, and obtaining one or more candidate text lines according to the rest text boxes.

6. The text detection method as claimed in claim 5, wherein the determining the location information and the confidence score of each text line according to the text box corresponding to each text line in the image to be detected includes:

acquiring the coordinate offset, the wide and high pixel quantity and the rotation angle quantity of each text box in the image to be detected, and determining the position information of the corresponding text line by utilizing the coordinate offset, the wide and high pixel quantity and the rotation angle quantity of each text box;

and obtaining the confidence scores of the pixel points in each text box in the image to be detected, and determining the confidence scores of the corresponding text lines by utilizing the confidence scores of the pixel points in each text box.

7. The text detection method as claimed in any one of claims 1 to 6, wherein the performing feature transformation on each text line in the image to be detected to obtain a text feature map corresponding to each text line, performing regression processing on the position information and the confidence score of the text line by using the text feature map and using a second neural network, and obtaining the confidence score and the text line position of the text line after the regression processing to form a text detection result includes:

Carrying out feature transformation on each text line to obtain a text feature map corresponding to each text line;

establishing a second neural network and configuring a loss function of the second neural network as

L＝L _q +L _p ；

Wherein L is _q Regression loss function for text line confidence score and satisfiesLambda is a preset weight coefficient, y' is a predicted text line confidence score,/->Representing the confidence score of the annotated text line; l (L) _p Regression loss function for text line position information and satisfies L _p ＝L _d +L _β ，/>d represents the text line position and d= (x) is represented by coordinates or coordinate offsets of four corner points ₁ ,y ₁ ,x ₂ ,y ₂ ,x ₃ ,y ₃ ,x ₄ ,y ₄ )，smoothed _L1 () To be about L ₁ Activation function of norm d _j For the predicted jth text line position, < >>For the j-th text line position of the label, beta _j For the predicted j text line rotation angle, < >>The rotation angle of the j text line is marked;

inputting the text feature map into the second neural network, and calculating a text line confidence score y' and a text line position d when a loss function of the second neural network converges;

and obtaining a text detection result of the corresponding text line by using the text line confidence score y' and the text line position d.

8. A text detection device, comprising:

the camera is used for capturing images of a target scene and forming an image to be detected, wherein the image to be detected comprises one or more text lines;

A processor, connected to the camera, for processing the image to be detected according to the text detection method of any one of claims 1-7, to obtain a text detection result;

and the display is connected with the processor and used for displaying the image to be detected and/or the text detection result.

9. The text detection device of claim 8, wherein the processor comprises:

the image acquisition module is used for acquiring the image to be detected from the camera;

the text prediction module is used for constructing a plurality of scale feature maps of the image to be detected, predicting text types and connection relations of all pixel points by adopting a first neural network according to the plurality of scale feature maps, wherein the text types refer to classification situations of texts/non-texts, and the connection relations refer to classification situations of connection/disconnection between the pixel points and the pixel points in the adjacent areas;

the text segmentation module is used for forming text connected components by using the text categories and the pixel points with the connection relations meeting the thresholding conditions so as to conduct instance segmentation on the image to be detected, and position information and confidence scores of each text line in the image to be detected are obtained;

the result generation module is used for carrying out feature transformation on each text line in the image to be detected to obtain a text feature map corresponding to each text line, carrying out regression processing on the position information and the confidence score of the text line by using the text feature map and adopting a second neural network, and obtaining the confidence score and the text line position of the text line after the regression processing to form a text detection result.

10. A computer-readable storage medium, wherein the medium has stored thereon a program executable by a processor to implement the text detection method of any of claims 1-7.