CN114926849A

CN114926849A - Text detection method, device, equipment and storage medium

Info

Publication number: CN114926849A
Application number: CN202210429576.4A
Authority: CN
Inventors: 周源赣; 章水鑫
Original assignee: Nanjing Sanbaiyun Information Technology Co ltd
Current assignee: Nanjing Sanbaiyun Information Technology Co ltd
Priority date: 2022-04-22
Filing date: 2022-04-22
Publication date: 2022-08-19

Abstract

The invention discloses a text detection method, a text detection device, text detection equipment and a storage medium. The method comprises the following steps: acquiring an image to be detected; inputting an image to be detected into a pre-constructed contracted offset text detection model, and determining a target semantic segmentation feature map and a target offset feature map; determining a target outward expansion distance and a rectangular frame to be outward expanded according to the target semantic segmentation feature map and the target offset feature map; externally expanding a target external expansion distance of a rectangular frame to be externally expanded, and determining a target text detection box; the pre-constructed contracted offset text detection model comprises a semantic segmentation sub-model and an offset regression sub-model. The technical scheme of the embodiment of the invention solves the problems of low detection speed and poor effect when detecting the dense text area because the text detection model obtained by training according to the contracted text example does not consider the text contracted process offset, reduces the determination calculation amount of the target text detection box and improves the text detection efficiency.

Description

Text detection method, device, equipment and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a text detection method, apparatus, device, and storage medium.

Background

The text detection has wide application range, and is a front step of many computer vision tasks, such as image search, character recognition, identity authentication, visual navigation and the like. The purpose of text detection is mainly to locate the position of a text line in an image, however, in a natural scene, the size, font, color, shape, direction and background of the text line have diversity and are often stuck in the recognition process. With the rise of deep learning, research on text detection gradually becomes a hotspot, and a large number of methods related to text detection appear.

In a natural text detection scene, a text detection algorithm based on semantic segmentation generally uses an indented text instance to generate masks with mutually separated text instances as real samples, and at present, the algorithm using the idea to perform text detection includes EAST, PSENet and the like.

However, after learning the contracted samples by using the above algorithm, the EAST algorithm directly regresses the set distance of the quadrangle on the mask position of the contracted region to locate the text example; the PSENet algorithm is to expand a plurality of masks with different retraction offsets from inside to outside to obtain an accurate mask, then calculate a bounding box of the mask to obtain the location of a text instance, and both do not consider the offsets possibly generated in the generation process of the retraction sample frame, so that when the PSENet algorithm is applied to detection of a dense text region, the detection speed is low, the detection effect is poor, and the efficiency of text detection is influenced.

Disclosure of Invention

The invention provides a text detection method, a text detection device, text detection equipment and a storage medium, which are used for learning and detecting a text of a scene with a contracted offset, improve the efficiency and the accuracy of text detection and balance the precision and the speed required by the text detection.

In a first aspect, an embodiment of the present invention provides a text detection method, including:

acquiring an image to be detected;

inputting an image to be detected into a pre-constructed contracted offset text detection model, and determining a target semantic segmentation characteristic diagram and a target offset characteristic diagram;

determining a target outward expansion distance and a rectangular frame to be outward expanded according to the target semantic segmentation feature map and the target offset feature map;

externally expanding the target external expansion distance of the rectangular frame to be externally expanded to determine a target text detection box;

the pre-constructed contracted offset text detection model comprises a semantic segmentation sub-model and an offset regression sub-model.

Further, the training step of the text detection model with the retracted offset comprises the following steps:

performing basic feature extraction on an image sample set in a training sample set of the contracted offset text to determine a basic feature sample set; the training sample set of the contracted offset text comprises an image sample set and a calibration sample set corresponding to the image sample set, wherein the calibration sample set comprises contracted segmentation labels and offset labels corresponding to all image samples;

inputting the basic characteristic sample set into an initial semantic segmentation sub-model, and extracting a semantic segmentation intermediate result;

inputting the basic feature sample set into an initial offset regression sub-model, and extracting an offset intermediate result;

determining a corresponding first loss function according to the semantic segmentation intermediate result and the corresponding contracted segmentation label;

determining a corresponding second loss function according to the offset intermediate result and the corresponding offset label;

and determining a total loss function according to the first loss function and the second loss function, and training the initial semantic segmentation sub-model and the initial offset regression sub-model based on the total loss function until a preset convergence condition is met to obtain the contracted offset text detection model.

Further, performing basic feature extraction on an image sample set in the training sample set of the contracted offset text, and determining a basic feature sample set, including:

inputting an image sample set in a training sample set of the contracted offset text into a feature extraction backbone network, and determining a first feature image set; the first feature map set comprises a plurality of feature maps with different resolutions extracted by an image sample set;

performing multi-scale feature extraction on the first feature atlas, and determining a second feature atlas;

and performing multi-feature fusion on the second feature map set, and determining the fused set of each feature map as a basic feature sample set.

Further, the step of determining the segmentation labels with the contraction comprises the following steps:

aiming at each image sample, constructing a first two-dimensional matrix corresponding to the image sample according to the size of the image sample, and determining the shortest side length of the image sample;

if the shortest side length is less than or equal to the preset minimum frame length, setting each pixel corresponding to the position of the marked text in the first two-dimensional matrix as a first preset numerical value;

and if the shortest side length is greater than the preset minimum frame length, determining a first retraction distance according to the size of the image sample, updating the position of the marked text according to the first retraction distance, and setting each pixel corresponding to the updated position of the marked text as a first preset value.

Further, the step of determining the offset label comprises:

for each image sample, constructing a second two-dimensional matrix corresponding to the image sample according to the size of the image sample, and determining the shortest side length of the image sample;

if the shortest side length is less than or equal to the preset minimum frame length, setting each pixel corresponding to the position of the marked text in the second two-dimensional matrix as a first preset numerical value;

and if the shortest side length is greater than the preset minimum frame length, determining a second shrinkage distance according to the size of the image sample, and updating and assigning the second two-dimensional matrix according to the second shrinkage distance and the shortest side length.

Further, updating and assigning the second two-dimensional matrix according to the second retraction distance and the shortest side length, and the method comprises the following steps:

if the second retraction distance is smaller than the shortest side length, determining an offset intensity value according to the second retraction distance and a preset reference value, updating the position of the marked text according to the second retraction distance, and setting each pixel corresponding to the updated position of the marked text as the offset intensity value;

and if the second retraction distance is larger than or equal to the shortest side length, setting each pixel corresponding to the position of the marked text in the second two-dimensional matrix as a first preset numerical value.

Further, determining a corresponding first loss function according to the semantic segmentation intermediate result and the corresponding contracted segmentation label, including:

comparing the corresponding value of each pixel in the semantic segmentation intermediate result with the corresponding value of each pixel in the corresponding contracted segmentation label;

and determining a first loss function according to the comparison result.

Further, determining a corresponding second loss function according to the offset intermediate result and the corresponding offset label, including:

comparing the value corresponding to each pixel in the offset intermediate result with the value corresponding to each pixel in the corresponding offset label;

and determining a second loss function according to the comparison result.

Further, determining a total loss function according to the first loss function and the second loss function, training the initial semantic segmentation sub-model and the initial offset regression sub-model based on the total loss function until a preset convergence condition is met, and obtaining an invaginated offset text detection model, wherein the method comprises the following steps:

weighting and summing the first loss function and the second loss function according to a preset weight value to determine a total loss function;

and adjusting the weight parameters in the initial semantic segmentation sub-model and the initial offset regression sub-model based on the total loss function until a preset convergence condition is met to obtain the contracted offset text detection model.

Further, determining a target extension distance and a rectangular frame to be extended according to the target semantic segmentation feature map and the target offset feature map, and the method comprises the following steps:

thresholding is carried out on the target semantic segmentation feature map, and a labeled map is determined according to a connected component labeling algorithm;

intersecting the marking graph with the target offset characteristic graph, and updating the target offset characteristic graph;

traversing different mark values in the mark image after intersection, and determining an external rectangular frame of the mark object corresponding to the same mark value as a rectangular frame to be externally expanded;

determining the average value of the pixel values of each marking object in the updated target offset characteristic image as the predicted offset of the marking object;

and determining the target outward expansion distance according to the predicted offset and a preset reference value.

Further, the text detection method further comprises:

determining the confidence degree of the marking object corresponding to the rectangular frame to be subjected to outward expansion according to the average value of the pixel values of the marking object in the target semantic segmentation characteristic image;

and if the confidence coefficient is smaller than a preset confidence coefficient threshold value, deleting the rectangular frame to be expanded.

Further, the step of determining a target text detection box by using the target extension distance of the extension rectangular frame to be extended comprises the following steps:

ordering the coordinates of corner points of the rectangular frame to be subjected to external expansion clockwise, and determining the coordinate of the center point of the rectangular frame to be subjected to external expansion;

the adjacent edges of every two rectangular frames to be subjected to external expansion are shifted outwards by a target external expansion distance relative to the coordinates of the center point;

extending the target external expansion distance outwards from two ends of each side in the rectangular frame to be externally expanded, and determining new endpoint coordinates corresponding to each side;

determining the intersection point of each side of the rectangular frame to be subjected to external expansion as a new intersection point coordinate;

if the new intersection point coordinates are consistent with the corresponding new endpoint coordinates, determining the rectangular frame to be subjected to external expansion after external expansion as a target text detection frame; otherwise, returning the original coordinates corresponding to the rectangular frame to be expanded.

In a second aspect, an embodiment of the present invention further provides a text detection apparatus, including:

the image acquisition module is used for acquiring an image to be detected;

the characteristic diagram determining module is used for inputting the image to be detected into a pre-constructed contracted offset text detection model and determining a target semantic segmentation characteristic diagram and a target offset characteristic diagram;

the external expansion rectangle determining module is used for determining a target external expansion distance and a rectangular frame to be external expanded according to the target semantic segmentation feature map and the target offset feature map;

the detection frame determining module is used for determining the target text detection frame by the external expansion distance of the target to be externally expanded in the rectangular frame to be externally expanded;

In a third aspect, an embodiment of the present invention further provides a text detection apparatus, where the text detection apparatus includes:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor such that the at least one processor is capable of implementing the text detection method of any of the embodiments of the invention.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to, when executed by a processor, implement the text detection method according to any embodiment of the present invention.

The embodiment of the invention provides a text detection method, a text detection device, text detection equipment and a text detection storage medium, wherein an image to be detected is obtained; inputting an image to be detected into a pre-constructed contracted offset text detection model, and determining a target semantic segmentation characteristic diagram and a target offset characteristic diagram; determining a target outward expansion distance and a rectangular frame to be outward expanded according to the target semantic segmentation feature map and the target offset feature map; externally expanding the target external expansion distance of the rectangular frame to be externally expanded to determine a target text detection box; the pre-constructed contracted offset text detection model comprises a semantic segmentation sub-model and an offset regression sub-model. By adopting the technical scheme, the image to be detected is input into a pre-constructed contracted offset text detection model, a semantic segmentation sub-model and an offset regression sub-model in the contracted offset text detection model are respectively processed to obtain a corresponding target semantic segmentation feature map and a target offset feature map, a rectangular frame to be subjected to external expansion of the determined contracted text and a target external expansion distance to be subjected to external expansion are determined according to the semantic segmentation result and the offset calculation result, and the target external expansion distance is further subjected to external expansion of the rectangular frame to be subjected to external expansion so as to obtain a final target text detection frame. The problem of current text detection model that obtains according to the training of the contracted text example do not consider text contracted process offset, when leading to detecting intensive text region, detection speed is slow and the effect is relatively poor is solved, adopt the mode of semantic segmentation and offset regression to handle same waiting to detect the image simultaneously, the influence of the produced offset in the contracted process in fully considering adhesion degree and the different texts of size, make the target text detection frame that determines more coincide with the text boundary that needs confirm the position, the degree of accuracy that text detected has been promoted, only need simultaneously treat after the flaring rectangle frame outside according to the target that determines the target flaring distance carry out the flaring once, the definite calculated amount of target text detection frame has been reduced, text detection efficiency has been promoted.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a text detection method according to a first embodiment of the present invention;

FIG. 2 is a flowchart of a text detection method according to a second embodiment of the present invention;

FIG. 3 is a flowchart illustrating a training procedure of a scaled-in offset text detection model according to a second embodiment of the present invention;

fig. 4 is a flowchart illustrating a process of performing basic feature extraction on an image sample set in a scaled-in offset text training sample set to determine a basic feature sample set according to a second embodiment of the present invention;

FIG. 5 is a flowchart illustrating a step of determining a segmentation label according to a second embodiment of the present invention;

FIG. 6 is a flowchart illustrating a step of determining an offset tag according to a second embodiment of the present invention;

fig. 7 is a schematic structural diagram of a text detection apparatus in a third embodiment of the present invention;

fig. 8 is a schematic structural diagram of a text detection device in the fourth embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example one

Fig. 1 is a flowchart of a text detection method according to an embodiment of the present invention, where the embodiment of the present invention is applicable to determining a frame and an offset of an indented text in an image to be detected by an indented offset text detection model obtained by training an indented offset text image, and then completing an external expansion according to the determined frame and offset of the indented text to determine a situation of a target text detection box corresponding to a text position in the image to be detected.

As shown in fig. 1, a text detection method provided in this embodiment specifically includes the following steps:

and S101, acquiring an image to be detected.

In this embodiment, the image to be detected may be specifically understood as an image including text information that needs to be recognized, for example, the image to be detected may be a still image acquired by a monitoring camera, and may also be an image frame acquired in a video, where the image frame includes text information such as a slogan and a symbol that needs to be recognized, which is not limited in this embodiment of the present invention.

Specifically, when text information in an acquired image needs to be identified in practical application or text information in a video or an image needs to be extracted, a position of the text information in the image or the video frame where the text information needs to be extracted needs to be determined first, whether the text information needs to be extracted exists in the image or not is detected, and at this time, the image where the position of the text information needs to be detected is determined as an image to be detected.

S102, inputting the image to be detected into a pre-constructed contracted offset text detection model, and determining a target semantic segmentation feature map and a target offset feature map.

In this embodiment, the shrinkage-biased text detection model may be specifically understood as a neural network model that is formed by two sub-models, performs shrinkage after labeling, trains a text training sample that is biased due to shrinkage, extracts features of an input image to be detected, and performs semantic segmentation and offset regression on a feature map after feature extraction. The semantic segmentation submodel can be specifically understood as a neural network model used for performing semantic segmentation on the image to be detected after feature extraction in the shrinking offset text detection model. The offset regression submodel may be specifically understood as a neural network model for determining an offset of an inlined text, which is obtained by detecting an image to be detected and is inlined compared with a text boundary, in the inlined offset text detection model. The target semantic segmentation feature map can be specifically understood as a feature image obtained by inputting an image to be detected into a semantic segmentation sub-model after feature extraction and performing semantic segmentation, and is compared with a feature image corresponding to a text position with a text boundary position shrunk inwards expected to be detected. The target offset characteristic diagram can be specifically understood as a characteristic image which is obtained by inputting an image to be detected into an offset regression sub-model after characteristic extraction and determining the relative offset of the position after internal contraction.

Specifically, an image to be detected is input into a pre-constructed contracted offset text detection model, after feature extraction is carried out on the image to be detected, the image to be detected is respectively input into a semantic segmentation sub-model and an offset regression sub-model in the contracted offset text detection model, and a target semantic segmentation feature map and a target offset feature map are determined according to output results of the sub-models.

S103, determining a target outward expansion distance and a rectangular frame to be outward expanded according to the target semantic segmentation feature map and the target offset feature map.

In this embodiment, the rectangle to be expanded may be specifically understood as a rectangle that is determined according to the target semantic segmentation feature map and the target offset feature map and used for representing the target text inner contracted boundary. The target extension distance can be specifically understood as a distance required for extending a rectangular box to be extended corresponding to the contracted boundary to the boundary of the target text according to different offsets of different target texts.

Specifically, the processed target semantic segmentation feature map is marked and divided according to the connectivity of the target semantic segmentation feature map, so that a marker map with a mark for representing a connectivity relation is obtained, the marker object corresponding to each pixel point obtained after the target semantic segmentation feature map and the target offset feature map are intersected according to the marker map is determined, so that an offset value corresponding to the pixel point corresponding to each marker object is determined, a rectangular frame of an inner contracted boundary corresponding to each marker object is determined according to the connectivity pair and is used as a rectangular frame to be subjected to external expansion, so that the offset corresponding to each pixel in each marker object is determined according to the target offset feature map, and the target external expansion distance, corresponding to each rectangular frame to be subjected to external expansion, of the rectangular frame to be subjected to external expansion to the text boundary corresponding to the marker object is determined.

And S104, expanding the target expanding distance of the rectangular frame to be expanded outwards to determine a target text detection frame.

Specifically, the target outward expansion distance of the rectangular frame to be outward expanded is expanded outwards through a predetermined outward expansion algorithm, so that outward expansion of the boundary of the target text after the target text is retracted corresponding to the rectangular frame to be outward expanded is achieved, and the finally obtained rectangular frame expanded to the boundary of the target text is determined as the target text detection frame.

According to the technical scheme of the embodiment, the image to be detected is obtained; inputting an image to be detected into a pre-constructed contracted offset text detection model, and determining a target semantic segmentation characteristic diagram and a target offset characteristic diagram; determining a target outward expansion distance and a rectangular frame to be outward expanded according to the target semantic segmentation feature map and the target offset feature map; externally expanding the target external expansion distance of the rectangular frame to be externally expanded to determine a target text detection box; the pre-constructed contracted offset text detection model comprises a semantic segmentation sub-model and an offset regression sub-model. By adopting the technical scheme, the image to be detected is input into a pre-constructed contracted offset text detection model, the semantic segmentation submodel and the offset regression submodel in the contracted offset text detection model are respectively processed to obtain the corresponding target semantic segmentation characteristic diagram and the target offset characteristic diagram, the rectangular frame to be subjected to external expansion of the determined contracted text and the target external expansion distance to be subjected to external expansion are determined according to the calculation results of the semantic segmentation result and the offset, and the target external expansion distance is expanded outside the rectangular frame to be subjected to external expansion so as to obtain the final target text detection frame. The problem of current text detection model that obtains according to the training of the contracted text example do not consider text contracted process offset, when leading to detecting intensive text region, detection speed is slow and the effect is relatively poor is solved, adopt the mode of semantic segmentation and offset regression to handle same waiting to detect the image simultaneously, the influence of the produced offset in the contracted process in fully considering adhesion degree and the different texts of size, make the target text detection frame that determines more coincide with the text boundary that needs confirm the position, the degree of accuracy that text detected has been promoted, only need simultaneously treat after the flaring rectangle frame outside according to the target that determines the target flaring distance carry out the flaring once, the definite calculated amount of target text detection frame has been reduced, text detection efficiency has been promoted.

Example two

Fig. 2 is a flowchart of a text detection method provided in the second embodiment of the present invention, in which the technical solution of the second embodiment of the present invention is further optimized based on the above optional technical solutions, and it is further determined how to perform the method for determining the segmentation labels and the offset labels according to the training sample set of the shrinking offset text, and further determined how to respectively train the semantic segmentation submodel and the offset regression submodel according to the training sample set of the shrinking offset text, so as to finally obtain the constructed detection model of the shrinking offset text, and further determined the label diagram by performing thresholding on the target semantic segmentation feature diagram and communicating components, so as to determine the rectangular frames to be subjected to external expansion having the same label value and obtained by identification according to the intersection of the label diagram and the target offset feature diagram, and the predicted offsets corresponding to the rectangular frames to be subjected to external expansion, and further determine the target external expansion distance according to the predicted offsets and the preset reference value, the method comprises the steps of finishing extension and extension of each side of a rectangular frame to be subjected to external expansion according to a target external expansion distance, deleting the rectangular frame to be subjected to external expansion which does not meet a preset confidence pre-support according to the confidence of the rectangular frame to be subjected to external expansion, finally obtaining a detected target text detection frame, fully considering the influence of an offset on the identified frame of the target text after internal contraction, simultaneously calculating the memorability confidence of the rectangular frame to be subjected to external expansion, deleting the rectangular frame to be subjected to external expansion with lower confidence, improving the determination precision of the target text detection frame, reducing the determination calculation amount of the target text detection frame, and improving the text detection efficiency.

As shown in fig. 2, a text detection method provided in the second embodiment of the present invention specifically includes the following steps:

s201, obtaining an image to be detected.

S202, inputting the image to be detected into a pre-constructed contracted offset text detection model, and determining a target semantic segmentation characteristic diagram and a target offset characteristic diagram.

Specifically, an image to be detected is input into a pre-constructed contracted offset text detection model, basic feature extraction of the image to be detected is completed through a backbone network in the contracted offset text detection model and a network correspondingly used for completing multi-scale feature extraction and feature fusion, the extracted basic features are respectively input into a semantic segmentation sub-model and an offset regression sub-model in the contracted offset text detection model, the output result of the semantic segmentation sub-model is determined as a target semantic segmentation feature map, and the output result of the offset regression sub-model is determined as a target offset feature map.

Further, fig. 3 is a flowchart illustrating a training procedure of a text detection model with retracted offset according to a second embodiment of the present invention, as shown in fig. 3, specifically including the following steps:

s301, extracting basic features of the image sample set in the training sample set of the inner contraction offset text, and determining the basic feature sample set.

The training sample set of the contracted offset text comprises an image sample set and a calibration sample set corresponding to the image sample set, wherein the calibration sample set comprises contracted segmentation labels and offset labels corresponding to the image samples.

In this embodiment, the training sample set of the contracted offset text can be specifically understood as a set of training objects formed by a real image and a calibration image, which are input to an untrained contracted offset text detection model, used for training a backbone network, a semantic segmentation sub-model and an offset regression sub-model of the model, and include text information to be detected. Further, since the text detection model with the inlining offset in the present application is a neural network model for performing text semantic segmentation and inlining offset determination on the text after the inlining offset according to the input image, the training sample set of the text with the inlining offset at this time should include the same image sample set as the image to be detected which is input subsequently, and the corresponding image sample set contains the information of the text range and the inward contraction offset of the text to be detected, a calibration sample set formed by calibration images after the detection of each image sample in the image sample set, wherein each calibration sample in the calibration sample set and the image sample in the image sample set have a one-to-one correspondence relationship for training the contracted offset text detection model, and comparing the intermediate results with the intermediate results of the semantic segmentation submodel and the offset regression submodel to generate a corresponding loss function. The calibration sample set comprises an inner contraction segmentation label and an offset label corresponding to each image sample, wherein the inner contraction segmentation label can be specifically understood as a calibration image or a pixel matrix which corresponds to the image sample and is marked with an inner contraction boundary position of a target text to be identified in the image sample; the offset label can be specifically understood as a calibration image or a pixel matrix which corresponds to the image sample and is marked with an offset generated in the target text needing to be recognized in the image sample in an inward contraction process. The basic feature sample set can be specifically understood as a set of feature maps obtained by performing feature extraction and multi-feature fusion on each image sample set.

Further, fig. 4 is a flowchart illustrating a process of extracting a basic feature from an image sample set in a training sample set of a scaled-in offset text and determining a basic feature sample set according to a second embodiment of the present invention, as shown in fig. 4, which specifically includes the following steps:

s401, inputting the image sample set in the training sample set of the contracted offset text into a feature extraction backbone network, and determining a first feature atlas.

The first feature map set comprises a plurality of feature maps with different resolutions extracted by the image sample set.

Specifically, each image sample in an image sample set in the training set of the contracted offset text is input into a backbone network for image feature extraction, a plurality of feature maps with different resolutions corresponding to each image sample are obtained, and a set of each feature map is determined as a first feature map set.

Optionally, the backbone network may be resnet50, or may also be another network that can implement image feature extraction, which is not limited in this embodiment of the present invention. For example, assuming that the size of the input image sample is (b,3, h, w), where h and w are the sizes of the image samples, and b is the input batch, image feature extraction is performed using resnet50 as the backbone network, and 5 feature maps with different size resolutions, which can be represented as F, can be obtained _n {n∈(0,5)}。

S402, multi-scale feature extraction is carried out on the first feature atlas, and a second feature atlas is determined.

Specifically, the first feature map set is input into a pre-selected multi-scale feature extraction network or method, multi-scale feature extraction is performed to obtain a plurality of feature maps with different size resolutions, and a set of each feature map is determined as a second feature map set.

In the above example, taking an image sample as an example, the corresponding first feature map set F is set _n { n ∈ (0,5) } is input into a standard Feature Pyramid Network (FPN) for multi-scale Feature extraction to obtain a Feature map set composed of 4 Feature maps with different sizes and resolutions, which can be represented as F _i {i∈(0,3)}。

And S403, performing multi-feature fusion on the second feature map set, and determining a set of the fused feature maps as a basic feature sample set.

Specifically, the feature maps in the second feature map set are unified in size, the feature maps with unified size are input into a multi-feature fusion algorithm selected in advance, a basic feature sample corresponding to the image sample is obtained after multi-feature fusion is performed, and a set of technical feature samples corresponding to the image samples is determined as a basic feature sample set.

Following the above example, the second feature atlas F may be _i Unifying the sizes of the characteristic graphs in { i epsilon (0,3) } to (b, C1, h/4, w/4) in a bilinear interpolation mode, further inputting the characteristic graphs into a pre-selected multi-characteristic fusion algorithm or a multi-characteristic fusion module to obtain basic characteristic samples with the sizes of (b, C2, h/4, w/4), and determining a set of the basic characteristic samples as a basic characteristic sample set. Furthermore, the multi-feature fusion module can be formed by connecting three different convolution kernels and convolution blocks with sampling intervals of 1 in parallel, and corresponding basic feature samples can be obtained by adding the feature maps with constant resolution output by each convolution block bit by bit. Alternatively, the three different convolution kernel sizes may be 3 × 3,5 × 5 and 1 × 1, respectively, and the convolution block may be composed of a convolution layer, a batch normalization layer and an activation function layer.

Further, fig. 5 is a flowchart illustrating a step of determining an indented segmentation label according to a second embodiment of the present invention, and as shown in fig. 5, the method specifically includes the following steps:

s501, aiming at each image sample, constructing a first two-dimensional matrix corresponding to the image sample according to the size of the image sample, and determining the shortest side length of an annotation text in the image sample.

Specifically, for each input image sample, a two-dimensional matrix with the size consistent with that of the image sample is generated, the two-dimensional matrix is determined as a first two-dimensional matrix, and the corresponding shortest side length is determined according to the number of pixels occupied by the annotation text in the image sample. Optionally, the size of the first two-dimensional matrix may be determined according to the number of pixels in the image sample, the image sample may include a label for an outermost boundary of the text to be identified, the labeled text information may be used as a labeled text, and a pixel value corresponding to a shortest side length of the labeled text is determined according to a pixel corresponding to a position of the labeled text. Optionally, all the first two-dimensional matrix constructed may be filled with 0, and may also be filled with other preset initial values, which is not limited in the embodiment of the present invention.

S502, if the shortest side length is smaller than or equal to the preset minimum frame length, setting each pixel corresponding to the position of the label text in the first two-dimensional matrix as a first preset numerical value.

In this embodiment, the preset minimum frame length may be specifically understood as a preset frame length value used for determining whether the markup text to be recognized needs to be contracted inward or not.

Specifically, when the shortest side length is less than or equal to the preset minimum frame length, the text information to be recognized is considered to be small, if the text information is shrunk, the recognition is difficult, and at the moment, each pixel corresponding to the position of the labeled text in the first two-dimensional matrix is directly set as a first preset numerical value, and the position of the labeled text does not need to be shrunk and updated. Optionally, the first preset value may be 1, and may also be other preset values, which is not limited in this embodiment of the present invention.

S503, if the shortest side length is larger than the preset minimum frame length, determining a first retraction distance according to the size of the marked text, updating the position of the marked text according to the first retraction distance, and setting each pixel corresponding to the updated position of the marked text as a first preset numerical value.

Specifically, if the shortest side length is greater than the preset minimum border length, the marked text to be recognized is considered to be large, the marked text needs to be retracted to avoid adhesion with other texts to be recognized, a first retraction distance which needs to be retracted is determined according to the size of the marked text, then the position information corresponding to the marked text is updated according to the first retraction distance, and the pixel corresponding to the position of the updated marked text is set to be a first preset numerical value.

Optionally, the retraction distance to be retracted may be determined according to the area and the perimeter of the labeled text and a preset retraction degree parameter, and the retraction distance may be calculated by the following formula:

wherein A is the area of the label text, L is the perimeter of the label text, and r is a preset retraction degree parameter. Optionally, r in the embodiment of the present invention may be set to 0.4, and may also be adjusted according to an actual situation, which is not limited in the embodiment of the present invention.

Further, if the first retraction distance is smaller than the shortest side length, the position of the label text can be updated according to the first retraction distance, and each pixel corresponding to the updated position of the label text is set as a first preset value; and if the first retraction distance is larger than or equal to the shortest side length, not updating the position of the marked text according to the first retraction distance, and directly setting each pixel corresponding to the position of the marked text in the first two-dimensional matrix as a first preset numerical value.

Furthermore, when the position of the marked text is updated according to the first retraction distance, the corner point coordinates of the circumscribed rectangle frame of the marked text region can be sequenced clockwise, and the coordinates of the center point of the circumscribed rectangle frame are determined; the two adjacent sides of the external rectangular frame are translated inwards by a first retraction distance relative to the center point coordinate; retracting both ends of each side in the external rectangular frame inwards by a first inward-retracting distance, determining new endpoint coordinates corresponding to each side and new intersection point coordinates after each adjacent side completes inward retraction; and if the new intersection point coordinates are consistent with the corresponding new endpoint coordinates, determining the included area of the internally contracted external rectangular frame as the position of the updated labeled text, and otherwise, returning to the original coordinates corresponding to the external rectangular frame to perform the internal contraction of the labeled text again until the internal contraction of the labeled text relative to the first internal contraction distance is completed.

Further, fig. 6 is a flowchart illustrating a flow of a step of determining an offset label according to a second embodiment of the present invention, as shown in fig. 6, which specifically includes the following steps:

s601, aiming at each image sample, constructing a second two-dimensional matrix corresponding to the image sample according to the size of the image sample, and determining the shortest side length of an annotated text in the image sample.

Specifically, for each input image sample, a two-dimensional matrix with the same size corresponding to the number of pixels corresponding to the image sample is generated, the two-dimensional matrix is determined as a second two-dimensional matrix, and the corresponding shortest side length is determined according to the number of pixels occupied by the annotation text in the image sample. Optionally, all the second two-dimensional matrix constructed may be filled with 0, and may also be filled with other preset initial values, which is not limited in the embodiment of the present invention.

It can be understood that the determining of the first two-dimensional matrix and the determining of the second two-dimensional matrix in S601 are the same as those in S501, and the determining of the shortest side length of the label text is also the same, which is not described in detail in this embodiment of the present invention.

And S602, if the shortest side length is less than or equal to the preset minimum frame length, setting each pixel corresponding to the position of the marked text in the second two-dimensional matrix as a first preset numerical value.

In this embodiment, the preset minimum frame length may be specifically understood as a preset frame length value used for determining whether the markup text to be recognized needs to be shrunk.

Specifically, when the shortest side length is less than or equal to the preset minimum border length, the marked text to be recognized is considered to be small, if the marking text is subjected to internal contraction again, the occupied position of the internally contracted text is too small, the recognition difficulty is high, at the moment, each pixel corresponding to the position of the marked text in the second two-dimensional matrix is directly set as a first preset numerical value, and the pixel range corresponding to the position of the marked text does not need to be updated. Optionally, the first preset value may be 1, and may also be other preset values, which is not limited in this embodiment of the present invention.

S603, if the shortest side length is larger than the preset minimum frame length, determining a second retracted distance according to the size of the image sample, and updating and assigning the second two-dimensional matrix according to the second retracted distance and the shortest side length.

Specifically, if the shortest side is longer than the preset minimum border length, the marked text to be recognized is considered to be large and needs to be retracted to avoid adhesion with other texts to be recognized, at this time, a second retraction distance which needs to be retracted is determined according to the area and the perimeter of the marked text and preset retraction degree parameters, then the position information corresponding to the marked text is updated according to the second retraction distance, meanwhile, the offset of the marked text in the retraction process is determined according to the second retraction distance and a preset reference value, whether the marked text is retracted or not is determined according to the calculated second retraction distance and the shortest side length, and then each pixel corresponding to the position of the retracted marked text is assigned as the determined offset.

It should be clear that the calculation method of the second retraction distance is the same as the calculation method of the first retraction distance, and the embodiment of the present invention will not be described in detail.

Further, if the second retraction distance is smaller than the shortest side length, determining an offset strength value according to the second retraction distance and a preset reference value, updating the position of the marked text according to the second retraction distance, and setting each pixel corresponding to the updated position of the marked text as the offset strength value; and if the second retraction distance is larger than or equal to the shortest side length, setting each pixel corresponding to the position of the marked text in the second two-dimensional matrix as a first preset numerical value.

In this embodiment, the preset reference value may be specifically understood as a preset value that can be adjusted by verifying the contraction effect according to the processed image visualization result. The offset strength value is specifically understood to be a value representing the degree of deviation of the current contraction from the original size of the labeled text.

Specifically, when the second retraction distance is smaller than the shortest side length, the labeled text can be recognized after retraction, and a certain offset can be generated due to retraction, at this time, the offset strength value of the retraction at this time is determined according to the second retraction distance and a preset reference value, meanwhile, the position information corresponding to the labeled text is updated according to the second retraction distance, that is, the position information corresponding to each side of the original labeled text is inwardly offset by the second retraction distance and then is used as the corresponding position information of the updated labeled text, and meanwhile, each pixel corresponding to the position of the updated labeled text is set as the determined offset strength value. When the second retraction distance is greater than or equal to the shortest side length, it may be considered that the label text does not exist or cannot reach the retraction distance after being retracted according to the second retraction distance, and if the label text cannot be identified, the label text is discarded from being retracted at this time, and each pixel corresponding to the position of the label text in the second two-dimensional matrix is set as the first preset threshold.

S302, inputting the basic feature sample set into the initial semantic segmentation sub-model, and extracting a semantic segmentation intermediate result.

In this embodiment, the initial semantic segmentation submodel may be specifically understood as a semantic segmentation submodel without training, in which the neural network layer composition framework is completely consistent with that in the semantic segmentation submodel, and two 3 × 3 convolution blocks consisting of convolution layers, batch normalization layer and activation function layer, and one 1 × 1 convolution layer and one sigmoid activation function may be used, but the weight parameters of each neural network layer are not adjusted. The semantic segmentation intermediate result can be specifically understood as an intermediate result output after the untrained initial semantic segmentation submodel performs semantic segmentation on the input basic feature sample set.

Specifically, the basic feature sample set is input into an initial semantic segmentation submodel for training, and a plurality of different semantic segmentation intermediate results obtained by performing semantic segmentation on each basic feature sample in the basic feature sample set by the initial semantic segmentation submodel can be extracted in the training process.

And S303, inputting the basic feature sample set into the initial offset regression sub-model, and extracting an offset intermediate result.

In this embodiment, the initial offset regression submodel may be specifically understood as an untrained offset regression submodel, in which the neural network layer composition framework is completely consistent with that of the offset regression submodel, and two 3 × 3 convolution blocks composed of convolution layers, batch normalization layers and activation function layers, and one 1 × 1 convolution layer and one sigmoid activation function may be used, but the weight parameters of each neural network layer are not adjusted. The intermediate offset result can be specifically understood as an intermediate result output after the base feature sample set input by the trained initial offset regression submodel is subjected to offset regression.

Specifically, the basic feature sample set is input into an initial offset regression submodel for training, and a plurality of different offset intermediate results of the offset obtained after the initial offset regression submodel performs internal reduction on each basic feature sample in the basic feature sample set can be extracted in the training process.

S304, determining a corresponding first loss function according to the semantic segmentation intermediate result and the corresponding contracted segmentation label.

Specifically, each pixel value in the semantic segmentation intermediate result is compared with each pixel value in the corresponding contracted segmentation label, and then a first loss function corresponding to the pixel value is determined. Alternatively, the first loss function may be a smooth L1 loss function.

Further, determining a corresponding first loss function according to the semantic segmentation intermediate result and the corresponding contracted segmentation label, specifically comprising the following steps:

s3041, comparing the corresponding value of each pixel in the semantic division intermediate result with the corresponding value of each pixel in the corresponding contracted division label.

S3042, determining a first loss function according to the comparison result.

S305, determining a corresponding second loss function according to the offset intermediate result and the corresponding offset label.

Specifically, each pixel value in the offset intermediate result is compared with each pixel value in the corresponding offset label, and then a second loss function corresponding to the pixel value is determined. Optionally, the second loss function may be a dice loss function.

Further, determining a corresponding second loss function according to the offset intermediate result and the corresponding offset label, specifically comprising the following steps:

s3051, comparing the value corresponding to each pixel in the offset intermediate result with the value corresponding to each pixel in the corresponding offset label.

S3052, determining a second loss function according to the comparison result.

S306, determining a total loss function according to the first loss function and the second loss function, and training the initial semantic segmentation sub-model and the initial offset regression sub-model based on the total loss function until a preset convergence condition is met to obtain the contracted offset text detection model.

In this embodiment, the predetermined convergence condition may be specifically understood as a condition for determining whether the trained initial network model enters a convergence state. Optionally, the preset convergence condition may include that a change of a weight parameter between two iterations of model training is smaller than a preset parameter change threshold, or the iteration exceeds a set maximum iteration number, or all training of the training samples of the shrinkage offset text is completed, and the embodiment of the present invention does not limit this.

Specifically, different weight parameters are set for the contracted offset text detection model according to different importance of the first loss function and the second loss function in the training process of the contracted offset text detection model, then a total loss function is constructed by the first loss function and the second loss function together according to the weight parameters, and back propagation is carried out on the initial semantic segmentation sub-model and the initial offset regression sub-model according to the total loss function, so that the weight parameters in each neural network layer for forming the initial semantic segmentation sub-model and the initial offset regression sub-model can be adjusted according to the total loss function until a preset convergence condition is met, and the trained initial semantic segmentation sub-model and the trained initial offset regression sub-model together construct the contracted offset text detection model.

Further, a total loss function is determined according to the first loss function and the second loss function, the initial semantic segmentation sub-model and the initial offset regression sub-model are trained based on the total loss function until a preset convergence condition is met, and the contracted offset text detection model is obtained, and the method specifically comprises the following steps:

s3061, weighting and summing the first loss function and the second loss function according to a preset weight value, and determining a total loss function.

Exemplary, Total loss function L _total Specifically, it can be represented by the following formula:

L _total ＝L _seg +λ×L _offset

wherein L is _seg Is a first loss function, L _offset For the second loss function, λ is a weight of the weighted sum, and optionally, λ may be set to 10 in the embodiment of the present invention, and the weight may be adjusted according to an experimental result, which is not limited in the embodiment of the present invention.

S3062, adjusting weight parameters in the initial semantic segmentation sub-model and the initial offset regression sub-model based on a total loss function until preset convergence conditions are met, and obtaining the retracted offset text detection model.

S203, thresholding is carried out on the target semantic segmentation feature map, and a labeled map is determined according to a connected component labeling algorithm.

In this embodiment, the Connected Component Labeling Algorithm (Connected Component Labeling Algorithm) may be specifically understood as an Algorithm for scanning each pixel point in a binary image, dividing pixels having the same pixel value and being Connected to each other into the same group, and finally obtaining all pixel Connected components in the image.

Specifically, thresholding is carried out on corresponding numerical values of each point in the target semantic segmentation characteristic map, a value exceeding a preset threshold range is replaced by 1, otherwise, the value is replaced by 0, and a binary map with only 0 and 1 elements is obtained. And then processing the obtained binary image by a connected component marking algorithm, identifying components with the same pixel value and connected relation in the binary image, marking the pixels belonging to the same component as the same value, adopting different marking values for different components, taking the pixels with the same value as a marking object, and finally obtaining the marking image with one or more marking objects.

And S204, intersecting the mark graph with the target offset characteristic graph, and updating the target offset characteristic graph.

Specifically, because the target offset characteristic diagram may have an offset value in a part where the non-labeled text is located, the offset of the position where the non-labeled object is located in the target offset characteristic diagram is removed by intersecting the labeled diagram with the target offset characteristic diagram, and the target offset characteristic diagram is updated.

It should be clear that, the update of the target offset feature map may be completed by the intersection of the tag map and the target offset feature map, or may be completed by the intersection of the thresholded binary map and the target offset feature map.

S205, traversing different mark values in the intersected mark map, and determining the circumscribed rectangular frame of the mark object corresponding to the same mark value as the rectangular frame to be externally expanded.

Specifically, traversing different mark values in the mark map obtained after intersection, determining a pixel with the same mark value as a mark object, and determining the circumscribed rectangular frame of each mark object as a rectangular frame to be circumscribed extended, wherein the circumscribed rectangular frame needs to be circumscribed so as to reach the circumscribed boundary of the text to be recognized. That is, the rectangular frame to be expanded may be a frame of the text to be recognized that is contracted inward, or may be a frame of the text to be recognized that is not contracted inward, depending on the determination of the preset minimum frame length in the training process.

Optionally, the marked object may be calculated through a minarefect function provided by opencv to obtain an external rotation rectangular frame of the marked object, that is, the rectangular frame to be extended in the embodiment of the present invention.

Further, after determining the circumscribed rectangular frame of the marked object corresponding to the same mark value as the rectangular frame to be externally expanded according to different mark values in the traversed and intersected mark map, the method further comprises the following steps:

determining the confidence coefficient of the marking object corresponding to the rectangular frame to be externally expanded according to the average value of the pixel values of the marking object in the target semantic segmentation characteristic image; and if the confidence coefficient is smaller than a preset confidence coefficient threshold value, deleting the rectangular frame to be expanded.

Specifically, because the target semantic segmentation characteristic map output by the semantic segmentation submodel may have errors, the label object determined in the connected region after thresholding is performed according to the target semantic segmentation characteristic map may also have errors.

And S206, taking the average value of the pixel values of each marking object in the updated target offset characteristic image as the predicted offset of the marking object.

Specifically, for each marked object, an average value of the pixels corresponding to the marked object in the updated target offset characteristic map is determined, and the average value is determined as the predicted offset corresponding to the marked object.

And S207, determining the target outward expansion distance according to the predicted offset and a preset reference value.

Specifically, the product of the predicted offset and a preset reference value is determined as the target extension distance of the rectangular frame to be extended corresponding to the predicted offset.

S208, the coordinates of the corner points of the rectangular frame to be expanded are sorted clockwise, and the coordinate of the center point of the rectangular frame to be expanded is determined.

Specifically, for each rectangular frame to be subjected to external expansion, whether four corner points of the rectangular frame to be subjected to external expansion are sorted according to a clockwise sequence is determined, if not, the corner points are adjusted to be sorted according to the clockwise sequence so as to facilitate subsequent external expansion operation, meanwhile, a first corner point and a third corner point of the four corner points sorted according to the clockwise sequence in the rectangular frame to be subjected to external expansion are connected, a second corner point and the fourth corner point are connected, the intersection point of the two connecting lines is the center point of the rectangular frame to be subjected to external expansion, and the coordinates of the center point can be determined according to the coordinates of the four corner points of the rectangular frame to be subjected to external expansion.

S209, translating the adjacent edges of the rectangular frames to be subjected to external expansion outwards by the target external expansion distance relative to the center point coordinate.

Specifically, two adjacent sides in the rectangular frame to be subjected to external expansion are sequentially translated by the target external expansion distance relative to the central point coordinate in the direction perpendicular to the central point coordinate.

For example, assuming that the central point of the rectangular frame to be expanded is C and the target expansion distance is D, one edge of the rectangular frame to be expanded will be shifted outward by the distance D relative to the direction perpendicular to C.

S210, extending the target outward-expanding distance outwards from two ends of each side in the rectangular frame to be outward-expanded, and determining new endpoint coordinates corresponding to each side.

Specifically, after each side of the rectangular frame to be subjected to outward expansion is subjected to outward expansion relative to the central point, the end points at the two ends of each side extend the target outward expansion distance to the two ends respectively, and two new end point coordinates corresponding to each side are determined.

S211, determining the intersection point of each side of the rectangular frame to be subjected to external expansion as a new intersection point coordinate.

Specifically, after each side in the rectangular frame to be expanded is expanded at two ends, the adjacent expanded sides are intersected, and the intersection point of the two adjacent sides is determined as a new intersection point coordinate.

S212, judging whether each new intersection point coordinate is consistent with the corresponding new endpoint coordinate, if so, executing a step S213; if not, go to step S214.

Specifically, whether each new intersection point coordinate is consistent with the corresponding new endpoint coordinate is judged, if yes, each edge after external expansion can be considered to form a complete rectangular frame, and the external expansion is successful, and then step S213 is executed; otherwise, it is considered that the edges do not intersect after the external expansion, or the intersection is not located at the end point of each edge, and the external expansion fails, and then step S214 is executed.

S213, determining the rectangle frame to be subjected to external expansion as a target text detection frame.

And S214, returning the original coordinates corresponding to the rectangular frame to be expanded.

Specifically, after confirming that the rectangular frame to be subjected to external expansion fails to be subjected to external expansion, returning corresponding points of each side in the rectangular frame to be subjected to external expansion to the original coordinates, and further performing error reporting, or returning to execute step S208 until determining a target text detection box corresponding to the rectangular frame to be subjected to external expansion.

According to the technical scheme of the embodiment, the method comprises the steps of finishing the determination of an inside contraction segmentation label and an offset label according to the shortest side length of an annotated text and the size of the annotated text according to each image sample in an inside contraction offset text training sample set, further respectively training a semantic segmentation sub-model and an offset regression sub-model through the image sample set, the inside contraction segmentation label and the offset label in the inside contraction offset text training sample set to finally obtain a constructed inside contraction offset text detection model, determining a label graph through thresholding and communicating components on a target semantic segmentation feature graph, determining a rectangular frame to be subjected to external expansion with the same label value obtained through identification according to intersection of the label graph and the target offset feature graph, and a predicted offset corresponding to each rectangular frame to be subjected to external expansion, further determining a target external expansion distance according to the predicted offset and a preset reference value, the method comprises the steps of finishing extension and extension of each side of a rectangular frame to be subjected to external expansion according to a target external expansion distance, deleting the rectangular frame to be subjected to external expansion which does not meet a preset confidence pre-support according to the confidence of the rectangular frame to be subjected to external expansion, finally obtaining a detected target text detection frame, fully considering the influence of an offset on the identified frame of the target text after internal contraction, simultaneously calculating the memorability confidence of the rectangular frame to be subjected to external expansion, deleting the rectangular frame to be subjected to external expansion with lower confidence, improving the determination precision of the target text detection frame, reducing the determination calculation amount of the target text detection frame, and improving the text detection efficiency.

EXAMPLE III

Fig. 7 is a schematic structural diagram of a text detection apparatus according to a third embodiment of the present invention, where the text detection apparatus includes: an image acquisition module 71, a feature map determination module 72, an outward expansion rectangle determination module 73 and a detection frame determination module 74.

The image acquisition module 71 is configured to acquire an image to be detected; the feature map determining module 72 is configured to input the image to be detected to a pre-constructed contracted offset text detection model, and determine a target semantic segmentation feature map and a target offset feature map; an externally-expanding rectangle determining module 73, configured to determine a target externally-expanding distance and a rectangular frame to be externally expanded according to the target semantic segmentation feature map and the target offset feature map; a detection box determining module 74, configured to extend the target extension distance of the rectangular frame to be extended, and determine a target text detection box; the pre-constructed contracted offset text detection model comprises a semantic segmentation sub-model and an offset regression sub-model.

According to the technical scheme, the problems that when the existing text detection model obtained through training according to the contracted text example does not consider the offset of the text contraction process, and the dense text area is detected, the detection speed is low and the effect is poor are solved, meanwhile, the same image to be detected is processed in a semantic segmentation and offset regression mode, the influence of the offset generated in the contraction process of the texts with different adhesion degrees and different sizes is fully considered, the determined target text detection box is more consistent with the text boundary of the position to be determined, the accuracy of text detection is improved, meanwhile, only one-time external expansion is needed to be performed according to the determined target external expansion distance after the rectangular box to be subjected to external expansion is determined, the determination calculation amount of the target text detection box is reduced, and the text detection efficiency is improved.

Further, the step of determining the segmentation labels comprises:

Further, the step of determining the offset label comprises:

if the second retraction distance is smaller than the shortest side length, determining an offset strength value according to the second retraction distance and a preset reference value, updating the position of the marked text according to the second retraction distance, and setting each pixel corresponding to the updated position of the marked text as the offset strength value;

comparing the corresponding numerical value of each pixel in the semantic segmentation intermediate result with the corresponding numerical value of each pixel in the corresponding contracted segmentation label;

and determining a first loss function according to the comparison result.

and determining a second loss function according to the comparison result.

and adjusting the weight parameters in the initial semantic segmentation submodel and the initial offset regression submodel based on the total loss function until a preset convergence condition is met to obtain the contracted offset text detection model.

Optionally, the outward rectangle determining module 73 includes:

the label map determining unit is used for thresholding the target semantic segmentation feature map and determining a label map according to a connected component labeling algorithm;

the offset map updating unit is used for intersecting the label map and the target offset characteristic map and updating the target offset characteristic map;

the rectangular frame determining unit is used for traversing different mark values in the intersected mark image and determining an external rectangular frame of the mark object corresponding to the same mark value as a rectangular frame to be externally expanded;

the offset determining unit is used for determining the average value of the pixel values of each marking object in the updated target offset characteristic image as the predicted offset of the marking object;

and the external expansion distance determining unit is used for determining the target external expansion distance according to the predicted offset and a preset reference value.

Optionally, the text detection apparatus further includes:

the confidence coefficient determining module is used for determining the confidence coefficient of the marked object corresponding to the rectangular frame to be externally expanded according to the average value of the pixel values of the marked object in the target semantic segmentation characteristic image; and if the confidence coefficient is smaller than a preset confidence coefficient threshold value, deleting the rectangular frame to be expanded.

Optionally, the detection frame determining module 74 includes:

the central point coordinate determining unit is used for sequencing the corner point coordinates of the rectangular frames to be subjected to external expansion in a clockwise manner and determining the central point coordinate of the rectangular frames to be subjected to external expansion;

the edge translation unit is used for translating the adjacent edges of the rectangular frames to be subjected to external expansion outwards by the target external expansion distance relative to the center point coordinate;

the end point coordinate determining unit is used for extending the target outward-expanding distance from the two ends of each side in the rectangular frame to be outward-expanded and determining new end point coordinates corresponding to each side;

the intersection point coordinate determination unit is used for determining the intersection point of each side of the rectangular frame to be subjected to external expansion as a new intersection point coordinate;

the detection frame determining unit is used for determining the rectangular frame to be subjected to external expansion after the external expansion as a target text detection frame if each new intersection point coordinate is consistent with the corresponding new endpoint coordinate; otherwise, returning the original coordinates corresponding to the rectangular frame to be expanded.

The text detection device provided by the embodiment of the invention can execute the text detection method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Example four

Fig. 8 is a schematic structural diagram of a text detection device according to a fourth embodiment of the present invention. The text detection device 80 may be an electronic device intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 8, the text detection device 80 includes at least one processor 81, and a memory communicatively connected to the at least one processor 81, such as a Read Only Memory (ROM)82, a Random Access Memory (RAM)83, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 81 can perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM)82 or the computer program loaded from a storage unit 88 into the Random Access Memory (RAM) 83. In the RAM 83, various programs and data necessary for the operation of the text detection device 80 can also be stored. The processor 81, ROM 82 and RAM 83 are connected to each other by a bus 84. An input/output (I/O) interface 85 is also connected to bus 84.

A plurality of components in the text detection device 80 are connected to the I/O interface 85, including: an input unit 86 such as a keyboard, a mouse, and the like; an output unit 87 such as various types of displays, speakers, and the like; a storage unit 88 such as a magnetic disk, optical disk, or the like; and a communication unit 89 such as a network card, modem, wireless communication transceiver, etc. The communication unit 89 allows the text detection device 80 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Processor 81 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 81 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 81 performs the various methods and processes described above, such as a text detection method.

In some embodiments, the text detection method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 88. In some embodiments, part or all of the computer program may be loaded and/or installed onto the text detection device 80 via the ROM 82 and/or the communication unit 89. When the computer program is loaded into RAM 83 and executed by processor 81, one or more steps of the text detection method described above may be performed. Alternatively, in other embodiments, the processor 81 may be configured to perform the text detection method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A text detection method, comprising:

acquiring an image to be detected;

inputting the image to be detected into a pre-constructed contracted offset text detection model, and determining a target semantic segmentation feature map and a target offset feature map;

expanding the target expanding distance of the rectangular frame to be expanded outwards to determine a target text detection frame;

2. The method of claim 1, wherein the training step of the scaled-in offset text detection model comprises:

performing basic feature extraction on an image sample set in a training sample set of the contracted offset text to determine a basic feature sample set; the training sample set of the contracted offset text comprises an image sample set and a calibration sample set corresponding to the image sample set, wherein the calibration sample set comprises contracted segmentation labels and offset labels corresponding to the image samples;

inputting the basic feature sample set into an initial semantic segmentation sub-model, and extracting a semantic segmentation intermediate result;

and determining a total loss function according to the first loss function and the second loss function, and training the initial semantic segmentation sub-model and the initial offset regression sub-model based on the total loss function until a preset convergence condition is met to obtain an inner contraction offset text detection model.

3. The method of claim 2, wherein performing base feature extraction on the image sample set in the scaled-in offset text training sample set to determine a base feature sample set comprises:

inputting the image sample set in the training sample set of the contracted offset text into a feature extraction backbone network, and determining a first feature atlas; wherein the first feature map set comprises a plurality of feature maps of different resolutions extracted from the image sample set;

and performing multi-feature fusion on the second feature map set, and determining a set of the fused feature maps as a basic feature sample set.

4. The method of claim 2, wherein the step of determining the segmentation labels comprises:

for each image sample, constructing a first two-dimensional matrix corresponding to the image sample according to the size of the image sample, and determining the shortest side length of an annotation text in the image sample;

if the shortest side length is less than or equal to the preset minimum frame length, setting each pixel corresponding to the position of the label text in the first two-dimensional matrix as a first preset numerical value;

and if the shortest side length is greater than the preset minimum frame length, determining a first retraction distance according to the size of the marked text, updating the position of the marked text according to the first retraction distance, and setting each pixel corresponding to the updated position of the marked text as a first preset numerical value.

5. The method of claim 2, wherein the step of determining the offset label comprises:

for each image sample, constructing a second two-dimensional matrix corresponding to the image sample according to the size of the image sample, and determining the shortest side length of an annotated text in the image sample;

if the shortest side length is less than or equal to a preset minimum frame length, setting each pixel corresponding to the position of the marked text in the second two-dimensional matrix as a first preset numerical value;

and if the shortest side length is greater than the preset minimum frame length, determining a second retraction distance according to the size of the marked text, and updating and assigning the second two-dimensional matrix according to the second retraction distance and the shortest side length.

6. The method of claim 5, wherein the updating and assigning the second two-dimensional matrix according to the second reduction distance and the shortest side length comprises:

7. The method of claim 2, wherein determining the corresponding first loss function based on the semantic segmentation intermediate result and the corresponding indented segmentation label comprises:

comparing the value corresponding to each pixel in the semantic segmentation intermediate result with the value corresponding to each pixel in the corresponding contracted segmentation label;

and determining a first loss function according to the comparison result.

8. The method of claim 2, wherein determining a corresponding second penalty function based on the offset intermediate result and a corresponding offset label comprises:

and determining a second loss function according to the comparison result.

9. The method of claim 2, wherein determining a total loss function according to the first loss function and the second loss function, and training the initial semantic segmentation sub-model and the initial offset regression sub-model based on the total loss function until a preset convergence condition is met to obtain a contracted offset text detection model comprises:

and adjusting the weight parameters in the initial semantic segmentation sub-model and the initial offset regression sub-model based on the total loss function until a preset convergence condition is met to obtain a contracted offset text detection model.

10. The method according to claim 1, wherein the determining a target outward expansion distance and a rectangle frame to be outward expanded according to the target semantic segmentation feature map and the target offset feature map comprises:

determining the average value of the pixel values of each marked object in the updated target offset characteristic map as the predicted offset of the marked object;

11. The method of claim 10, further comprising:

determining the confidence of the marking object corresponding to the rectangular frame to be subjected to external expansion according to the average value of the pixel values of the marking object in the target semantic segmentation characteristic image;

and if the confidence coefficient is smaller than a preset confidence coefficient threshold value, deleting the rectangular frame to be externally expanded.

12. The method according to claim 1, wherein the step of extending the target extension distance of the rectangular frame to be extended to determine a target text detection frame comprises:

sequencing the coordinates of the corner points of the rectangular frame to be subjected to external expansion clockwise, and determining the coordinate of the center point of the rectangular frame to be subjected to external expansion;

translating the adjacent edges of the rectangular frames to be subjected to external expansion outwards relative to the central point coordinate by the target external expansion distance;

extending the target external expansion distance outwards from two ends of each side in the rectangular frame to be subjected to external expansion, and determining new end point coordinates corresponding to each side;

if the new intersection point coordinates are consistent with the corresponding new endpoint coordinates, determining the rectangular frame to be subjected to external expansion after external expansion as a target text detection frame; otherwise, returning to the original coordinates corresponding to the rectangular frame to be expanded.

13. A text detection apparatus, comprising:

the image acquisition module is used for acquiring an image to be detected;

the external expansion rectangle determining module is used for determining a target external expansion distance and a rectangular frame to be subjected to external expansion according to the target semantic segmentation feature map and the target offset feature map;

the detection frame determining module is used for expanding the rectangular frame to be expanded outwards by the target expanding distance and determining a target text detection frame;

14. A text detection device characterized by comprising:

at least one processor; and

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the text detection method of any one of claims 1-12.

15. A computer-readable storage medium having stored thereon computer instructions for causing a processor to, when executed, implement the text detection method of any one of claims 1-12.