CN113269280B

CN113269280B - Text detection method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN113269280B
Application number: CN202110821804.8A
Authority: CN
Inventors: 秦勇
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-07-21
Filing date: 2021-07-21
Publication date: 2021-10-08
Anticipated expiration: 2041-07-21
Also published as: CN113269280A

Abstract

The present disclosure provides a text detection method, an apparatus, an electronic device, and a computer-readable storage medium, including: acquiring a text image to be detected; acquiring a first feature mapping of a text image to be detected; inputting the first feature mapping to a feature pyramid enhancement module to generate a second feature mapping; performing first convolution processing on the second feature mapping to generate a central point score map of the predicted target; multiplying the central point score map and the first feature map channel by channel and point by point to obtain a group of combined feature maps; performing second convolution processing on the group of combined feature maps to generate a positioning score map of the predicted target; positioning and predicting a central point of the target according to the positioning score map of the target; and positioning the text box according to the central point. The speed and the precision of intensive text detection are improved through the method and the device.

Description

Text detection method and device, electronic equipment and computer readable storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a text detection method, an apparatus, an electronic device, and a computer-readable storage medium.

Background

Text detection has a wide application range, is a pre-step of many computer vision tasks, such as image search, character recognition, identity authentication, visual navigation and the like, and the main purpose of text detection is to locate text lines or characters in an image, so that accurate location of texts is very important and challenging.

Disclosure of Invention

According to an aspect of the present disclosure, there is provided a text detection method including:

acquiring a text image to be detected;

acquiring a first feature mapping of the text image to be detected;

inputting the first feature mapping to a feature pyramid enhancement module to generate a second feature mapping;

performing first convolution processing on the second feature mapping to generate a central point score map of a predicted target;

multiplying the central point score map and the first feature map channel by channel and point by point to obtain a group of combined feature maps;

performing second convolution processing on the group of merged feature maps to generate a positioning score map of a predicted target;

positioning a central point of a predicted target according to the positioning score map;

and positioning the text box according to the center point.

According to another aspect of the present disclosure, there is provided a text detection apparatus including:

the first acquisition module is used for acquiring a text image to be detected;

the second acquisition module is used for acquiring a first feature mapping of the text image to be detected;

the first generation module is used for inputting the first feature mapping to the feature pyramid enhancement module and generating a second feature mapping;

the second generation module is used for performing first convolution processing on the second feature mapping to generate a central point score map of the prediction target;

the first processing module is used for multiplying the central point score map and the first feature map channel by channel and point by point to obtain a group of combined feature maps;

the second processing module is used for carrying out second convolution processing on the combined feature mapping set to generate a positioning score map of the predicted target; the first positioning module is used for positioning the central point of the predicted target according to the positioning score map;

and the second positioning module is used for positioning the text box according to the central point.

According to another aspect of the present disclosure, there is provided an electronic device including:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the text detection method according to any one of the above aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the text detection method according to any one of the above aspects.

One or more technical schemes provided in the embodiment of the application improve the speed and the precision of intensive text detection.

Drawings

Further details, features and advantages of the disclosure are disclosed in the following description of exemplary embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 shows a flow diagram of a text detection method according to an example embodiment of the present disclosure;

FIG. 2 illustrates another flow diagram of a text detection method according to an exemplary embodiment of the present disclosure;

FIG. 3 shows a schematic block diagram of a text detection apparatus according to an exemplary embodiment of the present disclosure;

FIG. 4 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description. It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

With the rise of deep learning again in recent years, research on text detection also becomes a great hotspot, a large number of methods special for text detection appear, and good detection effects are obtained. According to the technical characteristics of a method used for detecting texts, the current popular text detection method can be roughly divided into two types, the first type is a text detection method based on a sliding window, the method is mainly based on the idea of universal target detection, a large number of anchor points frames with different length-width ratios and different sizes are set, the anchor points frames are taken as the sliding window, traversal search is carried out on an image or a feature mapping image obtained by carrying out convolution operation on the image, and classification judgment on whether a text exists in each searched position frame is carried out, so that the method has the advantages that after the text frame is judged, subsequent work can be carried out without other subsequent processing, and the method has the defects that the calculated amount is overlarge, a large amount of calculation resources are consumed, and the time is long; the second kind is a method based on calculating connected domain, also called a method based on segmentation idea, it mainly uses the full convolution neural network model to extract the image characteristic first, then binarizes the characteristic map and calculates its connected domain, then uses some specific methods according to different application scenes (i.e. different training data sets), judges the text line position, the advantage of this method is that the calculation is fast, the calculated amount is small, the disadvantage is that the post-processing step is tedious, involve a large amount of calculation and tuning, this not only can consume a large amount of time, at the same time, whether the post-processing strategy is reasonable effective also restricts the performance of the algorithm strictly.

For "effective and Accurate area-Shaped Text Detection with Pixel Aggregation Network" (PAN for short), "Real-time Scene Text Detection with differentiated binary" and "Objects as Points" (CenterNet for short), the PAN uses residual error Network Resnet18 as the basic Network skeleton to extract the features of texture, edge, corner and semantic information from the input image, and these features are represented by 4 sets of multi-channel feature maps with different sizes. Then, the extracted features are processed by a Feature Pyramid Enhancement Module (FPEM), the FPEM Module combines convolution, deconvolution and batch normalization, extracts features such as texture, edge, corner and semantic information again, and finally obtains a Feature map of 6 channels by performing up-sampling on an output Feature map, the Feature map of the first channel is a probability map representing a text line region, calculates a connected domain after binarization to obtain a specific text line region, the Feature map of the second channel is a probability map representing a text line region and a scaled-in text line region according to a certain rule, calculates the connected domain after binarization to obtain a specific scaled-in text line region, combines the remaining 4 channels to represent a Feature vector with a size of 4-dimensional Feature vector, and then uses a clustering method, and combining the text region map and the contracted text region map, and calculating the distance between the 4-dimensional characteristic vector of each pixel point position and the clustering center point to judge which text region the pixel points which appear in the text region but do not appear in the contracted text region belong to.

DB is also based on Resnet18, extracting features of an input image, then up-sampling all the extracted feature maps to the size of one fourth of the original image and connecting the feature maps in series, then performing convolution operation once to obtain a 2-channel feature map as output, wherein the first channel represents a probability map of an invaginated text region, the second channel represents a threshold map of the text region, namely the distance between each pixel point and a real text region frame, performing normalization operation, the distance is a number between 0 and 1, then designing a differentiable binarization function, the parameter of the binarization function can be learned along with the network, then calculating a binary map of the image region according to the threshold map and the probability map, calculating a connected domain on the binarization map, obtaining the invaginated text region, and then expanding the invaginated text region outwards according to certain rules and proportions, thereby obtaining a real text region.

The general idea of the centret, which can be considered as a regression-based method, is to first set the overall class N of the object to be predicted, and finally output the number of channels N +2+2, which only predicts the center point of the object, and output a score map for each class, with the value of each pixel point between 0 and 1, indicating the probability that this point is the center of some class of object, and therefore there are N score maps. In the prediction process, it cannot be guaranteed that the predicted central point is the real central point, and in practice, offset often occurs, so that two channels are used for predicting the offset of the central point, one is the offset of the x axis, and the other is the offset of the y axis. In addition, the remaining two channels are used to predict the distance of the center point from the left and upper borders of the rectangular box. The actual post-processing is to find a possible center point of the object in the score map by setting a threshold, correct the center point according to the xy offset corresponding to the center point, and then directly obtain a rectangular frame through the center point and by combining the predicted width and height. The so-called offset amount described above is explained below: for example, if the width and height of the original are W and H, respectively, and the size of the feature map finally predicted is W/4 and H/4, then one point (10,10) on the original is (2.5 ) corresponding to the one point on the feature map, but the image is discrete and its coordinates are integer values, so rounding up, (10,10) corresponds to (2, 2), and then the offset of the center point on the feature map with respect to the original is (0.5 ).

In summary, PAN and DB are advantageous, wherein PAN is faster in forward calculation than DB due to the use of FPEM module, and DB is simpler in post-processing than PAN, so DB is faster in post-processing operation. On some open scene text detection data sets, such as 4 to 5 text boxes per image, the detection speed and detection result of the two are substantially comparable. However, for an actual application scenario with very dense texts, for example, an image has 100 text regions, such as an arithmetic exercise book of a pupil, the speed of the image and the speed of the arithmetic exercise book are greatly influenced by the number of text boxes, and almost linearly decrease with the increase of the number of text boxes, so that the speed requirement of the actual application scenario cannot be met. However, in an actual dense text scene, the PAN and the DB still have a surplus capability, that is, the actual dense text mostly includes an angularly inclined rectangular text, and not much a long-form text. Secondly, the CenterNet is a very fast general target detection algorithm, the detection speed of the CenterNet is hardly influenced and still very fast for a dense text scene such as a text image, but the precision of the CenterNet for the dense text image detection is very low, and particularly the problems of unstable training and frame drift exist, so that the CenterNet cannot be applied to the dense scene text detection. The drift problem caused by the fact that the centret predicts only one central point can be solved by predicting one central area, for example, PAN and DB do not have drift problem, but the main advantage of centret is fast compared with other methods based on sliding window or segmentation is that one object only predicts one central point, so the post-processing is very simple and fast.

The embodiment provides a text detection method, which can be used for smart phones, portable tablet computers (PAD for short), Personal Digital assistants (PDA for short), and other intelligent devices (electronic devices) with display, processing, and network connection functions, and is applied to practical application scenes with very dense texts in the field of education. Fig. 1 shows a flowchart of a text detection method according to an exemplary embodiment of the present disclosure, and as shown in fig. 1, the flowchart includes the following steps:

and step S101, acquiring a text image to be detected. For example, an image with 100 text regions, such as an arithmetic exercise book for pupils.

Step S102, a first feature mapping of the text image to be detected is obtained. In some optional embodiments, as in PAN, using a Resnet18 network model as a basic network model, inputting a text image to be detected into a Resnet18 network model to obtain M sets of feature maps, and using the M sets of feature maps as the first feature map, where the Resnet18 network model includes M block concatenation construction. Specifically, a large number of dense text images are collected and manually labeled with rectangles with angles, the backbone network is constructed as a training set of the network of the optional embodiment, a Resnet18 network model is used as a basic network model, a Resnet18 network can be constructed by connecting 4 block blocks in series, each block comprises a plurality of layers of convolution operations, the size of a feature map output by the first block is 1/4 of an original image, the second block is an original image 1/8, the third block is an original image 1/16, the fourth block is an original image 1/32, 4 groups of obtained feature maps are obtained, and the 4 groups of feature maps are used as the first feature map. Those skilled in the art should understand that the obtaining method of the first feature map is not limited to the embodiment, and other obtaining methods of the first feature map according to actual needs are within the scope of the embodiment.

Step S103, inputting the first feature mapping to a feature pyramid enhancement module to generate a second feature mapping. Specifically, a set of feature maps (called second-order feature maps) is obtained after the first feature maps pass through two FPEM modules, and then the second feature maps are obtained by performing information fusion on the second-order feature maps.

And step S104, performing first convolution processing on the second feature mapping to generate a central point score map of the prediction target. In some optional embodiments, after the second feature map is obtained, the second feature map is divided into five branches to be processed, the first branch performs a convolution operation on the second feature map, and outputs a 1-channel feature map with a size equal to 1/4 of the original image, which represents a central point score map of the prediction target, where, unlike CenterNet, for each prediction target, only a pixel point where a central point is located is taken as a central point, in this embodiment, all points in a gaussian region with a radius of, for example, 3 near the central point are taken as central points of the prediction target, and of course, the radius may be flexibly selected according to actual situations.

Step S105, multiplying the central point score map and the first feature map channel by channel and point by point to obtain a group of combined feature maps;

and step S106, performing second convolution processing on the combined feature mapping to generate a positioning score map of the prediction target. In an optional embodiment, the step is performed by a fifth branch, the central point score map is multiplied by the first feature map channel by point to obtain a set of merged feature maps, the set of merged feature maps is subjected to a second convolution processing, specifically, the set of merged feature maps is input into a neural network model to perform a first deconvolution operation and a second convolution operation, so as to obtain the positioning score map, the neural network model can be obtained by training parameters of a first deconvolution layer and a second convolution layer by using a two-class cross entropy loss function, the first deconvolution layer corresponds to the first deconvolution operation, and the second convolution layer corresponds to the second convolution operation. As some specific implementation manners, the fifth branch firstly changes all 4 groups of feature maps output by four block blocks of Resnet18 in the backbone network into 1/8 size of the original image in an up-down sampling manner, then serially superimposes the feature maps, then scales the output of the first branch to 1/8 size of the original image and multiplies the feature maps obtained after superimposing channel by channel to obtain a group of combined feature maps, then performs deconvolution and convolution operations on the combined feature maps for two times, outputs a 1-channel feature map with the size of 1/4 size of the original image, and represents a positioning score map, and the value of each pixel point on the map represents the probability that the pixel point is the only center point of an object (one center point of each object).

And S107, positioning the central point of the prediction target according to the positioning score map. Specifically, the pixel point in the positioning score map that is greater than the preset threshold is used as the center point.

And step S108, positioning the text box according to the center point.

Through the steps, the operation of obtaining the central point by combining the positioning score map and the central point score map is carried out at the same time, so that the central point prediction precision is greatly improved, and the speed and the precision of intensive text detection are further improved.

In order to more accurately position the text box, in some alternative embodiments, as shown in FIG. 2, the following steps are included:

step S201, performing a third convolution process on the second feature map to generate a central point offset of the prediction target. Specifically, the second branch performs a convolution operation on the second feature map, and outputs a 2-channel feature map with the size of the original map 1/4, which represents the offset of the center point, where each pixel point on channel 1 represents the offset of the center point of the corresponding position relative to the x direction, and each pixel point on channel 2 represents the offset of the center point of the corresponding position relative to the y direction.

Step S202, determining the coordinates of the center point according to the center point offset, where the text box may be located according to the coordinates of the center point, or to make the location of the text box more accurate, step S203 is continuously performed.

Step S203, performing a fourth convolution process on the second feature map to obtain a rotation angle of the prediction target. Specifically, the third branch performs a convolution operation on the second feature map, and outputs a 1-channel feature map whose size is equal to the size of the original map 1/4, which indicates the rotation angle of the detected text box, that is, the value on each pixel point indicates the angle of the corresponding target with respect to the horizontal direction, which is increased to cope with the oblique text.

And step S204, performing fifth convolution processing on the second feature mapping to obtain the height and the width of the prediction target. Specifically, the fourth branch performs a convolution operation on the second feature map, and outputs a 2-channel feature map whose size is equal to the size of the original map 1/4, where the feature map represents the height and width of the prediction target, the value of each pixel point on channel 1 represents the height of the detection text box, and the value of each pixel point on channel 2 represents the width of the detection text box.

Step S205, positioning the text box according to the coordinates of the center point, the rotation angle, the height and the width.

In some optional embodiments, the model training is divided into two stages, a fifth branch in the first stage does not participate in the training, only the first four branches are trained, as with the centrnet, the first branch is trained by using Focal local, the other branches are both trained by using Smooth-L1 Loss, when the training of the first stage is finished, the training of the second stage is started, the training of the second stage only aims at the fifth branch, and only the parameters of three neural network layers of one deconvolution and two convolutions before output are updated simultaneously, the training is performed by using a two-classification cross entropy Loss function, only the central point is positive, and the other points are all negative; in the forward stage, the output of the fifth branch is multiplied by the output of the first branch point by point to obtain a final central point score map, a threshold value is set, the central point is considered to be a central point when the threshold value is larger than the threshold value, then the coordinate of the central point on an original image is determined through the offset, a corresponding text detection frame can be directly obtained according to the height, the width and the inclination angle, all operations are completed on a Graphic Processor (GPU) in parallel, the detection speed is greatly improved, meanwhile, the operation of the central point is obtained by positioning the score map and the central point score map in a combined mode, the central point prediction precision is greatly improved, and further the speed and the precision of intensive text detection are improved.

Step S103 mentioned above involves inputting the first feature map into the feature pyramid enhancement module to obtain the second feature map, in some optional embodiments, dividing N sets of multi-channel feature maps with different sizes in the first feature map into forward 1 st, 2 nd and 3 … th sets of feature maps according to a descending order, taking the forward N set of feature maps as a reverse first set of feature maps, up-sampling the forward N set of feature maps, adding the forward N-1 set of feature maps point by point according to channels and performing a sixth convolution processing to obtain a reverse second set of feature maps, up-sampling the reverse second set of feature maps, adding the reverse N-2 set of feature maps point by point according to channels and performing a seventh convolution processing to obtain a reverse second set of feature maps, and sequentially performing the same operation on each set of forward feature maps, obtaining a reverse N-th group of feature mapping, taking the reverse N-th group of feature mapping as a target first group of feature mapping, performing down-sampling on the reverse N-th group of feature mapping, performing point-by-point addition on the reverse N-1 group of feature mapping according to a channel, performing eighth convolution processing to obtain a target second group of feature mapping, performing down-sampling on the target second group of feature mapping, performing point-by-point addition on the target second group of feature mapping and a reverse N-2 group of feature mapping according to a channel, performing ninth convolution processing to obtain a target second group of feature mapping, sequentially performing the same operation on each group of reverse feature mapping to obtain a target N-th group of feature mapping, and taking the target N-th group of feature mapping as the second feature mapping; n is a positive integer. In particular, 2 FPEM modules were selected in this alternative embodiment, since 2 gave the best results among the inventors' experimental results. The processing of each FPEM module is the same, the specific details are that 4 groups of multi-channel feature maps with different sizes are obtained, the multi-channel feature maps are sequentially called as forward first, forward second, forward third and forward fourth group feature maps from large to small from front to back, the forward fourth group feature map is up-sampled by 2 times, namely the size of the forward fourth group feature map is enlarged by 2 times, then the forward fourth group feature map and the forward third group feature map are added point by point according to the channels, after the result is subjected to one-time deep separable convolution operation, the convolution, batch normalization and function activation operation are carried out, the obtained result is called as reverse second group feature map, the same operation is used for obtaining reverse third group feature map by reversing the second group feature map and the forward second group feature map, and then the same operation is applied to the reverse third group feature map and the forward first group feature map, obtaining a reverse fourth group of feature maps, and simultaneously regarding the forward fourth group of feature maps as a reverse first group of feature maps, thereby obtaining 4 groups of reverse feature maps; taking the fourth group of reverse feature maps as a target first group of feature maps, performing 2-time down-sampling on the target first group of feature maps, namely reducing the size by 2 times, then adding the fourth group of reverse feature maps and the reverse third group of feature maps point by point according to channels, performing a depth separable convolution operation on the result, and then performing convolution, batch normalization and activation function action operation once again to obtain a result called a target second group of feature maps, wherein the same operation is applied to the target second group of feature maps and the reverse second group of feature maps to obtain a target third group of feature maps, and then the same operation is applied to the target third group of feature maps and the reverse first group of feature maps to obtain a target fourth group of feature maps, wherein the target first group of feature maps, the target second group of feature maps, the target third group of feature maps and the target fourth group of feature maps are output of the FFEM module; the 2 nd FFEM module takes the output of the 1 st FFEM module as input, and the same operation is carried out to obtain the output as a second group of feature maps.

PAN and DB are algorithms which are specially used for text detection and based on segmentation, and a text box is positioned by predicting the whole contracted text area without the phenomenon of text drift. While the backbone network (backbone) of a PAN is specifically designed for text detection, it is different from other text detection algorithms that use some backbone networks designed for classification or recognition tasks. The CenterNet is a general target detection method based on no Anchor-free, the backbone network of which is designed for human key point detection or natural scene object detection or classification, the detection frame output by the CenterNet is mainly a regular rectangle, and the CenterNet cannot be directly used for dense, slender and angular inclined text detection. Meanwhile, the CenterNet only predicts one central point of one object, namely only one pixel point is used as the central point of the object, and although the CenterNet can cause the drift of a detection frame due to the drift (inaccurate prediction) of the central point when detecting a general target, the detection effect cannot be influenced due to the fact that the drift amplitude is not large. Based on this, the optional embodiment combines the advantages of the above three, the centret can eliminate frame drift if predicting the central region, but for the same object, multiple detection frames are brought, and Non-Maximum Suppression (NMS for short) is required, the NMS process is complex and needs to be performed on the CPU, which is time consuming, and the optional embodiment selects a central point from the central region by using a classification operation, which is equivalent to that the centret adds a central point preference, thereby achieving dual improvement of precision and speed.

In this embodiment, a text detection apparatus is further provided, and the text detection apparatus is used to implement the foregoing embodiments and preferred embodiments, which have already been described and are not described again. As used hereinafter, the term "module" is a combination of software and/or hardware that can implement a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

The present embodiment provides a text detection apparatus 300, as shown in fig. 3, including:

the first obtaining module 301 is configured to obtain a text image to be detected;

a second obtaining module 302, configured to obtain a first feature map of the text image to be detected;

a first generating module 303, configured to input the first feature mapping to a feature pyramid enhancing module, and generate a second feature mapping;

a second generating module 304, configured to perform a first convolution processing on the second feature map to generate a central point score map of the predicted target;

a first processing module 305, configured to multiply the central point score map and the first feature map channel by channel and point by point to obtain a set of merged feature maps;

a second processing module 306, configured to perform a second convolution process on the set of merged feature maps to generate a location score map of the predicted target; a first positioning module 307, configured to position a central point of the predicted target according to the positioning score map of the target;

and a second positioning module 308, configured to position the text box according to the central point.

The text detection means in this embodiment is presented in the form of functional units, where a unit refers to an ASIC circuit, a processor and memory executing one or more software or fixed programs, and/or other devices that may provide the above-described functionality.

Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.

An exemplary embodiment of the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is for causing the electronic device to perform a method according to an embodiment of the disclosure.

The disclosed exemplary embodiments also provide a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

The exemplary embodiments of the present disclosure also provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

Referring to fig. 4, a block diagram of a structure of an electronic device 400, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 4, the electronic device 400 includes a computing unit 401 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 402 or a computer program loaded from a storage unit 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data required for the operation of the device 400 can also be stored. The computing unit 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

A number of components in the electronic device 400 are connected to the I/O interface 405, including: an input unit 406, an output unit 407, a storage unit 408, and a communication unit 409. The input unit 406 may be any type of device capable of inputting information to the electronic device 400, and the input unit 406 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. Output unit 407 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 404 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 409 allows the electronic device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

Computing unit 401 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 401 executes the respective methods and processes described above. For example, in some embodiments, the text detection method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 400 via the ROM 402 and/or the communication unit 409. In some embodiments, the computing unit 401 may be configured to perform the text detection method in any other suitable way (e.g. by means of firmware).

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Claims

1. A text detection method, comprising:

acquiring a text image to be detected;

acquiring a first feature mapping of the text image to be detected;

performing second convolution processing on the combined feature mapping to generate a positioning score map of a prediction target, wherein the value of each pixel point on the positioning score map represents the probability that the pixel point is the only central point of the object;

positioning a text box according to the center point;

the obtaining of the first feature mapping of the text image to be detected includes: inputting the text image to be detected into a Resnet18 network model to obtain M groups of feature mappings, and taking the M groups of feature mappings as the first feature mappings; wherein the Resnet18 network model includes M block concatenation constructs;

the channel-by-channel point-by-point multiplying the central point score map and the first feature map to obtain a set of combined feature maps, including: and for the M groups of feature maps, all the feature maps are changed into 1/8 sizes of the text image to be detected in an up-and-down sampling mode and are superposed in series, the central point score map is zoomed to 1/8 sizes of the text image to be detected and is multiplied with the feature maps obtained after superposition channel by channel, and a group of combined feature maps are obtained.

2. The text detection method of claim 1, wherein the method further comprises:

performing third convolution processing on the second feature mapping to generate a central point offset of a predicted target;

taking the pixel points which are larger than a preset threshold value in the positioning score map as the central points;

determining the coordinate of the central point according to the central point offset;

and positioning the text box according to the coordinates of the central point.

3. The text detection method of claim 2, wherein the method further comprises:

performing fourth convolution processing on the second feature mapping to obtain a rotation angle of the prediction target;

performing fifth convolution processing on the second feature mapping to obtain the height and the width of a predicted target;

and positioning the text box according to the coordinates of the central point, the rotation angle, the height and the width.

4. The text detection method of claim 1, wherein performing a second convolution process on the set of merged feature maps to generate a localization score map of the predicted target comprises:

inputting the combined feature mapping into a neural network model to perform primary deconvolution operation and secondary convolution operation to obtain the positioning score map;

the neural network model is obtained by training parameters of the primary deconvolution layer and the secondary convolution layer by using a two-classification cross entropy loss function; the primary deconvolution layer corresponds to the primary deconvolution operation, and the secondary convolution layer corresponds to the secondary convolution operation.

5. The text detection method of claim 1, wherein inputting the first feature mapping to a feature pyramid enhancement module, resulting in a second feature mapping comprises:

dividing N groups of multi-channel feature maps with different sizes in the first feature map into forward 1 st, 2 nd and 3 … th groups of feature maps according to the descending order;

taking the forward N group of feature maps as a reverse first group of feature maps, up-sampling the forward N group of feature maps, adding the forward N-1 group of feature maps point by point according to channels, and performing sixth convolution processing to obtain a reverse second group of feature maps;

after the reverse second group of feature maps are subjected to up-sampling, adding the forward N-2 th group of feature maps point by point according to channels and performing seventh convolution processing to obtain a reverse second group of feature maps; sequentially carrying out the same operation on each group of forward feature maps to obtain a reverse Nth group of feature maps;

taking the reverse N-th group of feature mapping as a target first group of feature mapping, performing down-sampling on the reverse N-th group of feature mapping, adding the reverse N-1-th group of feature mapping point by point according to a channel, and performing eighth convolution processing to obtain a target second group of feature mapping;

after downsampling the target second group of feature maps, adding the downsampled target second group of feature maps and the reverse N-2 th group of feature maps point by point according to channels, and performing ninth convolution processing to obtain a target second group of feature maps; sequentially carrying out the same operation on each reverse characteristic mapping group to obtain an Nth group of target characteristic mapping;

taking the target Nth set of feature maps as the second feature map;

n is a positive integer.

6. A text detection apparatus comprising:

the first acquisition module is used for acquiring a text image to be detected;

the second acquisition module is used for acquiring a first feature mapping of the text image to be detected; the obtaining of the first feature mapping of the text image to be detected includes: inputting the text image to be detected into a Resnet18 network model to obtain M groups of feature mappings, and taking the M groups of feature mappings as the first feature mappings; wherein the Resnet18 network model includes M block concatenation constructs;

the first processing module is used for multiplying the central point score map and the first feature map channel by channel and point by point to obtain a group of combined feature maps; the channel-by-channel point-by-point multiplying the central point score map and the first feature map to obtain a set of combined feature maps, including: the M groups of feature maps are all changed into 1/8 sizes of the text image to be detected in an up-down sampling mode and are overlapped in series, the central point score map is zoomed to 1/8 sizes of the text image to be detected and is multiplied with the feature maps obtained after overlapping channel by channel, and a group of combined feature maps are obtained;

the second processing module is used for carrying out second convolution processing on the combined feature mapping to generate a positioning score map of the prediction target, and the value of each pixel point on the positioning score map represents the probability that the pixel point is the only central point of the object;

the first positioning module is used for positioning the central point of the predicted target according to the positioning score map;

7. An electronic device, comprising:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to carry out the method according to any one of claims 1-5.

8. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.